eCommons

 

The Document Representation Problem: An Analysis of LSI and IterativeResidual Rescaling

Other Titles

Author(s)

Abstract

Important text analysis problems in information retrieval and natural language processing, such as document clustering and automatic text summarization, require accurate measurement of inter-document similarity. The goal of this work is to find methods for automatically creating document representations in which inter-document similarity measurements correspond to human judgment. We present a new model for the task of creating document representations. From this model, we derive a new analysis of Latent Semantic Indexing (LSI), which is one of the successful approaches that has been studied extensively. In particular, we show a precise relationship between LSI's performance and the uniformity of the underlying distribution of documents over topics. As a consequence, we propose a novel alternative method called Iterative Residual Rescaling (IRR), that, crucially, compensates for distributional non-uniformity. Experiments over a variety of practically-encountered settings and with several evaluation metrics validate our theoretical prediction and confirm the effectiveness of IRR in comparison to LSI. We also propose several extensions including a new document sampling method to scale IRR up to large document collections. Comparison with random sampling provides further empirical evidence that performance can be improved by counteracting non-uniformity. Finally, we present a system for multi-document summarization based on IRR, which demonstrates that IRR can be immediately useful in applications. We show that IRR works as a framework to find a tightly connected (and therefore interpretable) set of coherent texts, and effectively present them to the user.

Journal / Series

Volume & Issue

Description

Sponsorship

Date Issued

2001-07-02

Publisher

Cornell University

Keywords

computer science; technical report

Location

Effective Date

Expiration Date

Sector

Employer

Union

Union Local

NAICS

Number of Workers

Committee Chair

Committee Co-Chair

Committee Member

Degree Discipline

Degree Name

Degree Level

Related Version

Related DOI

Related To

Related Part

Based on Related Item

Has Other Format(s)

Part of Related Item

Related To

Related Publication(s)

Link(s) to Related Publication(s)

References

Link(s) to Reference(s)

Previously Published As

http://techreports.library.cornell.edu:8081/Dienst/UI/1.0/Display/cul.cs/TR2001-1843

Government Document

ISBN

ISMN

ISSN

Other Identifiers

Rights

Rights URI

Types

technical report

Accessibility Feature

Accessibility Hazard

Accessibility Summary

Link(s) to Catalog Record