eCommons

 

The Document Representation Problem: An Analysis of LSI and IterativeResidual Rescaling

dc.contributor.authorAndo, Rieen_US
dc.date.accessioned2007-04-09T19:56:19Z
dc.date.available2007-04-09T19:56:19Z
dc.date.issued2001-07-02en_US
dc.description.abstractImportant text analysis problems in information retrieval and natural language processing, such as document clustering and automatic text summarization, require accurate measurement of inter-document similarity. The goal of this work is to find methods for automatically creating document representations in which inter-document similarity measurements correspond to human judgment. We present a new model for the task of creating document representations. From this model, we derive a new analysis of Latent Semantic Indexing (LSI), which is one of the successful approaches that has been studied extensively. In particular, we show a precise relationship between LSI's performance and the uniformity of the underlying distribution of documents over topics. As a consequence, we propose a novel alternative method called Iterative Residual Rescaling (IRR), that, crucially, compensates for distributional non-uniformity. Experiments over a variety of practically-encountered settings and with several evaluation metrics validate our theoretical prediction and confirm the effectiveness of IRR in comparison to LSI. We also propose several extensions including a new document sampling method to scale IRR up to large document collections. Comparison with random sampling provides further empirical evidence that performance can be improved by counteracting non-uniformity. Finally, we present a system for multi-document summarization based on IRR, which demonstrates that IRR can be immediately useful in applications. We show that IRR works as a framework to find a tightly connected (and therefore interpretable) set of coherent texts, and effectively present them to the user.en_US
dc.format.extent1079864 bytes
dc.format.mimetypeapplication/pdf
dc.identifier.citationhttp://techreports.library.cornell.edu:8081/Dienst/UI/1.0/Display/cul.cs/TR2001-1843en_US
dc.identifier.urihttps://hdl.handle.net/1813/5830
dc.language.isoen_USen_US
dc.publisherCornell Universityen_US
dc.subjectcomputer scienceen_US
dc.subjecttechnical reporten_US
dc.titleThe Document Representation Problem: An Analysis of LSI and IterativeResidual Rescalingen_US
dc.typetechnical reporten_US

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
2001-1843.pdf
Size:
1.03 MB
Format:
Adobe Portable Document Format