The Document Representation Problem: An Analysis of LSI and IterativeResidual Rescaling
dc.contributor.author | Ando, Rie | en_US |
dc.date.accessioned | 2007-04-09T19:56:19Z | |
dc.date.available | 2007-04-09T19:56:19Z | |
dc.date.issued | 2001-07-02 | en_US |
dc.description.abstract | Important text analysis problems in information retrieval and natural language processing, such as document clustering and automatic text summarization, require accurate measurement of inter-document similarity. The goal of this work is to find methods for automatically creating document representations in which inter-document similarity measurements correspond to human judgment. We present a new model for the task of creating document representations. From this model, we derive a new analysis of Latent Semantic Indexing (LSI), which is one of the successful approaches that has been studied extensively. In particular, we show a precise relationship between LSI's performance and the uniformity of the underlying distribution of documents over topics. As a consequence, we propose a novel alternative method called Iterative Residual Rescaling (IRR), that, crucially, compensates for distributional non-uniformity. Experiments over a variety of practically-encountered settings and with several evaluation metrics validate our theoretical prediction and confirm the effectiveness of IRR in comparison to LSI. We also propose several extensions including a new document sampling method to scale IRR up to large document collections. Comparison with random sampling provides further empirical evidence that performance can be improved by counteracting non-uniformity. Finally, we present a system for multi-document summarization based on IRR, which demonstrates that IRR can be immediately useful in applications. We show that IRR works as a framework to find a tightly connected (and therefore interpretable) set of coherent texts, and effectively present them to the user. | en_US |
dc.format.extent | 1079864 bytes | |
dc.format.mimetype | application/pdf | |
dc.identifier.citation | http://techreports.library.cornell.edu:8081/Dienst/UI/1.0/Display/cul.cs/TR2001-1843 | en_US |
dc.identifier.uri | https://hdl.handle.net/1813/5830 | |
dc.language.iso | en_US | en_US |
dc.publisher | Cornell University | en_US |
dc.subject | computer science | en_US |
dc.subject | technical report | en_US |
dc.title | The Document Representation Problem: An Analysis of LSI and IterativeResidual Rescaling | en_US |
dc.type | technical report | en_US |
Files
Original bundle
1 - 1 of 1