Show simple item record

dc.contributor.authorShaparenko, Benyahen_US
dc.date.accessioned2010-04-09T19:58:05Z
dc.date.available2010-04-09T19:58:05Z
dc.date.issued2010-04-09T19:58:05Z
dc.identifier.otherbibid: 6890861
dc.identifier.urihttps://hdl.handle.net/1813/14733
dc.description.abstractOne goal of text mining is to provide automatic methods to help people grasp the key ideas in ever-increasing document collections. Often these text corpora accumulate incrementally over time by a self-referential process as documents propose new ideas, build on or refute existing ideas, or draw connections between different existing ideas, and so on. Such corpora are pervasive, including email, news articles, blogs, and research publications. Search engines are effective for retrieving individual documents from such corpora, but they do not typically provide information about the structure of the corpora and how their ideas developed over time. We propose a set of tasks, which we call information genealogy, which seek to analyze and summarize a document collection’s development over time in terms of its ideas. These methods focus on helping people grasp the document collection as a whole. Specifically, we address the following tasks: What is each document’s (interesting) original contribution of ideas to the corpus? How do ideas flow from one document to another? What are the most important, influential documents and ideas? We develop methods grounded in probability and statistics, specifically based on generative mixture models for document language modeling. Consequently, unlike heuristic approaches, these methods are both extensible and readily analyzable. In addition, the input for these methods consists of only the text and temporal ordering of the documents, not any hyperlink information. Exclusively using document text in an unsupervised setting allows these methods to apply in many domains. We evaluate these methods on both synthetically-generated and actual research publications. In general, these methods outperform heuristic baseline methods based on text similarity alone.en_US
dc.language.isoen_USen_US
dc.titleInformation Genealogy: Modeling Idea Origins And Flows In Texten_US
dc.typedissertation or thesisen_US


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record

Statistics