Cornell University
Library
Cornell UniversityLibrary

eCommons

Help
Log In(current)
  1. Home
  2. Cornell University Graduate School
  3. Cornell Theses and Dissertations
  4. Information Genealogy: Modeling Idea Origins And Flows In Text

Information Genealogy: Modeling Idea Origins And Flows In Text

File(s)
Shaparenko, Benyah.pdf (632.29 KB)
Permanent Link(s)
https://hdl.handle.net/1813/14733
Collections
Cornell Theses and Dissertations
Author
Shaparenko, Benyah
Abstract

One goal of text mining is to provide automatic methods to help people grasp the key ideas in ever-increasing document collections. Often these text corpora accumulate incrementally over time by a self-referential process as documents propose new ideas, build on or refute existing ideas, or draw connections between different existing ideas, and so on. Such corpora are pervasive, including email, news articles, blogs, and research publications. Search engines are effective for retrieving individual documents from such corpora, but they do not typically provide information about the structure of the corpora and how their ideas developed over time. We propose a set of tasks, which we call information genealogy, which seek to analyze and summarize a document collection’s development over time in terms of its ideas. These methods focus on helping people grasp the document collection as a whole. Specifically, we address the following tasks: What is each document’s (interesting) original contribution of ideas to the corpus? How do ideas flow from one document to another? What are the most important, influential documents and ideas? We develop methods grounded in probability and statistics, specifically based on generative mixture models for document language modeling. Consequently, unlike heuristic approaches, these methods are both extensible and readily analyzable. In addition, the input for these methods consists of only the text and temporal ordering of the documents, not any hyperlink information. Exclusively using document text in an unsupervised setting allows these methods to apply in many domains. We evaluate these methods on both synthetically-generated and actual research publications. In general, these methods outperform heuristic baseline methods based on text similarity alone.

Date Issued
2010-04-09T19:58:05Z
Type
dissertation or thesis

Site Statistics | Help

About eCommons | Policies | Terms of use | Contact Us

copyright © 2002-2026 Cornell University Library | Privacy | Web Accessibility Assistance