Information Genealogy: Modeling Idea Origins And Flows In Text

Other Titles
One goal of text mining is to provide automatic methods to help people grasp the key ideas in ever-increasing document collections. Often these text corpora accumulate incrementally over time by a self-referential process as documents propose new ideas, build on or refute existing ideas, or draw connections between different existing ideas, and so on. Such corpora are pervasive, including email, news articles, blogs, and research publications. Search engines are effective for retrieving individual documents from such corpora, but they do not typically provide information about the structure of the corpora and how their ideas developed over time. We propose a set of tasks, which we call information genealogy, which seek to analyze and summarize a document collection’s development over time in terms of its ideas. These methods focus on helping people grasp the document collection as a whole. Specifically, we address the following tasks: What is each document’s (interesting) original contribution of ideas to the corpus? How do ideas flow from one document to another? What are the most important, influential documents and ideas? We develop methods grounded in probability and statistics, specifically based on generative mixture models for document language modeling. Consequently, unlike heuristic approaches, these methods are both extensible and readily analyzable. In addition, the input for these methods consists of only the text and temporal ordering of the documents, not any hyperlink information. Exclusively using document text in an unsupervised setting allows these methods to apply in many domains. We evaluate these methods on both synthetically-generated and actual research publications. In general, these methods outperform heuristic baseline methods based on text similarity alone.
Journal / Series
Volume & Issue
Date Issued
Effective Date
Expiration Date
Union Local
Number of Workers
Committee Chair
Committee Co-Chair
Committee Member
Degree Discipline
Degree Name
Degree Level
Related Version
Related DOI
Related To
Related Part
Based on Related Item
Has Other Format(s)
Part of Related Item
Related To
Related Publication(s)
Link(s) to Related Publication(s)
Link(s) to Reference(s)
Previously Published As
Government Document
Other Identifiers
Rights URI
dissertation or thesis
Accessibility Feature
Accessibility Hazard
Accessibility Summary
Link(s) to Catalog Record