Information Genealogy: Modeling Idea Origins And Flows In Text
Loading...
No Access Until
Permanent Link(s)
Collections
Other Titles
Authors
Abstract
One goal of text mining is to provide automatic methods to help people grasp the key ideas in ever-increasing document collections. Often these text corpora accumulate incrementally over time by a self-referential process as documents propose new ideas, build on or refute existing ideas, or draw connections between different existing ideas, and so on. Such corpora are pervasive, including email, news articles, blogs, and research publications. Search engines are effective for retrieving individual documents from such corpora, but they do not typically provide information about the structure of the corpora and how their ideas developed over time. We propose a set of tasks, which we call information genealogy, which seek to analyze and summarize a document collection’s development over time in terms of its ideas. These methods focus on helping people grasp the document collection as a whole. Specifically, we address the following tasks: What is each document’s (interesting) original contribution of ideas to the corpus? How do ideas flow from one document to another? What are the most important, influential documents and ideas? We develop methods grounded in probability and statistics, specifically based on generative mixture models for document language modeling. Consequently, unlike heuristic approaches, these methods are both extensible and readily analyzable. In addition, the input for these methods consists of only the text and temporal ordering of the documents, not any hyperlink information. Exclusively using document text in an unsupervised setting allows these methods to apply in many domains. We evaluate these methods on both synthetically-generated and actual research publications. In general, these methods outperform heuristic baseline methods based on text similarity alone.
Journal / Series
Volume & Issue
Description
Sponsorship
Date Issued
2010-04-09T19:58:05Z
Publisher
Keywords
Location
Effective Date
Expiration Date
Sector
Employer
Union
Union Local
NAICS
Number of Workers
Committee Chair
Committee Co-Chair
Committee Member
Degree Discipline
Degree Name
Degree Level
Related Version
Related DOI
Related To
Related Part
Based on Related Item
Has Other Format(s)
Part of Related Item
Related To
Related Publication(s)
Link(s) to Related Publication(s)
References
Link(s) to Reference(s)
Previously Published As
Government Document
ISBN
ISMN
ISSN
Other Identifiers
Rights
Rights URI
Types
dissertation or thesis