Information Genealogy: Modeling Idea Origins And Flows In Text
One goal of text mining is to provide automatic methods to help people grasp the key ideas in ever-increasing document collections. Often these text corpora accumulate incrementally over time by a self-referential process as documents propose new ideas, build on or refute existing ideas, or draw connections between diﬀerent existing ideas, and so on. Such corpora are pervasive, including email, news articles, blogs, and research publications. Search engines are eﬀective for retrieving individual documents from such corpora, but they do not typically provide information about the structure of the corpora and how their ideas developed over time. We propose a set of tasks, which we call information genealogy, which seek to analyze and summarize a document collection’s development over time in terms of its ideas. These methods focus on helping people grasp the document collection as a whole. Speciﬁcally, we address the following tasks: What is each document’s (interesting) original contribution of ideas to the corpus? How do ideas ﬂow from one document to another? What are the most important, inﬂuential documents and ideas? We develop methods grounded in probability and statistics, speciﬁcally based on generative mixture models for document language modeling. Consequently, unlike heuristic approaches, these methods are both extensible and readily analyzable. In addition, the input for these methods consists of only the text and temporal ordering of the documents, not any hyperlink information. Exclusively using document text in an unsupervised setting allows these methods to apply in many domains. We evaluate these methods on both synthetically-generated and actual research publications. In general, these methods outperform heuristic baseline methods based on text similarity alone.
dissertation or thesis