JavaScript is disabled for your browser. Some features of this site may not work without it.
Text Processing for the Effective Application of Latent Dirichlet Allocation

Author
SCHOFIELD, ALEXANDRA K
Abstract
Distributional semantic models such as LDA (Blei et al., 2003) are a powerful method to extract patterns of word co-occurrences for exploration of a textual corpus. This is of particular interest to social scientists and humanists, who may wish to explore large collections of text in their fields of expertise without specific hypotheses to test. However, to use topic models effectively relies on choices about both text processing and model initialization. Without prior experience in machine learning and natural language processing, these choices may be challenging to navigate. I focus on two primary challenges in establishing datasets for effective topic models: pre-processing and privacy. In the first part, I share a number of experiments to discover the effects of standard text pre-processing steps on the learned topic models. My work shows common practices in text cleaning, including stemming, stopword removal, and text de-duplication may be less necessary than conventionally assumed. In the second part, I discuss a workflow to apply differential privacy through randomization for Poisson factorization models, a broad class of distributional models of count data including LDA. My work includes multiple methods for efficient inference of private Poisson factorization models on large datasets, including approximations to an MCMC algorithm and a new variational inference (VI) algorithm. I also discuss approaches to introduce of randomness to privatize unigram frequencies to better preserve the sparse, correlated structure of the true data.
Date Issued
2019-05-30Subject
Differential Privacy; Artificial intelligence; Bayesian Poisson factorization; latent Dirichlet allocation; stemmers; stopwords; topic models
Committee Chair
Mimno, David
Committee Member
Danescu-Niculescu-Mizil, Cristian; Birman, Kenneth Paul
Degree Discipline
Computer Science
Degree Name
Ph.D., Computer Science
Degree Level
Doctor of Philosophy
Rights
Attribution-ShareAlike 2.0 Generic
Type
dissertation or thesis
Except where otherwise noted, this item's license is described as Attribution-ShareAlike 2.0 Generic