eCommons

 

Text Processing for the Effective Application of Latent Dirichlet Allocation

dc.contributor.authorSCHOFIELD, ALEXANDRA K
dc.contributor.chairMimno, David
dc.contributor.committeeMemberDanescu-Niculescu-Mizil, Cristian
dc.contributor.committeeMemberBirman, Kenneth Paul
dc.date.accessioned2019-10-15T15:29:21Z
dc.date.available2020-06-05T06:00:53Z
dc.date.issued2019-05-30
dc.description.abstractDistributional semantic models such as LDA (Blei et al., 2003) are a powerful method to extract patterns of word co-occurrences for exploration of a textual corpus. This is of particular interest to social scientists and humanists, who may wish to explore large collections of text in their fields of expertise without specific hypotheses to test. However, to use topic models effectively relies on choices about both text processing and model initialization. Without prior experience in machine learning and natural language processing, these choices may be challenging to navigate. I focus on two primary challenges in establishing datasets for effective topic models: pre-processing and privacy. In the first part, I share a number of experiments to discover the effects of standard text pre-processing steps on the learned topic models. My work shows common practices in text cleaning, including stemming, stopword removal, and text de-duplication may be less necessary than conventionally assumed. In the second part, I discuss a workflow to apply differential privacy through randomization for Poisson factorization models, a broad class of distributional models of count data including LDA. My work includes multiple methods for efficient inference of private Poisson factorization models on large datasets, including approximations to an MCMC algorithm and a new variational inference (VI) algorithm. I also discuss approaches to introduce of randomness to privatize unigram frequencies to better preserve the sparse, correlated structure of the true data.
dc.identifier.doihttps://doi.org/10.7298/5b9e-pw26
dc.identifier.otherSCHOFIELD_cornellgrad_0058F_11395
dc.identifier.otherhttp://dissertations.umi.com/cornellgrad:11395
dc.identifier.otherbibid: 11050287
dc.identifier.urihttps://hdl.handle.net/1813/67305
dc.language.isoen_US
dc.rightsAttribution-ShareAlike 2.0 Generic
dc.rights.urihttps://creativecommons.org/licenses/by-sa/4.0/
dc.subjectDifferential Privacy
dc.subjectArtificial intelligence
dc.subjectBayesian Poisson factorization
dc.subjectlatent Dirichlet allocation
dc.subjectstemmers
dc.subjectstopwords
dc.subjecttopic models
dc.titleText Processing for the Effective Application of Latent Dirichlet Allocation
dc.typedissertation or thesis
dcterms.licensehttps://hdl.handle.net/1813/59810
thesis.degree.disciplineComputer Science
thesis.degree.grantorCornell University
thesis.degree.levelDoctor of Philosophy
thesis.degree.namePh.D., Computer Science

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
SCHOFIELD_cornellgrad_0058F_11395.pdf
Size:
1.98 MB
Format:
Adobe Portable Document Format