Text Processing for the Effective Application of Latent Dirichlet Allocation
dc.contributor.author | SCHOFIELD, ALEXANDRA K | |
dc.contributor.chair | Mimno, David | |
dc.contributor.committeeMember | Danescu-Niculescu-Mizil, Cristian | |
dc.contributor.committeeMember | Birman, Kenneth Paul | |
dc.date.accessioned | 2019-10-15T15:29:21Z | |
dc.date.available | 2020-06-05T06:00:53Z | |
dc.date.issued | 2019-05-30 | |
dc.description.abstract | Distributional semantic models such as LDA (Blei et al., 2003) are a powerful method to extract patterns of word co-occurrences for exploration of a textual corpus. This is of particular interest to social scientists and humanists, who may wish to explore large collections of text in their fields of expertise without specific hypotheses to test. However, to use topic models effectively relies on choices about both text processing and model initialization. Without prior experience in machine learning and natural language processing, these choices may be challenging to navigate. I focus on two primary challenges in establishing datasets for effective topic models: pre-processing and privacy. In the first part, I share a number of experiments to discover the effects of standard text pre-processing steps on the learned topic models. My work shows common practices in text cleaning, including stemming, stopword removal, and text de-duplication may be less necessary than conventionally assumed. In the second part, I discuss a workflow to apply differential privacy through randomization for Poisson factorization models, a broad class of distributional models of count data including LDA. My work includes multiple methods for efficient inference of private Poisson factorization models on large datasets, including approximations to an MCMC algorithm and a new variational inference (VI) algorithm. I also discuss approaches to introduce of randomness to privatize unigram frequencies to better preserve the sparse, correlated structure of the true data. | |
dc.identifier.doi | https://doi.org/10.7298/5b9e-pw26 | |
dc.identifier.other | SCHOFIELD_cornellgrad_0058F_11395 | |
dc.identifier.other | http://dissertations.umi.com/cornellgrad:11395 | |
dc.identifier.other | bibid: 11050287 | |
dc.identifier.uri | https://hdl.handle.net/1813/67305 | |
dc.language.iso | en_US | |
dc.rights | Attribution-ShareAlike 2.0 Generic | |
dc.rights.uri | https://creativecommons.org/licenses/by-sa/4.0/ | |
dc.subject | Differential Privacy | |
dc.subject | Artificial intelligence | |
dc.subject | Bayesian Poisson factorization | |
dc.subject | latent Dirichlet allocation | |
dc.subject | stemmers | |
dc.subject | stopwords | |
dc.subject | topic models | |
dc.title | Text Processing for the Effective Application of Latent Dirichlet Allocation | |
dc.type | dissertation or thesis | |
dcterms.license | https://hdl.handle.net/1813/59810 | |
thesis.degree.discipline | Computer Science | |
thesis.degree.grantor | Cornell University | |
thesis.degree.level | Doctor of Philosophy | |
thesis.degree.name | Ph.D., Computer Science |
Files
Original bundle
1 - 1 of 1
Loading...
- Name:
- SCHOFIELD_cornellgrad_0058F_11395.pdf
- Size:
- 1.98 MB
- Format:
- Adobe Portable Document Format