eCommons

 

Text Processing for the Effective Application of Latent Dirichlet Allocation

Other Titles

Abstract

Distributional semantic models such as LDA (Blei et al., 2003) are a powerful method to extract patterns of word co-occurrences for exploration of a textual corpus. This is of particular interest to social scientists and humanists, who may wish to explore large collections of text in their fields of expertise without specific hypotheses to test. However, to use topic models effectively relies on choices about both text processing and model initialization. Without prior experience in machine learning and natural language processing, these choices may be challenging to navigate. I focus on two primary challenges in establishing datasets for effective topic models: pre-processing and privacy. In the first part, I share a number of experiments to discover the effects of standard text pre-processing steps on the learned topic models. My work shows common practices in text cleaning, including stemming, stopword removal, and text de-duplication may be less necessary than conventionally assumed. In the second part, I discuss a workflow to apply differential privacy through randomization for Poisson factorization models, a broad class of distributional models of count data including LDA. My work includes multiple methods for efficient inference of private Poisson factorization models on large datasets, including approximations to an MCMC algorithm and a new variational inference (VI) algorithm. I also discuss approaches to introduce of randomness to privatize unigram frequencies to better preserve the sparse, correlated structure of the true data.

Journal / Series

Volume & Issue

Description

Sponsorship

Date Issued

2019-05-30

Publisher

Keywords

Differential Privacy; Artificial intelligence; Bayesian Poisson factorization; latent Dirichlet allocation; stemmers; stopwords; topic models

Location

Effective Date

Expiration Date

Sector

Employer

Union

Union Local

NAICS

Number of Workers

Committee Chair

Mimno, David

Committee Co-Chair

Committee Member

Danescu-Niculescu-Mizil, Cristian
Birman, Kenneth Paul

Degree Discipline

Computer Science

Degree Name

Ph.D., Computer Science

Degree Level

Doctor of Philosophy

Related Version

Related DOI

Related To

Related Part

Based on Related Item

Has Other Format(s)

Part of Related Item

Related To

Related Publication(s)

Link(s) to Related Publication(s)

References

Link(s) to Reference(s)

Previously Published As

Government Document

ISBN

ISMN

ISSN

Other Identifiers

Rights

Attribution-ShareAlike 2.0 Generic

Types

dissertation or thesis

Accessibility Feature

Accessibility Hazard

Accessibility Summary

Link(s) to Catalog Record