eCommons

 

Robust and Scalable Spectral Topic Modeling for Large Vocabularies

dc.contributor.authorCho, Sungjun
dc.contributor.chairBindel, David
dc.contributor.committeeMemberMimno, David
dc.date.accessioned2020-08-10T20:07:54Z
dc.date.available2020-08-10T20:07:54Z
dc.date.issued2020-05
dc.description56 pages
dc.description.abstractAcross many data domains, co-occurrence statistics about the joint appearance of objects are powerfully informative. In topic modeling, spectral methods can provably learn low-dimensional latent topics from easily-collected word co-occurrence statistics unlike likelihood-based methods that require exhaustive reiterations through the corpus. However, spectral methods suffer from two major drawbacks: the quality of learned topics deteriorates drastically when the empirical data does not follow the generative model, and the co-occurrence statistics itself grows to an intractable size when working with large vocabularies. This thesis is an attempt to overcome these drawbacks by developing a scalable and robust spectral topic inference framework based on Joint Stochastic Matrix Factorization. First, we provide theoretical foundations of spectral topic inference as well as step-wise algorithmic implementations of our anchor-based approach that can learn quality topics despite model-data mismatch. We then scale towards larger vocabularies by operating solely on compressed low-rank representations of co-occurrence statistics, keeping the overall cost linear with respect to the vocabulary size. Quantitative and qualitative experiments on various datasets not only demonstrate our framework's consistency and efficiency in inferring high-quality topics, but also introduce improvements in interpretability of the individual topics.
dc.identifier.doihttps://doi.org/10.7298/eymc-6724
dc.identifier.otherCho_cornell_0058O_10866
dc.identifier.otherhttp://dissertations.umi.com/cornell:10866
dc.identifier.urihttps://hdl.handle.net/1813/70307
dc.language.isoen
dc.subjectNatural Language Processing
dc.subjectNonlinear Dimensionality Reduction
dc.subjectSpectral Methods
dc.subjectUnsupervised Learning
dc.titleRobust and Scalable Spectral Topic Modeling for Large Vocabularies
dc.typedissertation or thesis
dcterms.licensehttps://hdl.handle.net/1813/59810
thesis.degree.disciplineComputer Science
thesis.degree.grantorCornell University
thesis.degree.levelMaster of Science
thesis.degree.nameM.S., Computer Science

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Cho_cornell_0058O_10866.pdf
Size:
1.23 MB
Format:
Adobe Portable Document Format