Robust and Scalable Spectral Topic Modeling for Large Vocabularies

Cho, Sungjun

Robust and Scalable Spectral Topic Modeling for Large Vocabularies

Files

Cho_cornell_0058O_10866.pdf (1.23 MB)

Permanent Link(s)

https://doi.org/10.7298/eymc-6724

https://hdl.handle.net/1813/70307

Collections

Cornell Theses and Dissertations

Full item page

Author(s)

Cho, Sungjun

Abstract

Across many data domains, co-occurrence statistics about the joint appearance of objects are powerfully informative. In topic modeling, spectral methods can provably learn low-dimensional latent topics from easily-collected word co-occurrence statistics unlike likelihood-based methods that require exhaustive reiterations through the corpus. However, spectral methods suffer from two major drawbacks: the quality of learned topics deteriorates drastically when the empirical data does not follow the generative model, and the co-occurrence statistics itself grows to an intractable size when working with large vocabularies. This thesis is an attempt to overcome these drawbacks by developing a scalable and robust spectral topic inference framework based on Joint Stochastic Matrix Factorization. First, we provide theoretical foundations of spectral topic inference as well as step-wise algorithmic implementations of our anchor-based approach that can learn quality topics despite model-data mismatch. We then scale towards larger vocabularies by operating solely on compressed low-rank representations of co-occurrence statistics, keeping the overall cost linear with respect to the vocabulary size. Quantitative and qualitative experiments on various datasets not only demonstrate our framework's consistency and efficiency in inferring high-quality topics, but also introduce improvements in interpretability of the individual topics.

Description

56 pages

Date Issued

2020-05

Keywords

Natural Language Processing; Nonlinear Dimensionality Reduction; Spectral Methods; Unsupervised Learning

Committee Chair

Bindel, David

Committee Member

Mimno, David

Degree Discipline

Computer Science

Degree Name

M.S., Computer Science

Degree Level

Master of Science

Types

dissertation or thesis

Robust and Scalable Spectral Topic Modeling for Large Vocabularies

Files

No Access Until

Permanent Link(s)

Collections

Other Titles

Author(s)

Abstract

Journal / Series

Volume & Issue

Description

Sponsorship

Date Issued

Publisher

Keywords

Location

Effective Date

Expiration Date

Sector

Employer

Union

Union Local

NAICS

Number of Workers

Committee Chair

Committee Co-Chair

Committee Member

Degree Discipline

Degree Name

Degree Level

Related Version

Related DOI

Related To

Related Part

Based on Related Item

Has Other Format(s)

Part of Related Item

Related To

Related Publication(s)

Link(s) to Related Publication(s)

References

Link(s) to Reference(s)

Previously Published As

Government Document

ISBN

ISMN

ISSN

Other Identifiers

Rights

Rights URI

Types

Accessibility Feature

Accessibility Hazard

Accessibility Summary

Link(s) to Catalog Record