Joint-stochastic Spectral Inference for Robust Co-occurrence Modeling and Latent Topic Analysis

Other Titles


Co-occurrence information is powerful statistics that can model various discrete objects by their joint instances with other objects. Transforming unsupervised problems of learning low-dimensional geometry into provable decompositions of co-occurrence information, spectral inference provides fast algorithms and optimality guarantees for non-linear dimensionality reduction or latent topic analysis. Spectral approaches reduce the dependence on the original training examples and produce substantial gain in efficiency, but at costs: a) The algorithms perform poorly on real data that does not necessarily follow underlying models; b) Users can no longer infer information about individual examples, which is often important for real-world applications; c) Model complexity rapidly grows as the number of objects increases, requiring a careful curation of the vocabulary. The first issue is called model-data mismatch, which is a fundamental problem common in every spectral inference method for latent variable models. As real data never follows any particular computational model, this issue must be ad- dressed for practicality of the spectral inference beyond synthetic settings. For the second issue, users could revisit probabilistic inference to infer information about individual examples, but this brings back all the drawbacks of traditional approaches. One method is recently developed for spectral inference, but it works only on tiny models, quickly losing its performance for the datasets whose underlying structures exhibit realistic correlations. While probabilistic inference also suffers from the third issue, the problem is more serious for spectral inferences because co-occurrence information easily exceeds storable capacity as the size of vocabulary becomes larger. We cast the learning problem in the framework of Joint Stochastic Matrix Factorization (JSMF), showing that existing methods violate the theoretical conditions necessary for a good solution to exist. Proposing novel rectification paradigms for handling the model-data mismatch, the Rectified Anchor Word Algorithm (RAWA) is able to learn quality latent structures and their interactions even on small noisy data. We also propose the Prior Aware Dual Decomposition (PADD) that is capable of considering the learned interactions as well as the learned latent structures to robustly infer example- specific information. Beyond the theoretical guarantees, our experimental results show that RAWA recovers quality low-dimensional geometry on various textual/non-textual datasets comparable to probabilistic Gibbs sampling, and PADD substantially outperforms the recently developed method for learning low-dimensional representations of individual examples. Although this thesis does not address the complexity issue for large vocabulary, we have developed new methods that can drastically compress co-occurrence information and learn only with the compressed statistics without losing much precision. Providing rich capability to operate on millions of objects and billions of examples, we complete all the necessary tools to make spectral inference robust and scalable competitor to probabilistic inference for unsupervised latent structure learning. We hope our research serves an initial basis for a new perspective that combines the benefits of both spectral and probabilistic worlds.

Journal / Series

Volume & Issue



Date Issued




Statistics; Applied mathematics; Anchor Word Algorithm; Co-occurrence Modeling; Joint-Stochastic Matrix Factorization; Rectification; Spectral Inference; Topic Modeling; Artificial intelligence


Effective Date

Expiration Date




Union Local


Number of Workers

Committee Chair

Mimno, David

Committee Co-Chair

Committee Member

Frazier, Peter
Bindel, David S.

Degree Discipline

Computer Science

Degree Name

Ph. D., Computer Science

Degree Level

Doctor of Philosophy

Related Version

Related DOI

Related To

Related Part

Based on Related Item

Has Other Format(s)

Part of Related Item

Related To

Related Publication(s)

Link(s) to Related Publication(s)


Link(s) to Reference(s)

Previously Published As

Government Document




Other Identifiers


Rights URI


dissertation or thesis

Accessibility Feature

Accessibility Hazard

Accessibility Summary

Link(s) to Catalog Record