Cornell University
Library
Cornell UniversityLibrary

eCommons

Help
Log In(current)
  1. Home
  2. Cornell Computing and Information Science
  3. Computing and Information Science
  4. Computing and Information Science Technical Reports
  5. On the Use of Linear Programming for Unsupervised Text Classification

On the Use of Linear Programming for Unsupervised Text Classification

File(s)
TR2005-1992.pdf (1.25 MB)
Permanent Link(s)
https://hdl.handle.net/1813/5692
Collections
Computing and Information Science Technical Reports
Author
Sandler, Mark
Abstract

We present a new algorithm for large scale unsupervised text classification. Our method views each document as a sample of fixed size from a mixture model, and uses a novel L1-norm based theoretical approach due to Kleinberg and Sandler. We show that our algorithm performs extremely well on data sets of $10^5$ documents and more, and in particular out-performs Latent Semantic Indexing by a large margin. Furthermore, on some tests its prediction accuracy approaches that of {\it supervised} learning with training set of 5,000 or more documents. Unlike LSI, our algorithm produces a ``well-behaved'' projection in general, that in many cases does not require additional clustering algorithm to separate topics. We experiment with the \arxiv - a collection of scientific abstracts and the \news~dataset - a small snapshot of 20 specific newsgroups.

Date Issued
2005-05-06
Publisher
Cornell University
Keywords
computer science
•
technical report
Previously Published as
http://techreports.library.cornell.edu:8081/Dienst/UI/1.0/Display/cul.cis/TR2005-1992
Type
technical report

Site Statistics | Help

About eCommons | Policies | Terms of use | Contact Us

copyright © 2002-2026 Cornell University Library | Privacy | Web Accessibility Assistance