On the Use of Linear Programming for Unsupervised Text Classification
We present a new algorithm for large scale unsupervised text classification. Our method views each document as a sample of fixed size from a mixture model, and uses a novel L1-norm based theoretical approach due to Kleinberg and Sandler. We show that our algorithm performs extremely well on data sets of $10^5$ documents and more, and in particular out-performs Latent Semantic Indexing by a large margin. Furthermore, on some tests its prediction accuracy approaches that of {\it supervised} learning with training set of 5,000 or more documents. Unlike LSI, our algorithm produces a ``well-behaved'' projection in general, that in many cases does not require additional clustering algorithm to separate topics. We experiment with the \arxiv - a collection of scientific abstracts and the \news~dataset - a small snapshot of 20 specific newsgroups.