Show simple item record

dc.contributor.authorSandler, Marken_US
dc.description.abstractWe present a new algorithm for large scale unsupervised text classification. Our method views each document as a sample of fixed size from a mixture model, and uses a novel L1-norm based theoretical approach due to Kleinberg and Sandler. We show that our algorithm performs extremely well on data sets of $10^5$ documents and more, and in particular out-performs Latent Semantic Indexing by a large margin. Furthermore, on some tests its prediction accuracy approaches that of {\it supervised} learning with training set of 5,000 or more documents. Unlike LSI, our algorithm produces a ``well-behaved'' projection in general, that in many cases does not require additional clustering algorithm to separate topics. We experiment with the \arxiv - a collection of scientific abstracts and the \news~dataset - a small snapshot of 20 specific newsgroups.en_US
dc.format.extent1307845 bytes
dc.publisherCornell Universityen_US
dc.subjectcomputer scienceen_US
dc.subjecttechnical reporten_US
dc.titleOn the Use of Linear Programming for Unsupervised Text Classificationen_US
dc.typetechnical reporten_US

Files in this item


This item appears in the following Collection(s)

Show simple item record