dc.contributor.author Sandler, Mark en_US dc.date.accessioned 2007-04-04T19:44:24Z dc.date.available 2007-04-04T19:44:24Z dc.date.issued 2005-05-06 en_US dc.identifier.citation http://techreports.library.cornell.edu:8081/Dienst/UI/1.0/Display/cul.cis/TR2005-1992 en_US dc.identifier.uri https://hdl.handle.net/1813/5692 dc.description.abstract We present a new algorithm for large scale unsupervised text classification. Our method views each document as a sample of fixed size from a mixture model, and uses a novel L1-norm based theoretical approach due to Kleinberg and Sandler. We show that our algorithm performs extremely well on data sets of $10^5$ documents and more, and in particular out-performs Latent Semantic Indexing by a large margin. Furthermore, on some tests its prediction accuracy approaches that of {\it supervised} learning with training set of 5,000 or more documents. Unlike LSI, our algorithm produces a well-behaved'' projection in general, that in many cases does not require additional clustering algorithm to separate topics. We experiment with the \arxiv - a collection of scientific abstracts and the \news~dataset - a small snapshot of 20 specific newsgroups. en_US dc.format.extent 1307845 bytes dc.format.mimetype application/pdf dc.language.iso en_US en_US dc.publisher Cornell University en_US dc.subject computer science en_US dc.subject technical report en_US dc.title On the Use of Linear Programming for Unsupervised Text Classification en_US dc.type technical report en_US
﻿