Show simple item record

dc.contributor.authorSinghal, Amitabhen_US
dc.date.accessioned2007-04-23T18:09:24Z
dc.date.available2007-04-23T18:09:24Z
dc.date.issued1997-03en_US
dc.identifier.citationhttp://techreports.library.cornell.edu:8081/Dienst/UI/1.0/Display/cul.cs/TR97-1626en_US
dc.identifier.urihttps://hdl.handle.net/1813/7281
dc.description.abstractTerm weighting is an essential part of the modern information retrieval systems. Out of the three main components of a term weighting strategy --- term frequency, inverse document frequency, and document length normalization --- the term frequency factor has been investigated recently by researchers. In this work, we study the inverse document frequency, and document length normalization components of term weights. We observe that a document length normalization scheme that retrieves documents of all lengths with similar chances as their likelihood of relevance will outperform another scheme which retrieves documents with chances very different from their likelihood of relevance. We present {\em pivoted normalization\/}, a technique that can be used to modify normalization functions to reduce the gap between the relevance and the retrieval probabilities. We present two new normalization functions --- {\em pivoted unique normalization\/} and {\em pivoted byte size normalization}, both of which yield significant improvements over the previous state of the art normalization functions. When optical character recognition is used to create large information bases, term weighting schemes can be highly sensitive to the errors in the input text, introduced by the OCR process. This work examines the effects of the well known {\em cosine normalization\/} method in the presence of OCR errors, and proposes a new, more robust, normalization method. Experiments show that the new scheme is less sensitive to OCR errors and facilitates the use of more diverse basic weighting schemes. This study also explains why the use of cosine normalization in presence of the inverse document frequency factor is not advisable in large document collections. When a user types a natural language query for an IR system, certain keywords in the query are more pertinent to the user's information need than others. Most modern IR systems incorporate these distinctions by using an inverse document frequency ({\em idf\/}) factor in term weighting. Preliminary experiments show that the usefulness of an {\em idf\/} type function is high at low ranks. We observe that the main reason for this effect is the widened gap between the weights of the rare terms and the non-rare query terms. The standard {\em idf\/} function works very well across query sets. Experiments show that there is room for improvement in the {\em idf\/} function. Further studies are needed to discover a better replacement for the standard {\em idf\/} function.en_US
dc.format.extent1559987 bytes
dc.format.extent2263893 bytes
dc.format.mimetypeapplication/pdf
dc.format.mimetypeapplication/postscript
dc.language.isoen_USen_US
dc.publisherCornell Universityen_US
dc.subjectcomputer scienceen_US
dc.subjecttechnical reporten_US
dc.titleTerm Weighting Revisiteden_US
dc.typetechnical reporten_US


Files in this item

Thumbnail
Thumbnail

This item appears in the following Collection(s)

Show simple item record

Statistics