Pivoted Document Length Normalization
Singhal, Amit; Buckley, Chris; Mitra, Mandar; Salton, Gerard
Document length normalization is an important aspect of term weight assignment in an automatic information retrieval system. In this study, we observe that a normalization scheme that retrieves documents of all lengths with similar chances as their likelihood of relevance will outperform another scheme which retrieves documents with chances very different from their likelihood of relevance. We show that the retrieval probabilities for a particular normalization method deviate systematically from the relevance probabilities across different collections. We present pivoted normalization a technique that can be used to reduce the gap between the relevance and the retrieval probabilities. Training pivoted normalization on one collection, we can successfully use it on other (new) text collections, yielding a robust, collection independent normalization technique. We use the idea of pivoting with the well known cosine normalization scheme. We point out some shortcomings of the cosine normalization function and present two new normalization functions --- pivoted unique normalization and pivoted byte size normalization.
computer science; technical report
Previously Published As