Document Length Normalization
Singhal, Amit; Salton, Gerard; Mitra, Mandar; Buckley, Chris
In the TREC collection -- a large full-text experimental text collection with widely varying document lengths -- we observe that the likelihood of a document being judged relevant by a user increases with the document length. We show that a retrieval strategy, such as the vector-space cosine match, that retrieves documents of different lengths with roughly equal probability, will not optimally retrieve useful documents from such a collection. We present a modified technique that attempts to match the likelihood of retrieving a document of a certain length to the likelihood of documents of that length being judged relevant, and show that this technique yields significant improvements in retrieval effectiveness.
computer science; technical report
Previously Published As