A Theory of Indexing
THe content analysis, or indexing problem, is fundamental in information storage and retrieval. Several automatic procedures are examined for the assignment of significance values to the terms, or keywords, identifying the documents of a collection. Good and bad index terms are characterized by objective measures, leading to the conclusion that the best index terms are those with medium document frequency and skewed frequency distributions. A discrimination value model is introduced which makes it possible to construct effective indexing vocabularies by using phrase and thesaurus transformations to modify poor discriminators - those whose document frequency is too high, or too low - into better discriminators, and hence more useful index terms. Test results are included which illustrate the effectiveness of the theory.
computer science; technical report
Previously Published As