Theory of Indexing and Classification
Yu, C. T.
Given a written text in natural language, it is convenient to represent the information content of the text by one or more entities, variously known as concepts, keywords, or terms. It is desired to choose "good" terms which collectively reflect the information content as accurately as possible. A characterization is given of discriminating (good) and non-discriminating (bad) terms, based on the document frequencies of occurrence and the distribution of frequencies of the terms in the documents (texts) of a given document collection. Based on this characterization, reasons are presented for the success and/or the failure of some well-known indexing methods, namely thesaurus construction, "weighting" of the rare terms, and the deletion of non-discriminating terms. A method, which converts non-discriminating terms to discriminating terms is described. Experiments are performed to show that the new method is of practical use. In order to improve the content analysis (indexing) further, term classes are constructed using a process known as pseudoclassification. It is shown that the to construct term classes through discriminating terms. Experiments using the new method are performed on two document collections in medicine and aerodynamics. For both collections, the new method yields substantial improvements over the method of the deletion of non-discriminating terms. In Chapter III, the idea of a term class is formalized and generalized. Techniques from the theory of algorithms are used to examine the computational complexity of certain useful methods for the automatic construction of term classes. It is shown that any algorithm which maximizes the number of relevant documents that are retrieved (recall) and the number of irrelevant documents that are rejected (precision) under some pre-assigned term classes is polynomial complete (likely to take exponential amount of computer time). Four heuristic methods which decrease the computaational time for the automatic construction of term classes using a "pseudo-classification" process are presented. Experiments with these methods and their variations produce surprisingly good results. Chapter IV deals with the clustering of documents. A new clustering method which makes use of the query formulations previously processed by the system is presented. The proposed method clusters the requests in the form of a tree. From that tree, a corresponding tree of documents is constructed. The resulting clusters have the following useful properties. (1) The clusters are independent of the order in which the documents are processed. (2) Overlapping of clusters is allowed in that documents may appear in more than one class or cluster (3) Not all the documents need to be clustered. The new clustering method is then compared with other automatic classification procedures, namely the Single-Link Method, Dattola's Method, Rocchio's Method and Bonner's Method with repect to systems effectiveness, systems efficiency, and computer time required for clustering. Both theoretical and experimental results seem to indicate that the new method is superior to the olded methods. Chapter V summarizes the resultss obtained in this thesis and indicates possible future research areas. Many of the more tedious or difficult proofs in different parts of the thesis are included in Appendices I, II, and III.
computer science; technical report
Previously Published As