We address these problems with a system for topical information space navigation that combines the query-based and taxonomic approaches. Our system, named SONIA (Service for Organizing Networked Information Autonomously), has been implemented as part of the Stanford Digital Libraries testbed. It enables the creation of dynamic hierarchical document categorizations based on the full-text of articles. Using probability theory as a formal foundation, we have developed a number of Machine Learning methods to allow document collections to be automatically organized at a topical level. First, in order to generate such topical hierarchies, we employ a novel probabilistic clustering scheme that outperforms traditional methods used in both Information Retrieval and Probabilistic Reasoning. Furthermore, we have also developed methods for the classification of new articles into such automatically generated, or existing manually generated, hierarchies. In contrast to standard classification approaches which do not make use of the taxonomic relations in a topic hierarchy, our method makes explicit use of the existing hierarchical relationships between topics, leading to improvements in classification accuracy. Much of this improvement is derived from the fact that the classification decisions in such a hierarchy can be made by considering only the presence (or absence) of a small number of features (words) in each document. The choice of relevant words is made using a novel information theoretic algorithm for feature selection. We note that many of the components developed as part of SONIA are general enough that they have been successfully applied to data mining problems in entirely different domains than text.
The integration of the hierarchical clustering and classification methods will allow large amounts of information to be organized and presented to a user in a comprehensible way, one which is tailored to his or her own particular needs. By alleviating the information bottleneck, we hope to provide users with a solution to the problems of information access on the Internet.