Automatic Identification of Significant Topics in Domain-independent Full Text Documents

Contact Information

Judith L. Klavans	Nina Wacholder
Center for Research on Information Access	Center for Research on Information Access
Columbia University	Columbia University
535 W. 114th Street, MC 1101	535 W. 114th Street, MC 1101
New York, NY 10027	New York, NY 10027
Phone: 212-854-7443	Phone: 212-939-7119
Fax: 212-854-9099	Fax: 212-666-0140
klavans@cs.columbia.edu	nina@cs.columbia.edu

WWW Page: http://www.columbia.edu/cria/

Keywords: Information retrieval, natural language processing, computational linguistics, grammatical analysis, parsing, document analysis, topic identification, topic detection and tracking.

Project Award Information:
Award Number: IRI-97-12069
Title: Automatic Identification of Significant Topics in Domain-Independent Full Text Documents
Duration: Three years
Dates: September 1997 to August 2000

Project Summary
The goal of this project is to develop a suite of techniques for identifying significant topics in edited documents such as newspaper articles. Significant entities and concepts are most often referred to in text with nominal expressions such as nouns phrases (e.g., computer science) and proper names (e.g., Buffalo Bills). However, achieving even shallow understanding of nominal expressions remains a major blocking point for natural language processing (NLP). Related to this is the question of the role of natural language information in information retrieval (IR) and specifically the role of nominals in information access. An additional goal is to achieve a better understanding of the relative strengths and weaknesses of statistically-based and rule-based natural language processing systems.

We have developed linguistically motivated techniques which can link related nominal expressions, to identify documents or portions of documents that are relevant for a given application or user. To the extent that our techniques are based on linguistically-motivated patterns and not on domain-dependent vocabularies, our patterns apply across to text in any genre or domain. We assume that efficient and effective known statistical techniques will continue to be used for large scale tasks such as analysis of large corpora and retrieving a set of possibly relevant documents from the web. This research is useful for applications such as automatic indexing tasks, either for digital library applications or for direct user manipulation, second stage information retrieval where a subset of a larger corpus has been determined to be potentially relevant, advanced information extraction where important entities in the document must be identified and linked, summarization, and topic detection and tracking.

Goals, Objectives, and Targeted Activities

a domain-general method for identifying a list of candidate significant topics in a document that is as complete as is practical, given that full natural language understanding is not likely to be achieved by NLP systems for the foreseeable future.
a suite of functions for identifying the most significant of these candidate topics. Our particular contribution solves some of the more challenging problems associated with nominal referring expressions and their variants.
a set of tools for evaulating and analyzing the contribution of this type of linguistic information to statistical information retrieval systems.

Indications of Success
We developed a modular system, LinkIT, for building a list of candidate significant topics for each document. The SNAP module takes Simplex Noun Phrases (SNPs) (Wacholder 1998), such as asbestos fiber, 9.8 billion Kent cigarettes, and Department of Energy. SNPs can be contrasted with complex NPs such as 9.8 billion Kent cigarettes sold by the company, where the head is followed by a participial verb. LinkIT then links together expressions that refer to the same higher level concept such as rights in reproduction rights and literary rights. This year we have achieved:

Refinement of the SNAP module

development of a filtering and ranking metric to identify the most significant of the SNPs, based on frequency of the head of the SNP and on frequency of the modifier
evaluation of the ranking metric and comparison with two related techniques

Implementation of a new module to identify main verbs in documents (VERSO)
Analysis of the role of grammatical categories in a statistical IR system (Wacholder et al. under review)

creation of set of resources used for training, testing and evalution, including a tagged 330 MB corpus based on the Text Retrieval Conference (TREC) collections
development of the DFI (Distance from Ideal) metric to closely analyze the performance of SMART of different versions of documents on specific queries (Evans et al. under review).

Refinement of a visualization tool for displaying marked up documents.

LinkIT is used by the two NSF sponsored STIMULATE projects for the computation of paragraph similarity (Eskin et al. under review), for the analysis of image captions for building an ontology of multimedia objects (Chang et al. under review), and for a distributed data base project.

Project Impact / Human Resources

David K. Evans, Ph.D. student, has developed the LinkIT software and the Distance From Ideal (DFI) metric used in evaluation of LinkIT.
Sonja Allin, B.A. student, major in computer science, has completed a project to improve and refine our tool for document visualization.

Your department/institution infrastructure
This project is part of the Columbia University Digital Library program under the Center for Research on Information Access (CRIA). CRIA sponsors projects to develop tools and techniques which are essential for effective user-oriented text analysis and retrieval.

Industry -- collaborations, transfer of technology, patents
We have developed a plan with the electronic publishing division of Columbia University Press (CUP) and with the Columbia Law School for potential use use LinkIT in an intelligent indexing tool.

What activities have been enabled/spawned because of the accomplishments made possible by your award?

Eleazar Eskin, a CS Ph.D. student, used LinkIT output as a feature in a machine learning approach to paragraph similarity.
Carl Sable, a CS Ph.D. student, used LinkIT to help in caption analysis for image classification.
Stefan Negrila, a Columbia undergraduate, used LinkIT output to cluster documents that discussed the same event. (Supervised Luis Gravano)

Project References
1. Chang Shih-Fu et al. (under review) "Integration of Visual and Text-Based Approaches for the Content Labeling and Classification of Photographs".
2. Eskin, Eleazar, Judith Klavans, and Vasileios Hatzivassiloglou (under review). "Detecting Similarity by Applying Learning over Indicators"
3. Evans, David Kirk, Judith L. Klavans, and Nina Wacholder (under review). "The Impact of Document Collection Characteristics on Information Access in Digital Libraries".
4. Klavans, Judith L. (1998) "Databases in Digital Libraries: Where Computer Science and Information Management Meet.", ACM-PODS Invited Tutorial, available at http://www.cs.columbia.edu/~klavans/Slides/PODS98/index.htm".
5. McKeown, Kathleen R, Judith Klavans, Vasileios Hatzivassiloglou, Regina Barzilay, and Eleazar Eskin (under review) "Towards multidocument summarization by reformulation: Progress and prospects".
6. Wacholder, Nina, Judith L. Klavans and David Kirk Evans (under review) "The role of grammatical categories in a statistical information retrieval system"
7. Wacholder, Nina (1998). "Simplex NPS Sorted by Head: a Method for Identifying Significant Topics within a Document," Proceddings of the COLING-ACL Workshop on the Computational Treatment of Nominals, Montreal, Canada, August 16, 1998.
8. Wacholder, Nina, Yael Ravin and Misook Choi (1997) "Disambiguation of proper names in text," Proceedings of the ANLP, ACL, Washington, DC.

Area Background
The NSF is actively involved in funding research to enable increased universal access to the fast-growing body of electronic text. Our research directly addresses the need to find information more easily and more reliably. We are developing a range of innovative methods to improve current methodologies for information retrieval, indexing, extraction, and summarization.

Area References

Losee, Robert M., 1998, Text Retrieval and Filtering: Analytic Models of Performance. Kluwer Academic Publishers, Boston, 1st edition.
Klavans, Judith L. and Philip Resnik, editors (1996). The Balancing Act: Combining Symbolic and Statistical Approaches to Language. MIT Press, Cambridge, Mass.