Automatic Identification of Significant Topics in Domain-Independent Full Text Documents

Contact Information
Judith L. Klavans Nina Wacholder
Center for Research on Information Access Center for Research on Information Access
Columbia University Columbia University
535 W. 114th Street, MC 1101 535 W. 114th Street, MC 1101
New York, NY 10027 New York, NY 10027
Phone: 212-854-7443 Phone: 212-939-7119
Fax: 212-854-9099 Fax: 212-666-0140
klavans@cs.columbia.edu nina@cs.columbia.edu

WWW Page

http://www.columbia.edu/cu/cria/

Project WWW Page: http://www.columbia.edu/cu/SigTops/

List of Supported Students and Staff

See Project Impact: Human Resources.

Project Award Information

Keywords

Information retrieval, natural language processing, computational linguistics, grammatical analysis, parsing, document analysis, topic identification, topic detection and tracking, noun phrases.

Project Summary

The goal of this project is to develop a suite of techniques for identifying significant topics in edited documents such as newspaper articles. Significant entities and concepts are most often referred to in text with nominal expressions such as nouns phrases (e.g., computer science) and proper names (e.g., Buffalo Bills). However, achieving even shallow understanding of nominal expressions remains a major blocking point for natural language processing (NLP). Related to this is the question of the role of natural language information in information retrieval (IR) and specifically the role of nominals in information access. An additional goal is to achieve a better understanding of the relative strengths and weaknesses of statistically-based and rule-based natural language processing systems.

We have developed linguistically motivated techniques which can link related nominal expressions, to identify documents or portions of documents that are relevant for a given application or user. To the extent that our techniques are based on linguistically-motivated patterns and not on domain-dependent vocabularies, our patterns apply across to text in any genre or domain. We assume that efficient and effective known statistical techniques will continue to be used for large scale tasks such as analysis of large corpora and retrieving a set of possibly relevant documents from the web. This research is useful for applications such as automatic indexing tasks, either for digital library applications or for direct user manipulation, second stage information retrieval where a subset of a larger corpus has been determined to be potentially relevant, advanced information extraction where important entities in the document must be identified and linked, summarization, and topic detection and tracking. Our system has been evaluated against two similar noun phrase identifiers and as been shown to perform above the others.

Publications and Products

Project Impact

Goals, Objectives, and Targeted Activities

Our goal is to develop a domain-general method for identifying a list of candidate significant topics in a document that is as complete as is practical, given that full natural language understanding is not likely to be achieved by NLP systems for the foreseeable future. We are developing:

Project References

Area Background

Our research directly addresses the need to find information more easily and more reliably. We are developing a range of innovative methods to improve current methodologies for information retrieval, indexing, extraction, and summarization.

Area References

Losee, Robert M., 1998, Text Retrieval and Filtering: Analytic Models of Performance. Kluwer Academic Publishers, Boston, 1st edition.
Klavans, Judith L. and Philip Resnik, editors (1996). The Balancing Act: Combining Symbolic and Statistical Approaches to Language. MIT Press, Cambridge, Mass.
Wacholder, Nina, Yael Ravin and Misook Choi (1997) "Disambiguation of proper names in text," Proceedings of the ANLP, ACL, Washington, DC.

Potential Related Projects


  *All award information can be found on the on the NSF on-line 
Awards Abstracts system http://www.fastlane.nsf.gov/a6/A6Start.htm.