IDM'2000 Project Report for Automatic Identification of Significant Topics in Domain-Independent Full Text Documents

Automatic Identification of Significant Topics in Domain-Independent Full Text Documents

Contact Information

Judith L. Klavans	Nina Wacholder
Center for Research on Information Access	Center for Research on Information Access
Columbia University	Columbia University
535 W. 114th Street, MC 1101	535 W. 114th Street, MC 1101
New York, NY 10027	New York, NY 10027
Phone: 212-854-7443	Phone: 212-939-7119
Fax: 212-854-9099	Fax: 212-666-0140
klavans@cs.columbia.edu	nina@cs.columbia.edu

WWW Page

http://www.columbia.edu/cu/cria/

Project WWW Page: http://www.columbia.edu/cu/SigTops/

List of Supported Students and Staff

See Project Impact: Human Resources.

Project Award Information

Award Number: IRI-97-12069 *
Duration: 09/01/1997 -- 08/01/2000
Title: Automatic Identification of Significant Topics in Domain-Independent Full Text Documents

Keywords

Information retrieval, natural language processing, computational linguistics, grammatical analysis, parsing, document analysis, topic identification, topic detection and tracking, noun phrases.

Project Summary

The goal of this project is to develop a suite of techniques for identifying significant topics in edited documents such as newspaper articles. Significant entities and concepts are most often referred to in text with nominal expressions such as nouns phrases (e.g., computer science) and proper names (e.g., Buffalo Bills). However, achieving even shallow understanding of nominal expressions remains a major blocking point for natural language processing (NLP). Related to this is the question of the role of natural language information in information retrieval (IR) and specifically the role of nominals in information access. An additional goal is to achieve a better understanding of the relative strengths and weaknesses of statistically-based and rule-based natural language processing systems.

We have developed linguistically motivated techniques which can link related nominal expressions, to identify documents or portions of documents that are relevant for a given application or user. To the extent that our techniques are based on linguistically-motivated patterns and not on domain-dependent vocabularies, our patterns apply across to text in any genre or domain. We assume that efficient and effective known statistical techniques will continue to be used for large scale tasks such as analysis of large corpora and retrieving a set of possibly relevant documents from the web. This research is useful for applications such as automatic indexing tasks, either for digital library applications or for direct user manipulation, second stage information retrieval where a subset of a larger corpus has been determined to be potentially relevant, advanced information extraction where important entities in the document must be identified and linked, summarization, and topic detection and tracking. Our system has been evaluated against two similar noun phrase identifiers and as been shown to perform above the others.

Publications and Products

Publications: See Project References.
What Web sites or other Internet site have you created?

http://www.columbia.edu/cu/SigTops/

What other specific products (databases, physical collections, educational aids, software, instruments, or the like) have you developed?

Document Visualization Tool (software)
Intelligent Indexer's Aid (software)

Project Impact

Human Resources
- David Kirk Evans, Ph.D. student, has developed and performed evaluations of the LinkIT software.
- Himani Naresh, B. S. student, major in computer science, undergraduate supported for initial version of document visualization tool.
- Sonja Allin, B.A. student, major in computer science, has completed a project to improve and refine our tool for document visualization.
Your department/institution infrastructure
This project is part of the Columbia University Digital Library program under the Center for Research on Information Access (CRIA). CRIA sponsors projects to develop tools and techniques which are essential for effective user-oriented text analysis and retrieval. CRIA supports students in the Fu School of Engineering and Applied Sciences at Columbia University.
Industry -- collaborations, transfer of technology, patents
- We have developed a plan with the electronic publishing division of Columbia University Press (CUP) for potential use of LinkIT in an intelligent indexing tool for scholarly publications.
- Our software is a key component in the Digital Government work for automatic ontology compilation. (Digital Government URL: http://www.cs.columbia.edu/digigov/

Goals, Objectives, and Targeted Activities

Our goal is to develop a domain-general method for identifying a list of candidate significant topics in a document that is as complete as is practical, given that full natural language understanding is not likely to be achieved by NLP systems for the foreseeable future. We are developing:

a suite of functions for identifying the most significant of these candidate topics. Our particular contribution solves some of the more challenging problems associated with nominal referring expressions and their variants.
a set of tools for evaulating and analyzing the contribution of this type of linguistic information to statistical information retrieval systems.
an evaluation between LinkIT (the software for Significant Topic Identification) and two other noun phrase identification tools for the task of noun phrase identification. LinkIT was shown to have performance comparable to or better than other noun phrase identification systems. (Evans, to appear) The evaluation of LinkIT can be viewed as a two stage process. The first step is the evaluation of the noun phrase identification which we have completed. The second stage is an evaluation of one of the features of LinkIT where we not only identify noun phrases, (e.g. "asbestos workers") but link each noun phrase to related noun phrases within the article via the modifiers (e.g. "asbestos poisoning") or via the head (e.g. "factory workers".) Designing an evaluation for the second component is a complex task as no clear metrics exist. In the next phase of our research, we will evaluate these components of the system in a task-based evaluation.

Project References

Web Site: Significant Topics Website (http://www.columbia.edu/cu/cria/SigTops/)
Evans, David Kirk, Judith L. Klavans, Nina Wacholder, (2000, to appear). "Document processing with LinkIT," RIAO 2000 Recherche d'Informations Assistée par Ordinateur (Content-Based Multimedia Information Access) , Paris, France.
Hatzivassiloglou, Vasileios, Judith L. Klavans and Eleazar Eskin (1999). "Detecting Text Similarity over Short Passages: Exploring Linguistic Feature Combinations via Machine Learning," EMNLP/VLC-99 Joint SIGDAT Conference on Empirical Methods in NLP and Very Large Corpora, University Of Maryland, College Park, MD, USA
Klavans, Judith L., David K. Evans, Nina Wacholder (2000, to appear). "Evaluation of Computational Linguistic Techniques for Identifying Significant Topics for Browsing Applications," 2nd international Language Resources and Evaluation Conference (LREC2000), Athens, Greece.
Klavans, Judith L. (1998) "Databases in Digital Libraries: Where Computer Science and Information Management Meet.", ACM-PODS Invited Tutorial, available at http://www.cs.columbia.edu/~klavans/Slides/PODS98/index.htm".
McKeown, Kathleen R., Judith L. Klavans, Vasileios Hatzivassiloglou, Regina Barzilay and Eleazar Eskin, (1999). "Towards Multidocument Summarization by Reformulation: Progress and Prospects," Proceedings of the Sixteenth National Conference on Artificial Intelligence AAAI-1999, Orlando, Florida.
Negrilla, Stefan (1998). "Clustering Algorithms Summer Project," Computer Science Report, Columbia University
Wacholder, Nina (1998). "Simplex NPS Sorted by Head: a Method for Identifying Significant Topics within a Document," Proceddings of the COLING-ACL Workshop on the Computational Treatment of Nominals, Montreal, Canada, August 16, 1998.
Wacholder, Nina, Judith L. Klavans, David K. Evans (2000, to appear). "Evaluation of Automatically Identified Index Terms for Browsing Electronic Documents," Applied Natural Language Processing Conference (ANLP-2000), Seattle, Washington.

Area Background

Our research directly addresses the need to find information more easily and more reliably. We are developing a range of innovative methods to improve current methodologies for information retrieval, indexing, extraction, and summarization.

Area References

Losee, Robert M., 1998, Text Retrieval and Filtering: Analytic Models of Performance. Kluwer Academic Publishers, Boston, 1st edition.
Klavans, Judith L. and Philip Resnik, editors (1996). The Balancing Act: Combining Symbolic and Statistical Approaches to Language. MIT Press, Cambridge, Mass.
Wacholder, Nina, Yael Ravin and Misook Choi (1997) "Disambiguation of proper names in text," Proceedings of the ANLP, ACL, Washington, DC.

Potential Related Projects

The use of natural language techniques to extract structured information from free text.
Automatic construction of ontologies using natural language techniques.
New methods to integrate ontologies and metadata.
Evaluation techniques for natural language and database applications with partial matching.

  ^*All award information can be found on the on the NSF on-line

Awards Abstracts system http://www.fastlane.nsf.gov/a6/A6Start.htm.