Automatic Identification of Significant Topics in Domain-independent Full Text Documents

Contact Information

Judith L. Klavans	Nina Wacholder
Center for Research on Information Access	Center for Research on Information Access
Columbia University	Columbia University
535 W. 114th Street, MC 1101	535 W. 114th Street, MC 1101
New York, NY 10027	New York, NY 10027
Phone: 212-854-7443	Phone: 212-854-7443
Fax: 212-666-0140	Fax: 212-666-0140
klavans@cs.columbia.edu	nina@cs.columbia.edu

WWW Page: http://www.columbia.edu/cria/
Keywords: Natural language processing, information access, information retrieval, document analysis, topic identification, topic detection and tracking, text retrieval
Project Award Information:
Award Number: IRI-97-12069
Title: Automatic Identification of Significant Topics in Domain-Independent Full Text Documents
Duration: Three years
Dates: September 1997 to August 2000

Project Summary

The goal of this project is to develop a suite of techniques for identifying significant topics in edited documents such as newspaper articles. For the purposes of this research, a 'topic' is any event or entity explicitly referred to in a document, and a 'significant topic' is a topic central to what is sometimes called the 'aboutness' of a document. The notion 'significant', like the notion 'relevant', is both task and user dependent. This research is important because significant entities and concepts are most often referred to in text with nominal expressions such as nouns phrases (e.g., computer science) and proper names (e.g., Buffalo Bills). However, achieving even shallow understanding of nominal expressions remains a major blocking point for NLP systems.

We have developed linguistically motivated techniques which can link related nominal expressions, to identify documents or portions of documents that are relevant for a given application or user. To the extent that our techniques are based on linguistically-motivated patterns and not on domain-dependent vocabularies, our patterns apply across to text in any genre or domain. We assume that efficient and effective known statistical techniques will continue to be used for large scale tasks such as analysis of large corpora or retrieving a set of possibly relevant documents from the web. One of the hypotheses of our work is that better coverage and better portability across domains and genres is achievable using creative combinations of statistically and linguistically motivated techniques.
The natural language processing (NLP) applications for which this research is useful utilize efficient shallow language analysis in order to produce output that improves information access, for example:

automatic indexing tasks, either for digital library applications or for direct user manipulation.
second stage information retrieval, where a subset of a larger corpus has been determined to be potentially relevant, perhaps by a statistically based system.
advanced information extraction where important entities in the document must be identified and linked so that information about the entity from different parts of the document can be related.
summarization or other techniques for conveying the gist of a document.
topic detection and tracking

Goals, Objectives, and Targeted Activities

To achieve our goal, we have undertaken three tasks:

development of a domain-general method for identifying a list of candidate significant topics in a document that is as complete as is practical, given that full natural language understanding is not likely to be achieved by NLP systems for the foreseeable future.
development of a suite of functions for identifying the most significant of these candidate topics. Our particular contribution solves some of the more challenging problems associated with nominal referring expressions and their variants.
development of a set of tools for evaulating and analyzing the contribution of linguistic information to statistical information retrieval systems.

Indications of Success

In the first year of the project, we developed a modular system, tentatively named LinkIT. The SNAP module of the LinkIT tool builds a list of candidate significant topics for each document consisting of the complete list of Simplex Noun Phrases (SNPs) (Wacholder 1998). A SNP is a maximal noun phrase with a common noun as its head, where the NP may include premodifiers such as determiners and possessives but not post-nominal constituents such as prepositions or relativizers. Examples are asbestos fiber and 9.8 billion Kent cigarettes. Simplex NPs can be contrasted with complex NPs such as 9.8 billion Kent cigarettes with the filters where the head of the NP is followed by a preposition, or 9.8 billion Kent cigarettes sold by the company, where the head is followed by a participial verb. LinkIT then sorts the Simplex Noun Phrases so as to link together expressions that refer to the same concept e.g. reproduction rights and literary rights.
In addition to our work on identifying and linking simplex noun phrases which we reported on in last year's IDM report, we have achieved the following over the past year:

Refinement of the SNAP module

development of a filtering and ranking metric to identify the most significant of the SNPs, based on frequency of the head of the SNP and on frequency of the modifier
evaluation of the ranking metric
comparison of the performance of the head sorting method for conveying the 'gist' of a document to two other techniques, keyword frequency (the tf of the tf*idf method) and repeated word sequences (based on the technical term approach of Justeson and Katz 1993.

Implementation of a new module to identify main verbs in documents (VERSO)
Analysis of the role of grammatical categories in a statistical IR system (Wacholder et al. 1999)

creation of a 330 MB corpus based on the Text Retrieval Conference (TREC) Disk 1 and Disk 2 collections
use of LinkIT output to create a number of (6) versions of the corpus, and compared the results of running an IR system over the different versions of the corpora.
development of the DFI (Distance from Ideal) metric to closely analyze the performance of different versions of documents on specific queries (Evans et al. under review).

Refinement of visualization tool for displaying marked up documents.
Ongoing discussion with the Columbia University Law School and the Columbia University Press about how to adapt LinkIT for use in indexing of electronic documents.
Successful use of LinkIT output to measure paragraph similarity in the NSF Stimulate Project (Eskin et al., under review)

The major focus of our effort for the third and last year of our project will be on development of additional techniques for identification of significant natural language applications and an ongoing process of evaluating and refining the LinkIT system, in order to make sure that it identifies complex nominals and proper names correctly and to improve the quality of linking. We are also using the LinkIT output, along with other linguistic information obtained by shallow parsing in order to analyze the contribution of nominal expressions to the SMART system.

Over the next year, our output will be used for other projects such as summarization of multiple documents (e.g. the Columbia NSF-STIMULATE projects) and a project which involves finding background information on the web about significant topics in a document, and we plan to evaluate the contribution that our work has made to these applications. If our hypotheses are correct, we will have developed new ways to identify significant topics; we will also have achieved a better understanding of the relative strengths and weaknesses of statistically-based and rule-based natural language processing systems.

Project Impact

Human Resources

David K. Evans, a Columbia University graduate student, is fully funded. He is responsible for developing the LinkIT software. This year he transferred from the Masters degree program to the Ph.D. program
Sonja Allin, a Columbia graduate, has returned to campus as a General Studies student for B.A. in computer science. She has been working on a non-credit project to improve and refine our tool for document visualization.
Eleazar Eskin, a CS Ph.D. student, used LinkIT output to measure document similarity.
Stefan Negrila, a Columbia undergraduate, used LinkIT output to cluster documents that discussed the same event. (Supervised by Luis Gravano)

Your department/institution infrastructure

This project is part of the Columbia University Digital Library program under the Center for Research on Information Access (CRIA). The purpose of CRIA is to establish new links between Computer Science research which is relevant to digital libraries and the information services division of the university. We are developing tools and techniques which are essential for effective user-oriented text analysis and retrieval, and are also useful for publishing, library, and other information management tasks.

Industry -- collaborations, transfer of technology, patents

We have developed a plan with the Columbia University Libraries and with electronic publishing division of Columbia University Press (CUP) to use LinkIT to build an intelligent indexing tool. We are also in contact with the Columbia Law School for potential use in the Columbia Law Review.

What activities have been enabled/spawned because of the accomplishments made possible by your award?

collaboration with related applications in the Department of Computer Science
discussion with Columbia University Press for an indexer's aid tool
collaborating with Dr. Christian Jacquemin, LIMSI to use LinkIT with FASTR for variant analysis

Project References

Aberdeen, J., J. Burger, D. Day, L. Hirschman, and M. Vilain (1995) "Description of the Alembic system used for MUC-6". In Proceedings of MUC-6, Morgan Kaufmann.
Eskin, Eleazar, Judith Klavans, Vasileios Hatzivassiloglou "Paper title here", ACL 1999 Under Review
Evans, David Kirk, "The Impact of Document Collection Characteristics on Information Access in Digital Libraries" , ACM-DL 1999, Under Review
Jacquemin, C., Klavans, J., and Tzoukermann, E. (1997) "Expansion of multi-word terms for indexing and retrieval using morphology and syntax." Proceedings of the 35^th Annual ACL. 24-21.
Justeson, John and Slava Katz (1995) "Technical Terminology: some linguistic properties and an algorithm for identifying in text", Natural Language Engineering 1(1):9-27.
Klavans, Judith L. (1998, to appear) "Databases in Digital Libraries: Where Computer Science and Information Management Meet.",ACM-PODS Invited Tutorial.
Klavans, Judith L. and Min-Yen Kan (under review). "The Role of Verbs in Document Analysis."
TREC Disks 1 and 2, Penn Treebank, University of Pennsylvania, Philadelphia, PA.
Wacholder, Nina, Judith L. Klavans and David Kirk Evans (under review) "The role of grammatical categories in a statistical information retrieval system"
Wacholder, Nina (1998). "Simplex NPS Sorted by Head: a Method for Identifying Significant Topics within a Document," Proceddings of the COLING-ACL Workshop on the Computational Treatment of Nominals, Montreal, Canada, August 16, 1998.
Wacholder, Nina, Yael Ravin and Misook Choi (1997) "Disambiguation of proper names in text," Proceedings of the ANLP, ACL, Washington, DC.

Area Background

As the NSF works toward the goal of enabling increased universal access to the fast-growing body of electronic text, our research addresses directly the needs of information king-seeking individauls to find what they need more easily and more reliably. To achieve this end, we are developing a range of innovative mets to improve current methodologies for information retrieval, indexing, extraction, and summarization. The specific focus of our project is on the identification of significant information in documents or sets of documents. This type of information is under-utilized by most available systems.

Area References

Cowie, Jim and Wendy Lehnert (1996). "Information Extraction." Communications of the ACM, 39(1): 80-91.
Hirschman, Lynette and Marc Vilain (1995). Extracting Information from the MUC. ACL Tutorial.
Klavans, Judith L. and Philip Resnik, editors (1996). The Balancing Act: Combining Symbolic and Statistical Approaches to Language. MIT Press, Cambridge, Mass.
Paice, Chris D. (1990). "Constructing literature abstracts by computer: techniques and prospects." Information Processing & Management, 26: 171-186.
Wilkinson, R. (1994). "Effective Retrieval of Structured Documents," ACM-SIGIR Proceedings. 311-317.

Last modified: Thu Feb 11 13:37:04 EST 1999