Automatic Identification of Significant Topics in Domain-independent Full Text Documents

Contact Information

Judith L. Klavans Nina Wacholder
Center for Research on Information Access Center for Research on Information Access
Columbia University Columbia University
535 W. 114th Street, MC 1101 535 W. 114th Street, MC 1101
New York, NY 10027 New York, NY 10027
Phone: 212-854-7443 Phone: 212-854-7443
Fax: 212-666-0140 Fax: 212-666-0140
klavans@cs.columbia.edu nina@cs.columbia.edu

WWW Page

http://www.cs.columbia.edu/~klavans

Keywords

Natural language processing, information access, information retrieval, document analysis, topic identification, topic detection and tracking.

Project Award Information

Award Number: IRI-97-12069

Title: Automatic Identification of Significant Topics in Domain-Independent Full Text Documents

Duration: Three years

Dates: September 1997 to August 2000

Project Summary

The goal of this project is to develop a suite of techniques for identifying significant topics in edited documents such as newspaper articles. For the purposes of this research, a 'topic' is any event or entity explicitly referred to in a document, and a 'significant topic' is a topic central to what is sometimes called the 'aboutness' of a document. The notion 'significant', like the notion 'relevant', is both task and user dependent. For example, what is considered significant for an application that answers specific questions differs from what is significant for an application that conveys the gist of a particular document or set of documents; what is significant in a domain for a naive user may be quite different than what is significant to an expert.

The natural language processing (NLP) applications for which this research is useful utilize efficient shallow language analysis in order to produce output that improves information access, for example:

We have developed linguistically motivated techniques which can identify relevant information over a set of documents for more fine analysis, e.g. to improve the relative ranking of the document or to identify portions of documents which are relevant for a given application or user. To the extent that our techniques are based on linguistically-motivated patterns and not on domain-dependent vocabularies, our patterns will apply to domain- and genre-independent text. This research is important because achieving even shallow understanding of nominal expressions remains a major blocking point for NLP systems. The hypothesis of our research is that better coverage and better portability across domains and genres is achievable using creative combinations of statistically and linguistically motivated techniques. We assume that efficient and effective known statistical techniques will continue to be used for large scale tasks such as analysis of large corpora or retrieving a set of possibly relevant documents from the web.

Goals, Objectives, and Targeted Activities

To achieve our goal, we have undertaken two tasks:

Based on our research, we are using a complete list of simplex noun phrases (NPs) in a document as our candidate list of significant topics (Wacholder 1998). A simplex NP is a maximal NP with a common noun as its head, where the NP may include premodifiers such as determiners and possessives but not post-nominal constituents such as prepositions or relativizers. Examples are asbestos fiber and 9.8 billion Kent cigarettes. Simplex NPs can be contrasted with complex NPs such as 9.8 billion Kent cigarettes with the filters where the head of the NP is followed by a preposition, or 9.8 billion Kent cigarettes sold by the company, where the head is followed by a participial verb. An important part of identifying candidate significant topics is the linking of references within the document to the same entity, for example by anaphor resolution ( linking a pronoun with its antecedent, e.g., lung cancer in one sentence and it in the next ) (Kennedy and Boguraev 1996) or by linking syntactic variants (e.g., lung cancer and cancer of the lung) (Jacquemin, Klavans, and Tzoukermann1997 ).

Both the linking of candidate significant topics and the identification of topics that are indeed significant is adduced from the list of nominals, their variants, and relatively simple associated information about document structure. By document structure, we refer to information about where in the document the noun phrase occurred, based on information such as sentence number, position relative to other simplex noun phrases in the document, token span, section or segment, and text markup code.

Indication of Success

The project award was received in August 1997, and we are well on target for the three-year period of the grant. In the initial eight month period, we developed a modular system, tentatively named LinkIT. The input to LinkIt is text which has been tagged with part-of-speech tags by the Alembic system, a publicly available part-of-speech tagger developed by the Mitre Corporation [http://www.mitre.org/resources/centers/it/g063/alembic.html; Aberdeen et al. 1996]. LinkIT then parses the tagged text in order to collect a variety of information ths useful in identifying significant topics. In addition to our work on identifying and linking simplex noun phrases which was reported on in last year's IDM report, we have achieved the following over the past year:

In addition, we have identified a novel technique for identifying significant topics which we call 'head sorting' (Wacholder 1998, under review). We are engaged in an ongoing process of evaluating and refining the LinkIT system, in order to make sure that it identifies complex nominals and proper names correctly and to improve the quality of linking. In addition, we are preparing to undertake an analysis of the significant topics identified by head sorting by comparing these nominals with the output of statistically-based systems (e.g. the SMART system or LSI). We also plan to explore ways to incorporate our output into the statistical indexing stage.

The major focus of our effort for the third and last year of our project will be on <> techniques for identification of significant topics and on evaluation of our results by both qualitative and quantitative techniques. Comparison with the output of other systems will be an important part of this analysis. Our output will be used for other projects such as summarization of multiple documents (e.g. the Columbia NSF-STIMULATE projects) and a project which involves finding background information on the web about significant topics in a document. If our hypotheses are correct, we will have developed new ways to identify significant topics; we will also have achieved a better understanding of the relative strengths and weaknesses of statistically-based and rule-basedral language processing systems.

Project Impact

Human Resources (student participation -- graduate and undergraduate, minorities, persons with disabilities, women), directly funded students,

Your department/institution infrastructure

This project is part of the Columbia University Digital Library program under the Center for Research on Information Access (CRIA). The purpose of CRIA is to establish new links between Computer Science research which is relevant to digital libraries and the information services division of the university. Our goals are to perform creative research as well as to link this research with needed applications. To this end, we are achieving this goal by developing tools and techniques which are essential for effective user-oriented text analysis and retrieval, and are also useful for publishing, library, and other information management tasks.

Industry -- collaborations, transfer of technology, patents

We have developed a plan with the Columbia University Libraries and with electronic publishing division of Columbia University Press (CUP) to use LinkIT to build an intelligent indexing tool. We are also in contact with the Columbia Law School for potential use in Columbia Law Reviews.

What activities have been enabled/spawned because of the accomplishments made possible by your award?

Project References

Judith, I don't have info for any of your recent stuff here

Evans et al. 1999

Aberdeen, J., J. Burger, D. Day, L. Hirschman, and M. Vilain (1995) "Description of the Alembic system used for MUC-6". In Proceedings of MUC-6, Morgan Kaufmann.

Jacquemin, C., Klavans, J., and Tzoukermann, E. (1997) "Expansion of multi-word terms for indexing and retrieval using morphology and syntax." Proceedings of the 35th Annual ACL. 24-21.

Klavans, Judith L. (1998, to appear) "Databases in Digital Libraries: Where Computer Science and Information Management Meet.",ACM-PODS Invited Tutorial.

Klavans, Judith L. and Min-Yen Kan (under review). "The Role of Verbs in Document Analysis."

ALIGN="JUSTIFY">Penn Treebank. Wall Street Journal, 1988. Treebank, ium, University of ennsylvania, Philadelphia, PA.

Wacholder et al. 1999

Wacholder, Nina (1998). "Simplex NPS Sorted by Head: a Method for Identifying Significant Topics within a Document."

Wacholder, Nina, Yael Ravin and Misook Choi (1997) "Disambiguation of proper names in text," Proceedings of the ANLP, ACL, Washington, DC.

Area Background

I tried to make this sound less bureaucratic

As the NSF works toward the goal of enabling increased universal access to the fast-growing body of electronic text, our research addresses directly the needs of information king-seeking individauls to find what they need more easily and more reliably. To achieve this end, we are developing a range of innovative mets to improve current methodologies for information retrieval, indexing, extraction, and summarization. The specific focus of our project is on the identification of significant information in documents or sets of documents. This type of information is under-utilized by most available systems.

Area References

Cowie, Jim and Wendy Lehnert (1996). "Information Extraction." Communications of the ACM, 39(1): 80-91.

Hirschman, Lynette and Marc Vilain (1995). Extracting Information from the MUC. ACL Tutorial.

Klavans, Judith L. and Philip Resnik, editors (1996). The Balancing Act: Combining Symbolic and Statistical Approaches to Language. MIT Press, Cambridge, Mass.

Paice, Chris D. (1990). "Constructing literature abstracts by computer: techniques and prospects." Information Processing & Management, 26: 171-186.

Wilkinson, R. (1994). "Effective Retrieval of Structured Documents," ACM-SIGIR Proceedings. 311-317.


Last modified: Mon Feb 1 01:28:54 EST 1999