Automatic Identification of Significant Topics in Domain-independent Full Text Documents

Contact Information

Judith L. Klavans	Nina Wacholder
Center for Research on Information Access	Center for Research on Information Access
Columbia University	Columbia University
535 W. 114th Street, MC 1101	535 W. 114th Street, MC 1101
New York, NY 10027	New York, NY 10027
Phone: 212-854-7443	Phone: 212-854-7443
Fax: 212-666-0140	Fax: 212-666-0140
klavans@cs.columbia.edu	nina@cs.columbia.edu

WWW Page

http://www.cs.columbia.edu/~klavans

Keywords

Natural language processing, information access, information retrieval, document analysis, topic identification, topic detection and tracking.

Project Award Information

Award Number: IRI-97-12069

Title: Automatic Identification of Significant Topics in Domain-Independent Full Text Documents

Duration: Three years

Dates: September 1997 to August 2000

Project Summary

The goal of this project is to develop a suite of techniques for identifying significant topics in edited documents such as newspaper articles. For the purposes of this research, a 'topic' is any event or entity explicitly referred to in a document, and a 'significant topic' is a topic central to what is sometimes called the 'aboutness' of a document. The notion 'significant', like the notion 'relevant', is both task and user dependent. For example, what is considered significant for an application that answers specific questions differs from what is significant for an application that conveys the gist of a particular document or set of documents; what is significant in a domain for a naive user may be quite different than what is significant to an expert.

The natural language processing (NLP) applications for which this research is useful utilize efficient shallow language analysis in order to produce output that improves information access, for example:

automatic indexing tasks, either for digital library applications or for direct user manipulation.
second stage information retrieval, where a subset of a larger corpus has been determined to be potentially relevant, perhaps by a statistically based system. The subset can then be further filtered in order to identify documents which are likely to be of interest for a particular query or which may provide the answer to a specific question.
advanced information extraction where important entities in the document must be identified and linked so that information about the entity from different parts of the document can be related.
summarization or other techniques for conveying the gist of a document.

We have developed linguistically motivated techniques which can identify relevant information over a set of documents for more fine analysis, e.g. to improve the relative ranking of the document or to identify portions of documents which are relevant for a given application or user. To the extent that our techniques are based on linguistically-motivated patterns and not on domain-dependent vocabularies, our patterns will apply to domain- and genre-independent text. This research is important because achieving even shallow understanding of nominal expressions remains a major blocking point for NLP systems. The hypothesis of our research is that better coverage and better portability across domains and genres is achievable using creative combinations of statistically and linguistically motivated techniques. We assume that efficient and effective known statistical techniques will continue to be used for large scale tasks such as analysis of large corpora or retrieving a set of possibly relevant documents from the web.

Goals, Objectives, and Targeted Activities

To achieve our goal, we have undertaken two tasks:

development of a domain-general method for identifying a list of candidate significant topics in a document that is as complete as is practical, given that full natural language understanding is not likely to be achieved by NLP systems for the foreseeable future.
development of a suite of functions for identifying the most significant of these candidate topics. Our particular contribution will consist of a set of carefully targeted techniques for solving some of the more challenging problems associated with nominal referring expressions and their variants.
development of evaluation tools which consist of incorporating our techniques into standard systems for direct comparison with other results. Our contribution consists of creating new evaluation methods which accurately reflect our linguistic information, while permitting traditional measurements simultaneously.

Based on our research, we are using a complete list of simplex noun phrases (NPs) in a document as our candidate list of significant topics (Wacholder 1998). A simplex NP is a maximal NP with a common noun as its head, where the NP may include premodifiers such as determiners and possessives but not post-nominal constituents such as prepositions or relativizers. Examples are asbestos fiber and 9.8 billion Kent cigarettes. Simplex NPs can be contrasted with complex NPs such as 9.8 billion Kent cigarettes with the filters where the head of the NP is followed by a preposition, or 9.8 billion Kent cigarettes sold by the company, where the head is followed by a participial verb. An important part of identifying candidate significant topics is the linking of references within the document to the same entity, for example by anaphor resolution ( linking a pronoun with its antecedent, e.g., lung cancer in one sentence and it in the next ) (Kennedy and Boguraev 1996) or by linking syntactic variants (e.g., lung cancer and cancer of the lung) (Jacquemin, Klavans, and Tzoukermann1997 ).

Both the linking of candidate significant topics and the identification of topics that are indeed significant is adduced from the list of nominals, their variants, and relatively simple associated information about document structure. By document structure, we refer to information about where in the document the noun phrase occurred, based on information such as sentence number, position relative to other simplex noun phrases in the document, token span, section or segment, and text markup code.

Indication of Success

The project award was received in August 1997, and we are well on target for the three-year period of the grant. In the initial eight month period, we developed a modular system, tentatively named LinkIT. The input to LinkIt is text which has been tagged with part-of-speech tags by the Alembic system, a publicly available part-of-speech tagger developed by the Mitre Corporation [http://www.mitre.org/resources/centers/it/g063/alembic.html; Aberdeen et al. 1996]. LinkIT then parses the tagged text in order to collect a variety of information ths useful in identifying significant topics. In addition to our work on identifying and linking simplex noun phrases which was reported on in last year's IDM report, we have achieved the following over the past year:

Refinement of the SNAP module for identifying Simplex Noun Phrases

implementation of techniques to filter the list of SNPs in the topic, based on linguistically motivated techniques.
development of a ranking metric to idenify the most significant of the SNPs, based on frequency of the head of the SNP and on frequency of the modifier
evaluation of the ranking metric
comparison of the performance of the head sorting method for conveying the 'gist' of a document to two other techniques, keyword frequency (the tf of the tf*idf method) and repeated word sequences (based on the technical term approach of (Justeson and Katz (1993) See: Natural Language Engineering: 1(1), 9-27 (1995). also see John S. Justeson and Slava M. Katz. Technical terminology: some linguistic property and algorithm for identifying in text. IBM Research Report xxxx, 1993. ))

Implementation of a new module, VERSO, to identify main verbs in the document
analysis of the role of grammatical categories in a statistical IR system

created a part-of-speech tagged corpus based on the TREC disk 1 and disk 2 collections (ziff-davis, wall street journal, AP newswire, and federal register data sets used) of over 200 MBs for testing and evaluating our techniques.
used LinkIT output to create enhanced versions of the full text documents and also reduced versions of the documents consisting only of words from different grammatical categories, and compared the results of IR over these corpora to the original full text corpus.
running all versions of documents through the SMART IR system
development of the DFI (Distance from Ideal) metric to closely analyze the performance of different versions of documents on specific queries <>

refinement of visualization tool for displaying documents, significant topics in the document and grammatical constituents that may help in the identification of significant topics.
ongoing discussion with the Columbia University Law School and the Columbia University Press about how to adapt LinkIT for use in indexing of electronic documents.
successful use of LinkIT output to measure paragraph similarity in the NSF Stimulate Project

In addition, we have identified a novel technique for identifying significant topics which we call 'head sorting' (Wacholder 1998, under review). We are engaged in an ongoing process of evaluating and refining the LinkIT system, in order to make sure that it identifies complex nominals and proper names correctly and to improve the quality of linking. In addition, we are preparing to undertake an analysis of the significant topics identified by head sorting by comparing these nominals with the output of statistically-based systems (e.g. the SMART system or LSI). We also plan to explore ways to incorporate our output into the statistical indexing stage.

The major focus of our effort for the third and last year of our project will be on <> techniques for identification of significant topics and on evaluation of our results by both qualitative and quantitative techniques. Comparison with the output of other systems will be an important part of this analysis. Our output will be used for other projects such as summarization of multiple documents (e.g. the Columbia NSF-STIMULATE projects) and a project which involves finding background information on the web about significant topics in a document. If our hypotheses are correct, we will have developed new ways to identify significant topics; we will also have achieved a better understanding of the relative strengths and weaknesses of statistically-based and rule-basedral language processing systems.

Project Impact

Human Resources (student participation -- graduate and undergraduate, minorities, persons with disabilities, women), directly funded students,

David K. Evans, a Columbia University graduate student, is fully funded. He is responsible for developing the LinkIT software. This year he transferred from the Masters degree program to the PhD program
Sonja Allin, a Columbia graduate, has returned to campus as a General Studies student for BA in computer science. She has been working on a non-credit project to improve and refine our tool for document visualization.
Eleazar Eskin, a CS PhD student, used LinkIT output to measure document similarity.

Your department/institution infrastructure

This project is part of the Columbia University Digital Library program under the Center for Research on Information Access (CRIA). The purpose of CRIA is to establish new links between Computer Science research which is relevant to digital libraries and the information services division of the university. Our goals are to perform creative research as well as to link this research with needed applications. To this end, we are achieving this goal by developing tools and techniques which are essential for effective user-oriented text analysis and retrieval, and are also useful for publishing, library, and other information management tasks.

Industry -- collaborations, transfer of technology, patents

We have developed a plan with the Columbia University Libraries and with electronic publishing division of Columbia University Press (CUP) to use LinkIT to build an intelligent indexing tool. We are also in contact with the Columbia Law School for potential use in Columbia Law Reviews.

What activities have been enabled/spawned because of the accomplishments made possible by your award?

collaboration with related applications in the Department of Computer Science
discussion with Columbia University Press for an indexer's aid tool

Project References

Judith, I don't have info for any of your recent stuff here

Evans et al. 1999

Aberdeen, J., J. Burger, D. Day, L. Hirschman, and M. Vilain (1995) "Description of the Alembic system used for MUC-6". In Proceedings of MUC-6, Morgan Kaufmann.

Jacquemin, C., Klavans, J., and Tzoukermann, E. (1997) "Expansion of multi-word terms for indexing and retrieval using morphology and syntax." Proceedings of the 35^th Annual ACL. 24-21.

Klavans, Judith L. (1998, to appear) "Databases in Digital Libraries: Where Computer Science and Information Management Meet.",ACM-PODS Invited Tutorial.

Klavans, Judith L. and Min-Yen Kan (under review). "The Role of Verbs in Document Analysis."

ALIGN="JUSTIFY">Penn Treebank. Wall Street Journal, 1988. Treebank, ium, University of ennsylvania, Philadelphia, PA.

Wacholder et al. 1999

Wacholder, Nina (1998). "Simplex NPS Sorted by Head: a Method for Identifying Significant Topics within a Document."

Wacholder, Nina, Yael Ravin and Misook Choi (1997) "Disambiguation of proper names in text," Proceedings of the ANLP, ACL, Washington, DC.

Area Background

I tried to make this sound less bureaucratic

As the NSF works toward the goal of enabling increased universal access to the fast-growing body of electronic text, our research addresses directly the needs of information king-seeking individauls to find what they need more easily and more reliably. To achieve this end, we are developing a range of innovative mets to improve current methodologies for information retrieval, indexing, extraction, and summarization. The specific focus of our project is on the identification of significant information in documents or sets of documents. This type of information is under-utilized by most available systems.

Area References

Cowie, Jim and Wendy Lehnert (1996). "Information Extraction." Communications of the ACM, 39(1): 80-91.

Hirschman, Lynette and Marc Vilain (1995). Extracting Information from the MUC. ACL Tutorial.

Klavans, Judith L. and Philip Resnik, editors (1996). The Balancing Act: Combining Symbolic and Statistical Approaches to Language. MIT Press, Cambridge, Mass.

Paice, Chris D. (1990). "Constructing literature abstracts by computer: techniques and prospects." Information Processing & Management, 26: 171-186.

Wilkinson, R. (1994). "Effective Retrieval of Structured Documents," ACM-SIGIR Proceedings. 311-317.

Last modified: Mon Feb 1 01:28:54 EST 1999