Project Summary

Automatic Identification of Significant Topics in Domain-independent Full Text Documents

Contact Information

Judith L. Klavans Nina Wacholder

Center for Research on Information Access Center for Research on Information Access

Columbia University Columbia University

535 W. 114th Street, MC 1101 535 W. 114th Street, MC 1101

New York, NY 10027 New York, NY 10027

Phone: 212-854-7443 Phone: 212-854-7443

Fax: 212-666-0140 Fax: 212-666-0140

klavans@cs.columbia.edu nina@cs.columbia.edu

WWW Page

http://www.cs.columbia.edu/~klavans

Keywords

Natural language processing, information access, information retrieval, document analysis, topic identification.

Project Award Information

Award Number: IRI-97-12069

Title: Automatic Identification of Significant Topics in Domain-Independent Full Text Documents

Duration: Three years

Dates: September 1997 to August 2000

Project Summary

The goal of this project is to develop a suite of techniques for identifying significant topics in edited documents such as newspaper articles. For the purposes of this research, a ‘topic’ is any event or entity explicitly referred to in a document, and a ‘significant topic’ is a topic central to what is sometimes called the ‘aboutness’ of a document. The notion ‘significant’, like the notion ‘relevant’, is both task and user dependent. For example, what is considered significant for an application that answers specific questions differs from what is significant for an application that conveys the gist of a particular document or set of documents; what is significant in a domain for a naive user may be quite different than what is significant to an expert.

The natural language processing (NLP) applications for which this research is useful require shallow language analysis in order to produce output that users will find satisfactory. Examples of this type of application include:

summarization or other techniques for conveying the gist of a document.
advanced information extraction where important entities in the document must be identified and linked so that information about the entity from different parts of the document can be related.
second stage information retrieval, where a subset of a larger corpus has been determined to be potentially relevant, perhaps by a statistically based system. The subset can then be further filtered in order to identify documents which are likely to be of interest for a particular query or which may provide the answer to a specific question.

We are developing linguistically motivated techniques which can identify relevant information over a set of documents for more fine analysis, e.g. to improve the relative ranking of the document or to identify portions of documents which are relevant for a given application or user. To the extent that our techniques are based on linguistically-motivated patterns and not on domain-dependent vocabularies, our patterns will apply to domain- and genre-independent text. This research is important because achieving even shallow understanding of nominal expressions remains a major blocking point for NLP systems. The hypothesis of our research is that better coverage and better portability across domains and genres is achievable using creative combinations of statistically and linguistically motivated techniques. We assume that efficient and effective known statistical techniques will continue to be used for large scale tasks such as analysis of large corpora or retrieving a set of possibly relevant documents from the web.

Goals, Objectives, and Targeted Activities

To achieve our goal, we have undertaken two tasks:

development of a domain-general method for identifying a list of candidate significant topics in a document that is as complete as is practical, given that full natural language understanding is not likely to be achieved by NLP systems for the foreseeable future.
development of a suite of functions for identifying the most significant of these candidate topics. Our particular contribution will consist of a set of carefully targeted techniques for solving some of the more challenging problems associated with nominal referring expressions and their variants.

Every document can be viewed as forming its own self-contained world. A document is written to get across a particular idea or set of ideas. The task of the author, at least in documents intended for public distribution, is to convey to the reader what general knowledge is assumed and to inform the reader of the context so that ambiguous expressions such as proper names can be easily identified. For example, in edited text, the first reference to a named entity typically uses a relatively full form of the name in a version which is sufficient to disambiguate the reference for the expected audience. Later in the document, the same entity can be referred to by shorter, more ambiguous forms. (Wacholder et al. 1997). For example, an article might first refer to Columbia University (or, more formally, Columbia University in the City of New York), and later refer to it only by the name Columbia. Without the initial disambiguating reference, Columbia by itself is quite ambiguous. It might be a city (Columbia, MD), a bank (Columbia Savings and Loan) or one of many other entities. Once the entity has been disambiguated, a common NP (an NP headed by a common noun) such as the university or a pronoun such as it could also be used to refer to the entity formally called Columbia University in the City of New York.

Based on our research, we are using a complete list of simplex noun phrases (NPs) in a document as our candidate list of significant topics (Wacholder 1998, under review). A simplex NP is a maximal NP with a common noun as its head, where the NP may include premodifiers such as determiners and possessives but not post-nominal constituents such as prepositions or relativizers. Examples are asbestos fiber and 9.8 billion Kent cigarettes. Simplex NPs can be contrasted with complex NPs such as 9.8 billion Kent cigarette with the filters where the head of the NP is followed by a preposition, or 9.8 billion Kent cigarettes sold by the company, where the head is followed by a participial verb. An important part of identifying candidate significant topics is the linking of references within the document to the same entity, for example by anaphor resolution ( linking a pronoun with its antecedent, e.g., lung cancer in one sentence and it in the next ) (Kennedy and Boguraev 1996) or by linking syntactic variants (e.g., lung cancer and cancer of the lung) (Jacquemin, Klavans, and Tzoukermann 1997 ).

Both the linking of candidate significant topics and the identification of topics that are indeed significant will be adduced from the list of nominals, and their variants, and relatively simple associated information about document structure. By document structure, we refer to information about where in the document the noun phrase occurred, based on information such as sentence number, position relative to other noun simplex noun phrases in the document, token span, section or segment, and text markup code.

Indication of Success

The project award was received in August 1997, and we are well on target for the three-year period of the grant. In the initial eight month period, we have developed a modular system, tentatively named LinkIT. The input to LinkIt is text which has been tagged with part-of-speech tags by the Alembic system, a publicly available part-of-speech tagger developed by the Mitre Corporation [http://www.mitre.org/resources/centers/advanced_info/g04h/alembic.html; Aberdeen et al. 1996]. The fTo date, LinkIt identifies the following: (Examples are from wsj 0003 of the Penn Treebank); the first three sentences of the article are reproduced here: simplex NPs are bracketed; head nouns are in italics. "[A form] of [asbestos] once used to make [Kent cigarette filters] has caused [a high percentage] of [cancer deaths] among [a group] of [workers] exposed to [it] more than [30 years] ago, [researchers] reported. [The asbestos fiber], [crocidolite], is unusually resilient once [it] enters [the lung], with [even brief exposures] to [it] causing [[symptoms] that show up [decades] later], [researchers] said." )

a list of simplex noun phrases in the document (a form, asbestos, Kent cigarette filters),

the sentence number, nominal number and token span of each simplex NP (e.g., sentence 1, token1-2, simplex NP 1 for a form)

a preliminary version of the linking technology, which identifies a variety of information such as:

candidate appositives (e.g., crocidolite is an appositive modifying the NP the asbestos fiber)

candidate postmodifiers (e.g. of asbestos is a postmodifier of the NP a form)
candidate head links (e.g. asbestos is the head of simplex NPs 2, 41, 43, 67, 115, 118, 141 and 143)
document statistics such as the number of simplex NPs per document
a visualization tool for displaying documents and their significant topics.

In addition, we have identified a novel technique for identifying significant topics which we call ‘head sorting’ (Wacholder 1998, under review).

We are engaged in an ongoing process of evaluating and refining the LinkIT system, in order to make sure that it identifies complex nominals and proper names correctly and to improve the quality of linking. In addition, we are preparing to undertake an analysis of the significant topics identified by head sorting by comparing these nominals with the output of statistically-based systems (e.g. the SMART system or LSI). We also plan to explore ways to incorporate our output into the statistical indexing stage.

The major focus of our effort for the second two years of our project will be on creative use of linguistically motivated techniques for identification of significant topics and on evaluation of our results by both qualitative and quantitative techniques. Comparison with the output of other systems will be an important part of this analysis. Our output will be used for other projects such as summarization of multiple documents (e.g. the Columbia NSF-STIMULATE projects) and a project which involves finding background information on the web about significant topics in a document. If our hypotheses are correct, we will have developed new ways to identify significant topics; we will also have achieved a better understanding of the relative strengths and weaknesses of statistically-based and rule-based natural language processing systems.

Project Impact

Human Resources (student participation -- graduate and undergraduate, minorities, persons with disabilities, women), directly funded students,

David K. Evans, a Columbia University graduate student, is fully funded. He is responsible for developing the LinkIT software.
Himani Naresh, a junior computer science major at Columbia College, is working for credit on the visualization tool research project.
Sonja Allin, a Columbia graduate, has returned to campus as a Barnard student for BA in computer science. She is working on a non-credit project to adapt the visualization tool for other uses.
Adam Dinwoodie, a Columbia College sophomore, is doing a special project using the same tagging tools to identify verbs.

Your department/institution infrastructure

This project is part of the Columbia University Digital Library program under the Center for Research on Information Access (CRIA). The purpose of CRIA is to establish new links between Computer Science research which is relevant to digital libraries and the information services division of the university. Our goals are to perform creative research as well as to link this research with needed applications. To this end, we are achieving this goal by developing tools and techniques which are essential for effective user-oriented text analysis and retrieval, and are also useful for publishing, library, and other information management tasks.

Industry -- collaborations, transfer of technology, patents.

We have initiated discussions with the electronic publishing division of Columbia University Press (CUP) who is interested in using LinkIT in an indexer's workstation. Our goal is to perfect the nominal identification tool, along with the visualization tool to the point where we can test its usefulness with the CUP indexers as an indexing aid.

What activities have been enabled/spawned because of the accomplishments made possible by your award?

collaboration with related applications in the Department of Computer Science
discussion with Columbia University Press for an indexer's aid tool

Project References

Aberdeen, J., J. Burger, D. Day, L. Hirschman, and M. Vilain (1995) "Description of the Alembic system used for MUC-6". In Proceedings of MUC-6, Morgan Kaufmann.

Boguraev, Branimir and Christopher Kennedy (1997) "Technical terminology for domain specification and content characterization."

Jacquemin, C., Klavans, J., and Tzoukermann, E. (1997) "Expansion of multi-word terms for indexing and retrieval using morphology and syntax." Proceedings of the 35^th Annual ACL. 24-21.

Kameyama, Megumi "Recognizing referential links: an information extraction perspective" cmp—

lg/9707009.

Klavans, Judith L. (1998, to appear) "Databases in Digital Libraries: Where Computer Science and Information Management Meet.",ACM-PODS Invited Tutorial.

Klavans, Judith L. and Min-Yen Kan (under review). "The Role of Verbs in Document Analysis."

Penn Treebank. Wall Street Journal, 1988. Treebank, Linguistic Data Consortium, University of

Pennsylvania, Philadelphia, PA.

Wacholder, Nina (under review). "Simplex NPS Sorted by Head: a Method for Identifying Significant Topics within a Document."

Wacholder, Nina, Yael Ravin and Misook Choi (1997) "Disambiguation of proper names in text,"

Proceedings of the ANLP, ACL, Washington, DC.

Area Background

The focus of this research is on the identification of significant information in documents or sets of documents. The results of our research will be useful for applications such as effective indexing for many areas of information access including information extraction, retrieval, and presentation of results in summarization and visualization. The motivation is discussed in the first two paragraphs above in the project summary.

Area References

Cowie, Jim and Wendy Lehnert (1996). "Information Extraction." Communications of the ACM, 39(1): 80-91.

Hirschman, Lynette and Marc Vilain (1995). Extracting Information from the MUC. ACL Tutorial.

Klavans, Judith L. and Philip Resnik, editors (1996). The Balancing Act: Combining Symbolic and Statistical Approaches to Language. MIT Press, Cambridge, Mass.

Paice, Chris D. (1990). "Constructing literature abstracts by computer: techniques and prospects." Information Processing & Management, 26: 171-186.

Wilkinson, R. (1994). "Effective Retrieval of Structured Documents," ACM-SIGIR Proceedings. 311-317.