Automatic Identification of Significant Topics in Domain-independent Full Text Documents

 

Contact Information

 

Judith L. Klavans Nina Wacholder

Center for Research on Information Access Center for Research on Information Access

Columbia University Columbia University

535 W. 114th Street, MC 1101 535 W. 114th Street, MC 1101

New York, NY 10027 New York, NY 10027

Phone: 212-854-7443 Phone: 212-854-7443

Fax: 212-666-0140 Fax: 212-666-0140

klavans@cs.columbia.edu nina@cs.columbia.edu

 

WWW Page

 

http://www.cs.columbia.edu/~klavans

 

Keywords

 

Natural language processing, information access, information retrieval, document analysis, topic identification.

Project Award Information

 

Award Number: IRI-97-12069

Title: Automatic Identification of Significant Topics in Domain-Independent Full Text Documents

Duration: Three years

Dates: September 1997 to August 2000

 

Project Summary

 

The goal of this project is to develop a suite of techniques for identifying significant topics in edited documents such as newspaper articles. For the purposes of this research, a ‘topic’ is any event or entity explicitly referred to in a document, and a ‘significant topic’ is a topic central to what is sometimes called the ‘aboutness’ of a document. The notion ‘significant’, like the notion ‘relevant’, is both task and user dependent. For example, what is considered significant for an application that answers specific questions differs from what is significant for an application that conveys the gist of a particular document or set of documents; what is significant in a domain for a naive user may be quite different than what is significant to an expert.

The natural language processing (NLP) applications for which this research is useful require shallow language analysis in order to produce output that users will find satisfactory. Examples of this type of application include:

 

We are developing linguistically motivated techniques which can identify relevant information over a set of documents for more fine analysis, e.g. to improve the relative ranking of the document or to identify portions of documents which are relevant for a given application or user. To the extent that our techniques are based on linguistically-motivated patterns and not on domain-dependent vocabularies, our patterns will apply to domain- and genre-independent text. This research is important because achieving even shallow understanding of nominal expressions remains a major blocking point for NLP systems. The hypothesis of our research is that better coverage and better portability across domains and genres is achievable using creative combinations of statistically and linguistically motivated techniques. We assume that efficient and effective known statistical techniques will continue to be used for large scale tasks such as analysis of large corpora or retrieving a set of possibly relevant documents from the web.

 

Goals, Objectives, and Targeted Activities

 

To achieve our goal, we have undertaken two tasks:

 

Every document can be viewed as forming its own self-contained world. A document is written to get across a particular idea or set of ideas. The task of the author, at least in documents intended for public distribution, is to convey to the reader what general knowledge is assumed and to inform the reader of the context so that ambiguous expressions such as proper names can be easily identified. For example, in edited text, the first reference to a named entity typically uses a relatively full form of the name in a version which is sufficient to disambiguate the reference for the expected audience. Later in the document, the same entity can be referred to by shorter, more ambiguous forms. (Wacholder et al. 1997). For example, an article might first refer to Columbia University (or, more formally, Columbia University in the City of New York), and later refer to it only by the name Columbia. Without the initial disambiguating reference, Columbia by itself is quite ambiguous. It might be a city (Columbia, MD), a bank (Columbia Savings and Loan) or one of many other entities. Once the entity has been disambiguated, a common NP (an NP headed by a common noun) such as the university or a pronoun such as it could also be used to refer to the entity formally called Columbia University in the City of New York.

 

Based on our research, we are using a complete list of simplex noun phrases (NPs) in a document as our candidate list of significant topics (Wacholder 1998, under review). A simplex NP is a maximal NP with a common noun as its head, where the NP may include premodifiers such as determiners and possessives but not post-nominal constituents such as prepositions or relativizers. Examples are asbestos fiber and 9.8 billion Kent cigarettes. Simplex NPs can be contrasted with complex NPs such as 9.8 billion Kent cigarette with the filters where the head of the NP is followed by a preposition, or 9.8 billion Kent cigarettes sold by the company, where the head is followed by a participial verb. An important part of identifying candidate significant topics is the linking of references within the document to the same entity, for example by anaphor resolution ( linking a pronoun with its antecedent, e.g., lung cancer in one sentence and it in the next ) (Kennedy and Boguraev 1996) or by linking syntactic variants (e.g., lung cancer and cancer of the lung) (Jacquemin, Klavans, and Tzoukermann 1997 ).

 

Both the linking of candidate significant topics and the identification of topics that are indeed significant will be adduced from the list of nominals, and their variants, and relatively simple associated information about document structure. By document structure, we refer to information about where in the document the noun phrase occurred, based on information such as sentence number, position relative to other noun simplex noun phrases in the document, token span, section or segment, and text markup code.

 

Indication of Success

 

The project award was received in August 1997, and we are well on target for the three-year period of the grant. In the initial eight month period, we have developed a modular system, tentatively named LinkIT. The input to LinkIt is text which has been tagged with part-of-speech tags by the Alembic system, a publicly available part-of-speech tagger developed by the Mitre Corporation [http://www.mitre.org/resources/centers/advanced_info/g04h/alembic.html; Aberdeen et al. 1996]. The fTo date, LinkIt identifies the following: (Examples are from wsj 0003 of the Penn Treebank); the first three sentences of the article are reproduced here: simplex NPs are bracketed; head nouns are in italics. "[A form] of [asbestos] once used to make [Kent cigarette filters] has caused [a high percentage] of [cancer deaths] among [a group] of [workers] exposed to [it] more than [30 years] ago, [researchers] reported. [The asbestos fiber], [crocidolite], is unusually resilient once [it] enters [the lung], with [even brief exposures] to [it] causing [[symptoms] that show up [decades] later], [researchers] said." )

 

 

In addition, we have identified a novel technique for identifying significant topics which we call ‘head sorting’ (Wacholder 1998, under review).

 

We are engaged in an ongoing process of evaluating and refining the LinkIT system, in order to make sure that it identifies complex nominals and proper names correctly and to improve the quality of linking. In addition, we are preparing to undertake an analysis of the significant topics identified by head sorting by comparing these nominals with the output of statistically-based systems (e.g. the SMART system or LSI). We also plan to explore ways to incorporate our output into the statistical indexing stage.

 

The major focus of our effort for the second two years of our project will be on creative use of linguistically motivated techniques for identification of significant topics and on evaluation of our results by both qualitative and quantitative techniques. Comparison with the output of other systems will be an important part of this analysis. Our output will be used for other projects such as summarization of multiple documents (e.g. the Columbia NSF-STIMULATE projects) and a project which involves finding background information on the web about significant topics in a document. If our hypotheses are correct, we will have developed new ways to identify significant topics; we will also have achieved a better understanding of the relative strengths and weaknesses of statistically-based and rule-based natural language processing systems.

 

Project Impact

 

Human Resources (student participation -- graduate and undergraduate, minorities, persons with disabilities, women), directly funded students,

 

Your department/institution infrastructure

This project is part of the Columbia University Digital Library program under the Center for Research on Information Access (CRIA). The purpose of CRIA is to establish new links between Computer Science research which is relevant to digital libraries and the information services division of the university. Our goals are to perform creative research as well as to link this research with needed applications. To this end, we are achieving this goal by developing tools and techniques which are essential for effective user-oriented text analysis and retrieval, and are also useful for publishing, library, and other information management tasks.

 

Industry -- collaborations, transfer of technology, patents.

We have initiated discussions with the electronic publishing division of Columbia University Press (CUP) who is interested in using LinkIT in an indexer's workstation. Our goal is to perfect the nominal identification tool, along with the visualization tool to the point where we can test its usefulness with the CUP indexers as an indexing aid.

 

What activities have been enabled/spawned because of the accomplishments made possible by your award?

 

Project References

 

Aberdeen, J., J. Burger, D. Day, L. Hirschman, and M. Vilain (1995) "Description of the Alembic system used for MUC-6". In Proceedings of MUC-6, Morgan Kaufmann.

Boguraev, Branimir and Christopher Kennedy (1997) "Technical terminology for domain specification and content characterization."

Jacquemin, C., Klavans, J., and Tzoukermann, E. (1997) "Expansion of multi-word terms for indexing and retrieval using morphology and syntax." Proceedings of the 35th Annual ACL. 24-21.

Kameyama, Megumi "Recognizing referential links: an information extraction perspective" cmp—

lg/9707009.

Klavans, Judith L. (1998, to appear) "Databases in Digital Libraries: Where Computer Science and Information Management Meet.",ACM-PODS Invited Tutorial.

Klavans, Judith L. and Min-Yen Kan (under review). "The Role of Verbs in Document Analysis."

Penn Treebank. Wall Street Journal, 1988. Treebank, Linguistic Data Consortium, University of

Pennsylvania, Philadelphia, PA.

Wacholder, Nina (under review). "Simplex NPS Sorted by Head: a Method for Identifying Significant Topics within a Document."

Wacholder, Nina, Yael Ravin and Misook Choi (1997) "Disambiguation of proper names in text,"

Proceedings of the ANLP, ACL, Washington, DC.

 

Area Background

 

The focus of this research is on the identification of significant information in documents or sets of documents. The results of our research will be useful for applications such as effective indexing for many areas of information access including information extraction, retrieval, and presentation of results in summarization and visualization. The motivation is discussed in the first two paragraphs above in the project summary.

 

Area References

 

Cowie, Jim and Wendy Lehnert (1996). "Information Extraction." Communications of the ACM, 39(1): 80-91.

Hirschman, Lynette and Marc Vilain (1995). Extracting Information from the MUC. ACL Tutorial.

Klavans, Judith L. and Philip Resnik, editors (1996). The Balancing Act: Combining Symbolic and Statistical Approaches to Language. MIT Press, Cambridge, Mass.

Paice, Chris D. (1990). "Constructing literature abstracts by computer: techniques and prospects." Information Processing & Management, 26: 171-186.

Wilkinson, R. (1994). "Effective Retrieval of Structured Documents," ACM-SIGIR Proceedings. 311-317.