Automatic Identification of Significant Topics in
Domain-independent Full Text Documents
Contact Information
Judith L. Klavans |
Nina Wacholder |
Center for Research on Information Access |
Center for Research on Information Access |
Columbia University |
Columbia University |
535 W. 114th Street, MC 1101 |
535 W. 114th Street, MC 1101 |
New York, NY 10027 |
New York, NY 10027 |
Phone: 212-854-7443 |
Phone: 212-939-7119 |
Fax: 212-854-9099 |
Fax: 212-666-0140 |
klavans@cs.columbia.edu
|
nina@cs.columbia.edu
|
WWW Page:
http://www.columbia.edu/cria/
Keywords: Natural language processing, information access,
information retrieval, document analysis, topic identification, topic
detection and tracking, text retrieval
Project Award Information:
Award Number: IRI-97-12069
Title: Automatic Identification of
Significant Topics in Domain-Independent Full Text Documents
Duration: Three years
Dates: September 1997 to August 2000
Project Summary
The goal of this project is to develop a suite of techniques for
identifying significant topics in edited documents such as newspaper
articles. Significant entities and concepts are most often referred
to in text with nominal expressions such as nouns phrases (e.g.,
computer science) and proper names (e.g., Buffalo Bills). However,
achieving even shallow understanding of nominal expressions remains a
major blocking point for natural language processing (NLP). Related to this
is the question of the role of natural language information in
information retrieval (IR) and specifically the role of nominals in
information access.
This research is useful for applications such as
automatic indexing tasks, either for digital library
applications or for direct user manipulation,
second stage information retrieval where a subset of a larger
corpus has been determined to be potentially relevant,
advanced information extraction where important entities in the
document must be identified and linked,
summarization, and topic detection and tracking.
We have developed linguistically motivated techniques which
can link related nominal expressions, to identify documents or
portions of documents that are relevant for a given
application or user. To the extent that our techniques are based on
linguistically-motivated patterns and not on domain-dependent
vocabularies, our patterns apply across to text in any genre or
domain. We assume that efficient and effective known statistical
techniques will continue to be used for large scale tasks such as
analysis of large corpora and retrieving a set of possibly relevant
documents from the web. One of the hypotheses of our work is that
better coverage and better portability across domains and genres is
achievable using creative combinations of statistically and
linguistically motivated techniques.
Goals, Objectives, and Targeted Activities
To achieve our goal, we are developing:
- a domain-general method for identifying a list of
candidate significant topics in a document that is as complete as is
practical, given that full natural language understanding is not
likely to be achieved by NLP systems for the foreseeable future.
- a suite of functions for identifying the most
significant of these candidate topics. Our particular
contribution solves some of the more challenging problems associated
with nominal referring expressions and their variants.
- a set of tools for evaulating and analyzing the
contribution of linguistic information to statistical information
retrieval systems.
Indications of Success
In the first year of
the project, we developed a modular system,
LinkIT. The first module of LinkIT, called SNAP, builds a list of
candidate significant topics for each document consisting of the
complete list of Simplex Noun Phrases (SNPs) (Wacholder 1998). A SNP
is a maximal noun phrase with a common noun as its head, where the NP
may include premodifiers such as determiners and possessives but not
post-nominal constituents such as prepositions or
relativizers. Examples are asbestos fiber and 9.8 billion
Kent cigarettes. Simplex NPs can be contrasted with complex NPs
such as 9.8 billion Kent cigarettes with the filters where the
head of the NP is followed by a preposition, or 9.8 billion Kent
cigarettes sold by the company, where the head is followed by a
participial verb. LinkIT then sorts the Simplex Noun Phrases
to link together expressions that refer to the same higher level concept
such as rights
in reproduction rights and literary rights.
The year we have achieved:
- Refinement of the SNAP module
- development of a filtering and ranking metric to identify the most
significant of the SNPs, based on frequency of the head of
the SNP and on frequency of the modifier
- evaluation of the ranking metric and comparison with two
related techniques
- Implementation of a new module to identify main verbs in
documents (VERSO)
- Analysis of the role of grammatical
categories in a statistical IR system (Wacholder et al. under review)
- creation of set of resources used for training, testing
and evalution, including a tagged 330 MB corpus based on the
Text Retrieval Conference (TREC) collections
- development of the DFI (Distance from Ideal) metric to
closely analyze the performance of SMART of different versions of
documents on specific queries (Evans et al. under
review).
- Refinement of a visualization tool for displaying marked up documents.
LinkIT is used by the two NSF sponsored stimulate projects for the
computation of paragraph similarity (Eskin et al. under review) and
for the analysis of image captions for building an ontology of
multimedia objects (Chang et al. under review). It is also input to a
distributed data base project to find background information on the
web about significant topics in a document.
If our hypotheses are correct, we will
have developed new ways to identify significant topics; we will also
have achieved a better understanding of the relative strengths and
weaknesses of statistically-based and rule-based natural language
processing systems.
Project Impact / Human Resources
- David K. Evans, Ph.D. student, has developed the LinkIT software and
the Distance From Ideal (DFI) metric used in evaluation of LinkIT.
- Sonja Allin, B.A. student, major in computer science, has
completed a project to improve and refine our tool for document
visualization.
Your department/institution infrastructure
This project is part of the Columbia University Digital Library
program under the Center for Research on Information Access (CRIA).
CRIA sponsors projects to develop tools and techniques which are
essential for effective user-oriented text analysis and retrieval, and
are also useful for publishing, library, electronic commerce, and
other information management tasks in distributed networked environments.
Industry -- collaborations, transfer of technology, patents
We have developed a plan with the Columbia University Libraries and
with electronic publishing division of Columbia University Press (CUP) to
use LinkIT to build an intelligent indexing tool.
We are also in contact with the Columbia Law School for potential use
in the Columbia Law Review.
What activities have been enabled/spawned because of the
accomplishments made possible by your award?
- Eleazar Eskin, a CS Ph.D. student, used LinkIT output as a
feature in a machine learning approach to paragraph similarity.
- Carl Sable, a CS Ph.D. student, used LinkIT to help in
caption analysis for image classification.
- Stefan Negrila, a Columbia undergraduate, used LinkIT output
to cluster documents that discussed the same event. (Supervised
by Luis Gravano)
Project References
1. Chang, Shih-Fu et al (under review) "Integration of Visual and
Text-Based Approaches for the Content Labeling and Classification of
Photographs".
2. Eskin, Eleazar, Judith Klavans, and Vasileios Hatzivassiloglou
(under review). "Detecting Similarity by Applying
Learning over Indicators"
3. Evans, David Kirk, Judith L. Klavans, and Nina Wacholder
(under review). "The Impact of Document Collection Characteristics on
Information Access in Digital Libraries".
4. Klavans, Judith L. (1998) "Databases in Digital Libraries: Where
Computer Science and Information Management Meet.", ACM-PODS Invited
Tutorial, available at
http://www.cs.columbia.edu/~klavans/Slides/PODS98/index.htm".
5. McKeown, Kathleen R, Judith Klavans, Vasileios Hatzivassiloglou,
Regina Barzilay, and Eleazar Eskin (under review) "Towards
multidocument summarization by reformulation: Progress and prospects".
6. Wacholder, Nina, Judith L. Klavans and David Kirk
Evans (under review) "The role of grammatical categories in a
statistical information retrieval system"
7. Wacholder, Nina (1998). "Simplex NPS Sorted by Head: a
Method for Identifying Significant Topics within a Document,"
Proceddings of the COLING-ACL Workshop on the Computational
Treatment of Nominals, Montreal, Canada, August 16, 1998.
8. Wacholder, Nina, Yael Ravin and Misook Choi (1997)
"Disambiguation of proper names in text," Proceedings of the ANLP,
ACL, Washington, DC.
Area Background
The NSF is actively involved in funding research to enable increased
universal access to the fast-growing body of electronic text. Our
research directly addresses the need to find information more easily
and more reliably. We are devloping a range of innovative methods to
improve current methodologies for information retrieval, indexing,
extraction, and summarization.
Area References
Losee, Robert M., 1998, Text Retrieval and Filtering: Analytic
Models of Performance. Kluwer Academic Publishers, Boston, 1st edition.
Klavans, Judith L. and Philip Resnik, editors (1996). The Balancing
Act: Combining Symbolic and Statistical Approaches to Language. MIT
Press, Cambridge, Mass.
Last modified: Mon Feb 15 10:36:33 Eastern Standard Time 1999