Automatic Identification of Significant
Topics in Domain-Independent Full Text Documents
Contact Information
Judith L. Klavans |
Nina Wacholder |
Center for Research on Information Access |
Center for Research on Information Access |
Columbia University |
Columbia University |
535 W. 114th Street, MC 1101 |
535 W. 114th Street, MC 1101 |
New York, NY 10027 |
New York, NY 10027 |
Phone: 212-854-7443 |
Phone: 212-939-7119 |
Fax: 212-854-9099 |
Fax: 212-666-0140 |
klavans@cs.columbia.edu
|
nina@cs.columbia.edu
|
WWW Page
http://www.columbia.edu/cu/cria/
Project WWW Page:
http://www.columbia.edu/cu/SigTops/
List of
Supported Students and Staff
See Project Impact: Human Resources.
Project Award
Information
- Award Number:
IRI-97-12069 *
- Duration: 09/01/1997 --
08/01/2000
- Title: Automatic
Identification of Significant Topics in Domain-Independent Full Text Documents
Keywords
Information retrieval, natural language
processing, computational linguistics, grammatical analysis, parsing,
document analysis, topic identification, topic detection and tracking,
noun phrases.
Project Summary
The goal of this project is to develop a suite of techniques for
identifying significant topics in edited documents such as newspaper
articles. Significant entities and concepts are most often referred
to in text with nominal expressions such as nouns phrases (e.g.,
computer science) and proper names (e.g., Buffalo Bills). However,
achieving even shallow understanding of nominal expressions remains a
major blocking point for natural language processing (NLP). Related
to this is the question of the role of natural language information in
information retrieval (IR) and specifically the role of nominals in
information access. An additional goal is to achieve a better
understanding of the relative strengths and weaknesses of
statistically-based and rule-based natural language processing
systems.
We have developed linguistically motivated techniques which can link
related nominal expressions, to identify documents or portions of
documents that are relevant for a given application or user. To the
extent that our techniques are based on linguistically-motivated
patterns and not on domain-dependent vocabularies, our patterns apply
across to text in any genre or domain. We assume that efficient and
effective known statistical techniques will continue to be used for
large scale tasks such as analysis of large corpora and retrieving a
set of possibly relevant documents from the web. This research is
useful for applications such as automatic indexing tasks, either for
digital library applications or for direct user manipulation, second
stage information retrieval where a subset of a larger corpus has been
determined to be potentially relevant, advanced information extraction
where important entities in the document must be identified and
linked, summarization, and topic detection and tracking. Our system
has been evaluated against two similar noun phrase
identifiers and as been shown to perform above the others.
Publications
and Products
- Publications:
See Project References.
- What Web sites or other Internet site have you created?
- What other specific products
(databases, physical collections, educational aids, software, instruments,
or the like) have you developed?
- Document Visualization Tool (software)
- Intelligent Indexer's Aid (software)
Project Impact
- Human Resources
- David Kirk Evans, Ph.D. student, has developed and performed
evaluations of the LinkIT software.
- Himani Naresh, B. S. student, major in computer science,
undergraduate supported for initial version of document
visualization tool.
- Sonja Allin, B.A. student, major in computer science, has
completed a project to improve and refine our tool for document
visualization.
- Your department/institution infrastructure
This project is part of the Columbia University Digital Library
program under the Center for Research on Information Access (CRIA).
CRIA sponsors projects to develop tools and techniques which are
essential for effective user-oriented text analysis and
retrieval. CRIA supports students in the Fu School of Engineering
and Applied Sciences at Columbia University.
- Industry -- collaborations, transfer of technology, patents
- We have developed a plan with the electronic publishing
division of Columbia University Press (CUP) for potential
use of LinkIT in an intelligent indexing tool for scholarly
publications.
- Our software is a key component in the Digital Government
work for automatic
ontology compilation. (Digital Government URL:
http://www.cs.columbia.edu/digigov/
Goals,
Objectives, and Targeted Activities
Our goal is to develop a domain-general method for identifying a list of
candidate significant topics in a document that is as complete as is
practical, given that full natural language understanding is not
likely to be achieved by NLP systems for the foreseeable
future. We are developing:
- a suite of functions for identifying the most
significant of these candidate topics. Our particular
contribution solves some of the more challenging problems
associated with nominal referring expressions and their
variants.
- a set of tools for evaulating and analyzing the
contribution of this type of linguistic information
to statistical information
retrieval systems.
- an evaluation between LinkIT (the software for Significant
Topic
Identification) and two other noun phrase identification
tools for the
task of noun phrase identification. LinkIT was shown to
have performance comparable to or better than other noun phrase
identification systems. (Evans, to appear)
The evaluation of LinkIT can be
viewed as a two stage process. The first step is the
evaluation of the noun phrase identification which we have
completed. The second stage is an evaluation of one of the features of
LinkIT where we not only identify noun phrases, (e.g. "asbestos
workers") but link each noun phrase to related noun phrases
within the article via the modifiers (e.g. "asbestos
poisoning") or via the head (e.g. "factory workers".)
Designing an evaluation for the second component is a
complex task as no clear metrics exist. In the next phase
of our research, we will evaluate these components of the
system in a task-based
evaluation.
Project
References
- Web Site: Significant
Topics Website (http://www.columbia.edu/cu/cria/SigTops/)
- Evans, David
Kirk, Judith L. Klavans, Nina Wacholder, (2000, to appear). "Document
processing with LinkIT," RIAO 2000 Recherche d'Informations
Assistée par Ordinateur (Content-Based Multimedia Information Access)
, Paris, France.
- Hatzivassiloglou, Vasileios, Judith L. Klavans and Eleazar Eskin
(1999). "Detecting Text Similarity over Short Passages: Exploring
Linguistic Feature Combinations via Machine Learning," EMNLP/VLC-99
Joint SIGDAT Conference on Empirical Methods in NLP and Very Large
Corpora, University Of Maryland, College Park, MD, USA
- Klavans, Judith L., David K. Evans, Nina Wacholder (2000, to
appear). "Evaluation of Computational Linguistic Techniques for
Identifying Significant Topics for Browsing Applications,"
2nd international Language Resources and Evaluation Conference
(LREC2000), Athens, Greece.
- Klavans, Judith L. (1998) "Databases in Digital Libraries: Where
Computer Science and Information Management Meet.", ACM-PODS Invited
Tutorial, available at
http://www.cs.columbia.edu/~klavans/Slides/PODS98/index.htm".
- McKeown, Kathleen R., Judith L. Klavans, Vasileios Hatzivassiloglou,
Regina Barzilay and Eleazar Eskin, (1999). "Towards
Multidocument Summarization by Reformulation: Progress and
Prospects," Proceedings of the Sixteenth National Conference
on Artificial Intelligence AAAI-1999, Orlando, Florida.
- Negrilla, Stefan (1998). "Clustering Algorithms Summer Project,"
Computer Science Report, Columbia University
- Wacholder, Nina (1998). "Simplex NPS Sorted by Head: a
Method for Identifying Significant Topics within a Document,"
Proceddings of the COLING-ACL Workshop on the Computational
Treatment of Nominals, Montreal, Canada, August 16, 1998.
- Wacholder, Nina, Judith L. Klavans, David K. Evans (2000, to
appear). "Evaluation of Automatically Identified Index Terms for
Browsing Electronic Documents," Applied Natural Language
Processing Conference (ANLP-2000), Seattle, Washington.
Area
Background
Our research directly addresses the need to find information more
easily and more reliably. We are developing a range of innovative
methods to improve current methodologies for information retrieval,
indexing, extraction, and summarization.
Area
References
Losee, Robert M., 1998,
Text Retrieval and Filtering: Analytic
Models of Performance. Kluwer Academic Publishers, Boston, 1st edition.
Klavans, Judith L. and Philip Resnik, editors (1996
). The Balancing
Act: Combining Symbolic and Statistical Approaches to Language. MIT
Press, Cambridge, Mass.
Wacholder, Nina, Yael Ravin and Misook Choi (1997)
"Disambiguation of proper names in text,"
Proceedings of the ANLP,
ACL, Washington, DC.
Potential
Related Projects
- The use of natural language techniques to extract structured
information from free text.
- Automatic construction of ontologies using natural language
techniques.
- New methods to integrate ontologies and metadata.
- Evaluation techniques for natural language and database
applications with partial matching.
*All award information can be found on the on the NSF on-line
Awards Abstracts system http://www.fastlane.nsf.gov/a6/A6Start.htm.