Automatic Identification of Significant Topics in
Domain-independent Full Text Documents
Contact Information
Judith L. Klavans |
Nina Wacholder |
Center for Research on Information Access |
Center for Research on Information Access |
Columbia University |
Columbia University |
535 W. 114th Street, MC 1101 |
535 W. 114th Street, MC 1101 |
New York, NY 10027 |
New York, NY 10027 |
Phone: 212-854-7443 |
Phone: 212-854-7443 |
Fax: 212-666-0140 |
Fax: 212-666-0140 |
klavans@cs.columbia.edu
|
nina@cs.columbia.edu
|
WWW Page:
http://www.columbia.edu/cria/
Keywords: Natural language processing, information access,
information retrieval, document analysis, topic identification, topic
detection and tracking, text retrieval
Project Award Information:
Award Number: IRI-97-12069
Title: Automatic Identification of
Significant Topics in Domain-Independent Full Text Documents
Duration: Three years
Dates: September 1997 to August 2000
Project Summary
The goal of this project is to develop a suite of techniques for
identifying significant topics in edited documents such as newspaper
articles. For the purposes of this research, a 'topic' is any event or
entity explicitly referred to in a document, and a 'significant topic' is
a topic central to what is sometimes called the 'aboutness' of a document.
The notion 'significant', like the notion 'relevant', is both task and
user dependent.
This research is important because significant entities and
concepts are most often referred to in text with nominal expressions
such as nouns phrases (e.g., computer science) and proper names
(e.g., Buffalo Bills). However, achieving even shallow understanding
of nominal expressions remains a major blocking point for NLP
systems.
We have developed linguistically motivated techniques which
can link related nominal expressions, to identify documents or
portions of documents that are relevant for a given
application or user. To the extent that our techniques are based on
linguistically-motivated patterns and not on domain-dependent
vocabularies, our patterns apply across to text in any genre or
domain. We assume that efficient and effective known statistical
techniques will continue to be used for large scale tasks such as
analysis of large corpora or retrieving a set of possibly relevant
documents from the web. One of the hypotheses of our work is that
better coverage and better portability across domains and genres is
achievable using creative combinations of statistically and
linguistically motivated techniques.
The natural language processing (NLP) applications for which this
research is useful utilize efficient shallow language analysis in
order to produce output that improves information access, for example:
- automatic indexing tasks, either for digital library
applications or for direct user manipulation.
- second stage information retrieval, where a subset of a larger
corpus has been determined to be potentially relevant, perhaps by a
statistically based system.
- advanced information extraction where important entities in the
document must be identified and linked so that information about the
entity from different parts of the document can be related.
- summarization or other techniques for conveying the gist of a
document.
- topic detection and tracking
Goals, Objectives, and Targeted Activities
To achieve our goal, we have undertaken three tasks:
- development of a domain-general method for identifying a list of
candidate significant topics in a document that is as complete as is
practical, given that full natural language understanding is not
likely to be achieved by NLP systems for the foreseeable future.
- development of a suite of functions for identifying the most
significant of these candidate topics. Our particular
contribution solves some of the more challenging problems associated
with nominal referring expressions and their variants.
- development of a set of tools for evaulating and analyzing the
contribution of linguistic information to statistical information
retrieval systems.
Indications of Success
In the first year of
the project, we developed a modular system, tentatively named
LinkIT. The SNAP module of the LinkIT tool builds a list of
candidate significant topics for each document consisting of the
complete list of Simplex Noun Phrases (SNPs) (Wacholder 1998). A SNP
is a maximal noun phrase with a common noun as its head, where the NP
may include premodifiers such as determiners and possessives but not
post-nominal constituents such as prepositions or
relativizers. Examples are asbestos fiber and 9.8 billion
Kent cigarettes. Simplex NPs can be contrasted with complex NPs
such as 9.8 billion Kent cigarettes with the filters where the
head of the NP is followed by a preposition, or 9.8 billion Kent
cigarettes sold by the company, where the head is followed by a
participial verb. LinkIT then sorts the Simplex Noun Phrases
so as to link together expressions that refer to the same concept
e.g. reproduction rights and literary rights.
In addition to our work on identifying and linking simplex noun
phrases which we reported on in last year's IDM report, we have achieved
the following over the past year:
- Refinement of the SNAP module
- development of a filtering and ranking metric to identify the most
significant of the SNPs, based on frequency of the head of
the SNP and on frequency of the modifier
- evaluation of the ranking metric
- comparison of the performance of the head sorting method
for conveying the 'gist' of a document to two other
techniques, keyword frequency (the tf of the tf*idf
method) and repeated word sequences (based on the
technical term approach of Justeson and Katz 1993.
- Implementation of a new module to identify main verbs in
documents (VERSO)
- Analysis of the role of grammatical
categories in a statistical IR system (Wacholder et al. 1999)
- creation of a 330 MB corpus based on the
Text Retrieval Conference (TREC) Disk 1 and Disk 2 collections
- use of LinkIT output to create a number of (6) versions of
the corpus, and compared the results of running an IR
system over the different versions of the corpora.
- development of the DFI (Distance from Ideal) metric to
closely analyze the performance of different versions of
documents on specific queries (Evans et al. under
review).
- Refinement of visualization tool for displaying marked up documents.
- Ongoing discussion with the Columbia University Law School and
the Columbia University Press about how to adapt LinkIT for use
in indexing of electronic documents.
- Successful use of LinkIT output to measure paragraph similarity
in the NSF Stimulate Project (Eskin et al., under review)
The major focus of our effort for the third and last year of our
project will be on development of additional techniques for
identification of significant natural language
applications and an ongoing process of evaluating and refining the
LinkIT system,
in order to make sure that it identifies complex nominals and proper
names correctly and to improve the quality of linking. We are also
using the LinkIT output, along with other linguistic information
obtained by shallow parsing in order to analyze the contribution of
nominal expressions to the SMART system.
Over the next year, our output will be used for other projects
such as summarization of multiple documents (e.g. the Columbia
NSF-STIMULATE projects) and a project which involves finding
background information on the web about significant topics in a
document, and we plan to evaluate the contribution that our work has
made to these applications. If our hypotheses are correct, we will
have developed new ways to identify significant topics; we will also
have achieved a better understanding of the relative strengths and
weaknesses of statistically-based and rule-based natural language
processing systems.
Project Impact
Human Resources
- David K. Evans, a Columbia University graduate student, is fully
funded. He is responsible for developing the LinkIT software. This
year he transferred from the Masters degree program to the Ph.D.
program
- Sonja Allin, a Columbia graduate, has returned to campus as a General
Studies student for B.A. in computer science. She has been working on a
non-credit project to improve and refine our tool for document
visualization.
- Eleazar Eskin, a CS Ph.D. student, used LinkIT output to measure
document similarity.
- Stefan Negrila, a Columbia undergraduate, used LinkIT output
to cluster documents that discussed the same event. (Supervised
by Luis Gravano)
Your department/institution infrastructure
This project is part of the Columbia University Digital Library
program under the Center for Research on Information Access
(CRIA). The purpose of CRIA is to establish new links between Computer
Science research which is relevant to digital libraries and the
information services division of the university. We are developing
tools and techniques which are essential for effective user-oriented
text analysis and retrieval, and are also useful for publishing,
library, and other information management tasks.
Industry -- collaborations, transfer of technology, patents
We have developed a plan with the Columbia University Libraries and
with electronic publishing division of Columbia University Press (CUP) to
use LinkIT to build an intelligent indexing tool.
We are also in contact with the Columbia Law School for potential use
in the Columbia Law Review.
What activities have been enabled/spawned because of the
accomplishments made possible by your award?
- collaboration with related applications in the Department of
Computer Science
- discussion with Columbia University Press for an indexer's aid
tool
- collaborating with Dr. Christian Jacquemin, LIMSI to use LinkIT
with FASTR for variant analysis
Project References
Aberdeen, J., J. Burger, D. Day, L. Hirschman, and M.
Vilain (1995) "Description of the Alembic system used for MUC-6".
In Proceedings of MUC-6, Morgan Kaufmann.
Eskin, Eleazar, Judith Klavans,
Vasileios Hatzivassiloglou "Paper title
here", ACL 1999 Under Review
Evans, David Kirk, "The Impact of Document
Collection Characteristics on Information Access in Digital Libraries"
, ACM-DL 1999, Under Review
Jacquemin, C., Klavans, J., and Tzoukermann, E. (1997)
"Expansion of multi-word terms for indexing and retrieval using morphology
and syntax." Proceedings of the 35th Annual ACL. 24-21.
Justeson, John and Slava Katz (1995) "Technical
Terminology: some linguistic properties and an algorithm for
identifying in text", Natural Language Engineering 1(1):9-27.
Klavans, Judith L. (1998, to appear) "Databases in
Digital Libraries: Where Computer Science and Information Management
Meet.",ACM-PODS Invited Tutorial.
Klavans, Judith L. and Min-Yen Kan (under
review). "The Role of Verbs in Document Analysis."
TREC Disks 1 and 2, Penn Treebank, University of
Pennsylvania, Philadelphia, PA.
Wacholder, Nina, Judith L. Klavans and David Kirk
Evans (under review) "The role of grammatical categories in a
statistical information retrieval system"
Wacholder, Nina (1998). "Simplex NPS Sorted by Head: a
Method for Identifying Significant Topics within a Document,"
Proceddings of the COLING-ACL Workshop on the Computational
Treatment of Nominals, Montreal, Canada, August 16, 1998.
Wacholder, Nina, Yael Ravin and Misook Choi (1997)
"Disambiguation of proper names in text," Proceedings of the ANLP,
ACL, Washington, DC.
Area Background
As the NSF works toward the goal of enabling increased universal
access to the fast-growing body of electronic text, our research
addresses directly the needs of information king-seeking individauls
to find what they need more easily and more reliably. To achieve this
end, we are developing a range of innovative mets to improve current
methodologies for information retrieval, indexing, extraction, and
summarization. The specific focus of our project is on the
identification of significant information in documents or sets of
documents. This type of information is under-utilized by most
available systems.
Area References
Cowie, Jim and Wendy Lehnert (1996). "Information Extraction."
Communications of the ACM, 39(1): 80-91.
Hirschman, Lynette and Marc Vilain (1995). Extracting Information from
the MUC. ACL Tutorial.
Klavans, Judith L. and Philip Resnik, editors (1996). The Balancing
Act: Combining Symbolic and Statistical Approaches to Language. MIT
Press, Cambridge, Mass.
Paice, Chris D. (1990). "Constructing literature abstracts by computer:
techniques and prospects." Information Processing & Management, 26:
171-186.
Wilkinson, R. (1994). "Effective Retrieval of Structured Documents,"
ACM-SIGIR Proceedings. 311-317.
Last modified: Thu Feb 11 13:37:04 EST 1999