Automatic Identification of Significant Topics in Domain Independent Full Text Analysis

The goal of this research is to investigate the relationship between the occurrence of significant topics in a document and the structure of the document. The unique contribution of our research lies in the combination of methods to be used for the production of a list of significant topics, built on both statistical and rule-based techniques for the identification of term variants as a function of their distribution in focus areas in documents.

Applications include information retrieval, passage retrieval, relevance feedback, information extraction, and summarization. Our results will be used directly in ongoing research projects on the automatic summarization of documents, using both statistical and information extraction techniques. To the extent that our techniques are based on linguistically-motivated patterns and not on domain-dependent vocabularies, our patterns should apply to general text. We will apply our approach to several domains to test its generality and applicability across document types. This will permit us to measure the cost of porting across genres. Formative and summative evaluation procedures will be developed and performed at each step of the analysis.

This research will be undertaken in the context of the Digital Library Research program at Columbia University, in conjunction with the Center for Research on Information Access.

Problems with this page? Send email to klavans@cs.columbia.edu

Automatic Identification of Significant Topics in Domain Independent Full Text Analysis

This page is located at http://www.cs.columbia.edu/~klavans/Cria/Current-projects/Significant-Topics/summary.html.

This page was last updated on 8/5/97