Generating Coherent Summaries of On-Line Documents:
Combining Statistical and Symbolic Techniques

Kathleen R. McKeown and Judith L. Klavans

CONTACT INFORMATION

Kathleen R. McKeown
1214 Amsterdam
450 Computer Science Building
Department of Computer Science
New York, N.Y. 10027
Phone: (212) 939-7118
Fax : (212) 666-0140
Email: kathy@cs.columbia.edu

Judith L. Klavans
Center for Research on Information Access
535 West 114th Street, MC 1103
New York, New York 10027
phone: 212-854-7443
fax: 212-222-0331
Email: klavans@cs.columbia.edu

WWW PAGE

Stimulate-Cogito Project Page

PROGRAM AREA

Speech and Natural Language Understanding.

KEYWORDS

Presentation of information, information overload, symbolic natural language, statistical natural language, summarization, language generation, WordNet, lexical semantics, document analysis, segmentation.

PROJECT SUMMARY

Goals

Given the exponential growth of online information, one of the primary difficulties facing Internet users is information overload. Summaries can function as an abbreviated form of a document or as an aid in assessing the relevance of a document to a selected topic, thereby reducing the amount of information a user must read. We propose a multi-level process of summarization for the presentation of available information during browsing or searching. Key features of this project include summarization of a set of documents as opposed to just a single document, integration of information from related data sources (e.g., structured data bases) with information derived from text, summaries which are both indicative and informative, and generation of updates as new information becomes available.

The unique aspects of our research are

  1. the integration of knowledge about a document derived with both statistical and symbolic techniques;
  2. the use of language generation for reformulating this information into a concise and coherent summary; and
  3. summarization across sets of related articles.

Unlike other approaches which use purely statistical techniques to extract existing sentences from documents and present them as a ``summary'', our summaries are coherent and highly readable. Furthermore, our approach will allow us to make progress in developing techniques for domain independent summarization.

Research Plans

The system being proposed will have two basic components: one for document analysis and one for summary generation. Document analysis involves sectioning a document into topics, identifying key content within each section, and constructing a lexical semantic representation for the use of the summary generator. In segmenting a document, statistical methods will be used to determine topics and subtopics. To improve information quality, we propose to further analyze selected segments of the text with grammatical parsing and lexical semantic information, thus producing a structured representation of key information from articles being summarized.

The output of the document analysis component will serve as input to the summary generator, whose primary tasks are to produce a coherent summary of individual articles, and to identify similarities and differences among documents for summaries of sets of articles. Our research will focus on integration of relevant information from several data sources, on discovery of conflicting information, and on the use of planning operatiors to combine information from the separate representations into a single readable text. In addition, by comparing newly received information against a representation of already analyzed information, updates can be generated as appropriate. A second major focus will be on generating concise wording to convey maximal information, using words from the input in addition to related words from the lexical semantic relations constructed in the first component, in order to enhance accuracy and to avoid the need for large domain specific lexicons.

Evaluation

Both coverage and robustness of the analyzer and generator will be measured, breaking each of these measures down into conceptual coverage and linguistic coverage. Quantitative evaluation metrics will be developed for comparing system coverage with summaries found in online corpora, based both on an approach used in earlier work at Columbia University. and on new approaches to be developed as part of the research. Human judges will also be used to provide a separate set of qualitative measures.

Expected Results

This work fits within the larger scenario of digital library efforts and information technology research on text analysis and generation for the purpose of summarization. Through collaboration with Columbia's Center for Research on Information Access, dynamically generated summaries will ultimately be part of a digital library interface that filters documents according to a user's interests and needs. By integrating statistics, lexical semantics, and language generation, we will develop new techniques for automated coherent summarization across domains and for multiple documents.

PROJECT REFERENCES

Kathleen R. McKeown, Karen Kukich, and Jacques Robin. Generating concise natural language summaries. Journal of Information Processing and Mangement, 31(5), September 1995, pp. 703-733, Special Issue on Summarization.

Jacques Robin and Kathleen R. McKeown. Empirically designing and evaluating a new revision-based model for summary generation. Artificial Intelligence Journal, 85, August 1996, Special Issue on Empirical Methods.

Klavans, Judith L. and Virginia Tantral (1996) Analysis of Documents for Summarization: The Role of Verbs, paper presented at the International Workshop on Predicative Forms in Natural Language and Lexical Knowledge Bases, Toulouse, France.

Tzoukermann Evelyne, Judith L. Klavans, and Christian Jacquemin (1997) ``Effective Use of Natural Language Processing Techniques for Automatic Conflation of Multi-Word Terms: The Role of Derivational Morphology, Part of Speech Tagging, and Shallow Parsing'', in in Proceedings of the ACM Special Interest Group on Information Retrieval (SIGIR), Philadelphia, Pennsylvania.

AREA BACKGROUND

Summarization of natural language texts involves identification of important information in the input text and generation of a paragraph or more that summarizes that information. Our work focuses on summarization of multiple articles using domain independent techniques.

Previous work in summarization has focused either on statistical extraction, identifying and lifting key sentences to use as the summary, or on information extraction, using domain specific symbolic techniques to find and extract specific types of information to represent in database form. The first of these approaches identifies and extract key sentences from an article using statistical techniques that locate important phrases using various statistical measures. These key sentences are then grouped together to form the summary. This work has a long history within the field of information retrieval, beginning with early work by Luhn, and including a flurry of more recent work as exemplified by this year's ACL/EACL Workshop on Intelligent Scalable Summarization.

Work in summarization using symbolic techniques has tended to focus more on identifying information in text that can serve as summary content as opposed to generating the summary. The DARPA message understanding systems, which process news articles in specific domains to extract specified types of information, fall within this category. As output, work of this type produces templates that identify important pieces of information in the text, representing them as attribute-value pairs which could be part of a database entry. As stand-alone systems, however, they do not address the task of summarization since they do not combine and rephrase extracted information as part of a textual summary. These systems are domain-specific.

A more comprehensive approach to summarization would use some form of analysis (either statistical, symbolic, or a combination of the two) to identify key information in the input text and language generation techniques to merge and condense information to produce a coherent paragraph. In our previous work on summarization, we have distinguished between conceptual and linguistic summarization. Conceptual summarization must determine which concepts out of a large number of concepts in the input should be included in the summary. Linguistic summarization is concerned with expressing that information in the most concise way possible. Unlike traditional language generation, summarization is concerned with conveying the maximal amount of information within minimal space. To date, there has been far more work on generating summaries of data (e.g., stock market summaries, basketball summaries) than generating summaries of text.

AREA REFERENCES

Luhn, Hans P. The Automatic Creation of Literature Abstracts , IBM Journal, 1958, pp. 159-165.

Kukich, Karen K. Design of a Knowledge-Based Report Generator, Proceedings of the 21st Annual Meeting of the ACL, Cambridge, Mass., 1983, pp. 145-150.

Kathleen R. McKeown. Text generation: Using discourse strategies and focus constraints to generate natural language text. Cambridge University Press, Cambridge, England, 1985.

Paice, Chris D. Constructing Literature Abstracts by Computer: Techniques and Prospects, Information Processing and Management, Vol 26, 1990, pp. 171-186.

Proceedings of any of the DARPA Message Understanding Conferences; for example, Proceedings of the Fourth Message Understanding Conference (MUC-4), DARPA Software and Intelligent Systems Technology Office, 1992.

Klavans, Judith L. and Philip Resnik, eds. (1996) The Balancing Act: Combining Symbolic and Statistical Approaches to Language. MIT Press: Cambridge, Massachusetts.

Proceedings of the ACL/EACL Workshop on Intelligent Scalable Text Summarization, held jointly with the ACL/EACL97 joint conference, Madrid, Spain, 1997.

RELATED PROGRAM AREAS

Adaptive Human Interfaces, Usability and User-Centered Design, Intelligent Interactive Systems for Persons with Disabilities.

POTENTIAL RELATED PROJECTS

This project addresses one aspect of the information overload problem. However, analysis and summarization of text is just one feature of this very complex issue. Related areas that would inform and enhance our overall results include input from groups working on ways to present information effectively, techniques to organize information intuitively, and cognitive measurements of human acceptability for comprehension of information, e.g. how much information is too much, how much is needed for making decisions, how much is too little for certain tasks, what do people want and need to know. Results of such studies will help us determine what kinds of information to extract and what kinds of summaries are most useful. Furthermore, ways to combine textual summaries with audio and multimedia output will improve the value of our results.

The effect of summarization in reducing information overload is also relevant to visually impaired people. The sheer volume of text available makes it impossible for text-to-speech machines (or people) to read everything aloud. This has always been the case, even before the advent of electronic information. Previously, information for the visually impaired was filtered by humans, and then presented in short form. The technology we are developing could help visually impaired persons perform more direct filtering of material themselves, thus giving them a better understanding of the range of possible information available and more control over choice.

The development of reliable evaluation methodologies in order to measure progress and success for summarization systems has not yet been fully addressed in the computational linguistics community. While evaluation itself has been a major component of many of the DARPA-funded information retrieval and information extraction tasks, such as ATIS, MUC, TIPSTER, etc., none of these projects directly addresses the question of ways to measure the success of summarization. One of the tasks we are undertaking is to formulate evaluation techniques. Other projects addressing factors affecting evaluation would be useful to our research. One of the related components of evaluation is the development of test-bed material and annotated resources, which requires an enormous investment in time, energy, and expertise to create. We are currently investigating ways to build such annotated material which will be used in evaluation. We plan to develop evaluation metrics that are appropriate to the summarization task, since standard metrics such as precision and recall simply do not apply.

This report was prepared for the NSF Interactive Systems Grantees Workshop at Stevenson, Washington August 17-19, 1997



This page is located at http://www.cs.columbia.edu/~klavans/JLK-Projects/Stimulate/report97.html
This page was last updated 6/26/97.