Level of Participation - Billed

Kathleen McKeown, 17%
Vasileios Hatzivassiloglou, 33%
Judith Klavans, 17%

Level of Participation - Unbilled

Kathleen McKeown, 10%
Vasileios Hatzivassiloglou, 10%
Judith Klavans, 10%
Project URL:http://www.cs.columbia.edu/TIDES

Objective:

The aim is the development of a system that automatically generates a short English summary of a set of documents, in multiple languages, on the same event. By providing a concise view of the consensus on the event, presented across documents, the system dramatically reduces the amount of reading that is required. By highlighting differences between documents, the system will point out inconsistencies in different views of the event, whether from different sources or different countries.

Approach:

Columbia University is developing a practical, multilingual and multidocument summarization system. The design features the integration of robust statistical techniques, shallow linguistic approaches and machine learning to achieve scalability within languages and portability across languages. To realize these goals, the research will develop methods for summarization across documents using information fusion and identification of key differences, summarization across languages relying on identification and translation of terms, and new methods for identification, expansion and translation of terms. Support for Chinese and Japanese is currently being implemented, and rapid portability to additional languages is one of the goals of the project.

Key features of the approach include:

Summarization of similarities and differences across multiple documents based on extraction, comparison and reformulation of phrases. Unlike most other approaches, rather than relying on sentence extraction, this work uses information fusion of similar information, merging together repetitive phrases into a single phrase allowing dramatic reduction of information across many articles. Language generation techniques are used to merge multiple phrases into a short paragraph. The work will focus on characterizing and implementing types of differences to include in a summary, which is an unexplored direction in multi-document summarization. Difference operators will be developed to identify new information, contradictions, trends, multiple perspectives, and different topics. Thus, a summary of a stream of documents on the same event will concisely present common information, point out differences that are reported, and provide links to the original articles.

Summarization across languages to highlight differences in reports of the same event from different countries. In order to minimize reliance on full machine translation given the possibility for errors and the lack of tools for many languages, the approach will use identification, expansion and translation of terms where possible. Terms will be translated using statistical methods and expanded using structured, cross-language rules. Partially translated documents will be merged to identify repeated information and clear differences. Methods inherent in summarization, such as information fusion, will aid in reducing errors and ambiguity from translation. The result will be an automatically generated English summary of a stream of articles in different languages on the same event.

Generic tools that are used across components and that will be made available to others. Broad coverage tools required for many tasks will be developed, such as term extraction and expansion, new information detection, short text clustering, paraphrase identification, and segmentation. Statistical algorithms which offer increased robustness, will be used, integrating shallow forms of symbolic knowledge when available. The result will be a set of re-usable tools for language analysis that can be re-used for different tasks, different domains, and different languages.

Rapid deployment in new languages through the use of machine learning to automatically augment data and rules. Our system is designed to require only resources and tools that are broadly available in many languages. For example, part-of-speech tags and bracketings can be obtained relatively easily with a large corpus with no annotations (e.g., a year's worth of news stories in the language of interest) plus a small hand-labeled corpus for bootstrapping. Machine learning techniques will be used to augment the shallow knowledge available for specific languages. Thus, the approach will exploit the benefits of language technology without being limited to a small set of languages for which advanced tools are available. At the same time, it can move beyond traditional IR string matching.

Recent Accomplishments:

Extended SimFinder, our tool for identifying ``themes'' across an input set of documents. Each theme consists of a set of similar clauses, sentences, or paragraphs. In the past year, we added new features, improved the machine learning component, and increased the robustness of SimFinder, resulting in a relative accuracy increase of 22%. We also evaluated SimFinder's performance on a second data collection (involving 10,000 comparisons of sentences) and tested its generality by interfacing it to multiple summarizers.

Significantly increased the speed and robustness of MultiGen, our similarity-based multi-document summarization system. MultiGen combines statistical measures of similarity produced by SimFinder with information fusion and regeneration to summarize multiple documents, and achieves high compression rates (typically around 50:1). We implemented a client-server architecture and other enhancements that resulted in a five-fold increase in MultiGen's speed, and tested MultiGen with inputs other than articles on the same event.

Developed tools for increasing MultiGen's accuracy and quality of output. We implemented a tool for learning paraphrases out of unmarked text that outperforms alignment techniques proposed earlier in machine translation. We also implemented a tool for ordering the sentences in a summary based on the coherence of successive sentences and time information in the source articles.

Developed a prototype system to discern significant differences between related documents. The system groups common nouns and verbs into semantic classes based on WordNet and compares the ways the classes appear in two documents. We also implemented a multidocument summarizer that uses the above system to produce summaries of a group of related but dissimiliar documents. %%Jay's tool could go here, as an independent paragraph or perhaps as %%a sentence in the above paragraph, but I thought we already have %%too many paragraphs and sentences.

Implemented a conversion tool that unifies in XML the markup of documents coming from different sources, and a router that decides for each set of documents which summarizer is most appropriate for it. As a result, different kinds of document sets (documents on a single event, on related events, on biographical information, etc.) are handled by specialized summarizers in a transparent manner.

Designed an architecture for a multilingual version of SimFinder, using a plug-in architecture for supporting different languages and different similarity features in each language. We also installed and tested a machine learning system for compiling a bilingual lexicon from parallel data. Current Plan:

Further improve the robustness and speed of SimFinder and MultiGen, achieving an additional doubling of speed for the latter by the end of 2001.

Apply our similarity measures with appropriate modifications to the problem of detecting blocks of new information and sub-events as well as changes of topic within an input document. We will also use our information fusion algorithm to automatically identify temporal updates and include them into the summary.

Conduct additional corpus analysis to improve the WordNet-based lexicon now used in the difference recognizer and to develop methods to determine the relative significance of differing passages found in the comparisons. In addition, we will investigate methods for identifying and presenting opposing perspectives, contradictory statements and changes over time.

Implement our new multilingual architecture for SimFinder with plug-ins for Chinese and Japanese, develop a translation method specific to noun phrases for use in the multingual information fusion component, and adapt MultiGen's regeneration component to handle partially translated or ill-formed input. Part of this work will be based on an annotated Chinese-English parallel corpus to be produced at a DARPA-sponsored John Hopkins workshop this summer with the participation of one of our researchers.

Technology Transition:

Currently, we have a number of domain independent tools for text analysis and generation that are available for dissemination. These tools are building blocks for the larger tasks of detecting similarity across documents, identifying contradictions, partial translation, and summary generation. They include Segmenter, a tool for finding the topical segmentation of documents ( HREF="http://www.cs.columbia.edu/nlp/licenses/segmenterLicenseDownload.html"> http://www.cs.columbia.edu/nlp/licenses/segmenterLicenseDownload.html ); LinkIt, a software tool for identifying significant topics in domain independent full text through the identification of noun phrases, linking those noun phrases, grouping them, and ranking the significance of the noun phrase groups (Contact: David Evans devans@cs.columbia.edu); and data collections such as sets of paragraphs from documents on the same event marked for meaning similarity (contact: Regina Barzilay regina@cs.columbia.edu). These tools and collections are currently in use by researchers at MIT Media Labs, the University of New Mexico, and internationally (France, Israel, New Zeeland).

In addition, during the past year we have made prototype versions of our higher-level systems available to researchers at other institutions. Our sentence similarity and clustering tool, SimFinder, has been licensed to CoGenTex who will use it as a first component of their summarization system under the TIDES program, to collect similar information in groups in a manner similar to our approach to summarization. SimFinder has also been interfaced with components of other systems developed at Columbia, e.g., the large scale medical summarization system currently being designed as part of Columbia's Digital Libraries Initiative Phase 2, funded by NSF. The MultiGen system is being used by MITRE to summarize collections of articles automatically formed by their retrieval tools, and we are in the process of developing formal APIs for MultiGen and SimFinder so that they can be made generally available to the community.