Level of Participation - Billed

Kathleen McKeown, 17%
Vasileios Hatzivassiloglou, 33%
Judith Klavans, 17%

Level of Participation - Unbilled

Kathleen McKeown, 10%
Vasileios Hatzivassiloglou, 10%
Judith Klavans, 10%
Project URL:http://www.cs.columbia.edu/TIDES

Objective:

The aim is the development of a system that automatically generates a short English summary of a set of documents, in multiple languages, on the same event. By providing a concise view of the consensus on the event, presented across documents, the system dramatically reduces the amount of reading that is required. By highlighting differences between documents, the system will point out inconsistencies in different views of the event, whether from different sources or different countries. Approach:

Columbia University is developing a practical, multilingual and multidocument summarization system. The design features the integration of robust statistical techniques, shallow linguistic approaches and machine learning to achieve scalability within languages and portability across languages. To realize these goals, the research will develop methods for summarization across documents using information fusion and identification of key differences, summarization across languages relying on identification and translation of terms, and new methods for identification, expansion and translation of terms. Work will begin with a language such as Spanish, but quickly expand to include Asian languages and other non Indo-European languages such as mid-Eastern languages.

Key features of the approach include:

Summarization of similarities and differences across multiple documents based on extraction, comparison and reformulation of phrases. Unlike most other approaches, rather than relying on sentence extraction, this work uses information fusion of similar information, merging together repetitive phrases into a single phrase allowing dramatic reduction of information across many articles. Language generation techniques are used to merge multiple phrases into a short paragraph. The proposed work will focus on characterizing and implementing types of differences to include in a summary, which is an unexplored direction in multi-document summarization. Difference operators will be developed to identify new information, contradictions, trends, multiple perspectives, and different topics. Thus, a summary of a stream of documents on the same event will concisely present common information, point out differences that are reported, and provide links to the original articles.

Summarization across languages to highlight differences in reports of the same event from different countries. In order to minimize reliance on full machine translation given the possibility for errors and the lack of tools for many languages, the approach will use identification, expansion and translation of terms where possible. Terms will be translated using statistical methods and expanded using structured, cross-language rules. Partially translated documents will be merged to identify repeated information and clear differences. Methods inherent in summarization, such as information fusion, will aid in reducing errors and ambiguity from translation. The result will be an automatically generated English summary of a stream of articles in different languages on the same event.

Generic tools that are used across components and that will be made available to others. Broad coverage tools required for many tasks will be developed, such as term extraction and expansion, new information detection, paraphrase identification, and segmentation. Statistical algorithms which offer increased robustness, will be used, integrating shallow forms of symbolic knowledge when available. The result will be a set of re-usable tools for language analysis that can be re-used for different tasks, different domains, and different languages.

Rapid deployment in new languages through the use of machine learning to automatically augment data and rules. Our system is designed to require only resources and tools that are broadly available in many languages. For example, part-of-speech tags and bracketings can be obtained relatively easily with a large corpus with no annotations (e.g., a year's worth of news stories in the language of interest) plus a small hand-labeled corpus for bootstrapping. Machine learning techniques will be used to augment the shallow knowledge available for specific languages. Thus, the approach will exploit the benefits of language technology without being limited to a small set of languages for which advanced tools are available. At the same time, it can move beyond traditional IR string matching.

Recent Accomplishments:

Developed a prototype multi-document summarization system, MULTIGEN, which takes a set of English news articles on the same event as input and produces a paragraph summary of similarities across documents as output. Results in a dramatic decrease in amount of text to read; for one example, given 34 documents of about 13 sentences each in input, it produces a paragraph summary of 9 sentences total in output.

Developed a tool for identifying ``themes'' across an input set of documents. Each theme consists of a set of similar paragraphs. A unique similarity metric based on linguistic features derived from machine learning is used to identify pairs of similar sentences and clustering is used to group pairs of sentences into themes.

Completed a single document, domain independent, summarization system (begun under earlier funding) that generates the sentences of a summary by extracting sentences, removing extraneous phrases within the extracted sentences, and regenerates the summary sentences by combining the resulting phrases. Results in more concise (by 88%) and more coherent (by 56%) summaries than the standard method of sentence extraction.

Completed implementation a range of document clustering techniques operating on both traditional features and features utilizing shallow linguistic analysis. Detailed comparison experiments on a collection of over 40,000 news articles establish that the linguistic features are useful for document clustering. This clustering tool will be used for organizing incoming information into sets of related documents to be processed by our summarizer. Current Plan:

Improve the efficiency of MULTIGEN by tenfold. Current speed of summary generation is slow requiring from 28 to 55 minutes to generate a summary, depending on the size and number of input documents. Optimizations are currently being implemented which will reduce running time by a factor of 10.

Increase the robustness of MULTIGEN by experimenting with increased noise in the input set of articles. Current performance degrades as the articles become less similar. Experiments will be performed that identify source of degradation and its dependence on type of noise, implementing improvements to scale MULTIGEN to handle an order of magnitude greater variety in input.

Extend the similarity detection tool so that it can run on input documents in languages other than English. The first step will be to handle input documents in a language other than English and the second will be to handle input documents in different languages.

Implement terminology finder for Japanese and Chinese and tools for translating identified tools into English.

Technology Transition:

Currently, we have a number of domain independent tools for text analysis and generation that are available for dissemination. These tools are building blocks for the larger tasks of detecting similarity across documents, identifying contradictions, partial translation, and summary generation.

Segmenter

Developed key technology of classifying different types of noun phrases (NPs) with different roles in finding the topical segmentation. Used in primary research at Columbia and at other sites internationally (MIT Media Lab, Avignon Univ. (FR), Waikato Univ. (NZ), and others) to improve segmentation techniques and to develop hierarchical document segmentation (document trees/maps). Segmenter 1.7 is being exported under Columbia's academic license agreement. Its purpose is to segment articles into adjacent, multiple-paragraph topic chunks. The tool runs on UNIX platforms with standard perl support using access to Berkeley DB files, and accepts input in raw ASCII or SGML. To learn more about Segmenter, contact: Min-Yen Kan HREF="mailto:min@cs.columbia.edu">min@cs.columbia.edu, or visit the URL: HREF="http://www.cs.columbia.edu/nlp/licenses/segmenterLicenseDownload.html"> http://www.cs.columbia.edu/nlp/licenses/segmenterLicenseDownload.html

LinkIt: LinkIT is a software tool for identifying significant topics in domain independent full text through the identification of noun phrases, linking those noun phrases, grouping them, and ranking the significance of the noun phrase groups. LinkIT identifies simplex noun phrases efficiently using a hand-constructed finite state grammar over part of speech tags. It has been transferred to several other research sites including LIMSI (C. Jacquemin) and New Mexico State University (J. Wiebe) and is being transferred for use by Columbia University Press as part of an intelligent book indexing system. Contact: David Evans devans@cs.columbia.edu

Corpus of similar paragraphs<\B>

Collected a corpus with a markup of paragraphs which convey the same meaning from the articles about the same event. Researchers in Columbia and Bar-Ilan University(Israel) used this corpus in training statistical tools for identification of similar paragraphs from the related articles. Point of contact: Regina Barzilay regina@cs.columbia.edu