We propose to develop a practical, multilingual and multidocument information tracking and summarization system. Our design features the integration of robust, statistical techniques, shallow linguistic approaches and machine learning to achieve scalability within languages and portability across languages. To realize these goals, we will develop methods for information tracking based on a novel algorithm for identification of events, summarization across documents using information fusion and identification of key differences, summarization across languages relying on identification and translation of terms, and new methods for identification, expansion and translation of terms. We will begin work with a language such as Spanish, but quickly expand to include Asian languages and other non Indo-European languages such as mid-Eastern languages.
The key features of our approach include:
Tracking events over time through the development of techniques to identify events based on extraction of participants, location, and time. An event is an activity with a start and end point, involving a fixed set of participants. We hypothesize that a participant, location, and time triple is adequate to uniquely identify an event. Our goal is to develop a clustering architecture that treats events differently from general topics. Our approach will investigate statistical models using features such as named entities and noun phrases to more accurately provide a definition of an event than vector based approaches. These are tasks that can be performed with high accuracy and which extends relatively easily to other languages.
Summarization of similarities and differences across multiple documents based on extraction, comparison and reformulation of phrases. Unlike most other approaches, rather than relying on sentence extraction, our work uses information fusion of similar information, merging together repetitive phrases into a single phrase allowing dramatic reduction of information across many articles. Our work will focus on characterizing types of differences to include in a summary, a totally unexplored direction in multi-document summarization. We will develop difference operators to identify new information, contradictions, trends, multiple perspectives, and different topics.
Summarization across languages to highlight differences in reports of the same event from different countries. Our approach will minimize reliance on full machine translation, instead using identification, expansion and translation of terms where possible. Terms will be translated using statistical methods and expanded using structured, cross-language rules. Methods inherent in summarization, such as information fusion, will aid in reducing errors and ambiguity from translation.
Generic tools that are used across components and that will be made available to others. We will develop broad coverage tools required for many tasks, such as term extraction and expansion, paraphrase identification, and segmentation. We use statistical algorithms which offer increased robustness, integrating shallow forms of symbolic knowledge when available.
Rapid deployment in new languages through the use of machine learning to automatically augment data and rules. We have designed our system to require only resources and tools that are broadly available in many languages. For example, part-of-speech tags and bracketings can be obtained relatively easily with a large corpus with no annotations (e.g., a year's worth of news stories in the language of interest) plus a small hand-labeled corpus for bootstrapping. Thus, we can bring in the benefits of language technology and move beyond traditional IR string matching without limiting ourselves to a small set of languages for which advanced tools are available.