Multilingual Multidocument Information Tracking and Summarization
We propose to develop techniques to explore and summarize large online, quickly
changing multilingual collections. Our work will make possible automated
tracking of events over time, summarization of multiple documents on the event
of interest, multilingual summarization using techniques to avoid full machine
translation where possible, and a set of domain independent, portable tools
that will be used for all three tasks. A key feature of our work will be the
integration of robust, statistical techniques, shallow linguistic approaches,
and machine learning to automatically acquire any new rules or data required to
move the system to new languages. Multilingual tracking and summarization will
be based on identification and translation of terms, using information fusion
techniques in summarization to fill in gaps and resolve errors and ambiguity
due to translation. To support these capabilities, our work features
representation and identification of events through extraction of participants,
location, and time, and summarization through reformulation of extracted
phrases, using both language generation and statistical methods.
NewsBlaster
The NewsBlaster
project showcases many of the methodologies which have resulted
from TIDES research so far. Every night, NewsBlaster crawls the web,
starting from many popular news sites, searching for news articles.
These articles, along with images and captions (if any), are
downloaded. They are then clustered into groups of articles
representing single events and categorized into sections similar to
those from manually constructed news sites. A user can select a
single event (cluster) to obtain an automatically generated summary of
the event based on all the articles in the cluster. Appropriate
images, if any, will also be displayed with the summary. NewsBlaster
provides an efficient way for users to obtain daily updates of news.
Currently, it only works with English news articles, but plans are
underway to add multilingual capabilities.