Speaker Name: | Martin Jansche |
Speaker Info: | Post Doc, Center for Computational Learning Systems; jansche@cs.columbia.edu |
Date: | Thursday November 11th |
Time: | 11:30am-12:30pm |
Location: | CCLS conference room |
Abstract:
Finding regions of interest in text is a fundamental task in Natural
Language Processing. Typical regions of interest include noun phrases
(Church 1988, Ramshaw & Marcus 1995), subject-verb phrases (Punyakanok
& Roth 2001), named entities (Bikel et al. 1997), and word tokens
(Zhou 2003), among others. Viewing this task abstractly, one can
speak of smaller ``chunks'' to be located inside larger strings.
Chunking -- the process of finding chunks -- is evaluated like an
information retrieval task in terms of precision and recall.
Evaluation uses a given ``gold standard'' of chunks, against which one
compares the chunks found by a system. The same gold standard can be used for supervised learning of chunkers. The chunking task is usually reduced to sequence labeling task, so at the core of the corresponding learning task is the well-understood problem of learning with sequential data (Bengio 1999, Dietterich 2002, Collins 2003). This reduction leads to two questions that have apparently not even been asked before: First, how does one undo the reduction and recover chunks from labeled phrases in a way that is informed by the evaluation metric of the chunking task? Second, how can the evaluation metric of the chunking task inform the sequence learning task? This talk is about the algorithmic challenges posed by these two questions. I will present algorithms for incorporating the evaluation measure of the chunking task into a stochastic approach to the sequence labeling task and its associated learning problem. |