New Algorithms for Minimum Risk Chunking and Extraction

Speaker Name: Martin Jansche
Speaker Info: Post Doc, Center for Computational Learning Systems; jansche@cs.columbia.edu
Date: Thursday November 11th
Time: 11:30am-12:30pm
Location: CCLS conference room

Abstract:
Finding regions of interest in text is a fundamental task in Natural Language Processing. Typical regions of interest include noun phrases (Church 1988, Ramshaw & Marcus 1995), subject-verb phrases (Punyakanok & Roth 2001), named entities (Bikel et al. 1997), and word tokens (Zhou 2003), among others. Viewing this task abstractly, one can speak of smaller ``chunks'' to be located inside larger strings. Chunking -- the process of finding chunks -- is evaluated like an information retrieval task in terms of precision and recall. Evaluation uses a given ``gold standard'' of chunks, against which one compares the chunks found by a system.

The same gold standard can be used for supervised learning of chunkers. The chunking task is usually reduced to sequence labeling task, so at the core of the corresponding learning task is the well-understood problem of learning with sequential data (Bengio 1999, Dietterich 2002, Collins 2003). This reduction leads to two questions that have apparently not even been asked before: First, how does one undo the reduction and recover chunks from labeled phrases in a way that is informed by the evaluation metric of the chunking task? Second, how can the evaluation metric of the chunking task inform the sequence learning task? This talk is about the algorithmic challenges posed by these two questions. I will present algorithms for incorporating the evaluation measure of the chunking task into a stochastic approach to the sequence labeling task and its associated learning problem.