Bob Carpenter, Alias I, Inc.
Natural Language Scientist and Chief Software Architect

carp@alias-i.com


Title: Character Language Modeling for Word Segmentation and Entity Detection

Abtract:

I'll discuss the application of LingPipe's character language models to the two problems in Chinese language processing: word segmentation and named entity extraction.

For word segmentation, we use the same noisy channel model as we use for spelling correction. The source model is a character language model trained on word segmented Chinese data. The channel model is weighted edit distance; for word segmentation, this is merely deterministic space deletion. There are no Chinese-specific features at all in the models. The bakeoff F1 measure for our segmenter was .961; the winning F1 was .972.

For named entity extraction, we use a two stage process. The first stage is an HMM with character language model emissions. For Chinese, where we consider each character a token, this reduces to the more usual multinomial emission HMM. We code entity-extraction as a tagging problem using fine-grained states to effectively encode a higher-order HMM. For rescoring, we use a pure character language model approach that allows longer distance dependencies, encoding chunking information as characters within the models. As with word segmentation, there are no Chinese-specific features. The bakeoff F1 for our entity extractor was .855; the winning F1 was .890.

Time permitting, I'll discuss our confidence ranking entity and part-of-speech taggers and show some output from MEDLINE POS tagging and gene mention extraction.

The LingPipe web site provides tutorials on both word segmentation and entity extraction. There are also web demos for both applications. The sandbox contains the complete code used to generate entries for the SIGHAN bakeoff; the data is available from SIGHAN. Two papers covering roughly the same material as the talk are:




About the speaker:

Before joining Alias I, Bob Carpenter had been at SpeechWorks International, and Lucent Technologies Multimedia Communications Laboratory. Prior to Lucent, he was an Associate Professor of Computational Linguistics in the Philosophy Department at Carnegie Mellon. His Ph.D. is from the University of Edinburgh. For more details about his projects and interests, see Bob Carpenter's Projects.