Brian Roark
Learning high precision text normalization systems from (mostly) unlabeled data

Abstract

If a speech synthesizer were reading this sentence, it cld have some trbl since abbrvs are being used. Preparing text to be read by a speech synthesizer, or for other kinds of subsequent processing (such as automatic translation), is known as 'text normalization', and may include things like abbreviation expansion or number name identification, depending on the application. Incorrect normalization of text can be particularly damaging for applications like text-to-speech synthesis (TTS), where the resulting normalization is directly presented to the user: it is much worse to incorrectly normalize an abbreviation to the wrong expansion than it is to just let the synthesizer read the abbreviation as a letter sequence. In this talk, I'll present recent work on abbreviation expansion for TTS following a “do no harm”, high precision approach which yields few expansion errors at the cost of leaving relatively many abbreviations unexpanded. After introducing and motivating text normalization, I'll present methods for training classifiers to establish whether a particular expansion is apt. To do so, we leverage very large amounts of unlabeled data to build models of abbreviation and of clean (expanded) text; we then train a classifier on a small amount of labeled data, using features derived from the other models. We achieve a large increase in correct abbreviation expansion when combined with the baseline text normalization component of the TTS system, together with a substantial reduction in incorrect expansions. I'll close the talk with a discussion of some additional precision improving methods and next steps. Joint work with Richard Sproat.

Bio

Before joining Google as a research scientist in mid 2013, Brian Roark was a faculty member for 9 years in the Center for Spoken Language Understanding (CSLU) of Oregon Health & Science University (OHSU) in Portland, Oregon. He received his PhD from Brown University in 2001 and spent 3 years at AT&T Labs - Research before joining CSLU. He is a computational linguist working on various topics in natural language processing. His research interests include: language modeling for automatic speech recognition and other applications; syntactic parsing of text and speech; supervised and unsupervised learning of language and parsing models; text entry, accessibility and augmentative & alternative communication (AAC); and spoken language processing for diagnosis of neurodevelopmental and neurodegenerative disorders.

Brian Roark Learning high precision text normalization systems from (mostly) unlabeled data

Abstract

Bio

Brian Roark
Learning high precision text normalization systems from (mostly) unlabeled data