Morphological Preprocessing for Statistical Machine Translation

talk by Nizar Habash, Oct. 19, CCLS conf. room


Statistical machine translation is quite robust when it comes to the
choice of input representation. It only requires consistency between
training and testing. As a result, there is a wide range of possible
preprocessing choices for data used in statistical machine translation.
This is even more so for morphologically rich languages such as Arabic. In
this talks, I present a study of the effect of eleven different word-level
preprocessing schemes (and techniques to apply them) for Arabic on the
quality of phrase-based statistical machine translation.
The study results show that given large amounts of training data,
splitting off only proclitics performs best. However, for small amounts of
training data, it is best to apply English-like tokenization using
part-of-speech tags, and sophisticated morphological analysis and
disambiguation. Moreover, choosing the appropriate preprocessing produces
a significant increase in BLEU score if there is a change in genre between
training and test data. Finally, I present approaches for combining
preprocessing schemes that result in improved translation quality.

NOTE: I will also present an overview of the machine translation (MT)
effort at Columbia over this last year. There are three MT projects that
explore different degrees of hybridization of statistical and rule-based
approaches: (a.) statistically enriched generation heavy MT, (b.)
syntax-aware MT; and (c.) linguistic preprocessing for statistical MT. The
focus of the talk will be on (c.).