Combination of Statistical Word Alignments Based on
Multiple Tokenization Schemes


talk by Jakob Elming, Nov. 16


Statistically determined word alignments using GIZA++ is a standard
practice in NLP research exploiting parallel data. Word alignments are
used in many applications from machine translation to projection of
linguistic knowledge from one language to another. Some applications,
such as phrase-based MT are more robust to alignment errors whereas
others such as parse projection or gapped-phrase-based MT are more
sensitive. The focus of this research is improving word-alignment
quality under constraints of high sparsity caused by morphological
complexity.


The approach we propose exploits the complementarities of alignment
decisions made when training with multiple tokenizations of a
morphologically rich language such as Arabic (in an Arabic-English
alignment task). We use supervised machine learning to discover the
best way to combine the different tokenization-variant alignments. We
show large reductions in alignment error rate over a standard baseline.