Generation-Heavy Hybrid Machine Translation

Speaker Name:	Nizar Habash
Speaker Info:	Postdoctoral Researcher, Center for Computational Learning Systems;
	habash@cs.columbia.edu
Date:	Thursday October 28th
Time:	11:30am-12:30pm
Location:	CS Conference Room

Abstract:

Generation-Heavy Hybrid Machine Translation (GHMT) is an asymmetrical hybrid approach that addresses the issue of MT resource poverty in source-poor/target-rich language pairs by exploiting available symbolic and statistical target-language (TL) resources. This talk presents a specific implementation of this approach where the expected source-language (SL) resources include a syntactic parser and a simple one-to-many translation dictionary. Expensive parallel resources, such as transfer rules, complex interlingual lexicons, or even bitexts are not used. Rich TL symbolic resources such as word lexical semantics, categorial variations and subcategorization frames are used to overgenerate multiple structural variations from a TL-glossed syntactic dependency representation of SL sentences. This SL-independent symbolic overgeneration accounts for possible translation divergences, cases where the underlying concept or "gist" of a sentence is distributed differently in two languages. The overgeneration is constrained by multiple statistical TL models including surface n-grams and structural n-grams. The first implementation of this approach focused on Spanish-English MT. An evaluation of this system will be presented together with issues with ongoing work on retargeting to Chinese and Arabic.