Word Sense Disambiguation within a Multilingual Framework

Speaker Name: Mona Diab
Speaker Info: Postdoc, CCLS Group; mdiab@cs.columbia.edu
Date: Thursday March 31
Time: 10:30am-11:30pm
Location: CCLS Conference Room (Interchurch)

Ambiguity is an inherent characteristic of natural language, permeating its various levels of representation. From a human language processing perspective, ambiguity is not a severe problem. However, from a machine processing perspective, the story is quite different. Resolving ambiguity in natural language has been of central interest to researchers from the early 1950's. In particular, Word Sense Disambiguation (WSD) has occupied center stage in the early work on Natural Language Processing (NLP).

What constitutes a word sense is a subject of great debate. An appealing perspective, aims to define senses in terms of their multilingual correspondences, an idea explored by several researchers, [Dyvik(1998), Ide (1999), Resnik & Yarowsky (1999), and Chugur, Gonzalo & Verdejo (2002)] but to date has not been given any practical demonstration. This talk presents an empirical validation of these ideas of characterizing word meaning using cross linguistic correspondences. The idea is that word meaning or word sense is quantifiable as much as it is uniquely translated in some language or set of languages.

Consequently, I address the problem of WSD from a multilingual perspective; I expand the notion of context to encompass multilingual evidence. I devise a new approach to resolve word sense ambiguity in natural language, using a source of information that was never exploited on a large scale for WSD before.

The core of the work presented in this talk builds on exploiting word correspondences across languages for sense distinction. In essence it is a practical and functional implementation of the basic idea common to research interest in defining word meanings in cross linguistic terms.

I devise an algorithm, SALAAM, that empirically investigates the feasibility and the validity of utilizing translations for WSD. SALAAM is an unsupervised approach for word sense tagging of large amounts of text given a parallel corpus and a sense inventory for one of the languages in the corpus. Using SALAAM, I obtain large amounts of sense annotated data in both languages of the parallel corpus simultaneously. The quality of the tagging is rigorously evaluated for both languages of the corpora exhibiting some of the best results to date for an unsupervised method.