Grammar Approximation by Representative Sublanguage:

A New Model for Grammar Induction

Speaker Name: Smaranda Muresan
Speaker Info: PhD student, NLP Group; smara@cs.columbia.edu
Date: Thursday October 20 27
Time: 11:30am-12:30pm
Location: CCLS Conference Room (Interchurch) CS Conference Room (Mudd)

Abstract:
The question "What does it mean to learn language?" is one of the greatest topics of scientific inquiry. The problem has attracted many researchers in linguistics, computer science, and cognitive science, and attempts to answer this question vary greatly from discipline to discipline. In my thesis, I take language learning as a grammar learning problem. The grammar encodes both syntax and semantics and an ontology is used during learning to provide access to meaning.

In this talk, I will present a new computationally efficient model for language learning, called Grammar Approximation by Representative Sublanguage (GARS). In this model, the language is taken to be a set of strings together with their syntactic-semantic representations. The learner is presented with a set of positive representative examples of the target language, and an additional set of positive examples used for generalization, which we refer to as a representative sublanguage. The task of the learner is to induce a grammar that generates the target language. Constraint-based grammar formalisms have been widely used to capture natural language. We defined a new type of constraint-based grammars, Lexicalized Well-Founded Grammars (LWFG), which are always learnable under the GARS model, i.e., the learning always converges to the target grammar. We have shown that the search space is a grammar lattice and have provided polynomial algorithms for grammar induction, proving they are correct.