Towards Efficient Statistical Parsing Using Lexicalized Grammatical Information

John Chen

Many natural language understanding systems require efficient and accurate parsing disambiguation to be effective. State of the art parsers owe their high performance in large part to statistical modeling of lexical features. Although lexicalized tree adjoining grammar (TAG) is a lexicalized grammatical formalism for natural language, its use in statistical parsing has remained relatively unexplored. In this work, I address issues in statistical TAG parsing. First, I explore the issue of linear time TAG parsing disambiguation (supertagging). By careful analysis and utilization of features, I achieve the highest reported accuracies on this task. Second, in order to provide a robust resource for statistical TAG models, I develop and evaluate procedures to extract a TAG from a treebank. Further, I introduce procedures to organize the resulting grammar for smoothing purposes and also to help interface the grammar to semantics or other grammatical frameworks. Third, I explore smoothing approaches for TAG, which is essential because of the inherent data sparseness problem for broad-coverage TAG parsers.