CS 4705: Natural Language Processing

Prof. Kathleen R. McKeown

Fall 1998; Mon & Wed 2:40-4:00

Newsgroup: columbia.fall.cs4705

Newsgroup: CUNIX Web based newsgroup

Office Hours

About the Course

Prerequisites

Assignments and Grading

Submitting Assignments

Office Hours

Prof. Kathleen R. McKeown
Monday 4:00-5:00, 450 CS Building
Thursday 11:00 - 12:00, 450 CS Building

TA: Barry Schiffman
Monday 11-12, 702 CEPSR
Tuesday 11-12, 702 CEPSR

About the course

With the explosion of the Internet, the amount of online textual material has grown exponentially. There are many situations in which automatic natural language processing could be used to exploit online material and make it easier to use the Internet. Some of these applications are already available online. For example, tools exist to translate any web page on the Internet from English into another language and back. Tools to automatically summarize a textual document are beginning to appear. These summaries can help a user of an IR system determine whether a retrieved document is relevant without having to read the full document. Telephone spoken language interfaces to online material are also beginning to appear. These systems allow a user to call up and get information about the weather, for example, from an online weather sources.

This class will cover the basic techniques used to create such systems. We will explore techniques and tools used for parsing input text or speech (syntax), techniques for understanding the meanings of words within those sentences (semantics), and techniques that interpret a sentence within the context of surrounding text or dialogue (pragmatics). We will see how these tools can be used in practical systems that understand or produce language. We will spend a good amount of time on new, statistical methods that can be used to scale up a system to handle a wide variety of input and output.

The assignments in this class will all center around a single project on natural language for the internet. There will be some choice in design of the project, which willinvolve summarization and information extraction. Assignments will be individual concrete modules of this project, culminating in a demo of the full project at the end of the semester. The final exam will be a written take-home exam, examining how theoretical issues addressed in the class might apply to summarization and information extraction. Summarization, particularly multi-lingual summarization, is currently a topic of much interest in the field. Recent advances make the problem look more feasible than it was in earlier years. We will study some recent advances in both these areas and use them to focus the course. Summarization is particularly relevant to use of the world wide web and could potentially be used to provide descriptions of available sources in order to allow users to decide whether information at a site is relevant. We will explore other means for integrating natural language into the world wide web as well. This project will serve as a focus for studying the main techniques of natural language processing, which include syntax, semantics, pragmatics, and statistical techniques.

There will be four main assignments. The first two will introduce you to parsing and generation so that you can make an informed decision about the project that you want to work on. You will use parsing and generation tools to do the assignment. By the third assignment, you will have decided whether you will work on interpretation or generation and for what application. The third assignment will consist of a written part and a component that uses semantic information to arrive at some meaning. The fourth assignment will also consist of a written and programming part, where the written part will ask you to identify the hard problems in your application, specify techniques that look promising, and the programming part will require implementing some subset of them. You may work with a partner or alone throughout the semester. You will then have time to prepare for the class presentation. The take-home final will ask you to consider theoretical issues that were not included in your programming assignment.

Prerequisites

Required: CS 3139 (Data Structures)

Recommended: one of cs4701 (AI), CS3261 (Models of Computation), CS4115 (Programming Languages and Translators)

Programming Language: Your choice of LISP or PERL

Assignments and Grading

There will be 4 programming assignments, 1 mid-term, class demo and presentation, and a final paper. Class grade will be determined by the following weights: 20% mid-term, 50% programming assignments (including demo and presentation), 20% final paper, 10% class participation.

Sep 16 Syntax 1 -- parsing with CASS
Sep 28 Syntax 1 due
Sep 28 Syntax 2 -- generation with FUF
Oct 12 Syntax 2 due
Oct 14 Semantics -- interpretation of syntactic output
Nov 4 Semantics due
Nov 4 Pragmatics
Nov 25 Pragmatics due

Homeworks

Readings

Readings will be drawn primarily from Natural Language Understanding by James Allen. Other articles will be available on reserve inthe library. Click here to see a list of readings, which will be updated as the class progresses.