COMS E6998: Machine Learning for Natural Language Processing
Update (21st March): The first deadline for the project: You should
send a 1 page project proposal to the TAs by Friday March 30th, 5pm
The choice of project is up to you, but it should be clearly
related to the course material.
Design and implementation of a machine-learning model for
some NLP task; the write-up would describe the technical
details of the model, as well as experimentation with the
model on some dataset.
Implementation of an approach (or approaches) described in
one or more papers in the research literature.
Purely "theoretical" projects (no experimentation) may also be
possible, although these projects will be less common.
Group projects are allowed (up to a maximum of 3 people)
We'll expect a 6 page write-up for 1 person projects, 8 pages for 2
person projects, 10 pages for 3 people.
Datasets. We can obtain the following datasets (and no doubt
others, in particular we can get corpora from the
LDC). Please let us know
if you'd like to use one of these datasets, or have another request or
Some example projects. Here are some possible project suggestions:
- Brown Treebank Corpus (part-of-speech tagging and parsed data).
- Wall Street Journal Treebank Corpus (part-of-speech tagging and parsed data).
- There are treebanks in a variety of other languages: please ask
us if you have a specific request.
- The CONLL dependency parsing data (dependency parse training and
test data for several different languages).
- Machine-translation datasets (corpora of translations between
two or more languages): for example, the Europarl data.
- Named entity data (e.g., the CONLL datasets).
- Gigaword (a large amount of plain text, from various newswire sources).
- Implement a CRF or perceptron trained model for some task (e.g.,
part-of-speech tagging, dependency parsing).
- Implement the EM algorithm for some task (e.g., for parsing via
the inside-outside algorithm; for tagging via the forward-backward algorithm;
for translation using IBM Model 2; for document clustering using Naive
Bayes). Carry out an experimental evaluation, looking at accuracy of
the model on some task; study how the method behaves with respect to the local
optimum problem; etc.
- Implement an inference algorithm based on dual decomposition or
Lagrangian relaxation for some task (e.g., parsing, MT alignment). Ask
us if you'd like pointers to existing models that can be combined using