COMS E6998: Machine Learning for Natural Language Processing (Spring 2012)




Problem sets

Update (21st March): The first deadline for the project: You should send a 1 page project proposal to the TAs by Friday March 30th, 5pm

The choice of project is up to you, but it should be clearly related to the course material.

Example projects:

    Design and implementation of a machine-learning model for some NLP task; the write-up would describe the technical details of the model, as well as experimentation with the model on some dataset.

    Implementation of an approach (or approaches) described in one or more papers in the research literature.

    Purely "theoretical" projects (no experimentation) may also be possible, although these projects will be less common.

Group projects are allowed (up to a maximum of 3 people)

We'll expect a 6 page write-up for 1 person projects, 8 pages for 2 person projects, 10 pages for 3 people.

Datasets. We can obtain the following datasets (and no doubt others, in particular we can get corpora from the LDC). Please let us know if you'd like to use one of these datasets, or have another request or query:

  • Brown Treebank Corpus (part-of-speech tagging and parsed data).
  • Wall Street Journal Treebank Corpus (part-of-speech tagging and parsed data).
  • There are treebanks in a variety of other languages: please ask us if you have a specific request.
  • The CONLL dependency parsing data (dependency parse training and test data for several different languages).
  • Machine-translation datasets (corpora of translations between two or more languages): for example, the Europarl data.
  • Framenet.
  • PropBank.
  • CCGBank.
  • Named entity data (e.g., the CONLL datasets).
  • Gigaword (a large amount of plain text, from various newswire sources).
Some example projects. Here are some possible project suggestions:
  • Implement a CRF or perceptron trained model for some task (e.g., part-of-speech tagging, dependency parsing).
  • Implement the EM algorithm for some task (e.g., for parsing via the inside-outside algorithm; for tagging via the forward-backward algorithm; for translation using IBM Model 2; for document clustering using Naive Bayes). Carry out an experimental evaluation, looking at accuracy of the model on some task; study how the method behaves with respect to the local optimum problem; etc.
  • Implement an inference algorithm based on dual decomposition or Lagrangian relaxation for some task (e.g., parsing, MT alignment). Ask us if you'd like pointers to existing models that can be combined using dual decomposition.