6998 Section 1, NLP for the Web

Spring 2010



There will be three types of assignments in this class, presentation of readings, discussant and semester project. Students are also expected to read all assigned papers and participate in the in-class discussion on the papers. The weight of these assignments towards the final grade is as follows:

Class Presentation

You will be assigned one or more of the papers for one class. Your job is to prepare a 10 minute overview of the paper. You should provide a framework for the approach taken in the paper, highlighting the main points and contributions. You should point out any claims or results that you find controversial. If there is a technical point of the paper that you think will be difficult to understand, you might select this point as the main part of your presentation and spend time explaining how you think it works.

Since the class is discussion oriented, everyone will give their presentation seated as part of the circle. I would prefer that people not use powerpoint. If you feel you need something written to help you make your points or to help in your explanation of a technical point, you may have a 1 page handout for the class. However, if you feel strongly that you will do much better using powerpoint, let me know and that will be an option. We will evaluate as a class whether the style of presentation is working Your presentation will be strictly timed in order to allow enough time for discussion. Expect to be cut off if you go over time.


As discussant, your job is to raise questions about the papers for discussion. In order to have an interesting discussion, think about higher level issues that the class papers raise, rather than detail oriented questions. For example, you may think about controversial claims, you may think about pros and cons of the approach, you may think about points of agreement or overlap between the different papers, you may think about directions for future research, you may think about implications for the field and whether it is on the right track.

Questions must be prepared and distributed to the class by email 24 hours ahead of time (i.e., by 5:00pm on Wednesday the day before). The two discussants should discuss among themselves how they want to start off the discussion. It should be shared in an integrated way. That is, it should not happen that one discussant takes a turn first and then the next discussant goes. You will get plenty of help from me in directing the discussion.


Write a 1-2 page proposal describing your idea of what you want to do for the project. Give as much detail as you can about how you plan to do the project. Will you use any existing tools and if so what? Will you use any corpora? Do you need to collect a corpus? What kinds of annotations do you need? What components will your system have? In case something goes wrong with your plan, what would your backup plan be?

Interim Results

In your proposal you have suggested the modules that you will finish by March 11th. In addition, you have received feedback whether your suggestions were reasonable and/or whether you should turn in additional results. Now, create a web page for your contributions to this course. Submit the URL to your webpage in courseworks (in a text file). The webpage does not have to be anything facy, it should contain, for example:

Your code (for the modules that you have completed) as a zip file (It should compile) plus a readme describing how to run the modules, examples of input and output.

Your primary results and a short write-up of what you have done so far and

A link to the corpora you have downloaded/preprocessed (if there is one)

Further clarification on your remaining plans.

Anything else you think will be helpful (figures, charts,...)


Final Project

For the final submission you will need to do the following

·         Class presentation: This should be a 10 minute presentation. It will be strictly timed as conference presentations are timed. You will be given a 5 minute warning, 2 minute warning and 1 minute warning. Going overtime will cause you to lose points. The presentation should provide an overview of your goals, results (which might be charts showing accuracy or examples of output), and demo.

·         Face-to-face grading session where we will focus more on the implementation. You should be prepared to:

§  Run the project on 5-10 examples

·         At least 3 new ones

§  Review the components you implemented

·         Including showing the code

§  Describe the components that you used from elsewhere and why you chose them

§  Explain the data you used and why

·         5 page write-up that includes

§  Overview of Project (What was it in the end?)

§  Components (Make this the 2nd section of the paper and format as itemized list)

§  Results

§  Conclusions

§   Summary of how feasible

§   Future directions

§  Anything else that you think is important


§  Presentations: April 29th and May 06th

§  Face-to-face sessions: Dates and Times TBD

§  Write-up: Due by May 3rd



You may design your own project or you may choose one of the suggested projects below. In either case, you should discuss your project with both Prof. McKeown and the TA, Yves Petinot, before submitting your proposals. Some possible projects include:

Question Answering system based on Web Data:

Search Engine systems such as Google, Yahoo or Ask typically rely on high quality sources to handle question-like queries and often use custom algorithms to extract information from these important sources. In this project we propose to build a question answering system based on Wikipedia. Given a factoid question (What, Who, Where, What, When) the system is expected to retrieve the most relevant content from Wikipedia that answers the question (assuming an answer is available in Wikipedia of course).

For instance for the question "When was Barrack Obama born ?" the output to your system could be: (en.wikipedia.org/wiki/Barack_Obama, "Barack Hussein Obama II (born August 4, 1961) is the 44th and current President of the United States.")

You will have complete freedom in the actual interface to your system, although for any given query, the output should probably consist of a list of relevant Wikipedia URLs as well as, for each URL, a span of text containing part (or all) of the answer. In its simplest form your system will comprise an information retrieval component coupled with a snippet generator (query-focused extractive summarizer). Heuristics can also be developed to take advantage of the structural (e.g. entity hierarchy, links, etc) and textual specificity of Wikipedia content.

Keywords: Wikipedia, IR, Stanford POS tagger, Lucene

Web page classification:

In this project we propose to construct a Web page classifier that could be used, for instance, to collect content for a Vertical Search Engine. Starting with the DMOZ corpus, you will be expected to train and evaluate an N-way classifier able to classify an arbitrary Web-page into one of the DMOZ categories. Given a previously unseen Web page and a subset of DMOZ categories (or all of them if depending on the approach chosen), your classifier should identify the most likely category for this page. An alternative take on this project could be the construction of a spam site/page classifier.

Keywords: DMOZ, Machine Learning

Related query generator:

One of most valuable features of recent Search Engine systems is the ability to, given a user query, suggest new queries that may also be of interest to the user. In this project you will build a system that is able to make such query recommendations. The input to the system will be a user query (or potentially a session of user queries) and return a list of related/suggested queries. One way of identifying related queries could be to use the Wikipedia corpus (be creative !), or potentially a publicly available Search Engine query log, such as the AOL query log. If possible you should also a human evaluation of your system against the recommendations of one of the major commercial Search Engines for a small set of user queries.

Keywords: AOL query log, etc.

Sentiment Analysis:

Keywords: Machine Learning

Phrase Polarity Detection:

Keywords: Machine Learning