Statistical Methods for Natural Language Processing (NLP) 
Spring 2010 
Course Information:
Time : Tuesday, 4:10  6pm
Location : 606 Lewisohn
Office Hours: Tuesday, 2  4pm (or by appointment), Speech Lab CEPSR building (7th floor)
Instructor:
Dr. Sameer Maskey
smaskey @ cs.columbia.edu
914 945 1573
Teaching Assistant:
Kapil Thadani, kapil at cs. domain .edu (domain = columbia)
Office Hours: 3  5pm (or by appointment), Thursday, Office 724, CEPSR building
Guest lectures by :
Dr. Salim Roukos
Dr. Bowen Zhou

Course Description
This course will explore topics in Statistical Methods/Machine Learning for realworld Natural Language Processing (NLP) problems. We will study ML topics that are commonly used in NLP such as Maximum Entropy Models, Hidden Markov Models, Clustering techniques, Conditional Random Fields, ExpectationMaximization algorithm, Active Learning and Support Vector Machines. We will understand how these methods are applied to real world NLP problems such as information extraction, stochastic parsing, text segmentation and classification, topic/document clustering and word sense disambiguation. We will also study the details of inference algorithms such as Viterbi, Synchronous Chart Parsing and Beam Search. The students will get handson experience by implementing some of these ML techniques for classification, clustering and a complex NLP task of machine translation.

Academic Integrity
Presenting copied work as your own is strictly not tolerated, and will result in automatic zero. If you believe you need extra time to complete the assignment please email the TA or the instructor in advance.

Prerequisites
Background knowledge in probability, statistics, linear algebra. Some experience in at least one programming language.

Grading
There will be 3 Homework, Final Exam and a Final Project. There will be no Midterm Exam. Each homework will contain programming assignment (some homework may contain a brief written assignment as well). HW1 (15%), HW2 (15%), HW3 (15%), Final Project (40%), Final Exam (15%). You have 3 'no penalty' late days in total that can be used during the semester. Each additional late day (without approval) will be penalized by 20% each day.

Tentative Class Schedule
Week 
Date 
Topics 
Slides 
Assignments 
Readings and Remarks 
Week1 
19 Jan 
Introduction, Text Mining, Linear Models of Regression 



Week2 
26 Jan 
Text Categorization, Linear Methods of Classification 


23.1.1, 23.1.2, 23.1.3 J&M Book
1.1, 3.1, 4.1  Bishop Book

Week3 
2 Feb 
Text Categorization, Support Vector Machines 

HW1 Assigned HW1 Solutions 
6.1, 6.2, 7.1 (upto 7.1.1 only)  Bishop Book
3.1, 4.5.1  J&M Book
Sebastiani, F., Machine Learning in Automated Text Catgorization, ACM Surveys 2002
Optional Reading:
Christopher Burge's SVM tutorial

Week4 
9 Feb 
Information Extraction, Sequential Stochastic Models, HMM 


22.1, 6.1 to 6.5 J&M Book 
Week 5 
16 Feb 
Hidden Markov Models II 

HW1 Due, HW2 Assigned
HW2 Solutions

22.2 J&M Book 13.1, 13.2 Bishop Book 
Week 6 
23 Feb 
Maximum Entropy Models 

Project Proposal Due (11:59pm) 
J&M 6.66.8 
Week 7 
2 Mar 
Semantics, Brief Introduction to Graphical Models 


Please come prepared with at least 1 question for each paper
~ Liang P., Jordan M., Klein D., Learning Semantic Correspondences with Less Supervision, ACL 2009
~ Shen D., and Lapata M, Using semantic roles to improve question answering, EMNLP 2007
~Carlson A, et. al, Coupling SemiSupervised Learning of Categories and Relations, HLT 2009 Workshop

Week 8 
9 Mar 
Topic, Document Clustering, Kmeans, Mixture Models, Expectation Maximization 

HW2 Due (March 14) 
9.19.4, Bishop Book 
Week 9 
16 Mar 
No Class, Spring Break 

Project Information Project Intermediate Results Due (March 25) 

Week 10 
23 Mar 
Conditional Random Fields 

HW3 Assigned
HW3 Solutions

8.3 Bishop Book,
Sutton, C. and McCallum, A., "An Introduction to Conditional Random Fields for Relational Learning" 2006

Week 11 
30 Mar 
Machine Translation I 


25.125.13 J&M Book Invited Lecture: Dr. Salim Roukos 
Week 12 
6 Apr 
Machine Translation II 

HW3 Due (April 9) 
Invited Lecture: Dr. Bowen Zhou 
Week 13 
13 Apr 
Language Models, Graphical Models 


4.24.7 J&M Book 8.18.3 Bishop Book 
Week 14 
20 Apr 
Part I : Markov Random Fields Part II : Equations to Implementation 



Week 15 
27 Apr 
Project Presentations 

Final Projects Due (April 25, 11:59pm) 


Books
For NLP topics of the course we will use the following book :
Speech and Language Processing (2nd Edition) by Daniel Jurafsky and James H. Martin (ISBN13: 9780131873216)
For statistical methods/Machine Learning topics we will partly use :
Pattern Recognition and Machine Learning by Christopher M. Bishop (ISBN13: 9780387310732)
We may also use one of the online textbooks. We will also have assigned readings from various published papers.
Another good book ML book is "The Elements of Statistical Learning: Data Mining, Inference, and Prediction," Second Edition by Trevor Hastie, Rober Tibshirani and Jerome Friedman
