Statistical Methods for Natural Language Processing (NLP)

Spring 2010

Course Information:
Time : Tuesday, 4:10 - 6pm
Location : 606 Lewisohn
Office Hours: Tuesday, 2 - 4pm (or by appointment), Speech Lab CEPSR building (7th floor)

Instructor: Dr. Sameer Maskey
smaskey @
914 945 1573

Teaching Assistant: Kapil Thadani, kapil at cs. domain .edu (domain = columbia)
Office Hours: 3 - 5pm (or by appointment), Thursday, Office 724, CEPSR building

Guest lectures by :
Dr. Salim Roukos
Dr. Bowen Zhou

Course Description
This course will explore topics in Statistical Methods/Machine Learning for real-world Natural Language Processing (NLP) problems. We will study ML topics that are commonly used in NLP such as Maximum Entropy Models, Hidden Markov Models, Clustering techniques, Conditional Random Fields, Expectation-Maximization algorithm, Active Learning and Support Vector Machines. We will understand how these methods are applied to real world NLP problems such as information extraction, stochastic parsing, text segmentation and classification, topic/document clustering and word sense disambiguation. We will also study the details of inference algorithms such as Viterbi, Synchronous Chart Parsing and Beam Search. The students will get hands-on experience by implementing some of these ML techniques for classification, clustering and a complex NLP task of machine translation.

Academic Integrity

Presenting copied work as your own is strictly not tolerated, and will result in automatic zero. If you believe you need extra time to complete the assignment please email the TA or the instructor in advance.


Background knowledge in probability, statistics, linear algebra. Some experience in at least one programming language.


There will be 3 Homework, Final Exam and a Final Project. There will be no Mid-term Exam. Each homework will contain programming assignment (some homework may contain a brief written assignment as well). HW1 (15%), HW2 (15%), HW3 (15%), Final Project (40%), Final Exam (15%). You have 3 'no penalty' late days in total that can be used during the semester. Each additional late day (without approval) will be penalized by 20% each day.

Tentative Class Schedule

Week       Date       Topics       Slides Assignments   Readings and Remarks         
Week1 19 Jan Introduction, Text Mining, Linear Models of Regression pdflink    
Week2 26 Jan Text Categorization, Linear Methods of Classification pdflink   23.1.1, 23.1.2, 23.1.3 J&M Book
1.1, 3.1, 4.1 - Bishop Book
Week3 2 Feb Text Categorization, Support Vector Machines pdflink HW1 Assigned pdflink
HW1 Solutions pdflink
6.1, 6.2, 7.1 (upto 7.1.1 only) - Bishop Book
3.1, 4.5.1 - J&M Book
Sebastiani, F., Machine Learning in Automated Text Catgorization, ACM Surveys 2002
Optional Reading: Christopher Burge's SVM tutorial
Week4 9 Feb Information Extraction, Sequential Stochastic Models, HMM pdflink   22.1, 6.1 to 6.5 J&M Book
Week 5 16 Feb Hidden Markov Models II pdflink HW1 Due, HW2 Assigned pdflink
HW2 Solutions pdflink
22.2 J&M Book
13.1, 13.2 Bishop Book
Week 6 23 Feb Maximum Entropy Models pdflink Project Proposal Due (11:59pm) J&M 6.6-6.8
Week 7 2 Mar Semantics, Brief Introduction to Graphical Models pdflink   Please come prepared with at least 1 question for each paper
~ Liang P., Jordan M., Klein D., Learning Semantic Correspondences with Less Supervision, ACL 2009
~ Shen D., and Lapata M, Using semantic roles to improve question answering, EMNLP 2007
~Carlson A, et. al, Coupling Semi-Supervised Learning of Categories and Relations, HLT 2009 Workshop
Week 8 9 Mar Topic, Document Clustering, K-means, Mixture Models, Expectation Maximization pdflink HW2 Due (March 14) 9.1-9.4, Bishop Book
Week 9 16 Mar No Class, Spring Break   Project Information pdflink Project Intermediate Results Due (March 25)
Week 10 23 Mar Conditional Random Fields pdflink HW3 Assigned pdflink
HW3 Solutions pdflink
8.3 Bishop Book,
Sutton, C. and McCallum, A., "An Introduction to Conditional Random Fields for Relational Learning" 2006
Week 11 30 Mar Machine Translation I pdflink 25.1-25.13 J&M Book
Invited Lecture: Dr. Salim Roukos
Week 12 6 Apr Machine Translation II pdflink HW3 Due (April 9) Invited Lecture: Dr. Bowen Zhou
Week 13 13 Apr Language Models, Graphical Models pdflink   4.2-4.7 J&M Book
8.1-8.3 Bishop Book
Week 14 20 Apr Part I : Markov Random Fields
Part II : Equations to Implementation
Week 15 27 Apr Project Presentations   Final Projects Due (April 25, 11:59pm)  


For NLP topics of the course we will use the following book :
Speech and Language Processing (2nd Edition) by Daniel Jurafsky and James H. Martin (ISBN-13: 9780131873216)

For statistical methods/Machine Learning topics we will partly use :
Pattern Recognition and Machine Learning by Christopher M. Bishop (ISBN-13: 9780387310732)
We may also use one of the online textbooks. We will also have assigned readings from various published papers.

Another good book ML book is "The Elements of Statistical Learning: Data Mining, Inference, and Prediction," Second Edition by Trevor Hastie, Rober Tibshirani and Jerome Friedman