Statistical NLP for the Web

Fall 2012

Course Information:
Time : Wednesday, 4:10 - 6pm
Location : 627 Mudd

Instructor: Dr. Sameer Maskey
smaskey [at] cs.columbia.edu
Office Hours: 2 - 4pm (or by appointment), Wednesday, 457 CS building

TA: : Morgan Ulinski
Office Hours: Tuesday, 2 - 4pm Speech Lab CEPSR building (7th floor)
mulinski [at] cs.columbia.edu

Course Description
Are you interested in developing a Sentiment Analysis algorithm that uses Twitter fire hose data? Do you want to learn how Hidden Markov Models and Finite State Machines can be used to implement a Spoken Dialog System like Siri? Would you like to understand News clustering algorithm or Maximum Entropy based Question Answering systems? This course will explore topics that juxtapose Statistical/Machine Learning algorithms with real world NLP/Speech tasks that use a large amount of web data. We will study NLP/Speech topics such as Text Mining, Document Classification, Topic Clustering, Summarization and Dialog systems. We will explore Statistical Methods/Machine Learning techniques such as Linear classifiers, Clustering techniques, Inference algorithms, Ranking Methods that are used in addressing some of these NLP/Speech problems. Students will get hands-on experience in implementing some of these techniques efficiently to build an NLP/Speech system that can handle a significant amount of unstructured web data (text, speech and video).

Academic Integrity

Presenting copied work as your own is strictly not tolerated, and will result in automatic zero. If you believe you need extra time to complete the assignment please email the TA or the instructor in advance.

Prerequisites

Background knowledge in probability, statistics, linear algebra. Experience in at least one programming language.

Grading

There will be 3 Homework and a Final Project (No Final Exam). Each homework will contain programming assignment (some homework may contain a brief written assignment as well). HW1 (15%), HW2 (15%), HW3 (15%), Final Project (55%). You have 3 'no penalty' late days in total that can be used during the semester. Each additional late day (without approval) will be penalized by 20% each day.

Tentative Class Schedule


Week       Date       Topics       Slides Assignments   Readings and Remarks          Additional Material
Week1 September 5, 2012 Introduction, Text Mining and Linear Methods of Regression pdflink      
Week2 September 12, 2012 Text Categorization and Linear Classifiers pdflink (updated Dec 4th) Final Project Information pdflink 23.1.1, 23.1.2, 23.1.3 J&M Book
1.1, 3.1, 4.1 - Bishop Book
Elkan's intro pdflink
Week3 September 19, 2012 Topic/Document Clustering, Unsupervised Learning, K-Means, Expectation Maximization algorithms, Hierarchical Clustering pdflink Homework1 Assignedpdflink
Project Proposal Draft Due (11:59pm)
9.1 to 9.4 Bishop Book Document Clustering Overview pdflink
Eisner's excel pdflink
Week4 September 26, 2012 Non-Metric Methods, Statistical Parsing, PCFGs, Synchronous PCFGs pdflink Project Proposal Due (11:59pm) Chapters 12, 13 and 14 J&M Book  
Week5 October 3, 2012 Information Extraction, Tagging, Stochastic Sequential Models, Hidden Markov Models pdflink Homework1 Due (Oct 4th, 11:59pm)
Homework2 Assignedpdflink
22.2 J&M Book
13.1 and 13.2 Bishop Book
Rabiner Paper (Section I, II and III) pdflink
Eisner's excel pdflink
F-measure Example excel pdflink
Week6 October 10, 2012 Hidden Markov Models II, MapReduce pdflink     J&M Book 6.1 to 6.5  
Week7 October 17, 2012 MapReduce for Statistical NLP/Machine Learning pdflink   Project Intermediate Report I Due (October 17, 11:59pm) Mapreduce paper : pdflink MapReduce MLpdflink
Language Model MapReduce pdflink
Week8 October 24, 2012 Neural Networks  pdflink  Bishop Book 5.1 to 5.3  
Week9 October 31, 2012 Deep Belief Networks pdflink Homework 2 Due (Oct 30, 11:59pm) Hinton's Deep Belief Network Paper pdflink Colbert's DBN for NLP Tasks pdflink Semantic hashingpdflink
Week10 November 7, 2012 Machine Translation I pdflink   Homework3 Assignedpdflink
J&M Book 25.1 to 25.7,
Brown Paper pdflink
Kevin Knight's Workbook pdflink
Week11 November 14, 2012 Maximum Entropy Models pdflink   J&M 6.6-6.8
MaxEnt for NLP pdflink
Eisner's excel pdflink
Week12 November 21, 2012 Machine Translation Decoding pdflink Invited Guest Lecture : Dr. Ahmad Emami
Project Intermediate Report Oral (Nov 21 - 10:00 - 4:00)   J&M Boook 25.8 to 25.12  
Week13 November 28, 2012 Log Linear Models in general, Conditional Random Fields, Question Answering pdflink   Homework 3 Due (Dec 5, 11:59pm) Charles Elkan's cikm tutorial pdflink  
Week14 December 5, 2012 Equations to Implementation/Building Scalable Statistical Web NLP Applications        
Week15 December 12, 2012 Final Project Demo/Presentation Day Final Project Report Due (Dec 12, 11:59)
Demo and Presentation (Dec 12 : 10:00-2:00) CS Conf room
Last week of classes  
   
Week16 December 19, 2012 Finals Week (no Finals for this class)        

Examples of Previous Student Projects


Section Classification in Clinical Notes using Supervised HMM - Ying
Automatic Summarization of Recipe Reviews - Benjamin
Classifying Kid-submitted Comments using Machine Learning Techniques - Tony
Towards An Effective Feature Selection Framework - Boyi
Using Output Codes as a Boosting Mechanism - Green
Enriching CATiB Treebank with Morphological Features - Sarah
SuperWSD: SupervisedWord Sense Disambiguation by Cross-Lingual Lexical Substitution - Wenhan
L1 regularization in log-linear Models - Tony
A System for Routing Papers to Proper Reviewers - Zhihai

Books

We will provide handouts in the class. Besides the handouts we will also use the following books.
For statistical methods/Machine Learning topics we will partly use :
Pattern Recognition and Machine Learning by Christopher M. Bishop (ISBN-13: 9780387310732)
For NLP topics of the course we will partly use the following book :
Speech and Language Processing (2nd Edition) by Daniel Jurafsky and James H. Martin (ISBN-13: 9780131873216)
We may also use one of the online textbooks. We will also have assigned readings from various published papers.

Another good book ML book is "The Elements of Statistical Learning: Data Mining, Inference, and Prediction," Second Edition by Trevor Hastie, Rober Tibshirani and Jerome Friedman