Probabilistic Models of Discrete Data

Spring 2016, Columbia University

David M. Blei

Day/Time: Fridays, 12:10PM-2:30PM
Location: Mudd 633

Piazza site

Course Description

We will study probabilistic models of discrete data, especially focusing on large-scale data sets that are high-dimensional and sparse. Discrete data sets are found in diverse applications of statistical machine learning, such as natural language processing, recommendation systems, computational neuroscience, and statistical genetics. Topics will include embeddings, mixed-membership models (topic models), scalable computation, Bayesian nonparametrics, and model diagnosis. During the course of the semester, each student will be expected to complete an ambitious project around a real-world problem.

Prerequisites. The prerequisite course is Foundations of Graphical Models and you should be comfortable with its material. Specifically, you should be able to write down a new model where each complete conditional is in the exponential family, derive and implement an approximate inference algorithm for the model, and understand how to interpret its results. You should also be fluent in the semantics of graphical models. Finally, note this is a seminar. It is only open to PhD students. Auditors are not permitted.

Reading assignments and notes

  1. Introduction and logistics
  2. Word embeddings I
  3. Word embeddings II
  4. Word embeddings III
  5. Factorization I
  6. Aside: Stochastic optimization and variational inference
  7. Inverse regression for text
  8. Bayesian nonparametrics
  9. More Bayesian nonparametrics
  10. Discrete choice models
  11. Statistical analysis of networks