Truth In Data

Spring 2015, Columbia University

David M. Blei

Day/Time: Mondays, 10:10AM-12:00PM
Location: SSW 1025

Course Description

This is a seminar course about how to develop probabilistic models to analyze complex data sets. We will explore two aspects of probabilistic modeling, model checking and causality.

In the (prerequisite) course Foundations of Graphical Models, students learned how to develop probabilistic models tailored to a problem at hand, and how to derive efficient algorithms for computing about those models. Here we ask: what next?

The first subject will be model checking. In model checking we want to measure the fidelity between our model and our data, understand in which ways the model works and in which ways it fails. This guides the iterative nature of practical model building, where we repeatedly build and refine a probabilistic model to solve a problem using data. Checking our model is crucial to this process. The circle of ideas we will study with relates to goodness-of-fit tests, Bayesian-Frequentist compromises, cross validation, exploratory data analysis, and even the philosophy of science.

The second subject is causality. Many of the questions that we would like to answer about a data set are ultimately causal questions, that is, questions about the effect of one variable on another or counterfactual questions of "what if". For example:

Causal inference, especially from observational data, is both an important activity and a controversial one. We will study various perspectives on causality, including the potential outcome framework, graphical models for causality, and others. We will also study the statistical and algorithmic problems that arise from causal inference goals. We will try to connect modern probabilistic modeling to causal inference, including the ideas we study about building and checking models.

Prerequisite Requirements

This is a small seminar course. It is only open to PhD students. Auditors are not permitted.

The prerequisite course is Foundations of Graphical Models and you should be comfortable with its material. Specifically, you should be able to write down a new model where each complete conditional is in the exponential family, derive and implement a scalable approximate inference algorithm for the model, and understand how to interpret its results. You should also be fluent in the semantics of graphical models, especially d-separation.

If you are interested in the seminar but have not taken Foundations, I suggest you take it in Fall 2015 and then take the seminar offered in Spring 2016.

This seminar will require dedication beyond a typical PhD-level seminar. I recommend it only to students whose research is centered around building and fitting models to answer questions about data. See below for a full description of the workload.

Format and Workload

The seminar involves weekly readings and a project.

Many of these ideas are at the cutting edge of the practice of applied probabilistic models. As a group, we will try to understand the methods, compare them to each other, and consider research opportunities to improve them.

In parallel with readings and discussion, each student will be exploring a data set with probability models. We will share our progress with the rest of the class through modern electronic research "notebooks" for reporting about data analysis (e.g., iPython or knitR) in a shared software repository.

My hope is that this will have several benefits:

At the end of the semester, each student will also write a report that summarizes their progress through the semester. The final grade is based on consistent progress and the final report.


Our readings will give historical perspective about the subjects and a snapshot of the state of the art. The syllabus will evolve throughout the semester. Readings may include the following.

Model specification

Model checking and criticism