Truth In Data

Spring 2015, Columbia University

David M. Blei

Day/Time: Mondays, 10:10AM-12:00PM
Location: SSW 1025

Course Description

This is a seminar course about how to develop probabilistic models to analyze complex data sets. We will explore two aspects of probabilistic modeling, model checking and causality.

In the (prerequisite) course Foundations of Graphical Models, students learned how to develop probabilistic models tailored to a problem at hand, and how to derive efficient algorithms for computing about those models. Here we ask: what next?

The first subject will be model checking. In model checking we want to measure the fidelity between our model and our data, understand in which ways the model works and in which ways it fails. This guides the iterative nature of practical model building, where we repeatedly build and refine a probabilistic model to solve a problem using data. Checking our model is crucial to this process. The circle of ideas we will study with relates to goodness-of-fit tests, Bayesian-Frequentist compromises, cross validation, exploratory data analysis, and even the philosophy of science.

The second subject is causality. Many of the questions that we would like to answer about a data set are ultimately causal questions, that is, questions about the effect of one variable on another or counterfactual questions of "what if". For example:

Does a drug work?
Will this recommendation system lead to more clicks?
Which genes increase vulnerability to a disease?
Does the "broken window" approach to policing reduce serious crime?

Causal inference, especially from observational data, is both an important activity and a controversial one. We will study various perspectives on causality, including the potential outcome framework, graphical models for causality, and others. We will also study the statistical and algorithmic problems that arise from causal inference goals. We will try to connect modern probabilistic modeling to causal inference, including the ideas we study about building and checking models.

Prerequisite Requirements

This is a small seminar course. It is only open to PhD students. Auditors are not permitted.

The prerequisite course is Foundations of Graphical Models and you should be comfortable with its material. Specifically, you should be able to write down a new model where each complete conditional is in the exponential family, derive and implement a scalable approximate inference algorithm for the model, and understand how to interpret its results. You should also be fluent in the semantics of graphical models, especially d-separation.

If you are interested in the seminar but have not taken Foundations, I suggest you take it in Fall 2015 and then take the seminar offered in Spring 2016.

This seminar will require dedication beyond a typical PhD-level seminar. I recommend it only to students whose research is centered around building and fitting models to answer questions about data. See below for a full description of the workload.

Format and Workload

The seminar involves weekly readings and a project.

Many of these ideas are at the cutting edge of the practice of applied probabilistic models. As a group, we will try to understand the methods, compare them to each other, and consider research opportunities to improve them.

In parallel with readings and discussion, each student will be exploring a data set with probability models. We will share our progress with the rest of the class through modern electronic research "notebooks" for reporting about data analysis (e.g., iPython or knitR) in a shared software repository.

My hope is that this will have several benefits:

We will connect the abstract ideas from the readings to real problems that we are invested in solving. We understand ideas more concretely and more deeply when we try them on problems we care about.
The seminar will become a forum to share best practices in analyzing data with probability models. We can talk about and teach each other about the practice of probabilistic modeling. This will include: nagging issues such as hyperparameters, initialization, and regularization; useful tools such as Stan, PyMC, Pandas, SQLite, ggplot2, Julia, etc.; and fundamental issues, such as choosing a discrepancy function to check a model, finding a causal question, and how to visualize high-dimensional data. We will share software, scripts, tips, and pitfalls.
We will be working on projects throughout the semester and producing reproducible research. We will be learning about how to communicate research results in-progress.

At the end of the semester, each student will also write a report that summarizes their progress through the semester. The final grade is based on consistent progress and the final report.

Readings

Our readings will give historical perspective about the subjects and a snapshot of the state of the art. The syllabus will evolve throughout the semester. Readings may include the following.

Model specification

E. Lehmann. Model specification: The views of Fisher and Neyman, and later developments. Statistical Science, 5(2):160–168, 1990.
G. Box. Sampling and Bayes’ inference in scientific modeling and robustness. Journal of the Royal Statistical Society, Series A, 143(4):383–430, 1980.
H. Simon. Science seeks parsimony, not simplicity: Searching for pattern in phenomena. Simplicity, inference and modeling: Keeping it sophisticatedly simple, pages 32–72, 2001.

Model checking and criticism

A. Gelman and C. Shalizi. Philosophy and the practice of Bayesian statistics. British Journal of Mathematical and Statistical Psychology, 66:8–38, 2012.
A. Gelman, X. Meng, and H. Stern. Posterior predictive assessment of model fitness via realized discrepancies. Statistica Sinica, 6:733–807, 1996.
D. Rubin. Bayesianly justifiable and relevant frequency calculations for the applied statistician. The Annals of Statistics, 12(4):1151–1172, 1984.
P. Diaconis. Theories of data analysis: From magical thinking through classical statistics. In Exploring Data: Tables, Trends, and Shapes, pages 1–36. 1985.
S. Geisser and W. Eddy. A predictive approach to model selection. Journal of the American Statistical Association, pages 153–160, 1979.

Causality

M. Hernan and J. Robins. Causal Inference, manuscript, 2014.
G. Imbens and D. Rubin. Causal Inference in Statistics and Social Sciences, manuscript, 2014.
J. Pearl. Causal inference in statistics: An overview. Statistical Surveys, 2009.
L. Bottou, J. Peters, J. Quinonero-Candela, D. Charles, D. Chickering, E. Portugaly, D. Ray, P. Simard, and E. Snelson. Counterfactual reasoning and learning systems: The example of computational advertising. Journal of Machine Learning Research, 14:3207–3260, 2013.
J. Mooij, J. Peters, D. Janzing, J. Zscheischler, B. Schölkopf. Distinguishing cause from effect using observational data: methods and benchmarks. arXiv, 1412.3773, 2014.