Machine Learning Reading Group Meetings at Columbia
Home Research Papers Courses Meetings Links
Machine Learning at Columbia Reading Group Schedule
All meetings are at 2:45 in CEPSR 6LE5, unless otherwise noted.

Old meetings are below..
May 4, 2006 Pannagadatta K Shivaswamy and Tony Jebara: Permutation Invariant SVMs (Pannagadatta presenting)
March 2, 2006 Yirong Shen, Ng Andrew, Seeger Matthias: Fast Gaussian Process Regression using KD-Trees (Vlad presenting)
January 26, 2006 Koby Crammer and Yoram Singer: Ultraconservative Online Algorithms for Multiclass Problems (Risi presenting)
January 19, 2006 Linli Xu, James Neufeld, Bryce Larson, Dale Schuurmans: Maximum Margin Clustering (Pannagadatta presenting)
January 5, 2006 (Bert Presenting)
December 29, 2005 No group meeting due to holidays.
December 22, 2005 Vlad Shchogolev: Using b-matching to improve clustering (Vlad Presenting)
December 15, 2005 NIPS Highlights (Andy Presenting)
December 8, 2005 Jean-Philippe Vert, Robert Thurman and William Stafford Noble Kernels for Regulatory Regions (Rui Presenting)
December 1, 2005 Brad Schumitch, Sebastian Thrun, Gary Bradski and Kunle Olukotun The Information-Form Data Association Filter (Tony Presenting)
November 29, 2005 Neil Lawrence (University of Sheffield)
High Dimensional Probabilistic Modelling through Manifolds
2:30-3:15 in Interschool Lab

Density modelling in high dimensions is a traditionally very difficult problem. Approaches such as mixtures of Gaussians typically fail to capture the structure of data sets in high dimensional spaces. In this talk we will argue that for many data sets of interest, the data can be represented as a lower dimensional manifold immersed in the higher dimensional space. We will then present the Gaussian Process Latent Variable Model (GP-LVM), a non-linear probabilistic variant of principal component analysis (PCA) which implicitly assumes that the data lies on a lower dimensional space. We will demonstrate the application of the model to a range of data sets, but with a particular focus on human motion data. We will show some preliminary work on facial animation and make use of a skeletal motion capture data set to illustrate differences between our model and traditional manifold techniques.

BIO: Neil Lawrence received his PhD from Cambridge University in 2000 after which he spend a year at Microsoft Research, Cambridge. Currently he his a senior lecturer in the Department of Computer Science, University of Sheffield, U.K.. His research interests are probabilistic models with a particular focus on Gaussian processes. He has a particular interest in applications of these models and recent work has involved him in computational biology, speech, vision and graphics. See
November 17, 2005 Risi Kondor: Fourier Transformation on non-Abelian Goups
November 10, 2005 Risi Kondor: Representations of the Symmetric Group
November 3, 2005 Raphael Pelossof, Particle Filtering Using Dynamic Clustering. (Raphael presenting)
October 27, 2005 Chiranjib Bhattacharyya Second Order Cone Programming Formulations for Feature Selection (Pannagadatta presenting)
November 12, 2004 Sanjeev Arora and Ravi Kannan Learning mixtures of arbitrary gaussians (Darrin presenting)
November 5, 2005 Matching via loopy propagation (Bert Presenting)
October 29, 2004 Trevor Hastie, Saharon Rosset, Rob Tibshirani and Ji Zhu The Entire Regularization Path for the Support Vector Machine (Andy presenting)
September 27, 2004 Yee Whye Teh (U.C. Berkeley):
Hierarchical Dirichlet processes: A Bayesian approach to sharing clusters among related groups

We consider problems involving groups of data, where each observation within a group is drawn from a mixture model, and where it is desirable to share mixture components both within and between groups. We assume that the number of mixture components is unknown a priori and is to be inferred from data. Such problems occur often in practice, e.g. in the problem of topic discovery in document corpora, each document is treated as a group of data items (bag of words), and each topic corresponds to a mixture component. In this setting, sharing components between groups simply means that topics can occur across a number of documents, allowing dependencies across documents (groups) to be modeled effectively as well as conferring generalization to new documents (groups).
The hierarchical Dirichlet process (HDP) is a Bayesian solution to this problem, utilizing both hierarchical and nonparametric modeling ideas. Each group is modeled using a Dirichlet process (DP) mixture, which provides a nonparametric prior for the number of components within each group. To facilitate sharing components between groups, we consider a hierarchical extension where the common base distribution for the DPs is itself distributed according to a DP. Such a base distribution being discrete, the group specific DPs necessarily share atoms, hence the mixtures for different groups necessarily share components.
We discuss a variety of representations for the HDP, as well as Markov chain Monte Carlo algorithms for posterior inference. We report experimental results on three text corpora showing the effective and superior performance of the HDP over previous models.
Technical Report: Hierarchical Dirichlet processes. Teh, Jordan, Beal and Blei (2004). UC Berkeley Department of Statistics, TR 653. Can be obtained at:

September 2, 2004 Matthias Seeger (U.C. Berkeley):
Sparse Multi Gaussian Process Methods: Multi-way Classification and beyond

While supervised kernel techniques involving a single latent function (or discriminant) are well understood and powerful methods have emerged, much less is known about models which combine a number of latent functions. We describe a generic way of generalizing the sparse Bayesian Gaussian process Informative Vector Machine (IVM) to such multi process models, emphasizing the key techniques which are required for an efficient solution (exploiting matrix structure, numerical quadrature).
We apply our method to the multi-way classification problem, obtaining a scheme which scales essentially linearly in the number of datapoints and classes. We show how kernel parameters can be learned by empirical Bayesian techniques. We argue that a good solution for the multi-class problem leads to schemes for larger structured graphical models such as conditional random fields.
Joint work with Michael Jordan.

July 20, 2004 Pat Langley (Center for the Study of Language and Information, Stanford University):
Computational Discovery of Explanatory Process Models
11:00am CS Conference Room

The growing amount of scientific data has led to the increased use of computational discovery methods to understand and interpret them. However, most work has relied on knowledge-lean techniques like clustering and classification learning, which produce descriptive rather than explanatory models, and it has utilized formalisms developed in AI or statistics, so that results seldom make contact with current theories or scientific notations. In this talk, I present an approach to computational discovery that encodes explanatory scientific models as sets of quantitative processes, simulates these models' behavior over time, incorporates background knowledge to constrain model construction, and induces the models from time-series data. I illustrate this framework on data and models from Earth science and microbiology, two domains in which explanatory process accounts occur frequently. In closing, I describe our progress toward an interactive software environment for the construction, evaluation, and revision of such explanatory scientific models.
This talk describes joint work with Kevin Arrigo, Nima Asgharbeygi, Stephen Bay, Andrew Pohorille, and Jeff Shrager.

July 2, 2004 Risi: Highlights of the Barcelona conference on the Mathematical Foundations of Learning Theory
June 21, 2004 Gokhan BakIr (Tuebingen):
Breaking SVM Complexity with Cross-Training
11:00am Machine Learning Lab

We propose an algorithm for selectively removing examples from the training set using probabilistic estimates related to editing algorithms (Devijver and Kittler, 1982). The procedure creates a separable distribution of training examples with minimal impact on the decision boundary position. It breaks the linear dependency between the number of SVs and the number of training examples, and sharply reduces the complexity if SVMs during both the training and prediction stages.

June 18, 2004 Thomas Blumensath (University of London):
Bayesian Music Transcription
2:30pm Machine Learning Lab

German Creamer (Columbia University):
Automated trading strategies using the boosting approach, technical indicators and constant rebalanced portfolio
2:45pm Machine Learning Lab
April 30, 2004 Tony:
Orbit Learning via Convex Optimization
ML Lab, 2:30pm
Snowbird workshop
April 16, 2004 At 2:30pm, we will be continuing with the Chinese Restaurant Process and Dirichlet Process papers with Risi leading the discussion from April 2nd.
April 13, 2004 Risi:
Multi-facet Learning in Hilbert Spaces
ML Lab, 5pm
Slides from Snowbird workshop: [ps]
April 16, 2004 Deb Roy (MIT Media Laboratory):
Meaning Machines
11:00am Interschool Lab

Computers don't grasp meaning in any deep, human sense. From a computer's point of view, words are meaningless bits of information to be processed with speed and precision, but without any genuine interpretation of content. We envision a new class of machines that understand the meaning of information in more human-like ways by grounding knowledge in the physical world and in the machines' own goals. Towards this long term goal, our group develops conversational robots and other situated systems designed to communicate with human partners using natural language. In this talk I will highlight recent experimental results and applications in multimodal human-machine interaction. I will also discuss an emerging theoretical framework for connecting language, action, and perception which has relevance both for designing situated systems and for cognitive modeling.
Professor Roy's and the Cognitive Machines Group Web Page are at:

April 9, 2004 Sayan Mukherjee (MIT):
Gene Set Enrichment Analysis
11:00am Interschool Lab

The selection and analysis of differentially expressed gene profiles (markers) helps associate a biological phenotype with its underlying molecular mechanisms and provides valuable insights into the structure of pathways and cellular regulation. However, analyzing and interpreting a given list of gene markers to glean useful biological insights can be extremely challenging. This is in part due to the difficulty of objectively evaluating how well members of a given a pathway or functional class of interest (Gene Set) are represented in the markers list. To address this problem we introduce a statistical methodology called Gene Set Enrichment Analysis (GSEA) for determining whether a given Gene Set is over-represented or enriched in a Gene List of markers ordered by their correlation with a phenotype or class distinction of interest. The method is based upon a score computed as the maximum deviation of a random walk (in the same spirit as the Kolmogorov-Smirnov statistic) and uses permutation testing to assess significance. When multiple Gene Sets are tested simultaneously we propose two approaches to address the multiplicity: Validation GSEA which controls the Familywise error rate (FWER) and Discovery GSEA which controls the False Discovery rate (FDR). The utility of this procedure will be illustrated on two biological problems: validating a mouse model of lung cancer and finding chromosomal dislocations for myeloid leukemia.

April 2, 2004 Blei, Griffiths, Jordan and Tenenbaum: Hierarchical Topic Models and the Nested Chinese Restaurant Process. (Risi presenting) Slides: [ps] [pdf]
March 25, 2004 Naftali Tishby (The Hebrew University):
Efficient data representations that preserve relevant information - A new look at Shannon's Information Theory
11:00am Interschool Lab

In this talk I will take a new look at Shannon's Information theory from the Machine Learning perspective . I will argue that Shannon's theory provides a compelling mechanism for quantifying the fundamental tradeoff between complexity and accuracy, by unifying the source and channel coding theorems into one principle which I call the "Information Bottleneck" (IB). This unified view of the coding theorems can shed new light on the question of "relevant data representation" and suggests new algorithms for extracting such representations from co-occurrence statistics. It also provides new ways of thinking about neural coding and neural data analysis. When applied to the analysis of human language it reveals new - possibly universal - scaling law that may reflect the way words are acquired in natural languages. The IB principle has new interesting extensions that deal with multivariate data using Graphical models and multi-information; allow adding irrelevant side information; and extract nonlinear continuous dimension reduction that preserve information (SDR). I will describe some of those extensions as time allows.
More information and many related papers can be found at:

March 12, 2004 Brendan J. Frey: A Principled and Computationally Efficient Approach to Visual and Auditory Scene Analysis
11:00am Interschool Lab

Scene analysis is currently the most exciting problem to work on in sensory processing. Scene analysis entails decomposing sensory inputs into a combinatorial explanation that can be used to predict new inputs. The problem is old and can be traced back to Helmholtz and more fuzzily to the ancient Greeks (who thought that fire, etc, was the combinatorial explanation). However, only recently do we have 1) Efficient algorithms that turn exponential-time combinatorial inference and learning algorithms into "linear time" approximate algorithms; and 2) Fast computers that can process sensory data repeatedly and quickly enough that grad students stick around to look at the results. In this talk, I'll describe research on visual and auditory scene analysis, being done in my group at the University of Toronto (PSI-Lab), using graphical models and efficient, approximate inference algorithms. The focus will be on the generative modeling approach and why this approach holds the most promise for "solving" the problem of sensory scene analysis.

For serious reading, check out the following tutorial paper:

For fun videos, audio clips, gene expression array results, etc, see the web pages of Nebojsa Jojic, my postdocs (Romer Rosales, Quaid Morris) and my current students (Kannan Achan, Anitha Kannan, Chris Pal).

March 5, 2004 John Langford: The Method of Reduction in Machine Learning

Reductions transform a solver for one domain into a solver for another domain. I will discuss a general framework for analyzing reductions between learning domains which captures Boosting, ECOC, and other well known algorithms. I will also present new algorithmic reductions and empirical tests of their performance.
February 27, 2004 Yoav Freund, Robert E. Schapire, Yoram Singer, Manfred K. Warmuth: Using and combining predictors that specialize. (Ray presenting)
February 6, 2004 Denis V. Chigirev, William S. Bialek: Optimal Manifold Representation of Data: An Information Theoretic Approach. (Andy presenting)
January 30, 2004 Vishy (SVN Vishwanathan):
2:30pm in CEPSR 6LE5

We present Hilbert space embeddings of dynamical systems and embeddings generated via dynamical systems. This is achieved by following the behavioural framework invented by Willems, namely by comparing trajectories of states. As important special cases we recover the diffusion kernels of Kondor and Lafferty, generalised versions of directed graph kernels of Gartner, novel kernels on matrices and new similarity measures on Markov Models. We show applications of our method to Dynamical texture recognition problems from computer vision.

January 27, 2004 Eleazar Eskin: The Homology Kernel: A Biological Motivated Sequence Embedding
4:30pm in CS Lounge

Many recent techniques in learning over biological sequences implicitly embed sequences into a Euclidean space in order to take advantage of strong margin based learning algorithms. However, these embeddings often do not take advantage of the rich biological intuitions that have motivated development of Hidden Markov Model style biological sequence models and have lead to great successes in computational biology. In this talk, we present a new biological motivated sequence embedding. We discuss several of the formal properties of the embedding include its connection to local sequence alignment. One of the key features of the embedding is that a sequence is embedded along with its homologues or neighboring sequences. The distance between two sequences is defined by the distance between close neighbors of the sequences. We demonstrate application of the embedding to several applications. We apply the embedding to learning protein secondary structure and protein family classification. We also show how the embedding can be used for aligning two sequences based on their homologues. Finally we discuss how due to the properties of the embedding, we can efficiently compute the nearest neighbors. Joint work with Sagi Snir.

December 19, 2003 Casual meeting to discuss plans for ICPR, JMLR, ICML, UAI, COLT, and Snowbird.
December 15, 2003 Andrew Gelman: Bayesian data analysis: What it is and what it isn't
11am in Interschool Lab

Bayesian inference is often associated with the concept of subjective probability and prior distributions. We have a different attitude, in which inference proceeds from a posited (not "elicited") probability model, and then the implications of these inferences are compared to existing data, new data, and other subject-matter knowledge. The resulting approach to statistics places the flexibility of complex probability modeling in a scientific context of falsifiable hypotheses. In practice, it means that we can solve lots of problems without agonizing over philosophical questions of subjectivity. We illustrate with several examples in social science and public health.

December 5, 2003 Zhou, Bousquet, Lal, Weston and Scholkopf: Learning with Local and Global Consistency. (Darrin presenting)
November 21, 2003 M. Hein and O. Bousquet: Maximal Margin Classification in Metric Spaces. (Risi presenting)
November 7, 2003 J. Lafferty, A. McCallum, F. Pereira: Conditional Random Fields. (Deep presenting)
October 31, 2003 B. Taskar, C. Guestrin and D. Koller: Max Margin Markov Networks.
October 10, 2003 T. Jebara: Latent Entropy Discrimination.
September 26, 2003 O. Bousquet: New Approaches to Statistical Learning Theory. (Risi presenting)
September 19, 2003 Brief hello meeting to hear about everyone's work and progress.
September 5, 2003 Yoav Freund: How to be a Bayesian without believing
At the Probability Seminar, 11am in Mathematics Building (Room 520)

Most of the study of Bayesian prediction procedures is premised on some strong assumptions regarding the prior distribution. In particular, an assumption that always needs to be made is that the prior distribution assigns a non-zero probability to the correct model. In practice, however, we often have to restrict the set of models (in the support of the prior) in order to make it feasable to compute the posterior average. As a result, we often can't assume that the correct model is in our set and the standard Bayesian theory cannot be applied.
In this work we show a classification procedure which uses model averaging and can be interpreted as a Bayesian procedure. We show that this procedure has some desirable properties when it is applied to *any* source of IID examples, regardless of whether or not the source distribution is related to the models in the support of the prior. Our main result is that the predictions made by this procedure are very stable with regard to the choice of the random training set.
This stability property has some far-reaching implications on a variety of issues, including Bias-Variance decomposition of classification error, the "curse of dimensionality" and Bagging.
This is joint work with Yishay Mansour and Rob Schapire

August 29, 2003 Alex Smola: Exponential Families in Feature Space
12:45pm in CS Conference Room
August 15, 2003 Y. Altun, I. Tsochantaridis, and T. Hofmann: Hidden Markov Support Vector Machines. (ICML 2003)
August 8, 2003 Paolo Frasconi: Learning Structured Data: Theory and Applications

An informal talk but should be very interesting!

July 18, 2003 Nebojsa Jojic: Epitomic analysis of appearance and shape
11am in Interschool Lab

We present novel simple appearance and shape models that we call epitomes. The epitome of an image is its miniature, condensed version containing the essence of the textural and shape properties of the image. As opposed to previously used simple image models, such as templates or basis functions, the size of the epitome is considerably smaller than the size of the image or object it represents, but the epitome still contains most constitutive elements needed to reconstruct the image. A collection of images often shares an epitome, e.g., when images are a few consecutive frames from a video sequence, or when they are photographs of similar objects. A particular image in a collection is defined by its epitome and a smooth mapping from the epitome to the image pixels. When the epitomic representation is used within a hierarchical generative model, appropriate inference algorithms can be derived to extract epitome from a single image or a collection of images and at the same time perform various inference tasks, such as image segmentation, motion estimation, object removal and super-resolution. We have also had some preliminary success in other domains (audio waveforms and MID files, for example).

Joint work with Brendan Frey and Anitha Kannan.

July 11, 2003 John Langford: Bounds for Sale

Constructing tight bounds on the future error rate of a classifier has historically been a rather difficult task. This task has been accomplished for support vector machines. I'll discuss the intuitions behind a couple bounds, and presents some empirical results. With bounds this tight, direct bound optimization algorithms are motivated, for which I'll also present results.

Related Papers:
PAC-Bayes and Margins
Relating Data Compression and Learnability
July 1, 2003 Chris Pal: Probabilistic Montage Models of Images

In this talk I'll present a new class of image models we call Probabilistic Montages. Probabilistic Montages represent an image using a grid of smaller sub-images. Each sub-image is then modeled as arising from a transformed and cropped version of one of a collection of possible slightly larger images. We describe the probability distribution relating the way in which sub-images are cropped and transformed with respect to the other locations on the grid using various graphical models. In this talk I'll illustrate a number of variations of the probabilistic montage. I'll present an EM algorithm for estimating parameters in Bayesian tree structured montages and show how probabilistic montages are applicable to various classification and segmentation problems. In particular, I will illustrate the use of a tree structured model for recognition tasks applied to full body images of people in motion.

Joint work with Brendan Frey and Nebojsa Jojic

ECCV Paper Learning Montages of Transformed Latent Images...
June 20, 2003 G. Lebanon: Learning Riemannian Metrics
June 13, 2003 K. Barnard, P. Duygulu, N. de Freitas, D. A. Forsyth, D. M. Blei, and M. I. Jordan: Matching words and pictures (presented by Andy)
May 9, 2003 O. Bousquet and A. Elisseeff: Algorithmic Stability and Generalization Performance (presented by Risi)
April 25, 2003 T. Jebara and A. Howard: Dynamical Systems Trees for Communities of Interacting Processes (presented by Andy)
April 11, 2003 M. Belkin and P. Niyogi: Using Manifold Structure for Partially Labelled Classification (presented by Darrin)
April 4, 2003 F. Cucker and S. Smale: On the Mathematical Foundations of Learning (presented by Risi)
March 14, 2003 T. Jebara: Alternating Projection for Independent Component Analysis
March 7, 2003 K. Bennett and A. Demiriz: Semi-Supervised Support Vector Machines;
T. Joachims: Transductive Inference for Text Classification using Support Vector Machines (presented by Darrin)
February 24, 2003 Balazs Szendroi: Power laws and self-similarity in large networks: a case study

Using a social network consisting of 10^4 human individuals as a motivating example, I will recall some generic features occuring in various classes of large networks such as "small world" properties and power laws. Going beyond aggregated measures, I will present a somewhat mysterious representation of the self-similar architecture of the network based on its adjacency spectrum. A model reproducing both aggregated and self-similar features of the network will also be discussed. I will conclude by showing that such self-similarity is present in several, though not all, real-world networks.

Joint work with Gabor Csanyi

Some background articles: (Small World)
February 14, 2003 Gunter Ratsch, Alexander Smola, and Sebastian Mika: Adapting Codes und Embeddings for Polychotomies, NIPS 2002 (Katherine)
February 7, 2003 K. Nigam, A. McCallum, S. Thrun, and T. Mitchell: Text Classification from Labeled and Unlabeled Documents using EM (Darrin)
January 31, 2003 E. Segal, Y. Barash, I. Simon, N. Friedman and D. Koller: From Promoter Sequence to Expression: A Probabilistic Framework (Omar)
January 22, 2003 John Langford and John Shawe-Taylor: PAC-Bayes And Margins (NIPS 2002) (Darrin);
Michael Tipping, Anita Faul: Fast Marginal Likelihood Maximisation for Sparse Bayesian Models. (AISTAT 2003)
December 4, 2002 John Lafferty and Guy Lebanon: Information Diffusion Kernels (Risi);
Gal Chechik and Naftali Tishby: Extracting relevant structures with side information (Anshul)
November 27, 2002 Eric P. Xing, Andrew Y. Ng, Michael I. Jordan, and Stuart Russell: Distance Metric Learning, with application to Clustering with side-information (Risi)
November 20, 2002 We will be preparing for the ACM research fair on Friday, sorting out presentations, etc.: Details and schedule
November 13, 2002 V. Pavlovik, J. Rehg and J. MacCormick: Learning Switching Linear Models of Human Motion. (NIPS 2000) (Andy)
October 30, 2002 We will wrap up the Sufficent Dimensionality paper and look at the following related paper: D. D. Lee and H.S. Seung: Algorithms for Non-negative Matrix Factorization. (NIPS 2000) (Risi)
October 23, 2002 A. Globerson and N. Tishby: Sufficient Dimensionality Reduction.
October 16, 2002 N. Tishby, F. Pereira and W. Bialek: The Information Bottleneck Method. (Anshul)
October 9, 2002 We will continue with: F. R. Bach and M. I. Jordan: Kernel independent component analysis.
October 2, 2002 F. R. Bach and M. I. Jordan: Kernel independent component analysis.
September 25, 2002 Paul Viola and Michael Jones: Rapid Object Detection using a Boosted Cascade of Simple Features
Paul Viola and Michael Jones: Fast and Robust Classification using Asymmetric AdaBoost and a Detector Cascade