Columbia Computer Science
Faculty Candidate Colloquium

Spring 2004

Learning Predictive Models for Genomic Data

Christina Leslie


Center for Computational Learning Systems
Columbia University

Monday, April 26th, 11 AM, Interschool Lab, 7th floor, CEPSR

Abstract

We present our recent work on two central problems in computational biology: inferring gene regulatory networks and remote protein homology detection. In both cases, we develop discriminative machine learning methods that allow us both to achieve strong predictive performance and to extract biologically important interpretations.

A cell's gene regulatory network refers to the coordinated switching on and off of genes by regulatory proteins that bind to non-coding DNA. We present a new methodology for learning predictive models of gene regulation from gene expression and regulatory sequence data for simple organisms like yeast. The core of our approach is a novel algorithm called GeneClass, based on the boosting algorithm. While descriptive models such as probabilistic graphical models focus on finding structure in the training data, our method is able to make accurate predictions about which genes will be up- or down-regulated in new experiments. We also show how to use GeneClass to identify biologically important regulators and binding motifs for specific regulatory pathways.

Protein classification is the prediction of the structural or functional class of a protein from its primary sequence. A difficult subproblem is remote homology detection, where one wants to predict a structural relationship between sequences that have diverged over a long evolutionary distance. We build on our work on string kernels for support vector machine (SVM) classifiers for these problems by developing new semi-supervised learning approaches. We take advantage of abundant unlabeled data -- large databases of protein sequences whose structural class is unknown -- to define cluster kernels and profile-based string kernels that outperform all competing remote homology detection methods. In the case of profile kernels, we can interpret the SVM classifier by extracting discriminative motif regions that suggest conserved structural subunits in a protein superfamily.

Biosketch:

Dr. Leslie received her PhD in Mathematics from Berkeley and held an NSERC Postdoctoral Fellowship in the Mathematics Department at Columbia University. She joined the Columbia Computer Science Department in Fall 2000 and moved to the Center for Computational Learning Systems (CLASS) at Columbia in Spring 2004, where she is currently a Research Scientist. The research focus of her lab is the application of machine learning methods to computational biology problems, including modeling gene regulation, protein classification and remote homology detection, improving protein ranking, signal finding for pre-mRNA splicing, and local protein conformation prediction.