New Project Ideas

Predictive models for gene regulation

My research group has recently developed a method for learning predictive models for gene regulation from gene expression data and regulatory sequence data in simple organisms like yeast. By "predictive", we mean that we learn to predict which genes will be up- or down-regulated under different experimental conditions. The method uses boosting, a classification algorithm from machine learning, with alternating decision trees to represent the learned predictive model. There are many directions for extension or application to other simple organisms. This work is to appear in ISMB 2004 (the largest international conference on machine learning).

References:

Contacts:

Integrating data types for learning models of regulation

Combining multiple sources of data -- such sequence data from promoter regions from transcription factor binding sites containing transcription factor binding sites, gene expression data, and binding localization data -- in a learning approach can lead to improved understanding of transcriptional regulation. The REDUCE paper was presented earlier in the semester in lecture. The reference from Daphne Koller's lab is technically more difficult, but there are interesting ideas -- you may be able to try a simpler model. We have also done work on joint clustering models in our research group -- see me for additional ideas and references.

References:

Predicting Protein-Protein Interactions

Learning to predict which pairs of proteins will interact is an important but difficult new problem. There are new high-throughput techniques like yeast two-hybrid screens for detecting pairwise protein interactions, but these assays are notoriously noisy -- that is, the + and - labels (for interaction and non-interaction) are uncertain. Some recent efforts have focused on combining different kinds of evidence for supervised learning (see Janssen reference below) or incorporating protein motif data (unpublished, but see abstract referenced below). A possible new approach for this problem would be to avoid using labels and instead weigh different sources of evidence for consistency in order to predict interactions. Dr. Phil Long (at Columbia's CLASS research center) and collaborators have developed a machine learning technique for weighing evidence that could be applied to this problem.

References:

Contacts:

Inferring Regulatory Networks from Expression Data

A new and exciting area of computational biology is the problem of inferring regulatory networks in the cell from gene expression data. The theory of Bayes nets (or "graphical models") -- a probabilistic generative model that describes a joint probability distribution for an acyclic network of random variables -- provides a framework for learning such networks. Unfortunately, structure learning in graphical models is quite involved, and limited and noisy data make the inferrence problem difficult; currently, there are literally only a handful of groups in the world with strong expertise in this area, and they use their own internal (unavailable) software for computations.

Given the advanced nature of the model and the importance of the problem, it would be a worthwhile project simply to try to implement a matlab prototype of the learning algorithms discussed in one of the references below and to try to reproduce the results on the (publically available) datasets that these papers use. As a starting point, download and study Kevin Murphy's Bayes Net Toolbox for matlab -- he has code to set up graphical models, learn parameters from data, and even do a few types of structure learning.

The easier project would be to reproduce results in the Hartemink paper: set up a set of candidate models for the small network that they consider and calculate the "Bayesian" score for each model to try to rank the candidates. Ideally, you would find a second biological example on which to validate this method.

A more involved project would be to model "interventions" as discussed in the Pe'er paper for dealing with knock-out data. You would want to try to set up the bootstrapping (sampling) process to calculate confidence scores for small features of the network and see if you can validate the high-confidence features that the authors obtained for the mating response and/or ergesterol cycles.

A newer paper (the Minreg paper of Pe'er et al.) used prior knowledge of transcription factors in yeast to learn a simpler network structure.

References:

Learning rankings for protein sequence searches

Algorithms like PSI-BLAST use pairwise similarity measures to produce a ranking of protein sequences from a database relative to a sequence query. Sequences near the top of the list (the top "hits") are most likely to be homologs of the query. We have recently developed a graph-based algorithm called RankProp for learning to improve the ranking of protein sequences returned by algorithms like PSI-BLAST. RankProp defines a graph on the space of all protein sequences, where edge weights are derived from pairwise similarity scores, and uses this global structure to learn an improved ranking. This work is to appear in PNAS. Due to PNAS rules, I cannot post a preprint online, but see me for more information and project ideas.

Contacts:

Inference from Single Nucleotide Polymorphism Data

The human genome project has led to considerable progress in understanding and characterizing variation in the human genome. A dense collection of sequence variants (i.e., genetic markers) has been mapped across the genome, which will aid researches to identify disease causing sequence variants. Most stable variation in the genome occurs in the form of single nucleotide polymorphism. Single-nucleotide polymorphisms (SNPs) represent about 90% of the common variation in the genome. This variation arises through a single mutation event in the history of the human population. The likelihood of recurrent mutation at the same site is low. Consequently, SNPs are stable genetic markers.

The extensive repository of these SNP markers provides a tool for discovering the genetic basis of common complex diseases (due to multiple interacting genes and the environment). The approach involves typing large number of SNP markers in a set of candidate genes thought to be functionally significant in the manifestation of disease of interest using case-control samples. The expectation is that SNPs associated with the disease would have a different profile in the case vs. the control sample.

There are a number of possible learning-based approaches to this problem. One could view the SNPs as (typically binary-valued) features for a multi-class classification problem (the classes correspond to phenotypes or diseases), and one could use standard supervised learning techniques to train a classifier. More meaningful to a medical researcher or biologist, perhaps, would be to develop a probabilistic model for this data that is somewhat more involved than the one implicitly used by the population geneticists -- for example, a graphical model that would allow for interaction between SNPs in different candidate genes, producing SNP base network configurations that distinguish the case and control population. Any interesting learning approach applied to this data would be a novel contribution and an interesting project.

References:

Contacts:

Haplotype Mapping

Here is some information about the Haplotype Mapping project ("HapMap") from the National Institutes of Health: "Sites in the genome where individuals differ in their DNA sequence by a single base are called single nucleotide polymorphisms (SNPs). Recent work has shown that there are about 10 million SNPs that are common in human populations. SNPs are not inherited independently; rather, sets of adjacent SNPs are inherited in blocks. The specific pattern of particular SNP alleles in a block is called a haplotype. Recent studies show that most haplotype blocks in the human genome have been transmitted through many generations without recombination. Furthermore, each block has only a few common haplotypes. This means that although a block may contain many SNPs, it takes only a few SNPs to uniquely identify or 'tag' each of the haplotypes in the block."

Computational approaches are being developed for determining haplotype blocks from genotype data from many individuals and as well as associating haplotypes with disease. I list just a few references; more references can be found in the bibliography for the second paper. A possible project would be to implement one of these haplotype algorithms and test on a small dataset; extensions of these methods or comparisons between methods would be very interesting.

References:

Motif models and discovery

Probabilistic and combinatorial models for regulatory motifs (e.g. binding site for transcription factors) have been used to search for new signals in promoter regions and full genomes. We'll probably cover one EM-based approach, called MEME, later in the semester.

References:

String Kernels for Sequence Data

String kernels are functions that implicitly map a pair of sequences to a feature space and take their inner product in this space; they allow us to use learning algorithms and techniques for vector-valued data (SVMs, clustering, principal component analysis) on sequence data. Various string kernels have recently been used in computational biology for applications such as protein classification (many were introduced by our group at Columbia) and peptide cleavage site recognition; they have also appeared in natural language processing for text classification. New string kernels, extensions of existing string kernels, and new biological applications of string kernels would all make interesting subjects for a project. More recently, our group has been involved in developing profile-based string kernels and semi-supervised approaches to building kernels -- see me for additional newer references.

I list only a few references below -- I can provide more to interested students. Among other things, there is a connection between these kernels and non-deterministic finite state automata.

References:

Contacts:

New Approaches for Time Series Expression Data

Many labs are now producing time series gene expression data sets, where multiple microarray assays are made at different time points in some biological process. One should be able to learn more from time series data than from the same number of unrelated replicates, since we can see the evolution of a process. However, the data is too sparse and noisy (and the genes to numerous) to use many standard time series analysis techniques.

Below I list two clustering algorithms specifically designed to deal with time series data. Implementation of the spline-based clustering is available from my lab. I can make specific suggestions for projects related to the spline approach if a group is interested. Otherwise, any implementation or comparison of time series clustering or analysis techniques for this data could be interesting.

References: