Past Project Ideas

Here is a collection of project ideas from Prof. Bill Noble's offering of the course in Spring 2001.

Gene clustering by promoter patterns

The upstream regions of genes are characterized by transcription factor binding sites, short conserved sequence motifs that are recognized by the transciption factors that control transcription initiation. The purpose of this project is to cluster genes according to the occurences of these sites in the upstream regions.

The project consists of three steps: (1) convert each promoter region to a fixed-length vector that represents the motif occurrences, (2) feed these vectors into a clustering program, (3) evaluate the significance of the resulting clusters. You have choices at each step. For example, the fixed-length vectors could be of length n4, where each element corresponds to a single length-n subsequence and contains the number of occurences of that subsequence in the promoter region, divided by the frequency of that subsequence in the complete genome. Or the fixed-length vector could be derived by comparing each of the transcription factor binding sites in SCPD to the promoter region. In this case, each element in the vector would be the maximum matching score of the SCPD matrix versus the promoter. Or the fixed-length vector could be derived from motifs discovered by a fast motif discovery program such as SPLASH. For step (2), you have a multitude of options for clustering algorithms -- hierarchical clustering, self-organizing maps, superparamagnetic clustering, etc. You can search the web for software to perform the clustering. One option is S-plus or R. One way to evaluate the significance of the clusters is to compare each one to the classes in a database such as MIPS and find the most similar class. You can evaluate the statistical significance of the similarity by computing the total number of differences (false positives plus false negatives) and then computing this same value for 10,000 randomly generated classes of the same size.

Input: A collection of upstream regions.

Output: A set of cluster labels (1, 2, 3, etc.) for subsets of the upstream regions that share similar sets of motifs.


Identifying putative promoter motifs

You can also turn the previous project on its head, and simply use the MIPS classifications to produce inputs for a motif discovery algorithm. For each class in MIPS, gather up the corresponding upstream regions, and feed them into one or more motif discovery algorithms. Evaluate the statistical significance of the motifs that you find. Also, check to see whether the motifs correspond to binding sites in TRANSFAC or the SCPD.

Protein clustering

This project is similar to the previous one, except that instead of clustering promoter regions, you will cluster proteins. You can convert the proteins to fixed-length vectors by computing match scores versus a database of motifs or a database of family models, or by computing Fisher scores from a hidden Markov model. Clustering and evaluation can be performed as described above.

Input: A collection of proteins.

Output: A set of protein family labels (1, 2, 3, etc.) for subsets of the proteins that share primary sequence similarity.


Supervised learning of gene classes

In this project, you would use the fixed-length vector representations of promoter regions or primary sequences, as described in the previous three projects, but your task would be to learn the given set of classes in a supervised fashion. Your choices for supervised learning algorithm are similarly varied: a Bayesian classifier (as implemented, for example, by AutoClass), a decision tree algorithm such as C4.5, the k-nearest neighbor algorithm, neural networks, support vector machines, etc. Evaluation should be performed using cross-validation (divide the data into n subsets; train on n-1 subsets and test on the remaining subset; repeat this process n times) and by computing an ROC score for the predicted classes. The ROC score is the normalized area under a curve that plots true positives as a function of false positives.

Input: A training set of proteins or gene promoter regions, each labeled as belonging to a protein family, and an unlabeled test set of the same kind of data.

Output: A set of predicted class labels for the test set.

Data: See previous three projects.


Gene cluster annotation

The scientific literature is full of methods for clustering genes according to their mRNA expression levels, primary sequence, promoter region sequences, or other types of genomic data. The goal of this project is to write and validate a program for automatically annotating and evaluating the clusters derived from an automatic clustering technique.

Given a cluster of genes, the annotation program will extract the corresponding keyword annotations from a database. For each keyword found, the program will compute the probability of seeing the observed number of genes with that keyword if the cluster were drawn randomly from the set of annotated genes, and will report all keywords in a cluster that occur more frequently than expected by chance. You can evaluate this clustering annotation tool by attempting to automatically annotate clusters from published sets of microarray expression data, or by comparing the resulting cluster annotations to known classifications such as the MIPS catalog, SCOP database or PROSITE.

An additional component of this project could be to use the annotations to select clusters automatically from a hierarchical clustering dendrogram. This component would require coming up with a single statistical score to represent the significance of the found annotations, and then computing these scores at various levels of the dendrogram in search of significant clusters.

Input: A list of gene identifiers, keyword annotations, and cluster labels. A collection of keyword annotations from large genomic database.

Output: For each cluster, a set of keywords that occur more frequently than would be expected by chance, along with their associated p-values.



Principal components analysis of genomic data

Principal components analysis (PCA) is a useful way of finding patterns in high-dimensional data sets. Perform PCA on any of the fixed-length vector representations described in the previous three projects. Plot the vectors in the space defined by the first two or first three principal components, and color the vectors according to some external classification. You can use gene functional classes, protein structural classes, or (for microarray data) diagnostic classes such as cancerous versus non-cancerous samples. Does the resulting picture show an obvious clustering or clear separation between classes? If so, you may be able roughly to correlate the learnability of a given class (using a simple supervised algorithm such as k-nearest neighbor) with the class's appearence in the PCA plot. Compare the utility of the PCA picture with that of a simple colorized view of the two sets of class vectors, as given by a tool such as Treeview. A useful outcome of this project would be a standalone program or a web page that automatically performs PCA and plots the resulting data.

Input: Any type of genomic data (protein sequences, promoter regions, microarray data, yeast two-hybrid data, phylogenetic profiles, etc.), represented as fixed-length vectors, plus a corresponding set of class labels.

Output: A picture of the data in the space defined by the first two or three principal components, colored according to the given classes.


See above. Also:


A new metric for evaluating changes in microarray expression levels

Gene expression is typically measured as a log ratio of dye intensity in an experimental versus control sample. Many published papers use as a rule of thumb that an n-fold change in expression is significant, where n is usually a value between 1.3 and 2. Besides being arbitrary, this rule is risky because ratios become unstable for small values. The purpose of this project is to come up with a better method for selecting the significantly changed genes in a single microarray experiment.

Here is one possible metric. Imagine that each intensity measurement includes normally distributed error with standard deviation S. Then, for an expression log ratio R of intensities in experimental (E) versus control (C) conditions, a simple method for putting error bars around the measurements is to include the standard deviation term in both measurements; i.e., (X-S)/(Y+S) < R < (X+S)/(Y-S). These error bars will be larger for low-intensity ratios, and smaller for high-intensity ratios. For a given n-fold threshold, you can select either all the genes that exhibit definitely greater than n-fold change (e.g., (X-S)/(Y+S) < n) or all genes that exhibit possibly greater than n-fold change (e.g., (X+S)/(Y-S) > n). Compare these lists to the set of genes that were selected using only the raw ratios, as well as the corresponding lists derived from a gold standard of truly changed genes.

You can derive your gold standard by using a data set that includes multiple repetitions of a single experiment. Derive the true error bars derived by computing a standard deviation for each gene directly from the multiple experiments.

Comparison of feature selection metrics for microarray expression data

An important task in analyzing microarray expression data is the identification of a subset of genes that are relevant to a particular phenotype. A number of feature ranking metrics have been proposed for this task. The purpose of this project is to evaluate the relative utility of these metrics.

Golub et al. proposed a simple ranking scheme: for each gene, compute the mean and standard deviations of the expression measurements within each of two pre-defined classes. The ranking metric is the absolute value in the difference between the means, divided by the sum of the standard deviations. This metric is closely related to the student t statistic, as well as the Fisher criterion score. Another metric is the Pearson or Spearman correlation coefficient between the expression levels and the binary class labels. Park et al. proposed a nonparametric scoring scheme based upon the Wilcoxon signed rank test.

The project consists of ranking the features in one or more data sets according to each of these metrics, and comparing the resulting ranked lists. One difficult part of this project is finding a suitable gold standard list of genes.

Input: A matrix of gene expression levels, with binary class labels along the experiments.

Output: A ranked list of genes.



Comparison of motif discovery algorithms

Evaluate the sensitivity and selectivity of motif discovery algorithms for locating transcription factor binding sites in promoter regions. You can do this using simulated data, which I will provide upon request.


Comparison of multiple sequence alignment algorithms

Evaluate several multiple alignment programs using the BAliBASE test suite. Candidate programs to evaluate are HMMER and CLUSTALW, though you can choose whatever programs you'd like (or write your own). References:

Locating gene transcripts from EST data

The gene-finding task can be usefully decomposed into two subproblems: (1) locate a region of genomic DNA that contains a gene, and (2) predict the gene transcript (intron-exon structure) within the located region. The purpose of this project is to evaluate the utility of expressed sequence tag (EST) information for addressing the first problem.

You will use the EST information to attempt to predict the locations of genes within human chromosomes 21 and 22. You will use as a gold standard annotations provided by the Sanger Center. You should report the sensitivity and specificity of your method, and compare them with similar values from a simple algorithm that looks for open reading frames.

If you are interested in this project, please contact Victoria Haghighi, who will assist in interpreting the data files below. We will also provide you with more detailed suggestions about how best to interpret the EST data and make predictions from it.

Input: A complete chromosome or the corresponding aligned EST data.

Output: A prediction (preferably in GFF) of chromosomal regions in which genes are likely to occur. Data:

Building a better null model for BLAST

There are a lot of aspects of the statistical structure of genomic DNA that are not taken into account by the BLAST null model, which is designed primarily for efficiency. After running BLAST to get a few thousand hits, these hits could be rescored using a more sophisticated null model, and perhaps borderline hits of real biological significance will move closer to the top of the list. Currently many of these borderline hits are lost in the noise, and this borderline area is where many of the more interesting discoveries are to be made.

Comparison of nematode genomes

There are two related nematode genomes, C. elegans and C. briggsae. A lot can be learned from comparing two related genomes. For example, what features of the corresponding genes between these two organisms are conserved? Of special interest would be intron features, and features of the promoter region. Transcription factor binding sites in promoter regions can often only be discovered by bioinformatic methods that use direct comparison of promoter regions from the corresponding genes from related organisms. Use a tool such as this one to pull out some common features between the genes of these two organisms. Then do this comparison and some preliminary exploration of the new discoveries that pop up.

Support vector machine analysis of microarray expression data

Support vector machines have been used successfully to classify genes into functional categories. Several follow-up projects are possible:
  1. determine how well this method generalizes to other data sets,
  2. investigate alternative representations of the original data set or alternative kernel functions, or
  3. use the SVM to refine clusters initially learned by an unsupervised method.

You should replicate the methods described in the paper listed below. The svm software is available here.

Talk to me or Paul Pavlidis for more details and project suggestions.



Correlate gene expression and promoter motif patterns in C. elegans

The following projects were suggested by Jim Kent (

I'd encourage people to look for motifs in the promoter region of the worm data, especially in the ones that have C. elegans homology. This is a project I've built all sorts of tools for (the Improbizer, the C. elegans/C. briggsae alignments, the Intronerator in general) but will probably never get to work on, between my splicing related dissertation research and the human genome stuff. If you or some of your students can exploit these it would be wonderful. Just running Improbizer on the upstream regions of all genes named unc something (or on the other major gene classes) actually gives a very strong motif I haven't had time to do anything but notice.

I can think of two good projects on this:

It might be a good project for two people where one finds the Kim data and clusters it, and the other does the Improbizer stuff initially on unc* and other gene families, and then at the very end on the clusters.

Improve PSI-BLAST using sequence weighting

Attempt to improve PSI-BLAST by using weighting with respect to the query sequence, as opposed to weighting with respect to the profile. This project would involve implementing your own approximation of PSI-BLAST, since the program source code is not available. You could use an existing BLAST binary as the inner loop, and write a Perl script to do the iterations. The goal would be to prevent PSI-BLAST from iteratively pulling in false positives by weighting every training set sequence with respect to the original query, rather than with respect to the entire training set (which may be corrupted by false positives).

Evaluation of the ESIZE algorithm

The ESIZE algorithm calculates the degree of redundancy in a given set of proteins and summarizes this information in a single number. The purpose of this project is to test the reliability and usefulness of the ESIZE statistic.

First, test the reliability of the ESIZE statistic by running it repeatedly on a set of protein families using different numbers and types of random queries. Second, investigate the relationship between effective family size on the one hand and percent sequence identity, pairwise sequence p-values, and other measures of sequence similarity. For example, a comparable statistic could be derived from the purge program, which removes from a given data set all proteins with pairwise BLAST similarity scores higher than a given threshold. Third, characterize existing databases in terms of the esizes of the families represented.

The ESIZE algorithm is available as part of the FPS software. Purge is available as a Solaris binary.

Input: A collection of protein sequences.

Output: A single number that summarizes the degree of similarity among the sequences.



Use information retrieval techniques to analyze Medline abstracts

Most people are familiar with the task performed by information retrieval algorithms via web search engines. The goal of this project is to apply these techniques to the analysis of the scientific literature. For example, you can begin by reproducing the method given in the paper below. The method consists of representing Medline abstracts using fixed-length vectors, and then querying this database of vectors to discover gene-pair relationships.


Automated clustering of C. elegans proteins

Try to cluster the worm proteins with different cluster algorithms based on sequences alone or based on a subset of proteins with known structural domains. Compare these results to one another or to an existing clustering such as Pfam or ProtoMap.

Remote homology assessment of proteins related to the nuclear receptor superfamily

This project was suggested by and would involve collaboration with Joe Thornton, who is a Research Fellow in Center for Environmental Research and Conservation and in the Department of Biological Sciences. Here is Joe's description:
The purpose of this project is to identify proteins or protein families from which the nuclear receptor superfamily originated during evolution. The nuclear receptor superfamily are a large group of transcriptional regulators that play essential roles in development and physiology throughout metzoans. Many are ligand-activated transcription factors, and the superfamily includes the receptors for estrogens, androgens, thyroid hormones, retinoic acids, ecdysone, and other essential hormones.

The nuclear receptors have been detected throughout the metazoa, but none are present in fungi, plants, protists, bacteria, or archaea. How then did these biologically essential proteins originate? Presumably they are highly diverged descendants of other proteins, but standard homology search techniques do not detect meaningful relationships between nuclear receptors and any other protein families.

We will use new techniques for remote homology assessment to identify proteins that are related to but not members of the nuclear receptor superfamily. Results of this identification will allow important insights into the evolution of novel functions at the biochemical level.

Evaluate a new, hybrid HMM/progressive alignment method for multiple sequence alignment

Evaluate a new multiple alignment method developed by Felix Sheinerman ( The method is designed to align sequences of all members of a large protein family starting from a "seed" alignment. The seed alignment may be obtained, e.g., from structural superposition or from an "expert" alignment. Examples of input files and general information on how the method works can be found here. (Note that this page is only accessible from Columbia URLs.) To run the method, you can use this set of Perl scripts. You will also need to install the HMMER hidden Markov model toolkit, and ClustalW multiple alignment software. You can compare the performance of this method to that of ClustalW alone, using the BAliBASE evaluation database.

If you plan to work on this project, you should contact Felix Sheinerman in the Department of Biochemistry & Molecular Biophysics (212 305-8172;

Build hidden Markov models by combining structure and sequence alignments

Probability models of protein families allow for the detection of remote protein homologs. However, when the similarity between a query sequence and a model is very low, it is still difficult to detect structural and functional relationships. A complementary approach is to incorporate structure alignment, because structure is more conserved than sequence through evolution. However, structural alignment suffers two major problems. First, there exist large insertions and deletions in many part of structural alignments. Second, it is more difficult to obtain the optimal multiple structural alignment than multiple sequence alignment, especially in loop regions. Thus, significant sequence signals are attenuated in the structural alignments.

One approach to overcoming the above difficulties is to build a combined HMM from both structure and sequence alignment. This may involve the following steps. Firstly, core structure regions and non-core regions will be defined from the multiple structure alignment. Then, several small HMMs will be trained for the core regions from the multiple structure alignment and for the non-core regions from the multiple sequence alignment, respectively. Last, these trained small HMMs will be connected as a single linear model. The resulting connected HMM will be used to search a database.

If you are interested in this project, please contact Lei Xie ( in the Department of Biochemistry & Molecular Biophysics.

Iterative evolutionary distance adjusted pairwise sequence alignment

It is believed that the alignment of two sequences performed with a similarity matrix corresponding to their evolutionary distance will be optimal. The project is to write a program that will align two protein sequences using a distance matrix that is optimal for the sequences. This is accomplished by the folowing iterative loop:
  1. The sequences are aligned.
  2. The distance of the two sequences are calculated based upon the alignment.
  3. Step 1 is repeated using the distance calculated in step 2.
The loop is repeated until a given PAM matrix or range of PAM matrices are converged upon.

The project would involve writing such a program and testing the resulting alignments by comparing them to structurally derived alignments. The programs could also be tested by comparing database search performance with and without the iterative component.

The alignment can be performed either by writing code that will align two sequences and measure their distance or by writing a script that will run existing programs to do this. If the later option is picked then fasta would be a good alignment program and clustalw a good distance analysis program.

Input: Two protein sequences.

Output: An alignment between the two sequences.


Richard Friedman is available for consultation (, (212)305-6901).