Project References

Project Resources and references

General References:

The following links provide access to general bioinformatics journals, conferences, datasets and research organizations.

Journals: Bioinformatics journal resources
Conferences: Bioinformatics conferences
Datasets: Experimental datasets
Other links: Bioinformatics organizations

Specific References:

Below is a selection of representative Bioinformatics research papers in central computational areas. These papers focus primarily on computational and learning techniques to analyze biological data rather than on solving specific biological problems. These papers may help guide you in selecting your project focus of either applying these or similar techniques to specific biological problems or developing and prototyping variants of these methods.

Statistical analysis of microarray data (Normalization and noise analysis)

Gene expression measurements are very noisy and it is an interesting statistical problem to weed noise from true signal.

Noise in cDNA gene expression microarrays (pdf)
Noise in affymetrix oligonucleotide gene expression arrays (pdf)
Noise in SAGE tag gene expression arrays (pdf)
Noise in MPSS tag gene expression arrays (pdf)

Clustering Gene Expression Data and Inferring Gene regulatory networks

Microarray technology allows us to measure the expression of genes spanning the entire genome of organisms. Clustering this expression data helps identify genes that behave similarly which helps simplify analysis of datasets. A more interesting problem is to identify gene regulatory networks i.e. how different genes interact with and control each other under diverse conditions.

Cluster analysis and display of genome-wide expression patterns Eisen, Spellman, Brown and Botstein. (pdf, ps)
Principal component analysis for clustering gene expression data. Yeung and Ruzzo. (pdf, ps)
Inferring subnetworks from perturbed expression profiles Dana Pe'er, Aviv Regev, Gal Elidan, Nir Friedman. (pdf, ps)
Using graphical models and genomic expression data to statistically validate models of genetic regulatory networks. Hartemink, Gifford, Jaakkola and Young.(pdf, ps)
Predicting gene regulatory response using classification Link
Papers by Eran Segal Link
Papers by Nir Friedman Link

Regulatory Motif discovery

Gene expression is regulated by different activating and repressing proteins that bind to short sequences of nucleotides generally upstream of the genes. It is a very interesting problem to computationally identify these motifs.

Regulatory element detection using correlation with expression (Bussemaker et al.) Link
From promoter sequence to expression: a probabilistic framework Link
Motif discovery using gene expression and sequence data Link
Papers by Itzhak Pilpel Link
Papers by Saeed Tavazoie Link
Modeling dependencies in protein-DNA binding sites Link
Fitting a mixture model by expectation maximization to find motifs in biopolymers Link
Finding composite regulatory patterns in DNA Sequences Link

Protein structure prediction and Protein classification (Sequence to Structure)

Predicting protein structure is one of the hardest computational tasks in computational biology.

Towards predicting coiled-coil protein interactions. Mona Singh, Peter S. Kim. (pdf, ps)
Predicting the beta-helix fold from protein sequence data. Phil Bradley, Lenore Cowen, Matthew Menke, Jonathan King, Bonnie Berger. (pdf, ps)
Papers by Burkhardt Rost Link
Papers on protein classification and ranking by Christina Leslie Link
Papers by Mona Singh Link
Advanced papers at CASP (which is a world contest for protein structure prediction) Link. Check out the abstracts submitted to each CASP contest to get links to authors.
A discriminative framework for detecting remote protein homologies. Jaakkola, Diekhans and Haussler. (pdf, ps)
An Introduction to Hidden Markov Models for Biological Sequences. Anders Krogh. (pdf, ps)
Papers on protein structure prediction PSB2002 Link

Gene Finding, Gene Structure, and Splicing

Massive database of publications on Gene finding Link
Computational identification of promoters and first exons in the human genome. Davaluri, Grosse and Zhang (pdf, ps)
Integrating genomic homology into gene structure prediction. Ian Korf, Paul Flicek, Daniel Duan, Michael R. Brent (pdf, ps)
Prediction of complete gene structures in human genomic DNA. Burge and Karlin. (pdf, ps)
Engineering support vector machine kernels that recognize translation initiation sites. Zien, Ratsch, Mika, Scholkopf, Lengauer, and Muller. (pdf, ps)

Phylogenetics, phylogenomics, Haplotype mapping, Comparative Genomics and Single nucleotide polymorphisms

Single nucleotide Polymorphisms: The human genome project has led to considerable progress in understanding and characterizing variation in the human genome. A dense collection of sequence variants (i.e., genetic markers) has been mapped across the genome, which will aid researches to identify disease causing sequence variants. Most stable variation in the genome occurs in the form of single nucleotide polymorphism. Single-nucleotide polymorphisms (SNPs) represent about 90% of the common variation in the genome. This variation arises through a single mutation event in the history of the human population. The likelihood of recurrent mutation at the same site is low. Consequently, SNPs are stable genetic markers. The extensive repository of these SNP markers provides a tool for discovering the genetic basis of common complex diseases (due to multiple interacting genes and the environment). The approach involves typing large number of SNP markers in a set of candidate genes thought to be functionally significant in the manifestation of disease of interest using case-control samples. The expectation is that SNPs associated with the disease would have a different profile in the case vs. the control sample. There are a number of possible learning-based approaches to this problem. One could view the SNPs as (typically binary-valued) features for a multi-class classification problem (the classes correspond to phenotypes or diseases), and one could use standard supervised learning techniques to train a classifier. More meaningful to a medical researcher or biologist, perhaps, would be to develop a probabilistic model for this data that is somewhat more involved than the one implicitly used by the population geneticists -- for example, a graphical model that would allow for interaction between SNPs in different candidate genes, producing SNP base network configurations that distinguish the case and control population. Any interesting learning approach applied to this data would be a novel contribution and an interesting project.

Haplotype mapping: Here is some information about the Haplotype Mapping project ("HapMap") from the National Institutes of Health: "Sites in the genome where individuals differ in their DNA sequence by a single base are called single nucleotide polymorphisms (SNPs). Recent work has shown that there are about 10 million SNPs that are common in human populations. SNPs are not inherited independently; rather, sets of adjacent SNPs are inherited in blocks. The specific pattern of particular SNP alleles in a block is called a haplotype. Recent studies show that most haplotype blocks in the human genome have been transmitted through many generations without recombination. Furthermore, each block has only a few common haplotypes. This means that although a block may contain many SNPs, it takes only a few SNPs to uniquely identify or 'tag' each of the haplotypes in the block." Computational approaches are being developed for determining haplotype blocks from genotype data from many individuals and as well as associating haplotypes with disease. I list just a few references; more references can be found in the bibliography for the second paper. A possible project would be to implement one of these haplotype algorithms and test on a small dataset; extensions of these methods or comparisons between methods would be very interesting.

Linkage disequilibrium (LD): is a phenomenon that when two chromosome locations (two markers, two genes, two loci) are so close to each, that there is a lack of historical (ancestral) recombination events in between. This lack of crossing-over has several effects. If one of the location is where the disease gene is, the marker that is in LD with the gene is "hitchhiked" ("dragged along") by the disease gene. Thus the persons affected with the disease tend to have certain marker allele value at this marker.

Phylogenomic inference of protein molecular function: advances and challenges (An excellent review by Kimmen Sjolander) (pdf)
Other papers from the Berkeley Phylogenomics Lab Link
Other research projects from the Berkeley Phylogenomics Lab Link
Journal of Molecular Biology and Evolution Link
Haplotyping as perfect phylogeny: conceptual framework and efficient solutions (pdf)
Large-scale reconstruction of haplotypes from genotype data (pdf)
Haplotype analysis papers by Eran Halperin Link
Papers by Mike Eisen's group Link
Papers on Comparative genomics by Manolis Kellis Link