Project Resources and references

General References:

The following links provide access to general bioinformatics journals, conferences, datasets and research organizations.

Specific References:

Below is a selection of representative Bioinformatics research papers in central computational areas. These papers focus primarily on computational and learning techniques to analyze biological data rather than on solving specific biological problems. These papers may help guide you in selecting your project focus of either applying these or similar techniques to specific biological problems or developing and prototyping variants of these methods.

Statistical analysis of microarray data (Normalization and noise analysis)

Gene expression measurements are very noisy and it is an interesting statistical problem to weed noise from true signal.

Clustering Gene Expression Data and Inferring Gene regulatory networks

Microarray technology allows us to measure the expression of genes spanning the entire genome of organisms. Clustering this expression data helps identify genes that behave similarly which helps simplify analysis of datasets. A more interesting problem is to identify gene regulatory networks i.e. how different genes interact with and control each other under diverse conditions.

Regulatory Motif discovery

Gene expression is regulated by different activating and repressing proteins that bind to short sequences of nucleotides generally upstream of the genes. It is a very interesting problem to computationally identify these motifs.

Protein structure prediction and Protein classification (Sequence to Structure)

Predicting protein structure is one of the hardest computational tasks in computational biology.

Gene Finding, Gene Structure, and Splicing

Phylogenetics, phylogenomics, Haplotype mapping, Comparative Genomics and Single nucleotide polymorphisms

Single nucleotide Polymorphisms: The human genome project has led to considerable progress in understanding and characterizing variation in the human genome. A dense collection of sequence variants (i.e., genetic markers) has been mapped across the genome, which will aid researches to identify disease causing sequence variants. Most stable variation in the genome occurs in the form of single nucleotide polymorphism. Single-nucleotide polymorphisms (SNPs) represent about 90% of the common variation in the genome. This variation arises through a single mutation event in the history of the human population. The likelihood of recurrent mutation at the same site is low. Consequently, SNPs are stable genetic markers. The extensive repository of these SNP markers provides a tool for discovering the genetic basis of common complex diseases (due to multiple interacting genes and the environment). The approach involves typing large number of SNP markers in a set of candidate genes thought to be functionally significant in the manifestation of disease of interest using case-control samples. The expectation is that SNPs associated with the disease would have a different profile in the case vs. the control sample. There are a number of possible learning-based approaches to this problem. One could view the SNPs as (typically binary-valued) features for a multi-class classification problem (the classes correspond to phenotypes or diseases), and one could use standard supervised learning techniques to train a classifier. More meaningful to a medical researcher or biologist, perhaps, would be to develop a probabilistic model for this data that is somewhat more involved than the one implicitly used by the population geneticists -- for example, a graphical model that would allow for interaction between SNPs in different candidate genes, producing SNP base network configurations that distinguish the case and control population. Any interesting learning approach applied to this data would be a novel contribution and an interesting project.

Haplotype mapping: Here is some information about the Haplotype Mapping project ("HapMap") from the National Institutes of Health: "Sites in the genome where individuals differ in their DNA sequence by a single base are called single nucleotide polymorphisms (SNPs). Recent work has shown that there are about 10 million SNPs that are common in human populations. SNPs are not inherited independently; rather, sets of adjacent SNPs are inherited in blocks. The specific pattern of particular SNP alleles in a block is called a haplotype. Recent studies show that most haplotype blocks in the human genome have been transmitted through many generations without recombination. Furthermore, each block has only a few common haplotypes. This means that although a block may contain many SNPs, it takes only a few SNPs to uniquely identify or 'tag' each of the haplotypes in the block." Computational approaches are being developed for determining haplotype blocks from genotype data from many individuals and as well as associating haplotypes with disease. I list just a few references; more references can be found in the bibliography for the second paper. A possible project would be to implement one of these haplotype algorithms and test on a small dataset; extensions of these methods or comparisons between methods would be very interesting.

Linkage disequilibrium (LD): is a phenomenon that when two chromosome locations (two markers, two genes, two loci) are so close to each, that there is a lack of historical (ancestral) recombination events in between. This lack of crossing-over has several effects. If one of the location is where the disease gene is, the marker that is in LD with the gene is "hitchhiked" ("dragged along") by the disease gene. Thus the persons affected with the disease tend to have certain marker allele value at this marker.

Alternative Splicing

Protein-protein interaction

Computational Immunology (Epitope discovery and classification)