A Bayesian Approach to Motif-Based Protein Modeling

William Noble Grundy
Department of Computer Science & Engineering
University of California, San Diego

 

Abstract

The Human Genome Project and similar work on other species produce biological sequence data at an accelerating rate. The GenBank database of publicly available DNA sequences currently contains over 1.6 million entries. Sophisticated computational tools are required to analyze this wealth of data. Hidden Markov models (HMMs) provide a theoretically justified means of representing families of related proteins. These models provide insight into the structural and functional operation of the proteins in question, and may be used to discover functional and evolutionary relationships between protein sequences.

This work addresses two major shortcomings of the standard approach to protein modeling. Linear HMMs typically contain several thousand parameters and therefore require large training sets, on the order of 200 protein sequences. Also, these HMMs imply a relatively simple model of molecular evolution. The Meta-MEME software toolkit builds motif-based HMMs that focus on the biologically important motif regions. These regions are highly conserved throughout the protein family due to functional or structural constraints. Meta-MEME models are smaller than standard HMMs, allowing for smaller training sets and faster database searching. Furthermore, Meta-MEME employs a non-linear topology that allows for the representation of large-scale evolutionary events, such as the deletion, copying and shuffling of protein domains.

The models produced by Meta-MEME provide biologists with insight into the general characteristics of the given family of related proteins. The models may also be used to produce multiple alignments and to search for remote homologs. For smaller training sets, Meta-MEME provides homology detection performance that is superior to that provided by standard HMMs.



Luis Gravano
gravano@cs.columbia.edu