Due Date: Friday, March 12 (11:59pm) Note that the due date is the Friday before Spring Break, and that there is also a midterm that week -- start early! Submit electronically via Courseworks, following the instructions posted previously by our head TA Eugene Ie (firstname.lastname@example.org). For theory questions, write your answers in a convenient electronic format -- plain text, pdf, postscript, or doc (if you must!). For other formats, ask the TA first to make sure it will be readable. For programming questions, please submit both your source code and your results/plots (in a standard format like ps, jpg, etc) along with a plain text "readme.txt" file that explains to the TA where everything is.
Suggested languages and tools: We suggest that you use matlab as the programming language for this assignment. We will provide some matlab code as a starting point for the various dynamic programming algorithms associated with alignment and HMM prediction (more complete code is provided for some problems than for others). Matlab has a fairly simple programming language and good predefined functions for matrix manipulations, and the matlab environment allows you to interactively look at current values of variables. Many researchers in machine learning use matlab as a prototyping language. In fact, if you get a chance, you might want to take a look at Kevin Murphy's Bayes Net Toolbox for matlab. (HMMs are a special example of more general graphical models called Bayes Nets.) You won't need the Bayes Net Toolbox for this homework, but you might use it in your final project.
Once again, the TAs will compile a list of online resources, post some tips for doing this assignment in matlab, and run a matlab tutorial. However, if you prefer to use other programming languages for this assignment, just email one of the TAs ahead of time to make sure your choice is acceptable.
Biology Reference: For an interesting recent article on finding both CpG-related and non-CpG-related promoter regions in order to locate first exons, see Davuluri R. V., Grosse I., Zhang M.Q. Computational identification of promoters and first exons in the human genome. Nat Genet. 2001 Dec;29(4):412-7.
Background: The first programming question is about optimal alignment, but for the bulk of this homework, we'll concentrate on developing a Hidden Markov Model for CpG detection and using the model to predict the location of CpG islands in a set of genomic test sequence data. We outline some details of the model here.
We'll use the 8-state model developed in the text, with the simplifying assumption that I will cover in class -- namely, rather than try to estimate all the transition probabilities from positive states to negative states and negative states to postive states, we let
p = P(pii is + | pii-1 is +)We also assume that the transition probabilities from a fixed positive state to each of the four negative states are all equal, hence each is equal to (1-p)/4; similarly, the transition probabilities from negative states to positive states are are (1-q)/4.
q = P(pii is - | pii-1 is -)
For the transition probabilities from positive to positive states, we estimate symbol transition probabilities as we did with Markov chains (in fact, we'll just use the estimates given in the text), then multiply by p to get the state transition probabilities. Similarly, we can obtain the negative to negative transition probabilities from negative training data.
We obtain the following model:
The emission probabilities are as given in the text and in class: the state A+ emits A with probability 1, while eA+(b) = 0 for all b different from A, etc.
Theory (10 points):
>> x = 'ACGGCTTTC' >> y = 'ACCTTTC' >> alignto produce the following global alignment:
(Try it, then examine the code.) Now change the code so that it does local alignment for protein sequences. You will need to make a few changes to the dynamic programming algorithm as discussed in class. Note that unfortunately, since matlab starts indexing at 1 instead of 0, the (i,j) indices are "off by one" from the conventions used in the text. You will also need to incorporate substitution matrix match/mismatch scores. Use the BLOSUM50 matrix (copy from an online version to save typing) and a linear gap penalty of d=8 as in the text. You may find the index function in index.m convenient for converting from symbols to indices -- for example, typing:
>> amino = 'ARNDCQEGHILKMFPSTWYVBZX' >> index('R', amino)returns the value 2 (index of 'R' in the character array called amino).
Once your Smith-Waterman program is working, use it to produce the optimal local alignment of the following two sequences, given in FASTA format:
We'll use the model described above with fixed p and q values. A reasonable choice is to take p=.98, q=.999 -- with enough annotated training data, we could get a better estimate. We'll use the small dataset as our test set -- we want to see how well our predictions compare to the biologically-motivated annotations so that we get an empirical measure of the performance of our model. (Why is it not a good idea to keep rerunning our prediction algorithms on the test set with different values of p and q, and then report the best result?)
The first prediction algorithm we'll use is the Viterbi algorithm, to find the most probable path given the observed data. As a starting place, the file viterbi.m is a simple implementation of the Viterbi algorithm (using the log transform). This code assumes that the observed symbol sequence is represented by a variable x and that certain array/matrix variables -- the alphabet of symbols, the alphabet of states, matrices representing the transition and emission probabilities -- are already defined. For example, the file casino_model.m defines all the variables for implementation of the occasionally dishonest casino model discussed in your text -- here the symbols are '1' through '6' and the states are 'F' for fair die and 'L' for loaded die. For example, here I use the casino model to make the following state path prediction for observed sequence x:
>> x = '1263543264666666664666444625342323' x = 1263543264666666664666444625342323 >> viterbi LLLLLLLLLLLLLLLLLLLLLLFFFFFFFFFFFFSo, for this part, there is not much to do to get the code working for CpG island prediction: examine and make sure you understand the Viterbi code, write a model file to implement our CpG island HMM, and use the code to predict the most probable path for each of the test sequences. (Suggestion: you could use 'A', 'C', 'G', 'T' to represent positive states and 'a', 'c', 'g', 't' to represent negative states.) Save the results of your predictions -- below you will be asked to do some analysis on the performance of the model using Viterbi prediction.
Implementation/theoretical note: in this code, I am not explicitly modeling the begin and end silent states. Instead, I am using the following expression for the joint probability of the observed sequence x and state sequence pi:
P(x,pi) = P(pi1) epi1(x1) api1,pi2 epi2(x2) ... apiL-1,piL epiL(xL)I am also assuming a uniform distribution over initial states -- if there are m possible states, then:
P(pi1= k) = 1/mNote that you are free to modify (and improve) my code! You should include all the code you write as part of your homework submission.
The file forward.m contains matlab code to compute scaled forward probabilities, using the formula for the scaling factors si that I outlined in class (it's also in your text). You should write code for the backward algorithm (to compute scaled backward probabilities) and complete the implementation of posterior decoding. Use your code to predict state paths for the test data sequences, and save your results.
Implementation note: The backward algorithm in the text assumes that you are modeling an end state -- that is, you introduce a transition probability to the end state, and adjust the other transition probabilities in the model accordingly. For us, it's probably just easier to slightly modify the initialization of the backward algorithm so that it computes the probabilities without an end state. Either way, as long as what you compute allows you to make correct predictions about the state pii with maximal posterior probability, you are fine.
Please include all the code you write as part of your homework submission.