## Alignment Algorithms and HMMs for CpG Island Prediction

Due Date: Friday, March 12 (11:59pm) Note that the due date is the Friday before Spring Break, and that there is also a midterm that week -- start early! Submit electronically via Courseworks, following the instructions posted previously by our head TA Eugene Ie (eie@cs.columbia.edu). For theory questions, write your answers in a convenient electronic format -- plain text, pdf, postscript, or doc (if you must!). For other formats, ask the TA first to make sure it will be readable. For programming questions, please submit both your source code and your results/plots (in a standard format like ps, jpg, etc) along with a plain text "readme.txt" file that explains to the TA where everything is.

Suggested languages and tools: We suggest that you use matlab as the programming language for this assignment. We will provide some matlab code as a starting point for the various dynamic programming algorithms associated with alignment and HMM prediction (more complete code is provided for some problems than for others). Matlab has a fairly simple programming language and good predefined functions for matrix manipulations, and the matlab environment allows you to interactively look at current values of variables. Many researchers in machine learning use matlab as a prototyping language. In fact, if you get a chance, you might want to take a look at Kevin Murphy's Bayes Net Toolbox for matlab. (HMMs are a special example of more general graphical models called Bayes Nets.) You won't need the Bayes Net Toolbox for this homework, but you might use it in your final project.

Once again, the TAs will compile a list of online resources, post some tips for doing this assignment in matlab, and run a matlab tutorial. However, if you prefer to use other programming languages for this assignment, just email one of the TAs ahead of time to make sure your choice is acceptable.

Biology Reference: For an interesting recent article on finding both CpG-related and non-CpG-related promoter regions in order to locate first exons, see Davuluri R. V., Grosse I., Zhang M.Q. Computational identification of promoters and first exons in the human genome. Nat Genet. 2001 Dec;29(4):412-7.

Background: The first programming question is about optimal alignment, but for the bulk of this homework, we'll concentrate on developing a Hidden Markov Model for CpG detection and using the model to predict the location of CpG islands in a set of genomic test sequence data. We outline some details of the model here.

We'll use the 8-state model developed in the text, with the simplifying assumption that I will cover in class -- namely, rather than try to estimate all the transition probabilities from positive states to negative states and negative states to postive states, we let

p = P(pii is + | pii-1 is +)
q = P(pii is - | pii-1 is -)
We also assume that the transition probabilities from a fixed positive state to each of the four negative states are all equal, hence each is equal to (1-p)/4; similarly, the transition probabilities from negative states to positive states are are (1-q)/4.

For the transition probabilities from positive to positive states, we estimate symbol transition probabilities as we did with Markov chains (in fact, we'll just use the estimates given in the text), then multiply by p to get the state transition probabilities. Similarly, we can obtain the negative to negative transition probabilities from negative training data.

We obtain the following model:

 ak,l A+ C+ G+ T+ A- C- G- T- A+ 0.180p 0.274p 0.426p 0.120p (1-p)/4 (1-p)/4 (1-p)/4 (1-p)/4 C+ 0.171p 0.368p 0.274p 0.188p (1-p)/4 (1-p)/4 (1-p)/4 (1-p)/4 G+ 0.161p 0.339p 0.375p 0.125p (1-p)/4 (1-p)/4 (1-p)/4 (1-p)/4 T+ 0.079p 0.355p 0.384p 0.182p (1-p)/4 (1-p)/4 (1-p)/4 (1-p)/4 A- (1-q)/4 (1-q)/4 (1-q)/4 (1-q)/4 0.300q 0.205q 0.285q 0.210q C- (1-q)/4 (1-q)/4 (1-q)/4 (1-q)/4 0.322q 0.298q 0.078q 0.302q G- (1-q)/4 (1-q)/4 (1-q)/4 (1-q)/4 0.248q 0.246q 0.298q 0.208q T- (1-q)/4 (1-q)/4 (1-q)/4 (1-q)/4 0.177q 0.239q 0.292q 0.292q

The emission probabilities are as given in the text and in class: the state A+ emits A with probability 1, while eA+(b) = 0 for all b different from A, etc.

Theory (10 points):

1. When you use BLAST (a widely-used heuristic alignment program) for a query sequence against a database, the program reports a bit score each alignment it finds as well as an E-value. What, roughly, does the E-value tell you? What does a p-value tell you? For optimal local alignment algorithms and the heuristic algorithms that approximate local alignment, how do you convert an E-value into a p-value? (Theoretically, it is best to restrict to optical local ungapped alignment, but empirically the results are similar for gapped/heuristic local alignment.)
2. Give at least one reason why the above simplified model for CpG islands might be better than trying to estimate all the original transition probabilities, even when we have annotated training data (CpG island locations known)?
3. Given annotated training data -- genomic sequences where the CpG islands are known -- how do you calculate the maximum likelihood estimates for the parameters p and q above?
4. What is the different between the Viterbi state path and the path obtained from posterior decoding? Give the probabilistic definition for each, and give a few sentences to explain what the definition means.
Programming (40 points):

1. As a warm-up for dynamic programming, implement the Smith-Waterman algorithm for local gapped alignment. You can start with the following matlab code (borrowed from Prof. Anastassiou) for doing global alignment of DNA sequences, located in the file align.m. The code assumes that the two strings to be aligned are called x and y; e.g. you type the following commands in the matlab command window
```  >> x = 'ACGGCTTTC'

>> y = 'ACCTTTC'

>> align
```
to produce the following global alignment:
```  ACGGCTTTC
AC--CTTTC
```

(Try it, then examine the code.) Now change the code so that it does local alignment for protein sequences. You will need to make a few changes to the dynamic programming algorithm as discussed in class. Note that unfortunately, since matlab starts indexing at 1 instead of 0, the (i,j) indices are "off by one" from the conventions used in the text. You will also need to incorporate substitution matrix match/mismatch scores. Use the BLOSUM50 matrix (copy from an online version to save typing) and a linear gap penalty of d=8 as in the text. You may find the index function in index.m convenient for converting from symbols to indices -- for example, typing:

```  >> amino = 'ARNDCQEGHILKMFPSTWYVBZX'

>> index('R', amino)
```
returns the value 2 (index of 'R' in the character array called amino).

Once your Smith-Waterman program is working, use it to produce the optimal local alignment of the following two sequences, given in FASTA format:

Report both the optimal local alignment and the score for this alignment. Also remember to include your code with your homework submission.

2. We'll now move on to the problem of using the HMM described above for prediction of CpG islands. We'll look for CpG islands in a small dataset consisting of a collection of 9 sequences containing promoter regions of human genes with a total of 10 annotated CpG islands. This test dataset is in the file cpg_file.txt.

We'll use the model described above with fixed p and q values. A reasonable choice is to take p=.98, q=.999 -- with enough annotated training data, we could get a better estimate. We'll use the small dataset as our test set -- we want to see how well our predictions compare to the biologically-motivated annotations so that we get an empirical measure of the performance of our model. (Why is it not a good idea to keep rerunning our prediction algorithms on the test set with different values of p and q, and then report the best result?)

The first prediction algorithm we'll use is the Viterbi algorithm, to find the most probable path given the observed data. As a starting place, the file viterbi.m is a simple implementation of the Viterbi algorithm (using the log transform). This code assumes that the observed symbol sequence is represented by a variable x and that certain array/matrix variables -- the alphabet of symbols, the alphabet of states, matrices representing the transition and emission probabilities -- are already defined. For example, the file casino_model.m defines all the variables for implementation of the occasionally dishonest casino model discussed in your text -- here the symbols are '1' through '6' and the states are 'F' for fair die and 'L' for loaded die. For example, here I use the casino model to make the following state path prediction for observed sequence x:

```  >> x = '1263543264666666664666444625342323'

x =

1263543264666666664666444625342323

>> viterbi
LLLLLLLLLLLLLLLLLLLLLLFFFFFFFFFFFF
```
So, for this part, there is not much to do to get the code working for CpG island prediction: examine and make sure you understand the Viterbi code, write a model file to implement our CpG island HMM, and use the code to predict the most probable path for each of the test sequences. (Suggestion: you could use 'A', 'C', 'G', 'T' to represent positive states and 'a', 'c', 'g', 't' to represent negative states.) Save the results of your predictions -- below you will be asked to do some analysis on the performance of the model using Viterbi prediction.

Implementation/theoretical note: in this code, I am not explicitly modeling the begin and end silent states. Instead, I am using the following expression for the joint probability of the observed sequence x and state sequence pi:

P(x,pi) = P(pi1) epi1(x1) api1,pi2 epi2(x2) ... apiL-1,piL epiL(xL)
I am also assuming a uniform distribution over initial states -- if there are m possible states, then:
P(pi1= k) = 1/m
Note that you are free to modify (and improve) my code! You should include all the code you write as part of your homework submission.

3. Now you will implement and use posterior decoding to make predictions about the locations of CpG islands. Please read the last section in Chapter 3 of the text (or review notes from lecture) on scaling probabilities, i.e. using scaling factors si so that the scaled forward and backward probabilities are in a numerically better range.

The file forward.m contains matlab code to compute scaled forward probabilities, using the formula for the scaling factors si that I outlined in class (it's also in your text). You should write code for the backward algorithm (to compute scaled backward probabilities) and complete the implementation of posterior decoding. Use your code to predict state paths for the test data sequences, and save your results.

Implementation note: The backward algorithm in the text assumes that you are modeling an end state -- that is, you introduce a transition probability to the end state, and adjust the other transition probabilities in the model accordingly. For us, it's probably just easier to slightly modify the initialization of the backward algorithm so that it computes the probabilities without an end state. Either way, as long as what you compute allows you to make correct predictions about the state pii with maximal posterior probability, you are fine.

Please include all the code you write as part of your homework submission.

4. In this part, I would like you to compare your predictions based on the most probable path method and the posterior decoding method with each other and with the annotation given on the test set. We do not have enough test data to draw too much of a conclusion from simple statistics like number of false positives and false negatives. However, I would still like you to report the performance of the two methods quantitatively and qualitatively. First, for each sequence, produce some graphical (or at least tabular) representation of the location of the "real" (annotated) CpG islands, the ones predicted by Viterbi, and the ones predicted by posterior decoding. Second, make some quantitative statements about the performance of each prediction method: How many annotated CpG islands were completely missed by each method (false negatives)? How many spurious islands were predicted (false positives)? There is also a grey area -- predicted islands may overlap but not entirely coincide with annotated ones; make a decision about how to report these results. Note that exact borders of CpG islands are somewhat arbitrary, and it is possible that some true CpG islands have not been annotated (though our test set comes from reference sequences from Genbank and should be fairly well annotated). Finally, make some qualitative statements about the performance of each prediction method, relating your results (if possible) to the probabilistic interpretation of the method.
Acknowledgements: Many thanks to Dr. Victoria Haghighi of the Columbia Genome Center for providing the annotated dataset used in this homework and to Ilan Wapinski for post-processing the fasta file.