**Due Date: Friday, March 12 (11:59pm)** Note that the due date
is the Friday before Spring Break, and that there is also a midterm
that week -- start early! Submit electronically via
Courseworks, following
the instructions posted previously by our
head TA Eugene Ie
(eie@cs.columbia.edu).
For theory questions, write your answers in a convenient electronic
format -- plain text, pdf, postscript, or
doc (if you must!). For
other formats, ask the TA first to make sure it will be readable.
For programming questions, please submit both your source code and your
results/plots (in a standard format like ps, jpg, etc) along with a plain
text "readme.txt" file that explains to
the TA where everything is.

**Suggested languages and tools:** We suggest that you
use **matlab** as the
programming language for this assignment. We will provide some matlab code
as a starting point for the various dynamic programming algorithms associated
with alignment and HMM prediction (more
complete code is provided for some problems than for others). Matlab has a fairly
simple programming language and good predefined functions for matrix
manipulations, and the matlab environment allows you to interactively
look at current values of variables.
Many researchers in machine learning use matlab as a
prototyping language. In fact, if you get a chance, you might want to take
a look at Kevin Murphy's
Bayes Net Toolbox for matlab. (HMMs are a special example of
more general graphical models called Bayes Nets.) You won't need the Bayes
Net Toolbox
for this homework, but you might use it in your final project.

Once again, the TAs will compile a list of online resources, post some tips for doing this assignment in matlab, and run a matlab tutorial. However, if you prefer to use other programming languages for this assignment, just email one of the TAs ahead of time to make sure your choice is acceptable.

**Biology Reference:** For an interesting recent article on finding both
CpG-related and non-CpG-related promoter regions in order to locate first
exons, see
Davuluri R. V., Grosse I., Zhang M.Q.
Computational identification of promoters and first exons in the human genome.
Nat Genet. 2001 Dec;29(4):412-7.

**Background:** The first programming question is about optimal
alignment, but for the bulk of this homework,
we'll concentrate on developing a Hidden Markov Model for CpG detection
and using the model to predict the location of CpG islands in a set of
genomic test sequence data. We outline some details of the model here.

We'll use the 8-state model developed in the text, with the simplifying assumption that I will cover in class -- namely, rather than try to estimate all the transition probabilities from positive states to negative states and negative states to postive states, we let

p = P(piWe also assume that the transition probabilities from a fixed positive state to each of the four negative states are all equal, hence each is equal to (1-p)/4; similarly, the transition probabilities from negative states to positive states are are (1-q)/4._{i}is + | pi_{i-1}is +)

q = P(pi_{i}is - | pi_{i-1}is -)

For the transition probabilities from positive to positive states, we estimate symbol transition probabilities as we did with Markov chains (in fact, we'll just use the estimates given in the text), then multiply by p to get the state transition probabilities. Similarly, we can obtain the negative to negative transition probabilities from negative training data.

We obtain the following model:

a_{k,l} |
A^{+} |
C^{+} |
G^{+} |
T^{+} |
A^{-} |
C^{-} |
G^{-} |
T^{-} |

A^{+} |
0.180p |
0.274p |
0.426p |
0.120p |
(1-p)/4 | (1-p)/4 | (1-p)/4 | (1-p)/4 |

C^{+} |
0.171p |
0.368p |
0.274p |
0.188p |
(1-p)/4 | (1-p)/4 | (1-p)/4 | (1-p)/4 |

G^{+} |
0.161p |
0.339p |
0.375p |
0.125p |
(1-p)/4 | (1-p)/4 | (1-p)/4 | (1-p)/4 |

T^{+} |
0.079p |
0.355p |
0.384p |
0.182p |
(1-p)/4 | (1-p)/4 | (1-p)/4 | (1-p)/4 |

A^{-} |
(1-q)/4 | (1-q)/4 | (1-q)/4 | (1-q)/4 | 0.300q |
0.205q |
0.285q |
0.210q |

C^{-} |
(1-q)/4 | (1-q)/4 | (1-q)/4 | (1-q)/4 | 0.322q |
0.298q |
0.078q |
0.302q |

G^{-} |
(1-q)/4 | (1-q)/4 | (1-q)/4 | (1-q)/4 | 0.248q |
0.246q |
0.298q |
0.208q |

T^{-} |
(1-q)/4 | (1-q)/4 | (1-q)/4 | (1-q)/4 | 0.177q |
0.239q |
0.292q |
0.292q |

The emission probabilities are as given in the text and in class: the
state A+ emits A with probability 1, while e_{A+}(b) = 0 for
all b different from A, etc.

**Theory (10 points):**

- When you use BLAST (a widely-used heuristic alignment program) for a query sequence against a database, the program reports a bit score each alignment it finds as well as an E-value. What, roughly, does the E-value tell you? What does a p-value tell you? For optimal local alignment algorithms and the heuristic algorithms that approximate local alignment, how do you convert an E-value into a p-value? (Theoretically, it is best to restrict to optical local ungapped alignment, but empirically the results are similar for gapped/heuristic local alignment.)
- Give at least one reason why the above simplified model for CpG islands might be better than trying to estimate all the original transition probabilities, even when we have annotated training data (CpG island locations known)?
- Given annotated training data -- genomic sequences where the CpG islands are known -- how do you calculate the maximum likelihood estimates for the parameters p and q above?
- What is the different between the Viterbi state path and the path obtained from posterior decoding? Give the probabilistic definition for each, and give a few sentences to explain what the definition means.

- As a warm-up for dynamic programming, implement the Smith-Waterman algorithm
for local gapped alignment. You can start with the following matlab code
(borrowed from Prof. Anastassiou) for doing
**global alignment**of DNA sequences, located in the file align.m. The code assumes that the two strings to be aligned are called x and y; e.g. you type the following commands in the matlab command window>> x = 'ACGGCTTTC' >> y = 'ACCTTTC' >> align

to produce the following global alignment:ACGGCTTTC AC--CTTTC

(Try it, then examine the code.) Now change the code so that it does

**local alignment**for protein sequences. You will need to make a few changes to the dynamic programming algorithm as discussed in class. Note that unfortunately, since matlab starts indexing at 1 instead of 0, the (i,j) indices are "off by one" from the conventions used in the text. You will also need to incorporate substitution matrix match/mismatch scores. Use the BLOSUM50 matrix (copy from an online version to save typing) and a linear gap penalty of d=8 as in the text. You may find the index function in index.m convenient for converting from symbols to indices -- for example, typing:>> amino = 'ARNDCQEGHILKMFPSTWYVBZX' >> index('R', amino)

returns the value 2 (index of 'R' in the character array called amino).Once your Smith-Waterman program is working, use it to produce the optimal local alignment of the following two sequences, given in FASTA format:

Report both the optimal local alignment and the score for this alignment. Also remember to include your code with your homework submission. - We'll now move on to the problem of using the HMM described above
for prediction of CpG islands. We'll look for CpG islands in a small dataset
consisting of a collection of 9 sequences containing promoter regions
of human genes with a total of 10 annotated CpG islands.
This test dataset is in the file cpg_file.txt.
We'll use the model described above with fixed p and q values. A reasonable choice is to take p=.98, q=.999 -- with enough annotated training data, we could get a better estimate. We'll use the small dataset as our test set -- we want to see how well our predictions compare to the biologically-motivated annotations so that we get an empirical measure of the performance of our model. (Why is it

**not**a good idea to keep rerunning our prediction algorithms on the test set with different values of p and q, and then report the best result?)The first prediction algorithm we'll use is the Viterbi algorithm, to find the most probable path given the observed data. As a starting place, the file viterbi.m is a simple implementation of the Viterbi algorithm (using the log transform). This code assumes that the observed symbol sequence is represented by a variable x and that certain array/matrix variables -- the alphabet of symbols, the alphabet of states, matrices representing the transition and emission probabilities -- are already defined. For example, the file casino_model.m defines all the variables for implementation of the occasionally dishonest casino model discussed in your text -- here the symbols are '1' through '6' and the states are 'F' for fair die and 'L' for loaded die. For example, here I use the casino model to make the following state path prediction for observed sequence x:

>> x = '1263543264666666664666444625342323' x = 1263543264666666664666444625342323 >> viterbi LLLLLLLLLLLLLLLLLLLLLLFFFFFFFFFFFF

So, for this part, there is not much to do to get the code working for CpG island prediction: examine and make sure you understand the Viterbi code, write a model file to implement our CpG island HMM, and use the code to predict the most probable path for each of the test sequences. (Suggestion: you could use 'A', 'C', 'G', 'T' to represent positive states and 'a', 'c', 'g', 't' to represent negative states.) Save the results of your predictions -- below you will be asked to do some analysis on the performance of the model using Viterbi prediction.Implementation/theoretical note: in this code, I am

**not**explicitly modeling the begin and end silent states. Instead, I am using the following expression for the joint probability of the observed sequence x and state sequence pi:P(x,pi) = P(pi

I am also assuming a uniform distribution over initial states -- if there are m possible states, then:_{1}) e_{pi1}(x_{1}) a_{pi1,pi2}e_{pi2}(x_{2}) ... a_{piL-1,piL}e_{piL}(x_{L})P(pi

Note that you are free to modify (and improve) my code! You should include all the code you write as part of your homework submission._{1}= k) = 1/m - Now you will implement and use posterior decoding to make predictions about
the locations of CpG islands. Please read the last section in Chapter 3
of the text (or review notes from lecture) on scaling probabilities, i.e.
using scaling factors s
_{i}so that the scaled forward and backward probabilities are in a numerically better range.The file forward.m contains matlab code to compute scaled forward probabilities, using the formula for the scaling factors s

_{i}that I outlined in class (it's also in your text). You should write code for the backward algorithm (to compute**scaled**backward probabilities) and complete the implementation of posterior decoding. Use your code to predict state paths for the test data sequences, and save your results.Implementation note: The backward algorithm in the text assumes that you are modeling an end state -- that is, you introduce a transition probability to the end state, and adjust the other transition probabilities in the model accordingly. For us, it's probably just easier to slightly modify the initialization of the backward algorithm so that it computes the probabilities without an end state. Either way, as long as what you compute allows you to make correct predictions about the state pi

_{i}with maximal posterior probability, you are fine.Please include all the code you write as part of your homework submission.

- In this part, I would like you to compare your predictions based on the most
probable path method and the posterior
decoding method with each other and with the annotation given on the test set.
We do not have enough test data to draw too much of a conclusion from simple
statistics like number of false positives and false negatives. However, I
would still like you to report the performance of the two methods
quantitatively and qualitatively. First, for each sequence,
produce some graphical (or at least tabular) representation
of the location of the "real" (annotated) CpG islands, the ones predicted by
Viterbi, and the ones predicted by posterior decoding. Second, make some
quantitative statements about the performance of each prediction method:
How many annotated CpG islands were completely missed by each method (false
negatives)? How many spurious islands were predicted (false positives)?
There is also a grey area -- predicted islands may overlap but not entirely
coincide with annotated ones; make a decision about how to report these
results. Note that exact borders of CpG islands are somewhat arbitrary,
and it is possible that some true CpG islands have not been annotated
(though our test set comes from reference sequences from Genbank and should
be fairly well annotated). Finally, make some qualitative statements about
the performance of each prediction method, relating your results (if possible)
to the probabilistic interpretation of the method.

*Acknowledgements: Many thanks to Dr. Victoria Haghighi of the Columbia Genome Center for providing the annotated dataset used in this homework and to Ilan Wapinski for post-processing the fasta file.*