Due Date: Friday, February 20. Submit electronically via Courseworks. Our TA Eugene Ie (eie@cs.columbia.edu) will post instructions on the Courseworks discussion board for how to submit homework. For theory questions, write your answers in a convenient electronic format -- plain text, pdf, postscript, or doc (if you must!). For other formats, ask the TA first to make sure it will be readable. For programming questions, please submit both your source code and your results/plots (in a standard format like ps, jpg, etc) along with a plain text "readme.txt" file that explains to the TA where everything is.
Suggested languages and tools: We suggest you use perl for file processing and basic computation and matlab for generating histograms and plots. (Alternatively, Excel is another simple possibility for generating histograms.) Both perl and matlab will be useful to you in subsequent homeworks and in your group project. However, if you prefer to use other programming languages or plotting tools, just email the TA ahead of time to make sure your choice is acceptable (Excel is fine for plotting).
Reference: See Gary Stormo's review article on modeling transcription factor binding sites for a historical survey and different viewpoints on the problem.
Background: First we review some key ideas about defining and searching for motifs, as discussed in lecture. A motif (or "signal") can be loosely defined as some local feature in the genome that has characteristic properties at the nucleotide level stemming from its evolution and function. For this homework, we are interested in modeling binding sites for transcription factors (TFs), proteins that bind to promoter regions of genes and help regulate transcription.
A candidate site for a motif (such as TFBS) will be denoted by b. If the site has m bases, then we denote these bases as b1 ... bm.
Given a particular TF and examples of its binding sites, how do we predict if a new candidate site b = b1 ... bm is also a binding site for this TF? As in class, we build two different probabilistic models: one that defines the probability distribution over real TFBSs, called the "site model", and the other that defines the "null model" probability distribution. The null model represents the background distribution in upstream regions, i.e. length m windows that are not binding sites (for any TF). Given a candidate site b, P(b|Msite) denotes the probability of the candidate b under the site model, and P(b|Mnull) denotes the probability of b under the null model. The key quantity we need to compute is the likelihood ratio:
P(b|Msite)/P(b|Mnull)
It is convenient for many reasons to take the log of the likelihood ratio as our primary statistic. The log likelihood ratio is often called the score, as in chapter 2 of the Durbin et al. text.
s(b) = log(P(b|Msite)/P(b|Mnull))
Recall from class that we developed the following TFBS model:
P(b|Msite) = (p1b1) (p2b2) .. (pmbm)       [Equation 1]
where pix is the probability of seeing character x in position i in a true site.
The higher the score for a candidate site b, the more likely we consider it to be a real splice site. We can define a threshold T, and test each candidate splice site b by computing score(b) and comparing it to the threshold T. If the score exceeds T, the candidate is predicted to be a true site, otherwise it is predicted not to be site. But how high a threshold T do you need? If the threshold is too low, we will get too many false positives i.e. sites that pass our test of having a score greater than T, but are in fact not real TFBSs. On the other hand, if the threshold T is too high, then there may be many real sites whose scores do not exceed the threshold, and thus will be missed. In the theory questions, we discuss how to associate a p-value to a particular threshold or score, and the corresponding E-value for this score when looking at a 500 bp upstream region.
Theory (15 points):
The file utr5_sc_500.fasta contains the 500 bp 5' upstream untranslated regions (UTR's) for all 5859 (predicted) yeast genes, downloaded from the Saccharomyces Genome Database (SGD). The file is in "fasta" format, a common and simple format for sequence data. Each entry looks like:
>YAL013W SGDID:S0000011 5' untranslated region, Chr I 128772 - 129271, 500 bp GTTGGAACCTATACAACTCCTTGTCAACCGCCGTATCAGGAATTTTGTCCAGTATGGTAT TGTATCGGGATATTAGTTCATCATTGGGCGCACACTTCTGCAATAACTCCAATATTGAGC CCAATTGCCGTTTCAAAGTAACGTTATCGTTTGAGGTCGGGGCGAGCTTCAACACCGACA CTAGCCTTGTTCTTTCTTCGACCAGATCAGAGAGCTGGTCCAGTTCATAACCCAGCTTCA ACACATCCATACCGTGTGTGCTTACGCATCTATTTGCTGTCGTGAGATCTGTCTCTATGC TTTATTCGTTTTCCATTGTAAAGTCTCAGCATTTATTTCTTGTTCTTTACTTGTTTTTGC CCTTTCGGGTGATCAAAGTCGTGCTGGGAAATTTTATTCTTATAAAATGATTTTTAGAAA TAATAAACTCATAACAGTGCAACGGCAAAGTACAAGGGAAGGAAGCACAGAAGCAAGAGG AGGCGCATCGATCGTGGCAGThere is one line of header information, starting with the symbol ">" and followed by a sequence identifier -- here, the yeast open reading frame (ORF) identifier, "YAL013W", which specifies the gene -- and then other descriptive information like chromosome number; the header line is followed by one or more lines of sequence.
The file tfbs_models.txt contains experimentally verified binding sequences for various transcription factors. For example, the binding site sequences listed under model ID "M00125" are for a TF called MCM1. These training sequences come from the Transfac database. You can find more information about the TF's used in this homework in the file tfbs_notes.txt or, for example, the Cold Spring Harbor yeast Promoter Database.
Also compute the PSSMs for the following TFBS motifs: M0031, M00197, and M00754. For the second two examples, there are no sequences or not many sequences for training data, so use the frequency matrices given in tfbs_notes.txt to estimate probabilities. For these examples, you just need to submit the final PSSMs.
Repeat using the PSSM for M00197 on the 5' UTR for ORF YFL014W, and submit plots of window scores and location of highest scoring hit(s).
There are more sophisticated ways of estimating p-values for these models, especially since we want to concentrate on the tail of the distribution. See the discussion of p-values for TFBS models from Nir Friedman's lab for a good overview.
Supplementary information:
A more difficult problem than the one we deal with in this homework is the problem of discovering new (putative) TFBSs by computational means. If you would like to try out motif discovery program, look at Jim Kent's improbizer program. First select improbizer from the web page. Then paste some subset of the 5' UTRs (in fasta format) -- it will work best if you already have reason to believe that the sequences might share regulatory elements, maybe based on your searches with the four PSSMs in the homework. The web-based version of Improbizer does not process all data in huge files, so limit to a few 100 sequences at most. Read the Information about Improbizer and then start with the following options: set "Number of motifs to find" to 1, "Maximum Occurrences per sequence" to 1, "Initial motif size" to the length of one of our PSSMs, and "Restrain Expansionist tendencies" to 1.0. Click "Submit". Improbizer will now try to discover a motif that is over-represented in the data sequences that you submitted. Improbizer is designed to discover the motif in unaligned sequences, even if the motif occurs a variable number of times and places -- you can experiment with different options. The algorithmic approach used is an example of expectation maximization (EM) -- we'll talk more about EM later in the course.
Improbizer returns the motif it discovers in the data as described by a weight matrix. Below the table describing the motif, the output shows the actual motifs found in the data you pasted in (the part of it that Improbizer processed at least). The location of the motif is shown in upper case. The number in the left column is the score. This is the log likelihood ratio, the same notion of score defined above. However, it should be noted that Improbizer is a bit smarter about the null model than we might have been above. It does not just assume that all bases are equally likely, but rather estimates their frequencies from the data you provide. P(b|Mnull) is defined using these estimated frequencies. It can also use a higher order Markov model as the null model (see pages 46-51 of the text) and estimate a null model from a separate data file.