## Probabilistic Models for Transcription Factor Binding Sites

Due Date: Friday, February 20. Submit electronically via Courseworks. Our TA Eugene Ie (eie@cs.columbia.edu) will post instructions on the Courseworks discussion board for how to submit homework. For theory questions, write your answers in a convenient electronic format -- plain text, pdf, postscript, or doc (if you must!). For other formats, ask the TA first to make sure it will be readable. For programming questions, please submit both your source code and your results/plots (in a standard format like ps, jpg, etc) along with a plain text "readme.txt" file that explains to the TA where everything is.

Suggested languages and tools: We suggest you use perl for file processing and basic computation and matlab for generating histograms and plots. (Alternatively, Excel is another simple possibility for generating histograms.) Both perl and matlab will be useful to you in subsequent homeworks and in your group project. However, if you prefer to use other programming languages or plotting tools, just email the TA ahead of time to make sure your choice is acceptable (Excel is fine for plotting).

Reference: See Gary Stormo's review article on modeling transcription factor binding sites for a historical survey and different viewpoints on the problem.

Background: First we review some key ideas about defining and searching for motifs, as discussed in lecture. A motif (or "signal") can be loosely defined as some local feature in the genome that has characteristic properties at the nucleotide level stemming from its evolution and function. For this homework, we are interested in modeling binding sites for transcription factors (TFs), proteins that bind to promoter regions of genes and help regulate transcription.

A candidate site for a motif (such as TFBS) will be denoted by b. If the site has m bases, then we denote these bases as b1 ... bm.

Given a particular TF and examples of its binding sites, how do we predict if a new candidate site b = b1 ... bm is also a binding site for this TF? As in class, we build two different probabilistic models: one that defines the probability distribution over real TFBSs, called the "site model", and the other that defines the "null model" probability distribution. The null model represents the background distribution in upstream regions, i.e. length m windows that are not binding sites (for any TF). Given a candidate site b, P(b|Msite) denotes the probability of the candidate b under the site model, and P(b|Mnull) denotes the probability of b under the null model. The key quantity we need to compute is the likelihood ratio:

P(b|Msite)/P(b|Mnull)

It is convenient for many reasons to take the log of the likelihood ratio as our primary statistic. The log likelihood ratio is often called the score, as in chapter 2 of the Durbin et al. text.

s(b) = log(P(b|Msite)/P(b|Mnull))

Recall from class that we developed the following TFBS model:

P(b|Msite) = (p1b1) (p2b2) .. (pmbm)       [Equation 1]

where pix is the probability of seeing character x in position i in a true site.

The higher the score for a candidate site b, the more likely we consider it to be a real splice site. We can define a threshold T, and test each candidate splice site b by computing score(b) and comparing it to the threshold T. If the score exceeds T, the candidate is predicted to be a true site, otherwise it is predicted not to be site. But how high a threshold T do you need? If the threshold is too low, we will get too many false positives i.e. sites that pass our test of having a score greater than T, but are in fact not real TFBSs. On the other hand, if the threshold T is too high, then there may be many real sites whose scores do not exceed the threshold, and thus will be missed. In the theory questions, we discuss how to associate a p-value to a particular threshold or score, and the corresponding E-value for this score when looking at a 500 bp upstream region.

Theory (15 points):

1. Explain what independence assumption we are making when we use Equation 1 to define our site model?
2. If we have a set of training data D -- true examples of binding sites for the TF of interest, all of fixed length m -- explain how to calculate the maximum likelihood estimate of the parameters pix, where x = A, C, G, or T and i = 1 .. m.
3. What are pseudocounts, and how do we use them to modify the maximum likelihood estimate? Explain why we should use pseudocounts in this setting.
4. Give a reasonable model for Mnull -- that is, give an expression for the probability P(b|Mnull) that such a null model assigns to candidates b. Give a reasonable choice for estimating the parameters of the null model from training data.
5. Using log rules, give an expression for the log likelihood score that is most computationally sound, i.e. unlikely to cause underflow errors due to very small numbers.
6. Explain what is meant by the "p-value" associated with a particular score. To be concrete, suppose that we empirically estimate or exactly compute the "random score distribution" for s(b), i.e. the probability distribution over all possible scores for sites b coming from the background model. How would you find the p-value for a particular score S? Why are we interested in having a p-value for a given score S?
7. Suppose we have a PSSM model for a length m=10 TFBS motif, and we use a score threshold T whose p-value is .001. Now we slide a m=10 length window over one strand of a length 500 promoter sequence as well as over the reverse complement strand, and we score every window b to see if s(b) >= T. If we make the simplifying assumption that each test is independent (not actually true, due to overlapping and reverse complement windows!), what is the number of "hits" (predicted sites) that we expect to see at random? This number is the E-value for the threshold T (for 500 bp promoter sequences). Why are we interested in this E-value? (More precisely, one usually talks about an E-value for a score S rather than a threshold. If one finds a hit with a certain score S, one defines the E-value for this score S as the expected number of hits with score >= S for the search space size.)
Programming (35 points):

The file utr5_sc_500.fasta contains the 500 bp 5' upstream untranslated regions (UTR's) for all 5859 (predicted) yeast genes, downloaded from the Saccharomyces Genome Database (SGD). The file is in "fasta" format, a common and simple format for sequence data. Each entry looks like:

```>YAL013W SGDID:S0000011 5' untranslated region, Chr I 128772 - 129271, 500 bp
GTTGGAACCTATACAACTCCTTGTCAACCGCCGTATCAGGAATTTTGTCCAGTATGGTAT
TGTATCGGGATATTAGTTCATCATTGGGCGCACACTTCTGCAATAACTCCAATATTGAGC
CCAATTGCCGTTTCAAAGTAACGTTATCGTTTGAGGTCGGGGCGAGCTTCAACACCGACA
CTAGCCTTGTTCTTTCTTCGACCAGATCAGAGAGCTGGTCCAGTTCATAACCCAGCTTCA
ACACATCCATACCGTGTGTGCTTACGCATCTATTTGCTGTCGTGAGATCTGTCTCTATGC
TTTATTCGTTTTCCATTGTAAAGTCTCAGCATTTATTTCTTGTTCTTTACTTGTTTTTGC
CCTTTCGGGTGATCAAAGTCGTGCTGGGAAATTTTATTCTTATAAAATGATTTTTAGAAA
TAATAAACTCATAACAGTGCAACGGCAAAGTACAAGGGAAGGAAGCACAGAAGCAAGAGG
AGGCGCATCGATCGTGGCAG
```
There is one line of header information, starting with the symbol ">" and followed by a sequence identifier -- here, the yeast open reading frame (ORF) identifier, "YAL013W", which specifies the gene -- and then other descriptive information like chromosome number; the header line is followed by one or more lines of sequence.

The file tfbs_models.txt contains experimentally verified binding sequences for various transcription factors. For example, the binding site sequences listed under model ID "M00125" are for a TF called MCM1. These training sequences come from the Transfac database. You can find more information about the TF's used in this homework in the file tfbs_notes.txt or, for example, the Cold Spring Harbor yeast Promoter Database.

1. Estimating PSSMs. Use the MCM1 binding sequences (under "M00125") to build your own Msite model -- that is, to estimate the parameters pix for the probability model. Use all the 5' UTR sequence information to estimate the null or background model, i.e. the background nucleotide probabilities. Usually, we would try to keep training data separate from test data (we'll use these upstream regions as test data later), but for the null model, it's not so serious a problem; for the TFBS sequences, they all come from yeast promoters, but there are many other TFBSs for MCM1 that are not included in the training data. Now compute a "position specific scoring matrix" that we can later use to score candidates b. Use perl or another suitable programming language for making these PSSM computations from the sequence data, and report: (a) the parameters of the background model, (b) the site model, and (c) the PSSM in your homework submission. Remember that the entries of the matrix should be log probability ratios of position-specific site model probabilities over background probabilities (use log base 2). Rather than using the strict maximum likelihood estimate for the parameter estimates, use small pseudocounts (you can use the same pseudocount value for each parameter -- since this is just a simple model, we won't worry about choosing the pseudocounts in a "principled way"). Make a comment in your output file about what choice you make for pseudocounts.

Also compute the PSSMs for the following TFBS motifs: M0031, M00197, and M00754. For the second two examples, there are no sequences or not many sequences for training data, so use the frequency matrices given in tfbs_notes.txt to estimate probabilities. For these examples, you just need to submit the final PSSMs.

2. Scoring sequence windows. Use the PSSM that you learned for the M00125 sequences (for TF MCM1) to search for sites in the 5' UTR for ORF YBR160W (this is a gene called cdc28, or cell division cycle gene 28). Score every window in both the forward and reverse complement sequence, and use matlab to produce a plot of these scores (sequence position all the x-axis, window score on the y-axis) for both forward and reverse sequences. Does there appear to be a hit? Where is the most likely hit located? Submit your plots and the location of the highest scoring hit(s).

Repeat using the PSSM for M00197 on the 5' UTR for ORF YFL014W, and submit plots of window scores and location of highest scoring hit(s).

3. Estimating p-values. In order to get some idea of statistical significance, we'd like to estimate the score distribution for each of the PSSMs on background sequences. We will do this by a naive approximation of the score distribution, generating many random background sequences of length m (where m is the length of the PSSM model) and evaluating the PSSM score to produce an empirical histogram. We can then figure out (empirial) p-values associated with any score. There are two possible ways to generate background sequences. One is to sample randomly from our simple background model, i.e. generate random length m sequences of A, C, G, T using the background nucleotide probabilities. (In fact, for small m, it's quite feasible to do an exact computation of p-values using this background model.) A slightly better way to go is to randomly shuffle individual promoter sequences and select length m subsequences. This method preserves differences in nucleotide composition that exist in different sequences, rather than assuming a homogeneous background model. You can choose whichever method you prefer. Generate at least 100,000 random length m sequences for each score distribution. For each PSSM, submit (a) a plot of the background score distribution (essentially just a histogram of scores on the random data) and (b) the score corresponding to p-value=.001.

There are more sophisticated ways of estimating p-values for these models, especially since we want to concentrate on the tail of the distribution. See the discussion of p-values for TFBS models from Nir Friedman's lab for a good overview.

4. Genome-wide search. Finally, for each of the four PSSMs, do a genome-wise search for hits using the log likelihood ratio score s(b). Remember to look in both forward and reverse complement sequences. For each PSSM and each ORF, report the location and score for all hits above the p-value=.001 score threshold. Report your results for each PSSM in a separate file using a fasta-like format, with a header line followed by the hit information, one line per hit. Note that even with the p-value threshold, we expect to get many hits at random! If interested, you can follow up on some of your hits in SGD or Transfac, to see if they are indeed annotation (not required for submission).

Supplementary information:

A more difficult problem than the one we deal with in this homework is the problem of discovering new (putative) TFBSs by computational means. If you would like to try out motif discovery program, look at Jim Kent's improbizer program. First select improbizer from the web page. Then paste some subset of the 5' UTRs (in fasta format) -- it will work best if you already have reason to believe that the sequences might share regulatory elements, maybe based on your searches with the four PSSMs in the homework. The web-based version of Improbizer does not process all data in huge files, so limit to a few 100 sequences at most. Read the Information about Improbizer and then start with the following options: set "Number of motifs to find" to 1, "Maximum Occurrences per sequence" to 1, "Initial motif size" to the length of one of our PSSMs, and "Restrain Expansionist tendencies" to 1.0. Click "Submit". Improbizer will now try to discover a motif that is over-represented in the data sequences that you submitted. Improbizer is designed to discover the motif in unaligned sequences, even if the motif occurs a variable number of times and places -- you can experiment with different options. The algorithmic approach used is an example of expectation maximization (EM) -- we'll talk more about EM later in the course.

Improbizer returns the motif it discovers in the data as described by a weight matrix. Below the table describing the motif, the output shows the actual motifs found in the data you pasted in (the part of it that Improbizer processed at least). The location of the motif is shown in upper case. The number in the left column is the score. This is the log likelihood ratio, the same notion of score defined above. However, it should be noted that Improbizer is a bit smarter about the null model than we might have been above. It does not just assume that all bases are equally likely, but rather estimates their frequencies from the data you provide. P(b|Mnull) is defined using these estimated frequencies. It can also use a higher order Markov model as the null model (see pages 46-51 of the text) and estimate a null model from a separate data file.

Acknowledgements: Thanks to Eugene Ie and Anshul Kundaje for helping to prepare the datasets for this homework.