Final Study Guide
General comments
- The final test is open-book: you can refer to your text, your lecture notes, and your study notes
- The final test covers material since the midterm -- it does not focus on the biosequence analysis we did before the midterm. However, many of the early ideas from the course are required for understanding the material since.
- You should know basic rules of probability and ideas from machine learning (e.g. training a model, log odds score, linear classifier, etc), but you will not be asked to reproduce the longer derivations from class or to prove new results
- You should be able to give basic biological motivation for the biosequence and gene expression analysis problems we have discussed
- You should be able to describe all the algorithms and methods we discussed in class and give examples of their use, but you will not be asked to formally derive or justify them.
Topics
- Final HMM topics
[References: Durbin, Section 3.3 and Chapter 5]
- EM algorithm (Baum-Welch) for parameter estimation in incomplete data case
- Profile HMMs
- Classification for Gene Expression Profiles
[References: Golub paper, Burges SVM tutorial]
- classification vs. clustering, supervised vs. unsupervised learning
- k-nearest neighbor
- Fisher's linear discriminant
- Example of classification problem: Golub leukemia
- Support vector machines
- Linear classifiers, geometric margin
- Hard margin SVM, dual optimization problem, interpretation of the weights (Lagrange multipliers)
- Soft margin SVMs
- Kernel trick, examples of kernels
- Feature selection: filter methods (e.g. Fisher score), wrapper method (e.g. recursive feature elimination)
- Graphical Models for Regulatory Network Inference
[References: Hartemink paper, inferring subnetworks paper, module networks paper, module networks tech report]
- Basic ideas of transcriptional regulation: promoter regions for genes and transcription factors
- Bayes net models: graphical model, conditional independencies, joint probability distribution, parameters, arrows do not imply causality
- Hartemink paper: Bayesian score, statistically validating candidate models
- Pe'er "Inferring subnetworks" paper: use of knock-out data modeled by "interventions" in graph, greedy search for high scoring structure, bootstrapping to get high confidence features
- Segal "Module Networks" paper: notion of transcriptional module, outline of structure search algorithm, overview of validation using other data and GO annotations
- Motif discovery
[References: REDUCE paper, MEME paper ]
- REDUCE algorithm -- linear regression model (we covered this before spring break, but since it wasn't covered on the first test, it will be covered on the final test)
- MEME algorithm -- mixture model and use of expectation maximization (overview only, since we didn't present details in lecture)
- Gene-finding
[References: Burge paper on GENSCAN, Haussler review article, TWINSCAN paper]
- Different approaches to gene-finding: biological, database-driven, microarray techniques, computational gene-finding models
- Gene-finding in prokaryotes, ORFs
- GENSCAN model
- Hidden semi-Markov model
- State diagram for GENSCAN, how is models genomic sequence
- Prediction using "Viterbi parse"
- TWINSCAN model
- Use of orthologous sequences, convervation sequence
- How TWINSCAN adds conservation information to the GENSCAN model
Not on test: The last few topics from Lecture 13 will not be on the test, i.e.
- String kernels for protein classification
- Whole genome comparative annotation (4 yeast species paper)