Lecture schedule from Spring 2002
Lecture 1
Most of following slides are borrowed from Prof. Russ Altman (Stanford Medical Informatics):Online repositories for biological data:
- Eukaryotic cell
- Lipid membrane
- Short protein segment
- Human chromosomes
- Packing DNA into chromosomes
- Universal genetic code
- Myoglobin example
- Yeast microarray
Lecture 2
Reading: Section 1.3 in Durbin. There is also good but terse material on probabilistic methods in Chapter 11 of the text -- see in particular Section 11.3 on inference.
Some background material and references for the splice site recognition problem (supplied for your interest only -- you aren't required to know details about splicing beyond what I present in class):
- Spliceosome diagram
- Training data for donor splice site
- Sanger Center splice site database
- M. Burset, I. A. Seledtsov, V. V. Solovyev (2000), "Analysis of canonical and non-canonical splice sites in mammalian genomes", Nucleic Acids Research, 28(21):4364-4375. (html)
Lecture 3
Reading: Durbin, Sections 2.1, 2.2 and 2.3 until the end of the subsection on global alignment (Needleman-Wunsch algorithm).
Lecture 4
Reading: Durbin, Sections 2.3 until the end of the subsection on local alignment (Smith-Waterman). Also take a look at the affine gap penalty part of Section 2.4. We won't do every variant of pairwise alignment in class, but it's useful to see that there are so many different versions.
Lecture 5
Reading: This lecture we'll be finishing up pairwise alignment Read through Section 2.5 on heuristic alignment algorithms, Section 2.7 on significance of scores (the "classical approach" subsection is most important), and Section 2.8 on deriving score parameters from data. Some of the explanation is quite sketchy, and the links below provide clearer exposition. Also start in on Section 3.1 for Markov chains.
- Homepage for BLAST, a widely-used heuristic alignment program
- Statistical significance for (ungapped) local alignment scores
- BLAST results for human p53 protein query
Lecture 6
Reading: Sections 3.1 and start of 3.2 on Markov chains and Hidden Markov Models for CpG island detection.
- CpG Island description and tagged sequences from the Sanger Center
Lecture 7
Reading: Continue with Section 3.2 on the Viterbi algorithm for Hidden Markov models.
Lecture 8
Reading: Section 3.2 on posterior decoding (the forward and backward algorithms).
Lecture 9
Reading: Training HMMs -- the parameter estimation problem (Section 3.3). Maximimum likelihood estimation when (1) states for the training data are known and (2) states for the training data are unknown (Expectation Maximization). Also read about scaling probabilities for the forward/backward algorithms in the last section of Chapter 3.
Lecture 10
Reading: Section 3.3 on the Baum-Welch algorithm. We'll finally cover the learning algorithm (a special case of Expectation Maximization) used to train the parameters of an HMM when the state sequence for the training data is unknown. If there's time, we'll start Chapter 4 on pair HMMs, used to produce alignments.
Lecture 11
Reading: Chapter 4 on pair HMMs, used to produce alignments. If you want to read more about Expectation Maximization in general and the Baum-Welch algorithm in particular (material from last time), you can check out Chapter 11 in the text (beware of typos in equations).
Lecture 12
Reading: Chapter 5 on profile HMMs for modelling protein families.
- Pfam database of multiple alignments and the HMMER hidden Markov Model package
- SAM profile HMM package, developed at UC Santa Cruz
- SCOP database, a hiearchical classification of proteins based on structure
- Protein Databank (PDB), repository of structural data for proteins
Lecture 13
First, to help you get groups together for the project -- please copy and fill out the HTML information sheet template with information about yourself, your background, and your project interests. Either post the template yourself and mail the URL to Ilan, or send the HTML file to Ilan and he'll post it. Please do this sometime in the next week. I would like to have information for everyone in the class posted before Spring Break.
In this lecture, we'll do a brief introduction to gene expression data and machine learning approaches to classification and clustering problems for vector-valued data -- this class is a preparation for the guest lecture from Dr. Paul Pavlidis next time. This material is not in the text -- I'll work on finding and posting some good reference material for these topics. In the meantime, some of the links below provide some background/pictures.
Lecture 14
Guest lecture by Dr. Paul Pavlidis, head of the Gene Expression Informatics Group at the Columbia Genome Center.
- Guest lecture by Paul Pavlidis in powerpoint format
Lecture 15
In this lecture, we'll discuss two clustering algorithms, hierarchical clustering and K-means.
- Hierarchical clustering of genes by their expression profiles across time series experiments
- Stanford Microarray Database
- Inferring regulatory networks from gene expression data -- here is an inferred Bayes net that is similar to the mating response subnetwork in yeast from the paper Inferring Subnetworks from Perturbed Expression Profiles, by Dana Pe'er, Aviv Regev, Gal Elidan, Nir Friedman.
Lecture 16
Midterm test.
Lecture 17
We'll start giving more details on support vector machines and kernel methods (for classification problems).
- Golub paper on analysis of gene expression data set for leukemia
- Tutorial on SVMs by Chris Burges (pdf, ps) -- some reference material for the SVM optimization problems that we'll outline in class
Lecture 18
Presentation of the SVM hard margin ("maximal margin") classifier.
Lecture 19
Soft-margin SVM classifiers, kernels, and feature selection.
- A good reference for this material is the book "An Introduction to Support Vector Machines" by Nello Cristianini and John Shawe-Taylor
Lecture 20
We'll discuss kernels for SVMs in general, and in particular the paper on the Fisher-SVM approach to remote homology detection
- A discriminative framework for detecting remote protein homologies by Jaakkola, Diekhans, and Haussler
Lecture 21
We'll finish up with standard kernels and operations on kernels, and we'll discuss principal component analysis (PCA) -- a standard dimension reduction technique -- and kernel PCA.
- Empirical study of PCA discussing challenges of using PCA for visualization and clustering
- Kernel PCA reference -- we didn't cover this in class, but you can read it as an explanation of ordinary PCA also
Lecture 22
Introduction to Bayes nets for inferring regulatory networks from gene expression data. Overview of the following three papers:
- Inferring regulatory subnetworks [Pe'er et al.]
- Statistical validation of network models [Hartemink et al.]
- Dynamic Bayes net for time series expression data in E. Coli [Ong et al.]
Lecture 23
We'll start discussing the use of Bayes nets in the above three papers in more detail, starting with the Pe'er paper.
Lecture 24
Finish discussion of Pe'er paper and Ong paper on dynamic Bayes nets.
Lecture 25
Introduction to computational gene-finding for eukaryotes (in particular, vertebrates and humans). The main reference is Chris Burge's paper on GENSCAN, one of the best-known gene-finding programs. The second reference is David Haussler's review article on computational gene-finding.
- Burge paper on GENSCAN
- Haussler review article
- Typical human gene structure
- Detail of transcription, slicing, translation
- Detail on intron/exon structure
- State model for GENSCAN
- Length distributions
Lecture 26
More details on GENSCAN and a quick discussion of TWINSCAN, a new gene-finding algorithm that uses both the GENSCAN model and a model of conservation across two organisms to improve prediction.
Lecture 27
For the last week, we'll discuss approaches to computational signal finding. This lecture, we discuss MEME, a popular motif discovery algorithm based on expectation maximization.
Lecture 28
More computational approaches to motif discovery.