For tentative schedule, see last year's lecture page. ![]()
Lecture Schedule
CBMF 4761
Department of Computer Science
Spring Semester, 2003Lecture 1 (Tues Jan 21)
Most of following slides are borrowed from Prof. Russ Altman (Stanford Medical Informatics):Online repositories for biological data:
- Eukaryotic cell
- Lipid membrane
- Short protein segment
- Human chromosomes
- Packing DNA into chromosomes
- Universal genetic code
- Myoglobin example
- Schematic of DNA and proteins
- Yeast microarray
- Genbank
- Structural Classification of Proteins
- Swiss-Prot
- Protein Data Bank
- Stanford Microarray Database
Lecture 2 (Thurs Jan 23)
Reading: Section 1.3 in Durbin. There is also good but terse material on probabilistic methods in Chapter 11 of the text -- see in particular Section 11.3 on inference.
some background material and references for the splice site recognition problem (supplied for your interest only -- you aren't required to know details about splicing beyond what I present in class):
- Spliceosome diagram
- Training data for donor splice site
- Sanger Center splice site database
- M. Burset, I. A. Seledtsov, V. V. Solovyev (2000), "Analysis of canonical and non-canonical splice sites in mammalian genomes", Nucleic Acids Research, 28(21):4364-4375. (html)
Lecture 3 (Tues Jan 28)
Reading: Durbin, Sections 2.1, 2.2 and 2.3 until the end of the subsection on global alignment (Needleman-Wunsch algorithm).
Lecture 4 (Thurs Jan 30)
Reading: Durbin, Sections 2.3 until the end of the subsection on local alignment (Smith-Waterman). Also take a look at the affine gap penalty part of Section 2.4. We won't do every variant of pairwise alignment in class, but it's useful to see that there are so many different versions.
Lecture 5 (Tues Feb 4)
Reading: Read through Section 2.5 on heuristic alignment algorithms, Section 2.7 on significance of scores (the "classical approach" subsection is most important), and Section 2.8 on deriving score parameters from data. Some of the explanation is quite sketchy, and the links below provide clearer exposition. Also start in on Section 3.1 for Markov chains.
- Homepage for BLAST, a widely-used heuristic alignment program
- Statistical significance for (ungapped) local alignment scores
- BLAST results for human p53 protein query
- Reference on PSI-BLAST
Lecture 6 (Thurs Feb 6)
Reading: Sections 3.1 and start of 3.2 on Markov chains and Hidden Markov Models for CpG island detection.
- CpG Island description and tagged sequences from the Sanger Center
Lecture 7 (Tues Feb 11)
Reading: Continue with Section 3.2 on the Viterbi algorithm for Hidden Markov models.
Lecture 8 (Thurs Feb 13)
Reading: Section 3.2 on posterior decoding (the forward and backward algorithms). We'll fully cover the Viterbi algorithm and hopefully finish posterior decoding also.
Lecture 9 (Tues Feb 18)
Reading: Training HMMs -- the parameter estimation problem (Section 3.3). We'll finish discuss of the forward/backward algorithms for posterior decoding. We'll also discuss maximimum likelihood estimation when (1) states for the training data are known and (2) states for the training data are unknown (Expectation Maximization). Also read about scaling probabilities for the forward/backward algorithms in the last section of Chapter 3.
Lecture 10 (Thurs Feb 20)
Reading: Section 3.3 on the Baum-Welch algorithm. We'll finish discussion the learning algorithm (a special case of Expectation Maximization) used to train the parameters of an HMM when the state sequence for the training data is unknown. If there's time, we'll briefly go over Chapter 4 on pair HMMs, used to produce alignments.
Lecture 11 (Tues Feb 25)
Reading: Chapter 4 on pair HMMs, used to produce alignments. If you want to read more about Expectation Maximization in general and the Baum-Welch algorithm in particular (material from last time), you can check out Chapter 11 in the text (beware of typos in equations).
Lecture 12 (Thurs Feb 27)
Reading: Chapter 5 on profile HMMs for modeling protein families.
- Pfam database of multiple alignments and the HMMER hidden Markov Model package
- SAM profile HMM package, developed at UC Santa Cruz
- SCOP database, a hiearchical classification of proteins based on structure
- Protein Databank (PDB), repository of structural data for proteins
Lecture 13 (Tues March 4)
We'll finish profile HMMs with a discussion of the sophisticated prior distributions, i.e. mixture of Dirichlet priors, used in parameter estimation for these models. We'll also do a quick tour of the secondary structure prediction problem:
- Alpha helix schematic and detailed example
- Beta strand schematic and detailed example
- Neural net for secondary structure prediction and full network topology
- Example of predictions
- Reference: HMMSTR model, combining local structure PSSMs with HMM topology
- Reference: Bayesian segmentation, HMM-based model for secondary structure prediction
Lecture 14 (Thurs March 6)
In this lecture, we'll give an introduction to microarray technology, gene expression data, and an overview of some of the main learning problems of interest for this data: classification of samples, clustering, and inference of regulatory relationships.
- Gene expression -- central dogma
- Spotted cDNA microarray
- Yeast microarray image
- Oligonucleotide and cDNA arrays
- Clustering for visualization and class discovery
- Stanford Microarray Database
Lecture 15 (Tues March 11)
First, to help you get groups together for the project -- please post an information web page about yourself and your interests for the project. Please see the project guidelines page for required information. Send the url to Omar (osa2001@cs.columbia.edu),or send him the HTML page itself if you prefer. Please do this sometime during the week. I would like to have information for everyone in the class posted before Spring Break.
In this lecture, we'll discuss clustering algorithms for gene expression data, such as hiearchical clustering and K-means. We'll also touch on some other learning problems in functional genomics, including regulatory network inference.
- Hierarchical clustering of genes by their expression profiles across time series experiments
- Stanford Microarray Database
- Inferring regulatory networks from gene expression data -- here is an inferred Bayes net that is similar to the mating response subnetwork in yeast from the paper Inferring Subnetworks from Perturbed Expression Profiles, by Dana Pe'er, Aviv Regev, Gal Elidan, Nir Friedman.
Lecture 16 (Thurs March 13)
Midterm test
Lecture 17 (Tues March 25)
We'll talk about approaches to the classification problem -- including k-nearest neighbor and Fisher's linear discriminant -- and introduce support vector machines and kernel methods.
- Golub paper on analysis of gene expression data set for leukemia
- Tutorial on SVMs by Chris Burges (pdf, ps) -- some reference material for the SVM optimization problems that we'll outline in class
Lecture 18 (Thurs March 27)
Presentation of the SVM hard margin ("maximal margin") classifier, slack variable idea for soft margin SVMs.
Lecture 19 (Tues April 1)
We'll quickly finish soft-margin SVM classifiers, kernels, and feature selection. We'll also start a discussion of transcriptional regulation, in preparation for talking about Bayes nets and other models for learning regulatory networks.
- A good reference for the SVM material is the book "An Introduction to Support Vector Machines" by Nello Cristianini and John Shawe-Taylor
Lecture 20 (Thurs April 3)
We'll present an overview of Bayes nets for inferring regulatory networks and start discussing the papers listed below.
- Here are some pictures to illustrate binding of transcription factors to promoter regions and the effect on transcription: general picture of TFs and RNA polymerase; model of human b-globin gene regulation; tryptophan-regulated repressor and picture in E. coli
- Statistical validation of network models [Hartemink et al.]
- Inferring regulatory subnetworks [Pe'er et al.]
- Minreg: Inferring an active regulator set [Pe'er et al.]
Lecture 21/Lecture 22 (Tues April 8)
We'll have back to back lectures today, giving details of the three papers introduced in the last lecture for using Bayes nets to learn regulatory networks. In particular, we'll talk about the Bayesian score for scoring network structures and several approaches for learning structures. Note that there will be no new lecture on Thursday, April 10, but CVN has agreed to show the video of the Lecture 22 during the regular class time on Thursday.
Lecture 23 (Tues April 15)
We'll have a guest lecture, Prof. Harmen Bussemaker from the Biology Department, who will talk about the REDUCE algorithm, which detects regulatory elements (motifs) from promoter sequences via correlation with gene expression.
Lecture 24 (Thurs April 17)
We'll discuss some approaches to motif discovery, also called computational signal finding. This lecture, we discuss MEME, a popular motif discovery algorithm based on expectation maximization.
Lecture 25 (Tues April 22)
Introduction to computational gene-finding for eukaryotes (in particular, vertebrates and humans). The main reference is Chris Burge's paper on GENSCAN, one of the best-known gene-finding programs. The second reference is David Haussler's review article on computational gene-finding.
- Burge paper on GENSCAN
- Haussler review article
- Typical human gene structure
- Detail of transcription, slicing, translation
- Detail on intron/exon structure
- State model for GENSCAN
- Length distributions
Lecture 26 (Thurs April 24)
We'll do a quick discussion of TWINSCAN, a new gene-finding algorithm that uses both the GENSCAN model and a model of conservation across two organisms to improve prediction. We'll also talk about a new comparative genomics paper from Eric Lander's group (computational companion paper to an upcoming Nature paper), which used comparative annotation of four specicies of yeast to do regulatory motif discovery.
- TWINSCAN paper
- Comparative annotation paper (from RECOMB 2003 Proceedings)
Lecture 27 (Tues April 29)
We'll talk about protein classification and remote homology detection, one of the central problems in computational biology.
- A discriminative framework for detecting remote protein homologies by Jaakkola, Diekhans, and Haussler
- Mismatch kernels for discriminative protein classification by Leslie, Eskin, Weston, and Noble.
Lecture 28 (Thurs May 1)
Final test