Course descriptionIn this course, we explore new computational approaches for studying genomic data, including biological sequence data and gene expression ("gene chip") data. Our focus is on machine learning techniques: probabilistic models such as hidden Markov models, learning algorithms like support vector machines for classification problems, clustering algorithms, Bayesian networks for inferring regulatory networks.
Instructor: Prof. Christina Leslie
Phone: (212) 939-7043
Office: 466 Mudd
Office Hours: Wed 4-6pm (tentative) and by appointment
Teaching Assistant: Eugene Ie
Office Hours: TBA
Courseworks site and homepage
The course home page can be found at http://www.cs.columbia.edu/~cleslie/cs4761.
We plan to use the CBMF 4761 Courseworks website to host a bulletin board discussion of class material and for electronic submission of homework. The Courseworks site will be accessible to all students soon. Students should read the course bulletin board on a daily basis and are responsible for information posted there.
Course GoalsThe goals for students who take this course:
Some notes on our approach:
- Understand the main challenges of computational molecular biology
- Understand, implement and use fundamental algorithms of the field
- Study important applications of machine learning techniques to biological problems
- In group projects, apply computational techniques to real biological data and assess results
- This course is not a tutorial on off-the-shelf bioinformatics software -- though we will discuss how some of these programs work
- I am a mathematician, so I like to fully explain algorithms, probabilistic models, and theoretical background! However, I don't expect students to reproduce these explanations in full detail on tests and homework -- mainly, I want students to be able to understand ideas well enough to implement and use them
- Towards the end of the course, I'll present a handful of current research papers in the field, so students see recent examples of applications of course ideas.
Topics to be coveredSequence alignment, hidden Markov models, information-based sequence analysis. Learning algorithms for classification problems: support vector machines, kernel techniques, and clustering algorithms for gene expression and sequence data. Computational signal finding. Inferring regulatory networks using Bayesian nets.
PrerequisitesA sufficient prerequisite for the course is
If you haven't taken this course, the following list should provide an idea of necessary background:
- ECBM E4060 Introduction to Genomic Information Science & Technology
- Solid programming skills -- the equivalent of COMS 1003, COMS 1004 or COMS 1007
- Data Structures and Algorithms (COMS 3134 or 3137) is an advantage, but not required
- No previous background in biology is required, but you must be interested in and prepared to learn necessary topics in biology
- Probability and statistics -- the equivalent of IEOR 4150 or SIEO 3600 -- or willingness to study this material
- Basic knowledge of linear algebra (vectors, matrices) would be helpful
The following textbook is required for all students:
Biological Sequence Analysis by Durbin, Eddy, Krogh and Mitchison. Cambridge University Press, 1999 (ISBN: 0521629713).The text will (soon) be available in the Columbia University Bookstore in Lerner Hall. The following textbook is recommended for background in machine learning:
Pattern Classification (2nd Edition). Duda, Hart and Stork.
For students without significant background in biology, the following textbook is recommended:
Essential Cell Biology: An Introduction to the Molecular Biology of the Cell by Bruce Alberts, Dennis Bray, Alexander Johnson, Julian Lewis, Peter Walter, Keith Roberts and Martin Raff. Garland Pub, 1997 (ISBN: 0815320450).
The following books are not required but may be of interest.
- Introduction to Computational Molecular Biology. Setubal and Meidanis. PWS Publishing, 1997.
- Computational Molecular Biology: A Computational Approach. Pevzner. MIT Press, 2000.
- Computational Methods in Molecular Biology. Salzberg, Searls, Kasif. Elsevier, 1998.
- Bioinformatics: The Machine Learning Approach. Baldi, Brunak. MIT, 1998.
- Introduction to Computational Biology. Waterman. Chapman & Hall, 1995.
Computer Science/Machine Learning:
- Neural Networks for Pattern Recognition. C. Bishop.
- Algorithms on strings, trees, and sequences: computer science and computational biology. Gusfield. 1997.
- A first course in probability, 5th edition. Ross. Prentice Hall, 1998.
- Neural Networks, 2nd edition. Haykin. Macmillan, 1999.
- An Introduction to Support Vector Machines. Cristianini and Shawe-Taylor. Cambridge UP, 2000.
Additional readings will be available on the web (see the links in the lecture schedule below).
- Biochemistry, 4th ed. Stryer. W. H. Freeman, 1995.
- Biology, 5th ed. Neil. 1999.
- Molecular cell biology, 3rd ed. Lodish et al.. 1995.
- Genes VI. Lewin. 1997.
As the final project for the course, students will complete a group research-oriented project (ideally, teams of 3-4 people). Projects will consist of writing a computer program (or using an existing one), running experiments on real biological data, summarizing the results on a web site, and writing up the results in a technical report. Suggestions for projects will be made available during the term.
In addition, 3-4 homework assignments consisting of theory and programming problems will be assigned during the semester. Late homeworks are penalized 10% per calendar day.
There will be two in-class, 75-minute, open book tests. The tests are scheduled for Monday, March 8 and Monday, May 3.
- 30% final project
- 30% two 75-minute tests
- 40% homework assignments