Due Date: Friday, April 23 (11:59pm) Submit electronically via Courseworks, following the instructions posted previously by our head TA Eugene Ie (eie@cs.columbia.edu). For theory questions, write your answers in a convenient electronic format -- plain text, pdf, postscript, or doc (if you must!). For other formats, ask the TA first to make sure it will be readable. For programming questions, please submit both your source code and your results/plots (in a standard format like ps, jpg, etc) along with a plain text "readme.txt" file that explains to the TA where everything is.
Suggested languages and tools: There are two suggested options for this homework. The first option is that you could use the matlab spider machine learning package, available at http://www.kyb.tuebingen.mpg.de/bs/people/spider/, for the entire homework. This package is written in object-oriented matlab -- you train "algorithm objects" and test on "data objects". A short tutorial can be found on the spider website, and many demos for different types of algorithms (clustering, classification, transductive learning) are included in the package. You might like to learn about spider so that you can use it in your class project. The second option is that you implement the k-nearest neighbor algorithm yourself (e.g. in matlab, perl, Java -- or you can look online for an implementation of this simple algorithm) and use an available SVM software package for the SVM classification problems. The two recommended SVM software packages are William Stafford Noble's GIST software, which can be downloaded from http://microarray.cpmc.columbia.edu/gist/, and Thorsten Joachim's SVM-light, which is available at http://svmlight.joachims.org/. (For GIST, find the link for the download page to get precompiled binaries for Linux or Solaris. If you are unsure what unix-like system you are running on, use the command "uname -a" to find out). Note that the spider package already comes with the SVM-light optimization algorithm (among others). There are a number of other SVM packages -- each with various advantages and disadvantages -- available online: you are free to use other implementations rather than GIST or SVM-light. A good place to look for SVM software as well as other tutorials and resources is the kernel machines homepage.
Biology Reference: "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring", T. R. Golub et al., Science, Volume 286, 1999.
Background on the the dataset:The Golub dataset consists of a training set (in the file golub-data-train.txt of gene expression profiles for 38 bone marrow samples from acute leukemia patients, with each profile consisting of about 7000 gene expression levels. The training sample are labelled as either ALL (acute lymphoid leukemia) or AML (acute myeloid leukemia), two clinically distinct types of leukemia. The ALL type samples can further be divided into T-lineage ALL and B-lineage ALL (see paper for details). Finally there is a test ("independent") set of 50 additional samples (in the file golub-data-independent.txt) also consisting of AML, T-lineage ALL, and B-lineage ALL leukemia types.
Theory (15 points):
(x_{1}, y_{1}), ... , (x_{m}, y_{m})with x_{i} in R^{N} and y_{i} in {-1,1}. Why might this version of SVM be inappropriate for real data?
Programming (35 points): For this assignment, we concentrate on supervised approaches, using k-nearest neighbor and support vector machines classifiers for the leukemia discrimination problem.
If you use spider, you will be able to train both kNN and SVM classifiers by creating and training the appropriate objects -- see online and internal documentation for spider. Once you get the hang of the object-oriented framework, spider will be the fastest way to complete the assignment. Otherwise, you'll need to implement (or find an implementation of) kNN, and use one of the SVM software implementations suggested above. Some notes about the GIST software. This SVM implementation uses a modification of the SVM optimization problem that only considers zero bias (b=0) linear classifiers. This allows for a simple algorithm to solve for the optimal classifier, but because of the 0 bias, it is slightly different than the regular SVM solution.
One possibility is to use the Fisher criterion score as our feature filtering statistic -- there are many other choices. The Fisher score for the j^{th} feature (coordinate) is given by
|mu^{+}_{j} - mu^{-}_{j}|^{2} / (sigma^{+}_{j}^{2} + sigma^{-}_{j}^{2})
You can use a feature selection object in spider, the fselect program included with the GIST package, or your own script in perl or matlab (or other language) to calculate the Fisher criterion score (or other chosen score) for all the genes across the training set, and produce a ranked list of genes, ordered by decreasing score. This list will be used for feature selection in your classification experiments.
Train soft-margin linear SVM classifiers using the following sets of features:
Depending on the SVM software that you choose, you may be using a 2-norm or 1-norm soft-margin classifier -- state clearly what version of the optimization problem you are using, what value of the parameter C you chose (or equivalently, state what command-line options you used for training), and how you tuned this parameter. (Note: one principled way to choose parameters is to use cross-validation on the training set; if you try multiple parameter values and choose the one that performs best on the test set, you may actually be overfitting to the test set!) Typically, people apply a mean 0 and unit variance transform to the data, or normalize so that the vectors are unit length -- state what pre-processing choices you make.
Also try kNN classifiers with k=1, 3, 5, 7, to give a baseline performance measure. Explain what distance measure you are using (Euclidean distance or correlation coefficient are standard choices).
For each kNN/SVM experiment, report the predicted labels and calculate the "confusion matrix" on the test set for the 2-class ALL versus AML problem (choose one class to be "Positive" and one class to be "Negative"):
Actual \ Predicted |
Negative |
Positive |
Negative |
a |
b |
Positive |
c |
d |
For SVM classifiers, a better way to view results is to plot an ROC (receiver operating characteristic) curve: plot the rate of true positives as a function of the rate of false positives, as you vary the threshold for the classifier (recall we predict x is positive if f(x) > c, where c is our threshold). That is, plot the number of TP as a function of the number of FP, and scale both axes so that they vary between 0 and 1. The area under this curve is called the ROC score. Compute the ROC score for your classifier on the test set (compute the area by summing the area of rectangles). Note that the ROC curve and ROC score is completely determined by the ranking of test examples by the classifier f(x) and by the true labels on the test examples. (For kNN, you can still vary the classification threshold, but the most natural threshold corresponds to a vote of the neighbors.) Report the ROC scores for your SVM classifiers.