Homework 3

Classification for Leukemia Expression Profiles

Due Date: Friday, April 23 (11:59pm) Submit electronically via Courseworks, following the instructions posted previously by our head TA Eugene Ie (eie@cs.columbia.edu). For theory questions, write your answers in a convenient electronic format -- plain text, pdf, postscript, or doc (if you must!). For other formats, ask the TA first to make sure it will be readable. For programming questions, please submit both your source code and your results/plots (in a standard format like ps, jpg, etc) along with a plain text "readme.txt" file that explains to the TA where everything is.

Suggested languages and tools: There are two suggested options for this homework. The first option is that you could use the matlab spider machine learning package, available at http://www.kyb.tuebingen.mpg.de/bs/people/spider/, for the entire homework. This package is written in object-oriented matlab -- you train "algorithm objects" and test on "data objects". A short tutorial can be found on the spider website, and many demos for different types of algorithms (clustering, classification, transductive learning) are included in the package. You might like to learn about spider so that you can use it in your class project. The second option is that you implement the k-nearest neighbor algorithm yourself (e.g. in matlab, perl, Java -- or you can look online for an implementation of this simple algorithm) and use an available SVM software package for the SVM classification problems. The two recommended SVM software packages are William Stafford Noble's GIST software, which can be downloaded from http://microarray.cpmc.columbia.edu/gist/, and Thorsten Joachim's SVM-light, which is available at http://svmlight.joachims.org/. (For GIST, find the link for the download page to get precompiled binaries for Linux or Solaris. If you are unsure what unix-like system you are running on, use the command "uname -a" to find out). Note that the spider package already comes with the SVM-light optimization algorithm (among others). There are a number of other SVM packages -- each with various advantages and disadvantages -- available online: you are free to use other implementations rather than GIST or SVM-light. A good place to look for SVM software as well as other tutorials and resources is the kernel machines homepage.

Biology Reference: "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring", T. R. Golub et al., Science, Volume 286, 1999.

Background on the the dataset:The Golub dataset consists of a training set (in the file golub-data-train.txt of gene expression profiles for 38 bone marrow samples from acute leukemia patients, with each profile consisting of about 7000 gene expression levels. The training sample are labelled as either ALL (acute lymphoid leukemia) or AML (acute myeloid leukemia), two clinically distinct types of leukemia. The ALL type samples can further be divided into T-lineage ALL and B-lineage ALL (see paper for details). Finally there is a test ("independent") set of 50 additional samples (in the file golub-data-independent.txt) also consisting of AML, T-lineage ALL, and B-lineage ALL leukemia types.

Theory (15 points):

  1. Distinguish between supervised and unsupervised learning. Distinguish between clustering and classification.
  2. Precisely state the optimization problem that the "hard margin" or "maximal margin" SVM solves, given a labeled training set
    (x1, y1), ... , (xm, ym)
    with xi in RN and yi in {-1,1}. Why might this version of SVM be inappropriate for real data?
  3. Training an SVM produces a set of "weights" usually denoted as alpha1, ... , alpham. Explain briefly how these weights arise from solving the optimization problem given above (no need to repeat the entire derivation from class), and explain how they determine a linear classifier. What does it mean if a particular weight alphai is 0, non-zero but small, or non-zero and large?
  4. Briefly outline what slack variables are and how they are used to define soft-margin versions of the SVM optimization problem.
  5. Briefly explain the theoretical motivation behind the Fisher criterion score (used in the first programming problem) for feature selection.
  6. What is the difference between a filter and a wrapper feature selection technique?

Programming (35 points): For this assignment, we concentrate on supervised approaches, using k-nearest neighbor and support vector machines classifiers for the leukemia discrimination problem.

If you use spider, you will be able to train both kNN and SVM classifiers by creating and training the appropriate objects -- see online and internal documentation for spider. Once you get the hang of the object-oriented framework, spider will be the fastest way to complete the assignment. Otherwise, you'll need to implement (or find an implementation of) kNN, and use one of the SVM software implementations suggested above. Some notes about the GIST software. This SVM implementation uses a modification of the SVM optimization problem that only considers zero bias (b=0) linear classifiers. This allows for a simple algorithm to solve for the optimal classifier, but because of the 0 bias, it is slightly different than the regular SVM solution.

  1. Since there are so many features for this dataset (and relatively few samples), we expect that many of the features are irrelevant for discrimination between classes and will merely add noise and degrade performance for our classifiers. Therefore, we want to use a simple filtering approach to feature selection: we try to choose the features that are most discriminative between classes in the training set.

    One possibility is to use the Fisher criterion score as our feature filtering statistic -- there are many other choices. The Fisher score for the jth feature (coordinate) is given by

    |mu+j - mu-j|2 / (sigma+j2 + sigma-j2)

    where mu+j (resp. mu-j) is the sample mean of jth feature values across positive training vectors (resp. negative training vectors), while sigma+j (resp. sigma-j) is similarly a sample estimate for the square root of the variance across the positive (resp. negative) training set. You are free to use a different feature selection score -- just state clearly what you are using.

    You can use a feature selection object in spider, the fselect program included with the GIST package, or your own script in perl or matlab (or other language) to calculate the Fisher criterion score (or other chosen score) for all the genes across the training set, and produce a ranked list of genes, ordered by decreasing score. This list will be used for feature selection in your classification experiments.

  2. Now you'll train kNN and SVM classifiers to make predictions on the Golub test set. There's actually very little programming to do (especially if you use spider for kNN), but there are many SVM experiments to run, requiring various pre-processing steps and post-processing analysis.

    Train soft-margin linear SVM classifiers using the following sets of features:

    Depending on the SVM software that you choose, you may be using a 2-norm or 1-norm soft-margin classifier -- state clearly what version of the optimization problem you are using, what value of the parameter C you chose (or equivalently, state what command-line options you used for training), and how you tuned this parameter. (Note: one principled way to choose parameters is to use cross-validation on the training set; if you try multiple parameter values and choose the one that performs best on the test set, you may actually be overfitting to the test set!) Typically, people apply a mean 0 and unit variance transform to the data, or normalize so that the vectors are unit length -- state what pre-processing choices you make.

    Also try kNN classifiers with k=1, 3, 5, 7, to give a baseline performance measure. Explain what distance measure you are using (Euclidean distance or correlation coefficient are standard choices).

    For each kNN/SVM experiment, report the predicted labels and calculate the "confusion matrix" on the test set for the 2-class ALL versus AML problem (choose one class to be "Positive" and one class to be "Negative"):

    Actual \ Predicted

    Negative

    Positive

    Negative

    a

    b

    Positive

    c

    d


    where a, b, c, d are the number of test examples falling into each category, and calculate the following simple statistics:

    For SVM classifiers, a better way to view results is to plot an ROC (receiver operating characteristic) curve: plot the rate of true positives as a function of the rate of false positives, as you vary the threshold for the classifier (recall we predict x is positive if f(x) > c, where c is our threshold). That is, plot the number of TP as a function of the number of FP, and scale both axes so that they vary between 0 and 1. The area under this curve is called the ROC score. Compute the ROC score for your classifier on the test set (compute the area by summing the area of rectangles). Note that the ROC curve and ROC score is completely determined by the ranking of test examples by the classifier f(x) and by the true labels on the test examples. (For kNN, you can still vary the classification threshold, but the most natural threshold corresponds to a vote of the neighbors.) Report the ROC scores for your SVM classifiers.

  3. For the last set of results, rerun the SVM experiments using your choice of kernel (second degree polynomial and radial basis function kernels are standard choices on this kind of dataset). Again, report the type of kernel and the kernel parameters that you used, and report the confusion matrix and the ROC score.

  4. Finally, write a paragraph or two about your results: How much feature elimination is useful for this dataset? Did the use of kernels with the SVM improve performance on the test set or lead to overfitting? How does the performance of your SVM compare with the "informative gene" classifier described in the original paper by Golub et al.?