MACHINE LEARNING April 2, 2010
COMS 4771
HOMEWORK #4
PROF. TONY JEBARA
|
DUE |
THURSDAY,
APRIL 15th, 2010 BEFORE CLASS |
Your write-up should be in ASCII plain text format (.txt) or Postscipt (.ps) or the Adobe
Portable Document Format (.pdf). Please do not submit
Microsoft Office documents, LaTeX source code, or
something more exotic. We will not be able to read it. LaTeX
is preferred and highly recommended, but it is not mandatory. You can use any
document editing software you wish, as long as the final product is in .ps or .pdf. Even if you do not
use LaTeX to prepare your document, you can use LaTeX notation to mark up complicated mathematical
expressions, for example, in comments in your Matlab
code. All your code should be written in Matlab.
Please submit all your souce files, each function in
a separate file. Clearly denote what each function does, which problem it
belongs to, and what the inputs and the outputs are. Do not resubmit code or
data provided to you. Do not submit code written by others. Identical
submissions will be detected and both parties will get zero credit. Shorter
code is better. Sample code is available on the “Tutorials” web page. Datasets
are available from the “Handouts” web page.
You may include figures directly in your write-up or include them
separately as .jpg, .gif, .ps or .eps
files, and refer to them by filename. Submit work via courseworks.columbia.edu.
If something goes wrong, ask the TA's for help.
Thanks.
1. (15
points) Expectation Maximization for Multinomial Mixtures:
Start by downloading the
implementation of the Expectation-Maximization algorithm for Gaussian
mixture-models for clustering D-dimensional vector data using a mixture of M
multivariate Gaussian models (with variable covariance). The code is available
form the tutorials link as mixmodel.m . You will also need to download the four .m files below it including randInit.m to initialize the
parameters randomly. Also, the functions plotClust.m
and plotGauss.m to display the Gaussians overlayed on a plot of the first two dimensions of the data
sets after EM converges. Try this code out on your own data and get a feel for how it works.
2. (10
points) Cross Validation and Principal Components Analysis: You will use the provided mixture of Gaussians code for this question. Cross-validation
for probability models involves splitting a data set into two where one piece
of it is used to train up a model (for example using maximum likelihood) and
the other piece of the data (the control set) is used to test or evaluate that
model. That way, you can try different models and see which one is best (for
example different numbers of Gaussians in a mixture model). Download datasetC (training data) and datasetD
(testing data) from the web page. These are image datasets where each row is a
64-dimensional vector that corresponds to the intensities of an 8x8 gray scale
image. The images are of various hand drawn 0’s and 1’s. You can view what each
data point looks like with the imageData.m.
You aren’t given the labels ‘0’ and ‘1’ for the images so you will use EM to
cluster the images into groups. Run the EM code on datasetC
until it converges for multiple values of M=1 to 5 and repeat each run multiple
times (remember each EM run will have a different random configuration and end
up in a possibly different local maximum of likelihood). Record the
log-likelihood on the training and testing data sets and show it in a table or
plot. Using the multiple runs, guess or estimate what is the best value of M
for this data set and discuss the issues of overfitting
and underfitting and why you came to that conclusion. You
should hand in the plots of a single log-likelihood converged from random
initialization to convergence for datasetC as well as
a plot of the converged log-likelihoods on datasetC
(training) and datasetD (testing) for various numbers
of Gaussians (1 to 5 Gaussians in the mixture) and various random trials. Then, for
just two Gaussians, and the best EM run (i.e. the one that gives the highest
log-likelihood on datasetD after training on datasetC), you should show the means of the Gaussians that
are produced by printing them out with the imageData.m
function. Then show the first two eigenvectors (the principal components with
highest eigenvalues) of each covariance matrix using
the imageData.m function. You should write a
brief blurb commenting about your results, what seems to be happening with EM,
numerical issues, convergence issues, etc.
on this 64-dimensional data set, and what the
above images seem to be for the means and the two eigenvectors for each
Gaussian (i.e. interpret the means and how the eigenvectors would change them).
The work for this question will be to produce code to do a mixture of multinomials instead of a mixture of Gaussians. Use this mixture of multinomials to cluster the Shakespeare and Middelton documents for M=2 different clusters. Show which documents go with which cluster and the training log-likelihood as you start EM and iterate it. Then report the log-likelihood as you vary the number of clusters for various random initializations. Then cross-validate to determine the best number of multinomials in the mixture (M=1,2,3 and 4) by splitting the documents into training and testing (only use 2 documents of Shakespeare and 2 from Middleton for testing since the data is so small). Report average and standard deviation of training and testing log-likelihoods over 10 different random initializes for M=1,2,3 and 4 multinomials in the mixture model.
You may need to add a small amount (i.e. do MAP estimation) to the multinomials so that they don't give you numerical problems and you will need to work with log probabilities to avoid NaN and Inf numerical issues.
Write a brief discussion about some of the peculiarities you noted with EM,
numerical issues, convergence issues, etc.
HINT: For the E-step you will need to calculate the following responsibility values (tau) which is the ratio:
tau = pk / (p1+p2+...pK)
Instead, to avoid numerical problems and 0/0 problems, compute lk=log(pk) for all k=1,...K probabilities.
Find the largest value z=max(l1,l2,...,lK)
Then compute tau using this formula:
tau = exp( lk-z ) / ( exp(l1-z) + exp(l2-z) + ... exp(lK-z) )