MACHINE LEARNING                                April 2, 2010

COMS 4771 HOMEWORK #4
PROF. TONY JEBARA

 

DUE

THURSDAY, APRIL 15th, 2010 BEFORE CLASS

Instructions


1. Put everything you need to submit (matlab code, figures, writeup etc) in a folder.
2. change to that directory and create a tar ball named .tar *
For example:
pks2103@knu:~/temp1/hw2$ tar -cvf pks2103.tar *
hw2.txt
prob1.m
This would create pks2103.tar which has the all the files in the folder hw2.
3. zip the tar ball
For exampe:
pks2103@knu:~/temp1/hw2$ gzip pks2103.tar
This gives a file named pks2103.tar.gz
4. Log on to courseworks.columbia.edu
5. Go to 4771 course
6. Click on the Class Files link from the menu on the left.
7. Click on postfile
8. Upload your .tar.gz
9. From the drop down menu, select HWi (i=2,3,4) for homework i (in the shared files section)
10. Press the submit button
11. You can see the status of your post by clicking on Log.
PS : These files can not be seen by other students. Do not worry about it.


Your write-up should be in ASCII plain text format (.txt) or Postscipt (.ps) or the Adobe Portable Document Format (.pdf). Please do not submit Microsoft Office documents, LaTeX source code, or something more exotic. We will not be able to read it. LaTeX is preferred and highly recommended, but it is not mandatory. You can use any document editing software you wish, as long as the final product is in .ps or .pdf. Even if you do not use LaTeX to prepare your document, you can use LaTeX notation to mark up complicated mathematical expressions, for example, in comments in your Matlab code. All your code should be written in Matlab. Please submit all your souce files, each function in a separate file. Clearly denote what each function does, which problem it belongs to, and what the inputs and the outputs are. Do not resubmit code or data provided to you. Do not submit code written by others. Identical submissions will be detected and both parties will get zero credit. Shorter code is better. Sample code is available on the “Tutorials” web page. Datasets are available from the “Handouts” web page.  You may include figures directly in your write-up or include them separately as .jpg, .gif, .ps or .eps files, and refer to them by filename. Submit work via courseworks.columbia.edu. If something goes wrong, ask the TA's for help.  Thanks.

 

1. (15 points) Expectation Maximization for Multinomial Mixtures:

 

Start by downloading the implementation of the Expectation-Maximization algorithm for Gaussian mixture-models for clustering D-dimensional vector data using a mixture of M multivariate Gaussian models (with variable covariance). The code is available form the tutorials link as mixmodel.m . You will also need to download the four .m files below it including randInit.m to initialize the parameters randomly. Also, the functions plotClust.m and  plotGauss.m to display the Gaussians overlayed on a plot of the first two dimensions of the data sets after EM converges. Try this code out on your own data and get a feel for how it works.

The work for this question will be to produce code to do a mixture of multinomials instead of a mixture of Gaussians. Use this mixture of multinomials to cluster the Shakespeare and Middelton documents for M=2 different clusters. Show which documents go with which cluster and the training log-likelihood as you start EM and iterate it. Then report the log-likelihood as you vary the number of clusters for various random initializations. Then cross-validate to determine the best number of multinomials in the mixture (M=1,2,3 and 4) by splitting the documents into training and testing (only use 2 documents of Shakespeare and 2 from Middleton for testing since the data is so small). Report average and standard deviation of training and testing log-likelihoods over 10 different random initializes for M=1,2,3 and 4 multinomials in the mixture model. You may need to add a small amount (i.e. do MAP estimation) to the multinomials so that they don't give you numerical problems and you will need to work with log probabilities to avoid NaN and Inf numerical issues. Write a brief discussion about some of the peculiarities you noted with EM, numerical issues, convergence issues, etc.

HINT: For the E-step you will need to calculate the following responsibility values (tau) which is the ratio:
tau = pk / (p1+p2+...pK)
Instead, to avoid numerical problems and 0/0 problems, compute lk=log(pk) for all k=1,...K probabilities.
Find the largest value z=max(l1,l2,...,lK)
Then compute tau using this formula:
tau = exp( lk-z ) / ( exp(l1-z) + exp(l2-z) + ... exp(lK-z) )

 

2. (10 points) Cross Validation and Principal Components Analysis:

 

You will use the provided mixture of Gaussians code for this question. Cross-validation for probability models involves splitting a data set into two where one piece of it is used to train up a model (for example using maximum likelihood) and the other piece of the data (the control set) is used to test or evaluate that model. That way, you can try different models and see which one is best (for example different numbers of Gaussians in a mixture model).

 

Download datasetC (training data) and datasetD (testing data) from the web page. These are image datasets where each row is a 64-dimensional vector that corresponds to the intensities of an 8x8 gray scale image. The images are of various hand drawn 0’s and 1’s. You can view what each data point looks like with the imageData.m. You aren’t given the labels ‘0’ and ‘1’ for the images so you will use EM to cluster the images into groups. Run the EM code on datasetC until it converges for multiple values of M=1 to 5 and repeat each run multiple times (remember each EM run will have a different random configuration and end up in a possibly different local maximum of likelihood). Record the log-likelihood on the training and testing data sets and show it in a table or plot. Using the multiple runs, guess or estimate what is the best value of M for this data set and discuss the issues of overfitting and underfitting and why you came to that conclusion.

 

You should hand in the plots of a single log-likelihood converged from random initialization to convergence for datasetC as well as a plot of the converged log-likelihoods on datasetC (training) and datasetD (testing) for various numbers of Gaussians (1 to 5 Gaussians in the mixture) and various random trials.

 

Then, for just two Gaussians, and the best EM run (i.e. the one that gives the highest log-likelihood on datasetD after training on datasetC), you should show the means of the Gaussians that are produced by printing them out with the imageData.m function. Then show the first two eigenvectors (the principal components with highest eigenvalues) of each covariance matrix using the imageData.m function. You should write a brief blurb commenting about your results, what seems to be happening with EM, numerical issues, convergence issues, etc.  on this 64-dimensional data set, and what the above images seem to be for the means and the two eigenvectors for each Gaussian (i.e. interpret the means and how the eigenvectors would change them).