MACHINE LEARNING

MACHINE LEARNING March 4, 2010

COMS 4771 HOMEWORK #3
PROF. TONY JEBARA

DUE	Thursday, March 25^th, 2010 Before Class

Instructions

1. Put everything you need to submit (matlab code, figures, writeup etc) in a folder.
2. change to that directory and create a tar ball named .tar *
For example:
pks2103@knu:~/temp1/hw2$ tar -cvf pks2103.tar *
hw2.txt
prob1.m
This would create pks2103.tar which has the all the files in the folder hw2.
3. zip the tar ball
For exampe:
pks2103@knu:~/temp1/hw2$ gzip pks2103.tar
This gives a file named pks2103.tar.gz
4. Log on to courseworks.columbia.edu
5. Go to 4771 course
6. Click on the Class Files link from the menu on the left.
7. Click on postfile
8. Upload your .tar.gz
9. From the drop down menu, select HWi (i=2,3,4) for homework i (in the shared files section)
10. Press the submit button
11. You can see the status of your post by clicking on Log.
PS : These files can not be seen by other students. Do not worry about it.

Your write-up should be in ASCII plain text format (.txt) or Postscipt (.ps) or the Adobe Portable Document Format (.pdf). Please do not submit Microsoft Office documents, LaTeX source code, or something more exotic. We will not be able to read it. LaTeX is preferred and highly recommended, but it is not mandatory. You can use any document editing software you wish, as long as the final product is in .ps or .pdf. Even if you do not use LaTeX to prepare your document, you can use LaTeX notation to mark up complicated mathematical expressions, for example, in comments in your Matlab code. All your code should be written in Matlab. Please submit all your souce files, each function in a separate file. Clearly denote what each function does, which problem it belongs to, and what the inputs and the outputs are. Do not resubmit code or data provided to you. Do not submit code written by others. Identical submissions will be detected and both parties will get zero credit. Shorter code is better. Sample code is available on the “Tutorials” web page. Datasets are available from the “Handouts” web page. You may include figures directly in your write-up or include them separately as .jpg, .gif, .ps or .eps files, and refer to them by filename. Submit work via courseworks.columbia.edu. If something goes wrong, ask the TA's for help. Thanks.

1. (20 points) Principal Components Analysis:

Download the dataset of 100 teapot images of size 38x50 in the file datasetPCA.mat. These images are organized as vectors of length 1900 in the matrix teapotImages. To view an image, say the second one in the dataset type:
load datasetPCA.mat
imagesc(reshape(teapotImages(2,:),38,50));
colormap gray;
Compute the mean, covariance matrix and the top 3 eigenvectors of the covariance matrix of the dataset. Show the mean and the top 3 eigenvectors as images. Reconstruct the data using principal components analysis (PCA) with least squares error using only the mean and a linear combination of the top 3 eigenvectors. Show ten different images in the data the way they originally appeared and after reconstruction using the mean and the top 3 eigenvectors.

2. (15 points) Multinomial Classification: Write an implementation for fitting 2 class multinomial models to dataset in dataset3.data. Just type ‘load dataset3.data’ in Matlab. This dataset3 is a matrix of 18 columns and 10025 rows. The data are the word counts from 9 plays are by William Shakespeare and 9 plays by Thomas Middleton. The first 9 columns correspond to the Shakespeare plays ANTONY, CORIOLANUS, HAMLET, JULIUS, LEAR, OTHELLO, ROMEO, TIMON, and TITUS. The last 9 columns correspond to the Middleton plays: Cheapside, Gallants, Maiden, Nowit, Phoenix, Puritan, Revenger, Trick, and Witch.

The texts were downloaded (and parsed to remove caps, punctuation, etc.) from

http://www.tech.org/~cleary/middhome.html

http://the-tech.mit.edu/Shakespeare/

The ‘words’ (about 10025 words total were used in the dictionary of all the above 18 plays) for each row are shown in the first column of dataset3.txt if you are curious what the word counts correspond to. We also added a 1 to the counts so words that never appear in a document get 1, words that appear once get a count of 2, and so forth.

Start by training your multinomials on only the first 2 Shakespeare (ANTONY and CORIOLANUS) and the first 2 Middleton counts (Cheapside and Gallants). Then test them on the remaining ones. You don’t need to estimate the prior class frequencies for the positive and negative class since we will be using the same training data for each. Show the log-likelihood on the training data and the log-likelihood on the testing data. Also, compute the classification accuracy of your models on the training data and the testing data. Show your code. Next, train on only the first 5 Shakespeare documents and train on only the first 5 Middleton documents and show what the log-likelihoods and classification accuracies are now.

Note: you should compare log probabilities instead of regular probabilities to avoid numerical problems.

Note: the code for this problem is actually very short!

3. (15 points) How predictable is human behavior?

This question looks at the rock-papers-scissors game. You will play this game against the computer in Matlab but have it use learning to find patterns of play and hopefully start to beat you. We will assume that we have a fully interconnected graphical model, or fully general distribution, with no conditional independencies. The distribution is over the past M moves:

P(X_M,X_M-1,X_M-2,…X₁)

Where each X is in the set {1,2,3} which stand for {Rock, Paper, Scissors} and the first X_M is the oldest one in the sequence and the newest X is X₁. The way we will do learning here is to look at the past M moves each time you play the game and choose rock-paper-scissors. By simply counting the frequency of various combinations of rock, paper and scissors, from the past plays, we can form a multidimensional multinomial model (or probability table). The counts will start off with ones everywhere. Download the code rockpaper.m from the Matlab tutorial page and modify the parts that are commented to compute statistics on the M-th order multinomial distribution from the clicks (currently it always guesses ‘Rock’ which is not very good). You will need to write code for order M=1,2,3,4,5. Also, you will need code to find the slice in the M-th order distribution from the past few clicks to compute what is the most likely guess (R,P,S) from past M-1 clicks. In Matlab, the 3 configurations are labeled 1,2,3 (i.e. R, P, S). You will only need to add about 20 or so lines of code (which are actually quite trivial). Hand in your code as well as a short discussion of how the algorithm behaves for 1^st order, 2^nd order, … up to 5^th order probability tables and how it compares against a human player.

As an example, for order M=2, we would have the following initial values for cube:

cube = [1 1 1; 1 1 1; 1 1 1];

After observing the following sequence, (the code assumes we have RRRR… in our memory) and with the last click being on the top of the window which quits the program, we get:

RPSRPSRPS

Basically, this is generated by clicking in the bottom part of the window with the left, middle and then right mouse buttons and repeating three times. This gives us the following data which we would treat as iid samples for our multidimensional multinomial:

RR, RP, PS, SR, RP, PS, SR, RP, PS

For the counts cube, we would get something like this:

cube = [2 4 1; 1 1 4; 3 1 1];

The code should predict that the next best move is rock as you click the last scissors with the cursor on the top of the window to quit. You don’t need to normalize the counts cube since this is trivial but in general, a true probability table would be normalized. Also, it is easier to accumulate more counts in an online manner if we don’t normalize.