Homework #4: Machine Learning
Experimentation due: December 3rd
It is the eve of the primaries for the 2008 presidential election and we are all wondering who the winners will be. Will the Republicans or the Democrats come out ahead?
For this assignment you will be working with a dataset called OpenSecret, linked to a subset of Census data from April 2000. OpenSecret (http://opensecrets.org/presidential/AllCands.asp) lists contributions to the Democratic and Republican candidates for the presidential primary election along with information about each contributor, including name, city, state, zip code, company employed by, and job title. From the Census data for 2000, you will have access to information about each zip code, including race, education level, income level, and employment. You can find both of these data sets in this directory. For example, from this data you could compute the ethnic majority of the area the vote is from, allowing you to experimentally determine whether, for example, the fact that a voter is from an area that is primarily Hispanic (vs. caucasian, African American, or Asian), for example, is a good predictor of which candidate that person will donate money to. (So, no, this will not tell us who will come out ahead, but it will give us an idea of how we could learn from data to make these kinds of predictions.)
Your job is to build a machine learning model from this data using Weka such that, given a new person with all of the associated information specified above, your model can predict which presidential candidate he will donate money to and at what level. You will need to specify a category of less than $500, between $500 and $1000, or more than $1000.
The tools that you have to use to perform your experiments can be
accessed at http://www.cs.waikato.ac.nz/ml/weka/
You will need to download the WEKA software to use locally. A tutorial
is also available at this site.
After pre-processing the data to put it in the necessary format for
WEKA, you will experiment with the impact of three parameters on the
accuracy of your results:
1. Learning algorithm. You will experiment with two different learning algorithms: you should use decision trees and then pick from one of naive bayes or linear regression. You are asked to document which algorithm produces more accurate results for this data set. You should discuss in your paper why you think the algorithms perform differently.
2. Attributes. You will use feature selection to pinpoint which attributes most significantly impact results. You should experiment with extracting different types of features from the data. Think about the kinds of things that you think intuitively should impact who a person votes for. Some things are explicit for each individual (job, gender is indirectly available for each person) while other things are implicit in the census data from the characteristics of the neighborhood the person is from (e.g., is it primarily rural, consist of highly educated people, of low-income families, or have a large number of unemployed individuals?). Zipcode should be used to link an individual from the OpenSecret data to the area s/he is from; don't use zipcode as a predictor as that would dominate any other features you may find in the Census data.
3. Training data size. You will experiment with different sizes of the training data set to determine the smallest amount of training data you need to get good results.
You will use cross validation to select the model that you think is most accurate. We will test your submissions on a new set of test data and you will receive a ranking of your model against all other assignments.
Notes on feature selection: You may experiment with feature selection entirely within WEKA or outside of WEKA. WEKA implements an incremental search over features (attributes) to select the ones that are the best predictors of your classes as described in this paper. However, you may wish to have finer grained control over selection of features. For example, you might represent the feature "personal information" containing the text string including name, title, city and state. Or, you might extract a separate feature for each one of these attributes. If you do the latter, you could experiment with the impact of different features outside of WEKA by systematically exploring the impact of these different attributes on results. There are many choices for how to do this. There is the chance that one attribute determines everything (e.g., that city alone determines whether you give money and to whom). If we have this case, you should continue experimenting by removing the one dominating feature and then measuring how other attributes contribute. If you don't do this, you will not have much to write about and your homework grade will reflect that. If you want, you can use any of the more elaborate (e.g. non-greedy) feature selection algorithms described in the machine learning litterature. Note that, when performing feature selection, your data should be divided into at least three sets: training set (on which you train your model), validation set (on which you evaluate selection of features), and test set (on which you perform the final evaluation). This in particular means that you should never use any of your testing data to run the feature selection algorithm (testing data is only used after the choice of features has been finalized).
You are to write a five page paper that describes:
A. A description of the model you submitted with charts that document why the model you chose did best. This part of the paper should show the differences in learning between the two different methods you used, in terms of accuracy and in terms of what is learned. Provide a short description of the learning algorithm itself (e.g., a paragraph on decision trees and either Naive Bayes or linear regression). For your decision tree results, you should use the learned tree to identify generalizations that were made and discuss whether these learned generalizations are meaningful (e.g., do they correspond to your intuition about why these attributes play a role?) (30 points)
B. Quantification of the effect of different attributes on the learning process. This discussion should examine the impact of feature selection, commenting on what happened as you systematically varied the features used. Which attributes were most important and how do your results illustrate that? How did accuracy changed as you varied the attributes? Use charts and a description of the charts to answer this question. Discuss your results and explain why they do or do not make sense. (30 points)
C. Quantification of the effect of amount of data on the learning process. Again, use charts and description of the charts to show how accuracy is affected by data set size. These charts should show the learning curve. (15 points)
You should hand in:
A. Any scripts or programs you wrote to assist you in running the experiments. Note that you are free this time to use any language that you find convenient. PERL is a good choice. JAVA is also fine. (15 points)
B. A readme file describing your approach and the scripts (1 page) (10 points)
C. The report (75 points as
Everything should be submitted electronically and you should hand in a hard copy of your report in class on Dec. 3rd .