Machine Learning (W4771)

Homework 2: Decision Trees

Due Date: Class time, Tuesday, February 15.

Reading: Chapter 3 of "Machine Learning" by Tom M. Mitchell

Part I: Implement decision tree induction, following the recursive partitioning algorithm outlined in Table 3.1 of the text. You may program in any computer language. Your initial implementation should follow Table 3.1 closely, implementing exactly the three base cases described in class.

You will try out your implementation on three datasets, which are available from ~es66/irvine/ on AcIS machines. If you do not have an account with Columbia's AcIS, please make arrangements with the teaching assistant to get the data. Each dataset is documented as to its range of attributes, and further details will also be given in class on the day this assignment is handed out.

Implementation tips:

For your own enlightenment, contrast your trees' resulting performance to that obtained by simply always guessing "yes" or always guessing "no" -- is it smarter than a monkey?

Only the training data should be used by your induction algorithm to build the tree. That is, entropy should be measured over (subsets of) the training data only. This way, the performance of a resulting tree, measured over the "unseen" test data, can be used to estimate the true performance of the tree in general.

The three datasets are:

Part II: Once you get your basic implementation working, your assignment is to test at least 3 of the following variations. Note that some are much simplier to implement than others. You may receive extra credit for doing more than 3, or for doing an extensive job on a particular item. However, be careful not to take on terribly large ambitions; each of these is bound to take more time than one might predict, and it is important to preserve your programming efforts for the remaining 3 programming projects coming this semester. The following list is long primarily to get you to think about these different learning issues.

  1. Alternative attribute selection criterion other than entropy, e.g. accuracy. Note that a negative result (i.e. a decrease in performance) would be interesting for this one.
  2. Effect of decreasing the size of the training sample incrementally, e.g., does it do just as well over the test data after training over only half of the training data?
  3. Avert overlearning with an alternative stopping criterion (i.e., base case of the tree growth algorithm).
  4. Avert overlearning with a post-pruning method.
  5. Deal with the numerical attributes of the loan data, either by partitioning each numerical attribute into a small number of value ranges (thereby making it discrete, as already done for some attributes), or by programming your system to automatically select a threshold with which to divide examples into two subsets. Note that if you choose the second alternative, a resulting decision tree should be able to consider the same (numerical) attribute more than just one time between the root and leaf.
  6. Incorporating costs associated with attributes, e.g., some medical tests are more expensive than others.
  7. Incorporating a cost matrix of penalties, e.g., false positives might be worse than false negatives. Such a matrix is shown in the documentation for the German loan applicant data.
  8. Create and interpret a visualization of the resulting classification performance such as that shown in class for the Monk's datasets.
  9. Create another benchmark dataset such as the Monk's problems, and try your implementation on it. The dataset's contents should be well-motivated.
  10. Active learning: the algorithm picks each training example; what question would most help the learner at each moment? Note that this is only possible for a made-up problem such as the Monk's data, since your system must be able to produce the correct answer (supervised training examples) to any question the learning module asks.
  11. Evaluating tree performance with accuracy ("how often correct?") may not always tell you everything you want to know about how well the tree is doing. For example, even if a tree's accuracy is not better than always guessing "class 1", how about examining how frequently each class is correctly handled, i.e., recall?

Since you have several datasets to work with, we recommend you focus on only one or two of them when determining which type of variation you choose to experiment with.

Do not strain yourself to get "good" results! Your goal is to have a working implementation (Part I), and to also try a small number of modifications to see their effect on accuracy (Part II). Low accuracy is to be expected, especially for certain parts of this assignment.

Please organize your code well, e.g., the I/O code should be in a seperate file from the induction code, and all code should be nicely documented.

What to hand in:

For your information, the data sets for this assignment, as well as many many others, are available from the University of California, Irvine, http://www.ics.uci.edu/~mlearn/MLRepository.html.

The breast cancer domain was obtained from the University Medical Centre, Institute of Oncology, Ljubljana, Yugoslavia. Thanks go to M. Zwitter and M. Soklic for providing the data. Please include this citation if you plan to use this database for publicized research. Please restrict the use of this data to no-profit research. Finally, do not disseminate this data. Other parties interested in attaining this data should be directed to the web sight above which provides the relevant information.

email: evs at cs dot columbia dot edu