Machine Learning (W4771)
Homework 2: Decision Trees
Due Date:
Class time, Tuesday, February 15.
Reading:
Chapter 3 of "Machine Learning" by Tom M. Mitchell
Part I: Implement decision tree induction, following
the recursive partitioning algorithm outlined in Table 3.1 of the text.
You may program in any computer language.
Your initial implementation should follow Table 3.1 closely, implementing
exactly the three base cases described in class.
You will try out your implementation on three datasets, which are
available from ~es66/irvine/ on AcIS machines.
If you do not
have an account with Columbia's AcIS, please make arrangements with
the teaching assistant to get the data. Each dataset is documented
as to its range of attributes, and further details will also be
given in class on the day this assignment is handed out.
Implementation tips:
- Organize your code well:
- Same code for 5 different data sets
- NOTE: TARGET CLASSIFICATION IS IN DIFFERENT COLUMNS, depending
on the data set.
- data sets have different numbers of attributes
- incorporate meaningful attribute names, so output trees make sense
- plan for easily incorporating modifications
- If not in LISP, you might want a symbol table of value names for each attribute.
- data structures:
- nodes (each with an attribute)
- sets of children (not necessarily a binary tree)
- subsets (i.e., partitions) of training data
For your own enlightenment, contrast your trees' resulting performance to that
obtained by simply always guessing "yes" or always guessing "no" -- is it smarter than a monkey?
Only the training data should be used by
your induction algorithm to build the tree. That is, entropy
should be measured over (subsets of) the training data only. This
way, the performance of a resulting tree, measured over the "unseen" test data, can be used
to estimate the true performance of the tree in general.
The three datasets are:
- The "monk's problems." This consists of three made-up concepts created for
the purposes of benchmarking classification algorithms.
- German loan applicants. Note, this data has some numerical attributes,
which should be ignored by your system, at least in its first
incarnation.
- Breast cancer recurrence. Note, this data contains a small
number of unknown values, marked with questions marks. Your system
should handle these values judiciously, for example, by using a method
described in class. Please document your choice clearly.
Part II: Once you get your basic implementation working, your assignment is to test
at least 3 of the following variations. Note that some are much simplier
to implement than others. You may receive extra credit for doing
more than 3, or for doing an extensive job on a particular item. However,
be careful not to take on terribly large ambitions; each of these is bound
to take more time than one might predict, and it is important to preserve
your programming efforts for the remaining 3 programming projects coming this
semester. The following list is long primarily to get you
to think about these different learning issues.
- Alternative attribute selection criterion other than entropy,
e.g. accuracy. Note that a negative result (i.e. a decrease
in performance) would be interesting for this one.
- Effect of decreasing the size of the training sample incrementally,
e.g., does it do just as well over the test data after training over
only half of the training data?
- Avert overlearning with an alternative stopping criterion (i.e., base
case of the tree growth algorithm).
- Avert overlearning with a post-pruning method.
- Deal with the numerical attributes of the loan data, either
by partitioning each numerical attribute into a small number of
value ranges (thereby making it discrete, as already done for some attributes), or by programming
your system to automatically select a threshold with which to divide
examples into two subsets. Note that if you choose the second
alternative, a resulting decision tree should be able to consider the same
(numerical)
attribute more than just one time between the root and leaf.
- Incorporating costs associated with attributes, e.g., some medical
tests are more expensive than others.
- Incorporating a cost matrix of penalties, e.g., false positives might be
worse than false negatives. Such a matrix is shown in the documentation
for the German loan applicant data.
- Create and interpret a visualization of the resulting classification
performance such as that shown in class for the Monk's datasets.
- Create another benchmark dataset such as the Monk's problems, and try
your implementation on it.
The dataset's contents should be well-motivated.
- Active learning: the algorithm picks each training example; what
question would most help the learner at each moment? Note that this
is only possible for a made-up problem such as the Monk's data, since
your system must be able to produce the correct answer (supervised
training examples) to any question the learning module asks.
- Evaluating tree performance with accuracy ("how often correct?")
may not always tell you everything you want to know about how
well the tree is doing. For example, even if a tree's accuracy is not
better than always guessing "class 1", how about examining
how frequently each class is correctly handled, i.e., recall?
Since you have several datasets to work with, we recommend you
focus on only one or two of them when determining which type of variation
you choose to experiment with.
Do not strain yourself to get "good" results! Your goal is to have a
working implementation (Part I), and to also try a small number of
modifications to see their effect on accuracy (Part II). Low accuracy
is to be expected, especially for certain parts of this assignment.
Please organize your code well, e.g., the I/O code should be in a seperate file
from the induction code, and all code should be nicely documented.
What to hand in:
- hard copy of your code (well documented)
- summary (one or two paragraphs) giving an overview of
your basic implementation, and describing any difficulties you encountered.
- hard copy of sample runs (also well documented)
across all three datasets, showing resulting decision trees, and their
classification performance
- a written description of each of the 3 or more modifications you
made to your basic system, 1 page max each. This should describe
the details of your modification, the motivation ("why?"), and
the resulting change of performance.
- (an electronic submission of your code, or a brief demo,
may also be requested later)
For your information, the data sets for this assignment, as well as many many others,
are available from the University of California, Irvine,
http://www.ics.uci.edu/~mlearn/MLRepository.html.
The breast cancer domain was obtained from the University Medical Centre,
Institute of Oncology, Ljubljana, Yugoslavia. Thanks go to M. Zwitter and
M. Soklic for providing the data. Please include this citation if you plan
to use this database for publicized research. Please restrict the use of
this data to no-profit research. Finally, do not disseminate this data.
Other parties interested in attaining this data should be directed to the
web sight above which provides the relevant information.
email: evs at cs dot columbia dot edu