## Random Decision Tree (RDT)

### What it is

Random decision tree algorithm constructs multiple decision trees randomly.

When constructing each tree, the algorithm picks a "remaining" feature randomly at each node expansion without any purity function check (such as information gain, gini index, etc.). A categorical feature (such as gender) is considered "remaining" if the same categorical feature has not been chosen previously in a particular decision path starting from the root of tree to the current node. Once a categorical feature is chosen, it is useless to pick it again on the same decision path because every example in the same path will have the same value (either male or female). However, a continuous feature (such as income) can be chosen more than once in the same decision path. Each time the continuous feature is chosen, a random threshold is selected.

A tree stops growing any deeper if one of the following conditions is met:

• A node becomes empty or there are no more examples to split in the current node.
• The depth of tree exceeds some limits.

Each node of the tree records class distributions. Assume that a node has a total of 1000 examples from the training data that pass through it. Among these 1000 examples, 200 of them are + and 800 of them are -. Then the class probability distribution for + is P(+|x) = 200/1000 = 0.2 and the class probability distribution for - is P(-|x) = 800/1000 = 0.8

The algorithm does not prune the randomly built decision tree in the conventional sense (such as MDL-based pruning and cost-based pruning, and etc.). However, it does remove "unnecessary" nodes. A node expansion is considered unnecessary, if none of its descendents have significantly different class distribution from this node. When this happens, we remove the expansion from this node and make the node a leaf node. In our implementation, the random tree is built recursively and "necessity check" is performed when the recursion returns.

Classification is always done at the leaf node level. Each tree outputs a class probability distribution (as defined above). The class distribution outputs from multiple trees are "simply" averaged as the final class distribution output. For example, assume that there are two trees. Tree 1 outputs P(+|x) = 0.2 and P(-|x) = 0.8, and tree 2 outputs P(+|x) = 0.3 and P(-|x) = 0.7. The averaged probability output will be P(+|x) = (0.2 + 0.3)/2 = 0.25, and P(-|x) = (0.8 + 0.7) = 0.75.

In some situations, a leaf node could be empty. When this happens, the algorithm doesn't output NaN (not a number) or 0 as the class probability distribution, but it goes one level up to its parent node and output its parent's class distribution.

In order to make a final prediction, a loss function is needed. For example, under traditional 0-1 loss, the best prediction is to choose the class label that is most likely. For example, for the binary problem, such as + and -, we predict class label + iff P(+|x) >= 0.5. In some situations, when the training data and testing data are not drawn with the same probability, the optimal decision threshold may not be exactly 0.5.

### Theoretical Explanation

Random decision tree is an efficient implementation of Bayes Optimal Classifier or (OBC). It is also termed as Model Averaging (MA).

### target distribution

### Good Generality

The same code can perform classification, regression, ranking and multi-label classification and can be easily implemented in MapReduce/Hadoop.

### Related Papers

[2010]   Xia Tian Zhang, Quan Yuan, Shiwan Zhao, Wei Fan, Wen Tao Zheng, and Dong Wang, "Multi-label Learning without Multi-label Cost", 2010 SIAM International Conference on Data Mining, Columbus, OH, April, 2010. The Powerpoint presentation can be found here.

[2006]   Kun Zhang, Wei Fan, Xiaojing Yuan, Ian Davidson, and Xiangshang Li, "Forecasting Skewed Stochastic Biased Ozone Days: Analyses and Solutions" , 2006 IEEE International Conference on Data Mining (ICDM'2006), December 2006. Best Paper Award: Application Category . Presentation PowerPoint (lots of animations).

An extended journal version of this paper comparing random decision trees with SVM, AdaBoosting and a number of other popular learning algorithms on this problem can be found here .

The dataset is available here.

[2006]   Kun Zhang, Wei Fan, Bill Buckles, Xiaojing Yuan, and Zujia Xu: "Discovering Unrevealed Properties of Probability Estimation Trees: On Algorithm Selection and Performance Explanation", 2006 IEEE International Conference on Data Ming (ICDM'2006), December 2006.

[2006]   Ian Davidson, and Wei Fan, "When Efficient Model Averaging Out-performs Boosting and Bagging", PKDD 2006, Berlin, September 2006. Click here for its PowerPoint presentation.

[2006]   Wei Fan, Joe McClosky, and Philip S. Yu, "A General Framework for Accuracy and Fast Regression by Data Summarization in Random Decision Trees", KDD2006, Philadelphia, August 2006. Click here for PowerPoint presentation.

[2005]   Wei Fan, Ed Greengrass, Joe McCloskey, and Philip S. Yu, "Effective Estimation of Posterior Probabilities: Explaining the Accuracy of Randomized Decision Tree Approaches". The Fifth IEEE International Conference on Data Mining (ICDM'05), Houston, Texas, November 2005.

[2005]   Wei Fan, Janek Mathuria, and Chang-tien Lu ``Making Data Mining Models Useful to Model Non-paying Customers of Exchange Carriers''. 2005 SIAM International Conference on Data Mining (SDM'05), Newport Beach, CA, April 2005.

[2005]   Fei Tony Liu, Kai Ming Ting, and Wei Fan ``Maximizing Tree Diversity by Building Complete-Random Decision Trees'' (short paper), 2005 Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD'05), Hanoi, Vietnam, May 2005.

[2004]   Wei Fan "On the Optimality of Probability Estimation by Random Decision Trees", The Nineteenth National Conference on Artificial Intelligence (AAAI'04), San Jose, CA, July, 2004.

[2004]   Wei Fan ``Systematic Data Selection to Mine Concept-drifting Data Streams'', The Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'05), Seattle, Washington, August 2004.

[2003]   Wei Fan, Haixun Wang, Philip S. Yu, Sheng Ma "Is random model better? On its accuracy and efficiency", 3rd IEEE International Conference on Data Mining (ICDM 2003), November 19-22, Melbourne, FL: 51-58.

[2008]   Kun Zhang, and Wei Fan: "Forecasting skewed biased stochastic ozone days: analyses, solutions and beyond". Knowl. Inf. Syst. 14(3): 507-527 (2008)

#### Papers that I am not an author

[2005] Fei Tony Liu "The Utility of Randomness in Decision Tree Ensembles", Master Thesis, Faculty of Information Technology, Monash University, Australia.

[2005] Kun Zhang, Zujia Xu, Jing Peng, and Bill Buckles Learning through Changes: An Empirical Study of Dynamic Behaviors of Probability Estimation Trees", Fifth IEEE International Conference on Data Mining (ICDM'05), Houston, TX.

An open source project for Random Decision Tree (RDT) is avalable from http://www.dice4dm.com . The source code is in JAVA and can perform classification, ranking, regression and multi-label classification.

## Ozone Dataset

The ozone data set is a streaming problem with both skewed distribution (3 to 5% positive) and unbalanced loss function. The dataset has 72 continuous features normalized between 0 and 1. The dataset spans over a period of 7 years and each data entry has a date stamp.

This dataset is available for research purposes. Please send a request to either Dr. Kun Zhang (zhang.kun05@gmail.com) or Wei (wei.fan@gmail.com). A shorter version of the dataset is now available from UCI machine learning database at here.

## Artificial Anomaly

Artificial anomaly generates "anomaly" from normal data by analyzing the characteristics of its distribution. The work is described in the following KAIS'04 paper. The source code to generate artificial anomaly can be found here.

## Sample Selection bias

An auto-clustering based method to correct all types of sample selection bias has been proposed and discussed in the following SDM'08 paper. The MatLab code developed by Xiaoxiao Shi and related datasets can be downloaded here.

## Transfer Learning

We propose a framework (as shown below) to actively transfer examples from out-domain into in-domain. We formally analyze how the error descreases as more in-domain examples are actively queried and how out-of-domain data could help reduce the cost of ask domain experts to provide labeled in-domain data. The paper by Xiaoxiao Shi, Wei Fan, and Jiangtao Ren can be found here, and the software and dataset (synthetic and landmine) written and prepared by Xiaoxiao Shi can be found here, and the 20 Newsgroup data can be found here.

In another approach, we combine models previously trained from different domains via locally weighted framework to transfer the knowledge into a new domain. One important difference is that the out-of-domain models in our approach did not use any in-domain data at all during their training. In other words, our proposed approach simply "adopt" them into the in-domain.

In this example, both training set 1 and set 2 have "conflicting" regions from the in-domain data. Thus the models trained cannot be adopted directly by simple combination or averaging. However, using the proposed approach, as shown in the bottom right plot below, we can successfully transfer the right knowledge from several different domains. The paper by Jing Gao, Wei Fan, Jing Jiang and Jiawei Han can be found here. The code and dataset written by Jing Gao can be found here.

## Frequent Pattern based Feature Construction

Most inductive algorithms can only model from well-structured feature vectors, but in reality, raw data from many applications do not have any feature vectors available. Repeated patterns in raw data can provide solutions, and some examples include frequent graphs, itemsets and sequential patterns. It is well understood that frequent pattern mining is non-trivial since the number of unique patterns is exponential but many are non-discriminative and correlated.

Currently, frequent pattern mining is performed in batch mode of two sequential steps: enumerating a set of frequent patterns as candidate features, followed by feature selection. Although many methods have been proposed in the past few years on how to perform each separate step efficiently, there is still limited success in eventually finding those highly compact and discriminative patterns. The culprit is due to the inherent nature of this widely adopted batch approach.

We proposes a new and different approach to mine frequent patterns as discriminative features. It builds a decision tree that sorts or partitions the data onto nodes of tree. Then at each node, it directly discovers a discriminative pattern to further divide its examples into purer subsets, that previously chosen patterns during the same run cannot separate. Since the number of examples towards leaf level is relatively small, the new approach is able to examine patterns with extremely low global support that could not be enumerated on the whole dataset by the batch method.

The discovered feature vectors are more accurate on some of the most difficult graph and frequent itemset problems than most recently proposed algorithms but the total size is typically 50% or more smaller. Importantly, the minimum support of some discriminative patterns can be extremely low (e.g. 0.03%). In order to enumerate these low support patterns, state-of-the-art frequent pattern algorithm either cannot finish due to huge memory consumption or have to enumerate 10^1 to 10^3 times more patterns before they can even be found.

Details on this work can be found in this paper., and the code and dataset is here.