## Simple Nonparametric Methods

The essence of learning is to find a model M to approximate an unknown target function F(x). For classification problems, the main task is to look for a model to approximate the unknown distribution P(y|x). [One could also estimate the joint probability P(x,y) and this formulation is particularly useful for semi-supervised learning where the label of some data is unknown]. The estimated probability is P(y|x,M) which is conditional on the choice of model M. Thus, the main challenge is to find the best model M to make P(y|x,M) as close as possible to P(y|x) given some loss function (such as 0-1 loss, cost-sensitive loss, AUC, MSE, log loss or KL-divergence).

If the physics that generate a true function is known, such as a multinomial Gaussian distribution, learning is simply to estimate parameters within the true function form. This is normally performed thru regularization in order to resolve related problems such as ill-posed problem, numerical instability etc. One big challenge for these methods is to apply to a new application where the true physics is unknown or too complicated to understand is to choose the right form of function. This can be measured by "goodness of fit" criteria, such as AIC or BIC. But still, it is a non-trivial task to look for the most appropriate form of function or sometimes functionals. For example, for most effective use of SVM, one ought to choose the right kernel.

Given these observations, we have been interested in developing a family of methods that do not require deep understanding of the algorithms, and do not require too much tuning and parameterization. This is described by the following PowerPoint.

The source code, some graphical examples, related papers and some sample dataset can be found here.

## Stream Mining

I am interested in inductive mining of data streams. One of our early work is on the use of weighted ensembles to combine models trained from most recent data with models trained in the past in order to reduce the variance term of error bias and variance decomposition. A later improvement over this is the ability to handle situations where new data is inconsistent with older data. Both works are demonstrated in VLDB'04, and the software and dataset is available from Software and Dataset Download.

Another related problem in data stream mining is that labeled data may not be available instantly but it takes time and money to obtain them. We have proposed a method that decides which "sections of the model" may perform worse than expected and request labeled data to improve the learning model. A related problem on is how to evolve the model with a small number of labeled examples.

Some stream applications have extremely skewed distributions (like 1% or so positives). Specialized sampling techniques have to be proposed in order to handle these problems.

Research in on data streams have been active for the past few years and many algorithms have been proposed. One bottom line question is: given a stream mining application, what is the most important assumption that one can safely assume and what techniques would guarantee perform and return of technology investment? Our analysis on this problem can be found here.

A good resource to find stream mining paper is maintained by Dr. Mohamed M. Gabor at here.

## Sample Selection Bias

Normally, inductive learning assumes that the training data and future testing data follow the same distribution. However, in reality, this may not be true. For example, when a bank approves a loan the risk model can only be constructed from the data of its own customers. However, people applying for a loan may not necessarily follow the distribution as its existing customers and the actual risk factor may be quite different from the computed value using existing customers. Thus, using the computed risk factors, the bank may either accept customers that may default a loan or refuse customers that can be loyal and profitable.

All these types of problems are called Sample Selection Bias. Assume that P(x,y) is the true distribution to model. The collected data usually follows a distribution P(s,x,y) where s is a significant variable that affects the quality of collected data.

For a formal description of how s affects the quality of the dataset, please refer to this paper. Besides formulating the problem, the same paper also provides a model averaging approach and the use of unlabeled data to correct the effect of sample selection bias. A bias-type independent method based on structural discovery and sample re-balancing method is described in this paper. An earlier paper demonstrates that when there is sample selection bias model averaging based methods can significantly outperform bagging and boosting.

Traditionally, cross-validation has been widely used in inductive learning to choose the best model. However, the implicit assumption is that the training and test data follow the same distribution. When this assumption is violated, cross-validation could choose an insignificant model. This paper discusses this problem as well as provides a solution called "ReverseTesting".

One earlier paper clarifies the concept that "feature bias" doesn't affect the learning accuracy of some learners. The long version including experimental studies can be found here.

## Ensembles

One of my research interests is on ensemble of classifiers or multiple models. My Ph.D. thesis is titled "Cost-sensitive, Scalable and Adaptive Learning using Ensemble-based Methods, supervised by Professor Salvatore J. Stolfo.

I am particularly interested in very simple ways to construct multiple models. One family of methods that other researchers and I have been working on is Random Decision Trees. You can find its description, source code and datasets here.

The advantage of model averaging based ensemble to have the lowest expected error for streaming environment has been discussed in this ICDM'07 paper, to overcome sample selection bias has been discussed in this ECML'06 paper, how to reliably guess the perform on a large dataset before fully constructing a model has been discussed in this ICDM'02 paper, how to reduce the operational cost and increase throughput by building and scheduling a "pipeline" of classifiers has been discussed in a ECML'00 paper for intrusion detection, and in a AAAI'02 paper for cost-sensitive learning.

Its application to mine skewed data streams can be found in this SDM'07 paper, to mine extremely skewed trading anomalies can be found in this EDBT'04 paper, to mine customer default can be found in this SDM'05 paper, to construct an adaptive intrusion detection system can be found in this SDM'02 paper, to reduce expected operational cost for building intrusion detection systems has been discussed in this ECML'00 paper as well as this Journal of Computer Security paper, its usage to build a cost-effective (minimizing both misclassification and operational cost) can be found in this DISCEX'01 paper.

The use of ensemble approach for cost-sensitive learning that estimate the risk can be found in this SDM'02 paper, its distributed adaptation appears in the ICDCS'02 paper (the longer version). A reliable method to estimate posterior probability (important to estimate risk) has been discussed in this AAAI'04 paper, ICDM'05 paper and ICDM'06 paper's extended version. A boosting based approach to reduce per-example misclassification cost has been discussed in this ICML'99 paper. Applications on cost-sensitive credit card fraud detection has been discussed in IEEE Intelligent System paper.

We have two published papers using ensemble for anomaly detection. The first generates "artificial anomalies" based on collected normal data, and the KAIS journal version can be found here. Later, we proposed a method that detects anomaly based on the correlation among multiple features in the data stream, and the paper appeared in ICDCS'03.

## Intrusion Detection

Since labeled data is difficult to find in real-world intrusion detection systems, anomaly detection is an important alternative. I am interested in general purpose anomaly detection algorithms. One method is to synthetically generate "artificial anomalies" from normal data, and the KAIS journal paper can be found here. Another method we proposed is to mine the correlation among feature values to predict an anomaly score based on normal data, and the extended ICDCS'03 paper can be found here

Intrusion detection systems ought to be cost-effective. The cost-effectiveness can be measured from two perspectives. First, various types of intrusions incur significantly different damage to the system, and in the same time, different features has different levels of operational cost to collect. Secondly, during model deployment in real-time, inductive models takes time to execute. Ideally, one would like a pipeline of models that first cheaply filter out those easy cases than use more costly models to concentrate on those more difficult cases. We have several papers discussing this issue and its implementation in a data mining-based intrusion detection prototype. The paper that addresses this issue as a machine learning problem appears in ECML'00 and can be found here.

In system aspects, the following papers describes our work in building a data mining-based intrusion detection system prototype, DISCEX'01 paper , DISCEX'00 paper, and Journal of Computer Security paper.

Much of the data mining material is organized in my Ph.D. thesis "Cost-sensitive, Scalable and Adaptive Learning using Ensemble-based Methods"

## Cost-sensitive Learning

One of my earlier interests is to adapt boosting techniques to focus on examples with higher misclassification cost and this appears in this ICML'99 paper. Then I was interested in minimizing both operational cost to run a system and misclassification cost. A good example is a data-mining based intrusion detection system (IDS) where the target is to catch as many malicious attacks as possible and in the same minimize the overhead of prediction and maximize system throughput. This appears in the following ECML'00 paper.

My recent interests on cost-sensitive learning focus on reliably estimating the risk or P(class label|x) * misclassification. Then the task of learning and decision making is to minimize the expected risk over a distribution. The main advantage for this framework is that it is straightforward and robust. One earlier work that uses ensemble for cost-sensitive learning appears in SDM'02 can be found here. In particular, the results can "make more money" on the donation dataset of KDD'98 winner. Since the framework is naturally parallel and distributed, in a ICDCS'02 paper, we describe a distributed implementation. Two applications on cost-sensitive learning appear in an EDBT'04 Industry paper for extremely skewed trading anomaly detection as well as in a SDM'05 paper to predict customers who may default their monthly payment.

## Frequent Pattern based Feature Construction

Most inductive algorithms can only model from well-structured feature vectors, but in reality, raw data from many applications do not have any feature vectors available. Repeated patterns in raw data can provide solutions, and some examples include frequent graphs, itemsets and sequential patterns. It is well understood that frequent pattern mining is non-trivial since the number of unique patterns is exponential but many are non-discriminative and correlated.

Currently, frequent pattern mining is performed in batch mode of two sequential steps: enumerating a set of frequent patterns as candidate features, followed by feature selection. Although many methods have been proposed in the past few years on how to perform each separate step efficiently, there is still limited success in eventually finding those highly compact and discriminative patterns. The culprit is due to the inherent nature of this widely adopted batch approach.

We proposes a new and different approach to mine frequent patterns as discriminative features. It builds a decision tree that sorts or partitions the data onto nodes of tree. Then at each node, it directly discovers a discriminative pattern to further divide its examples into purer subsets, that previously chosen patterns during the same run cannot separate. Since the number of examples towards leaf level is relatively small, the new approach is able to examine patterns with extremely low global support that could not be enumerated on the whole dataset by the batch method.

The discovered feature vectors are more accurate on some of the most difficult graph and frequent itemset problems than most recently proposed algorithms but the total size is typically 50% or more smaller. Importantly, the minimum support of some discriminative patterns can be extremely low (e.g. 0.03%). In order to enumerate these low support patterns, state-of-the-art frequent pattern algorithm either cannot finish due to huge memory consumption or have to enumerate 10^1 to 10^3 times more patterns before they can even be found.

Details on this work can be found in this paper., and the code and dataset is here.

## Transfer Learning

Transfer learning attempts to use labelled examples from one domain ("out-domain") to construct accurate models to predict on examples from another domain ("in-domain"). The main challenge is that in-domain and out-domain data explicitly do not follow the same distribution, and importantly there is a large number of labeled examples from out-domain and none or very few examples from in-domain. This is quite different from traditional inductive learning formulation where training and testing data are assumed to follow the same or very similar distributions.

### I.

In one approach, we combine models previously trained from different domains via locally weighted framework to transfer the knowledge into a new domain. One important difference is that the out-of-domain models in our approach did not use any in-domain data at all during their training. In other words, our proposed approach simply "adopt" them into the in-domain.

In this example, both training set 1 and set 2 have "conflicting" regions from the in-domain data. Thus the models trained cannot be adopted directly by simple combination or averaging. However, using the proposed approach, as shown in the bottom right plot below, we can successfully transfer the right knowledge from several different domains. The paper by Jing Gao, Wei Fan, Jing Jiang and Jiawei Han can be found here. The code and dataset written by Jing Gao can be found here.

### II

In a separate approach, we propose a framework (as shown below) to actively transfer examples from out-domain into in-domain. We formally analyze how the error descreases as more in-domain examples are actively queried and how out-of-domain data could help reduce the cost of ask domain experts to provide labeled in-domain data. The paper by Xiaoxiao Shi, Wei Fan, and Jiangtao Ren can be found here, and the software and dataset (synthetic and landmine) written and prepared by Xiaoxiao Shi can be found here, and the 20 Newsgroup data can be found here.

### III

In a most recently proposed approach, we address the problem of transfer learning between datasets of high dimensions. Transferring knowledge from one domain to another is challenging due to a number of reasons. Since both conditional and marginal distribution of the training data and test data are non-identical, model trained in one domain, when directly applied to a different domain, is usually low in accuracy. For many applications with large feature sets, such as text document, sequence data, medical data, image data of different resolutions, etc. two domains usually do not contain exactly the same features, thus introducing large numbers of "missing values" when considered over the union of features from both domains. In other words, its marginal distributions are at most overlapping. In the same time, these problems are usually high dimensional, such as, several thousands of features. Thus, the combination of high dimensionality and missing values make the relationship in conditional probabilities between two domains hard to measure and model. To address these challenges, we propose a framework that first brings the marginal distributions of two domains closer by "filling up" those missing values of disjoint features. Afterwards, it looks for those comparable sub-structures in the "latent-space" as mapped from the expanded feature vector, where both marginal and conditional distribution are similar. With these sub-structures in latent space, the proposed approach then find common concepts that are transferable across domains with high probability. During prediction, unlabeled instances are treated as "queries", the mostly related labeled instances from outdomain are retrieved, and the classification is made by weighted voting using retrieved out-domain examples. We formally show that importing feature values across domains and latentsemantic index can jointly make the distributions of two related domains easier to measure than in original feature space, the nearest neighbor method employed to retrieve related out domain examples is bounded in error when predicting in-domain examples.

The paper by Sihong Xie, Wei Fan, Jing Peng, Olivier Verscheure, and Jiangtao Ren can be found here, the Powerpoint presentation can be founed here, and the code and dataset can be downloaded from here.

## Feature Selection

We have proposed a graph-based hybrid feature selection method that combines the advantage of both supervised and semi-supervised feature selection.

When the number of labeled examples is limited, traditional supervised feature selection techniques often fail due to sample selection bias or unrepresentative sample problem. To solve this, semi-supervised feature selection techniques exploit the statistical information of both labeled and unlabeled examples in the same time. However, the results of semi-supervised feature selection can be at times unsatisfactory, and the culprit is on how to effectively use the unlabeled data. Quite different from both supervised and semi-supervised feature selection, we propose a "hybrid" framework based on graph models. We first apply supervised methods to select a small set of most critical features from the labeled data. Importantly, these initial features might otherwise be missed when selection is performed on the labeled and unlabeled examples simultaneously. Next, this initial feature set is expanded and corrected with the use of unlabeled data. We formally analyze why the expected performance of the hybrid framework is better than both supervised and semi-supervised feature selection. Experimental results demonstrate that the proposed method outperforms both traditional supervised and state-of-the-art semisupervised feature selection algorithms by at least 10% in accuracy on a number of text and biomedical problems with thousands of features to choose from.

The paper by Erheng Zhong, Sihong Xie, Wei Fan, Jiangtao Ren, Jing Peng and Kun Zhang can be found in here, the PowerPoint presentation can be found here and the code and dataset is available from here.