Tutorials Given

  1. Sample Selection Bias - Covariate Shift: Problems, Solutions, and Applications
  2. Data Stream Mining: Challenges and Techniques
  3. On the Power of Ensemble: Supervised and Unsupervised Methods Reconciled




Sample Selection Bias - Covariate Shift: Problems, Solutions, and Applications

by Wei Fan and Masashi Sugiyama, given in ICDM'08, Pisa, Italy, December 2008

Sample selection bias/covariate shift is a common problem encountered when using data mining algorithms for many real-world applications. Traditionally, it is assumed that training and test data are sampled from the same probability distribution, the so-called "stationary or non-biased distribution assumption." However, this assumption is often violated in reality. Typical examples include marketing solicitation, fraud detection, drug testing, loan approval, school enrollment, medical diagnosis etc. For these applications the only labeled data available for training is a biased representation, in various ways, of the future data on which the inductive model will predict. Intuitively, some examples sampled frequently into the training data may actually be infrequent in the testing data, and vice versa. When this happens, an inductive model constructed from biased training set may not be as accurate on unbiased testing data if there had not been any selection bias in the train! ing data. For example, there has been speculations that the most recent US subprime mortgage problem is due to sample selection bias problem where the default customers do not follow the same risk model as traditional mortgage customers. In this tutorial, we will employ various examples to describe the problem, describe various solution, and end the tutorial with a systematic approach to address a real-world problem.

The Powerpoint can be found here

Data Stream Mining: Challenges and Techniques

by Latifur Khan, Wei Fan, Jiawei Han, Jing Gao and Mohammad M. Masud, given in PAKDD'11, Shenzhen, China, May 2011

Data streams are continuous flows of data. Examples of data streams include network traffic, sensor data, call center records and so on. Their sheer volume and speed pose a great challenge for the data mining community to mine them. Data streams demonstrate several unique properties: infinite length, concept-drift, concept-evolution, feature-evolution and limited labeled data. Concept-drift occurs in data streams when the underlying concept of data changes over time. Concept-evolution occurs when new classes evolve in streams. Feature-evolution occurs when feature set varies with time in data streams. Data streams also suffer from scarcity of labeled data since it is not possible to manually label all the data points in the stream. Each of these properties adds a challenge to data stream mining. This tutorial presents an organized picture on how to handle various data mining techniques in data streams: in particular, how to handle classification and clustering in evolving data streams by addressing these challenges.

More information can be found here.

On the Power of Ensemble: Supervised and Unsupervised Methods Reconciled

by Jing Gao, Wei Fan and Jiawei Han, given in SDM'10, Columbus, OH, May 2010

Ensemble methods have emerged as a powerful method for improving the robustness as well as the accuracy of both supervised and unsupervised solutions. Moreover, as enormous amounts of data are continuously generated from different views, it is important to consolidate different concepts for intelligent decision making. In the past decade, there have been numerous studies on the problem of combining competing models into a committee, and the success of ensemble techniques has been observed in multiple disciplines, including recommendation systems, anomaly detection, stream mining, and web applications.

The ensemble techniques have been mostly studied in supervised and unsupervised learning communities separately. However, they share the same basic principles, i.e., combination of diversified base models strengthens weak models. Also, when both supervised and unsupervised models are available for a single task, merging all of the results leads to better performances. Therefore, there is a need of a systematic introduction and comparison of the ensemble techniques, combining the views of both supervised and unsupervised learning ensembles.

In this tutorial, we will present an organized picture on ensemble methods with a focus on the mechanism to merge the results. We start with the description and applications of ensemble methods. Through reviews of well-known and state-of-the-art ensemble methods, we show that supervised learning ensembles usually learn" this mechanism based on the available labels in the training data, whereas unsupervised ensembles simply combine multiple clustering solutions based on consensus". We end the tutorial with a systematic approach to combine both supervised and unsupervised models, and several applications of ensemble methods.

More information can be found here.