Data Mining and Knowledge Discovery in Databases: So What's New?

Usama Fayyad
Microsoft Research

Abstract

Knowledge Discovery in Databases (KDD) and Data Mining are concerned with the extraction of high-level information (knowledge) from low-level data (usually stored in large databases). I give an overview of this rapidly growing area, define the goals, present motivation, and give a high-level definition of the KDD Process and how it relates to Data Mining. We then focus on data mining methods. These methods have their origins in statistics, pattern recognition, learning, visualization, databases, and parallel computing. Basic coverage of a sampling of methods will be provided to give a feel for what the methods are about and how they are used. I'll outline the research challenges and opportunities posed by the problem of extracting models from massive data sets (i.e., much larger than main memory). Operating under such scalability constraints poses interesting problems for how models can be built and what methods are practical. I will use an application in astronomy, done at JPL/Caltech to motivate the need for dealing with large databases, to illustrate problems of classification and clustering with very large data sets, and to illustrate how these techniques can offer powerful novel solutions to significant problems.



Luis Gravano
gravano@cs.columbia.edu