Data Mining and Knowledge Discovery in Databases: So What's New?
Abstract
Knowledge Discovery in Databases (KDD) and Data Mining are concerned with
the extraction of high-level information (knowledge) from low-level data
(usually stored in large databases). I give an overview of this rapidly
growing area, define the goals, present motivation, and give a high-level
definition of the KDD Process and how it relates to Data Mining. We then
focus on data mining methods. These methods have their origins in statistics,
pattern recognition, learning, visualization, databases, and parallel computing.
Basic coverage of a sampling of methods will be provided to give a feel
for what the methods are about and how they are used. I'll outline the
research challenges and opportunities posed by the problem of extracting
models from massive data sets (i.e., much larger than main memory). Operating
under such scalability constraints poses interesting problems for how models
can be built and what methods are practical. I will use an application
in astronomy, done at JPL/Caltech to motivate the need for dealing with
large databases, to illustrate problems of classification and clustering
with very large data sets, and to illustrate how these techniques can offer
powerful novel solutions to significant problems.
Luis Gravano
gravano@cs.columbia.edu