Indexing and Mining Multimedia Databases

Flip Korn
Computer Science Department
University of Maryland, College Park

Abstract

This talk focuses on similarity searching and data mining in large multimedia databases. Similarity search involves the retrieval of multimedia objects (e.g., images, time series) that are most "similar" to a query object, for example: `Find images from a given collection of X-Rays that contain a nodule similar to the given tumor shape,' and `Find all stocks with movement similar to that of IBM.' Using concepts from mathematical morphology and tools from state-of-the-art indexing, we developed a system that efficiently searches for similar tumor shapes while attaining correct output (i.e., no false dismissals). The system is 27 times faster than sequential scanning, and exhibits excellent precision (80%) at perfect recall (100%).

The second part of the talk examines data mining. The goal is to support ad hoc queries on large data matrices that might not fit on disk. Such a matrix could have, e.g., customers for rows and days of the year for columns, with each cell value representing the amount spent on products. The target queries are single-cell queries ('Find the amount spent by Smith on 1/1/96') and aggregate queries ('Find the sales of customers from New York on December 1st'). We propose a compression format that permits random access, and thus efficiently supports ad hoc queries. Towards this end, we developed SVDD, a novel lossy compression method for very large data matrices, which reduces the matrix to 2% of the original space (i.e., a 50:1 compression ratio) and achieves 0.5% reconstruction error, as experiments on real data (e.g., AT&T customer sales) showed.



Luis Gravano
gravano@cs.columbia.edu