Fast Decision Support Queries in Main-Memory: Algorithms, Optimization and Implementation

Kenneth A. Ross
Department of Computer Science
Columbia University

Contact Information

1214 Amsterdam Avenue, Mail Code 0401
New York, NY 10027
Phone: (212) 939-7058
Fax : (212) 666-0140
Email: kar@cs.columbia.edu

WWW PAGES

Kenneth Ross: http://www.cs.columbia.edu/~kar
The Columbia Fast Query Project: http://www.cs.columbia.edu/~kar/fastqueryproj.html

Project Award Information

Keywords

Query processing, main memory databases, cache performance

Project Summary

An organization may place its data into a centralized database from which complex decision-support queries can be made. These queries typically need to process and combine large amounts of data very efficiently. Answers to these queries can be very valuable in the application domain, enabling informed changes to organizational behavior. New techniques for evaluating and optimizing general decision-support queries are being developed in the context of a main-memory database. It is becoming possible to store gigabytes of data in RAM. Thus one begins to ask whether decision support query processing could obtain the performance benefits of having data reside in main memory. The approach taken is to study the performance issues associated with processing complex queries in main memory. By designing and using algorithms, access structures, and data representations tailored to main-memory databases, we can avoid much of the overhead that is apparent in a disk-based system even when the data set fits in main-memory. By allowing updates only in an off-line batch window, one gains the flexibility to organize the data in a way that is tuned for fast query response. Prototype implementations demonstrate the benefits of the proposed techniques, measured in terms of query response times. The novel techniques developed in the course of this project will have broad use in applications that process complex queries over datasets that can fit in main memory.

Publications and Products

In addition to conference and journal publications, we have made software available over the Internet.  In particular, we have made available two packages.  The first package implements the Memory-Cube algorithm for computing datacubes.  The second package implements the CSS-Tree and CSB+-Tree data structures for fast in-memory indexing.  These packages can be found from the project URL  http://www.cs.columbia.edu/~kar/fastqueryproj.html.

Project Impact

Goals, Objectives, and Targeted Activities

Our project focuses on data sets that fit in main memory.  With recent growth in the amount of RAM one can buy for a fixed amount of money, gigabyte main memories are now common for server machines.  In the near future, memories in the tens or even hundreds of gigabytes will be affordable.  While there will always be data sets too big for main memory there are likely to be more and more applications whose data requirements match main memory sizes. Main memory is appropriate for: This project is distinguished from traditional work in main-memory databases by its focus on decision-support applications, and the complex bulk queries that occur in that domain.  Our goal is to provide very fast response to complex queries, to enable interactive exploratory queries, and to generate high query throughput.  The software developed could be used by an organization with data to disseminate (say on the Web), in order to enhance their service by providing fast query processing capabilities at relatively low cost. We believe that there are inherent reasons why the performance of a disk-based system is likely to be substantially worse than a main-memory system.  These performance differences can make the difference between interactivite and batch query modes.  We expect to be able to demonstrate significant performance improvements for a variety of query processing tasks, and to improve the response time of a large class of queries by 10-50%.

Recent progress includes a VLDB 1999 paper showing how to index decision-support databases in main memory so that the computer's cache line is efficiently utilized.  In particular, many keys are packed into each cache line, and pointers are eliminated entirely.  Performance is substantially better than conventional indexes.

In a follow-up SIGMOD 2000 paper, we show how to extend these techniques to other kinds of databases that allow on-line updates.  The trick is to use a mix of pointers and array arithmetic to locate child nodes.  Under realistic conditions, our rechniques outperform B+-trees for both lookups and updates.

In two SSDBM 2000 papers, we describe two datacube-related innovations.  One is a method for serving datacube records from main-memory.  The second shows how to reduce datacube computation if one is interested in placing conditions on the aggregate values output.

In a SIGMOD 2001 paper, we desribe techniques for implementing a publish-subscribe system that can handle large numbers of events per second, even with millions of active subscriptions.

Project References

See "Publications and Products" above.

Area Background

Decision-support databases aim to provide answers to complex queries over very large datasets.  In a commercial setting, for example, these answers can have high value in that they enable the organization to optimize its behavior based on current performance.  Query optimization and evaluation algorithms enable faster execution plans to be used.  Main memory databases are used in order to achieve very fast response times (in comparison with disk-based systems that can be significantly slower).

Area References

"Database Management Systems" by Raghu Ramakrishnan.  McGraw Hill.
"Principles of Database and Knowledge-Base Systems" by Jeffrey Ullman.  Computer Science Press.

Potential Related Projects

I would be interested in finding out from others about the kinds of complex queries (on small or large datasets) that they have encountered during their work.  The longer the query the better!