Fast Decision Support Queries in Main-Memory: Algorithms, Optimization and Implementation

Kenneth A. Ross
Department of Computer Science
Columbia University

Contact Information

1214 Amsterdam Avenue, Mail Code 0401
New York, NY 10027
Phone: (212) 939-7058
Fax : (212) 666-0140
Email: kar@cs.columbia.edu

WWW PAGES

Kenneth Ross: http://www.cs.columbia.edu/~kar
The Columbia Fast Query Project: http://www.cs.columbia.edu/~kar/fastqueryproj.html

Project Award Information

Award Number: IIS-9812014
Duration: 9/15/98-8/31/01; 3rd year.
Title: Fast Decision Support Queries in Main-Memory: Algorithms, Optimization and Implementation

Keywords

Query processing, main memory databases, cache performance

Project Summary

An organization may place its data into a centralized database from which complex decision-support queries can be made. These queries typically need to process and combine large amounts of data very efficiently. Answers to these queries can be very valuable in the application domain, enabling informed changes to organizational behavior. New techniques for evaluating and optimizing general decision-support queries are being developed in the context of a main-memory database. It is becoming possible to store gigabytes of data in RAM. Thus one begins to ask whether decision support query processing could obtain the performance benefits of having data reside in main memory. The approach taken is to study the performance issues associated with processing complex queries in main memory. By designing and using algorithms, access structures, and data representations tailored to main-memory databases, we can avoid much of the overhead that is apparent in a disk-based system even when the data set fits in main-memory. By allowing updates only in an off-line batch window, one gains the flexibility to organize the data in a way that is tuned for fast query response. Prototype implementations demonstrate the benefits of the proposed techniques, measured in terms of query response times. The novel techniques developed in the course of this project will have broad use in applications that process complex queries over datasets that can fit in main memory.

Publications and Products

"Filtering Algorithms and Implementation for Very Fast Publish/Subscribe," F. Fabret, H. A. Jacobsen, F. Llirbat, J. Pereira, K. A. Ross, D. Shasha, Proceedings of the 2001 SIGMOD Conference, May 2001.
``Adapting Materialized Views After Redefinitions: Techniques and a Performance Study,'' A. Gupta, I. S. Mumick, J. Rao, K. A. Ross, Information Systems, Special issue on Data Warehousing (to appear, 2001).
``Independence Diagrams: A Technique for Data Visualization,'' S. Berchtold, H. V. Jagadish, K. A. Ross, Journal of Electronic Imaging, Volume 9, Issue 4, October 2000, pages 375--384.
"Publish/Subscribe on the Web at Extreme Speed," J. Pereira, F. Fabret, F. Llirbat, R. Preotiuc-Pietro, K. A. Ross, D. Shasha, Proceedings of the 2000 VLDB Conference (demo), September 2000.
``Optimizing Selections over Datacubes'' K. A. Ross and K. A. Zaman, Proceedings of the 2000 SSDBM Conference, July 2000.
``Serving Datacube Tuples from Main Memory'' K. A. Ross and K. A. Zaman, Proceedings of the 2000 SSDBM Conference, July 2000.
``Making B+-Trees Cache Conscious in Main Memory,'' J. Rao and K. A. Ross, Proceedings of the ACM SIGMOD Conference, May 2000.
``Cache Conscious Indexing for Decision-Support in Main Memory'' J. Rao and K. A. Ross, Proceedings of the 1999 VLDB Conference, September 1999.

In addition to conference and journal publications, we have made software available over the Internet. In particular, we have made available two packages. The first package implements the Memory-Cube algorithm for computing datacubes. The second package implements the CSS-Tree and CSB⁺-Tree data structures for fast in-memory indexing. These packages can be found from the project URL http://www.cs.columbia.edu/~kar/fastqueryproj.html.

Project Impact

Patent applications have been filed on the indexing techniques.

Goals, Objectives, and Targeted Activities

Our project focuses on data sets that fit in main memory. With recent growth in the amount of RAM one can buy for a fixed amount of money, gigabyte main memories are now common for server machines. In the near future, memories in the tens or even hundreds of gigabytes will be affordable. While there will always be data sets too big for main memory there are likely to be more and more applications whose data requirements match main memory sizes. Main memory is appropriate for:

Applications requiring fast response times, where answers lose value if they arrive late.
Applications having a core ``hot'' dataset that is accessed often and that fits in memory, even when the entire database may be larger than the available RAM.
Exploratory applications such as decision support, in order to make the data analysis process interactive.
A disk-resident data-set might be summarized by aggregating detailed data, with the summary stored in a main-memory database.
A disk-resident data-set might be compressed by approximating detailed data, with the approximation stored in a main-memory database.

This project is distinguished from traditional work in main-memory databases by its focus on decision-support applications, and the complex bulk queries that occur in that domain. Our goal is to provide very fast response to complex queries, to enable interactive exploratory queries, and to generate high query throughput. The software developed could be used by an organization with data to disseminate (say on the Web), in order to enhance their service by providing fast query processing capabilities at relatively low cost. We believe that there are inherent reasons why the performance of a disk-based system is likely to be substantially worse than a main-memory system. These performance differences can make the difference between interactivite and batch query modes. We expect to be able to demonstrate significant performance improvements for a variety of query processing tasks, and to improve the response time of a large class of queries by 10-50%.

Recent progress includes a VLDB 1999 paper showing how to index decision-support databases in main memory so that the computer's cache line is efficiently utilized. In particular, many keys are packed into each cache line, and pointers are eliminated entirely. Performance is substantially better than conventional indexes.

In a follow-up SIGMOD 2000 paper, we show how to extend these techniques to other kinds of databases that allow on-line updates. The trick is to use a mix of pointers and array arithmetic to locate child nodes. Under realistic conditions, our rechniques outperform B+-trees for both lookups and updates.

In two SSDBM 2000 papers, we describe two datacube-related innovations. One is a method for serving datacube records from main-memory. The second shows how to reduce datacube computation if one is interested in placing conditions on the aggregate values output.

In a SIGMOD 2001 paper, we desribe techniques for implementing a publish-subscribe system that can handle large numbers of events per second, even with millions of active subscriptions.

Project References

See "Publications and Products" above.

Area Background

Decision-support databases aim to provide answers to complex queries over very large datasets. In a commercial setting, for example, these answers can have high value in that they enable the organization to optimize its behavior based on current performance. Query optimization and evaluation algorithms enable faster execution plans to be used. Main memory databases are used in order to achieve very fast response times (in comparison with disk-based systems that can be significantly slower).

Area References

"Database Management Systems" by Raghu Ramakrishnan. McGraw Hill.
"Principles of Database and Knowledge-Base Systems" by Jeffrey Ullman. Computer Science Press.

Potential Related Projects

I would be interested in finding out from others about the kinds of complex queries (on small or large datasets) that they have encountered during their work. The longer the query the better!