Fast Decision Support Queries in Main-Memory: Algorithms,
Optimization and Implementation
Kenneth A. Ross
Department of Computer Science
Columbia University
Contact Information
1214 Amsterdam Avenue, Mail Code 0401
New York, NY 10027
Phone: (212) 939-7058
Fax : (212) 666-0140
Email: kar@cs.columbia.edu
WWW PAGES
Kenneth Ross: http://www.cs.columbia.edu/~kar
The Columbia Fast Query Project: http://www.cs.columbia.edu/~kar/fastqueryproj.html
Project Award Information
-
Award Number: IIS-9812014
-
Duration: 9/15/98-8/31/01; 3rd year.
-
Title: Fast Decision Support Queries in Main-Memory: Algorithms, Optimization
and Implementation
Keywords
Query processing, main memory databases, cache performance
Project Summary
An organization may place its data into a centralized database from which
complex decision-support queries can be made. These queries typically need
to process and combine large amounts of data very efficiently. Answers
to these queries can be very valuable in the application domain, enabling
informed changes to organizational behavior. New techniques for evaluating
and optimizing general decision-support queries are being developed in
the context of a main-memory database. It is becoming possible to store
gigabytes of data in RAM. Thus one begins to ask whether decision support
query processing could obtain the performance benefits of having data reside
in main memory. The approach taken is to study the performance issues associated
with processing complex queries in main memory. By designing and using
algorithms, access structures, and data representations tailored to main-memory
databases, we can avoid much of the overhead that is apparent in a disk-based
system even when the data set fits in main-memory. By allowing updates
only in an off-line batch window, one gains the flexibility to organize
the data in a way that is tuned for fast query response. Prototype implementations
demonstrate the benefits of the proposed techniques, measured in terms
of query response times. The novel techniques developed in the course of
this project will have broad use in applications that process complex queries
over datasets that can fit in main memory.
Publications and Products
-
"Filtering Algorithms and Implementation for Very Fast Publish/Subscribe,"
F. Fabret, H. A. Jacobsen, F. Llirbat, J. Pereira, K. A. Ross, D.
Shasha, Proceedings of the 2001 SIGMOD Conference, May 2001.
-
``Adapting Materialized Views After Redefinitions: Techniques
and a Performance Study,'' A. Gupta, I. S. Mumick, J. Rao, K. A. Ross,
Information Systems, Special issue on Data Warehousing (to appear, 2001).
-
``Independence Diagrams: A Technique for Data Visualization,''
S. Berchtold, H. V. Jagadish, K. A. Ross, Journal of Electronic Imaging,
Volume 9, Issue 4, October 2000, pages 375--384.
-
"Publish/Subscribe on the Web at Extreme Speed," J. Pereira,
F. Fabret, F. Llirbat, R. Preotiuc-Pietro, K. A. Ross, D. Shasha,
Proceedings of the 2000 VLDB Conference (demo), September 2000.
-
``Optimizing Selections over Datacubes'' K. A. Ross
and K. A. Zaman, Proceedings of the 2000 SSDBM Conference, July 2000.
-
``Serving Datacube Tuples from Main Memory'' K. A.
Ross and K. A. Zaman, Proceedings of the 2000 SSDBM Conference, July 2000.
-
``Making B+-Trees Cache Conscious in Main Memory,'' J. Rao
and K. A. Ross, Proceedings of the ACM SIGMOD Conference, May 2000.
-
``Cache Conscious Indexing for Decision-Support in Main Memory''
J. Rao and K. A. Ross, Proceedings of the 1999 VLDB Conference, September
1999.
In addition to conference and journal publications, we have
made software available over the Internet. In particular, we have
made available two packages. The first package implements the Memory-Cube
algorithm for computing datacubes. The second package implements
the CSS-Tree and CSB+-Tree data structures for fast in-memory
indexing. These packages can be found from the project URL
http://www.cs.columbia.edu/~kar/fastqueryproj.html.
Project Impact
-
Patent applications have been filed on the indexing techniques.
Goals, Objectives, and Targeted Activities
Our project focuses on data sets that fit in main memory. With recent
growth in the amount of RAM one can buy for a fixed amount of money, gigabyte
main memories are now common for server machines. In the near future,
memories in the tens or even hundreds of gigabytes will be affordable.
While there will always be data sets too big for main memory there are
likely to be more and more applications whose data requirements match main
memory sizes. Main memory is appropriate for:
-
Applications requiring fast response times, where answers lose value if
they arrive late.
-
Applications having a core ``hot'' dataset that is accessed often and that
fits in memory, even when the entire database may be larger than the available
RAM.
-
Exploratory applications such as decision support, in order to make the
data analysis process interactive.
-
A disk-resident data-set might be summarized by aggregating detailed
data, with the summary stored in a main-memory database.
-
A disk-resident data-set might be compressed by approximating detailed
data, with the approximation stored in a main-memory database.
This project is distinguished from traditional work in main-memory databases
by its focus on decision-support applications, and the complex bulk queries
that occur in that domain. Our goal is to provide very fast response
to complex queries, to enable interactive exploratory queries, and to generate
high query throughput. The software developed could be used by an
organization with data to disseminate (say on the Web), in order to enhance
their service by providing fast query processing capabilities at relatively
low cost. We believe that there are inherent reasons why the performance
of a disk-based system is likely to be substantially worse than a main-memory
system. These performance differences can make the difference between
interactivite and batch query modes. We expect to be able to demonstrate
significant performance improvements for a variety of query processing
tasks, and to improve the response time of a large class of queries by
10-50%.
Recent progress includes a VLDB 1999 paper showing how to index decision-support
databases in main memory so that the computer's cache line is efficiently
utilized. In particular, many keys are packed into each cache line,
and pointers are eliminated entirely. Performance is substantially
better than conventional indexes.
In a follow-up SIGMOD 2000 paper, we show how to extend these techniques
to other kinds of databases that allow on-line updates. The trick
is to use a mix of pointers and array arithmetic to locate child nodes.
Under realistic conditions, our rechniques outperform B+-trees for both
lookups and updates.
In two SSDBM 2000 papers, we describe two datacube-related innovations.
One is a method for serving datacube records from main-memory. The
second shows how to reduce datacube computation if one is interested in
placing conditions on the aggregate values output.
In a SIGMOD 2001 paper, we desribe techniques for implementing a publish-subscribe
system that can handle large numbers of events per second, even with millions
of active subscriptions.
Project References
See "Publications and Products" above.
Area Background
Decision-support databases aim to provide answers to complex queries over
very large datasets. In a commercial setting, for example, these
answers can have high value in that they enable the organization to optimize
its behavior based on current performance. Query optimization and
evaluation algorithms enable faster execution plans to be used. Main
memory databases are used in order to achieve very fast response times
(in comparison with disk-based systems that can be significantly slower).
Area References
"Database Management Systems" by Raghu Ramakrishnan.
McGraw Hill.
"Principles of Database and Knowledge-Base Systems" by
Jeffrey Ullman. Computer Science Press.
Potential Related Projects
I would be interested in finding out from others about the kinds of complex
queries (on small or large datasets) that they have encountered during
their work. The longer the query the better!