Thursday April 2, 2009, 11:59pm
You will carry out this project in teams of two. You do not need to notify us now of your team composition. Instead, when you submit your project you should indicate in your submission your team compositio. Both students in a team will receive the same grade for Project 1. Please check the Collaboration Policy web page for important information on what kinds of collaboration are allowed for projects.
This project is about query result-size estimation as performed by a relational query optimizer. For this, you will extract different statistics for a given relation. Afterwards, you will estimate the result size of a given selection query (i.e., the number of tuples in the result for the query) using just the statistics and without examining the actual relation tuples. You will then be able to compare the accuracy of the estimates, and understand the impact of the data-distribution assumptions made when using the different statistics. Your implementation will be completely on main memory, so you don't need to worry about saving anything to disk, for example.
Specifically, we will focus in this project on a relation R(A1, A2), where A1 and A2 are integers. (You do not need to consider other relations or attributes.) Given the name of a file containing an instance of R, you need to compute the following three separate sets of statistics:
Hence Set 1 consists of just 5 numbers for our relation R(A1, A2).
The equiwidth histogram for attribute Ai with Bi buckets simply divides the range of values of Ai into Bi ranges of equal length W (see below), and counts how many tuples of the relation lie in each range. Hence, each bucket corresponds to one range and will be of the form [x, y): f, indicating that there are f tuples t in R such that: x<=t.Ai<y. In other words, the equiwidth histogram for Ai will consist of the following Bi buckets:
where W=floor((High(Ai)-Low(Ai))/Bi), m=Low(Ai), and M=High(Ai). Note that the last bucket is a special case in that it's "closed" on both ends, to also cover the tuples with Ai value equal to High(Ai). The last bucket might also be slightly larger than the others, if High(Ai)-Low(Ai) is not divisible by Bi. Note also that the number of tuples in the relation, N, is exactly equal to f1+f2+...+fBi. Finally, in the special case where parameter Bi is greater than High(Ai)-Low(Ai), you should create one bucket for each value in the range of values for attribute Ai. In other words, you should set Bi equal to High(Ai)-Low(Ai)+1.
Example: Consider the following instance of relation R:
Then, the Set 1 statistics for R are: N=5, Low(A1)=1, High(A1)=7, Low(A2)=2, and High(A2)=11.
The Set 2 statistics for R for B1=3 and B2=2 are:
Hence, Set 2 consists of just 3*(B1+B2) numbers for our relation R(A1, A2) (i.e., B1 buckets for A1 and B2 buckets for A2, with each value being represented as three numbers, as discussed above).
In the example above, if B3=3 and B4=2, the 2-dimensional histogram for R will consist of the following 6 buckets:
Given statistics of the contents of a relation such as those described above, the query optimizer estimates the result size of a given query. For this project, we will restrict our attention to queries of the form:
WHERE A1>=r1 AND A1<=s1 AND A2>=r2 AND A2<=s2
where r1, s1, r2, and s2 are integers. If only the Set 1 statistics are available, you can estimate the size of the query by adapting the description on pages 483 and 484 of the textbook, assuming that the conditions on A1 and on A2 are independent. (Hint: Do not consider the two conditions on A1 (i.e., A1>=r1 and A1<=s1) as being independent, but rather compute the reduction factor of r1<=A1<=s1 directly. Same thing for the two conditions on A2.)
In contrast, if the Set 2 statistics are available, you can derive a closer estimate of the result size of the query by exploiting the two histograms, but still assuming independence of the conditions on A1 and on A2.
Finally, if the Set 3 statistics are available, then you should be able to derive an even closer estimate of the result size of the query by exploiting your single 2-dimensional histogram and without having to assume independence of the conditions on A1 and on A2.
In this project, you need to write a program to do the following:
Note: When you are given a query, you should compute all "intermediate" estimates for the query using floating-point precision, with no rounding. Then, as a last step before you output each of the four result-estimates above, take the ceiling so your reported result-size estimates are always integers.
To facilitate our grading, your program should run in two modes: "normal" and "verbose":
Your executable should be invoked using the format that follows, which we ask that you please follow strictly:
$ java Estimate [-v] filename B1 B2 B3 B4
-v: verbose mode
filename: file to read data set from
B1: an integer
B2: an integer
B3: an integer
B4: an integer
Each query is entered (from stdin) as a line with 4 space-separated integers (e.g., "1 10 50 320<newline>").
The output of your program should follow exactly the format of the sample runs available as file results.txt. The program that we used to generate this output used floats for all intermediate calculations and considers that a condition 5 <= x <= 10 matches 6/10 of a bucket [1,11). This is because we are using integer semantics and the values are discrete. Similarly, a condition 23 <= x <= 23 matches 1/5 of a bucket [20,25). To generate the output in file results.txt, run tests.txt with queries.txt and the dataset.txt data set in the same directory.
You should submit:
To submit your project, follow these instructions:
IMPORTANT NOTE: You must write your project in Java. Your project must compile and run under Solaris in your CS account. (Use ssh to cluster.cs.columbia.edu to access Solaris machines.) We cannot make exceptions to this rule.
Q: What should my output look like?
A: Check the sample runs available in the "Output format" section above. Your output should exactly match the format specified in these examples for the "Normal" mode. The format of your "Verbose" mode output is up to you, but it must be clearly readable. We strongly suggest that when printing histograms you use the formats given in the examples above.
Q: Is error handling required?
A: You don't need to implement sophisticated error handling, but you should definitely perform basic error checking (such as checking for the correct number and type of arguments). However, you don't need to do any error checking for the data sets; you can assume that we will only test your program with valid data sets.
Q: What should I do if the range of attribute values and the values for B3 and B4 are so large that my 2-dimensional histogram doesn't fit in memory?
A: You can assume that we will only test your program with sufficiently small values for B3 and B4 such that this case never occurs.
Q: How will the project be graded?
A: Your grade will depend on the completeness of your implementation (whether you have met all of the requirements), the accuracy of your estimates, and -to a smaller extent- the quality of your README file.
Q: Will I be graded on the efficiency of my solution?
Q: Should I comment my code?
A: Yes. If we find problems with your solution, we will look at your code in assigning partial credit. Therefore, it is important that you provide enough comments so that we can understand your implementation.
Q: Where can I find a sample Makefile?
A: Here is a sample for Java:
%.class : %.java
$(JAVAC) -classpath $(CLASSPATH) $<
rm -f *.class