cs4112: Project 1

COMS W4112-Database Systems Implementation
Spring 2009

Project 1

Due Date: Thursday April 2, 2009, 11:59pm

FAQs

Teams

You will carry out this project in teams of two. You do not need to notify us now of your team composition. Instead, when you submit your project you should indicate in your submission your team compositio. Both students in a team will receive the same grade for Project 1. Please check the Collaboration Policy web page for important information on what kinds of collaboration are allowed for projects.

Important notes:

Please find a partner for your Project 1 immediately.
Specifically, if you can't find a team-mate, please follow these steps:
1. Post a message in the class discussion board asking for a team mate - the best way.
2. Send email to TA Weiwei right away (and definitely before Tue Feb 17 at 5pm) asking him to pair you up with another student without a team-mate. Weiwei will do his best to find you a team-mate.
You have to use Java for this project.

Description

This project is about query result-size estimation as performed by a relational query optimizer. For this, you will extract different statistics for a given relation. Afterwards, you will estimate the result size of a given selection query (i.e., the number of tuples in the result for the query) using just the statistics and without examining the actual relation tuples. You will then be able to compare the accuracy of the estimates, and understand the impact of the data-distribution assumptions made when using the different statistics. Your implementation will be completely on main memory, so you don't need to worry about saving anything to disk, for example.

Specifically, we will focus in this project on a relation R(A1, A2), where A1 and A2 are integers. (You do not need to consider other relations or attributes.) Given the name of a file containing an instance of R, you need to compute the following three separate sets of statistics:

Set 1 is the simplest set of statistics, and consists of the following information:
- N: the total number of tuples in R (counting repetitions, if any).
- Low(Ai): the minimum value that Ai takes in the relation instance, for each attribute Ai (i=1, 2).
- High(Ai): the maximum value that Ai takes in the relation instance, for each attribute Ai (i=1, 2).
Hence Set 1 consists of just 5 numbers for our relation R(A1, A2).
Set 2 has more sophisticated statistics about the distribution of values within each attribute, and consists of the following information:
- An equiwidth histogram for attribute A1 (see page 487 of textbook, and below) with B1 buckets, for an input integer B1.
- An equiwidth histogram for attribute A2 (see page 487 of textbook, and below) with B2 buckets, for an input integer B2.
The equiwidth histogram for attribute Ai with Bi buckets simply divides the range of values of Ai into Bi ranges of equal length W (see below), and counts how many tuples of the relation lie in each range. Hence, each bucket corresponds to one range and will be of the form [x, y): f, indicating that there are f tuples t in R such that: x<=t.Ai<y. In other words, the equiwidth histogram for Ai will consist of the following Bi buckets:
- [m, m+W): f1
- [m+W, m+2*W): f2
- ...
- [m+(Bi-2)*W, m+(Bi-1)*W): fBi-1
- [m+(Bi-1)*W, M]: fBi
where W=floor((High(Ai)-Low(Ai))/Bi), m=Low(Ai), and M=High(Ai). Note that the last bucket is a special case in that it's "closed" on both ends, to also cover the tuples with Ai value equal to High(Ai). The last bucket might also be slightly larger than the others, if High(Ai)-Low(Ai) is not divisible by Bi. Note also that the number of tuples in the relation, N, is exactly equal to f1+f2+...+fBi. Finally, in the special case where parameter Bi is greater than High(Ai)-Low(Ai), you should create one bucket for each value in the range of values for attribute Ai. In other words, you should set Bi equal to High(Ai)-Low(Ai)+1.

Example: Consider the following instance of relation R:

A1 A2

5 5

1 5

3 9

3 11

7 2

Then, the Set 1 statistics for R are: N=5, Low(A1)=1, High(A1)=7, Low(A2)=2, and High(A2)=11.

The Set 2 statistics for R for B1=3 and B2=2 are:
- An equiwidth histogram for A1 with the following B1=3 buckets: [1, 3): 1; [3, 5): 2; [5, 7]: 2.
- An equiwidth histogram for A2 with the following B2=2 buckets: [2, 6): 3; [6, 11]: 2.
Hence, Set 2 consists of just 3*(B1+B2) numbers for our relation R(A1, A2) (i.e., B1 buckets for A1 and B2 buckets for A2, with each value being represented as three numbers, as discussed above).
Set 3 has even more sophisticated statistics about the joint distribution of the two attributes, and consists of just one 2-dimensional histogram (and nothing else):
- One 2-dimensional histogram jointly for A1 and A2. Your 2-dimensional histogram will be a set of buckets, where the buckets now specify a "rectangle." Each bucket will then consist of two ranges and a frequency: [x1, y1) * [x2, y2): f, indicating that there are f tuples t in R such that x1<=t.A1<y1 AND x2<=t.A2<y2. Your 2-dimensional buckets are simply a conceptual extension of the 1-dimensional equiwidth histogram idea to two dimensions. To build them, your program will take two parameters B3 and B4 (see below), which indicate the number of equal-size "ranges" into which you partition attributes A1 and A2, respectively. Just as in the 1-dimensional case, the "rightmost" range for each dimension should be "closed" on both ends, and might be slightly larger than the others.
  Hence, Set 3 consists of just 5*B3*B4 numbers for our relation R(A1, A2) (i.e., B3*B4 2-dimensional (i.e., rectangular) buckets, each with two associated ranges and one tuple frequency).
In the example above, if B3=3 and B4=2, the 2-dimensional histogram for R will consist of the following 6 buckets:
- [1, 3) * [2, 6): 1
- [1, 3) * [6, 11]: 0
- [3, 5) * [2, 6): 0
- [3, 5) * [6, 11]: 2
- [5, 7] * [2, 6): 2
- [5, 7] * [6, 11]: 0

Given statistics of the contents of a relation such as those described above, the query optimizer estimates the result size of a given query. For this project, we will restrict our attention to queries of the form:

SELECT *
FROM R
WHERE A1>=r1 AND A1<=s1 AND A2>=r2 AND A2<=s2

where r1, s1, r2, and s2 are integers. If only the Set 1 statistics are available, you can estimate the size of the query by adapting the description on pages 483 and 484 of the textbook, assuming that the conditions on A1 and on A2 are independent. (Hint: Do not consider the two conditions on A1 (i.e., A1>=r1 and A1<=s1) as being independent, but rather compute the reduction factor of r1<=A1<=s1 directly. Same thing for the two conditions on A2.)

In contrast, if the Set 2 statistics are available, you can derive a closer estimate of the result size of the query by exploiting the two histograms, but still assuming independence of the conditions on A1 and on A2.

Finally, if the Set 3 statistics are available, then you should be able to derive an even closer estimate of the result size of the query by exploiting your single 2-dimensional histogram and without having to assume independence of the conditions on A1 and on A2.

What you need to do

In this project, you need to write a program to do the following:

Receive as input the name of a file that contains an instance of the relation R described above. So you can test your system, we are providing a data set with 10,000 tuples. The attribute values are positive integers ranging from 0 to (2^16)-1 . The data set is in a text file with one tuple per line and comma-separated attributes. You can assume that your program will only be applied on integer-valued relations.
Create the Set 1, Set 2, and Set 3 statistics. Your program should also receive as input B1 and B2, the number of buckets for the two histograms in Set 2, plus B3 and B4, which determine the number of buckets in the histogram in Set 3 (see above).
Accept queries as 4 space-separated integers, corresponding to r1, s1, r2, and s2 in the selection query above, in this order. For each input query, you should:
- Report the result-size estimate using only the Set 1 statistics.
- Report the result-size estimate using only the Set 2 statistics.
- Report the result-size estimate using only the Set 3 statistics.
- Report the exact number of tuples in R that match the selection query associated with r1, s1, r2, and s2. To report this exact number you will have to inspect all tuples in R.
Note: When you are given a query, you should compute all "intermediate" estimates for the query using floating-point precision, with no rounding. Then, as a last step before you output each of the four result-estimates above, take the ceiling so your reported result-size estimates are always integers.

To facilitate our grading, your program should run in two modes: "normal" and "verbose":

"Normal" mode: Just report the information listed above for each query that you enter.
"Verbose" mode: In addition to the information listed above, report as much information as you judge necessary for us to understand how you are computing your result-size estimates. For example, for the Set 1 and Set 2 statistics you might want to report -in addition to other things- the reduction factor of the condition on A1 and the reduction factor of the condition on A2, how you derived them, and how you combined them into the final result-size estimate. Also, for the Set 2 and Set 3 statistics you might want to report the actual histogram buckets that you used in the computation of the result-size estimate. Use common sense to report information that we could use to give you partial credit. Also, your verbose mode should allow us to ask to see the histograms that you computed.

Your executable should be invoked using the format that follows, which we ask that you please follow strictly:

$ java Estimate [-v] filename B1 B2 B3 B4
see below
-v: verbose mode
filename: file to read data set from
B1: an integer
B2: an integer
B3: an integer
B4: an integer

Each query is entered (from stdin) as a line with 4 space-separated integers (e.g., "1 10 50 320<newline>").

Output format

The output of your program should follow exactly the format of the sample runs available as file results.txt. The program that we used to generate this output used floats for all intermediate calculations and considers that a condition 5 <= x <= 10 matches 6/10 of a bucket [1,11). This is because we are using integer semantics and the values are discrete. Similarly, a condition 23 <= x <= 23 matches 1/5 of a bucket [20,25). To generate the output in file results.txt, run tests.txt with queries.txt and the dataset.txt data set in the same directory.

What to submit

You should submit:

Code that implements the functionality described above, supporting the two modes of operation (i.e., "normal" and "verbose") and generally the input and output formats specified above.
A README file explaining your implementation, focusing on how you compute the result-size estimates; specifically, your README file should contain:
- Your name and your partner's name;
- A list of all the files that you are submitting;
- A clear description of how to run your program (note that your project must compile/run under Solaris in your CS account);
- A clear description of the internal design of your project;
- Any additional information that you consider significant.
A Makefile file to compile on a CS machine running Solaris, including what input your program expects and in what format. We should be able to just run "make" and generate an executable from your code. This executable should be called "estimate" if you use C++, or as "java Estimate" if you use Java.

To submit your project, follow these instructions:

Create a directory named <your-UNI>-proj1, where you should replace <your-UNI> with the Columbia UNI of one teammate (for example, if the teammate's UNI is abc123, then the directory should be named abc123-proj1).
Copy the source code files into the <your-UNI>-proj1 directory, and include all the other files that are necessary for your program to run.
Copy your README and Makefile files into the <your-UNI>-proj1 directory.
Tar and gzip the <your-UNI>-proj1 directory, to generate a single file <your-UNI>-proj1.tar.gz, which is the file that you will submit.
Login to Courseworks at https://courseworks.columbia.edu/ and select the site for our class.
Select "Class Files" and then "Post File."
Enter "Project 1" in the "Title" field and select your <your-UNI>-proj1.tar.gz file in the "File" field using the "Browse" button.
Select folder "Project 1" (under "Shared Folders") in the "Post File To" field.
Enter the names of the members of the team and their UNI in the "Comments" field.
Hit the "Submit" button.

IMPORTANT NOTE: You must write your project in Java. Your project must compile and run under Solaris in your CS account. (Use ssh to cluster.cs.columbia.edu to access Solaris machines.) We cannot make exceptions to this rule.

Frequently-asked questions

Q: What should my output look like?
A: Check the sample runs available in the "Output format" section above. Your output should exactly match the format specified in these examples for the "Normal" mode. The format of your "Verbose" mode output is up to you, but it must be clearly readable. We strongly suggest that when printing histograms you use the formats given in the examples above.

Q: Is error handling required?
A: You don't need to implement sophisticated error handling, but you should definitely perform basic error checking (such as checking for the correct number and type of arguments). However, you don't need to do any error checking for the data sets; you can assume that we will only test your program with valid data sets.

Q: What should I do if the range of attribute values and the values for B3 and B4 are so large that my 2-dimensional histogram doesn't fit in memory?
A: You can assume that we will only test your program with sufficiently small values for B3 and B4 such that this case never occurs.

Q: How will the project be graded?
A: Your grade will depend on the completeness of your implementation (whether you have met all of the requirements), the accuracy of your estimates, and -to a smaller extent- the quality of your README file.

Q: Will I be graded on the efficiency of my solution?
A: No.

Q: Should I comment my code?
A: Yes. If we find problems with your solution, we will look at your code in assigning partial credit. Therefore, it is important that you provide enough comments so that we can understand your implementation.

Q: Where can I find a sample Makefile?
A: Here is a sample for Java:

CLASSPATH=.
JAVAC=javac

all: Example.class

%.class : %.java
$(JAVAC) -classpath $(CLASSPATH) $<

clean:
rm -f *.class

Good luck!

Kenneth Ross

COMS W4112-Database Systems Implementation Spring 2009