COMS W4112-Database Systems Implementation
Spring 2009

Project 1

Due Date: Thursday April 2, 2009, 11:59pm

FAQs

Teams

You will carry out this project in teams of two. You do not need to notify us now of your team composition. Instead, when you submit your project you should indicate in your submission your team compositio. Both students in a team will receive the same grade for Project 1. Please check the Collaboration Policy web page for important information on what kinds of collaboration are allowed for projects.

Important notes:

Description

This project is about query result-size estimation as performed by a relational query optimizer. For this, you will extract different statistics for a given relation. Afterwards, you will estimate the result size of a given selection query (i.e., the number of tuples in the result for the query) using just the statistics and without examining the actual relation tuples. You will then be able to compare the accuracy of the estimates, and understand the impact of the data-distribution assumptions made when using the different statistics. Your implementation will be completely on main memory, so you don't need to worry about saving anything to disk, for example.

Specifically, we will focus in this project on a relation R(A1, A2), where A1 and A2 are integers. (You do not need to consider other relations or attributes.) Given the name of a file containing an instance of R, you need to compute the following three separate sets of statistics:

Given statistics of the contents of a relation such as those described above, the query optimizer estimates the result size of a given query. For this project, we will restrict our attention to queries of the form:

SELECT *
FROM R
WHERE A1>=r1 AND A1<=s1 AND A2>=r2 AND A2<=s2

where r1, s1, r2, and s2 are integers. If only the Set 1 statistics are available, you can estimate the size of the query by adapting the description on pages 483 and 484 of the textbook, assuming that the conditions on A1 and on A2 are independent. (Hint: Do not consider the two conditions on A1 (i.e., A1>=r1 and A1<=s1) as being independent, but rather compute the reduction factor of r1<=A1<=s1 directly. Same thing for the two conditions on A2.)

In contrast, if the Set 2 statistics are available, you can derive a closer estimate of the result size of the query by exploiting the two histograms, but still assuming independence of the conditions on A1 and on A2.

Finally, if the Set 3 statistics are available, then you should be able to derive an even closer estimate of the result size of the query by exploiting your single 2-dimensional histogram and without having to assume independence of the conditions on A1 and on A2.

What you need to do

In this project, you need to write a program to do the following:

  1. Receive as input the name of a file that contains an instance of the relation R described above. So you can test your system, we are providing a data set with 10,000 tuples. The attribute values are positive integers ranging from 0 to (2^16)-1 . The data set is in a text file with one tuple per line and comma-separated attributes. You can assume that your program will only be applied on integer-valued relations.
  2. Create the Set 1, Set 2, and Set 3 statistics. Your program should also receive as input B1 and B2, the number of buckets for the two histograms in Set 2, plus B3 and B4, which determine the number of buckets in the histogram in Set 3 (see above).
  3. Accept queries as 4 space-separated integers, corresponding to r1, s1, r2, and s2 in the selection query above, in this order. For each input query, you should:

    Note: When you are given a query, you should compute all "intermediate" estimates for the query using floating-point precision, with no rounding. Then, as a last step before you output each of the four result-estimates above, take the ceiling so your reported result-size estimates are always integers.

To facilitate our grading, your program should run in two modes: "normal" and "verbose":

Your executable should be invoked using the format that follows, which we ask that you please follow strictly:

$ java Estimate [-v] filename B1 B2 B3 B4
see below
-v: verbose mode
filename: file to read data set from
B1: an integer
B2: an integer
B3: an integer
B4: an integer

Each query is entered (from stdin) as a line with 4 space-separated integers (e.g., "1 10 50 320<newline>").

Output format

The output of your program should follow exactly the format of the sample runs available as file results.txt. The program that we used to generate this output used floats for all intermediate calculations and considers that a condition 5 <= x <= 10 matches 6/10 of a bucket [1,11). This is because we are using integer semantics and the values are discrete. Similarly, a condition 23 <= x <= 23 matches 1/5 of a bucket [20,25). To generate the output in file results.txt, run tests.txt with queries.txt and the dataset.txt data set in the same directory.

What to submit

You should submit:

  1. Code that implements the functionality described above, supporting the two modes of operation (i.e., "normal" and "verbose") and generally the input and output formats specified above.
  2. A README file explaining your implementation, focusing on how you compute the result-size estimates; specifically, your README file should contain:
  3. A Makefile file to compile on a CS machine running Solaris, including what input your program expects and in what format. We should be able to just run "make" and generate an executable from your code. This executable should be called "estimate" if you use C++, or as "java Estimate" if you use Java.

To submit your project, follow these instructions:

  1. Create a directory named <your-UNI>-proj1, where you should replace <your-UNI> with the Columbia UNI of one teammate (for example, if the teammate's UNI is abc123, then the directory should be named abc123-proj1).
  2. Copy the source code files into the <your-UNI>-proj1 directory, and include all the other files that are necessary for your program to run.
  3. Copy your README and Makefile files into the <your-UNI>-proj1 directory.
  4. Tar and gzip the <your-UNI>-proj1 directory, to generate a single file <your-UNI>-proj1.tar.gz, which is the file that you will submit.
  5. Login to Courseworks at https://courseworks.columbia.edu/ and select the site for our class.
  6. Select "Class Files" and then "Post File."
  7. Enter "Project 1" in the "Title" field and select your <your-UNI>-proj1.tar.gz file in the "File" field using the "Browse" button.
  8. Select folder "Project 1" (under "Shared Folders") in the "Post File To" field.
  9. Enter the names of the members of the team and their UNI in the "Comments" field.
  10. Hit the "Submit" button.

IMPORTANT NOTE: You must write your project in Java. Your project must compile and run under Solaris in your CS account. (Use ssh to cluster.cs.columbia.edu to access Solaris machines.) We cannot make exceptions to this rule.

Frequently-asked questions

Q: What should my output look like?
A: Check the sample runs available in the "Output format" section above. Your output should exactly match the format specified in these examples for the "Normal" mode. The format of your "Verbose" mode output is up to you, but it must be clearly readable. We strongly suggest that when printing histograms you use the formats given in the examples above.

Q: Is error handling required?
A: You don't need to implement sophisticated error handling, but you should definitely perform basic error checking (such as checking for the correct number and type of arguments). However, you don't need to do any error checking for the data sets; you can assume that we will only test your program with valid data sets.

Q: What should I do if the range of attribute values and the values for B3 and B4 are so large that my 2-dimensional histogram doesn't fit in memory?
A: You can assume that we will only test your program with sufficiently small values for B3 and B4 such that this case never occurs.

Q: How will the project be graded?
A: Your grade will depend on the completeness of your implementation (whether you have met all of the requirements), the accuracy of your estimates, and -to a smaller extent- the quality of your README file.

Q: Will I be graded on the efficiency of my solution?
A: No.

Q: Should I comment my code?
A: Yes. If we find problems with your solution, we will look at your code in assigning partial credit. Therefore, it is important that you provide enough comments so that we can understand your implementation.

Q: Where can I find a sample Makefile?
A: Here is a sample for Java:

CLASSPATH=.
JAVAC=javac

all: Example.class

%.class : %.java
        $(JAVAC) -classpath $(CLASSPATH) $<

clean:
        rm -f *.class


 

Good luck!


Kenneth Ross