Spring 2009

**Project 1**

**Due Date:
Thursday April 2, 2009, 11:59pm
**

You will carry out this project in **teams of two**.
You do not need to notify us now of your team composition. Instead,
when you submit your project you should indicate in your
submission your team compositio. Both students in a team will
receive the same grade for Project 1. Please check the
Collaboration Policy web page for important information on what kinds of
collaboration are allowed for projects.

- Please find a partner for your Project 1 immediately.
- Specifically, if you can't find a team-mate, please follow these steps:
- Post a message in the class
discussion board asking for a team mate -
**the best way**. - Send email to TA Weiwei right away
(and definitely
**before Tue Feb 17 at 5pm**) asking him to pair you up with another student without a team-mate. Weiwei will do his best to find you a team-mate.

- Post a message in the class
discussion board asking for a team mate -
- You have to use
**Java**for this project.

This project is about query result-size estimation as performed by a relational
query optimizer.
For this, you will extract different statistics for a given relation.
Afterwards, you will estimate the **result size** of a given selection query
(i.e., the **number of tuples in the result** for the query) using just the
statistics and without examining the actual relation tuples. You will then be
able to compare the accuracy of the estimates, and understand the impact of the
data-distribution assumptions made when using the different statistics. Your implementation
will be completely on **main memory**, so you don't
need to worry about saving anything to disk, for example.

Specifically, we will focus in this project on a relation **
R(A1, A2)**, where **A1** and **A2** are integers. (You do not need to
consider other relations or attributes.) Given the name of a file containing an
instance of **R**, you need to compute
the following three separate sets of statistics:

**Set 1**is the simplest set of statistics, and consists of the following information:**N**: the total number of tuples in**R**(counting repetitions, if any).**Low(Ai):**the minimum value that**Ai**takes in the relation instance, for each attribute**Ai (i=1, 2)**.**High(Ai):**the maximum value that**Ai**takes in the relation instance, for each attribute**Ai (i=1, 2)**.

Hence

**Set 1**consists of just 5 numbers for our relation**R(A1, A2)**.**Set 2**has more sophisticated statistics about the distribution of values within each attribute, and consists of the following information:- An
**equiwidth histogram for attribute A1**(see page 487 of textbook, and below) with**B1**buckets, for an input integer**B1**. - An
**equiwidth histogram for attribute A2**(see page 487 of textbook, and below) with**B2**buckets, for an input integer**B2**.

The

**equiwidth histogram**for attribute**Ai**with**Bi**buckets simply divides the range of values of**Ai**into**Bi**ranges of equal length**W**(see below), and counts how many tuples of the relation lie in each range. Hence, each bucket corresponds to one range and will be of the form**[x, y): f**, indicating that there are**f**tuples**t**in**R**such that:**x<=t.Ai<y**. In other words, the equiwidth histogram for**Ai**will consist of the following**Bi**buckets:**[m, m+W): f1****[m+W, m+2*W): f2****...****[m+(Bi-2)*W, m+(Bi-1)*W): fBi-1****[m+(Bi-1)*W, M]: fBi**

where

**W=floor((High(Ai)-Low(Ai))/Bi), m=Low(Ai), and M=High(Ai)**. Note that the last bucket is a special case in that it's "closed" on both ends, to also cover the tuples with**Ai**value equal to**High(Ai)**. The last bucket might also be slightly larger than the others, if**High(Ai)-Low(Ai)**is not divisible by**Bi**. Note also that the number of tuples in the relation,**N**, is exactly equal to**f1+f2+...+fBi**. Finally, in the special case where parameter**Bi**is greater than**High(Ai)-Low(Ai)**, you should create one bucket for each value in the range of values for attribute**Ai**. In other words, you should set**Bi**equal to**High(Ai)-Low(Ai)+1**.

**Example**: Consider the following instance of relation**R**:**A1****A2**5 5 1 5 3 9 3 11 7 2 Then, the

**Set 1**statistics for**R**are:**N=5**,**Low(A1)=1**,**High(A1)=7**,**Low(A2)=2**, and**High(A2)=11**.The

**Set 2**statistics for**R**for**B1=3**and**B2=2**are:- An equiwidth histogram for
**A1**with the following**B1**=3 buckets:**[1, 3): 1**;**[3, 5): 2**;**[5, 7]: 2**. - An equiwidth histogram for
**A2**with the following**B2**=2 buckets:**[2, 6): 3**;**[6, 11]: 2**.

Hence,

**Set 2**consists of just**3*(B1+B2)**numbers for our relation**R(A1, A2)**(i.e.,**B1**buckets for**A1**and**B2**buckets for**A2**, with each value being represented as three numbers, as discussed above).- An
**Set 3**has even more sophisticated statistics about the joint distribution of the two attributes, and consists of**just****one 2-dimensional histogram (and nothing else)**:**One 2-dimensional histogram jointly for A1 and A2**. Your 2-dimensional histogram will be a set of buckets, where the buckets now specify a "rectangle." Each bucket will then consist of two ranges and a frequency:**[x1, y1) * [x2, y2): f**, indicating that there are**f**tuples**t**in**R**such that**x1<=t.A1<y1****AND****x2<=t.A2<y2.**Your 2-dimensional buckets are simply a conceptual extension of the 1-dimensional equiwidth histogram idea to two dimensions. To build them, your program will take two parameters**B3**and**B4**(see below), which indicate the number of equal-size "ranges" into which you partition attributes**A1**and**A2**, respectively. Just as in the 1-dimensional case, the "rightmost" range for each dimension should be "closed" on both ends, and might be slightly larger than the others.

Hence,**Set 3**consists of just**5*B3*B4**numbers for our relation**R(A1, A2)**(i.e.,**B3*B4**2-dimensional (i.e., rectangular) buckets, each with two associated ranges and one tuple frequency).

In the example above, if

**B3**=3 and**B4**=2, the 2-dimensional histogram for**R**will consist of the following 6 buckets:**[1, 3) * [2, 6): 1****[1, 3) * [6, 11]: 0****[3, 5) * [2, 6): 0****[3, 5) * [6, 11]: 2****[5, 7] * [2, 6): 2****[5, 7] * [6, 11]: 0**

Given statistics of the contents of a relation such as those described above, the query optimizer estimates the result size of a given query. For this project, we will restrict our attention to queries of the form:

SELECT *

FROM R

WHERE A1>=r1 AND A1<=s1 AND A2>=r2 AND A2<=s2

where **r1**, **s1**, **r2**, and **s2** are integers. If only
the **Set 1** statistics are available, you can estimate the size of the query
by adapting the description on pages 483 and 484 of the textbook, assuming that
the conditions on **A1** and on **A2** are independent. (**Hint:** Do
**not** consider the two conditions on **A1** (i.e., **A1>=r1** and **
A1<=s1**) as being independent, but rather
compute the reduction factor of **r1<=A1<=s1** directly. Same thing for
the two conditions on **A2**.)

In contrast, if the **Set 2** statistics are available, you can derive a
closer estimate of the result size of the query by exploiting the two
histograms, but still assuming independence of the conditions on **A1** and
on **A2**.

Finally, if the **Set 3** statistics are available, then you should be able
to derive an even closer estimate of the result size of the query by exploiting
your single 2-dimensional histogram and without having to assume independence of
the conditions on **A1** and on **A2**.

In this project, you need to write a program to do the following:

- Receive as input the name of a file that contains an instance of the
relation
**R**described above. So you can test your system, we are providing a data set with 10,000 tuples. The attribute values are positive integers ranging from 0 to (2^16)-1 . The data set is in a text file with one tuple per line and comma-separated attributes. You can assume that your program will only be applied on integer-valued relations. - Create the
**Set 1, Set 2**, and**Set 3**statistics. Your program should also receive as input**B1**and**B2**, the number of buckets for the two histograms in**Set 2**, plus**B3**and**B4**, which determine the number of buckets in the histogram in**Set 3**(see above). - Accept queries as
**4 space-separated integers**, corresponding to**r1**,**s1**,**r2**, and**s2**in the selection query above, in this order. For each input query, you should:- Report the result-size estimate using only the
**Set 1**statistics. - Report the result-size estimate using only the
**Set 2**statistics. - Report the result-size estimate using only the
**Set 3**statistics. - Report the exact number of tuples in
**R**that match the selection query associated with**r1**,**s1**,**r2**, and**s2**. To report this exact number you will have to inspect all tuples in**R**.

**Note:**When you are given a query, you should compute all "intermediate" estimates for the query**using floating-point precision, with no rounding**. Then, as a last step before you output each of the four result-estimates above, take the ceiling so your reported result-size estimates are always integers. - Report the result-size estimate using only the

To facilitate our grading, your program should run in two modes: "normal" and "verbose":

**"Normal" mode:**Just report the information listed above for each query that you enter.**"Verbose" mode:**In addition to the information listed above, report as much information as you judge necessary for us to understand how you are computing your result-size estimates. For example, for the**Set 1**and**Set 2**statistics you might want to report -in addition to other things- the reduction factor of the condition on**A1**and the reduction factor of the condition on**A2**, how you derived them, and how you combined them into the final result-size estimate. Also, for the**Set 2**and**Set 3**statistics you might want to report the actual histogram buckets that you used in the computation of the result-size estimate. Use common sense to report information that we could use to give you partial credit. Also, your verbose mode should allow us to ask to**see the histograms**that you computed.

Your executable should be invoked using the format that follows, which we ask that you please follow strictly:

$ **java Estimate [-v] filename B1 B2 B3 B4**

see below

**-v**: verbose mode

**filename**: file to read data set from

**B1**: an integer

**B2**: an integer

**B3: **an integer

**B4:** an integer

Each query is entered (from stdin) as a line with 4 space-separated integers
(e.g., "1 10 50 320<newline>").

The output of your program should follow exactly the format of the **sample
runs** available as file results.txt.
The program that we used to generate this output used floats for all
intermediate calculations and considers that a condition 5 <= x <= 10 matches
6/10 of a bucket [1,11). This is because we are using integer semantics and the
values are discrete. Similarly, a condition 23 <= x <= 23 matches 1/5 of a
bucket [20,25). To generate the output in file
results.txt, run
tests.txt with
queries.txt and the
dataset.txt data set in the same
directory.

You should submit:

**Code**that implements the functionality described above, supporting the two modes of operation (i.e., "normal" and "verbose") and generally the input and output formats specified above.- A
**README**file explaining your implementation, focusing on how you compute the result-size estimates; specifically, your README file should contain:- Your name and your partner's name;
- A list of all the files that you are submitting;
- A clear description of how to run your program (note that your
project
__must__compile/run under Solaris in your CS account); - A clear description of the internal design of your project;
- Any additional information that you consider significant.

- A
**Makefile**file to compile on a CS machine running Solaris, including what input your program expects and in what format. We should be able to just run "**make**" and generate an executable from your code. This executable should be called**"estimate"**if you use C++, or as**"java Estimate"**if you use Java.

To submit your project, follow these instructions:

- Create a directory named
**<your-UNI>-proj1**, where you should replace**<your-UNI>**with the Columbia UNI of one teammate (for example, if the teammate's UNI is**abc123**, then the directory should be named**abc123-proj1**). - Copy the source code files into the
**<your-UNI>-proj1**directory, and include all the other files that are necessary for your program to run. - Copy your
**README**and**Makefile**files into the**<your-UNI>-proj1**directory. **Tar**and**gzip**the**<your-UNI>-proj1**directory, to generate**a single file <your-UNI>-proj1.tar.gz**, which is the file that you will submit.- Login to Courseworks at https://courseworks.columbia.edu/ and select the site for our class.
- Select "Class Files" and then "Post File."
- Enter
**"Project 1"**in the "Title" field**<your-UNI>-proj1.tar.gz**file in the "File" field using the "Browse" button. - Select
**folder "Project 1" (under "Shared Folders")**in the "Post File To" field. - Enter the names of the members of the team and their UNI in the "Comments" field.
- Hit the "Submit" button.

__IMPORTANT NOTE:__**
**You must write your project in Java.** Your project **__must__**
compile and run under Solaris in your CS account. **(Use **ssh** to **cluster.cs.columbia.edu** to access
Solaris machines.) We cannot make exceptions to this rule.

**Q: What should my output look like?**

**A:** Check the sample runs available** **in the "Output format" section
above. Your output should **exactly**
match the format specified in these examples for the "**Normal**" mode. The format of your "**Verbose**"
mode output is up to you, but it must be clearly readable. We strongly suggest
that when printing histograms you use the formats given in the examples above.

**Q: Is error handling required?**

**A:** You don't need to implement sophisticated error handling, but you
should definitely perform basic error checking (such as checking for the correct
number and type of arguments). However, you don't need to do any error checking
for the data sets; you can assume that we will only test your program with valid
data sets.

**Q: What should I do if the range of attribute values and the values for B3 and B4 are so large that my 2-dimensional histogram doesn't fit
in memory?**

**A:** You can assume that we will only test your program with sufficiently
small values for **B3** and **B4** such that this case never occurs.

**Q: How will the project be graded?**

**A: **Your grade will depend on the completeness of your implementation
(whether you have met all of the requirements), the accuracy of your estimates,
and -to a smaller extent- the quality of your README file.

**Q: Will I be graded on the efficiency of my solution?**

**A: **No.

**Q: Should I comment my code?**

**A: **Yes. If we find problems with your solution, we will look at your code
in assigning partial credit. Therefore, it is important that you provide enough
comments so that we can understand your implementation.

**Q: Where can I find a sample Makefile?**

**A:** Here is a sample for **Java**:

CLASSPATH=.

JAVAC=javac

all: Example.class

%.class : %.java

$(JAVAC) -classpath $(CLASSPATH) $<

clean:

rm -f *.class

Good luck!

Kenneth Ross