COMS E6111 Advanced Database Systems
Fall 2015

Project 2

Due Date: Friday, November 6, 5 p.m. EST


You will carry out this project in teams of two. You do not need to notify us now of your team composition. Instead, when you submit your project you should indicate in your submission your team composition, which can be different from that of Project 1. Both students in a team will receive the same grade for Project 2. Please check the Collaboration Policy Web page for important information on what kinds of collaboration are allowed for projects, and how to compute the available number of project grace days for a team.

Important Notes:


This project has to do with Web "databases," and consists of two parts. In Part 1, you will implement the Web database classification algorithm that is described in the ACM TOIS 2003 "QProber" paper by Gravano, Ipeirotis, and Sahami. In Part 2, you will implement a simplified version of the content-summary extraction algorithm that we discussed in class.

Part 1: Web Database Classification

As Part 1 of this project, you will implement the QProber database classification algorithm that we discussed in class, which is described in the ACM TOIS 2003 "QProber: A System for Automatic Classification of Hidden-Web Databases" paper by Gravano, Ipeirotis, and Sahami. (This paper is part of the class reading list.) For your implementation, you should assume that the hierarchical categorization scheme that we will use consists of a root node with three children, "Computers," "Health," and "Sports," and that each of these three categories in turn has two leaf-level children, "Hardware" and "Programming" for "Computers," "Fitness" and "Diseases" for "Health," and "Basketball" and "Soccer" for "Sports." This 2-level categorization scheme is shown below:

Your program should implement the query-probing classification algorithm described in Figure 4 of the QProber paper. You should ignore (i.e., not use) the "confusion matrix adjustment" step. Also, you don't need to do any "training" of your system. Instead, we are providing you with the queries that you should use at each internal node of the categorization scheme. (Just for your information, these queries were derived automatically using the Support Vector Machines (SVM) version of QProber.) You can get the queries by clicking in each category above. For example, the line "Computers avi file" in the file associated with the "Root" category indicates that documents containing both the words "avi" and "file" should be classified under "Computers." Therefore, the number of documents matching query [avi file] in a database should be used towards the computation of the coverage of the database for the "Computers" category.

To derive the final classification of a database according to the Figure 4 algorithm, your program should receive as input:

  1. The URL of the database to be classified (e.g., You can assume that you will always get "http" URLs (e.g., not "https" or "ftp") and you should omit specifying "http" with the URL (i.e., your input should be and not  
  2. The specificity and coverage thresholds. The specificity threshold  tes is a real number such that 0<= tes<=1, while the coverage threshold tec is an integer such that tec>=1.

You will use the Bing Search API - Web Results Only. In your project, rather than querying each database directly, you will use the Bing API to avoid writing database-specific "wrappers." Specifically, you will use the "Site Restricted Search" feature of the Bing API, so that each query that you send to Bing matches Web pages only from the desired database (or Web site). For example, if you wanted to issue the QUERY [avi file] to the database, you should encode the request URL using your ACCOUNT_KEY and the QUERY, and add the string '' to the query, where you replace ":" with "%3a" to comply with Bing's format. The number of matches reported by Bing for this query indicates the number of documents in the Web site that match both the words "avi" and "file".

IMPORTANT NOTE: If you query Bing using URL premiership%27&$top=10&$format=Atom, Bing returns the results for the query [premiership] from site However, the results will not include the total number of pages matching the site-restricted query. To get this number, consider querying Bing with URL:$top=10&$format=Atom, which returns metadata about the query results.

For testing, etc. you might want to save/cache locally the results that you get from Bing so that you only send a query to Bing if you haven't issued the exact same query before. This caching is not required, but it would help with the efficiency of your testing process.

Test Cases

So you can test your program, here are some Web sites and their associated classification in the above categories, as of October 9, 2015 and for tes=0.6 and tec=100. The first four cases are examples of good classification decisions by our algorithm, while the last one is not so good. Root/Health/Fitness Root Root/Sports/Soccer Root/Health  Root/Computers

Note that the classification of a database will vary according to the threshold values that you choose, and that it is perfectly possible to classify a database under more than one category. Also, since these results were computed against the live version of Bing, which gets updated continuously, these results might of course change over time.

Part 2: Distributed Search over Web Databases

As Part 2 of this project, you will write a program to build simple content summaries of Web databases, where a content summary of a database includes a list of words together with their (estimated) document frequency. As we discussed in class, content summaries are used during the database selection step of the distributed search process, to decide what databases to contact to execute a query, and what databases to ignore. You do not have to implement a database selection module, but rather just a program that will produce a content summary for a database, as described below.  (If you are interested in reading some more about all this, you can refer to the ACM TOIS 2008 "Classification-Aware Hidden-Web Text Database Selection" paper by Ipeirotis and Gravano. Reading this paper is strictly optional, and you should not follow the more sophisticated techniques described there to construct content summaries, but rather just follow the procedure below.)

Part 2a: Document Sampling

You will construct the content summary of a database based on a small "sample" of the pages or documents in the database. You will extract the document sample of a database D  (e.g., while you classify it with the procedure of Part 1, as follows:

For each category node C (e.g., "Root") that you "visit" while you classify database D (using the Part 1 procedure):
For each query q associated with this node (e.g., [avi file])  retrieve the top-4 pages returned by Bing for the query. (These are all pages in database D.)
The document sample associated with a category node C and a database D, to which we refer as sample-C-D, is the set of documents retrieved from D as above by queries associated with C, plus the set of documents retrieved from D by queries associated with subcategories of C that were visited during the classification of D. You should eliminate duplicate documents from the document samples. For example, suppose that we classify the database under the "Health" category. During classification, we then visited the "Root" category node and the "Health" category node (and no other category). Therefore, the document sample associated with the "Root" node and the database (i.e., consists of the top-4 documents returned by Bing for each of the queries [cpu], [java], [module], ..., [pc windows], [acupuncture], ..., [game league], which are associated with the "Root" category, plus the top-4 documents returned by Bing for each of the queries [aids], ..., [aerobic], ..., [exercise weight], which are associated with the "Health" category.

Note: Please be nice to the Web sites when retrieving pages. For example, you might want to process (see below) each page that you retrieve before fetching the next page, so you space your requests. Alternatively, you can add a line to your program so that it waits for, say, 5 seconds in between page requests.

Part 2b: Content Summary Construction

After you obtained one document sample for each category node under which you classified a given database, you will build a "topic content summary" associated with each such sample. The topic content summary will contain a list of all the words that appear in the associated document sample, and their document frequency in the sample (i.e., the number of documents in the sample that contain each word). Your program should output each topic content summary to a text file, containing all the words in the sample -in dictionary order- together with their respective document frequencies. (Note that you do not have to implement the absolute-frequency computation strategy that we briefly discussed in class, but rather just compute the document frequency of a word simply as the number of documents in the sample that contain that word.)

For example, if the site was classified under category "Health" then your program should output two files:, with the topic content summary, and, with the topic content summary The content summary is based on all the documents retrieved by "Root"-level queries plus all the documents retrieved by "Health"-level queries (see Part 2a), because "Health" is a subcategory of "Root." In contrast, the content summary is based only on all the documents retrieved by "Health"-level queries.

To extract the text of a page, you can use the command "lynx --dump", available on the CS machines. Any part of the text after the "References" line should be ignored. Also any text within brackets "[....]" should be ignored. Any character not in the English alphabet should be treated as a word separator, and the words are case-insensitive. Alternatively, you can directly use or adapt our Java script, available here. This script converts an HTML document to lowercase, treats any character not in the English alphabet as a word separator, and  then returns the set of  words that appear in the document.

Note: You do not have to output content summaries for level 2 categories (i.e., for leaf categories), as there are no additional queries performed at this level.

What You Should Submit

  1. Your well commented Java or Python code that implements Parts 1 and 2, and which should follow the format of our reference implementation (see below)
  2. A README file including the following information:
    1. Your name and your partner's name and Columbia UNI
    2. A list of all the files that you are submitting
    3. A clear description of how to run your program (note that your project must compile/run under Linux in your CS account)
    4. A clear description of the internal design of your project, for each part of the project
    5. Your Bing account key (so we can test your project)
    6. Any additional information that you consider significant

The README file will determine part of your grade for the project. The rest of the grade will be determined by the accuracy of your implementation of both parts of the project. We will test your project for the test databases above as well as for other (undisclosed) databases.

How to Submit

  1. Create a directory named <your-UNI>-proj2, where you should replace <your-UNI> with the Columbia UNI of one teammate (for example, if the teammate's UNI is abc123, then the directory should be named abc123-proj2).
  2. Copy the source code files into the <your-UNI>-proj2 directory, and include all the other files that are necessary for your program to run.
  3. Copy your README file into the <your-UNI>-proj2 directory.
  4. Tar and gzip the <your-UNI>-proj2 directory, to generate a single file <your-UNI>-proj2.tar.gz, which is the file that you will submit.
  5. Login to Courseworks at and select the site for our class.
  6. Select "Assignments."
  7. Upload your <your-UNI>-proj2.tar.gz file under "Project 2."

Reference Implementation

We have created a reference implementation for this project. To run the reference implementation from a directory where you have write permission, do:

/home/gravano/6111/Html/Proj2/bin/ <BING_ACCOUNT_KEY> <t_es> <t_ec> <host>


The reference implementation will return the classification of the <host> Web site and will create the required content summaries. The output files will be created in the directory from which you run the reference implementation. Each line in a content summary has the following format:

Your program does not have to output the <number of matches> field.

Also, note that our reference implementation includes multiple-word entries, corresponding to multiple-word query probes. The line for a multiple-word query in a database content summary has zero as the frequency in the document sample (it is tricky to define "frequency" for multiple words that should not necessarily be considered as a phrase), followed by the number of matches for the query in the database. For example, a content summary line eating protein#0.0#1.0 corresponds to query probe [eating protein] and indicates that this query had exactly one match in the database in question. For your implementation, you have a choice: (a) you can include such multiple-word entries, just as we do in our reference implementation; alternatively, (b) you can decide not to include such multiple-word information in the content summaries. Either choice is fine, but you should specify your choice in your README file.

Luis Gravano