COMS E6111 Advanced Database Systems
Fall 2017

Project 1

Due Date: Friday, October 6, 5:00 p.m. ET

Teams

You will carry out this project in teams of two. If you can't find a teammate, please follow these steps:


You do not need to notify us of your team composition. Instead, you and your teammate should assign yourselves to an available "Project 1 Group" on CourseWorks. For this, go to the "People" section of CourseWorks for our course. Please join one of the already-created "Project 1 Group n" groups. Please do not create your own group. (We have already created sufficiently many such groups to accommodate all students in the course.) You will then upload your final electronic submission on CourseWorks exactly once per team, rather than once per student.
Here are some additional important points:

Project Description

In this project, you will implement an information retrieval system that exploits user-provided relevance feedback to improve the search results returned by Google. The relevance feedback mechanism is described in Singhal: Modern Information Retrieval: A Brief Overview, IEEE Data Engineering Bulletin, 2001, as well as in Chapter 9, “Relevance Feedback & Query Expansion,” of the Manning, Raghavan, and Schütze Introduction to Information Retrieval textbook, available online.

User queries are often ambiguous. For example, a user who issues a query [jaguar] might be after documents about the car or the animal, and in fact search engines like Bing and Google return pages on both topics among their top 10 results for the query. In this project, you will design and implement a query-reformulation system to disambiguate queries and improve the relevance of the query results that are produced. Here’s how your system, which should be written in Java or Python (your choice), should work:

  1. Receive as input a user query, which is simply a list of words, and a value—between 0 and 1—for the target “precision@10” (i.e., for the precision that is desired for the top-10 results for the query, which is the fraction of pages that are relevant out of the top-10 results).
  2. Retrieve the top-10 results for the query from Google, using the Google Custom Search API (see below), using the default value for the various API parameters, without modifying these default values.
  3. Present these results to the user, so that the user can mark all the web pages that are relevant to the intended meaning of the query among the top-10 results. For each page in the query result, you should display its title, URL, and description returned by Google.

    IMPORTANT NOTE:
    You should display the exact top-10 results returned by Google for the query (i.e., you cannot add or delete pages in the results that Google returns). Also, the Google Custom Search API has a number of search parameters. Please do not modify the default values for these search parameters.

  4. If the precision@10 of the results from Step 2 for the relevance judgments of Step 3 is greater than or equal to the target value, then stop. If the precision@10 of the results is zero, then you should also stop. Otherwise, use the pages marked as relevant to automatically (i.e., with no further human input at this point) derive new words that are likely to identify more relevant pages. You may introduce at most 2 new words during each round.

    IMPORTANT NOTE 1: You cannot delete any words from the original query or from the query from the previous iteration; you can just add words, up to 2 new words in each round. Also, your queries must consist of just keywords, without any additional operators (e.g., you cannot use negation, quotes, or any other operator in your queries).

    IMPORTANT NOTE 2:
    The order of the words in the expanded query is important. Your program should automatically consider the alternate ways of ordering the words in a modified query, and pick the order that is estimated to be best. In each iteration, you can reorder all words--new and old--in the query, but you cannot delete any words, as explained in the note above.

  5. Modify the current user query by adding to it the newly derived words and ordering all words in the best possible order, as determined in Step 4, and go to Step 2.

The key challenge in the project is in designing Step 4, for which you should be creative and use the ideas that we discussed in class—as well as the above bibliography and the course reading materials—as inspiration. You are welcome to borrow techniques from the research literature at large (either exactly as published or modified as much as you feel necessary to get good performance in our particular query setting), but make sure that you cite the specific publications on which you based your solution. As a hint on how to search for relevant publications, you might want to check papers on “query expansion” in the main IR conference, SIGIR, at http://www.informatik.uni-trier.de/~ley/db/conf/sigir/index.html. If you choose to implement a technique from the literature, you still need to make sure that you adapt the chosen technique as much as necessary so that it works well for our specific query setting and scenario, since you will be graded based on how well your technique works. If you want to do stopword elimination (this is of course optional), you can find a list of stopwords here.

You will use the Google Custom Search API (https://developers.google.com/custom-search/) in this project: this is Google’s web service to enable the creation of customized search engines. As a first step to use the Google Custom Search API, you will have to sign up for a Computer Science Cloud account, following the instructions provided here carefully. The document also explain how you should set up a VM on the cloud, to develop and run your project. 

As a second step, you will have to sign up for the Custom Search Engine service (https://cse.google.com/cse/):

  1. Press the "Sign in to Custom Search Engine" button on the top right corner.
  2. Create a new search engine by clicking the “New search engine” button on the top left corner.
  3. Specify the following field values:
  1. Press the “CREATE” button.
  2. Select “Edit search engine” on the left, choose search engine “cs6111,” and click "Setup."
  3. Under the top “Basics" button, select “Sites to search” and choose “Search the entire web but emphasize included sites.”
  4. Right below choose the "www.wikipedia.com" site, press the “Delete” button, and finally press the “Update” button at the bottom of the page. This will enable the creation of a search engine to search the entire web but without an emphasis on any particular website (i.e., you will be using the general Google search engine).
  5. Next to the “Details” label, click on “Search engine ID” to get your search engine key, which you will need for querying.
  6. Do not modify or change other settings.
  7. Google provides two APIs for the Google Custom Search Engine (https://developers.google.com/custom-search/docs/overview), namely, the JSON/ATOM API and the XML API. If you choose to use the JSON/ATOM API, you will additionally need to obtain a Google Custom Search JSON/ATOM API key at https://developers.google.com/custom-search/json-api/v1/overview by clicking on "GET A KEY" at the bottom of the page. The XML API does not require a key.

You will use your engine key, optionally your JSON/ATOM API key, and a query as parameters to encode a search request URL. When requested from a web browser, or from inside a program, this URL will return a document with the query results. Please refer to the Google Custom Search APIs documentation for details on the URL syntax and document schema. You should parse the response document in your program to extract the title, link, and description of each query result, so you can use this information in your algorithm. Here are examples of use of the Google Custom Search API that should be helpful: Java version, Python version (in the Python example, note that q refers to your query, developerKey refers to your Google Custom Search API key, and cx refers to your search engine key).

By default, the Google Custom Search API has a quota of 100 queries per day. If you exceed this quota, you can upgrade to 1000 queries per day for one month for $5 (i.e., $5 for the full month), which will be deducted from the coupon credit that Columbia provided (see above).

Test Cases

Your submission (see below) should include a transcript of the runs of your program on the following queries, with a goal of achieving a value of 0.9 for precision@10:

1.     Look for information on the Per Se restaurant in New York City, starting with the query [per se].

2.     Look for information on Google cofounder Sergey Brin, starting with the query [brin].

3.     Look for information on the animal jaguar, starting with the query [jaguar].

We will check the execution of your program on these three cases, as well as on some other queries. 

What You Should Submit

  1. Your well-commented Java or Python code, which should follow the format of our reference implementation (see below)

  2. A README file including the following information:

a)     Your group name on CourseWorks ("Project 1 Group n"), your name and Columbia UNI, and your partner's name and Columbia UNI

b)     A list of all the files that you are submitting

c)     A clear description of how to run your program. Note that your project must compile/run under Ubuntu in a Google Cloud VM. Provide all commands necessary to install the required software and dependencies for your program.

d)     A clear description of the internal design of your project

e)     A detailed description of your query-modification method (this is the core component of the project; see below)

f)      Your Google Custom Search Engine API Key and Engine ID (so we can test your project)

g)     Any additional information that you consider significant

  1. A transcript of the runs of your program on the 3 test cases above, with relevant results clearly marked, and with the rephrased query and precision@10 value for each run. The format of your transcript should closely follow the format of the interactive session of our reference implementation (see below).

Your grade will be based on the effectiveness of your query modification method—which, in turn, will be reflected in the number of iterations that your system takes to achieve the target precision both for the test cases as well as for other unseen queries that we will use for grading—, the quality of your code, and the quality of the README file.

How to Submit

  1. Create a directory named <groupn>-proj1, where you should replace <groupn> with your Project 1 Group as specified on CourseWorks (for example, if your group is "Project 1 Group 9," then the directory should be named group9-proj1). 
  2. Copy the source code files into the <groupn>-proj1 directory, and include all the other files that are necessary for your program to run.
  3. Tar and gzip the <groupn>-proj1 directory, to generate a single file <groupn>-proj1.tar.gz, which is the file that you will submit.
  4. Login to CourseWorks at https://courseworks2.columbia.edu/ and select the site for our class. To submit this file, you need to be in the Class view (not the Group view) and then upload your file to the "Project 1" assignment under Assignments. Submit file <groupn>-proj1.tar.gz.
  5. Separately, submit your uncompressed README file as well as your uncompressed query transcript file, as two separate files.

In summary, you need to submit on CourseWorks exactly three files: (1) your <groupn>-proj1.tar.gz file with your code, (2) your uncompressed README file, and (3) your uncompressed query transcript file. You should submit these materials as a team (not once per student).

Reference Implementation

We created a reference implementation for this project. To run the reference implementation, ssh as "guest-user" to the VM running at 35.196.34.40 (i.e., open a terminal and type "ssh guest-user@35.196.34.40"). Use the password for this VM that Min included in the email to you with your Google coupon code. After you have logged into the guest-user account, run the following from the home directory (i.e., from /home/guest-user):

/home/paparrizos/run <google api key> <google engine id> <precision> <query>

where:

The reference implementation is interactive.  Please adhere to the format of your relevance feedback session for your submission and your transcript file.

Also, you can use this reference implementation to give you an idea of how good your algorithm should be. Ideally, the performance of your own algorithm, in terms of the number of iterations that the algorithm takes to achieve a given precision value for a query, should be at least as good as that of our reference implementation.

Hints and Additional Important Notes