COMS E6111-Advanced Database Systems
Spring 2024

Project 1

Summary of Deadlines

Teams

You will carry out this project in teams of two. If you can't find a teammate, please follow these steps:

You do not need to notify us of your team composition. Instead, you and your teammate will indicate your team composition when you submit your project on Gradescope (click on "Add Group Member" after one of you has submitted your project). You will upload your final electronic submission on Gradescope exactly once per team, rather than once per student.

Important notes:

Project Description

In this project, you will implement an information retrieval system that exploits user-provided relevance feedback to improve the search results returned by Google. The relevance feedback mechanism is described in Singhal: Modern Information Retrieval: A Brief Overview, IEEE Data Engineering Bulletin, 2001, as well as in Chapter 9, “Relevance Feedback & Query Expansion,” of the Manning, Raghavan, and Schütze Introduction to Information Retrieval textbook, available online.

User queries are often ambiguous. For example, a user who issues a query [jaguar] might be after documents about the car or the animal, and in fact search engines like Bing and Google return pages on both topics among their top 10 results for the query. In this project, you will design and implement a query-reformulation system to disambiguate queries and improve the relevance of the query results that are produced. Here’s how your system, which should be written in Python, should work:

  1. Receive as input a user query, which is simply a list of words, and a value—between 0 and 1—for the target “precision@10” (i.e., for the precision that is desired for the top-10 results for the query, which is the fraction of pages that are relevant out of the top-10 results).
  2. Retrieve the top-10 results for the query from Google, using the Google Custom Search API (see below), using the default value for the various API parameters, without modifying these default values.
  3. Present these results to the user, so that the user can mark all the webpages that are relevant to the intended meaning of the query among the top-10 results. For each page in the query result, you should display its title, URL, and description returned by Google.
    IMPORTANT NOTE: You should display the exact top-10 results returned by Google for the query (i.e., you cannot add or delete pages in the results that Google returns). Also, the Google Custom Search API has a number of search parameters. Please do not modify the default values for these search parameters.
  4. If the precision@10 of the results from Step 2 for the relevance judgments of Step 3 is greater than or equal to the target value, then stop. If the precision@10 of the results is zero, then you should also stop. Otherwise, use the pages marked as relevant to automatically (i.e., with no further human input at this point) derive new words that are likely to identify more relevant pages. You may introduce at most 2 new words during each round.
    IMPORTANT NOTE 1: You cannot delete any words from the original query or from the query from the previous iteration; you can just add words, up to 2 new words in each round. Also, your queries must consist of just keywords, without any additional operators (e.g., you cannot use negation, quotes, or any other operator in your queries).
    IMPORTANT NOTE 2: The order of the words in the expanded query is important. Your program should automatically consider the alternate ways of ordering the words in a modified query, and pick the order that is estimated to be best. In each iteration, you can reorder all words—new and old—in the query, but you cannot delete any words, as explained in the note above.
  5. Modify the current user query by adding to it the newly derived words and ordering all words in the best possible order, as determined in Step 4, and go to Step 2.

The key challenge in the project is in designing Step 4, for which you should be creative and use the ideas that we discussed in class—as well as the above bibliography and the course reading materials—as inspiration. You are welcome to borrow techniques from the research literature at large (either exactly as published or modified as much as you feel necessary to get good performance in our particular query setting), but make sure that you cite the specific publications on which you based your solution. As a hint on how to search for relevant publications, you might want to check papers on “query expansion” in the main IR conference, SIGIR, at https://dblp.uni-trier.de/db/conf/sigir/index.html. If you choose to implement a technique from the literature, you still need to make sure that you adapt the chosen technique as much as necessary so that it works well for our specific query setting and scenario, since you will be graded based on how well your technique works. If you want to do stopword elimination (this is of course optional), you can find a list of stopwords here.

You will use the Google Custom Search API (https://developers.google.com/custom-search/) in this project: this is Google’s web service to enable the creation of customized search engines. Furthermore, the code that you submit for your project must run on the Google Cloud, so it is a good idea to develop your code on a VM on the Google Cloud from the very beginning (see below), rather than writing it on a different platform and then adapting it to the Google Cloud for submission.

As a first step to develop your project, you should set up your Google Cloud account carefully following our instructions provided here. Our instructions also explain how you should set up a VM on the cloud, to develop and run your project. Please make sure that you do all this over your Lionmail account, not your personal Gmail account.

As a second step, you will have to sign up for the Programmable Search Engine service (https://programmablesearchengine.google.com/about/):

  1. Log off from all Gmail/Google accounts and then log on only your Lionmail account. (Google doesn't let you switch between accounts when you are setting up a Google Programmable Search Engine service.)
  2. Press the "Get Started" button on the top right corner.
  3. On the "Create a new search engine" webpage, do the following:
    • For “Name of search engine,” enter "cs6111"
    • For “What to search?,” select "Search the entire web"
    • Check the "I'm Not a Robot" checkbox
    • Leave other settings unchecked
  4. Press the “CREATE” button.
  5. Copy your search engine ID. This is the 17-character alphanumeric value after "cx=" in the code box that pops up after creating your new search engine. There is no need to copy the entire code snippet.
  6. Do not modify or change other settings.
  7. Check the Google Custom Search JSON API documentation and obtain a JSON API key by clicking on "Get a Key"; you will have to select the Google Cloud project that you have already created using our instructions above.

You will use your search engine ID, your JSON API key, and a query as parameters to encode a search request URL. When requested from a web browser, or from inside a program, this URL will return a document with the query results. Please refer to the Google Custom Search JSON API documentation for details on the URL syntax and document schema. You should parse the response document in your program to extract the title, link, and description of each query result, so you can use this information in your algorithm. Here is a Python example of use of the Google Custom Search API that should be helpful: example (note that q refers to your query, developerKey refers to your Google Custom Search API key, and cx refers to your search engine ID).

By default, the Google Custom Search JSON API has a quota of 100 queries per day for free. Additional requests cost $5 per 1,000 queries, which will be deducted from the coupon credit that Columbia provided to you (see above). Please refer to the JSON API documentation for additional details, which you should check carefully to avoid billing-related surprises.

Test Cases

Your submission (see below) should include a transcript of the runs of your program on the following queries, with a goal of achieving a value of 0.9 for precision@10:

  1. Look for information on the Per Se restaurant in New York City, starting with the query [per se].
  2. Look for information on 23andMe cofounder Anne Wojcicki, starting with the query [wojcicki].
  3. Look for information on COVID-19 cases, starting with the query [cases].

We will check the execution of your program on these three cases, as well as on some other queries.

What to Submit and When

Your Project 1 submission will consist of the following three components, which you should submit on Gradescope by Monday, February 19, at 5 p.m. ET:

To submit your project, please follow these steps:

  1. Create a directory named proj1.
  2. Copy the source code files into the proj1 directory, and include all the other files that are necessary for your program to run.
  3. Tar and gzip the proj1 directory, to generate a single file proj1.tar.gz.
  4. Submit on Gradescope exactly three files:
    • Your proj1.tar.gz file with your code,
    • Your uncompressed README file (a PDF file is preferred), and
    • Your uncompressed query transcript file.

Reference Implementation for Project 1

We have created a reference implementation for this project. To run the reference implementation, ssh as "guest-user" to the VM running at 104.196.182.206 (i.e., open a terminal and type "ssh guest-user@104.196.182.206"). Use the password for this VM that Lara included in the email to you with your Google coupon code. After you have logged into the guest-user account, run the following command:

/home/gkaraman/run <google api key> <google engine id> <precision> <query>

where:

The reference implementation is interactive. Please adhere to the format of the relevance feedback session for your submission and your transcript file.

Also, you can use this reference implementation to give you an idea of how good your algorithm should be. Ideally, the performance of your own algorithm, in terms of the number of iterations that the algorithm takes to achieve a given precision value for a query, should be at least as good as that of our reference implementation.

Project Policies on Usage of AI Tools

To maintain the integrity of our learning process and ensure that the core objectives of the course are met by all students, we have established project-specific guidelines for using ChatGPT, Bard, and any other AI tools for our course projects. These guidelines are crafted to encourage independent problem-solving, hands-on engagement with the course material, and adherence to academic integrity.

Please read the guidelines carefully. Importantly, for the "usage allowed" cases below, please document extensively how you used any of these AI tools, including the exact "prompts" that you worked with, namely, the questions or statements you input to the AI tools. Please document this information clearly and comprehensively in the README file that you will include with the project submission. Thorough documentation is a key professional skill, and one that helps maintain transparency and proper attribution in the project development.

Greenlight: Usage Allowed

Redlight: Usage Not Allowed

Our understanding of AI tools is evolving constantly, and it is important that you seek clarification proactively if any of these guidelines are unclear or if you have any doubts. Also, your insights and inquiries are invaluable. So if you have any questions or comments about any of the above policies, please either post them to the class discussion board or contact a TA or the professor. Overall, we encourage you to approach these AI tools with a mindset of integrity and curiosity, and to not hesitate in seeking clarifications. This should ensure that your educational journey is both effective and ethically sound.

Hints and Additional Important Notes

Grading for Project 1

Your grade will be based on the effectiveness of your query modification method—which, in turn, will be reflected in the number of iterations that your system takes to achieve the target precision both for the test cases as well as for other unseen queries that we will use for grading—, the quality of your code, and the quality of the README file. We will not grade your project in terms of efficiency.