cs6111: Project 1

COMS E6111-Advanced Database Systems
Spring 2025

Project 1

Summary of Deadlines

By Thursday, February 6, at 5 p.m. ET: Find a teammate or email Konstantinos if you need help finding one.
By Thursday, February 27, at 5 p.m. ET: Submit your project electronically on Gradescope. You can use project grace days normally for the submission.

Teams

You will carry out this project in teams of two. If you can't find a teammate, please follow these steps:

Post a message on the Ed Discussion board asking for a teammate—the best way.
Send email to Konstantinos right away (and definitely before Thursday, February 6, at 5 p.m. ET) asking Konstantinos to pair you up with another student without a teammate. Konstantinos will make a best effort to find you a teammate.

You do not need to notify us of your team composition. Instead, you and your teammate will indicate your team composition when you submit your project on Gradescope (click on "Add Group Member" after one of you has submitted your project). You will upload your final electronic submission on Gradescope exactly once per team, rather than once per student.

Important notes:

If you decide to drop the class, or are even remotely considering doing so, please be considerate and notify your teammate immediately.
On a related note, do not wait until the day before the deadline to start working on the project, just to realize then that your teammate has dropped the class. It is your responsibility to start working on the project and spot any problems with your teammate early on.
You can do this project by yourself if you so wish. Be aware, however, that you will have to do exactly the same project as two-student teams will.
Please check the Class Policies webpage for important information about what kinds of collaboration are allowed for projects.
Also, please check and carefully follow our project policies on usage of AI tools (e.g., ChatGPT, Gemini, etc.).

Project Description

In this project, you will implement an information retrieval system that exploits user-provided relevance feedback to improve the search results returned by Google. The relevance feedback mechanism is described in Singhal: Modern Information Retrieval: A Brief Overview, IEEE Data Engineering Bulletin, 2001, as well as in Chapter 9, “Relevance Feedback & Query Expansion,” of the Manning, Raghavan, and Schütze Introduction to Information Retrieval textbook, available online.

User queries are often ambiguous. For example, a user who issues a query [jaguar] might be after documents about the car or the animal, and in fact search engines like Bing and Google return pages on both topics among their top 10 results for the query. In this project, you will design and implement a query-reformulation system to disambiguate queries and improve the relevance of the query results that are produced. Here’s how your system, which should be written in Python, should work:

Receive as input a user query, which is simply a list of words, and a value—between 0 and 1—for the target “precision@10” (i.e., for the precision that is desired for the top-10 results for the query, which is the fraction of pages that are relevant out of the top-10 results).
Retrieve the top-10 results for the query from Google, using the Google Custom Search API (see below), using the default value for the various API parameters, without modifying these default values.
Present these results to the user, so that the user can mark all the webpages that are relevant to the intended meaning of the query among the top-10 results. For each page in the query result, you should display its title, URL, and description returned by Google.
IMPORTANT NOTE: You should display the exact top-10 results returned by Google for the query (i.e., you cannot add or delete pages in the results that Google returns). Also, the Google Custom Search API has a number of search parameters. Please do not modify the default values for these search parameters.
If the precision@10 of the results from Step 2 for the relevance judgments of Step 3 is greater than or equal to the target value, then stop. If the precision@10 of the results is zero, then you should also stop. Otherwise, use the pages marked as relevant to automatically (i.e., with no further human input at this point) derive new words that are likely to identify more relevant pages. You may introduce at most 2 new words during each round.
IMPORTANT NOTE 1: You cannot delete any words from the original query or from the query from the previous iteration; you can just add words, up to 2 new words in each round. Also, your queries must consist of just keywords, without any additional operators (e.g., you cannot use negation, quotes, or any other operator in your queries).
IMPORTANT NOTE 2: The order of the words in the expanded query is important. Your program should automatically consider the alternate ways of ordering the words in a modified query, and pick the order that is estimated to be best. In each iteration, you can reorder all words—new and old—in the query, but you cannot delete any words, as explained in the note above.
Modify the current user query by adding to it the newly derived words and ordering all words in the best possible order, as determined in Step 4, and go to Step 2.

The key challenge in the project is in designing Step 4, for which you should be creative and use the ideas that we discussed in class—as well as the above bibliography and the course reading materials—as inspiration. You are welcome to borrow techniques from the research literature at large (either exactly as published or modified as much as you feel necessary to get good performance in our particular query setting), but make sure that you cite the specific publications on which you based your solution. As a hint on how to search for relevant publications, you might want to check papers on “query expansion” in the main IR conference, SIGIR, at https://dblp.uni-trier.de/db/conf/sigir/index.html. If you choose to implement a technique from the literature, you still need to make sure that you adapt the chosen technique as much as necessary so that it works well for our specific query setting and scenario, since you will be graded based on how well your technique works. If you want to do stopword elimination (this is of course optional), you can find a list of stopwords here.

Project Setup

You will use the Google Custom Search API (https://developers.google.com/custom-search/) in this project: this is Google’s web service to enable the creation of customized search engines. Furthermore, the code that you submit for your project must run on the Google Cloud, so it is a good idea to develop your code on a VM on the Google Cloud from the very beginning (see below), rather than writing it on a different platform and then adapting it to the Google Cloud for submission.

As a first step to develop your project, you should set up your Google Cloud account carefully following our instructions provided here. Our instructions also explain how you should set up a VM on the cloud, to develop and run your project. Please make sure that you do all this over your Lionmail account, not your personal Gmail account.

As a second step, you will have to sign up for the Programmable Search Engine service (https://programmablesearchengine.google.com/about/):

Log off from all Gmail/Google accounts and then log on only your Lionmail account. (Google doesn't let you switch between accounts when you are setting up a Google Programmable Search Engine service.)
Press the "Get Started" button on the top right corner.
On the "Create a new search engine" webpage, do the following:
- For “Name of search engine,” enter "cs6111"
- For “What to search?,” select "Search the entire web"
- Check the "I'm Not a Robot" checkbox
- Leave other settings unchecked
Press the “CREATE” button.
Copy your search engine ID. This is the 17-character alphanumeric value after "cx=" in the code box that pops up after creating your new search engine. There is no need to copy the entire code snippet.
Do not modify or change other settings.
Check the Google Custom Search JSON API documentation and obtain a JSON API key by clicking on "Get a Key"; you will have to select the Google Cloud project that you have already created using our instructions above.

You will use your search engine ID, your JSON API key, and a query as parameters to encode a search request URL. When requested from a web browser, or from inside a program, this URL will return a document with the query results. Please refer to the Google Custom Search JSON API documentation for details on the URL syntax and document schema. You should parse the response document in your program to extract the title, link, and description of each query result, so you can use this information in your algorithm. Here is a Python example of use of the Google Custom Search API that should be helpful: example (note that q refers to your query, developerKey refers to your Google Custom Search API key, and cx refers to your search engine ID).

By default, the Google Custom Search JSON API has a quota of 100 queries per day for free. Additional requests cost $5 per 1,000 queries, which will be deducted from the coupon credit that Columbia provided to you (see above). Please refer to the JSON API documentation for additional details, which you should check carefully to avoid billing-related surprises.

Test Cases

Your submission (see below) should include a transcript of the runs of your program on the following queries, with a goal of achieving a value of 0.9 for precision@10:

Look for information on the Per Se restaurant in New York City, starting with the query [per se].
Look for information on 23andMe cofounder Anne Wojcicki, starting with the query [wojcicki].
Look for information on Milky Way chocolate bars, starting with the query [milky way].

We will check the execution of your program on these three cases, as well as on some other queries.

What to Submit and When

Your Project 1 submission will consist of the following three components, which you should submit on Gradescope by Thursday, February 27, at 5 p.m. ET:

Your well-commented Python code, which should follow the format of our reference implementation (see below) and run on your Google Cloud VM, set up as detailed here
A README file including the following information:
1. Your name and Columbia UNI, and your teammate's name and Columbia UNI
2. A list of all the files that you are submitting
3. A clear description of how to run your program. Note that your project must run in a Google Cloud VM that you set up exactly following our instructions. Provide all commands necessary to install the required software and dependencies for your program.
4. A clear description of the internal design of your project, explaining the general structure of your code (i.e., what its main high-level components are and what they do), as well as acknowledging and describing all external libraries that you use in your code
5. A detailed description of your query-modification method (this is the core component of the project); this description should cover all important details of how you select the new keywords to add in each round, as well as of how you determine the query word order in each round
6. Your Google Custom Search Engine JSON API Key and Engine ID (so we can test your project)
7. Any additional information that you consider significant
A transcript of the runs of your program on the 3 test cases above, with relevant results clearly marked, and with the rephrased query and precision@10 value for each run. The format of your transcript should closely follow the format of the interactive session of our reference implementation (see below).

To submit your project, please follow these steps:

Create a directory named proj1.
Copy the source code files into the proj1 directory, and include all the other files that are necessary for your program to run.
Tar and gzip the proj1 directory, to generate a single file proj1.tar.gz.
Submit on Gradescope exactly three files:
- Your proj1.tar.gz file with your code,
- Your uncompressed README file (a PDF file is preferred), and
- Your uncompressed query transcript file.

Reference Implementation for Project 1

We have created a reference implementation for this project. To run the reference implementation, ssh as "guest-user" to the VM running at 35.190.160.142 (i.e., open a terminal and type "ssh guest-user@35.190.160.142"). Use the password for this VM that Shreyas included in the email to you with your Google coupon code. After you have logged into the guest-user account, run the following command:

/home/sc5290/proj1-s25/run <google api key> <google engine id> <precision> <query>

where:

<google api key> is your Google Custom Search JSON API Key (see above)
<google engine id> is your Google Custom Search Engine ID (see above)
<precision> is the target value for precision@10, a real number between 0 and 1
<query> is your query, a list of words in double quotes (e.g., “Milky Way”)

The reference implementation is interactive. Please adhere to the format of the relevance feedback session for your submission and your transcript file.

Also, you can use this reference implementation to give you an idea of how good your algorithm should be. Ideally, the performance of your own algorithm, in terms of the number of iterations that the algorithm takes to achieve a given precision value for a query, should be at least as good as that of our reference implementation.

Project Policies on Usage of AI Tools

To maintain the integrity of our learning process and ensure that the core objectives of the course are met by all students, we have established project-specific guidelines for using ChatGPT, Gemini, and any other AI tools for our course projects. These guidelines are crafted to encourage independent problem-solving, hands-on engagement with the course material, and adherence to academic integrity.

Please read the guidelines carefully. Importantly, for the "usage allowed" cases below, please document extensively how you used any of these AI tools, including the exact "prompts" that you worked with, namely, the questions or statements you input to the AI tools. Please document this information clearly and comprehensively in the README file that you will include with the project submission. Thorough documentation is a key professional skill, and one that helps maintain transparency and proper attribution in the project development.

Greenlight: Usage Allowed

Python Syntax, Libraries, and Package Management

What: Assistance with Python-specific syntax, Python standard libraries, or package management systems that help in creating the project but aren't directly linked to query expansion.
Why: Python syntax and the usage of specific tools aren’t at the core of our class learning objectives.

Redlight: Usage Not Allowed

Code Generation, Logic Formulation, and Relevance Feedback Mechanisms

What: Using AI tools for any form of direct code generation or significant logic formulation that is central to your project objectives. Specifically, the use of LLMs to produce any logic or code described under the query-reformulation algorithm in the Project Description is prohibited.
Why: The process of manually writing code and formulating the logic of your system is fundamental to your learning experience in this course.

Our understanding of AI tools is evolving constantly, and it is important that you seek clarification proactively if any of these guidelines are unclear or if you have any doubts. Also, your insights and inquiries are invaluable. So if you have any questions or comments about any of the above policies, please either post them to the class discussion board or contact a TA or the professor. Overall, we encourage you to approach these AI tools with a mindset of integrity and curiosity, and to not hesitate in seeking clarifications. This should ensure that your educational journey is both effective and ethically sound.

Hints and Additional Important Notes

Your implementation should not have any graphical user interface. Instead, please include a plain-text terminal interface just as that of the reference implementation that we have provided (see above).
You are welcome to ignore non-html files when you decide on what keywords to add to your query in each iteration. (Most likely there will not be many non-html files among the top-10 results for a query.) In other words, you will get the top-10 results, including perhaps non-html files, and you can optionally just focus your analysis on the html documents. Your calculation of precision for the results may then focus on just the html files (so if, say, the query returns only 8 html files, you may calculate precision over only these 8 files; in this case, if, say, 4 results are relevant out of the 8 html files, then precision@10 will be 0.5). Please explain in your README file how you are handling non-html files. However, the queries that you send to Google should not limit the document types that you receive (you should just include keywords in the query).
In each iteration, you can either just use and analyze the short document "snippets" that Google returns in the query results or, as an alternative, you can download and analyze the full pages from the web. This is completely up to you.
You are welcome to use external resources such as WordNet (see https://wordnet.princeton.edu/) or word embeddings such as word2vec. However, the use of such resources for this particular project is not encouraged or required, given that we haven't discussed them in class yet.
You should not query Google inside an iteration. In other words, you should decide on the query expansion for the next iteration based on the results from the previous iteration and their relevance judgments, but without querying Google again. So the order of the words should be determined based on the contents of the query results from the previous iteration. (For one thing, issuing extra queries would be unlikely to be helpful without new relevance judgments, since you are likely to get very different query results—for which you would not have judgments—even with small modifications of the queries.)
If in any iteration there are no relevant results among the top-10 pages that Google returns (i.e., precision@10 is zero), then your program should simply terminate, just as the reference implementation behaves.
If in the first iteration there are fewer than 10 results overall, then your program should simply terminate; there is no need for your program to handle this case gracefully. (Keep in mind that this project is about "broad," ambiguous queries, which typically return well over 10 documents.)

Grading for Project 1

Your grade will be based on the effectiveness of your query modification method—which, in turn, will be reflected in the number of iterations that your system takes to achieve the target precision both for the test cases as well as for other unseen queries that we will use for grading—, the quality of your code, and the quality of the README file. We will not grade your project in terms of efficiency.

COMS E6111-Advanced Database Systems Spring 2025