cs6111: Project 2

COMS E6111-Advanced Database Systems
Spring 2025

Project 2

Due Date: Monday March 31, 5 p.m. ET

Teams

You will carry out this project in teams of two. You can do the project with your same teammate as for Project 1 and you are also welcome to switch teammates if you wish. In this case, please be considerate and notify your Project 1 teammate immediately. If you want to form a new team but can't find a teammate, please follow these steps:

Post a message on the Ed Discussion board asking for a teammate—the best way.
Send email to Konstantinos right away (and definitely before Thursday, March 6, at 5 p.m. ET) asking Konstantinos to pair you up with another student without a teammate. Konstantinos will make a best effort to find you a teammate. If you email Konstantinos, please indicate who your previous teammate was in your message to him.

You do not need to notify us of your team composition. Instead, you and your teammate will indicate your team composition when you submit your project on Gradescope (click on "Add Group Member" after one of you has submitted your project). You will upload your final electronic submission on Gradescope exactly once per team, rather than once per student.

Important notes:

If you decide to drop the class, or are even remotely considering doing so, please be considerate and notify your teammate immediately.
On a related note, do not wait until the day before the deadline to start working on the project, just to realize then that your teammate has dropped the class or formed a new team. It is your responsibility to start working on the project and spot any problems with your teammate early on.
You can do this project by yourself if you so wish. Be aware, however, that you will have to do exactly the same project as two-student teams will.
Please check the Class Policies webpage for important information about what kinds of collaboration are allowed for projects, and how to compute the number of available grace late days for a team.

Project Overview

This project is about information extraction on the web, or the task of extracting "structured" information that is embedded in natural language text on the web. As we discussed in class, information extraction has many applications and, notably, is becoming increasingly important for web search.

In this project, you will implement a version of the Iterative Set Expansion (ISE) algorithm that we described in class: for a target information extraction task, an "extraction confidence threshold," a "seed query" for the task, and a desired number of tuples k, you will follow ISE, starting with the seed query (which should correspond to a plausible tuple for the relation to extract), to return k tuples extracted for the specified relation from web pages with at least the given extraction confidence, and following the procedure that we outline below.

The objective of this project is to provide you with hands-on experience on how to (i) retrieve and parse webpages; (ii) prepare and annotate text on the webpages for subsequent analysis; and (iii) extract structured information from the webpages. You will exercise both a "traditional" information extraction approach (using SpanBERT) that involves multiple steps of data annotation, as well as an approach that reflects the ongoing paradigm shift from multi-step data pipelines with specialized models for extraction tasks to strong "few-shot" learners (using Google's Gemini API). You will implement both approaches, and then you can select one for a specific run in the command line for your project.

Project Setup

You will develop and run your project on the Google Cloud infrastructure, using your LionMail account as you did for Project 1.

IMPORTANT NOTE: To avoid memory-related problems, and to adapt to the requirements of the libraries that you will use in this project, please create a brand-new VM for Project 2 (i.e., don't use the same VM as for Project 1). For this, follow all steps in our earlier instructions but with the following two changes:

In Step 10, in the "Machine configuration" tile set "Series" to "N1" and "Machine type" to n1-standard-4 (4 vCPU, 15 GB memory).
In Step 11, in the "OS and storage" tile, click on "Change" and then go to "Boot disk"; then, in the new popup menu, select "Ubuntu" as "Operating system" and Ubuntu 22.04 LTS (x86/64, amd64 jammy image built on 2025-02-28) as "Version."

You should install Python 3.10 (or newer) and create a Python virtual environment to develop and test your code, as follows:

First, make sure that you're using a fresh directory for the project, with no other virtual environments currently activated.
Install Python 3.10 (or newer), by running:
sudo apt update sudo apt install python3 sudo apt install python3-venv sudo apt install -y python3-pip
Create a new virtual environment named dbproj: python3 -m venv dbproj
Now, to avoid running into memory limitations that exist for the/home directory, move the virtual environment to a different location, as follows:
- sudo mv ~/dbproj /opt/
- ln -s /opt/dbproj ~/dbproj
To ensure the correct installation of Python, run:
- source dbproj/bin/activate
  Your environment should now be activated. (Your teammate would be using a different command for this, namely: source /home/<your_uni>/dbproj/bin/activate.)

Also, when using the commands apt or apt-get below, you may get an error that says "ModuleNotFoundError: No module named 'apt_pkg'." In this case, then please perform the following steps:

cd /usr/lib/python3/dist-packages
sudo ln -s apt_pkg.cpython-36m-x86_64-linux-gnu.so apt_pkg.so
cd ~
sudo pip3 install --upgrade google-api-python-client

Finally, note that some of the commands below may generate warnings (not errors) when you run them. As long as these are labeled “Warnings” (and not “Errors”), you can feel free to ignore them.

Your program will rely on:

The Google Custom Search API to search the web. You used this API for Project 1. In your new dbproj virtual environment, run:
pip3 install --upgrade google-api-python-client
The Beautiful Soup toolkit to extract the actual plain text from a given webpage, and ignore HTML tags, links, images, and all other content that would interfere with the information extraction process. To install the toolkit, run:
pip3 install beautifulsoup4
The spaCy library to process and annotate text through linguistic analysis (e.g., split sentences from paragraphs, tokenize text, detect entities). You will need to use Python 3.10 or newer (see above), and follow these steps on your Google Cloud VM:
sudo apt-get update pip3 install -U pip setuptools wheel pip3 install -U spacy python3 -m spacy download en_core_web_lg
The SpanBERT classifier to extract the following four types of relations from text documents:
1. Schools_Attended (internal name: per:schools_attended)
2. Work_For (internal name: per:employee_of)
3. Live_In (internal name: per:cities_of_residence)
4. Top_Member_Employees (internal name: org:top_members/employees)
We have implemented the scripts for downloading and running the pre-trained SpanBERT classifier for the purpose of this project:
- git clone https://github.com/Shreyas200188/SpanBERT
- cd SpanBERT
- pip3 install -r requirements.txt
- bash download_finetuned.sh
Make sure that you work inside the above SpanBERT folder, to have all needed files and dependencies.
The Google Gemini API to extract the above relations from text documents by exploiting large language models, as a state-of-the-art alternative to SpanBERT:
pip install -q -U google-generativeai
You will use the free tier of Google Gemini with the Gemini 2.0 Flash model. To use this API, you will first need to obtain a Google Gemini API key using personal Gmail address (not your Columbia address, for which the creation of API keys is not enabled yet), as follows:
- Navigate to this link.
- In the top right corner, switch to (or log into) your personal Gmail address (i.e., one ending in @gmail.com).
- Click the blue "Explore models in Google AI Studio" button in the center of the screen.
- Click the "Get API key" button.
- Click the "Create API key" button.
- IMPORTANT: Write down the API key elsewhere. Do not delete your key.
- Continue the rest of your work with your Columbia email address; you will not need your personal Gmail address any more for the project (you just needed it to create a Google Gemini API key).

Project Specification

Overall, your program should receive as input:

Your Google Custom Search Engine JSON API Key from Project 1
Your Google Engine ID from Project 1
Your Google Gemini API key, generated as explained above
An argument that indicates whether we are using SpanBERT (-spanbert) or Google Gemini (-gemini) for the extraction process
An integer r between 1 and 4, indicating the relation to extract: 1 is for Schools_Attended, 2 is for Work_For, 3 is for Live_In, and 4 is for Top_Member_Employees
A real number t between 0 and 1, indicating the "extraction confidence threshold," which is the minimum extraction confidence that we request for the tuples in the output; t is ignored if we are using -gemini
A "seed query" q, which is a list of words in double quotes corresponding to a plausible tuple for the relation to extract (e.g., "bill gates microsoft" for relation Work_For)
An integer k greater than 0, indicating the number of tuples that we request in the output

Then, your program should perform the following steps:

Initialize X, the set of extracted tuples, as the empty set.
Query your Google Custom Search Engine to obtain the URLs for the top-10 webpages for query q; you can reuse your own code from Project 1 for this part if you so wish.
For each URL from the previous step that you have not processed before (you should skip already-seen URLs, even if this involves processing fewer than 10 webpages in this iteration):
1. Retrieve the corresponding webpage; if you cannot retrieve the webpage (e.g., because of a timeout), just skip it and move on, even if this involves processing fewer than 10 webpages in this iteration.
2. Extract the actual plain text from the webpage using Beautiful Soup.
3. If the resulting plain text is longer than 10,000 characters, truncate the text to its first 10,000 characters (for efficiency) and discard the rest.
4. Use the spaCy library to split the text into sentences and extract named entities (e.g., PERSON, ORGANIZATION). See below for details on how to perform this step.
5. If -spanbertis specified, use the sentences and named entity pairs as input to SpanBERT to predict the corresponding relations, and extract all instances of the relation specified by input parameter r. Otherwise, if -gemini is specified, use the Google Gemini API for relation extraction. See below for details on how to perform this step.
6. If -spanbert is specified, identify the tuples that have an associated extraction confidence of at least t and add them to set X. Otherwise, if -gemini is specified, identify all the tuples that have been extracted and add them to set X (we do not receive extraction confidence values from the Google Gemini API, so feel free to hard-code in a value of 1.0 for the confidence value for all Gemini-extracted tuples).
Remove exact duplicates from set X: if X contains tuples that are identical to each other, keep only the copy that has the highest extraction confidence (if -spanbert is specified) and remove from X the duplicate copies. (You do not need to remove approximate duplicates, for simplicity.)
If X contains at least k tuples, return the top-k such tuples and stop. If -spanbert is specified, your output should have the tuples sorted in decreasing order by extraction confidence, together with the extraction confidence of each tuple. If -gemini is specified, your output can have the tuples in any order (if you have more than k tuples, then you can return an arbitrary subset of k tuples). (Alternatively, you can return all of the tuples in X, not just the top-k such tuples; this is what the reference implementation does.)
Otherwise, select from X a tuple y such that (1) y has not been used for querying yet and (2) if -spanbert is specified, y has an extraction confidence that is highest among the tuples in X that have not yet been used for querying. (You can break ties arbitrarily.) Create a query q from tuple y by just concatenating the attribute values together, and go to Step 2. If no such y tuple exists, then stop. (ISE has "stalled" before retrieving k high-confidence tuples.)

Performing the Annotation and Information Extraction Steps

Steps 3.d and 3.e above require that you use the spaCy library to annotate the plain text from each webpage and extract tuples for the target relation r using (1) the pre-trained SpanBERT classifier, when -spanbert is specified; or (2) the Google Gemini API, when -gemini is specified.

Relation extraction is a complex task that traditionally operates over text that has been annotated with appropriate tools. In particular, the spaCy library that you will use in this project provides a variety of text pre-processing tools (e.g., sentence splitting, tokenization, named entity recognition).

For your project, you should use spaCy for splitting the text to sentences and for named entity recognition for each of the sentences. You can find instructions on how to apply spaCy for this task here and in our example script (see below).

Using SpanBERT for Relation Extraction

If -spanbert is specified, after having identified named entities for a sentence you should use the pre-trained SpanBERT classifier for relation extraction. SpanBERT is a BERT-based relation classifier that considers as input (1) a sentence; (2) a subject entity from the sentence; and (3) an object entity from the sentence. SpanBERT then returns the predicted relation and the respective confidence value. You can find instructions on how to apply SpanBERT for this task here and in our example script (see below).

We have put together two minimal Python scripts, namely, spacy_help_functions.py and example_relations.py, that perform the full relation extraction pipeline, to illustrate how the spaCy library is integrated with SpanBERT. To run these scripts, you need to place them under the same directory as the spanbert.py file (provided here).

As an example, consider the following sentence and a "conceptual walk-through" of the various steps of the full information extraction process (note that this is not how the output of our reference implementation is formatted):

Sentence: "Bill Gates stepped down as chairman of Microsoft in February 2014 and assumed a new post as technology adviser to support the newly appointed CEO Satya Nadella."
Script output:
spaCy extracted entities: [('Bill Gates', 'PERSON'), ('Microsoft', 'ORGANIZATION'), ('February 2014', 'DATE'), ('Satya Nadella', 'PERSON')] Candidate entity pairs: 1. Subject: ('Bill Gates', 'PERSON') Object: ('Microsoft', 'ORGANIZATION') 2. Subject: ('Microsoft', 'ORGANIZATION') Object: ('Bill Gates', 'PERSON') 3. Subject: ('Bill Gates', 'PERSON') Object: ('Satya Nadella', 'PERSON') 4. Subject: ('Satya Nadella', 'PERSON') Object: ('Bill Gates', 'PERSON') 5. Subject: ('Microsoft', 'ORGANIZATION') Object: ('Satya Nadella', 'PERSON') 6. Subject: ('Satya Nadella', 'PERSON') Object: ('Microsoft', 'ORGANIZATION') SpanBERT extracted relations: 1. Subject: Bill Gates Object: Microsoft Relation: per:employee_of Confidence: 1.00 2. Subject: Microsoft Object: Bill Gates Relation: org:top_members/employees Confidence: 0.99 3. Subject: Bill Gates Object: Satya Nadella Relation: no_relation Confidence: 1.00 4. Subject: Satya Nadella Object: Bill Gates Relation: no_relation Confidence: 0.52 5. Subject: Microsoft Object: Satya Nadella Relation: no_relation Confidence: 0.99 6. Subject: Satya Nadella Object: Microsoft Relation: per:employee_of Confidence: 0.98

Note that in the above example, SpanBERT runs 6 times for the same sentence, each time with a different entity pair. SpanBERT extracts relations for 3 entity pairs and predicts the no_relation type for the rest of the pairs (i.e., no relations were extracted). Each relation type predicted by SpanBERT is listed together with the associated extraction confidence score.

Unfortunately, the SpanBERT classifier is computationally expensive, so for efficiency you need to minimize its use. Specifically, you should not run SpanBERT over entity pairs that do not contain named entities of the right type for the relation of interest r. The required named entity types for each relation type are as follows:

Schools_Attended: Subject: PERSON, Object: ORGANIZATION
Work_For: Subject: PERSON, Object: ORGANIZATION
Live_In: Subject: PERSON, Object: one of LOCATION, CITY, STATE_OR_PROVINCE, or COUNTRY
Top_Member_Employees: Subject: ORGANIZATION, Object: PERSON

As an example, consider extraction for the Work_For relation (internal name: per:employee_of). You should only keep entity pairs where the subject entity type is PERSON and the object entity type is ORGANIZATION. By applying this constraint in the example sentence above ("Bill Gates stepped down ... appointed CEO Satya Nadella.") SpanBERT would run only for the first entity pair ('Bill Gates', 'Microsoft') and the sixth entity pair ('Satya Nadella', 'Microsoft'). Note that the subject and object entities might appear in either order in a sentence and this is fine.

So to annotate the text, you should implement two steps. First, you should use spaCy to identify the sentences in the webpage text together with the named entities, if any, that appear in each sentence. Then, you should construct entity pairs and run the expensive SpanBERT model, separately only over each entity pair that contains both required named entities for the relation of interest, as specified above. IMPORTANT: You must not run SpanBERT for any entity pairs that are missing one or two entities of the type required by the relation. If a sentence is missing one or two entities of the type required by the relation, you should skip it and move to the next sentence.

While running the second step over a sentence, SpanBERT looks for some predefined set of relations in a sentence. We are interested in just the four relations mentioned above. (If you are curious about the other relations available, please check the complete list as well as an article with a detailed description.)

Using the Google Gemini API for Relation Extraction

If -gemini is specified, you will use Google's Gemini API with the 2.0 Flash model for relation extraction, rather than SpanBERT. Google Gemini and its peer large language models, or LLMs, have laid a major marker in generative modeling of text. LLMs have achieved state-of-the-art performance over many text-centric tasks such as machine translation and question answering, and have impressive performance for relation extraction as well.

Google's Gemini API can extract relations directly from sentences, as a response to a textual "prompt" provided by you, as we discussed in class. The prompt can also specify the format that you would like the output in, so that you can easily integrate this output into the rest of the data pipeline in this project. Google Gemini's web interface is helpful to experiment with designing a good prompt to extract relations from text.
IMPORTANT: You need to be logged into your personal Gmail/Google account for this web interface to work; it will not work if you are logged into your Columbia account, unfortunately.
There are infinitely many prompts that will work well, there is not one single magic prompt you have to discover. We have put together a minimal Python script, namely, gemini_helper_6111.py, to illustrate how to invoke the API.

For SpanBERT above, we followed a 2-step process of first tagging sentences and then classifying each candidate relation with SpanBERT, for each relation type. In contrast, Google Gemini can extract relations directly over the plain text, with no named-entity tagging.

However, because extraction is still computationally expensive (and in other settings can cost actual money!) using Google Gemini's API, you should still follow the multistep process described above for SpanBERT so that you only feed Gemini sentences that have the proper entities for the relation of interest. Specifically, you should first tag the sentences using spaCy to identify the sentences that contain the named entities pairs of the right type for the relation of interest r, and then feed a plain-text version of these sentences --without any tags-- to Google Gemini's API.

IMPORTANT NOTE 1: You must not submit to Google's Gemini API any sentences that are missing one or two entities of the type required by the relation. If a sentence is missing one or two entities of the type required by the relation, you should skip it and move to the next sentence, for computational efficiency and, importantly, to avoid monetary charges.

IMPORTANT NOTE 2: To avoid overloading Google Gemini, you can assume that whenever we specify -gemini, the value of the number of tuples k that we request will never exceed a modest number such as 10.

IMPORTANT NOTE 3: You may occasionally encounter an Internal Server Error exception (google.api_core.exceptions.InternalServerError: 500) when making a call to the Google Gemini API. This is a transient error signifying that Google servers are currently at capacity. If you frequently receive this error, try waiting a bit for Google's servers to clear up. On a related note, the API might return an error 429 Resource has been exhausted, meaning that you are sending requests too fast. To avoid this error, you might want to include a line time.sleep(5) in your code before each call to the Google Gemini API.

What to Submit and When

Your Project 2 submission will consist of the following three components, which you should submit on Gradescope by Monday March 31 at 5 p.m. ET:

Your well-commented Python code, which should follow the format of our reference implementation (see below) and run on your Google Cloud VM, set up as detailed here (but with more memory, etc. for this project as indicated above). Your implementation should be called using the same exact name and arguments as the reference implementation.
A README file including the following information:
1. Your name and Columbia UNI, and your teammate's name and Columbia UNI
2. A list of all the files that you are submitting
3. A clear description of how to run your program. Note that your project must run in a Google Cloud VM set up exactly following our instructions (but with more memory, etc. for this project as indicated above). Provide all commands necessary to install the required software and dependencies for your program.
4. A clear description of the internal design of your project, explaining the general structure of your code (i.e., what its main high-level components are and what they do), as well as acknowledging and describing all external libraries that you use in your code
5. A detailed description of how you carried out Step 3 in the "Description" section above
6. Your Google Custom Search Engine JSON API Key and Engine ID (so we can test your project); you do not need to send us your Google Gemini API key
7. Any additional information that you consider significant
A transcript of the run of your program on input parameters: -spanbert 2 0.7 "bill gates microsoft" 10 (i.e., using SpanBERT for r=2, t=0.7, q=[bill gates microsoft], and k=10). The format of your transcript should closely follow the format of the reference implementation, and should print the same information (i.e., number of characters, sentences, relations, etc.) as the corresponding session of the reference implementation.
A transcript of the run of your program on input parameters: -gemini 2 0.0 "bill gates microsoft" 10 (i.e., using Google's Gemini API for r=2, t=0, which is ignored for -gemini, q=[bill gates microsoft], and k=10). The format of your transcript should closely follow the format of the reference implementation, and should print the same information (i.e., number of characters, sentences, relations, etc.) as the corresponding session of the reference implementation.

To submit your project, please follow these steps:

Create a directory named proj2.
Copy the source code files into the proj2 directory, and include all the other files that are necessary for your program to run.
Tar and gzip the proj2 directory, to generate a single file proj2.tar.gz.
Submit on Gradescope exactly four files:
- Your proj2.tar.gz file with your code,
- Your uncompressed README file (a PDF file is preferred), and
- Your uncompressed transcript file for -spanbert.
- Your uncompressed transcript file for -gemini.

Reference Implementation for Project 2

We created a reference implementation for this project. The reference implementation is called as follows:
python3 project2.py [-spanbert|-gemini] <google api key> <google engine id> <google gemini api key> <r> <t> <q> <k>
where:

[-spanbert|-gemini] is either -spanbert or -gemini, to indicate which relation extraction method we are requesting>
<google api key> is your Google Custom Search Engine JSON API Key (see above)
<google engine id> is your Google Custom Search Engine ID (see above)
<google gemini api key> is your Google Gemini API key (see above)
<r> is an integer between 1 and 4, indicating the relation to extract: 1 is for Schools_Attended, 2 is for Work_For, 3 is for Live_In, and 4 is for Top_Member_Employees
<t> is a real number between 0 and 1, indicating the "extraction confidence threshold," which is the minimum extraction confidence that we request for the tuples in the output; t is ignored if we are specifying -gemini
<q> is a "seed query," which is a list of words in double quotes corresponding to a plausible tuple for the relation to extract (e.g., "bill gates microsoft" for relation Work_For)
<k> is an integer greater than 0, indicating the number of tuples that we request in the output

Unfortunately, the SpanBERT classifier requires substantial amounts of memory to run. Therefore, a VM that could support many concurrent runs of the reference implementation, to accommodate the number of students in the class, would exceed our available Google Cloud budget. So rather than giving you direct access to the reference implementation, we provide the transcripts of a variety of runs of the reference implementation. We will keep adding and updating these transcripts periodically, so you have reasonably up-to-date runs available.

Please adhere to the format of the reference implementation for your submission and your transcript file. Also, you can use the transcripts of the reference implementation to give you an idea of how good your overall system should be. Ideally, the number of querying iterations that your system takes to extract the number of tuples requested should be at least as low as that of our reference implementation.

NOTE/HINT: The "prompt" for the -gemini option of the reference implementation uses the following examples of relations:
'["Jeff Bezos", "Schools_Attended", "Princeton University"]' '["Alec Radford", "Work_For", "OpenAI"]' '["Mariah Carey", "Live_In", "New York City"]' '["Nvidia", "Top_Member_Employees", "Jensen Huang"]' Specifically, the prompt lists the one example above for the relation requested by input parameter r, together with an example sentence including the relation. (Think of it as a "one-shot in-context learning" prompt, as we discussed in class.) You are welcome to use or not use these examples as part of your prompt; overall, the design of the text prompt is part of what you need to complete for the project. Finally, please recall that we saw an example in class of a possible prompt for this general task, which you are welcome to adapt.

Grading for Project 2

A part of your grade will be based on the correctness of your overall system. Another part of your grade will be based on the number of iterations that your system takes to extract the number of tuples requested: ideally, the number of querying iterations that your system takes to extract the number of tuples requested should be at least as low as that of our reference implementation. We will not grade you on the run-time efficiency of each individual iteration, as long as you correctly implement the two annotator "steps" described above; in particular, note that you must not run the second (expensive) step for all sentences, but rather you should restrict that second step to only those sentences that satisfy the criteria described above. We will also grade your submission based on the quality of your code, the quality of the README file, and the quality of your transcript.

COMS E6111-Advanced Database SystemsSpring 2025