COMS E6111-Advanced Database Systems
Spring 2024

Project 2

Due Date: Monday March 25, 5 p.m. ET

Teams

You will carry out this project in teams of two. You can do the project with your same teammate as for Project 1 and you are also welcome to switch teammates if you wish. In this case, please be considerate and notify your Project 1 teammate immediately. If you want to form a new team but can't find a teammate, please follow these steps:

You do not need to notify us of your team composition. Instead, you and your teammate will indicate your team composition when you submit your project on Gradescope (click on "Add Group Member" after one of you has submitted your project). You will upload your final electronic submission on Gradescope exactly once per team, rather than once per student.

Important notes:

Overview

This project is about information extraction on the web, or the task of extracting "structured" information that is embedded in natural language text on the web. As we discussed in class, information extraction has many applications and, notably, is becoming increasingly important for web search.

In this project, you will implement a version of the Iterative Set Expansion (ISE) algorithm that we described in class: for a target information extraction task, an "extraction confidence threshold," a "seed query" for the task, and a desired number of tuples k, you will follow ISE, starting with the seed query (which should correspond to a plausible tuple for the relation to extract), to return k tuples extracted for the specified relation from web pages with at least the given extraction confidence, and following the procedure that we outline below.

The objective of this project is to provide you with hands-on experience on how to (i) retrieve and parse webpages; (ii) prepare and annotate text on the webpages for subsequent analysis; and (iii) extract structured information from the webpages. You will exercise both a "traditional" information extraction approach (using SpanBERT) that involves multiple steps of data annotation, as well as an approach that reflects the ongoing paradigm shift from multi-step data pipelines with specialized models for extraction tasks to strong "few-shot" learners (using Google's Gemini API). You will implement both approaches, and then you can select one for a specific run in the command line for your project.

You will develop and run your project on the Google Cloud infrastructure, using your LionMail account and VM as you did for Project 1.

IMPORTANT NOTE: When you restart your VM to work on this project, you must request at least 15 GB of RAM for the VM, so that your system will run without any memory issues. To do this, in the "VM instances" page of your Google Cloud account, click on the VM's name, then click on "EDIT" at the top, select n1-standard-4 (4 vCPU, 15 GB memory) as the "Machine type," and click on "Save."

Description

For this project, you will write a program that implements the ISE algorithm over the web. Before you get started, you should install Python 3.9 and create a Python 3.9 virtual environment to develop and test your code, as follows:

Also, when using the commands apt or apt-get below, you may get an error that says "ModuleNotFoundError: No module named 'apt_pkg'." In this case, then please perform the following steps:
cd /usr/lib/python3/dist-packages
sudo ln -s apt_pkg.cpython-36m-x86_64-linux-gnu.so apt_pkg.so
cd ~
sudo pip3 install --upgrade google-api-python-client
Finally, note that some of the commands below may generate warnings (not errors) when you run them. As long as these are labeled “Warnings” (and not “Errors”), you can feel free to ignore them.

Your program will rely on:

Overall, your program should receive as input:

Then, your program should perform the following steps:

  1. Initialize X, the set of extracted tuples, as the empty set.
  2. Query your Google Custom Search Engine to obtain the URLs for the top-10 webpages for query q; you can reuse your own code from Project 1 for this part if you so wish.
  3. For each URL from the previous step that you have not processed before (you should skip already-seen URLs, even if this involves processing fewer than 10 webpages in this iteration):
    1. Retrieve the corresponding webpage; if you cannot retrieve the webpage (e.g., because of a timeout), just skip it and move on, even if this involves processing fewer than 10 webpages in this iteration.
    2. Extract the actual plain text from the webpage using Beautiful Soup.
    3. If the resulting plain text is longer than 10,000 characters, truncate the text to its first 10,000 characters (for efficiency) and discard the rest.
    4. Use the spaCy library to split the text into sentences and extract named entities (e.g., PERSON, ORGANIZATION). See below for details on how to perform this step.
    5. If -spanbertis specified, use the sentences and named entity pairs as input to SpanBERT to predict the corresponding relations, and extract all instances of the relation specified by input parameter r. Otherwise, if -gemini is specified, use the Google Gemini API for relation extraction. See below for details on how to perform this step.
    6. If -spanbert is specified, identify the tuples that have an associated extraction confidence of at least t and add them to set X. Otherwise, if -gemini is specified, identify all the tuples that have been extracted and add them to set X (we do not receive extraction confidence values from the Google Gemini API, so feel free to hard-code in a value of 1.0 for the confidence value for all Gemini-extracted tuples).
  4. Remove exact duplicates from set X: if X contains tuples that are identical to each other, keep only the copy that has the highest extraction confidence (if -spanbert is specified) and remove from X the duplicate copies. (You do not need to remove approximate duplicates, for simplicity.)
  5. If X contains at least k tuples, return the top-k such tuples and stop. If -spanbert is specified, your output should have the tuples sorted in decreasing order by extraction confidence, together with the extraction confidence of each tuple. If -gemini is specified, your output can have the tuples in any order (if you have more than k tuples, then you can return an arbitrary subset of k tuples). (Alternatively, you can return all of the tuples in X, not just the top-k such tuples; this is what the reference implementation does.)
  6. Otherwise, select from X a tuple y such that (1) y has not been used for querying yet and (2) if -spanbert is specified, y has an extraction confidence that is highest among the tuples in X that have not yet been used for querying. (You can break ties arbitrarily.) Create a query q from tuple y by just concatenating the attribute values together, and go to Step 2. If no such y tuple exists, then stop. (ISE has "stalled" before retrieving k high-confidence tuples.)

Performing the Annotation and Information Extraction Steps

Steps 3.d and 3.e above require that you use the spaCy library to annotate the plain text from each webpage and extract tuples for the target relation r using (1) the pre-trained SpanBERT classifier, when -spanbert is specified; or (2) the Google Gemini API, when -gemini is specified.

Relation extraction is a complex task that traditionally operates over text that has been annotated with appropriate tools. In particular, the spaCy library that you will use in this project provides a variety of text pre-processing tools (e.g., sentence splitting, tokenization, named entity recognition).

For your project, you should use spaCy for splitting the text to sentences and for named entity recognition for each of the sentences. You can find instructions on how to apply spaCy for this task here and in our example script (see below).

Using SpanBERT for Relation Extraction

If -spanbert is specified, after having identified named entities for a sentence you should use the pre-trained SpanBERT classifier for relation extraction. SpanBERT is a BERT-based relation classifier that considers as input (1) a sentence; (2) a subject entity from the sentence; and (3) an object entity from the sentence. SpanBERT then returns the predicted relation and the respective confidence value. You can find instructions on how to apply SpanBERT for this task here and in our example script (see below).

We have put together two minimal Python scripts, namely, spacy_help_functions.py and example_relations.py, that perform the full relation extraction pipeline, to illustrate how the spaCy library is integrated with SpanBERT. To run these scripts, you need to place them under the same directory as the spanbert.py file (provided here).

As an example, consider the following sentence and a "conceptual walk-through" of the various steps of the full information extraction process (note that this is not how the output of our reference implementation is formatted):

Note that in the above example, SpanBERT runs 6 times for the same sentence, each time with a different entity pair. SpanBERT extracts relations for 3 entity pairs and predicts the no_relation type for the rest of the pairs (i.e., no relations were extracted). Each relation type predicted by SpanBERT is listed together with the associated extraction confidence score.

Unfortunately, the SpanBERT classifier is computationally expensive, so for efficiency you need to minimize its use. Specifically, you should not run SpanBERT over entity pairs that do not contain named entities of the right type for the relation of interest r. The required named entity types for each relation type are as follows:

As an example, consider extraction for the Work_For relation (internal name: per:employee_of). You should only keep entity pairs where the subject entity type is PERSON and the object entity type is ORGANIZATION. By applying this constraint in the example sentence above ("Bill Gates stepped down ... appointed CEO Satya Nadella.") SpanBERT would run only for the first entity pair ('Bill Gates', 'Microsoft') and the sixth entity pair ('Satya Nadella', 'Microsoft'). Note that the subject and object entities might appear in either order in a sentence and this is fine.

So to annotate the text, you should implement two steps. First, you should use spaCy to identify the sentences in the webpage text together with the named entities, if any, that appear in each sentence. Then, you should construct entity pairs and run the expensive SpanBERT model, separately only over each entity pair that contains both required named entities for the relation of interest, as specified above. IMPORTANT: You must not run SpanBERT for any entity pairs that are missing one or two entities of the type required by the relation. If a sentence is missing one or two entities of the type required by the relation, you should skip it and move to the next sentence.

While running the second step over a sentence, SpanBERT looks for some predefined set of relations in a sentence. We are interested in just the four relations mentioned above. (If you are curious about the other relations available, please check the complete list as well as an article with a detailed description.)

Using the Google Gemini API for Relation Extraction

If -gemini is specified, you will use Google's Gemini API for relation extraction, rather than SpanBERT. Google Gemini and its peer large language models, or LLMs, have laid a major marker in generative modeling of text. LLMs have achieved state-of-the-art performance over many text-centric tasks such as machine translation and question answering, and have impressive potential for relation extraction as well.

Google's Gemini API can extract relations directly from sentences, as a response to a textual "prompt" provided by you, as we discussed in class. The prompt can also specify the format that you would like the output in, so that you can easily integrate this output into the rest of the data pipeline in this project. Google Gemini's "free-form" playground is helpful to experiment with designing a good prompt to extract relations from some plain text.
IMPORTANT: You need to be logged into your personal Gmail/Google account for this playground to work; it will not work if you are logged into your Columbia account, unfortunately.
There are infinitely many prompts that will work well, there is not one single magic prompt you have to discover. We have put together a minimal Python script, namely, gemini_helper_6111.py, to illustrate how to invoke the API.

For SpanBERT above, we followed a 2-step process of first tagging sentences and then classifying each candidate relation with SpanBERT, for each relation type. In contrast, Google Gemini can extract relations directly over the plain text, with no named-entity tagging.

However, because extraction is still computationally expensive (and in other settings can cost actual money!) using Google Gemini's API, you should still follow the multistep process described above for SpanBERT so that you only feed Gemini sentences that have the proper entities for the relation of interest. Specifically, you should first tag the sentences using spaCy to identify the sentences that contain the named entities pairs of the right type for the relation of interest r, and then feed a plain-text version of these sentences --without any tags-- to Google Gemini's API.

IMPORTANT NOTE 1: You must not submit to Google's Gemini API any sentences that are missing one or two entities of the type required by the relation. If a sentence is missing one or two entities of the type required by the relation, you should skip it and move to the next sentence, for computational efficiency and, importantly, to avoid monetary charges.

IMPORTANT NOTE 2: To avoid overloading Google Gemini, you can assume that whenever we specify -gemini, the value of the number of tuples k that we request will never exceed a modest number such as 10.

IMPORTANT NOTE 3: You may occasionally encounter an Internal Server Error exception (google.api_core.exceptions.InternalServerError: 500) when making a call to the Google Gemini API. This is a transient error signifying that Google servers are currently at capacity. If you frequently receive this error, try waiting a bit for Google's servers to clear up.

What to Submit and When

Your Project 2 submission will consist of the following three components, which you should submit on Gradescope by Monday March 25 at 5 p.m. ET:

  1. Your well-commented Python code, which should follow the format of our reference implementation (see below) and run on your Google Cloud VM, set up as detailed here (but with more memory, etc. for this project as indicated above). Your implementation should be called using the same exact name and arguments as the reference implementation.
  2. A README file including the following information:
    1. Your name and Columbia UNI, and your teammate's name and Columbia UNI
    2. A list of all the files that you are submitting
    3. A clear description of how to run your program. Note that your project must run in a Google Cloud VM set up exactly following our instructions (but with more memory, etc. for this project as indicated above). Provide all commands necessary to install the required software and dependencies for your program.
    4. A clear description of the internal design of your project, explaining the general structure of your code (i.e., what its main high-level components are and what they do), as well as acknowledging and describing all external libraries that you use in your code
    5. A detailed description of how you carried out Step 3 in the "Description" section above
    6. Your Google Custom Search Engine JSON API Key and Engine ID (so we can test your project); you do not need to send us your Google Gemini API key
    7. Any additional information that you consider significant
  3. A transcript of the run of your program on input parameters: -spanbert 2 0.7 "bill gates microsoft" 10 (i.e., using SpanBERT for r=2, t=0.7, q=[bill gates microsoft], and k=10). The format of your transcript should closely follow the format of the reference implementation, and should print the same information (i.e., number of characters, sentences, relations, etc.) as the corresponding session of the reference implementation.
  4. A transcript of the run of your program on input parameters: -gemini 2 0.0 "bill gates microsoft" 10 (i.e., using Google's Gemini API for r=2, t=0, which is ignored for -gemini, q=[bill gates microsoft], and k=10). The format of your transcript should closely follow the format of the reference implementation, and should print the same information (i.e., number of characters, sentences, relations, etc.) as the corresponding session of the reference implementation.

To submit your project, please follow these steps:

  1. Create a directory named proj2.
  2. Copy the source code files into the proj2 directory, and include all the other files that are necessary for your program to run.
  3. Tar and gzip the proj2 directory, to generate a single file proj2.tar.gz.
  4. Submit on Gradescope exactly four files:
    • Your proj2.tar.gz file with your code,
    • Your uncompressed README file (a PDF file is preferred), and
    • Your uncompressed transcript file for -spanbert.
    • Your uncompressed transcript file for -gemini.

Reference Implementation for Project 2

We created a reference implementation for this project. The reference implementation is called as follows:
python3 project2.py [-spanbert|-gemini] <google api key> <google engine id> <google gemini api key> <r> <t> <q> <k>
where:

Unfortunately, the SpanBERT classifier requires substantial amounts of memory to run. Therefore, a VM that could support many concurrent runs of the reference implementation, to accommodate the number of students in the class, would exceed our available Google Cloud budget. So rather than giving you direct access to the reference implementation, we provide the transcripts of a variety of runs of the reference implementation. We will keep adding and updating these transcripts periodically, so you have reasonably up-to-date runs available.

Please adhere to the format of the reference implementation for your submission and your transcript file. Also, you can use the transcripts of the reference implementation to give you an idea of how good your overall system should be. Ideally, the number of querying iterations that your system takes to extract the number of tuples requested should be at least as low as that of our reference implementation.

NOTE/HINT: The "prompt" for the -gemini option of the reference implementation uses the following examples of relations:
'["Jeff Bezos", "Schools_Attended", "Princeton University"]'
'["Alec Radford", "Work_For", "OpenAI"]'
'["Mariah Carey", "Live_In", "New York City"]'
'["Nvidia", "Top_Member_Employees", "Jensen Huang"]'
Specifically, the prompt lists the one example above for the relation requested by input parameter r. (Think of it as a "one-shot in-context learning" prompt, as we discussed in class.) You are welcome to use or not use these examples as part of your prompt; overall, the design of the text prompt is part of what you need to complete for the project. Finally, please recall that we saw an example in class of a possible prompt for this general task, which you are welcome to adapt.

Grading for Project 2

A part of your grade will be based on the correctness of your overall system. Another part of your grade will be based on the number of iterations that your system takes to extract the number of tuples requested: ideally, the number of querying iterations that your system takes to extract the number of tuples requested should be at least as low as that of our reference implementation. We will not grade you on the run-time efficiency of each individual iteration, as long as you correctly implement the two annotator "steps" described above; in particular, note that you must not run the second (expensive) step for all sentences, but rather you should restrict that second step to only those sentences that satisfy the criteria described above. We will also grade your submission based on the quality of your code, the quality of the README file, and the quality of your transcript.