Project 2
Due Date: Friday November 3, 5:00 p.m. ET

Teams

You will carry out this project in teams of two. Both students in a team will receive the same grade for Project 2. Please check the Collaboration Policy Webpage for important information on what kinds of collaboration are allowed for projects, and how to compute the available number of project grace days for a team.

You do not need to notify us now of your team composition. We have created "Project 2 Group n" groups on CourseWorks with exactly the same team composition and team numbers as for Project 1. If your team is the same as for Project 1, you don't need to modify your Project 2 group assignment: it is fine as is.

However, you are welcome to switch teammates if you wish. In this case, please be considerate and notify your Project 1 teammate immediately. If you switch teammates, you need to update your team composition by modifying the appropriate "Project 2 Group n" group on CourseWorks, so you can submit your project as a team properly on CourseWorks. For this, follow the "People" link on the left side of the CourseWorks page and then select the "Project 2 Groups" tab on top.

Important Notes:

Overview

This project is about information extraction on the web, or the task of extracting "structured" information that is embedded in natural language text on the web. As we discussed in class, information extraction has many applications and, notably, is becoming increasingly important for web search.

In this project, you will implement a version of the Iterative Set Expansion (ISE) algorithm that we described in class: for a target information extraction task, an "extraction confidence threshold," a "seed query" for the task, and a desired number of tuples k, you will follow ISE, starting with the seed query (which should correspond to a plausible tuple for the relation to extract), to return k tuples extracted for the specified relation from web pages with at least the given extraction confidence, and following the procedure that we outline below.

The objective of this project is to provide you with a hands-on experience on how to (i) retrieve and parse webpages; (ii) prepare and annotate text on the webpages for subsequent analysis; and (iii) extract structured information from the webpages.

Description

For this project, you will write a program that implements the ISE algorithm over the web. Your program will rely on:

You will develop and run your project on the Google Cloud infrastructure, using your VM in your Computer Science Cloud account, just as you did for Project 1. IMPORTANT NOTE: When you restart your VM to work on this project, we recommend that you request 6 GB of RAM for the VM, so that your system will run more efficiently than it would with less memory.

Overall, your program should receive as input:

Then, your program should perform the following steps:

  1. Initialize X, the set of extracted tuples, as the empty set.
  2. Query your Google Custom Search Engine to obtain the URLs for the top-10 webpages for query q; you can reuse your own code from Project 1 for this part if you so wish.
  3. For each URL from the previous step that you have not processed before (you should skip already-seen URLs, even if this involves processing fewer than 10 webpages in this iteration):
    1. Retrieve the corresponding webpage; if you cannot retrieve the webpage (e.g., because of a timeout), just skip it and move on, even if this involves processing fewer than 10 webpages in this iteration.
    2. Extract the actual plain text from the webpage using Apache Tika or your preferred toolkit (see above).
    3. Annotate the text with the Stanford CoreNLP software suite and, in particular, with the Stanford Relation Extractor (see above), to extract all instances of the relation specified by input parameter r. See below for details on how to perform this step. Note that you should only consider an extracted relation to be an instance of r if r has the highest extraction confidence among all relation types, as explained below.
    4. Identify the tuples for r that have an associated extraction confidence of at least t and add them to set X.
  4. Remove exact duplicates from set X: if X contains tuples that are identical to each other, keep only the copy that has the highest extraction confidence and remove from X the duplicate copies. (You do not need to remove approximate duplicates, for simplicity.)
  5. If X contains at least k tuples, return the top-k such tuples sorted in decreasing order by extraction confidence, together with the extraction confidence of each tuple, and stop.
  6. Otherwise, select from X a tuple y such that (1) y has not been used for querying yet and (2) y has an extraction confidence that is highest among the tuples in X that have not yet been used for querying. (You can break ties arbitrarily.) Create a query q from tuple y by just concatenating the attribute values together, and go to Step 2. If no such y tuple exists, then stop. (ISE has "stalled" before retrieving k high-confidence tuples.)

Performing the Annotation and Information Extraction Steps

Steps 3.c and 3.d above require that you run the Stanford CoreNLP and Relation Extractor software to annotate the plain text from each webpage and also to extract tuples for the target relation r.

Relation extraction is a complex task that generally operates over text that has been annotated with appropriate tools. In particular, the Stanford CoreNLP software suite that you will use in this project provides a variety of annotators for text, to be applied to text instances in sequence.

For your project, you should focus on "relation" annotation, https://stanfordnlp.github.io/CoreNLP/relation.html, which is what you need for the Stanford Relation Extractor. Relation annotation requires six annotators, namely, tokenize, ssplit, pos, lemma, ner, and parse. You can find examples of how to specify and apply annotators at https://stanfordnlp.github.io/CoreNLP/api.html. In particular, for relation extraction you should specify the six annotators in Java as follows (IMPORTANT NOTE: The Python version of this process is slightly different: please refer to the README file and example in https://github.com/infobiac/PythonNLPCore for details on how to write the Python wrapper counterpart):
props.setProperty("annotators", "tokenize, ssplit, pos, lemma, ner, parse");
props.setProperty("parse.model", "edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz");
props.setProperty("ner.useSUTime", "0");
RelationExtractorAnnotator r = new RelationExtractorAnnotator(props);

IMPORTANT NOTE: You should use "parse" (not "depparse" as suggested in https://stanfordnlp.github.io/CoreNLP/dependencies.html), to avoid problems with the Stanford Relation Extractor.

Unfortunately, the parse annotator is computationally expensive, so for efficiency you need to minimize its use. Specifically, you should not run parse over sentences that do not contain named entities of the right type for the relation of interest r. The required named entities for each relation type are as follows:

So to annotate the text, you should implement two pipelines. You should run the first pipeline, which consists of tokenize, ssplit, pos, lemma, and ner, for the full text that you extracted from a webpage. The output will identify the sentences in the webpage text together with the named entities, if any, that appear in each sentence.

Then, you should run the second pipeline, which includes the expensive parse annotator, separately over each sentence that contains the required named entities for the relation of interest, as specified above. Note that the two named entities might appear in either order in a sentence and this is fine. The second pipeline consists of tokenize, ssplit, pos, lemma, ner, and parse.

After running the second pipeline of annotators over a sentence (using annotators as those in props above), the Stanford Relation Extractor -which you can think of as a classifier, and is invoked as the RelationExtractorAnnotator annotator in the Stanford CoreNLP software suite- predicts if any out of four relation types is present in the sentence, where the relation types are Live_In, Located_In, OrgBased_In, and Work_For, and with what extraction confidence. A fifth option is "_NR," for "no relation." You can access the annotations as follows:
Annotation docTmp = new Annotation(s.get(CoreAnnotations.TextAnnotation.class));
r.annotate(docTmp); // recall that we declared r as a RelationExtractorAnnotator above; class RelationMention contains the output of this annotation.

As an example, consider the following sentence and the output from the full process:

Each relation type is listed together with the associated extraction confidence score. (Please ignore the "start" and "end" fields, which are not that meaningful.) The most likely relation type is then the one with "type=Work_For," with a score of 0.28470059995802677. The second most likely case is "no relation" (i.e., "type=_NR"), with a slightly lower score, followed by Live_In, OrgBased_In, and Located_In, all with even lower scores. Furthermore, the two entities found in the sentence are "Gates" (of type "PEOPLE") and "Microsoft" (of type "ORGANIZATION").

For each RelationMention, you should consider that its relation type is the one with the highest extraction confidence score (and hence you should ignore the other types). Furthermore, you should only consider a RelationMention in Steps 3.c and 3.d above if the highest-score relation type coincides with the relation that you are extracting.

What You Should Submit

  1. Your well-commented Java or Python code, which should follow the format of our reference implementation (see below).
  2. A README file including the following information:
    1. Your group name on CourseWorks ("Project 2 Group n"), your name and Columbia UNI, and your partner's name and Columbia UNI.
    2. A list of all the files that you are submitting.
    3. A clear description of how to run your program. Note that your project must compile/run under Ubuntu in a Google Cloud VM. Provide all commands necessary to install the required software and dependencies for your program.
    4. A clear description of the internal design of your project.
    5. A detailed description of how you carried out Step 3 in the "Description" section above.
    6. Your Google Custom Search Engine API Key and Engine ID (so we can test your project).
    7. Any additional information that you consider significant.
  3. A transcript of the run of your program on input parameters: 4 0.35 "bill gates microsoft" 10 (i.e., for r=4, t=0.35, q="bill gates microsoft," and k=10). The format of your transcript should closely follow the format of the corresponding session of our reference implementation.

Grading

A part of your grade will be based on the correctness of your overall system. Another part of your grade will be based on the number of iterations that your system takes to extract the number of tuples requested: ideally, the number of querying iterations that the system takes to extract the number of tuples requested should be at least as low as that of our reference implementation. (We will not grade you on the run-time efficiency of each individual iteration, as long as you implement the two annotator "pipelines" described above.) We will also grade your submission based on the quality of your code, the quality of the README file, and the quality of your transcript.

How to Submit

  1. Create a directory named <groupn>-proj2, where you should replace <groupn> with your Project 2 Group as specified on CourseWorks (for example, if your group is "Project 2 Group 9," then the directory should be named group9-proj2).
  2. Copy the source code files into the <groupn>-proj2 directory, and include all the other files that are necessary for your program to run.
  3. Tar and gzip the <groupn>-proj2 directory, to generate a single file <groupn>-proj2.tar.gz, which is the file that you will submit.
  4. Login to CourseWorks and select the site for our class. To submit this file, you need to be in the Class view (not the Group view) and then upload your file to the "Project 2" assignment under Assignments. Submit file <groupn>-proj2.tar.gz.
  5. Separately, submit your uncompressed README file as well as your uncompressed transcript file, as two separate files.
In summary, you need to submit on CourseWorks exactly three files: (1) your <groupn>-proj2.tar.gz file with your code, (2) your uncompressed README file, and (3) your uncompressed transcript file. You should submit these materials only once per team (not once per student).

Reference Implementation

We created a reference implementation for this project. To run the reference implementation, ssh as "guest-user" to the VM running at 35.196.34.40 (i.e., open a terminal and type "ssh guest-user@35.196.34.40"). Use the password for this VM that Min included in the email to you with your Google coupon code at the beginning of the semester. (This is the same password that you used to access the Project 1 reference implementation.) After you have logged into the guest-user account, run the following from the home directory (i.e., from /home/guest-user):
/home/paparrizos/runProj2 <google api key> <google engine id> <r> <t> <q> <k>
where:

Please adhere to the format of the reference implementation for your submission and your transcript file. Also, you can use this reference implementation to give you an idea of how good your overall system should be. Ideally, the performance of your own system, in terms of the number of querying iterations that the system takes to extract the number of tuples requested, should be at least as low as that of our reference implementation.