Due Date: Friday, October 6, 5:00 p.m. ET
You will carry out this project in
teams of two. If you can't find a teammate, please follow these
- Post a message on Piazza asking for a teammate—the best way.
- Send email to
away (and definitely before
Monday, September 18, at 5:00 p.m. ET) asking him to pair you up with another
student without a teammate. Christophe will do his best to
find you a teammate.
You do not need to notify us of your team composition. Instead,
you and your teammate should assign yourselves to an available "Project 1 Group" on CourseWorks.
For this, go to the "People" section of CourseWorks for our
course. Please join one of the already-created "Project 1 Group
n" groups. Please do not
create your own group. (We have already created sufficiently
many such groups to accommodate all students in the course.) You
will then upload your final electronic submission on CourseWorks
exactly once per team, rather than once per student.
are some additional important points:
are even remotely considering doing so, please be considerate
and notify your teammate immediately.
- On a related note, do not wait until the day before the
deadline to start working on the project, just to realize then
that your teammate has dropped the class. It is your
responsibility to start working on the project and spot any
problems with your teammate early on.
can do this project by yourself if you so wish. Be aware,
however, that you will have to do exactly the same project as
two-student teams will.
In this project, you will
implement an information retrieval system that exploits
user-provided relevance feedback to improve the search results
returned by Google.
The relevance feedback mechanism is described in Singhal: Modern Information
Retrieval: A Brief Overview, IEEE Data Engineering Bulletin, 2001, as well as in
Chapter 9, “Relevance Feedback & Query Expansion,” of the
Manning, Raghavan, and Schütze Introduction to
Information Retrieval textbook, available online.
User queries are often ambiguous. For example, a user who
issues a query [jaguar] might be after documents about
the car or the animal, and in fact search engines like Bing and
Google return pages on both topics among their top 10 results
for the query. In this project, you will design and implement a
query-reformulation system to disambiguate queries and improve
the relevance of the query results that are produced. Here’s how
your system, which should be written in Java or
Python (your choice), should work:
a list of words, and a value—between 0 and 1—for the target “precision@10” (i.e., for the precision that is
desired for the top-10 results for the query, which is the
fraction of pages that are relevant out of the top-10
top-10 results for the query from Google, using the Google
Custom Search API (see below), using the default value for the
various API parameters, without
modifying these default values.
results to the user, so that the user can mark all the web
pages that are relevant to the intended meaning of the query
among the top-10 results. For each page in the query result,
you should display its title, URL, and description returned by
NOTE: You should display the exact top-10 results returned by Google for the
query (i.e., you
cannot add or delete pages in the results that Google
returns). Also, the Google Custom Search API has a number of
search parameters. Please do not modify the default values for
these search parameters.
the precision@10 of the results from Step 2
for the relevance judgments of Step 3 is greater than or equal
to the target value, then stop. If the precision@10 of the
results is zero, then you should also stop. Otherwise, use the
pages marked as relevant to automatically (i.e., with
no further human input at this point) derive new
words that are likely to identify more relevant pages.
You may introduce at most 2 new words during each round.
IMPORTANT NOTE 1: You cannot delete any words from the original query or
from the query from the previous iteration; you can just add
words, up to 2 new words in each round. Also, your queries
must consist of just keywords, without any additional
operators (e.g., you cannot use negation, quotes, or any other operator
in your queries).
IMPORTANT NOTE 2: The order
of the words in the expanded query is important. Your
program should automatically consider the alternate ways of
ordering the words in a modified query, and pick the order
that is estimated to be best. In each iteration, you
can reorder all words--new and old--in the query, but you
cannot delete any words, as explained in the note above.
current user query by adding to it the newly derived words and
ordering all words in the best possible order,
as determined in Step 4, and go to Step 2.
The key challenge in the project
is in designing Step 4, for which you should be creative and use
the ideas that we discussed in class—as well as the above
bibliography and the course reading materials—as inspiration.
You are welcome to borrow techniques from the research
literature at large (either exactly as published or modified as
much as you feel necessary to get good performance in our
particular query setting), but make sure that you
cite the specific publications on which you based your
solution. As a hint on how to search for relevant
publications, you might want to check papers on “query
expansion” in the main IR conference, SIGIR, at http://www.informatik.uni-trier.de/~ley/db/conf/sigir/index.html.
If you choose to implement a technique from the literature, you
still need to make sure that you adapt the chosen technique as
much as necessary so that it works well for our specific query
setting and scenario, since you will be graded based on how well
your technique works. If you want to do stopword elimination
(this is of course optional), you can find a list of stopwords here.
You will use the Google
Custom Search API
(https://developers.google.com/custom-search/) in this
project: this is Google’s web service to enable the creation
of customized search engines. As a first step to use the
Google Custom Search API, you will have to sign up for a Computer Science Cloud
account, following the instructions provided here
carefully. The document also explain how you should set up a
VM on the cloud, to develop and run your project.
As a second step, you will
have to sign up for the Custom Search Engine
- Press the "Sign in to
Custom Search Engine" button on the top right corner.
- Create a new search
engine by clicking the “New search engine” button on the top
- Specify the following
- “Sites to search”
should be "www.wikipedia.com" for now
- “Language” should be
- “Name of search engine”
should be "cs6111"
- Press the “CREATE” button.
- Select “Edit search engine” on the left, choose search
engine “cs6111,” and click "Setup."
- Under the top “Basics" button, select “Sites to search” and
choose “Search the entire web but emphasize included sites.”
- Right below choose the "www.wikipedia.com" site, press the
“Delete” button, and finally press the “Update” button at the
bottom of the page. This will enable the creation of a search
engine to search the entire web but without an emphasis on any
particular website (i.e., you will be using the general Google
- Next to the “Details” label, click on “Search engine ID” to
get your search engine key,
which you will need for querying.
- Do not modify or change other settings.
- Google provides two APIs for the Google
Custom Search Engine
namely, the JSON/ATOM API and the XML API. If you choose to
use the JSON/ATOM API, you will additionally need to obtain a
Google Custom Search JSON/ATOM API key at https://developers.google.com/custom-search/json-api/v1/overview
by clicking on "GET A KEY" at the bottom of the page. The XML
API does not require a key.
You will use your engine key, optionally your JSON/ATOM API key,
and a query as parameters to encode a search request URL. When
requested from a web browser, or from inside a program, this URL
will return a document with the query results. Please refer to the
Google Custom Search APIs documentation for details on the URL
syntax and document schema. You should parse the response document
in your program to extract the title, link, and description of
each query result, so you can use this information in your
algorithm. Here are examples of use of the Google Custom Search
API that should be helpful: Java
(in the Python example, note that q
refers to your query, developerKey
refers to your
Google Custom Search API key, and cx
refers to your search engine key).
By default, the Google Custom Search API has a quota of 100 queries per day
If you exceed this quota, you can upgrade to 1000 queries per day
for one month for $5 (i.e., $5 for the full month), which will be
deducted from the coupon credit that Columbia provided (see
submission (see below) should include a transcript of the runs
of your program on the following queries, with a goal of
achieving a value of 0.9 for precision@10:
Look for information on the Per Se
restaurant in New York City, starting with the query [per se].
Look for information on Google cofounder
Sergey Brin, starting with the query [brin].
Look for information on the animal
jaguar, starting with the query [jaguar].
check the execution of your program on these three cases, as
well as on some other queries.
What You Should Submit
- Your well-commented
Python code, which should follow the
format of our reference implementation (see below)
- A README file
including the following information:
a) Your group name on CourseWorks
("Project 1 Group n"), your name and Columbia UNI, and your
partner's name and Columbia UNI
b) A list of all the files that you
c) A clear description of how to run your program. Note
that your project must
compile/run under Ubuntu in a Google Cloud VM.
Provide all commands necessary to install the required
software and dependencies for your program.
d) A clear description of the internal
design of your project
e) A detailed description of your
query-modification method (this is the core component
of the project; see below)
f) Your Google Custom Search Engine
API Key and Engine ID (so we can test your project)
g) Any additional information that you
- A transcript of the runs of your program on the 3
test cases above, with relevant results clearly marked, and
with the rephrased query and precision@10
value for each run. The format of your transcript should
closely follow the format of the interactive session of our
reference implementation (see below).
Your grade will be based on the
effectiveness of your query modification method—which, in turn,
will be reflected in the number of iterations that your system
takes to achieve the target precision both for the test cases as
well as for other unseen queries that we will use for grading—,
the quality of your code, and the quality of the README file.
How to Submit
In summary, you need to submit on
CourseWorks exactly three
files: (1) your <groupn>-proj1.tar.gz file with
your code, (2) your uncompressed README file, and (3) your
uncompressed query transcript file. You should submit these materials as a team (not once per student).
- Create a directory named <groupn>-proj1,
where you should replace <groupn> with your Project 1
Group as specified on CourseWorks (for example, if your group
is "Project 1 Group 9," then the directory should be named
- Copy the source code files into the
<groupn>-proj1 directory, and include all the other files that
are necessary for your program to run.
- Tar and gzip the <groupn>-proj1
directory, to generate a single file
<groupn>-proj1.tar.gz, which is the file that you
- Login to CourseWorks at https://courseworks2.columbia.edu/
and select the site for our class. To submit this file, you
need to be in the Class
the Group view) and then upload your file to the "Project 1"
assignment under Assignments. Submit file
- Separately, submit your uncompressed README file
as well as your uncompressed
query transcript file, as two separate files.
We created a reference implementation for this project. To run
the reference implementation, ssh as "guest-user" to the VM
running at 188.8.131.52 (i.e., open a terminal and type "ssh
email@example.com"). Use the password for this VM that
Min included in the email to you with your Google coupon code.
After you have logged into the guest-user account, run the
following from the home directory (i.e., from /home/guest-user):
api key> <google engine id> <precision>
- <google api key>
is your Google Custom Search API Key (see above)
- <google engine id>
is your Google Custom Search Engine ID (see above)
- <precision> is
the target value for precision@10, a real number between 0 and
- <query> is your
query, a list of words in double quotes (e.g., “Milky Way”)
The reference implementation is interactive.
Please adhere to the
format of your relevance feedback session for your submission
and your transcript file.
Also, you can use this reference implementation to give you an idea of how good
your algorithm should be. Ideally, the performance of your
own algorithm, in terms of the number of iterations that the
algorithm takes to achieve a given precision value for a query,
should be at least as good as that of our reference
Hints and Additional Important
- Your implementation should not
have any graphical user interface. Instead, please include a
plain, text terminal interface just as that of the reference
implementation that we have provided (see above).
- You are welcome to ignore non-html files when you decide on
what keywords to add to your query in each iteration. (Most
likely there will not be many non-html files among the top-10
results for a query.) In other words, you will get the top-10
results, including perhaps non-html files, and you can just
focus your analysis on the html documents. However, the
queries that you send to Google should not limit the document types
that you receive (you should just include keywords in the
- In each iteration, you can either just use and analyze the
short document "snippets" that Google returns in the query
results or, as an alternative, you can download and analyze
the full pages from the Web. This is completely up to you.
- We will not grade
your project in terms of efficiency.
- You are welcome to use external
resources such as WordNet (see http://wordnet.princeton.edu/).
is not encouraged,
because they might introduce substantial "noise" into the
- You should not
query Google inside an
iteration. In other words, you should decide on the
query expansion for the next iteration based on the results
from the previous iteration and their relevance judgments, but
without querying Google again. So the order of the words
should be determined based on the contents of the query
results from the previous iteration. (For one thing, issuing
extra queries would be unlikely to be helpful without new
relevance judgments, since you are likely to get very
different query results --for which you would not have
judgments-- even with small modifications of the queries.)
- If in the first iteration there are no relevant results
among the top-10 pages that Google returns (i.e., precision@10 is zero),
then your program should simply terminate, just as the
reference implementation behaves.
- If in the first iteration there are fewer than 10 results
overall, then your program should simply terminate; there is
no need for your program to handle this case gracefully. (Keep
in mind that this project is about "broad," ambiguous queries,
which typically return well over 10 documents.)