For this assignment, hardcopies (only) of the first Part are required, and both hardcopies and electronic submissions of the second and third Parts are required. The course web page provides directions for electronic submission. Hardcopies are due in class, as usual.
Read Sections 8.4.1 and 11.3 about the Structured Query Language (SQL) and Database queries.
Also, read the first 2 pages of Chapter 8 about why there is not one single high-level programming language that is used universally. By the way, although the text uses C++ and we will instead use Java, the two are very similar (looking) as far as the (limited) things we'll do with Java this semester. Look through Chapter 7 and Section 8.2 if you are interested in seeing information about other high level languages. They are all very similar in that when you are programming, they each need you to type in about the same level of detail as you would specify for a "pseudo-code" algorithm.
While you are reading you can always get some extra perspective by looking at where you are within the table of contents. The table of contents breaks down the organization of the text, which in turn follows the levels of abstraction depicted on the front cover!
The database you will use is in ~es66/W1001/films/. In that directory you will find a database of films ("films.txt") and a "README" file that lists the fields given for each film. Copy the database of films (and the README) into a directory of your own:
cp ~es66/W1001/films/* .
If you want to see a more complete
on-line database of movies check out the "Internet Movie Database" at
www.imdb.com.
You will do queries on this database. We will describe the queries abstractly using the Structured Query Language (SQL) as in the reading above. SQL is often the way a person describes a query to a machine. But, for this assignment, to actually make the queries happen automatically with a computer, you will use the versatility of the Unix Operating System. In particular, you will create a "Unix pipeline", which starts with "cat films.txt" and then pipes that to another Unix command. For example, with the Unix pipeline "cat films.txt | grep 'Robert De Niro'" you will enact the following query:
SELECT *
FROM films.txt
WHERE actors INCLUDES "Robert De Niro"
This is because "grep" will output all lines that have the given string of characters (characters means letters of the alphabet or numerals, not parts in a movie!). In this case 'Robert De Niro' is the character string -- I put this string in single quotes since it includes spaces -- you have to do that also -- double quotes also works. Now, if De Niro had directed a movie, it would also appear since "grep" just looks anywhere on the whole line. But for the sake of this assignment you can assume that no actor directs and no director acts (which I think is true for our small database anyway). However, we all know that De Niro's directorial debut, "A Bronx Tale", was smashing!
Ok, for the next example let's do a fancier query:
SELECT title
FROM films.txt
WHERE rating = "R" AND grade = "3 stars"
The Unix pipeline for this is "cat films.txt | grep ' R ' | grep '3 stars' | cut -d':' -f1" . Note that the "AND" is done by piping the output
of one "grep" into the input of another "grep". Each "grep" filters
out the ones that match the pattern. Also note that ' R ' has spaces
on either side of it. This is because otherwise "grep" will
look for any occurrence of "R", even as part of a bigger word like "Ronin". The
last command, "cut", pulls the "title" field from each line, ignoring
the rest of the lines (try out the pipeline without the last command).
Make use of the guide of common Unix filters (hard copy handed out to you in class) to do the following problems. Note that you will not necessarily need to use everything on that guide. For each problem, you must submit the Unix pipeline and the results of the query. You can put this all neatly into a text file that you edit with Pico (or Emacs) -- that would be a good way since you can then print out this file for hardcopy submission, and electronically submit it. It is pretty easy to cut and paste from your Unix window to another window that has Pico running in it. Using Pico is a nice skill; you'll be editing Java source code with it for the last two assignments of this semester.
SELECT director
FROM films.txt
WHERE rating = "R" AND grade = "3 stars"
This one differs from the example above in that it asks for director, not title.
SELECT title, time
FROM films.txt
WHERE rating = "PG" AND grade = "2 stars"
Note: put spaces around "PG" in the grep line in order to avoid
also picking out PG-13 movies.
SELECT title
FROM films.txt
WHERE rating =\= " R " (i.e. the rating does not equal R)
This requires doing a "NOT". Check out the "-v" option of "grep" on
the attached page. Also, remember to put spaces around the "R" in
the grep line so it doesn't find occurrences of "R" that are
part of a bigger word. For this one, do not put the results
of the query in the file you will submit. Instead, create
a separate file with the results (put "> kidmovies" at the end
of the pipeline, but be careful -- it erases the file "kidmovies" or
whatever name you choose, if such a file already exists).
SELECT *
FROM films.txt
WHERE director = "Mimi Leder" OR actors INCLUDES "Kate Capshaw"
This one has an "OR", which has to be handled differently than "AND". How
can you do this? Hint: you may need more than one pipeline to find
the comprehensive answer to this SQL query.
Thanks to Andrew Kosoresow and Michael Grossberg for help developing part III above.
email: evs at cs dot columbia dot edu