CS W1001 Introduction to Computer Science

Homework 3 (11 points)

Unix, Database Queries (and a little Java)

Due Date:

Tuesday, March 21.

For this assignment, hardcopies (only) of the first Part are required, and both hardcopies and electronic submissions of the second and third Parts are required. The course web page provides directions for electronic submission. Hardcopies are due in class, as usual.

Reading:

"An Invitation to Computer Science," Section 5.4 and page 233 and Sections 6.1, 6.2, 6.4 (but pages 268-284 of this section are optional), and 6.5, and page 291 and Sections 7.1 and 7.2 (but not 7.2.2, and don't worry about understanding the assembly language instructions -- assembly language is just a slightly-friendlier version of (the notoriously technical) machine language).

Read Sections 8.4.1 and 11.3 about the Structured Query Language (SQL) and Database queries.

Also, read the first 2 pages of Chapter 8 about why there is not one single high-level programming language that is used universally. By the way, although the text uses C++ and we will instead use Java, the two are very similar (looking) as far as the (limited) things we'll do with Java this semester. Look through Chapter 7 and Section 8.2 if you are interested in seeing information about other high level languages. They are all very similar in that when you are programming, they each need you to type in about the same level of detail as you would specify for a "pseudo-code" algorithm.

While you are reading you can always get some extra perspective by looking at where you are within the table of contents. The table of contents breaks down the organization of the text, which in turn follows the levels of abstraction depicted on the front cover!

Part I: Computer Organization Theory Questions (3 points)

Show your work or explain your answer to each of the following:

Chapter 5, exercises 4,5,7, and 11.
(This isn't a computer organization question, but I'm sticking it here anyway.) Describe two different algorithms to do the query below in Part II, question #1. Hint: The first method can be very similar to the algorithm in Figure 2.9 (page 46), especially after modifying it according to exercise 13 (page 62). The second method corresponds to the algorithm the computer is following when you use a Unix pipeline, as in Part II below.

Part II: Database Queries with Unix (5 points)

The database you will use is in ~es66/W1001/films/. In that directory you will find a database of films ("films.txt") and a "README" file that lists the fields given for each film. Copy the database of films (and the README) into a directory of your own:

    cp ~es66/W1001/films/* .

If you want to see a more complete on-line database of movies check out the "Internet Movie Database" at www.imdb.com.

You will do queries on this database. We will describe the queries abstractly using the Structured Query Language (SQL) as in the reading above. SQL is often the way a person describes a query to a machine. But, for this assignment, to actually make the queries happen automatically with a computer, you will use the versatility of the Unix Operating System. In particular, you will create a "Unix pipeline", which starts with "cat films.txt" and then pipes that to another Unix command. For example, with the Unix pipeline "cat films.txt | grep 'Robert De Niro'" you will enact the following query:

     SELECT *
     FROM films.txt
     WHERE actors INCLUDES "Robert De Niro"

This is because "grep" will output all lines that have the given string of characters (characters means letters of the alphabet or numerals, not parts in a movie!). In this case 'Robert De Niro' is the character string -- I put this string in single quotes since it includes spaces -- you have to do that also -- double quotes also works. Now, if De Niro had directed a movie, it would also appear since "grep" just looks anywhere on the whole line. But for the sake of this assignment you can assume that no actor directs and no director acts (which I think is true for our small database anyway). However, we all know that De Niro's directorial debut, "A Bronx Tale", was smashing!

Ok, for the next example let's do a fancier query:

     SELECT title
     FROM films.txt
     WHERE rating = "R" AND grade = "3 stars"

The Unix pipeline for this is "cat films.txt | grep ' R ' | grep '3 stars' | cut -d':' -f1" . Note that the "AND" is done by piping the output of one "grep" into the input of another "grep". Each "grep" filters out the ones that match the pattern. Also note that ' R ' has spaces on either side of it. This is because otherwise "grep" will look for any occurrence of "R", even as part of a bigger word like "Ronin". The last command, "cut", pulls the "title" field from each line, ignoring the rest of the lines (try out the pipeline without the last command).

Make use of the guide of common Unix filters (hard copy handed out to you in class) to do the following problems. Note that you will not necessarily need to use everything on that guide. For each problem, you must submit the Unix pipeline and the results of the query. You can put this all neatly into a text file that you edit with Pico (or Emacs) -- that would be a good way since you can then print out this file for hardcopy submission, and electronically submit it. It is pretty easy to cut and paste from your Unix window to another window that has Pico running in it. Using Pico is a nice skill; you'll be editing Java source code with it for the last two assignments of this semester.

Create a Unix pipeline to do the following query:
```
     SELECT director
     FROM films.txt
     WHERE rating = "R" AND grade = "3 stars"
```
This one differs from the example above in that it asks for director, not title.
Create a Unix pipeline to do the following query:
```
     SELECT title, time
     FROM films.txt
     WHERE rating = "PG" AND grade = "2 stars"
```
Note: put spaces around "PG" in the grep line in order to avoid also picking out PG-13 movies.
I want to select a movie that I can bring a child to see. How can I create a Unix pipeline to do the following SQL query:
```
     SELECT title
     FROM films.txt
     WHERE rating =\= " R "  (i.e. the rating does not equal R)
```
This requires doing a "NOT". Check out the "-v" option of "grep" on the attached page. Also, remember to put spaces around the "R" in the grep line so it doesn't find occurrences of "R" that are part of a bigger word. For this one, do not put the results of the query in the file you will submit. Instead, create a separate file with the results (put "> kidmovies" at the end of the pipeline, but be careful -- it erases the file "kidmovies" or whatever name you choose, if such a file already exists).
Create a Unix pipeline to compute the number of movies that are not rated. The final output must be a number, not a list of movies. Hint: "wc". Another hint: since some movies are "not rated" and others are "NOT RATED" or "Not Rated", use the -i option of grep to have it ignore capitalization.
Create a Unix pipeline to show the number of movies with each grade (this is called a histogram). The grade of each movie is the number of stars, from 1-4. "NA stars" means "not available". Hint: use "cut" to get the grade, then "uniq" with the "-c" option. Woops, "uniq" expects you to give it to "sort" first.
Use Unix to do the following query:
```
     SELECT *
     FROM films.txt
     WHERE director = "Mimi Leder" OR actors INCLUDES "Kate Capshaw"
```
This one has an "OR", which has to be handled differently than "AND". How can you do this? Hint: you may need more than one pipeline to find the comprehensive answer to this SQL query.
Design one or more Unix pipelines to find out which rating (i.e., "R", "PG", etc.) get's the best and worst grades (i.e. "1 stars", "2.5 stars", etc.). Show the results and draw conclusions with a couple English sentences. Perhaps "R" has better movies 'cause they can show exciting stuff. But perhaps "R" has worse movies since the film-maker can rely on superficial garbage to sell her/his movie. There is more than one way to do this problem -- it is open-ended. But you should find out how many movies of each rating get high or low ratings, somehow.
Think of a query that interests you. Show the SQL version of this query, the Unix pipeline to make it happen, and the results of posing the query to our database of films. Some queries may be difficult to do with Unix pipelines - if you get stuck, change your query. If the output of the query is long, put it in a different file and submit it electronically only.

Part III: Your First Java Program (3 points)

For this first jaunt with Java, you will copy a Java source code file we have already created for you. You will try it out, and then make a minor modification, and then try it again.

Use mkdir to create a new subdirectory for this part of this assignment.
Copy the file ~es66/W1001/src/Inchtocm.java to your own directory. (Ignore the other files in that src directory until otherwise instructed -- they are not really ready yet.)
Use "cat" or "pico" to look at the file. This file is Java source code.
"javac Inchtocm.java" will compile this program.
"ls" will show you there is a new file, Inchtocm.class. This is the Java bytecode that can be executed by the Java emulator.
"java Inchtocm 3" will execute it with the emulator. Be sure to include the "3". Now try it with a different number other than 3. This number is the input to the program. If you try it without a number, it will give you a nasty error message.
WARNING: Remember, "javac" is to compile, and "java" is to execute -- a difference of one letter only! This is easy to confuse - I do it all the time.
Now, edit and compile Inchtocm.java (saving it in Cmtoinch.java) so that it performs the opposite conversion. To get this to work, you only need to change the formula. However, only changing the formula means the "inches" variable will have the length in centimeters, and the "cms" variable will have the length in inches. Therefore, to do a complete job, you also need to change each "cms" to "inches" and vice versa.
Take notice: since you are now working with a file of java source named "Cmtoinch.java", you must change each occurrence of "Inchtocm" in the file to "Cmtoinch", or it will not compile without an error. Be careful: capitalization matters.
Note that when you try it out you still have to put a number on the command line to provide input. However, in this case, the number is the length in centimeters, not in inches.
You must make sure your new version actually works. It must compile with no error messages (if you make a mistake when you edit it, even a minor mistake such as deleting a semicolon, the compiler will give you an error message when you try to compile), and it must work when you try it out. This way you will know you did it correctly -- like the computer checks your work for you.
When you submit, only submit your new ".java" file (the Java source code). DO NOT submit any ".class" files (Java bytecode) -- these take up much more space than the source code files, and your TA can create the ".class" file herself if she needs to by using the Java compiler. This guideline applies to all Java programming assignments in the remaining homeworks this semester.

Thanks to Andrew Kosoresow and Michael Grossberg for help developing part III above.

email: evs at cs dot columbia dot edu