CS1004 Homework #5
Due on Tuesday, April 19, 2005 at 11:00am

There are two parts to this homework: a written component worth 8 points, and programming component worth 17 points.  Submission instructions are available here.

Note: parts in red are revisions/clarifications (last revised 4/7/05).

Written questions

As described in the homework submission instructions, you may submit this as a hardcopy, or as a file along with your programming problems in one of four formats (Word, PDF, HTML, or plaintext).  Make sure to put at least your name and the section number on top of the homework whether it's submitted in written or electronic form, and if it's submitted electronically, make sure you name your file correctly. 

Note that problems assigned from Schneider/Gersting or Lewis/Loftus are the exercise problems at the end of each chapter, not the practice problems or self-review questions.  (The practice/self-review problems are optional, and solutions for them are provided in the book.  For obvious reasons, the solutions for the exercises are not. ;-))

  1. (3 points) Schneider/Gersting exercise 2.14.
  2. (2 points) Schneider/Gersting exercise 3.8 (ignore the "Compare the number of exchanges done here..." part).
  3. (1 point) Schneider/Gersting exercise 3.9.
  4. (1 point) Schneider/Gersting exercise 3.18.
  5. (1 point) Lewis/Loftus exercise 7.3.

Programming problems

As described below, you will submit this part of the assignment as five files: three .java files, corresponding to the source for each problem, a README file, and a typescript showing that each function works, similar to the one pictured below.   Make sure to put comments in your code - you may lose points if you don't comment your code.  Also, make sure you're familiar with code in class and in the book -- we've covered most of these topics already, and the assignment becomes reasonably straightforward if you're up to speed with the course material.

The Problem

We are interested in doing simple statistical calculations on ASCII files: in particular, we're interested in the frequency with which words appear in a body of text.  Such a program has applications in data compression and natural-language processing.  For this assignment, you will write an interactive program that reads a text file, counts all the words in the file, and stores the word counts and frequencies in an array. Once you have this word frequency information, your program will provide answers to numerous questions. For example: how many unique words were in the file? What are the words with the top N (say 10) frequencies?  etc.

We will walk you through some of the design, asking you to code certain classes and methods. Please pay particular attention to the specifications provided. For example, if we ask for a method that returns the total number of words as an int, then your method should do only that. It should not print anything unless specified.

Ready? Here we go!

At a high level, here's the plan...we are going to build three classes: Word, WordArray, and WordBank. WordBank is our driver program, and it therefore will be the only class with a main method (in fact, it will be the only method in that class). WordBank will open the text file that we wish to count the words for and read each word. Every time it reads a word, it will tell the WordArray to add the word. The WordArray will determine if this is a new word or one it has already seen, and it will act accordingly. To do this, WordArray will have an array of Words - a class that holds a word and its frequency of occurrence.

  1. (2 points) First, build the class to store the information for one word, called Word. Here is the UML diagram for Word:

    It is a rather straightforward class that will store a word (a String) and the frequency (int) as private member variables. The constructor initializes the word to an initial String value, passed as a parameter, and the count frequency to 1.  In addition, write the following methods:
    No other accessor methods are needed for this assignment.
  2. (8 points) We now need a class that stores a collection (array) of Words and manages the array. We'll call this class WordArray. You should implement WordArray; it has the following UML diagram:

    (1 point) The wordList is your array of words. (Use the array syntax we discussed in class; do not use an ArrayList object!) Your list should have the capacity to store 1000 words initially. count is an int that will keep track of how many (unique) words are in your wordList (and will tell you where to put the next word). You should initialize these appropriately in your default constructor (which will take no parameters).  Finally, you should write the following methods.

  3. (7 points) Write the WordBank driver class. WordBank should first take a filename as a command-line parameter, create a Scanner for the file, and read all the words from the file one word at-a-time, adding each to your local WordArray.  (3 points)

    Recall the default Scanner breaks tokens (via the next() method call) based on whitespace, and will not ignore punctuation in the file. However, the Scanner for your file can be customized to consider non-word characters (where word characters are defined to be a-z, A-Z and 0-9) as delimeters in addition to whitespace, thereby effectively ignoring punctuation.  To do this, use the following code to create your file scanner:

      Scanner fileScan = new Scanner(new File(filename));
      fileScan.useDelimiter("[\\s\\W]+");

    where filename is the text name of the file you want to open. (This method for changing the delimiter is non-optimal as it breaks hyphenated words, possessive quantifiers, etc. Can you think of a better way to do it? One point of extra credit if you can correctly handle words with hyphens and apostrophes.)

    Once you have read all the words in the file, sort them (so that getTopWords will work).  Next, use WordArray's getUniqueWordCount to print out the total number of unique words, create a new Scanner for user input, and repeatedly prompt the user until they hit enter at an empty prompt to quit the program.

    $ java WordBank test.txt
    test.txt has 865 unique words.
    Enter a word to get its frequency, a number to
    list the top N words, or a blank line to quit.
    > a
    Word a occurs 50 times.
    > 10
    the with frequency 191
    I with frequency 97
    to with frequency 74
    and with frequency 71
    of with frequency 64
    a with frequency 50
    my with frequency 43
    in with frequency 41
    was with frequency 40
    her with frequency 35
    > (user just hits Enter)
    $


    For full credit, you should handle both the ability to search for words (2 points) and the top N words (2 points) by calling the appropriate method in the WordArray and printing out the returned result.  If the user enters a word that doesn't exist or enters a value of N less than 1, print an error.  Apart from these two scenarios, you can assume "valid" input. Hint: java.lang's Character class has a few utility methods that help you determine if a character is a number or a letter; you can grab the first character of the String containing a line of user input, use these methods to decide what kind of input the user has made, and then process them accordingly.

    By the way, if you want sample ASCII files to test your code, check out Project Gutenberg -- you can play around with classic literature of various shapes and sizes. Here's Machiavelli's The Prince, for example. Note that some of the larger files may be quite slow, so we encourage you to first test your code against a small sample text file of your own (or Google for one).