Homework Assignment 6

An Interactive Text Analysis Program


810:062
Computer Science II
Object-Oriented Programming


Due: Thursday, March 10, at 8:00 AM


Introduction

Let's write the first version of an interactive text analysis program. From Session 10 through Session 12, we know how to to process files of text one word at a time. Now, let's use those techniques to delve into the characteristics of written documents.

This assignment focuses on just one characteristic, vocabulary. Most authors use a fairly predictable vocabulary. When writing large documents, the set of words they will usually be roughly the same. Our text analysis program will let a human user begin to explore the vocabulary used in writing a documents.

Your programming task is defined below as a sequence of small requirements. Write your program by implementing one requirement at a time in order. That is how we will grade your program.

Programming Advice

As always, you should write your program in the style we have used all semester long:

Take small steps, and your tests will give you feedback as soon as possible. Use the GenerateTest program to create your test class.


New Java for the Assignment

The class java.lang.Math contains a number of useful utility methods for working with numbers, such as Math.max() and Math.min(), Math.random(), and a method you will use on this assignment, Math.log(). Math.log() computes the natural logarithm of a double value. The natural logarithm of a positive value is the power that the number e must be raised in order to equal the value.

For this assignment, you will also need to be able to sort an array. The code you wrote for Homework 4 is a good starting point for a sorting method, but I do not expect you to write your own sort. Instead, you can use the Arrays.sort() method to do the job. sort() is a class method defined in the class java.util.Arrays utility class. Arrays.sort() takes an array as an argument. It returns nothing. But it leaves the argument array sorted in ascending order.

This simple example shows Arrays.sort() in action. Try it out!

Arrays.sort() works on arrays whose values are of the base types, including ints and doubles, with no extra help. If you want to sort an array of Objects, then the class of your objects must:

Feel free to only sort arrays of doubles and ints for now.

If you have any questions about this new Java idea, please ask questions soon! You do not need to scour the web for more information about these classes and methods, beyond the on-line Java documentation linked above.


Tasks

Write tests and code for each of the following requirements, in order. The words in bold indicate message names. Whenever a requirement says the user can "ask whether...", the expected answer is boolean. Whenever a requirement speaks of a "particular" item, then that item will be an argument to the method.

  1. A Bag is a collection of objects. It acts like a set, but it keeps track of how many times an object has been added to it.

    1. An empty Bag answers false when asked if it contains any particular string.

    2. An empty Bag returns 0 when asked for its wordCount.

    3. An empty Bag returns 0 when asked for its wordTotal.

    4. The user can add a particular string to a bag. Then:
      • It answers true when asked if it contains that string. Its answer doesn't change for any other string.
      • If it already contained that string, then its answer when asked for its wordCount will be the same, else its answer increases by 1.
      • Its answer when asked for its wordTotal increases by 1.

    5. The user can ask a Bag how many of its words overlapWith another Bag's words. The other Bag is sent as an argument with the overlapWith message. Tha answer is, of course, an int.

    6. The user can ask a Bag for its logDistribution. The log distribution is an array containing the logarithm of its word frequencies, in descending order.

      For example, if a Bag contains the following words and counts:

           (a, 5), (b, 7), (c, 2), (d, 8), (e, 8), (f, 3), (g, 1), (h, 5)
      

      then logDistribution will return this array:

           [2.0794415416798357   2.0794415416798357   1.9459101490553132
            1.6094379124341003   1.6094379124341003   1.0986122886681096
            0.6931471805599453   0.0]
      

      (That's one useful test case, but it's not the simplest. Be sure to test some simpler cases first!)

  2. A Document is an object that represents a specific text. For now, we will use have Document use a Bag to store what it knows about the text. For now, a Document will have only two behaviors:

    1. When a Document is created it is given the name of a file that contains its text. The constructor will read the file and add all of the words to a Bag instance variable.

      (You can use one of our early I/O examples, say Echo, as a basis for the this code.)

    2. When asked for its logDistribution, a Document returns the log distribution of its Bag.

  3. Once you have a written your Bag and Document classes, write a simple driver class named DocumentDemo that takes the name of a text file as a command-line argument. The main() method creates a Document on this text file and then writes the Document's log distribution to standard output.


Deliverables

By the due date and time, submit the files

Be sure that your submission follows all homework submission requirements.


Eugene Wallingford ..... wallingf@cs.uni.edu .... March 3, 2005