810:062 Session 10

Session 10

Towards Programs that Process Text

810:062
Computer Science II
Object-Oriented Programming

An Opening Exercise

Write a quick-and-dirty main() program named CharacterCount that displays the number of non-space characters on the command line after the command. For example:

    mac os x > java CharacterCount
    0

    mac os x > java CharacterCount a
    1

    mac os x > java CharacterCount a bc def ghij
    10

Here is the quickie. It's nothing special, really. Remember, args is just an array of Strings. Your code needs to add the length of each String in the array to a running total. This is a standard looping pattern.

Some New Java: `StringTokenizer`

Study EchoWordsInArgumentV1.java, which takes a single command-line argument and echoes all of the words that it contains, one per line.

    mac os x > java EchoWordsInArgument "a bc def ghij"
    a
    bc
    def
    ghij

    mac os x > java EchoWordsInArgument "eugene is a dandy fellow"
    eugene
    is
    a
    dandy
    fellow

Our new tool is the StringTokenizer. To "tokenize" means to break something down into the parts (tokens) that make it up. A StringTokenizer tokenizes a String. (Surprise!) Its interface lets us access the tokens one at a time.

You can create a StringTokenizer using a constructor that takes a string argument to process.
We send a StringTokenizer one of two messages: hasMoreElements() and nextElement. The former returns true or false, depending on whether the tokenizer has a String we haven't seen yet. The latter returns the next token we haven't seen yet, if one exists.

We use a common loop to process the sequence of Strings that the tokenizer finds.

        while ( tokenizer.hasMoreElements() )
        {
            String word = (String) tokenizer.nextElement();
            process word
        }

You can use a loop just like this one whenever you work with a StringTokenizer. Replace "process word" with whatever your program needs to do.

But a default StringTokenizer "breaks" only on whitespace. That means, whenever it sees whitespace, it thinks it has found a boundary between two tokens. Every other character counts as part of a string. But consider this case:

    mac os x > java EchoWordsInArgumentV1 "Eugene, you are a dandy fellow, and a scholar\!"
    Eugene,
    you
    are
    a
    dandy
    fellow,
    and
    a
    scholar!

We may want to process only the words, not strings that contain punctuation characters, such as "Eugene," and "scholar!".

A StringTokenizer can do this for us if we tell it what characters to treat as delimiters. When we create a StringTokenizer, we can tell it what characters to use to break the string into words. Consider the small change in EchoWordsInArgumentV2.java:

      String delimiters = " .?!()[]{}|?/&\\,;:-\'\"\t\n\r";

      StringTokenizer words = new StringTokenizer( args[0], delimiters );

Now:

    mac os x > java EchoWordsInArgumentV2 "Eugene, you are a dandy fellow, and a scholar\!"
    Eugene
    you
    are
    a
    dandy
    fellow
    and
    a
    scholar

One more issue... Sometimes when we are processing a long stream of words, the mixture of upper- and lowercase characters can cause a problem. Suppose that we are creating an index of the words in a file. So we would need to sort the words we find:

    mac os x > java EchoWordsInArgumentV2 "Eugene, you are a dandy Computer Scientist, and a scholar\!" | sort
    Computer
    Eugene
    Scientist
    a
    a
    and
    are
    dandy
    scholar
    you

The capitalized words appear out of order, because:

The ASCII values of uppercase letters are less than those for lowercase letters.
Unix's sort program, like most sort programs, uses ASCII values, not lexicographic ordering, to sort strings.

The StringTokenizer can't do much for us here, but Java Strings can. We can ask a String for an all-lowercase version of itself. Consider the one new line in EchoWordsInArgumentV3.java:

      String word = (String) words.nextElement();
      word = word.toLowerCase();
      System.out.println( word );

And now:

    mac os x > java EchoWordsInArgumentV3 "Eugene, you are a dandy Computer Scientist, and a scholar\!" | sort
    a
    a
    and
    are
    computer
    dandy
    eugene
    scholar
    scientist
    you

The lesson of all this? We can use a StringTokenizer to access individual words from a string. Eventually, we will want to process more than one line of text, or text that already exists in a file. How can we do that?

More New Java: File I/O

Study Echo.java, which echoes all the words in one file to an output file, one per line.

    mac os x > less hamlet.txt
    1604


    THE TRAGEDY OF HAMLET, PRINCE OF DENMARK


    by William Shakespeare



    Dramatis Personae

      Claudius, King of Denmark.
      Marcellus, Officer.
      Hamlet, son to the former, and nephew to the present king.
      ...

    mac os x > java Echo hamlet.txt hamlet.out

    mac os x > less hamlet.out
    1604
    the
    tragedy
    of
    hamlet
    prince
    of
    denmark
    by
    william
    shakespeare
    dramatis
    personae
    claudius
    king
    of
    denmark
    marcellus
    ...

Our new tools are two of Java's reader and writer classes, with which we can do input/output. For now, we will focus on two, BufferedReader and PrintWriter, which will use helper objects to access files in your directory.

We will use one constructor on each class.
We will send one input message to BufferedReaders, readLine().
We will send one of two output messages to PrintWriters, print() and println().

We will use a common loop for accessing lines of text from an input file, one line at a time.

        while ( true )
        {
            buffer = inputFile.readLine();

            if ( buffer == null ) break;

            process buffer
        }

To process the line of text, we will sometimes use the common StringTokenizer loop we learned above to access the words on a line one at a time.

Now watch this:

    mac os x > cat hamlet.out | sort | less
    1
    1
    1
    1
    1
    1
    1604
    a
    a
    a
    ...

A Quick Exercise

Turn Echo.java into WordCount, which prints the number of lines, words, and characters in a file to standard output. For example:

    mac os x > java WordCount hamlet.txt
    4792 32889 130156

Here is one possible solution. How much about it is different from Echo? How much is the same?

Let's compare WordCount's output to that of the built-in Unix command wc, which does the same task:

    mac os x > java WordCount hamlet.txt   ;; our Java program
    4792 32889 130156

    mac os x > wc hamlet.txt               ;; standard Unix command
        4792   31957  196505 hamlet.txt

Quick Exercise: Why do you suppose that our word and character counts don't match, and in opposite directions?

What Have We Learned?

The Programming Ideas: standard loops for processing lines in a file and tokens in a string
The Java Ideas: the StringTokenizer, BufferedReader, and PrintWriter classes; command-line arguments
The Unix Ideas: command-line arguments, escape characters, piping, the sort and wc commands

Wrap Up

Reading -- Download, study, and play with today's code. Play with it: compile it, run it, and modify it. If you want to use a large file to test your programs, use , which is available from the Hamlet (zipped).
Programming -- Homework 3 is due today. Homework 4 is available and is due in one week.

Eugene Wallingford ..... wallingf@cs.uni.edu ..... February 10, 2005