Session 10

Towards Programs that Process Text


810:062
Computer Science II
Object-Oriented Programming


An Opening Exercise

Write a quick-and-dirty main() program named CharacterCount that displays the number of non-space characters on the command line after the command. For example:

    mac os x > java CharacterCount
    0

    mac os x > java CharacterCount a
    1

    mac os x > java CharacterCount a bc def ghij
    10

Here is the quickie. It's nothing special, really. Remember, args is just an array of Strings. Your code needs to add the length of each String in the array to a running total. This is a standard looping pattern.


Some New Java: StringTokenizer

Study EchoWordsInArgumentV1.java, which takes a single command-line argument and echoes all of the words that it contains, one per line.

    mac os x > java EchoWordsInArgument "a bc def ghij"
    a
    bc
    def
    ghij

    mac os x > java EchoWordsInArgument "eugene is a dandy fellow"
    eugene
    is
    a
    dandy
    fellow

Our new tool is the StringTokenizer. To "tokenize" means to break something down into the parts (tokens) that make it up. A StringTokenizer tokenizes a String. (Surprise!) Its interface lets us access the tokens one at a time.

You can use a loop just like this one whenever you work with a StringTokenizer. Replace "process word" with whatever your program needs to do.

But a default StringTokenizer "breaks" only on whitespace. That means, whenever it sees whitespace, it thinks it has found a boundary between two tokens. Every other character counts as part of a string. But consider this case:

    mac os x > java EchoWordsInArgumentV1 "Eugene, you are a dandy fellow, and a scholar\!"
    Eugene,
    you
    are
    a
    dandy
    fellow,
    and
    a
    scholar!

We may want to process only the words, not strings that contain punctuation characters, such as "Eugene," and "scholar!".

A StringTokenizer can do this for us if we tell it what characters to treat as delimiters. When we create a StringTokenizer, we can tell it what characters to use to break the string into words. Consider the small change in EchoWordsInArgumentV2.java:

      String delimiters = " .?!()[]{}|?/&\\,;:-\'\"\t\n\r";

      StringTokenizer words = new StringTokenizer( args[0], delimiters );

Now:

    mac os x > java EchoWordsInArgumentV2 "Eugene, you are a dandy fellow, and a scholar\!"
    Eugene
    you
    are
    a
    dandy
    fellow
    and
    a
    scholar

One more issue... Sometimes when we are processing a long stream of words, the mixture of upper- and lowercase characters can cause a problem. Suppose that we are creating an index of the words in a file. So we would need to sort the words we find:

    mac os x > java EchoWordsInArgumentV2 "Eugene, you are a dandy Computer Scientist, and a scholar\!" | sort
    Computer
    Eugene
    Scientist
    a
    a
    and
    are
    dandy
    scholar
    you

The capitalized words appear out of order, because:

The StringTokenizer can't do much for us here, but Java Strings can. We can ask a String for an all-lowercase version of itself. Consider the one new line in EchoWordsInArgumentV3.java:

      String word = (String) words.nextElement();
      word = word.toLowerCase();
      System.out.println( word );

And now:

    mac os x > java EchoWordsInArgumentV3 "Eugene, you are a dandy Computer Scientist, and a scholar\!" | sort
    a
    a
    and
    are
    computer
    dandy
    eugene
    scholar
    scientist
    you

The lesson of all this? We can use a StringTokenizer to access individual words from a string. Eventually, we will want to process more than one line of text, or text that already exists in a file. How can we do that?


More New Java: File I/O

Study Echo.java, which echoes all the words in one file to an output file, one per line.

    mac os x > less hamlet.txt
    1604


    THE TRAGEDY OF HAMLET, PRINCE OF DENMARK


    by William Shakespeare



    Dramatis Personae

      Claudius, King of Denmark.
      Marcellus, Officer.
      Hamlet, son to the former, and nephew to the present king.
      ...

    mac os x > java Echo hamlet.txt hamlet.out

    mac os x > less hamlet.out
    1604
    the
    tragedy
    of
    hamlet
    prince
    of
    denmark
    by
    william
    shakespeare
    dramatis
    personae
    claudius
    king
    of
    denmark
    marcellus
    ...

Our new tools are two of Java's reader and writer classes, with which we can do input/output. For now, we will focus on two, BufferedReader and PrintWriter, which will use helper objects to access files in your directory.

Now watch this:

    mac os x > cat hamlet.out | sort | less
    1
    1
    1
    1
    1
    1
    1604
    a
    a
    a
    ...


A Quick Exercise

Turn Echo.java into WordCount, which prints the number of lines, words, and characters in a file to standard output. For example:

    mac os x > java WordCount hamlet.txt
    4792 32889 130156

Here is one possible solution. How much about it is different from Echo? How much is the same?

Let's compare WordCount's output to that of the built-in Unix command wc, which does the same task:

    mac os x > java WordCount hamlet.txt   ;; our Java program
    4792 32889 130156

    mac os x > wc hamlet.txt               ;; standard Unix command
        4792   31957  196505 hamlet.txt

Quick Exercise: Why do you suppose that our word and character counts don't match, and in opposite directions?


What Have We Learned?


Wrap Up


Eugene Wallingford ..... wallingf@cs.uni.edu ..... February 10, 2005