Session 28

Processing Files and Text


CS 2530
Intermediate Computing


Opening Exercise: The Point of No Return

Here is a quickie:

Write a class RemoveReturns whose main() method echoes a standard input to standard output, except that it replaces each new line character with a space.

In ASCII, the new line character is written '\n'.

For example:

    > less test-in.txt
    a
    b
    c
    1234
    e
    567890
    ff
    ghij
    kl
    12
    34
    56
    78
    90





    next
    > java RemoveReturns test-in.txt
    a b c 1234 e 567890 ff ghij kl 12 34 56 78 90      next > _



A Simple Solution

This sounds like a perfect job for CensorInputStream! We need to replace the new line character with a single space, we can write a main() method that reads characters from an CensorInputStream that knows to make that substitution. The all we have to do is write a loop that echoes the characters it receives to stdout.

Whenever possible, let other objects help us do our work! Laziness of this sort is a virtue in object-oriented programs, and object-oriented programmers. (Indeed, it is a virtue for all programmers.)

Of course, if you insist on reinventing wheels, you could write a different main() method that reads characters from an ordinary InputStream, prints a space whenever it sees the new line character, and prints the character itself when not.

Both of these solutions approach the problem at the level of individual characters. We could instead approach it at the level lines of text. If we could read whole lines of text, we could echo them back to stdout one at a time, followed by a space. InputStreams don't do that for us, so we will need some new tools.

We often want to work with files and other sources of text at levels higher than individual characters, so it's worth learning some new tools now.

By the end of the period, we will be able to write a line-based version of RemoveReturns.



Some New Java for File I/O and String Processing

The program Echo.java introduces us to a few new Java classes that allow us to work with files of text. Echo echoes the words from one file to another file, one word per line. For example:

    > less hamlet.txt
    1604


    THE TRAGEDY OF HAMLET, PRINCE OF DENMARK


    by William Shakespeare...

    > java Echo hamlet.txt hamlet.out

    > less hamlet.out
    1604
    the
    tragedy
    of
    hamlet
    prince
    of
    denmark
    by
    william
    shakespeare
    ...

We can understand many of the new features you see here in terms of concepts you have already learned.

One design pattern can helps us to understand several new ideas relatively quickly! Then we can focus in on a few handy messages, such as BufferedReader.readLine() and PrintWriter.println(), and be relatively productive relatively quickly, too. Note that readLine() returns null (in place of a String) when it reaches the end of the stream, rather than read()'s -1 (in place of a character's integer value).

Another new tool in this code is the StringTokenizer. To "tokenize" means to break something down into the parts (tokens) that make it up. A StringTokenizer tokenizes a String. (Surprise!) Its interface lets us access the tokens one at a time. We generally send a tokenizer two messages: hasMoreElements() and nextElement. The former returns true or false, depending on whether the tokenizer has a String for us that we haven't seen yet. The latter returns the next token we haven't seen yet, if one exists.

We use a common loop to process the sequence of Strings that the tokenizer gives to us:

    while ( tokenizer.hasMoreElements() )
    {
      String word = (String) tokenizer.nextElement();
      process word
    }

You can use a loop just like this one whenever you work with a StringTokenizer. Replace "process word" with whatever your program needs to do.

But a default StringTokenizer "breaks" only on whitespace. That means, whenever it sees whitespace, it thinks it has found a boundary between two tokens. Every other character counts as part of a string. But consider this case:

    "Eugene, you are a dandy fellow, and a scholar!"

A default StringTokenizer will return:

    Eugene,
    you
    are
    a
    dandy
    fellow,
    and
    a
    scholar!

We may want to process only the words, not strings that contain punctuation characters, such as the ',' in Eugene, and the '!' in scholar!.

A StringTokenizer can do this for us if we tell it which characters to treat as delimiters. We can tell a tokenizer which characters to use to separate the string into words by passing a string of delimiters to its constructor. In Echo, I pass a whole bunch of characters that I do not want to be treated as part of a word, including some characters that need to be "escaped" in a Java program.



A Quick Exercise

Turn Echo into WordCount, which prints the number of lines, words, and characters in a file to standard output. For example:

    > java WordCount hamlet.txt
    4463 32885 130145

Here is one possible solution. How much about it is different from Echo? How much is the same?

Let's compare WordCount's output to that of the built-in Unix command wc, which does the same task:

    > wc hamlet.txt               // standard Unix command
        4463   31956  191734 hamlet.txt

Quick Exercise: Why do you suppose that our word and character counts don't match, and in opposite directions?



Working with Standard Input and Output as Files

Sometimes, we'd like to give the user an option of providing a file name or using standard I/O. Most Unix commands work this way. How can we make our Java programs do the same thing?

The critical lines in Echo.java are these:

    buffer = inputFile.readLine();
    outputFile.println( word );

Why are they critical? They are the only lines in our processing code that interact with the files. So, if we want to use standard input in place of the input file, we need a way to change the readLine statement; if we want to use standard output in place of the output file, then we need a way to change the println statement.

BufferedReaders and PrintWriters are virtual. They rely on other readers and writers to help them do their jobs. Unfortunately, standard input and output are streams, not readers and writers.

Let's again take advantage of an object-oriented idea: We ought to be able to substitute an object with a common interface, even if different behavior, in place of one another, and let the new object fulfill the responsibilities of the replaced one.

Java give us the classes we need: InputStreamReader and OutputStreamWriter. They decorators that convert stream input into reader input and reader output into stream output, respectively.

Take a look at this new, improved version of Echo. It does the same job as the original, but it allows the user to work with standard input and standard output as well as input and output files. The only changes to this file from the original are in these four set-up lines:

    BufferedReader inputFile = new BufferedReader(
                   new InputStreamReader( System.in ) );
    PrintWriter    outputFile = new PrintWriter(
                   new OutputStreamWriter( System.out ) );

    if ( args.length > 0 )
        inputFile = new BufferedReader( new FileReader( args[0] ) );
    if ( args.length > 1 )
        outputFile = new PrintWriter( new FileWriter( args[1] ) );

By default, the program reads from standard input and writes to standard output. If the user gives one command-line argument, it is the name of an input file. If the user gives more than one command-line argument, the second is the name of an output file. Here are some example of how the new code works:

    > java Echo
    foo bar baz big
    Eugene Wallingford teaches this course.
    foo
    bar
    baz
    big
    eugene
    wallingford
    teaches
    this
    course

    > java Echo hamlet.txt | less
    1604
    the
    tragedy
    of
    hamlet
    prince
    of
    denmark
    by
    william
    shakespeare
    ...

    > java Echo hamlet.txt hamlet.out
    mac os x  > less hamlet.out
    1604
    the
    tragedy
    of
    hamlet
    prince
    of
    denmark
    by
    william
    shakespeare
    ...

    > cat hamlet.txt | java Echo | less
    1604
    the
    tragedy
    of
    ...

Whenever you need to read from standard input instead of a file, or write to standard output instead of a file, you can use objects created in this way. If you would like to do something more sophisticated, feel free to look into the details of FileReader, BufferedReader, FileWriter, and PrintWriter, or even InputStreamReader and OutputStreamWriter.

Notice again: the processing code in this program stays exactly the same. This demonstrates yet again a wonderful degree of say what you mean and say it once and only once. Objects give us the power to do this in a variety of ways.

Expound with great fervor.

Finally, a closing comment. That is a busy main() method. I'm already eager to find an object in the mess and factor it out of this code into a class, so that I could reuse it in different contexts. We're well on our way. We've managed to separate the creation of the input/output objects from the code that uses them to process a sequence of lines and words. Soon!



Remove Returns Redux

After learning about BufferedReader, we are ready to write a third main() method to echo a file, replace new line characters with spaces. It some ways, this version is more straightforward than the InputStream versions, and in other ways it is more complex.

Whether the latest version is better or worse than the CensorInputStream version depends in part on stylistic preference and in part on the context. That is not unusual. The relative quality of competing designs depends on a lot of factors outside the programs themselves.



Wrap Up



Eugene Wallingford ..... wallingf@cs.uni.edu ..... November 29, 2012