Session 16

Processing Files


CS 1510
Introduction to Computing


Growing a File Formatter

I tend to work in plain text with files formatted to fit in the width of a standard window or printed document (80 or 132 characters wide, historically). But when I move to a word processor, all the hard line returns that I've typed are in the way. I need to delete them, often by hand.

Write a Python program that replaces all of the end-of-line characters in a file with spaces.

Here is the simplest first program I can imagine, based on our cat program from yesterday. It does just what I asked.

Now I realize that some of the hard line returns are important. The blank lines in the file separate notes, which are like paragraphs. Let's leave those line breaks in.

Modify your program to leave the end-of-line character in whenever it is the only character on the line.

Version 2 works, though I'll still have some editing to do in cases where I put notes on consecutive lines. Let's leave that as it is.

But the resulting text is quite hard to read. We either need to put a second line break in, to separate the paragraphs, or indent the lines so that they are readable.

Modify your program either to leave a blank line between the text lines or to start each text line with three spaces.

Either change affects the 'then' clause on our if statement. Adding a blank line requires that we print a \n character in addition to the one the print() function. Indenting lines requires that we print spaces after each line break, because that's the time when we know we are starting a new line. But that means printing an explicit \n followed by spaces and suppressing the \n that print() usually gives us. And there's the first line, too.

We are in pretty good shape! But now I need to have this text in a new file.

Modify your program to write its output to file. Name the output file filename-stripped.ext, where filename.ext was the name of the input file.

Here is our final program. The output file looks good. It is 20 characters bigger, though.

    > ls -l notes*
    -rw-r--r--  1 wallingf  staff  3530 Oct 16 10:54 notes-stripped.txt
    -rw-r--r--  1 wallingf  staff  3510 Oct 16 10:00 notes.txt

Why? Because we put an extra character in for every line return in the file.

    > wc notes-stripped.txt
          40     584    3530 notes-stripped.txt

Unix is bigger than cat, more, and grep. You may enjoy learning it!



Lessons from the Lab

The cat exercise is just about the minimal example of reading from a file: just copy the file, character by character. That's how cat works, too, though our Python version works line by line. We may learn more about the different ways to read from files later in the course.

more adds one complication: we now need to count lines and ask the user whether to continue every so often. We already have a loop for our running total pattern, so we just add a counter and the if statement that checks our progress.

What happens if we wrap the for in a while? Nothing in my program, because I never set my counter back to zero. But many of your programs did. Once we run out of lines to process, we never get to increment the counter; it stays the same forever; and that's how long we keep looping: forever.

How can we know that we don't want another loop? Or need one?

The cat-to-file exercise adds an output file to the mix, but the behavior is still the same as cat echo the content of the input to the output location. There are two inputs, so we need to execute the read-and-write process twice.

Given what we know at this point, this requires two loops. Next week, we will learn how can write this code without copying and pasting the for loop. Heck, we may just sneak it in later today.

Finally, our grep-like program combines reading and counting lines for a new purpose: printing only the lines that contain a search string. There was one twist in the spec: matching the search string regardless of case but still printing the original line. My solution...

Not many of you got to Task 5, which I included as insurance and as a challenge. The idea is this: In Unix, I type commands and arguments on a single line:

    > grep William dirda-excerpts.txt

... but our Python versions prompt over multiple lines:

    grep: William
    in  : dirda-excerpts.txt

Wouldn't it be nice to prompt once and accept all the arguments at one time?

    grep: William dirda-excerpts.txt

How can we eliminate the extra prompts? In the Unix case, the arguments are separated by spaces. We can do the same thing in Python, using the split(' ') function.

    argument_str = input('grep: ')
    search_str, filename = argument_str.split(' ')

    # still make this lowercase
    search_str = search_str.lower()

Now running our execution feels a lot more like the Unix experience:

    grep: William dirda-excerpts.txt
    32 -- William Arrowsmith
    47 -- William Wordsworth

If we are persistent, creative, prepared, and persistent, we can solve almost any problem. I love programming. (And I love discovering the boundaries of what we can and cannot solve.)



Next Idea: Functions

There is a bit more that we can learn about files, but these basics will carry us quite far. Besides, dealing with files in a program is more about details and trivia than big ideas. Let's move on to a new idea -- a big idea -- that can change how we think about programs.

Recall that in cat_to_file.py we wanted to do the same thing twice: loop through a file, and print its lines into an output file. Given what we know at this point, doing so requires two loops:

    for line in infile1:
        line_str = line.strip()
        print(line_str, file=outfile)

    for line in infile2:
        line_str = line.strip()
        print(line_str, file=outfile)

The two loops are identical. Wouldn't it be nice if we didn't have to repeat them?

... these lines are a "thing": copy one file to another. If we can name the thing, then we can we refer to it in multiple places. The file objects are different, so we need to communicate that information, too.

... compare to square root. A process that we can name and apply to different values: sqrt(100) or sqrt(6.25).

We can do this in Python:

    def copy_file(in_file, out_file):
        for line in in_file:
            print(line, end='', file=out_file)

This allows us to write a new version of cat_to_file.py, with the two loops replaced:

    copy_file(infile1, outfile)
    copy_file(infile2, outfile)

The adventure is just beginning.



Wrap Up



Eugene Wallingford ..... wallingf@cs.uni.edu ..... October 16, 2014