Session 29

Analyzing Text With Multiple Strategies


CS 2530
Intermediate Computing


Using Computing to Solve a Real Problem

Every so often, we read about how computer programming and statistics have been used to crack a code or understand a literary work. Just last week, a news story [local mirror] reported that a 200-year-old code used by Roger Williams, the founder of the state of Rhode Island, had been broken through a combination of statistical analysis and historical research. A few years ago, a story in the Dallas-Fort Worth Star-Telegram (local mirror) reported the use of a computer program to identify novelist Henry James as the author of some older writings whose attribution was unknown.

Statistical literary analysis is a growing area of research in the humanities, in which scholars seek to identify or confirm who wrote a document by comparing the document to a corpus of works known to have been written by an author. Now that computer programs can be used so easily to assist with this sort of analysis, a new area of study -- digital humanities -- has arisen.

The most famous problem to which statistical literary analysis has been applied is "Who wrote Shakespeare?". Some literary scholars who believe that the Elizabethan actor named Shakespeare did not write the works attributed to him, that they were instead written by some other person with a motive for concealing his identity. Among the most prominent people proposed are Francis Bacon, Christopher Marlowe, and Edward de Vere (17th Earl of Oxford).

This debate rages on... A Google search for "shakespeare de vere", based on the most commonly suggested culprit, produces approximately 593,000 hits.

Statistical literary analysis requires some complex computer algorithms and a little knowledge of statistics. Yet you know enough Java to begin writing simple literary analysis programs. Many of the most useful metrics of a text are straightforward to compute, such as the lengths of words and the number of different words used.

Let's do it! For our work today, we will use Shakespeare's Hamlet and Much Ado About Nothing. If you would like to explore some of the ideas you see today with Shakespeare's other works, or other works altogether, I suggest you visit Project Gutenberg, a substantial collection of documents available in many forms, including plain text.



An Exercise: It Starts With a Single Step

Suppose that we have written a class named Play that is initialized with a filename and holds the full text of the play, in a Vector of Strings:

    public class Play
    {
      private Vector<String> text;

      public Play( String filename )
      {
        // reads filename into text, one line at a time
      }
    }

We would like a Play to respond to a startsWith( char initial ) message, returning the number of Strings in its document that begin with the initial character.

Your task: write the int startsWith() method.

Here is a possible solution. Does it work?

    > java StartsWith hamlet.txt t
    The play in the file hamlet.txt contains 4533 words that start with 't'.



Another Exercise: It Won't Take Long!

Next, we need to know how many words there are of a given length. So we add to the Play class a method named wordsOfLength( int length ). In response to this message, a Play returns the number of Strings in the document that are length characters long.

What would you change in our previous solution?

Answer: just the test on the loop counter!

Here is a possible solution. Does it work?

    > java FindLengthOf hamlet.txt 5
    The play in the file hamlet.txt contains 3785 words of 5 characters.

We could even look for words of multiple lengths:

    > java WordLengths hamlet.txt 20
    The play in the file hamlet.txt contains 2037 words of 1 characters.
    The play in the file hamlet.txt contains 5930 words of 2 characters.
    The play in the file hamlet.txt contains 7515 words of 3 characters.
    ...
    The play in the file hamlet.txt contains 16 words of 13 characters.
    The play in the file hamlet.txt contains 2 words of 14 characters.
    The play in the file hamlet.txt contains 0 words of 15 characters.



Yet Another Exercise: You'll Know This Idea Forwards and Backwards

Some crazy English professor gets in his (or her) head that the number of palindromes in a text is a significant marker of authorship. We are asked to add to the Play class a method named numberOfPalindromes(), which the number of Strings in the document that read the same forward and backward.

What would you change from our previous solution?

Same answer: just the test on the loop counter!

Here is yet another solution. Does it work?

    > java FindPalindromes hamlet.txt
    The play in the file hamlet.txt contains 2248 palindromes.

No problem. But...



The Real Problem: Duplication

The real problem here isn't writing these methods. After we write the first one, the rest are pretty easy.

The problem is: Each of these methods is nearly identical. How can we make the duplication go away?

What programming techniques do you know that you could use to do the job?

In the past, when we encountered common code in two methods, we have often factored out a helper method that contains the common behavior. Then the original methods call the helper instead of repeating the code. Indeed, this is one of the first tools we learned our Introduction to Computing course for creating abstractions.

A Possible Solution

To apply that idea here, we need to create a template method. The template method factors out a helper method in a way that is backwards from how we usually do of it. The helper method, which captures what is common among the three methods, is the calling methods. The called methods are the code that differs among the three solutions, which is test to be applied to the word.

So, Play would have a method that does all of the vector processing and calls a method to do the test. Then we write subclasses of Play to implement specific tests to implement specific counting behaviors.

First, we would create a "template" method named countWords with a call to a "hook" method that runs the test:

    public int countWords()
    {
        ...
        while( words.hasMoreElements() )
        {
          word = (String) words.nextElement();
          if ( passesTest( word ) )
            wordCount++;
        }
        ...
    }

Then, we might write a subclass named WordsStartWithPlay that implements the passesTest method in its own way:

    public int startsWith( char targetChar )
    {
        this.targetChar = targetChar;
        return countWords();
    }

    private boolean passesTest( String word )
    {
        return word.charAt( 0 ) == targetChar();
    }

A New Problem

A template method is a great tool for letting subclasses fill in the details of an otherwise specified process. That works when each different behavior being factored out indicates a different kind of object. If the different tests we want to run corresponded to different kinds of play, such as history, comedy, and tragedy, this would work great.

But that is not the case in our play-processing scenario. We can run any of our tests on any play that Shakespeare wrote. What happens in the proposed solution if we want to ask a Play both how many of its words start with a particular letter and how many of its words have a particular length? We can't!

This is a huge drawback to using a template method here. Our need is not for a different kind of play, but for a different kind of processing.



Repeat After Me

"Say it once and only once."

We know it.
We believe it.
We love it.

But how do we do it?

We know how! We have done it before. In fact, you did it when you wrote your startsWith() method earlier. Even if I had not told you to write a method that takes the initial character as an argument, you would have done so. I doubt anyone would have felt a desire to writing a method called startsWithT() and a method called startsWithJ() and a method startsWithS() and .... Why not?

Because you understand the idea of parameterized behavior -- even if you've never heard that phrase before. startsWith( char ) implements the behavior. The argument to the method parameterizes the behavior to work for a specific data value.

We parameterize a procedure by passing to it the value that differs among the various uses of the method.

So... The answer to "How do we do it?" in our Play class is to parameterize the behavior of the word-counting code by passing to it what differs.

Make the test we use on the string a parameter.

So I send you off to do the job.



Um, Excuse Me,
Professor Wallingford, ...

Then I wake up the next morning, go to my office as happy as can be, and find in my e-mail the following:

     Date: Wed, 5 Dec 2012 00:59:33 -0500 (CDT)
     From: Happy User (somestudent@uni.edu)
     To: My Favorite Teacher (wallingf@cs.uni.edu)
     Subject: say it once and only once

     How exactly do I do that ??

     I mean, I know how to pass characters and numbers
     to functions because, like, they're data, ya
     know?  How can I pass a function as an argument?
     How can a test be a parameter?

     Patiently awaiting enlightenment...



Enlightenment Arrives

Ah, Little One, you know the answer. You even know two possible answers, one of which is better than the other. You just don't realize that you know.

Idea #1: Objects are data, too.

Idea #2: Objects can do things!

Make an object. Pass it to the method. Tell it to do something for you.



A Better Solution

What changes in our set of applications is not the kind of play, but the way we count words in the query methods.

Make what changes -- the test -- an object.

So, for our Play problem:

This is the Strategy design pattern.



Previous Experience with the Strategy Pattern

We have encountered this pattern before, in the AWT. Each of our Frames has a LayoutManager, which is an algorithm that lays out the Frame's items on the screen.

Making a test or a function or even a whole algorithm into an object that can be created, passed, and replaced is a common idea in object-oriented programming. In many ways, it's not a new idea at all, because OOP is all about creating and using objects!



Using the Strategy Pattern in our Word-Counting Program

First, let's factor out what is different about our solutions. The "variable" here is determining whether a certain word has a certain characteristic.

We define such a strategy using an interface:

    public interface WordFeature
    {
      public boolean hasFeature( String s );
    }

We can then implement different checking strategies as classes. For example:

    public class StartWith     implements WordFeature {...}
    public class OfLength      implements WordFeature {...}
    public class IsAPalindrome implements WordFeature {...}



Implementing and Using a Strategy Object

Here's how we might implement one of our strategies:

    public class StartsWith implements WordFeature
    {
      private char targetChar;

      public StartsWith( char target )
      {
        targetChar = target;
      }

      public boolean hasFeature( String s )
      {
        if ( s == null || s.length() == 0 )
          return false;
    
        return s.charAt(0) == targetChar;
      }
    }

Any class that needs to test strings to see whether they start with a particular letter can create an instance of StartsWith ...

    WordFeature test = new StartsWith( 't' );

... and send it a message:

    if ( test.hasFeature( "the" ) ) ...

To do this in our Play class, let's extract the common parts of our three methods into a word-counting method:

    public int countWords( WordFeature test )
    {
      int wordCount = 0;
      String line;
      StringTokenizer words;
      String word;

      for ( int i = 0; i < text.size(); i++ )
      {
        line  = text.elementAt(i).toLowerCase();
        words = new StringTokenizer( line, DELIMITERS );
        while( words.hasMoreElements() )
        {
          word = (String) words.nextElement();
          if ( test.hasFeature( word ) )
            wordCount++;
        }
      }
    
      return wordCount;
    }

This is almost exactly the code that we re-wrote before. Now, we implement the specific methods we need, with calls to countWords():

    public int startsWith( char targetChar )
    {
      targetChar = Character.toLowerCase( targetChar );
      return countWords( new StartsWith(targetChar) );
    }

Likewise for wordsOfLength() and numberOfPalindromes().

To add new word-counting methods, we need only...

  1. write a new WordFeature class and
  2. add a one-line method to the Play class.

Writing such a class isn't much more work than writing a method, because that is all there is. Very nice!



An Exercise: Implementing a Strategy

This can be a quickie.

Write the class OfLength that tests strings to see if they are a particular length.

Here's a solution. It's that easy!



Shakespeare Redux

If you are interested in learning more about the Shakespeare question, I suggest that you visit the following web sites:

And, lest you think that digital humanities involves only text, check out Computer Analysis Suggests Paintings Are Not Pollocks, which talks about a project that used computational techniques to study the provenance of a set of paintings!

Computer programming really is very cool.



Wrap Up



Eugene Wallingford ..... wallingf@cs.uni.edu ..... December 4, 2012