CSI Lab 13

Tuesday, April 15th


Introduction

In lecture yesterday we looked at reading from and writing to files.  Today we want to look at creating a series of methods that help us to "analyze" "large" text files.  I have two goals for this lab:

You MAY work with a partner on this lab.


Analyzing Hamlet

While I like to teach the introductory courses, I also like to teach courses dealing with Artificial Intelligence.  One of the projects that a team of students studied in my class was detecting whether or not a particular piece of text was written by a particular author.  One of the ways to do this is to look at statistical trends in text known to be from the author and to compare this to statistical information about the "questionable" piece of text.

While we will not go quite that far with this, let's write some code to help us consider statistics about a piece of text.  In this case, let's look at statistics for the play hamlet.  To start with, download the following text file:

Now, open up JES and save your workspace as a file called lab13.py.  All of the code for this activity will go in this file and will be submitted for your grade this week.


Activity A : Figuring out how many words are in the file

One of the most simple statistical analysis tools for text is looking at how many total words are in a document.  Let's write code to do that.

 

What I will want

By the time we are done with this activity, I want you to have a method called:

    howManyWords( filename)

In other words, by the end of this activity, I want to be able to ask:

    filename = pickAFile()

    quantity = howManyWords(filename)

and your code will return 52 if filename points to the preamble.txt or 31956 if filename points to hamlet.txt.

 

Begin by writing some code and working on it until it works using the preamble file - a smaller file which is easier to check for errors in the beginning.  If you open preamble.txt in notepad or some other word processor, you will notice that there are 52 words in this text. 

What do you need to do in order to get an answer of 52?

Hints:

 

Keep working on this until you get an answer of 52.

THEN, test it with hamlet.txt.  You should get an answer of 31956.

 

I am purposefully being vague on how to do this.  I want you to experiment, try things, and ask questions until you can get it to work.

 


Activity B : Figuring out how many words start with a given letter

One of the next easiest statistical analysis tools for text is looking at how many words in a document start with a particular letter of the alphabet.  Let's write code to do that.

What I will want

By the time we are done with this activity, I want you to have a method called:

    howManyWordsStartWith( filename, givenLetter)

In other words, by the end of this activity, I want to be able to ask:

    quantity = howManyWordsStartWith(filename,'A')

and your code will return 5 if filename points to the preamble.txt or 2893 if filename points to hamlet.txt.

 

How you should get there

  1. FIRST, write a method called:

 startsWith(word,char)

This method should take ONE string and one character and return true (1) if the word starts with that character and false (0) if it does not.

For this method let's be case-insensitive.  That is, it doesn't matter if I use upper or lower case letters.  Python starts with "P" and also starts with "p"

  1. Use the code you wrote in Activity 1 as a model and write the method called howManyWordsStartWith() as described above.  Rather than counting EVERY word, you should use the helper method you wrote in step 1 to only count words starting with your character.

 

Again, begin by testing with preamble (where the answer is 5) and then see if it still works with hamlet (2893)

 


Activity C : Doing a full statistical consideration

Now that we can ask the computer to figure out how many words start with a given letter, let's figure out how many words start with each of the 26 letters in our alphabet.

What I will want

By the time we are done with this activity, I want you to have a method called:

    calculateDistribution( inputFilename, outputFilename)

In other words, by the end of this activity, I want to be able to say:

    fin = pickAFile()  #pick the preamble

    fout = pickAFile()  #pick an empty file called output.txt

    calculateDistribution(fin,fout)

and your code will create a file called output.txt.  When I open that file I should see how many words start with A, how many words start with B, how many words start with C, etc. one letter per line

There were 2893 that started with A
There were 1402 that started with B
There were 1002 that started with C
...

How you should get there

You already have a method that counts how many of a given letter that there are (Activity B).

This method should set up a loop that uses iteration to ask the question over and over again for each of the 26 letters.

You might begin by printing to the screen to see if you are making progress.  THEN, revise your code to print to the file.

 

TIPS

 


Activity D : [POTENTIAL FOR BONUS]  Figuring out how many words have a certain length

 

What I will want

By the time we are done with this activity, I want you to have a method called:

    wordsOfLength( filename, length)

In other words, by the end of this activity, I want to be able to ask:

    quantity = wordsOfLength(filename,5)

and your code will return 1 if filename points to the preamble.txt or 4313 if filename points to hamlet.txt.

 

Some issues

On the surface, this is a pretty simple method to write at this point.  My guess is that most of you can copy and modify code that you already wrote and have this working and giving the answer of 4313 I listed above in only a matter of a few minutes.  In fact, if you get that working I will give you credit at this point.

Having said that, 4313 is actually not the correct answer.  Getting the ACTUAL correct answer is worth some bonus points.

To demonstrate why 4313 isn't correct, modify your code to print the word when it finds a word that is 5 characters long.  Not too far from the end of this list of words you should see the following:

minds
wild,
plots
Fort.

Notice that our code is actually working with words that have punctuation in them.

Clean up your code using some of the things suggested in your chapter so that you only consider the ACTUAL length of the words.  This means that wild and Fort would be four letter words, which would make our count smaller.  But it ALSO means that some of the words that USED to be 6 letter words (perhaps "music," would now count as a 5 letter word).

This isn't a completely trivial problem to solve, but you should be able to make a good dent in the problem.  According to my calculations there are 3805 words of length 5 in hamlet ASSUMING that word's like prov'd and fall'n are words of length 5.


Activity E : Doing a full statistical consideration

Now that we can ask the computer to figure out how many words there are with a given length...

What I will want

By the time we are done with this activity, I want you to have a method called:

    calculateLengthDistribution( inputFilename, outputFilename)

In other words, by the end of this activity, I want to be able to say:

    fin = pickAFile()  #pick the preamble

    fout = pickAFile()  #pick an empty file called output.txt

    calculateLengthDistribution(fin,fout)

and your code will create a file called output.txt.  When I open that file I should see how many words of length 1-12 there are in the input file

There were 1275 of length 1
There were 5669 of length 2
There were 7564 of length 3
There were 7259 of length 4
...

 


Getting Credit for this assignment

This week you should turn in your lab13.py code as your only deliverable for credit for this lab.  I will grade this much like I would a programming assignment.

When you are ready to submit the code place it in a directory called "lab13_complete" in your root directory on the p: drive and then send me an email telling me I can grade this code.