Today we are going to do some text analysis on an American classic - "Green Eggs and Ham" You can download the full text in the following file:
I have heard the following fact about "Green Eggs and Ham" several times
Green Eggs and Ham is one of Seuss's "Beginner Books", written in a very simple vocabulary for beginning readers. The vocabulary of the text consists of just fifty different words.
Is that actually true?
In today's lab we will use lists and dictionaries to confirm that this is true and play with some ways to look at how many times each word is used.
While I won't require it, I strongly encourage you to work with a partner.
Before you begin, please create a new Python file called lab12.py and write the traditional comments at the top (file, author(s), and description). In addition, copy the following function into lab12.py where you replace my name with your name(s):
In this lab you will be reading a variety of files containing text. I have three files you can use for testing purposes:
In two different activities in this lab you will be analyzing the individual words in these text files. In both of those activities you will need to process the text file into these individual words. Thus, it might be helpful to have a piece of code that does that common prep work.
By the time you are done with this activity, I want you to have a function called createWordList() that:
The results of Activity A are nice, but they don't answer our question about whether or not there are only 50 words in Green Eggs and Ham. Heck, the answer given is that there are 804. But the word "like," for example, is in there many times. We want to figure out how many UNIQUE words are in the file. It turns out this isn't too difficult to do by using a list.
What I will want
By the time you are done with this activity, I want you to have a function called countUniqueWords() that:
Think about this problem for a little bit. It might sound easy at first, but how does the actual counting work? How can we take a section of the book like:
I do not like them in a box. I do not like them with a fox.
and get the correct count. Notice that while there are 16 words in that section, there are only 10 UNIQUE words. In other words, I need to write some code that does something like:
i => 1 word do => 2 words ... a = > 7 words box => 8 words i => already seen this so 8 words do => already seen this so 8 words ... with => 8 words a => already seen this 9 words fox => 10 words
Well, we could keep a list of words that we have seen. Thus, we are really saying:
start with an empty list of words look at the first word => "i" Is that in my list? => no Add it to my list => ["i"] look at the next word => "do" Is that in my list? =? no Add it to my list => ["i","do"] ... look at the next word => "box" Is that in my list? => no Add it to my list => ["i", "do", "not", "like", "them", "in", "a", "box"] look at the next word => "i" Is that in my list? => yes move on => ["i", "do", "not", "like", "them", "in", "a", "box"] look at the next word => "do" Is that in my list? =? yes move on => ["i", "do", "not", "like", "them", "in", "a", "box"] ... look at the next (last) word => "fox" Is that in my list? => no Add it to my list => ["i", "do", "not", "like", "them", "in", "a", "box", "with", "fox"] return how many words are in my list
So Activity B produced cool results. We now know that it is true - Green Eggs and Ham does in fact contain only 50 unique words. But it's a reasonably long book. I got thinking about how many times each word occurs - how many times do they use the word "eat" or "sam" ?
This is a perfect place to use a dictionary - the keys in the dictionaries are the words that are found while the value associated with each key is the number of times that word is found.
In this part of the lab I want you to write a method that tallies the occurrences of each unique words:
What I will want
By the time you are done with this activity, I want you to have a function called countTimes() that:
For example, my results looked like this (yours MAY be in a different order).
You are required to complete Activities A-C for full credit to this lab.
Let's figure our how to make the results a little more useful. Figure out how to alphabetize the word list:
The extra formatting is not required but encouraged.
Figure out how to sort the results by frequency.
This week I will again ask you to submit your code for electronic grading, using the eLearning submission system.
Follow the directions on the system to select the appropriate course and assignment and submit
If you worked with a partner, make sure that both you and your partner's names are in the comment header at the top of the file and in the printName() function.