For our final assignment, you will use many of the skills and Python features you have learned over the semester:
You will also generate web pages that you can show your family and friends!
A tag cloud is a way to represent text data visually. It can helps us see what ideas are most important to a writer or speaker by showing us information about words that appear in the data using size, color, and other visual features.
Here is tag showing the hundred words used most often to describe production music tracks, from a 2012 BBC blog article:
In a tag cloud, size usually indicates the frequency of words, with more frequent words presented in a larger font. Color and placement can also carry information, such as categories, though they are often used to create a more visually appealing tag cloud.
Of course, many words in a data set may carry little information about important ideas. For example, some words appear frequently in most every set of spoken or written English. If we tabulate the number of times a, the, and of occur in a document, those numbers will dwarf the occurrences of the actual content of the document. They will also obscure differences among speakers, writers, and data sets. So, most implementations of tag clouds filter out common words, as well as numbers and punctuation.
For more information, check out the Wikipedia entry for the term.
For this assignment, you will write a Python program to analyze the transcript of a presidential debate and create a tag cloud for a candidate based on the words he or she used. The frequency of the words will determine the size of the font in the cloud.
Transcripts.
We will use transcripts of the three 2012 Presidential debates between Pres. Obama and Gov. Romney. These data files are part of the zip file for the assignment:
- 2012-debate-01.txt -- the October 3, 2012, debate
- 2012-debate-02.txt -- the October 16, 2012, debate
- 2012-debate-03.txt -- the October 22, 2012, debate
- 2012-debate-01-partial.txt -- a partial transcript of the first debate, which contains only the first five minutes or so of the full transcript
You may want to use the partial transcript for testing. This file is still large (~14,000 characters) but gives you an opportunity to work with a data set smaller than a full debate (~100,000 characters). This can provide you faster feedback than processing a full transcript would.
Notice that the transcript files share a common format. There are three speakers: Pres. Obama, Gov. Romney, and the moderator (Lehrer, Crowley, and Schieffer, respectively). Each time one of the three participants begins to speak, a line is marked with the speaker's name in all caps, followed by a colon -- OBAMA:, ROMNEY:, and, say, LEHRER:. Once such a label appears, all words are attributed to that speaker until another label occurs. Notice that this often does not occur for many lines.
Stop Words.
We need to omit common words that carry no value in your tag cloud. This data file is part of the zip file for the assignment:The basis for this file is a list distributed with MySQL, an open-source database, with a few additions.
Each line in the file contains a single word. Any word in the stop-word list should be omitted from the tag cloud. You can omit it entirely from your processing.
All five of these data files are included in the zip file for the assignment.
Your program will process the data files to create the data you need to create a tag cloud. Actually producing a tag cloud as an image would take a bit more work... Instead, we will use your data to produce a web page that displays words in different sizes. That takes a bit more work, too, but it can be done using Python features you already know. It does require some knowledge of HTML.
To make this assignment doable in the time we have available, use the html_tag_cloud module:
It contains three functions that will generate the HTML we want from the data you produce:
This function takes a word as input and returns a string that is the word wrapped in a FONT tag with a specific size. To do this, the function also needs count, the number of times the word occurs in the document; high, the highest word count of the words being processed; and low, the lowest word count of the words being processed. You can change the values of two constants in the file, HTML_BIG and HTML_LITTLE, to control the size range of the words in your tag cloud.
This function takes as input body, a string, and returns the body wrapped in an HTML tag that produces a box to be displayed. We can pass this function a string containing all the font-wrapped words we receive from make_html_word().
Takes the body and title for and a web page, wraps them in the HTML code for a standard web page, and writes the HTML code to a file. The file name is the title with an .html suffix. We can pass this function a string containing the HTML box we receive from make_html_box().
Play with this file for a few minutes until you see how it works. You do not need to understand all of the details, but you do need to understand what each function does and how they work together.
You can import these functions into your program by including these lines at the top of your program file:
from html_tag_cloud import make_html_word from html_tag_cloud import make_html_box from html_tag_cloud import print_html_file
Download this zip file containing the code and data files described above. Create your program file, named tag_cloud.py, in the same directory.
Define a function named main() that...
Your main function should call helper functions to achieve its tasks, including the three functions in the HTML module described above and no fewer than three helper functions defined by you.
main() must take two parameters:
This speaker's name should be the exact label used in the transcript file, without the colon. It may not be in all capital letters, though.
For example, if I invoke:
main('2012-debate-01.txt', 'Romney')your code should produce a file named ROMNEY.html containing the HTML for a web page that contains a word cloud for the words spoken by Gov. Romney in the first 2012 debate. That page should look like this one.
If instead I invoke:
main('2012-debate-01.txt', 'Obama')your code should produce a file named OBAMA.html containing the HTML for a web page that contains a word cloud for the words spoken by Pres. Obama in that debate. Your page should look like this one.
Checking Your Work. If you run the commands above on your finished code and the HTML files look different from the ones I posted, then either something is wrong with your output or something is wrong with the file I posted. It could be that the file I posted is wrong, but you should ask questions.
Break the problem down into smaller steps.
At the top level, I imagine main() calling functions to:
- pull a speaker's lines out of the file,
- create a list of the words spoken, minus stop words,
- create a list of word/count tuples,
- identify the 40 most frequently used words, and
- produce the web page with a tag cloud.
You may have a different idea of how to proceed, which is fine. In any case, breaking the problem down as we practiced in Lab 13 is likely to help you think about how to write actual lines of code. You don't have to stop at one level... Several of my steps can be broken down further.
Remember to clean up the lines and words before you process the words.
There will be end-of-line characters at the end of the lines you read from the file, and there may be other white space on the line.Remember to remove punctuation from words. Splitting on spaces will leave commas, periods, and other characters in the strings it produces.
Capital letters are a also potential problem. A word at the start of a sentence will be capitalized, but it's the same word as one that appears as lowercase in the middle of a sentence. For simplicity, you may do what we have usually done when processing words: convert all words to lowercase before processing.
But be careful when you convert words to lowercase. Remember that speakers are labeled in the transcript with uppecase strings such as "ROMNEY". Don't convert your strings before you have used that information.
The transcripts occasionally contain text that is spoken or created by the audience. For example, some lines contain (APPLAUSE) and (LAUGHTER) to indicate audience reaction. In general, "audience words" occur on separate lines and appear inside parentheses. You should omit them.
Use the file method readlines() to load the transcript.
We have not used the readlines() method yet. It does in one line what we have been writing a loop to do. After this snippet:transcript_file = open(filename, 'r') list_of_lines = transcript_file.readlines() transcript_file.close()the variable list_of_lines is a list in which the items are the individual lines of the file.
Let the data structures you know help you.
When you count the frequency of the words spoken by the candidate, use a dictionary in which the key is the word and the value is the count.When you find the forty most common words spoken by the candidate, you can sort a list of (count, word) tuples.
Once you have the forty most frequent words (and their counts), you need to alphabetize the list. That sounds like another list of tuples.
Let the HTML module functions help you.
The functions in the html_tag_cloud module will generate the word strings, the tag cloud box, and the HTML page you need. Call them at the appropriate time.Use the sample usage code in that module as a starting point.
Before you call the make_html_word function, you need to know the biggest word count and the smallest word count from the words in this Top 40 list.
Start a fresh Python shell. Run the final version of your program on at least three combinations of transcript file and speaker.
By the due date and time, submit:
Use the on-line submission system.
Make sure that your program meets the course programming standards.