Homework Assignment 12

Analyzing Documents with Tag Clouds


CS 1510
Introduction to Computing
Fall Semester 2014


Due: Saturday, December 13, at 10:00 AM


Introduction

For our final assignment, you will use many of the skills and Python features you have learned over the semester:

You will also generate web pages that you can show your family and friends!



Background: Tag Clouds for Presidential Debates

A tag cloud is a way to represent text data visually. It can helps us see what ideas are most important to a writer or speaker by showing us information about words that appear in the data using size, color, and other visual features.

Here is tag showing the hundred words used most often to describe production music tracks, from a 2012 BBC blog article:

top 100 words describing production music tracks

In a tag cloud, size usually indicates the frequency of words, with more frequent words presented in a larger font. Color and placement can also carry information, such as categories, though they are often used to create a more visually appealing tag cloud.

Of course, many words in a data set may carry little information about important ideas. For example, some words appear frequently in most every set of spoken or written English. If we tabulate the number of times a, the, and of occur in a document, those numbers will dwarf the occurrences of the actual content of the document. They will also obscure differences among speakers, writers, and data sets. So, most implementations of tag clouds filter out common words, as well as numbers and punctuation.

For more information, check out the Wikipedia entry for the term.



Data for the Assignment

For this assignment, you will write a Python program to analyze the transcript of a presidential debate and create a tag cloud for a candidate based on the words he or she used. The frequency of the words will determine the size of the font in the cloud.

Transcripts.

We will use transcripts of the three 2012 Presidential debates between Pres. Obama and Gov. Romney. These data files are part of the zip file for the assignment:

You may want to use the partial transcript for testing. This file is still large (~14,000 characters) but gives you an opportunity to work with a data set smaller than a full debate (~100,000 characters). This can provide you faster feedback than processing a full transcript would.

Notice that the transcript files share a common format. There are three speakers: Pres. Obama, Gov. Romney, and the moderator (Lehrer, Crowley, and Schieffer, respectively). Each time one of the three participants begins to speak, a line is marked with the speaker's name in all caps, followed by a colon -- OBAMA:, ROMNEY:, and, say, LEHRER:. Once such a label appears, all words are attributed to that speaker until another label occurs. Notice that this often does not occur for many lines.

Stop Words.

We need to omit common words that carry no value in your tag cloud. This data file is part of the zip file for the assignment:

The basis for this file is a list distributed with MySQL, an open-source database, with a few additions.

Each line in the file contains a single word. Any word in the stop-word list should be omitted from the tag cloud. You can omit it entirely from your processing.

All five of these data files are included in the zip file for the assignment.



Code for the Assignment

Your program will process the data files to create the data you need to create a tag cloud. Actually producing a tag cloud as an image would take a bit more work... Instead, we will use your data to produce a web page that displays words in different sizes. That takes a bit more work, too, but it can be done using Python features you already know. It does require some knowledge of HTML.

To make this assignment doable in the time we have available, use the html_tag_cloud module:

It contains three functions that will generate the HTML we want from the data you produce:

Play with this file for a few minutes until you see how it works. You do not need to understand all of the details, but you do need to understand what each function does and how they work together.

You can import these functions into your program by including these lines at the top of your program file:

    from html_tag_cloud import make_html_word
    from html_tag_cloud import make_html_box
    from html_tag_cloud import print_html_file

~~~~~

Download this zip file containing the code and data files described above. Create your program file, named tag_cloud.py, in the same directory.



Specification: A Tag Cloud Generator

Define a function named main() that...

Your main function should call helper functions to achieve its tasks, including the three functions in the HTML module described above and no fewer than three helper functions defined by you.

main() must take two parameters:

This speaker's name should be the exact label used in the transcript file, without the colon. It may not be in all capital letters, though.

For example, if I invoke:

    main('2012-debate-01.txt', 'Romney')
your code should produce a file named ROMNEY.html containing the HTML for a web page that contains a word cloud for the words spoken by Gov. Romney in the first 2012 debate. That page should look like this one.

If instead I invoke:

    main('2012-debate-01.txt', 'Obama')
your code should produce a file named OBAMA.html containing the HTML for a web page that contains a word cloud for the words spoken by Pres. Obama in that debate. Your page should look like this one.

Checking Your Work. If you run the commands above on your finished code and the HTML files look different from the ones I posted, then either something is wrong with your output or something is wrong with the file I posted. It could be that the file I posted is wrong, but you should ask questions.



Design and Programming Hints

Break the problem down into smaller steps.

At the top level, I imagine main() calling functions to:

You may have a different idea of how to proceed, which is fine. In any case, breaking the problem down as we practiced in Lab 13 is likely to help you think about how to write actual lines of code. You don't have to stop at one level... Several of my steps can be broken down further.

Remember to clean up the lines and words before you process the words.

There will be end-of-line characters at the end of the lines you read from the file, and there may be other white space on the line.

Remember to remove punctuation from words. Splitting on spaces will leave commas, periods, and other characters in the strings it produces.

Capital letters are a also potential problem. A word at the start of a sentence will be capitalized, but it's the same word as one that appears as lowercase in the middle of a sentence. For simplicity, you may do what we have usually done when processing words: convert all words to lowercase before processing.

But be careful when you convert words to lowercase. Remember that speakers are labeled in the transcript with uppecase strings such as "ROMNEY". Don't convert your strings before you have used that information.

The transcripts occasionally contain text that is spoken or created by the audience. For example, some lines contain (APPLAUSE) and (LAUGHTER) to indicate audience reaction. In general, "audience words" occur on separate lines and appear inside parentheses. You should omit them.

Use the file method readlines() to load the transcript.

We have not used the readlines() method yet. It does in one line what we have been writing a loop to do. After this snippet:
    transcript_file = open(filename, 'r')
    list_of_lines   = transcript_file.readlines()
    transcript_file.close()
the variable list_of_lines is a list in which the items are the individual lines of the file.

Let the data structures you know help you.

When you count the frequency of the words spoken by the candidate, use a dictionary in which the key is the word and the value is the count.

When you find the forty most common words spoken by the candidate, you can sort a list of (count, word) tuples.

Once you have the forty most frequent words (and their counts), you need to alphabetize the list. That sounds like another list of tuples.

Let the HTML module functions help you.

The functions in the html_tag_cloud module will generate the word strings, the tag cloud box, and the HTML page you need. Call them at the appropriate time.

Use the sample usage code in that module as a starting point.

Before you call the make_html_word function, you need to know the biggest word count and the smallest word count from the words in this Top 40 list.



Demonstration of Correctness

Start a fresh Python shell. Run the final version of your program on at least three combinations of transcript file and speaker.



Deliverables

By the due date and time, submit:

Use the on-line submission system.

Make sure that your program meets the course programming standards.



Eugene Wallingford ..... wallingf@cs.uni.edu ..... December 7, 2014