Intro to Computer Science

Program Assignment #8

Creating Word Tags/Clouds of a Speech


Code due by Saturday, December 12th at 11:45 PM

Paperwork due at the start of the last class
(or bring paperwork to 8 am Wednesday 12/16 final exam).


Assignment Overview

The goal of this project is to

  1. gain experience using pieces of code written by someone other than yourself,

  2. test your new top-down design skills in breaking a large problem down into steps and helper functions, and

  3. gain more practice with file I/O, dictionaries, lists, tuples, and functions.

Background

A common element seen on web pages these days are tag clouds (http://en.wikipedia.org/wiki/Tag_cloud). A tag cloud is a visual representation of frequency of words, where more frequent words are represented in larger font. One can also use colors and placement.  We are going to analyze a presidential debate transcript and create a tag cloud for each candidate based on the words they used, where the frequency of the words indicates the size of the font in the cloud.

To help you with this assignment you are provided with the following documents:

Each of these files is explained below:

Play with this file for a few minutes until you see how it works.  You do not need to understand all of the details, but you need to understand what each function does and how they work together.

I played with the htmlFunctions.py file and then renamed it htmlBoxingRocky.py.
This is shown here so you could see an example of achieving goal #1:
                    gain experience using pieces of code written by someone other than yourself.
To play with means to tinker with, to modify, to experiment and to use the code until you get a feel for how it works.

After playing with htmlFunctions.py, I got started by reading in the file and seeing if I could find the candidate name tags okay. Here is evidence of that result of play and solving a simpler problem.
                  Reading files, finding candidates: gettingStarted.txt is edit of the Python Shell.

 

Project Specifications

In a file called pa10.py, you will need to write a group of helper functions called by a mater function (named main() ) which will read through one of the debate files, create a dictionary of words spoken by a specific speaker, remove stop words, identify the 40 most frequently used words by that speaker, and use that information to call the helper functions provided to in the file htmlFunctions.py to create the word cloud. Your code should include no less than 3 new helper functions (other than main()) to illustrate that you understand how to break large problems into small, managable steps. Of course, you can create more than 3 new helper functions if you like.

The master function main() will take in two parameters:

For example, if I invoke:

main("repDebate2015Sept16.txt","TRUMP")

my code should produce a file called TRUMP.html which is the word cloud spoken by that candidate in the republican debate.  BTW, that file should look like this one (TRUMP.html)

 

If I invoke:

main("demDebate2015Oct13.txt","CLINTON")

my code should produce a file called CLINTON.html which is the word cloud spoken by that candidate in the democratic debate.  BTW, that file should look like this one (CLINTON.html)

 

CHECKING YOUR WORK.  If you run the commands above on your finished code and the html files look different from mine, then something isn't quite right for one of us.  It COULD be me who is wrong, but you should ask questions.

 

Helpful Hints:
 

  1. Create your design diagrams and documents first! Break down all the steps that need to be performed, and figure out which functions they go into. (Note: You do not have to turn in a design tree, but it will probably be useful to you to create one.)
  2. Parsing the debate file. You have to read in the file and separate the lines according to who said them. Use the file format to help you with this. Remember, once you see one of the speaker tags all lines/words belong to that speaker until you see another speaker tag.
  3. You have to remove the stop words.  You have two choices.  Either do it as you are reading the debate or go back and remove them once you are done reading the entire debate.  Each has advantages.  Pick one and go with it.
  4. Also remember to remove punctuation from words, just because a word comes at the end of a sentence and has a period at the end of it doesn’t make it a different word. Also, you don't need to count the audience words inside of parenthesis like (APPLAUSE) and (LAUGHTER). You can tell audience words apart from regular speech because they occur on separate lines and are surrounded by parenthesis.
  5. Capital letters are a potential problem.  If we aren't careful the word "Economic" at the start of a sentence will be counted separately from the word "economic" inside of a sentence.  For simplicity I will allow you to treat all words as lower case letters (having said that, be careful when you decide to convert to lowercase letters since  speakers are labeled with all caps ("SANDERS:") and you may want/need to use this information.
  6. Count the word frequency in the candidate’s words. Use a dictionary, where the key is the word and the value is the count.
  7. Once you have dictionary for the candidate in question, you need to extract the 40 most frequently used non-stopwords and their counts.  This should be familiar.  You will need to use lists of tuples/lists as we have done before.
  8. We also need to extract the biggest count and the smallest count from the words in this top-40 as that information is needed by makeHTMLword ()
  9. Once you have the forty most frequent words (and their counts) we want to alphabetize that list.  That sounds like another set of tuples/lists in a list.
  10. Finally, use the code presented to you in htmlFunctions.py to generate the html file with the appropriate name.

 


Deliverables - by Saturday 12/12 at 11:45 pm

By the due date and time, submit:

Use the on-line submission system.

Make sure that your program meets the course programming standards.


Code due by Saturday, December 12th at 11:45 PM


Additional optional information and hints: More Goals, Directions, Hints, Explanations...

VIP: from_htmlFunctions_import.txt tip added on December 7th.

# Note how the htmlFunctions.py file has the 3 functions we need to utilize
# in our application, but that code does NOT clutter up our program file.

from htmlFunctions import makeHTMLbox
from htmlFunctions import printHTMLfile
from htmlFunctions import makeHTMLword