Intro to Computer Science
Program Assignment #8
Creating Word Tags/Clouds of a Speech
Code due by Saturday, December 12th at 11:45 PM
Paperwork due at the start of the last class
(or bring paperwork to 8 am Wednesday 12/16 final exam).
Assignment Overview
The goal of this project is to
- gain experience using pieces of code written by someone other than yourself,
- test your new top-down design skills in breaking a large problem down into steps and helper functions, and
- gain more practice with file I/O, dictionaries, lists, tuples, and
functions.
Background
A common element seen on web pages these days are tag clouds (http://en.wikipedia.org/wiki/Tag_cloud).
A tag cloud is a visual representation of frequency of words, where more
frequent words are represented in larger font. One can also use colors and
placement. We are going to analyze a presidential debate transcript and create a tag
cloud for each candidate based on the words they used, where the frequency of
the words indicates the size of the font in the cloud.
To help you with this assignment you are provided with the following
documents:
- A transcript of a debate (take your pick)
- The September 16, 2015 Republican debate.
(repDebate2015.txt)
- The October 13, 2015 Democratic debate.
(demDebate2015.txt)
- For testing purposes only, a partial transcript of each debate that
only contains the first five minutes or so of the debate is given.
This is still big, but it might give you a chance to print some stuff to the
screen without flooding the system when you are testing your code.
(rep-small.txt and dem-small.txt)
- A list of stop words (stopSQL.txt)
- Some helper functions (htmlFunctions.py)
Each of these files is explained below:
- Transcript - Open up the file for the debate you
picked. Each transcript is in a particular format. Each time one of the
three participants speaks, that line is marked with the speaker's name
followed by a colon (‘CLINTON:’, ‘SANDERS:’, 'CHAFEE', 'WEBB', 'O'MALLEY', or ‘COOPER:’ for the democratic debate). Once
encountered, all words are attributed to that speaker until another label
occurs. Notice that this may not be for several lines.
- Stopwords - Not all words are worth counting. In the context of
speeches, ‘a’, ‘the’, ‘was’,
etc. are just junk. A list of such words is provided as stopSQL.txt Each line
has a single word. No word in the stop word list should be counted in the tag
cloud. This is the list distributed with MySQL 4.0.20 list with a few
additions (mostly just duplication of contractions. That is both “can’t” and
“cant” are now in the list)
- Functions - Three functions and an example are provided in
htmlFunctions.py. Use them in your program. That file contains:
- makeHTMLword (word, cnt, high, low)
This function takes a word and
wraps it in a font tag with a specific size. The function takes the word to
be wrapped, how many times it occurred in the document, the highest word
count and the lowest word count of words being processed (the highest count
we are considering for this tag and the lowest). It returns a string that is
the word and fontSize between htmlBig and htmlLittle (two local vars in the
function. You can change them to be whatever you like)
- makeHTMLbox(body)
This function takes a single string of all the
font-wrapped words from makeHTMLword and places them in an html box to be
displayed. It returns a string which is the html code for the box.
- printHTMLfile(body,title)
Takes the body returned from makeHTMLbox and
wraps a standard html web page around it. The string title is used in the
html. The title is also the file name with an ‘.html’ suffix
Play with this file for a few minutes until you see how it works. You
do not need to understand all of the details, but you need to understand what
each function does and how they work together.
I played with the htmlFunctions.py file and then
renamed it htmlBoxingRocky.py.
This is shown here so you could see an example of
achieving goal #1:
gain experience using pieces of code written by someone other than yourself.
To play with means to tinker with, to modify, to experiment and to use the code until you get a feel for how it works.
After playing with htmlFunctions.py, I got started by reading in the file
and seeing if I could find the candidate name tags okay. Here is evidence of
that result of play and solving a simpler problem.
Reading files, finding candidates: gettingStarted.txt is edit of the Python Shell.
Project Specifications
In a file called pa10.py, you will need to write a group of helper functions called by a mater function (named main() ) which will read through one of the debate files, create a dictionary of words spoken by a specific speaker, remove stop words, identify the 40 most frequently used words by that speaker, and use that information to call the helper functions provided to in the file htmlFunctions.py to create the word cloud. Your code should include no less than 3 new helper functions (other than main()) to illustrate that you understand how to break large problems into small, managable steps. Of course, you can create more than 3 new helper functions if you like.
The master function main() will take in two parameters:
- A string representing the name of the debate text file
- A string representing the speaker that you wish to analyze. This
string should be the exact name/label used in the debate transcript (without
the colon. See below)
For example, if I invoke:
main("repDebate2015Sept16.txt","TRUMP")
my code should produce a file called TRUMP.html which is the word cloud
spoken by that candidate in the republican debate. BTW, that file should look
like this one (TRUMP.html)
If I invoke:
main("demDebate2015Oct13.txt","CLINTON")
my code should produce a file called CLINTON.html which is the word cloud
spoken by that candidate in the democratic debate. BTW, that file should look
like this one (CLINTON.html)
CHECKING YOUR WORK. If you run the commands above on your finished code
and the html files look different from mine, then something isn't quite right
for one of us. It COULD be me who is wrong, but you should ask questions.
Helpful Hints:
- Create your design diagrams and documents first! Break down all the steps that need to be performed, and figure out which functions they go into. (Note: You do not have to turn in a design tree, but it will probably be useful to you to create one.)
- Parsing the debate file. You have to read in the file and separate the
lines according to who said them. Use the file format
to help you with this. Remember, once you see one of the speaker tags all lines/words belong to that speaker until you see
another speaker tag.
- You have to remove the stop words. You have two choices.
Either do it as you are reading the debate or go back and remove them once you
are done reading the entire debate. Each has advantages. Pick one
and go with it.
- Also remember to remove punctuation from words, just because a word comes
at the end of a sentence and has a period at the end of it doesn’t make it a
different word. Also, you don't need to count the audience words inside of parenthesis like
(APPLAUSE) and (LAUGHTER). You can tell audience words apart from regular speech because
they occur on separate lines and are surrounded by parenthesis.
- Capital letters are a potential problem. If we aren't careful the
word "Economic" at the start of a sentence will be counted separately from the
word "economic" inside of a sentence. For simplicity I will allow you to
treat all words as lower case letters (having said that, be careful when you
decide to convert to lowercase letters since speakers are labeled with all caps ("SANDERS:")
and you may want/need to use this information.
- Count the word frequency in the candidate’s words. Use a dictionary, where
the key is the word and the value is the count.
- Once you have dictionary for the candidate in question, you need to
extract the 40 most frequently used non-stopwords and their counts. This
should be familiar. You will need to use lists of tuples/lists as we have done
before.
- We also need to extract the biggest count and the smallest count from the
words in this top-40 as that information is needed by makeHTMLword ()
- Once you have the forty most frequent words (and their counts) we want to
alphabetize that list. That sounds like another set of tuples/lists in a list.
- Finally, use the code presented to you in htmlFunctions.py to generate the
html file with the appropriate name.
Deliverables
- by Saturday 12/12 at 11:45 pm
By the due date and time, submit:
- tag_cloud.py
- design documents and program printout in class or under office door or at the final exam period
- tag_cloud_helpers.py is completely optional, but is provided
in case you develop a 2nd file for some of your helper functions.
tag_cloud_helpers.py - Added at 4:14 PM on 12/07/Monday.
Use
the on-line submission system.
Make sure that your program meets
the course programming standards.
Code due by Saturday, December 12th at 11:45 PM
Additional optional information and hints: More Goals,
Directions, Hints, Explanations...
VIP: from_htmlFunctions_import.txt tip added on December 7th.
# Note how the htmlFunctions.py file has the 3 functions we need to utilize
# in our application, but that code does NOT clutter up our program file.
from htmlFunctions import makeHTMLbox
from htmlFunctions import printHTMLfile
from htmlFunctions import makeHTMLword