TITLE: When I Procrastinate, I Write Code AUTHOR: Eugene Wallingford DATE: November 20, 2014 3:23 PM DESC: ----- BODY: I procrastinated one day with my intro students in mind. This is the bedtime story I told them as a result. Yes, I know that I can write shorter Python code to do this. They are intro students, after all.
~~~~~
Once upon a time, a buddy of mine, Chad, sent out a tweet. Chad is a physics prof, and he was procrastinating. How many people would I need to have in class, he wondered, to have a 50-50 chance that my class roster will contain people whose last names start with every letter of the alphabet?
    Adams
    Brown
    Connor
    ...
    Young
    Zielinski
This is a lot like the old trivia about how we only need to have 23 people in the room to have a 50-50 chance that two people share a birthday. The math for calculating that is straightforward enough, once you know it. But last names are much more unevenly distributed across the alphabet than birthdays are across the days of the year. To do this right, we need to know rough percentages for each letter of the alphabet. I can procrastinate, too. So I surfed over to the US Census Bureau, rummaged around for a while, and finally found a page on Frequently Occurring Surnames from the Census 2000. It provides a little summary information and then links to a couple of data files, including a spreadsheet of data on all surnames that occurred at least 100 times in the 2000 census. This should, I figure, cover enough of the US population to give us a reasonable picture of how peoples' last names are distributed across the alphabet. So I grabbed it. (We live in a wonderful time. Between open government, open research, and open source projects, we have access to so much cool data!) The spreadsheet has columns with these headers:
    name,rank,count,prop100k,cum_prop100k,      \
                    pctwhite,pctblack,pctapi,   \
                    pctaian,pct2prace,pcthispanic
The first and third columns are what we want. After thirteen weeks, we know how to do compute the percentages we need: Use the running total pattern to count the number of people whose name starts with 'a', 'b', ..., 'z', as well as how many people there are altogether. Then loop through our collection of letter counts and compute the percentages. Now, how should we represent the data in our program? We need twenty-six counters for the letter counts, and one more for the overall total. We could make twenty-seven unique variables, but then our program would be so-o-o-o-o-o long, and tedious to write. We can do better. For the letter counts, we might use a list, where slot 0 holds a's count, slot 1 holds b's count, and so one, through slot 25, which holds z's count. But then we would have to translate letters into slots, and back, which would make our code harder to write. It would also make our data harder to inspect directly.
    ----  ----  ----  ...  ----  ----  ----    slots in the list

       0     1     2  ...    23    24    25    indices into the list
The downside of this approach is that lists are indexed by integer values, while we are working with letters. Python has another kind of data structure that solves just this problem, the dictionary. A dictionary maps keys onto values. The keys and values can be of just about any data type. What we want to do is map letters (characters) onto numbers of people (integers):
    ----  ----  ----  ...  ----  ----  ----    slots in the dictionary

     'a'   'b'   'c'  ...   'x'   'y'   'z'    indices into the dictionary
With this new tool in hand, we are ready to solve our problem. First, we build a dictionary of counters, initialized to 0.
    count_all_names = 0
    total_names = {}
    for letter in 'abcdefghijklmnopqrstuvwxyz':
        total_names[letter] = 0
(Note two bits of syntax here. We use {} for dictionary literals, and we use the familiar [] for accessing entries in the dictionary.) Next, we loop through the file and update the running total for corresponding letter, as well as the counter of all names.
    source = open('app_c.csv', 'r')
    for entry in source:
        field  = entry.split(',')        # split the line
        name   = field[0].lower()        # pull out lowercase name
        letter = name[0]                 # grab its first character
        count  = int( field[2] )         # pull out number of people
        total_names[letter] += count     # update letter counter
        count_all_names     += count     # update global counter
    source.close()
Finally, we print the letter → count pairs.
    for (letter, count_for_letter) in total_names.items():
        print(letter, '->', count_for_letter/count_all_names)
(Note the items method for dictionaries. It returns a collection of key/value tuples. Recall that tuples are simply immutable lists.)

We have converted the data file into the percentages we need.

    q -> 0.002206197888442366
    c -> 0.07694634659082318
    h -> 0.0726864447688946
    ...
    f -> 0.03450702533438715
    x -> 0.0002412718532764804
    k -> 0.03294646311104032
(The entries are not printed in alphabetical order. Can you find out why?) I dumped the output to a text file and used Unix's built-in sort to create my final result. I tweet Chad, Here are your percentages. You do the math. Hey, I'm a programmer. When I procrastinate, I write code. -----