Session 26

How Many Names Start with Each Letter?


CS 1510
Introduction to Computing


Quick Follow-Up to the Lab



Counting Names by Their First Letter

Once upon a time, a buddy of mine, Chad, sent out a tweet. He was procrastinating. How many people would I need to have in class, he wondered, to have a 50-50 chance that my class roster will contain people whose last names start with every letter of the alphabet?

    Adams
    Brown
    Connor
    ...
    Young
    Zielinski

This is a lot like the old trivia about how we only need to have 23 people in the room to have a 50-50 chance that two people share a birthday. But last names are much more unevenly distributed across the alphabet than birthdays are across the days of the year. To do this right, we need to know rough percentages for each letter of the alphabet.

I can procrastinate, too. So I surfed over to the US Census Bureau, rummaged around for a while, and finally found a page on Frequently Occurring Surnames from the Census 2000. It provides a little summary information and links to a couple of data files, including a spreadsheet of data on all surnames that occurred at least 100 times in the 2000 census. This should have enough data to give us a reasonable picture of how peoples' last names are distributed across the alphabet. So I grabbed it.

(We live in a wonderful time. Between open government, open research, and open source projects, we have access to so many cool data files.)

The spreadsheet has columns with these headers:

    name,rank,count,prop100k,cum_prop100k,      \
                    pctwhite,pctblack,pctapi,   \
                    pctaian,pct2prace,pcthispanic

The first and third columns are what we need. After thirteen weeks, we know how to do this: Use the running total pattern to count up the number of people whose name starts with 'a', 'b', ..., 'z', as well as how many people there are altogether. Then loop through our list of letter counts and compute the percentages.

Now, how should we represent the data in our program? We need twenty-six counters for the letter counts, and one more for the overall total. We could make twenty-seven unique variables, but then our program would be so-o-o-o-o-o long and tedious to write. We can do better.

For the letter counts, we could use a list, where slot 0 holds a's count, slot 1 holds b's count, and so one, through slot 25, which holds z's count. But then we would have to translate letters into slots, and back, which would make our code harder to write. It would also make our data harder to inspect.

    ----  ----  ----  ...  ----  ----  ----    slots in the list

       0     1     2  ...    23    24    25    indices into the list

The downside of this approach is that lists are 'indexed' by integer values, while we are working with letters. Python has another kind of data structure that solves just this problem, the dictionary. A dictionary maps keys onto values. The keys and values can be of just about any data type. We would like to map letters (characters) → numbers of people (integers):

    ----  ----  ----  ...  ----  ----  ----    slots in the dictionary

     'a'   'b'   'c'  ...   'x'   'y'   'z'    indices into the dictionary

First, we build a dictionary of counters, initialized to 0.

    count_all_names = 0
    total_names = {}
    for letter in 'abcdefghijklmnopqrstuvwxyz':
        total_names[letter] = 0

(Note two bits of syntax: {} for dictionary literals, and the familiar [] for accessing entries in the dictionary.)

Next, we loop through the file and update the running total for corresponding letter, as well as the counter of all names.

    source = open('app_c.csv', 'r')
    for entry in source:
        field  = entry.split(',')        # split the line
        name   = field[0].lower()        # pull out lowercase name
        letter = name[0]                 # grab its first character
        count  = int( field[2] )         # pull out number of people
        total_names[letter] += count     # update letter counter
        count_all_names     += count     # update global counter
    source.close()

Finally, we print the letter → count pairs.

    for (letter, count_for_letter) in total_names.items():
        print(letter, '->', count_for_letter/count_all_names)

(Note the items method for dictionaries. It returns a collection of key/value tuples. Recall that tuples are simply immutable lists.)

We have converted the data file into the percentages we need.

    q -> 0.002206197888442366
    c -> 0.07694634659082318
    h -> 0.0726864447688946
    ...
    f -> 0.03450702533438715
    x -> 0.0002412718532764804
    k -> 0.03294646311104032

(The entries are not printed in alphabetical order. Can you find out why?)

I tweet Chad, Here are your percentages. You do the math. Hey, I'm a programmer. When I procrastinate, I write code.



Exam 2

The rest of the period is devoted to Exam 2



Wrap Up



Eugene Wallingford ..... wallingf@cs.uni.edu ..... November 20, 2014