Lab Exercise 14 - Monday December 7th

Analyzing Customer Data with Dictionaries
Due by 11:45 pm Tuesday December 8th


CS 1510
Introduction to Computing


Introduction

This week, we use simple dictionaries and lists to analyze a spreadsheet of simulated customer data. In the process, you will write a tool or two that you can use to work with spreadsheets of your own.

Launch IDLE. Create a new program file named lab14.py in which to do all your work.

This week, you will not submit your shell window at the end of the session. You will submit a responses.txt file this week. Download this template file and use it to record any answers or predictions asked for in the exercises.



Customer Data

Increasingly, companies have access to data about products, customers, and transactions in the form of spreadsheets and other simple text formats. With a little Python, we can ask and answer a remarkable number of useful questions about this data. Lists and dictionaries help us organize the data in the file in ways that make answering those questions easier.

For this lab, you will work with simulated customer data. It resembles the sort of transaction data one might collect in a spreadsheet. The file is large enough -- 30,000 customer records -- that your analysis can identify patterns of the sort that corporate data analysis finds. Each line in the file consists of a single customer and his or her data in the now-familiar CSV format:

    record number,gender,name,address,email,phone,birthday,credit card,...

There is a single header line at the top of the file with column labels. Every other line is customer data.

Download simulated-customer-data.csv, our data file. This file is a large enough that copy-and-paste may not work as well as usual. You can right click on the link and choose Save As.



Task 1: Find Distribution of Customers By State

In order to understand how to target our products, we might like to know how many customers come from each state. Let's start with a function written specifically for our data file.

Step 1.
Write a function named find_state_distribution() that takes no arguments and returns a list of (state, count) tuples sorted by state. To do this, the function should:

A dictionary with states as keys and counts as values is a natural tool to use for this task.

Which column is the state in?


Step 2.
Write a quick function called display_tuple_list() that prints the items in a list of tuples one per line. Pass the result of running find_state_distribution() to your display function. Output will be sorted by state abbreviation. You should see output something like this:
       >>> display_tuple_list( find_state_distribution() )
       AK →  120
       AL →  462
       ...
       WV →  177
       WY →   75

Copy the results of your run into your responses file.



Task 2: Find Distribution of Any Column in Any File

Our function from Task 1 is designed to work on a specific column of a specific file. The same code could work for any column in the file, or even for another file, if we turn constants into function parameters.

Step 1.
Write a function named find_column_distribution(filename, column_number) that returns a list of (column_value, count) tuples sorted by the column values. filename is a string, the name of the CSV file, and column_number is an integer, the number of the column to process.

This function should do just what your previous function do, but with the filename given by the user and on the column number given by the user. If the user asks for a column that does not exist, find_column_distribution() should print a suitable error message and return an empty list.

       >>> filename = 'simulated-customer-data.csv'
       >>> find_column_distribution(filename, 20)
       There is no column 20 in the file.


Step 2.
Call find_column_distribution() with several column numbers, both legal and illegal, and display the results. You should see output something like this:
       >>> display_tuple_list( find_column_distribution(filename, 1) )
       female → 15167
       male   → 14833

       >>> display_tuple_list( find_column_distribution(filename, 14) )
       MasterCard → 14868
       Visa       → 15132

Copy the results of several runs of this code into your responses file.



Task 3: Find Distribution of Birth Years

Our function from Task 2 looks like a "universal solution", but it isn't as helpful as we might hope. Sometimes, interesting data is nested inside other data.

For example, suppose we would like to sort our customers by age. We can compute a rough distribution by grouping customers by the year they were born. Using find_column_distribution() on Column 13 (the birthdate) directly allows us to group and sort only by a month-day-year string.

So let's write another function written specifically for our data file.

Step 1.
Write a function named find_birthyear_distribution() that returns a list of (column_value, count) tuples sorted by year.

This function should do just what your previous two functions do, with one change: When you pull the birthdate value from its columnn, split it on the slashes and use only the year portion as a key in your distribution.


Step 2.
Call find_birthyear_distribution() and display its results. You should see output something like this:
       >>> display_tuple_list( find_birthyear_distribution() )
       1940 → 651
       1941 → 677
       ...
       1984 → 660
       1985 → 622

Copy the results of your run into your responses file.




Finishing Up

Make sure that your program file is complete and saved. Save your responses.txt file.

Submit your files for grading on the electronic submission system, at lab14 -- Analyzing Customer Data with Dictionaries.

As always, make sure you see the verification screen that says The files listed above were uploaded.

If you need any help, let me know.



Due by 11:45 pm Tuesday December 8th