## Lab Exercise 15

### Introduction

This week, we use simple dictionaries and lists to analyze a spreadsheet of simulated customer data. In the process, you will write a tool or two that you can use to work with spreadsheets of your own.

Create a directory on your USB device for this lab, say, lab15, and launch IDLE. Create a new program file named lab15.py in which to do all your work.

This week, you will not submit your shell window at the end of the session. You will submit a responses.txt file this week. Download this template file and use it to record any answers or predictions asked for in the exercises.

### Customer Data

Increasingly, companies have access to data about products, customers, and transactions in the form of spreadsheets and other simple text formats. With a little Python, we can ask and answer a remarkable number of useful questions about this data. Lists and dictionaries help us organize the data in the file in ways that make answering those questions easier.

For this lab, you will work with simulated customer data. It resembles the sort of transaction data one might collect in a spreadsheet. The file is large enough -- 30,000 customer records -- that your analysis can identify patterns of the sort that corporate data analysis finds. Each line in the file consists of a single customer and his or her data in the now-familiar CSV format:

```    record number,gender,name,address,email,phone,birthday,credit card,...
```

There is a single header line at the top of the file with column labels. Every other line is customer data.

Download simulated-customer-data.csv, our data file. This file is a large enough that copy-and-paste may not work as well as usual. You can right click on the link and choose Save As.

### Task 1: Find Distribution of Customers By State

In order to understand how to target our products, we might like to know how many customers come from each state. Let's start with a function written specifically for our data file.

Step 1.
Write a function named find_state_distribution() that takes no arguments and returns a list of (state, count) tuples sorted by state. To do this, the function should:
• open and read the data file line-by-line,
• for each customer line, split the data at commas,
• pull out the state for the customer, and
• updates a counter for that state.

A dictionary with states as keys and counts as values is a natural tool to use for this task.

Which column is the state in?

Step 2.
Write a quick function called display_tuple_list() that prints the items in a list of tuples one per line. Pass the result of running find_state_distribution() to your display function. You should see output something like this:
```       >>> display_tuple_list( find_state_distribution() )
AK →  120
AL →  462
...
WV →  177
WY →   75
```

### Task 2: Find Distribution of Any Column in Any File

Our function from Task 1 is designed to work on a specific column of a specific file. The same code could work for any column in the file, or even for another file, if we turn constants into function parameters.

Step 1.
Write a function named find_column_distribution(filename, column_number) that returns a list of (column_value, count) tuples sorted by the column values. filename is a string, the name of the CSV file, and column_number is an integer, the number of the column to process.

This function should do just what your previous function do, but with the filename given by the user and on the column number given by the user. If the user asks for a column that does not exist, find_column_distribution() should print a suitable error message and return an empty list.

```       >>> filename = 'simulated-customer-data.csv'
>>> find_column_distribution(filename, 20)
There is no column 20 in the file.
```

Step 2.
Call find_column_distribution() with several column numbers, both legal and illegal, and display the results. You should see output something like this:
```       >>> display_tuple_list( find_column_distribution(filename, 1) )
female → 15167
male   → 14833

>>> display_tuple_list( find_column_distribution(filename, 14) )
MasterCard → 14868
Visa       → 15132
```

Copy the results of several runs of this code into your responses file.

### Task 3: Find Distribution of Birth Years

Our function from Task 2 looks like a "universal solution", but it isn't as helpful as we might hope. Sometimes, interesting data is nested inside other data.

For example, suppose we would like to sort our customers by age. We can compute a rough distribution by grouping customers by the year they were born. Using find_column_distribution() on Column 13 (the birthdate) directly allows us to group and sort only by a month-day-year string.

So let's write another function written specifically for our data file.

Step 1.
Write a function named find_birthyear_distribution() that returns a list of (column_value, count) tuples sorted by year.

This function should do just what your previous two functions do, with one change: When you pull the birthdate value from its columnn, split it on the slashes and use only the year portion as a key in your distribution.

Step 2.
Call find_birthyear_distribution() and display its results. You should see output something like this:
```       >>> display_tuple_list( find_birthyear_distribution() )
1940 → 651
1941 → 677
...
1984 → 660
1985 → 622
```

### Task 4: Compute Column Distributions by Column Name

I don't like to count columns. I can see the column header...

Step 1.
Revise your find_column_distribution() function so that the second argument be
• either an integer, the number of the column to process,
• or a string, the name of the column to process.

This means the function should continue to work exactly as it did before if the user passes a number as the second argument. If the user passes a string as the second argument, the function should find the position of that name in the header line and use that column number.

If there is no column with that name, the function should print a suitable error message and return an empty list.

Review your textbook (or a piece of code I gave you yesterday...) to find out how a program can determine the type of a particular value.

Step 2.
Call your new find_column_distribution() function with several different second arguments, both column names and column numbers, and display the results. You should see output something like this:
```       >>> display_tuple_list( find_column_distribution(filename, 'State') )
AK →  120
AL →  462
...
WV →  177
WY →   75

>>> display_tuple_list( find_column_distribution(filename, 'CC') )
There is no column 'CC' in the file.

>>> display_tuple_list( find_column_distribution(filename, 5.2) )
There is no column '5.2' in the file.
```

Copy the results of several runs of this code into your responses file.

### Finishing Up

Make sure that your program file is complete and saved. Save your responses.txt file.

Submit your files for grading on the electronic submission system, at lab15 -- Analyzing Customer Data with Dictionaries.

• lab15.py
• responses.txt

As always, make sure you see the verification screen that says The files listed above were uploaded.

If you need any help, let me know.

Eugene Wallingford ..... wallingf@cs.uni.edu ..... December 10, 2014