This week, we use simple dictionaries and lists to analyze a spreadsheet of simulated customer data. In the process, you will write a tool or two that you can use to work with spreadsheets of your own.
Launch IDLE. Create a new program file named lab14.py in which to do all your work.
This week, you will not submit your shell window at the end of the session. You will submit a responses.txt file this week. Download this template file and use it to record any answers or predictions asked for in the exercises.
Increasingly, companies have access to data about products, customers, and transactions in the form of spreadsheets and other simple text formats. With a little Python, we can ask and answer a remarkable number of useful questions about this data. Lists and dictionaries help us organize the data in the file in ways that make answering those questions easier.
For this lab, you will work with simulated customer data. It resembles the sort of transaction data one might collect in a spreadsheet. The file is large enough -- 30,000 customer records -- that your analysis can identify patterns of the sort that corporate data analysis finds. Each line in the file consists of a single customer and his or her data in the now-familiar CSV format:
record number,gender,name,address,email,phone,birthday,credit card,...
There is a single header line at the top of the file with column labels. Every other line is customer data.
Download simulated-customer-data.csv, our data file. This file is a large enough that copy-and-paste may not work as well as usual. You can right click on the link and choose Save As.
In order to understand how to target our products, we might like to know how many customers come from each state. Let's start with a function written specifically for our data file.
A dictionary with states as keys and counts as values is a natural tool to use for this task.
Which column is the state in?
>>> display_tuple_list( find_state_distribution() ) AK → 120 AL → 462 ... WV → 177 WY → 75
Copy the results of your run into your responses file.
Our function from Task 1 is designed to work on a specific column of a specific file. The same code could work for any column in the file, or even for another file, if we turn constants into function parameters.
This function should do just what your previous function do, but with the filename given by the user and on the column number given by the user. If the user asks for a column that does not exist, find_column_distribution() should print a suitable error message and return an empty list.
>>> filename = 'simulated-customer-data.csv' >>> find_column_distribution(filename, 20) There is no column 20 in the file.
>>> display_tuple_list( find_column_distribution(filename, 1) ) female → 15167 male → 14833 >>> display_tuple_list( find_column_distribution(filename, 14) ) MasterCard → 14868 Visa → 15132
Copy the results of several runs of this code into your responses file.
Our function from Task 2 looks like a "universal solution", but it isn't as helpful as we might hope. Sometimes, interesting data is nested inside other data.
For example, suppose we would like to sort our customers by age. We can compute a rough distribution by grouping customers by the year they were born. Using find_column_distribution() on Column 13 (the birthdate) directly allows us to group and sort only by a month-day-year string.
So let's write another function written specifically for our data file.
This function should do just what your previous two functions do, with one change: When you pull the birthdate value from its columnn, split it on the slashes and use only the year portion as a key in your distribution.
>>> display_tuple_list( find_birthyear_distribution() ) 1940 → 651 1941 → 677 ... 1984 → 660 1985 → 622
Copy the results of your run into your responses file.
Make sure that your program file is complete and saved. Save your responses.txt file.
Submit your files for grading on the electronic submission system, at lab14 -- Analyzing Customer Data with Dictionaries.
As always, make sure you see the verification screen that says The files listed above were uploaded.
If you need any help, let me know.