TITLE: Cleaning Data Off My Desk AUTHOR: Eugene Wallingford DATE: July 06, 2009 3:26 PM DESC: ----- BODY: As I mentioned last time, this week I am getting back to some regular work after mostly wrapping up a big project, including cleaning off my desk. It is cluttered with a lot of loose paper that the Digital Age had promised to eliminate. Some is my own fault, paper copies of notes and agendas I should probably find a way to not to print. Old habits dies hard. But I also have a lot paper sent to me as department head. Print-outs; old-style print-outs from a mainframe. The only thing missing from a 1980s flashback is the green bar paper. Some of these print-outs are actually quite interesting. One set is of grade distribution reports produced by the registrar's office, which show how many students earned As, Bs, and so on in each course we offered this spring and for each instructor who taught a course in our department. This sort of data can be used to understand enrollment figures and maybe even performance in later courses. Some upper administrators have suggested using this data in anonymous form as a subtle form of peer pressure, so that profs who are outliers within a course might self-correct their own distributions. I'm ready to think about going there yet, but the raw data seems useful, and interesting in its own right. I might want to do more with the data. This is the first time I recall receiving this, but in the fall it would be interesting to cross-reference the grade distributions by course and instructor. Do the students who start intro CS in the fall tend to earn different grades than those who start in the spring? Are there trends we can see over falls, springs, or whole years? My colleagues and I have sometimes wondered aloud about such things, but having a concrete example of the data in hand has opened new possibilities in my mind. (A typical user am I...) As a programmer, I have the ability to do such analyses with relatively straightforward scripts, but I can't. The data is closed. I don't receive actual data from the registrar's office; I receive a print-out of one view of the data, determined by people in that office. Sadly, this data is mostly closed even to them, because they are working with an ancient mainframe database system for which there is no support and a diminishing amount of corporate memory here on campus. The university is in the process of implementing a new student information system, which should help solve some of these problems. I don't imagine that people across campus will have much access to this data, though. That's not the usual M.O. for universities. Course enrollment and grade data aren't the only ones we could benefit from opening up a bit. As a part of the big project I just wrapped up, the task force I was on collected a massive amount of data about expenditures on campus. This data is accessible to many administrators on campus, but only through a web interface that constrains interaction pretty tightly. Now that we have collected the data, processed almost all of it by hand (the roughness of the data made automated processing an unattractive alternative), and tabulated it for analysis, we are starting to receive requests for our spreadsheets from others on campus. These folks all have access to the data, just not in the cleaned-up, organized format into which we massaged it. I expressed frustration with our financial system in a mini-rant a few years ago, and other users feel similar limitations. For me, having enrollment and grade data would be so cool. We could convert data into information that we could then us to inform scheduling, teaching assignments, and the like. Universities are inherently an information-based institutions, but we don't always put our own understanding of the world into practice very well. Constrained resources and intellectual inertia slow us down or stop us all together. Hence my wistful hope while reading Tim Bray's "Hello-World" for Open Data. Vancouver has a great idea:

Publish the data in a usable form.

License it in a way that turns people loose to do whatever they want, but doesn't create unreasonable liability risk for the city.

See what happens. ...

Would anyone on campus take advantage? Maybe, maybe not. I can imagine some interesting mash-ups using only university data, let alone linking to external data. But this isn't likely to happen. GPA data and instructor data are closely guarded by departments and instructors, and throwing light on it would upset enough people that any benefits would probably be shouted down. But perhaps some subset of the data the university maintains, suitably anonymized, could be opened up. If nothing else, transparency sometimes helps to promote trust. I should probably do this myself, at the department level, with data related to schedule, budget, and so on. I occasionally share the spreadsheets I build with the faculty, so they can see the information I use to make decisions. This spring, we even discussed opening up the historic formula used in the department to allocate our version of merit pay. (What a system that is -- so complicated that that I've feared making more than small editorial changes to it in my time as head. I keep hoping to find the time and energy to build something meaningful from scratch, but that never happens. And it turns out that most faculty are happy with what we have now, perhaps for "the devil you know" reasons.) I doubt even the CS faculty in my department would care to have open data of this form. We are a small crew, and they are busy with the business of teaching and research. It is my job to serve them by taking as much of this thinking out of our way. Then again, who knows for sure until we try? If the cost of sharing can be made low enough, I'll have no reason not to share. But whether anyone uses the data that might not even be the real point. Habits change when we change them, when we take the time to create new ones to replace the old ones. This would a good habit for me to have. -----