TITLE: Using Language to Understand a Data Set AUTHOR: Eugene Wallingford DATE: May 10, 2013 4:03 PM DESC: ----- BODY: Today was our twice-annual undergraduate research presentation day. Every B.S. student must do an undergraduate research project and present the results publicly. For the last few years, we have pooled the presentations on the morning of the Friday in finals week, after all the exams are given and everyone has a chunk of time free to present. It also means that more students and professors can attend, which makes for more a more engaging audience and a nice end to everyone's semester. I worked with one undergraduate research student this spring. As I mentioned while considering the role of parsing in a compilers course, this student was looking for patterns in several years of professional basketball play-by-play data. His ultimate goal was to explore ways of measuring the impact of individual defensive performance in the NBA -- fairly typical MoneyBall stuff applied to an skill that is not well measured or understood. This project fell into my hands serendipitously. The student had approached a couple of other professors, who upon hearing the word "basketball" immediately pointed him to me. Of course, the project is really a data analytics project that just happens to involve a dataset from basketball, but... Fortunately, I am interested in both the approach and the domain! As research sometimes does, this problem led the student to a new problem first. In order to analyze data in the way he wanted, he needed data of a particular sort. There is plenty of play-by-play data available publicly on the web, but it's mostly prepared for presentation in HTML. So he first had to collect the data by scraping the web, and then organize it into a data format amenable to analysis. This student had taken my compiler course the last time around, and his ability to parse several files of similar but just-different-enough data proved to be invaluable. As presented on sites like nba.com, the data is no where near ready to be studied. As the semester wore on, he and I came to realize that his project this semester wouldn't be the data analysis he originally intended to do. It was a substantial project simply to make sense of the data he had found. As he presented his work today, I realized something further. He was using language to understand a data set. He started by defining a grammar to model the data he found, so that he could parse it into a database. This involved recognizing categories of expression that were on the surface of the data, such as made and missed field goals, timeouts, and turnovers. When he ran this first version of his parser, he found unhandled entries and extended his grammar. Then he looked at the semantics of the data and noticed discrepancies deeper in the data. The number of possessions his program observed in a game differed from the expected values, sometimes wildly and with no apparent pattern. As we looked deeper, we realized that the surface syntax of the data often obscured some events that would extend or terminate a possession. A simple example is a missed FT, which sometimes ends a possession and sometimes not. It depends in part on the next event in the timeline. To handle these case, the student created new syntactic categories that enabled his parser to resolve such issues by recognized composite events in the data. As he did this, his grammar grew, and his parser became better at building a more accurate semantic model of the game. This turned out to be a semester-long project in its own right. He's still not done and intends to continue with this research after graduation. We were both a bit surprised at how much effort it took to corral the data, but in retrospect we should not have been too surprised. Data are collected and presented with many different purposes in mind. Having an accurate deep model of the underlying the phenomenon in question isn't always one of them. I hope the student was pleased with his work and progress this semester. I was. In addition to its practical value toward solving a problem of mutual interest, it reminded me yet again of the value of language in understanding the world around us, and the remarkable value that the computational ideas we study in computer science have to offer. For some reason, it also reminded me, pleasantly, of the Racket Way. As I noted in that blog entry, this is really the essence of computer science. Of course, if some NBA team were to give my student the data he needs in suitable form, he could dive into the open question of how better to measure individual defensive performance in basketball. He has some good ideas, and the CS and math skills needed to try them out. Some NBA team should snatch this guy up. -----