TITLE: Using Language to Understand a Data Set
AUTHOR: Eugene Wallingford
DATE: May 10, 2013 4:03 PM
Today was our twice-annual undergraduate research presentation
day. Every B.S. student must do an undergraduate research
project and present the results publicly. For the last few
years, we have pooled the presentations on the morning of the
Friday in finals week, after all the exams are given and
everyone has a chunk of time free to present. It also means
that more students and professors can attend, which makes for
more a more engaging audience and a nice end to everyone's
I worked with one undergraduate research student this spring.
As I mentioned while considering
the role of parsing in a compilers course,
this student was looking for patterns in several years of
professional basketball play-by-play data. His ultimate goal
was to explore ways of measuring the impact of individual
defensive performance in the NBA -- fairly typical MoneyBall
stuff applied to an skill that is not well measured or
This project fell into my hands serendipitously. The student
had approached a couple of other professors, who upon hearing
the word "basketball" immediately pointed him to me. Of
course, the project is really a data analytics project that
just happens to involve a dataset from basketball, but...
Fortunately, I am interested in both the approach and the
As research sometimes does, this problem led the student to a
new problem first. In order to analyze data in the way he
wanted, he needed data of a particular sort. There is plenty
of play-by-play data available publicly on the web, but it's
mostly prepared for presentation in HTML. So he first had to
collect the data by scraping the web, and then organize it
into a data format amenable to analysis.
This student had taken my compiler course the last time around,
and his ability to parse several files of similar but
just-different-enough data proved to be invaluable. As
presented on sites like nba.com, the data is no where near
ready to be studied.
As the semester wore on, he and I came to realize that his
project this semester wouldn't be the data analysis he
originally intended to do. It was a substantial project
simply to make sense of the data he had found.
As he presented his work today, I realized something further.
He was using language to understand a data set.
He started by defining a grammar to model the data he found,
so that he could parse it into a database. This involved
recognizing categories of expression that were on the surface
of the data, such as made and missed field goals, timeouts,
and turnovers. When he ran this first version of his parser,
he found unhandled entries and extended his grammar.
Then he looked at the semantics of the data and noticed
discrepancies deeper in the data. The number of possessions
his program observed in a game differed from the expected
values, sometimes wildly and with no apparent pattern.
As we looked deeper, we realized that the surface syntax of the
data often obscured some events that would extend or terminate
a possession. A simple example is a missed FT, which sometimes
ends a possession and sometimes not. It depends in part on
the next event in the timeline.
To handle these case, the student created new syntactic
categories that enabled his parser to resolve such issues by
recognized composite events in the data. As he did this, his
grammar grew, and his parser became better at building a more
accurate semantic model of the game.
This turned out to be a semester-long project in its own right.
He's still not done and intends to continue with this research
after graduation. We were both a bit surprised at how much
effort it took to corral the data, but in retrospect we should
not have been too surprised. Data are collected and presented
with many different purposes in mind. Having an accurate deep
model of the underlying the phenomenon in question isn't always
one of them.
I hope the student was pleased with his work and progress this
semester. I was. In addition to its practical value toward
solving a problem of mutual interest, it reminded me yet again
of the value of language in understanding the world around us,
and the remarkable value that the computational ideas we study
in computer science have to offer. For some reason, it also
reminded me, pleasantly, of
the Racket Way.
As I noted in that blog entry, this is really the essence of
Of course, if some NBA team were to give my student the data
he needs in suitable form, he could dive into the open question
of how better to measure individual defensive performance in
basketball. He has some good ideas, and the CS and math skills
needed to try them out.
Some NBA team should snatch this guy up.