TITLE: Studying Program Repositories AUTHOR: Eugene Wallingford DATE: October 06, 2010 11:24 AM DESC: ----- BODY: Last week, Garrett Morris presented an experience report at the 2010 Haskell Symposium entitled Using Hackage to Inform Language Design, which is also available at Morris's website. Hackage is an online repository for the Haskell programming community. Morris describes how he and his team studied programs in the Hackage repository to see how Haskell programmers work with a type-system concept known as overlapping instances. The details of this idea aren't essential to this entry, but if you'd like to learn more, check out Section 6.2.3 in this user guide.) In Morris's search, he sought answers to three questions:
Morris and his team used what they learned from this study to design the class system for their new language. They found that their language did not need to provide the full power of overlapping instances in order to support what programmers were really doing. Instead, they developed a new idea they call "instance chains" that was sufficient to support a large majority of the uses of overlapping instances. They are confident that their design can satisfy programmers because the design decision reflects actual uses of the concept in an extensive corpus of programs. I love this kind of empirical research. One of the greatest assets the web gives us is access to large corpora of code: SourceForge and GitHub are examples large public repositories, and there are an amazing number of smaller domain-specific repositories such as Hackage and RubyGems. Why design languages or programming tools blindly when we can see how real programmers work through the code they write? The more general notion of designing languages via observing behavior, forming theories, and trying them out is not new. In particular, I recommend the classic 1981 paper Design Principles Behind Smalltalk. Dan Ingalls describes the method by which the Smalltalk team grew its language and system as explicitly paralleling the scientific method. Morris's paper talks about a similar method, only with the observation phase grounded in an online corpus of code. Not everyone in computer science -- or outside CS -- thinks of this method as science. Just this weekend Dirk Riehle blogged about the need to broaden the idea of science in CS. In particular, he encourages us to include exploration as a part of our scientific method, as it provides a valuable way for us to form the theories that we will test using the sort of experiments that everyone recognizes as science.
- What proportion of the total code on Hackage uses overlapping instances?
- In code that uses overlapping instances, how many instances overlap each other?
- Are there common patterns among the uses of overlapping instances?