TITLE: Parsing Expression Grammars in the Compiler Course AUTHOR: Eugene Wallingford DATE: May 22, 2009 4:05 PM DESC: ----- BODY: Yesterday, a student told me about the Ruby gem Treetop, a DSL for writing language grammars. This language uses parsing expression grammar, which turns our usual idea of grammar inside out. Most compiler theory is built atop the context-free and regular grammars of Chomsky. These grammars are generative: they describe rules that allow us to create strings which are part of the language. Parsing expression grammars describe rules that allow us to recognize strings which are part of the language. This new kind of grammar offers a lot of advantages for working with programming languages, such as unifying lexical and syntactic descriptions and supporting the construction of linear-time parsers. I remember seeing Bryan Ford talk about packrat parsing at ICFP 2002, but at that point I wasn't thinking as much about language grammars and so didn't pay close attention the type of grammar that underlay his parsing ideas. While generative grammars are a fundamental part of computing theory, they don't map directly onto the primary task for which many software people use them: building scanners and parsers for programming languages. Our programs recognize strings, not generate them. So we have developed mechanisms for building and even generating scanners and parsers, given grammars that we have written under specific constraints and then massaged to fit our programming mechanisms. Sometimes the modified grammars aren't as straightforward as we might like. This can be a problem for anyone who comes to the grammar later, as well as a problem for the creators of the grammar when they want to change it in response to changes requested by users. A recognition-based grammar matches our goals as compiler writers more closely, which could be a nice advantage. Parsing expression grammars make explicit the specification of the code we write against them. For those of us who teach compiler courses, something like a parsing expression grammar raises another question. Oftentimes, we hope that the compiler course can do double duty: teach students how to build a compiler, and help them to understand the theory, history, and big issues of language processors. I think of this as a battle between two forces, "learning to" versus "learning about", a manifestation of epistemology's distinction between "knowing that" and "knowing how". Using recognition-based grammars as the foundation for a compiler course introduces a trade-off: students may be empowered more quickly to create language grammars and parsers but perhaps not learn as much about the standard terminology and techniques of the discipline. These standard ways are, of course, our historical ways of doing things. There is much value in learning history, but at what point do we take the step forward to techniques that are more practical than reminiscent? This is a choice that we have to make all the time in a compiler course: top-down versus bottom-up parsing, table-driven parsers versus recursive-descent parsers, writing parsers by hand versus using parser generators... As I've discussed here before, I still ask students to write their parser by hand because I think the experience of writing this code teaches them more than just about compilers. Now that I have been re-introduced to this notion of recognition-based grammars, I'm wondering whether they might help me to balance some of the forces at play more satisfactorily. Students would have the experience of writing a non-trivial parser by hand, but against a grammar that is more transparent and easier to work with. I will play with parsing expression grammars a bit in the next year or so and consider making a change the next time I teach the course. (If you have taught a compiler course using this approach, or know someone who has, please let me know.) Going this way would not commit me to having students write their parsers by hand. The link that started this thread of thought points to a tool for automating the manipulation of parsing expression grammars. Whatever I do, I'll add that tool to the list of tools I share with students. Oh, and a little Ruby Love to close. Take a look at TreeTop. Its syntax is beautiful. A Treetop grammar reads cleanly, crisply -- and is executable Ruby code. This is the sort of beauty that Ruby allows, even encourages, and is one of the reasons I remain enamored of the language. -----