TITLE: Parsing Expression Grammars in the Compiler Course
AUTHOR: Eugene Wallingford
DATE: May 22, 2009 4:05 PM
DESC:
-----
BODY:
Yesterday, a student
told me about
the Ruby gem
Treetop,
a DSL for writing language grammars. This language
uses
parsing expression grammar,
which turns our usual idea of grammar inside out.
Most compiler theory is built atop the context-free
and regular grammars of Chomsky. These grammars are
generative: they describe rules that allow
us to create strings which are part of the language.
Parsing expression grammars describe rules that allow
us to recognize strings which are part of
the language.
This new kind of grammar offers a lot of advantages
for working with programming languages, such as
unifying lexical and syntactic descriptions and
supporting the construction of linear-time parsers.
I remember seeing
Bryan Ford
talk about packrat parsing at ICFP 2002, but
at that point I wasn't thinking as much about
language grammars and so didn't pay close attention
the type of grammar that underlay his parsing ideas.
While generative grammars are a fundamental part of
computing theory, they don't map directly onto the
primary task for which many software people use
them: building scanners and parsers for programming
languages. Our programs recognize strings,
not generate them. So we have developed mechanisms
for building and even generating scanners and parsers,
given grammars that we have written under specific
constraints and then massaged to fit our programming
mechanisms. Sometimes the modified grammars aren't
as straightforward as we might like. This can be
a problem for anyone who comes to the grammar later,
as well as a problem for the creators of the grammar
when they want to change it in response to changes
requested by users.
A recognition-based grammar matches our goals as
compiler writers more closely, which could be a
nice advantage. Parsing expression grammars make
explicit the specification of the code we write
against them.
For those of us who teach compiler courses, something
like a parsing expression grammar raises another
question. Oftentimes, we hope that the compiler
course can do double duty: teach students how to
build a compiler, and help them to understand the
theory, history, and big issues of language
processors. I think of this as a battle between
two forces, "learning to" versus "learning about",
a manifestation of
epistemology's distinction
between "knowing that" and "knowing how".
Using recognition-based grammars as the foundation
for a compiler course introduces a trade-off:
students may be empowered more quickly to create
language grammars and parsers but perhaps not learn
as much about the standard terminology and techniques
of the discipline. These standard ways are, of
course, our historical ways of doing things.
There is much value in learning history, but at what
point do we take the step forward to techniques that
are more practical than reminiscent?
This is a choice that we have to make all the time
in a compiler course: top-down versus bottom-up
parsing, table-driven parsers versus recursive-descent
parsers, writing parsers by hand versus using
parser generators... As I've discussed here before,
I still ask students to write their parser by hand
because I think the experience of writing this code
teaches them more than just about compilers.
Now that I have been re-introduced to this notion
of recognition-based grammars, I'm wondering whether
they might help me to balance some of the forces at
play more satisfactorily. Students would have the
experience of writing a non-trivial parser by hand,
but against a grammar that is more transparent and
easier to work with. I will play with parsing
expression grammars a bit in the next year or so
and consider making a change the next time I teach
the course. (If you have taught a compiler course
using this approach, or know someone who has,
please let me know.)
Going this way would not commit me to having students
write their parsers by hand. The link that started
this thread of thought points to a tool for automating
the manipulation of parsing expression grammars.
Whatever I do, I'll add that tool to the list of
tools I share with students.
Oh, and a little Ruby Love to close. Take a look
at TreeTop. Its syntax is beautiful. A Treetop
grammar reads cleanly, crisply -- and is executable
Ruby code. This is the sort of beauty that Ruby allows,
even encourages, and is one of the reasons I remain
enamored of the language.
-----