Session 30

Course Wrap-Up


CS 4550
Translation of Programming Languages


Our Final Opening Exercise

Consider this little snippet of C, which operates on a string str. Let's assume that str is really long.

    for (int i = 0; i < strlen(str); i++) {
      process(str[i]);
    }

Last time, we learned that a compiler generate more efficient code by "unrolling" the loop:

    for (int i = 0; i < strlen(str); i+=5 ) {
      process(str[i]);
      process(str[i+1]);
      process(str[i+2]);
      process(str[i+3]);
      process(str[i+4]);
    }

Quick exercise:

How else could we optimize the original loop?
Recall that this is C, not Java...

The strlen() function is called strlen(str) times. In C, strings are zero-terminated arrays of indeterminate length. As James Hague points out, we have an unintentional O(n²) loop!

So, our compiler might want to optimize the loop by making the call once before entering the loop:

    int len = strlen(str);
    for (int i = 0; i < len; i++) {
      process(str[i]);
    }

Alas, it may not be able to. Hague notes:

As it turns out, this is much trickier to automate than may first appear. It's only safe if it can be guaranteed that the body of the loop doesn't modify the string, and that guarantee in C is hard to come by.

str is a pointer, and process() might modify the string. Hague gives a bit more on why it it can be hard to prove that process() doesn't modify the string.

Depending on the semantics of the source language, even simple improvements to code can be hard to implement automatically in a compiler. This helps me appreciate even more the ways in which compilers optimize our code.

Note that this is not a problem in Java, in which strings are objects. Check out this Java file... The unrolled loop runs six times faster for the short string and about 3.25 faster for the much longer string.

Unexpected Compiler Moment. While researching this bit, I came across an article about why we use zero as the starting point of arrays and loops in computer science. The accepted folklore is that zero-based arrays are mathematically more elegant, or that they reduce execution time. Mike Hoye went on an archeological dig to find out.

... the reason we started using zero-indexed arrays was because it shaved a couple of processor cycles off of a program's compilation time. Not execution time; compile time.

That article is a fascinating read. Eugene sez: Check it out.



Optimization and Refactoring

While the compiler may not be able to safely pull the call to strlen() out of the loop, the programmer might. If know that process() never modifies str, then I can refactor the code myself.

The same is true of any of the optimizations we saw last time, including inlining function calls by hand or using a loop or a goto to replace a function call entirely.

A programmer can do this, though, only if the programming language has the features we need. Klein doesn't have many features, so our refactoring is limited. In my work with Klein, I have converted if statements to boolean expressions and folded two function calls into one by substituting values by hand. I've also gone the other direction. As we learned last time, optimizations are just transformations that trade one cost for another, while preserving the meaning of the program.

A compiler can make these improvements at multiple levels: source language, AST, IR, or even target language. For example, after converting the AST for this Klein program:

    remainder(a : integer, b : integer) : integer
      if a < b
         then a
         else remainder(a-b, b)

... into three-address code of this form:

    L1:
    IF a >= b THEN GOTO L2
    T1 := a
    GOTO L3
    L2:
    T2 := a - b
    PARAM T2
    PARAM b
    T1 := CALL remainder
    L3:
    RETURN T1

... and then optimizing the call away as a loop:

    L1:
    IF a >= b THEN GOTO L2
    T1 := a
    GOTO L3
    L2:
    T2 := a - b
    a = T2           # PARAM T1
    b = b            # PARAM b
    GOTO L1          # T1 := CALL remainder
    RETURN T1

We can prove that this change to the code does not change its value, only how the value is computed. If Klein itself included goto statements, we can imagine a programmer making this kind of change to her own code, in an effort to make the code more efficient. (But only if she knew that her compiler wouldn't do it for her!)

When a programmer makes this sort of transformations to her code, it is called refactoring, which is defined as the process of changing the structure of a program to improve its design without changing its functional behavior. Compiler optimizations must also preserve a program's functional behavior, but their primary purpose is to improve efficiency. Programmers, on the other hand, refactor their code for a broader variety of reasons, with the emphasis usually on human-centered features such as extensibility or flexibility.

The techniques used to refactor code are often similar or identical to the techniques used to optimize code. As a result, we are able to build tools to help programmers refactor safely using many of the same techniques we use in compilers.

The relationship between refactoring and code optimization is not accidental. Back in Programming Languages and Paradigms, we learned about syntactic abstractions, which correspond to meaning-preserving transformations of programs. Any such transformation can be used as a refactoring (if it improves the design of the system in the minds of programmers) or as a code optimization (if it improves the quality of the generated target code in the minds of compiler writers and programmers who use the complier).



The Final Recap

a recursive compiler course

We have talked about the analysis phase of a compiler: taking a program written in a source language as a stream of characters, doing lexical analysis to generate a stream of tokens, doing syntactic analysis to produce an abstract syntax tree, and doing semantic analysis to ensure that the AST satisfies the language definition.

We have talked about the synthesis phase of the compiler, which converts a semantically-valid abstract syntax tree into an equivalent program in the target language: building a run-time system to support the execution of the target program, translating the AST through one or more intermediate representations such as three-address code, translating the final intermediate representation into code in the target language, and optimizing the code.

What a busy semester! And you wrote a compiler of your own. As much work as it turned out to be, I think you probably learned a lot from writing such a large and complex program. In fact, that is probably the only way that we can really learn how to program.



Compiler Writing as Software Development

I'm sure you understand now, better than ever before, why this course satisfies the project requirement for the Computer Science major. For all of the technical content of this course, writing a compiler is, at its core, a software development project. All of the issues that matter when we write any other large program matter when we write a compiler. Consider a few:

Perhaps we could spend more time in this course learning specifically about software engineering and project management, though that would mean eliminating some compiler material. Maybe we don't need much; many of us think that the best way to learn about software engineering is to write a big program with a few other programmers.

Quick Question: How much did the software engineering aspect of this project affect your success?



Compilers in the World

Nearly 60 years ago, a team at IBM changed the computing world forever when it published the first programmer's reference manual [ large PDF ] for Fortran, "The IBM Mathematical Formula Translating System". Fortran was simultaneously a high-level programming language and a program that translated programs written in the new language into the machine language. A compiler.

Though many university CS departments have dropped their compiler courses, programmers still write compilers today. Sometimes, they invent a new language, such as Lua, Scala, or Clojure. They have to build compilers and tools for the new language. Sometimes, they want to retarget a huge mass of existing code to a new platform, say, from Java to Javascript in a browser. So they build tools like the Google Web Toolkit to automate the port. Or maybe they want to generate code for new architectures or processors, or to help programmers using an existing language work in a new environment. One example I saw recently is Continuation Passing C (CPC), a new language designed for writing concurrent systems more reliably in a C environment.

Other programmers want to make their mainstream languages better or more widely useful. Take Ruby, for example. In an interview a few years ago, Chad Fowler talked about the diversity of Ruby compilers available:

There's Matz's Ruby (1.8), YARV (1.8 + a new VM and syntax for 1.9), JRuby, Rubinius, Maglev, IronRuby, MacRuby, Rite, and others in development. All of them can run real Ruby code. All of them provide advantages over the others. Each implementation is faster for some things than the current state-of-the-art canonical Ruby implementation.

Since then, Ruby 1.9 and Ruby 2.0 have been released, and other Ruby projects have come along. Programmers never sit still for long.

Bootstrapping

For many compilers, the ultimate test will be to become "self hosting". One of the best ways to demonstrate that a compiler is ready for prime time is to write the compiler in the source language being compiled and use the compiler to compile it. This serves to test the compiler, but more importantly, it also makes it possible to port the compiler to a new target machine by writing only a new code generator and compiling it with the existing compiler.

Of course, this creates a "chicken and egg" problem that we considered back in Session 4. Do you remember this?

creating the native C compiler

What does this diagram show?

And don't think that this is simply a theoretical exercise. I read about bootstrapping of various sorts surprisingly often in technical blogs. I recently re-read this write-up from a few years ago by the creator of Guile Scheme, who bootstraps his entire compiler from eval.scm and the barest C interpreter for it.

Quick Question: What is the smallest set of features we would have to add to the Klein programming language to make it powerful enough to write a Klein compiler? How close is your compiler to being good enough to compile a program of the size of a Klein compiler?

A wild digression: Bootstrapping is not just for programmers any more.



The Squeak Story

My favorite story about bootstrapping a compiler and implementing a language in itself isn't the Lisp story Most people know this story, and it is indeed a classic story about a seminal moment in computing. Still, my favorite story is the Squeak story, which is a modern incarnation of the Smalltalk story from the 1970s -- built on top of the same!

Back in the mid-1990s, computing pioneer Alan Kay assembled his original team from Xerox PARC and went to work at Disney, to develop the next generation of digital media. They wanted a software development environment in which to build educational media for use by children and non-programmers. Their original Smalltalk could do the job, but none of the modern implementations provided the primitives and power they desired. They considered using other modern languages, most notably Java, but found them to be too limiting. So they turned their attention back to Smalltalk and considered how to make a Smalltalk environment that would do what they wanted.

What they had to work with was an old implementation of Smalltalk-80 that ran on an old Mac. A Smalltalk implementation consists of a virtual machine, which interprets byte codes, and an image, or the object memory that is the program.

What they wanted was a VM implemented in C, for maximal portability and speed, and an image implemented as much in Smalltalk as possible, for maximal control and flexibility.

What they did was something like the bootstrapping idea we saw above, with a twist. They...

Presumably, they then moved the old Smalltalk implementation to an archive, for posterity.

You can download Squeak from www.squeak.org, under the freest of free software licenses.

You can read the whole story, as well as some interesting technical ideas that go into implementing a pure object-oriented language, in the paper "Back to the Future", by Dan Ingalls and his colleagues at Squeak Central:

[ html | pdf ]

For what it's worth, if I needed a virtual machine and had all the money I needed to hire anyone I wanted, I'd ask Dan Ingalls. His work on Smalltalk in the last forty years, creating many of the ideas that everyone takes for granted these days, is remarkable.



Your Questions

One asked this week:

It was mentioned that most compiler development these days focuses on compiling down to language that might be a step above assembly, C or Javascript for example. I was wondering if the code architecture -- perhaps more in dealing with the backend/code generation phase -- of these compilers would be vastly different from compilers that actually do compile down to hard assembly? And if so, what would it look like?

Short answer: not much change.

One oldie: quantum compilers. A cool link: A Lambda Calculus for Quantum Computation ... with tools implemented in PLT Scheme, the forerunner of Racket.

Short answer: I don't understand them well yet. But there are still important questions to answer about the model of computation and the kind of programming languages we will want and need. Compilers will follow.

What's up with the programming language BF?

Short answer: Esoteric languages are a thing. BF is a superstar in that world. I like the idea of Piet. On a plane ride home from OOPSLA one fall fifteen years ago, I wrote an interpreter for Ook! in Scheme. Here is "Hello, World":

    Ook. Ook? Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook.
    Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook! Ook? Ook? Ook.
    Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook.
    Ook. Ook. Ook. Ook. Ook. Ook? Ook! Ook! Ook? Ook! Ook? Ook.
    Ook! Ook. Ook. Ook? Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook.
    Ook. Ook. Ook. Ook. Ook. Ook. Ook! Ook? Ook? Ook. Ook. Ook.
    Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook? Ook! Ook! Ook? Ook!
    Ook? Ook. Ook. Ook. Ook! Ook. Ook. Ook. Ook. Ook. Ook. Ook.
    Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook! Ook. Ook! Ook.
    Ook. Ook. Ook. Ook. Ook. Ook. Ook! Ook. Ook. Ook? Ook. Ook?
    Ook. Ook? Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook.
    Ook. Ook. Ook. Ook. Ook. Ook. Ook! Ook? Ook? Ook. Ook. Ook.
    Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook? Ook! Ook! Ook? Ook!
    Ook? Ook. Ook! Ook. Ook. Ook? Ook. Ook? Ook. Ook? Ook. Ook.
    Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook.
    Ook. Ook. Ook. Ook. Ook. Ook. Ook! Ook? Ook? Ook. Ook. Ook.
    Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook. Ook.
    Ook. Ook. Ook. Ook. Ook. Ook? Ook! Ook! Ook? Ook! Ook? Ook.
    Ook! Ook! Ook! Ook! Ook! Ook! Ook! Ook. Ook? Ook. Ook? Ook.
    Ook? Ook. Ook? Ook. Ook! Ook. Ook. Ook. Ook. Ook. Ook. Ook.
    Ook! Ook. Ook! Ook! Ook! Ook! Ook! Ook! Ook! Ook! Ook! Ook!
    Ook! Ook! Ook! Ook. Ook! Ook! Ook! Ook! Ook! Ook! Ook! Ook!
    Ook! Ook! Ook! Ook! Ook! Ook! Ook! Ook! Ook! Ook. Ook. Ook?
    Ook. Ook? Ook. Ook. Ook! Ook.

Programmers are a fun lot.



Final Countdown

cue the music....

Next time is our final exam period. As the syllabus tells us, we won't have a final exam, but we will meet.

Final Exam Period

Each team will give a 10-15 minute presentation as part of the final exam period next week. These presentations are in lieu of an exam itself. The presentation should include:

You do not need to prepare slides and all the trappings of a more formal presentation, though you may use slides as an aid. You should take some care in presenting your work, both to show the class the result of your semester's labor and to give them a chance to learn from what you have done.

After your presentations, we will wrap up with a couple of quick activities:

Then, we relax and enjoy our accomplishments.

Project Submissions

Primary version to grade: what you submit on time.

Optimizations: extra credit!

Late submissions: okay. I will take them into account. Yes, you may demo the later version.

Optimizations as late submission: a little extra credit.

Last Reading

Please do read one last thing for me this semester: Steve Yegge's Rich Programmer Food. Like most of Yegge's writing, this essay contains a lot of fluff designed to entertain, which some people enjoy and some people don't. (It's a guilty pleasure of mine.) But the computing is usually solid and the reflections are usually thought-provoking.

This is one of the quotes where he gets to the heart of why compilers matter:

Do you just sit around and wait for "someone" to fix your editor [or any other tool you use]? What if it takes years? Doesn't it seem like you, as the perfectly good programmer that you are, ought to be able to fix it faster than that?

That, plus congratulatory lines such as

[The compilers course] brings together, in a very concrete way, almost everything you learned before you took the course.
make it a favorite of compiler course instructors.

We'll talk about it for a few minutes next week.

Good luck as you put the finishing touches on your compilers!



Eugene Wallingford ..... wallingf@cs.uni.edu ..... December 11, 2015