Session 22
Data Abstraction and Variety

The Infinite Variety of Implementations

Usually, we think of data structures as having alternative implementations. But even atomic types can be represented in a variety of ways.

Consider one of the simplest data types of all: the non-negative integer. Non-negative integers can be defined with an "interface" of four parts:

Every integer has a successor; zero doesn't have a predecessor.

(This interface is a subset of Peano's axioms).

We can implement this interface directly in Racket using its own numbers:

(define zero     0)
(define is-zero? zero?)
(define next     add1)
(define previous sub1)

Using this interface, we can express the number two as (next (next zero)).

But we can also implement this interface just as easily using a Racket list:

(define zero     '())
(define is-zero? null?)
(define next     (lambda (n) (cons 1 n)))
(define previous rest)

With this implementation, we can still express the number two as (next (next zero)), just as we would in the number-based implementation. The underlying value would be represented differently — (1 1) rather than 2 — but its meaning relative to the other operations would be the same.

Check out these implementations — and maybe even try to create your own. There are many more ways. We computer programmers are an ingenious lot. Computation is a flexible medium!

An Opening Exercise

As a warm-up exercise, I would like for you to brainstorm as many different implementations as possible for another simple data type: the pair.

A pair consists of two values. We can define a pair with an interface consisting of three functions:
  • The operation MAKE-PAIR is a constructor. It takes any two arguments and returns a new pair.
  • The access procedure FIRST takes a pair as an argument and returns the first part of that pair.
  • The access procedure SECOND takes a pair as an argument and returns the second part of that pair.
For example:
> (define pair1 (MAKE-PAIR 2 3))
> (define pair2 (MAKE-PAIR 1 pair1))
> (FIRST  pair2)
1
> (FIRST (SECOND pair2))
2
> (SECOND (SECOND pair2))
3
List all the different ways you can think of implementing this interface in Racket. You should be able to come up with at least two based on data types you have used in this course, and maybe more.

If you run out of ideas, list ways that you might do this in Python, Java, or some other language.

Some Possible Implementations of Pairs

The number of ways to implement a pair in Racket is probably larger than you first imagine. I can think of two ways using data types that you have been using all semester:

Back in Session 4, you learned about and had a reading assignment on Racket vectors. We can implement a pair as a vector with two slots:

(define (MAKE-PAIR a b) (vector a b))
(define (FIRST  aPair)  (vector-ref aPair 0))
(define (SECOND aPair)  (vector-ref aPair 1))

If you started thinking about data structures in other languages, you might have listed a Python dictionary or a Java map. Back in Session 4, I also mentioned that Racket has a hash table:

(define (MAKE-PAIR a b) (hash 'first a 'second b))
(define (FIRST  aPair)  (hash-ref aPair 'first))
(define (SECOND aPair)  (hash-ref aPair 'second))

Or you may have thought of a C struct. Racket has structs, too:

(struct pair (one two))       ; a structure with two fields
(define MAKE-PAIR pair)       ; a Racket-generated constructor
(define FIRST     pair-one)   ;   and Racket-generated accessors
(define SECOND    pair-two)   ;   named by struct and field

If you thought of Java, you might have thought using a class, which is pretty similar to a struct. Racket has classes and objects, too!

Wait, There's More...

What other values have we used this semester? Functions. Lots and lots of functions. In Racket, functions are values, too. Is it possible to implement a pair as a function? Given that, can we implement this interface such that the constructor MAKE-PAIR returns a function and the accessors FIRST and SECOND receive a function as an argument?

Yes, indeed! Here are three ways.

We could make the pair a selector function.

(define (MAKE-PAIR a b) (lambda (selector)
                          (if selector a b)))
(define (FIRST  aPair)  (aPair #t))
(define (SECOND aPair)  (aPair #f))

This approach uses boolean values in addition to functions, as well as an if expression.

We could use message passing to simulate how objects work. This generalizes the idea of a selector function to allow different (and more) arguments.

(define (MAKE-PAIR a b) (lambda (selector)
                          (cond ((eq? selector 'first ) a)
                                ((eq? selector 'second) b))))
(define (FIRST  aPair)  (aPair 'first))
(define (SECOND aPair)  (aPair 'second))

This approach uses symbols, and symbol equality, in addition to functions. It also uses an if or cond.

Both of these solutions use functions and another data type to implement a pair. Can we implement a pair using only functions?

We can. This implementation uses pure functions.

(define (MAKE-PAIR a b) (lambda (proc) (proc a b)))
(define (FIRST  aPair)  (aPair (lambda (x y) x)))
(define (SECOND aPair)  (aPair (lambda (x y) y)))

I love this last solution. Whenever I see it, I smile. It hints at how much one can do with nothing but functions. The lambda calculus underlies most programming language theory and inspired the creators of Lisp, Scheme, Racket, and many other other languages. It relies solely on function definition, function application, and variable substitution to do all of its computation. It does not even use boolean values or a selection statement, which seem to be at the core of every programming language weknow. Maybe those things aren't really essential after all?

Code. This file contains all eight Racket implementations of pairs shown above. Try them out! And, just so that you know this isn't a strange artifact of Racket, here are two implementations of the pair in Python, including the pure function implementation... (Remember, Python has lambdas, too.)

Of course, if we think a little harder, we can probably find cool ways to use Racket's other primitive types, such as numbers, strings, and symbols, to encode a pair. Those implementations will take a little bit more effort — and code. They also might not be as general.

Indeed, in this session two years ago, one of the students (thank you, Henry!) asked if we could implement a pair using the set data type that we implemented in Homework 7 and used in Sessions 19-20. I did not know... It seemed impossible, though, because a pair is ordered and sets are unordered. Even so, while the students worked on Quiz 3, I worked on this challenge as my quiz. It turns out that it is possible! If you'd like to see how, see this implementation.

One moral of this story is:

Do not assume anything about the implementation of an interface — even the simplest interface!

Some of these implementations might be outside the scope of your imagination just yet. The pure functional implementation probably is. I hope that our study of data abstraction will stretch our minds to a point where these don't seem so strange. Note in particular that we will use the 'message passing' approach above to implement object-oriented programming in Racket.

Setting The Stage

For the last six (*) sessions, we have been exploring the idea of syntactic abstractions, those features of a language that are convenient to have but not essential to the language. We considered several examples: local variables, local functions, non-if selection statements, logical connectives, and — most recently — variable names. Our goal in studying these syntactic abstractions was not to study Racket per se but to see why and how language interpreters provide such abstractions. Indeed, you can identify many such abstractions in other languages you know.

(*) Well, five. We had an off-day and are saving Session 21 for a little later.

Beginning with this session, we move on to another sort of abstraction that all languages provide: data abstraction. We will introduce the idea of data abstraction by returning to an idea you know well: data types and their implementations.

Data Abstraction

Programming requires two kinds of abstraction.

A syntactic abstraction offers a different way to express behaviors. It does not add to what can be expressed in the language, but it does add to what can be expressed conveniently.

A data abstraction offers a different way to express values. Usually, a data abstraction allows you to both represent and manipulate the data. In practice, these data are often aggregate values, but that is not necessarily true. As with syntactic abstractions, a data abstraction does not add to the set of problems that can be solved in a language, but it does make some solutions easier or more convenient to create.

We have already used one data abstraction extensively in this course: Racket's list, which is constructed out of the more fundamental type, the pair. Racket lists are implemented in terms of another data type, so they are a data abstraction. Technically, we don't need lists, but they make our jobs more convenient. Each list is built out of a sequence of pairs (primitive cons cells) and the empty list (the null pointer). The language provides an interface that, for the most part, hides from us the underlying data representation.

When we build a list incorrectly, Racket occasionally reminds us that lists are built out of pairs, by showing "dotted pair" notation when displaying the structure. We can even use dotted pair notation ourselves to express lists and other structures:

> '(1 . (2 . (3 . ())))
'(1 2 3)

> (cons 1 (cons 2 3))
'(1 2 . 3)

Of course, the idea of constructing one type out of another is not peculiar to Racket. Can you think of an example from some other language you know? The one that comes immediately to my mind is the Java ArrayList. A ArrayList is constructed out of an array, which is a more fundamental type. Java compilers know about arrays, but they don't have to know much about ArrayLists. They do know how to manipulate classes as abstractions, though, and so they can compile and manipulate ArrayLists.

Indeed, every user-defined class in Java is a data abstraction implemented in terms of other values. This holds as well for the classes defined as a part of the Java programming language. Object-oriented programming is a style in which programmers create data abstractions that make it more convenient to write some solutions -- and to maintain and extend them over time.

Another prominent data abstraction from another language is C++'s class construct. As designed, C++ programs were to be compiled by C compilers, even though C compilers do not recognize classes. Instead, the C++ pre-processorpre-processor would translate any code that creates and uses classes and their instances into equivalent C code. The pre-processor translates all classes into equivalent C structs. So, while the primitive term class is a syntactic abstraction, it is also a data abstraction.

The Racket list is a powerful data structure, due largely to its flexibility, but it is also rather inefficient when it comes to accessing elements. We cannot easily or efficiently access any item directly; instead, we must step through items one at a time. Knowing that a list is really a linked structure of cons cells gives this fact away, because we know from our data structures course that linked structures provide O(n) access time.

Racket provides another data aggregate, the vector, which we have used only occasionally to date. Vectors are a primitive data type. We will soon put vectors to more frequent use. You should find that vectors feel rather familiar, based on your programming experiences in other languages. At this point, though, you will want to refresh your memory about vectors by reviewing the Racket Guide, paying particular attention to:

In your data structures course, you learned about hash tables and how to implement them using arrays. We could do the same thing using Racket vectors, but the language creators have saved us the effort by providing this data abstraction as a primitive.

In our discussion of syntactic abstractions, we saw that we could define a syntactic abstraction such as let, write a program that translates expressions in a language containing let expressions into expressions in a language without let expressions, and our interpreter wouldn't know the difference. We might like to be able to do that with new data abstractions, too. Sometimes, we can.

A great example of programmers adding their own data abstraction to a language is Generic Java. Those of you who know Java or Ada have learned about generic data types. C++ offers the same capabilities with its template facility. In the beginning, Java did not provide generic class definitions in any way. This caused programmers a lot of grief, in in particular all the downcasting we had to do when retrieving objects from a container, such as a Vector.

Well, some programmers solved the problem for themselves by adding generics to the language as a data abstraction. They defined new syntax for expressing generic classes such as Vector<String>, and then wrote a preprocessor that translated code containing generic classes into regular Java.

Eventually, the team in charge of Java decided to add generic classes to the language. When Java 1.5 arrived in the summer of 2004, it included generic types as a part of the language, based on the Generic Java extension. The programmer-added feature became a language feature. Java's implementation of generic types is a great example of data and syntactic abstraction.

Making Data Abstractions

Often, you will find that Racket, or whatever other language you are using, does not provide a native data type that you need. Whenever you encounter this kind of situation, you know just what to do: create an abstract data type. These abstractions aren't "just" data abstractions, but they also are not "just" syntactic abstractions. Instead, they are behaviorial abstractions that involve both data and functions.

Making data abstractions is something you studied in some detail in your Data Structures course. There, you probably spoke of "abstract data types", or ADTs. One of the essential ideas underlying ADTs was that the interface of the ADT is independent of its implementation. The interface specifies a set of values and a set of operations on these values. It is "abstract" in the sense that the interface makes no commitment to how these values and operations are actually implemented. The implementation provides a concrete representation, at least concrete in the sense that it is expressed in terms of some other executable type(s).

The value of ADTs lies in their independence from representation. They allow us to write client code in terms of an interface, for which we can provide a suitable implementation that meets a particular set of needs.

              [     client     ]
                      |
                      |
                      v
              [   interface    ]
                      |
                      |
        |---------------------------|
        |                           |
        v                           v
[ implementation ]        [ new implementation ]

Often, at the beginning of a project, we will choose a simple implementation that allows us to write client code as quickly as possible. Later, we can choose a more efficient implementation, or an implementation that better suits the program's environment, without having to change the client code.

This tells us something about data abstraction from the programmer's point of view, but what about from the programming language's point of view?

It turns out that data abstractions play a role in the language similar to the one played by syntactic abstractions. A language with a small set of primitive data types can be interpreted by a simpler program than one that offers a larger set of primitive types. But having too small a set of types makes programming in the language too inconvenient, so we might want to add one or more data abstractions for the programmer's benefit. These abstractions can be implemented in terms of the more fundamental types and thus intrepreted by the simpler program.

Racket lists and pairs are a great example. The pair is more primitive because it has the simpler set of values and operations. But Racket also offers a list data type that has a straightforward implementation in terms of pairs, and it allows us to use pairs (and lists) to implement other sorts of data abstractions.

Wrap Up