TITLE: Softmax, Recursion, and Higher-Order Procedures
AUTHOR: Eugene Wallingford
DATE: October 03, 2011 7:20 AM
DESC:
-----
BODY:
**Update**: This entry originally appeared on September 28.
I bungled my blog directory and lost two posts, and the simplest
way to get the content back on-line is to repost.
John Cook
recently reported
that he has bundled up some of his earlier writings about the
**soft maximum** as
a tech report.
The soft maximum is "a smooth approximation to the maximum of
two real variables":
softmax(x, y) = log(exp(x) + exp(y))

When John posted his first blog entry about the softmax, I
grabbed the idea and made it a homework problem for my students,
who were writing their first Scheme procedures. I gave them a
link to John's page, so they had access to this basic formula as
well as a Python implementation of it. That was fine with me,
because I was simply trying to help students become more
comfortable using Scheme's unusual syntax:
(define softmax
(lambda (x y)
(log (+ (exp x)
(exp y)))))

On the next assignment, I asked students to generalize the
definition of softmax to more than two variables. This gave
them an opportunity to write a **variable arity** procedure
in Scheme. At that point, they had seen only a couple simple
examples of variable arity, such as this implementation of
addition using a binary `+` operator:
(define plus ;; notice: no parentheses around
(lambda args ;; the args parameter in lambda
(if (null? args)
0
(+ (car args) (apply plus (cdr args))) )))

Many students followed this pattern directly for softmax:
(define softmax-var
(lambda args
(if (null? (cdr args))
(car args)
(softmax (car args)
(apply softmax-var (cdr args))))))

Some of their friends tried a different approach. They saw
that they could use **higher-order procedures** to solve
the problem -- without explicitly using recursion:
(define softmax-var
(lambda args
(log (apply + (map exp args)))))

When students saw each other's solutions, they wondered --
as students often do -- which one is correct?
John's
original blog post
on the softmax tells us that the function generalizes as we
might expect:
softmax(x_{1}, x_{2}, ..., x_{n}) = log(exp(x_{1}) + exp(x_{2}) + ... + exp(x_{n}))

Not many students had looked back for that formula, I think,
but we can see that it matches the higher-order softmax
almost perfectly. `(map exp args)` constructs
a list of the `exp(x`_{i}) values.
`(apply + ...)` adds them up.
`(log ...)` produces the final answer.
What about the recursive solution? If we look at how its
recursive calls unfold, we see that this procedure computes:
softmax(x_{1}, softmax(x_{2}, ..., softmax(x_{n-1}, x_{n})...))

This is an interesting take on the idea of a soft maximum,
but it is not what John's generalized definition says, nor
is it particularly faithful to the original 2-argument function.
How might we roll our own recursive solution that computes the
generalized function faithfully? The key is to realize that
the function needs to iterate not over the maximizing behavior
but the *summing* behavior. So we might write:
(define softmax-var
(lambda args
(log (accumulate-exps args))))
(define accumulate-exps
(lambda (args)
(if (null? args)
0
(+ (exp (car args)) (accumulate-exps (cdr args))))))

This solution turns `softmax-var` into
interface procedure
and then uses
structural recursion
over a flat list of arguments. One advantage of using an
interface procedure is that the recursive procedure
`accumulate-exps` no longer has to deal with
variable arity, as it receives a list of arguments.
It was remarkable to me and some of my students just how
close the answers produced by the two student implementations
of softmax were, given how different the underlying behaviors
are. Often, the answers were identical. When different, they
differed only in the 12th or 15th decimal digit. As several
blog readers pointed out, softmax is associative, so the two
solutions are identical mathematically. The differences in the
values of the functions result from the vagaries of
floating-point precision.
The programmer in me left the exercise impressed by the
smoothness of the soft maximum. The idea is resilient across
multiple implementations, which makes it seem all the more
useful to me.
More important, though, this programming exercise led to
several interesting discussions with students about programming
techniques, higher-order procedures, and the importance of
implementing solutions that are faithful to the problem domain.
The teacher in me left the exercise pleased.
-----