TITLE: Softmax, Recursion, and Higher-Order Procedures AUTHOR: Eugene Wallingford DATE: October 03, 2011 7:20 AM DESC: ----- BODY: Update: This entry originally appeared on September 28. I bungled my blog directory and lost two posts, and the simplest way to get the content back on-line is to repost. John Cook recently reported that he has bundled up some of his earlier writings about the soft maximum as a tech report. The soft maximum is "a smooth approximation to the maximum of two real variables":
    softmax(x, y) = log(exp(x) + exp(y))
When John posted his first blog entry about the softmax, I grabbed the idea and made it a homework problem for my students, who were writing their first Scheme procedures. I gave them a link to John's page, so they had access to this basic formula as well as a Python implementation of it. That was fine with me, because I was simply trying to help students become more comfortable using Scheme's unusual syntax:
    (define softmax
      (lambda (x y)
        (log (+ (exp x)
                (exp y)))))
On the next assignment, I asked students to generalize the definition of softmax to more than two variables. This gave them an opportunity to write a variable arity procedure in Scheme. At that point, they had seen only a couple simple examples of variable arity, such as this implementation of addition using a binary + operator:
    (define plus              ;; notice: no parentheses around
      (lambda args            ;; the args parameter in lambda
        (if (null? args)
            (+ (car args) (apply plus (cdr args))) )))
Many students followed this pattern directly for softmax:
    (define softmax-var
      (lambda args
        (if (null? (cdr args))
            (car args)
            (softmax (car args)
                     (apply softmax-var (cdr args))))))
Some of their friends tried a different approach. They saw that they could use higher-order procedures to solve the problem -- without explicitly using recursion:
    (define softmax-var
      (lambda args
        (log (apply + (map exp args)))))
When students saw each other's solutions, they wondered -- as students often do -- which one is correct? John's original blog post on the softmax tells us that the function generalizes as we might expect:
    softmax(x1, x2, ..., xn) = log(exp(x1) + exp(x2) + ... + exp(xn))
Not many students had looked back for that formula, I think, but we can see that it matches the higher-order softmax almost perfectly. (map exp args) constructs a list of the exp(xi) values. (apply + ...) adds them up. (log ...) produces the final answer. What about the recursive solution? If we look at how its recursive calls unfold, we see that this procedure computes:
    softmax(x1, softmax(x2, ..., softmax(xn-1, xn)...))
This is an interesting take on the idea of a soft maximum, but it is not what John's generalized definition says, nor is it particularly faithful to the original 2-argument function. How might we roll our own recursive solution that computes the generalized function faithfully? The key is to realize that the function needs to iterate not over the maximizing behavior but the summing behavior. So we might write:
    (define softmax-var
      (lambda args
        (log (accumulate-exps args))))

    (define accumulate-exps
      (lambda (args)
        (if (null? args)
            (+ (exp (car args)) (accumulate-exps (cdr args))))))
This solution turns softmax-var into interface procedure and then uses structural recursion over a flat list of arguments. One advantage of using an interface procedure is that the recursive procedure accumulate-exps no longer has to deal with variable arity, as it receives a list of arguments. It was remarkable to me and some of my students just how close the answers produced by the two student implementations of softmax were, given how different the underlying behaviors are. Often, the answers were identical. When different, they differed only in the 12th or 15th decimal digit. As several blog readers pointed out, softmax is associative, so the two solutions are identical mathematically. The differences in the values of the functions result from the vagaries of floating-point precision. The programmer in me left the exercise impressed by the smoothness of the soft maximum. The idea is resilient across multiple implementations, which makes it seem all the more useful to me. More important, though, this programming exercise led to several interesting discussions with students about programming techniques, higher-order procedures, and the importance of implementing solutions that are faithful to the problem domain. The teacher in me left the exercise pleased. -----