サッカーユニフォームは背番号や胸番号、パンツ番号入れていくらになるのかが知りたいが価格がわかりつらい。　#フットサル ＃サッカー #ユニフォーム

## Tricks of the Trade: LogSumExp

There are a lot of neat little tricks in Machine Learning to make things work better. They can do many different things: make a network train faster, improve performance, etc. Today I’ll discuss LogSumExp, which is a pattern that comes up quite a bit in Machine Learning. First let’s define the expression:
$$LogSumExp(x_1…x_n) = \log\big( \sum_{i=1}^{n} e^{x_i} \big)$$
When would we see such a thing? Well, one common place is calculating the cross entropy loss of the softmax function. If that sounded like a bunch of gobbeldy-gook: 1. get used to it, there’s a bunch of crazily named stuff in ML and 2. just realize it’s not that complicated. Follow that link to the excellent Stanford cs231n class notes for a good explanation, or just realize for the purposes of this post that the softmax function looks like this:
$$\frac{e^{x_j}}{\sum_{i=1}^{n} e^{x_i}}$$
where the $x_j$ in the numerator is one of the values (one of the $x_i$s) in the denominator. So what this is doing is essentially exponentiating a few values and the normalizing so the sum over all possible $x_j$ values is 1, as is required to produce a valid probability distribution.

So you can think of the softmax function as just a non-linear way to take any set of numbers and transforming them into a probability distribution. And for the cross entropy bit, just accept that it involves taking the log of this function. This ends up producing the LogSumExp pattern since:
\begin{align}\log\left(\frac{e^{x_j}}{\sum_{i=1}^{n} e^{x_i}}\right) &= \log(e^{x_j}) \:-\: \log\left(\sum_{i=1}^{n} e^{x_i}\right) \\ &= x_j \:-\: \log\left(\sum_{i=1}^{n} e^{x_i}\right) & (1)\end{align}

It may seem a bit mysterious as to why this is a good way to produce a probability distribution, but just take it as an article of faith for now.

### Numerical Stability

Now for why LogSumExp is a thing. First, in pure mathematics, it’s not a thing. You don’t have to treat LogSumExp expressions specially at all. But when we cross over into running math on computers, it does become a thing. The reason is based in how computers represent numbers. Computers use a fixed number of bits to represent numbers. This works fine almost all of the time, but sometimes it leads to errors since it’s impossible to accurately represent an infinite set of numbers with a fixed number of bits.

To illustrate the problem, let’s take 2 examples for our $x_i$ sequence of numbers: {1000, 1000, 1000} and {-1000, -1000, -1000}. Due to my amazing mathematical ability, I know that feeding either of these sequences into the softmax function will yield a probability distribution of {1/3, 1/3, 1/3} and the log of 1/3 is a reasonable negative number. Now let’s try to calculate one of the terms of the summation in python:
 >>> import math >>> math.e**1000 Traceback (most recent call last): File "", line 1, in OverflowError: (34, 'Result too large') 
Whoops. Maybe we’ll have better luck with -1000:
 >>> math.e**-1000 0.0 
That doesn’t look right either. So we’ve run into some numerical stability problems even with seemingly reasonable input values.

### The Workaround

Luckily people have found a nice way to minimize these effects by relying on the fact that the product of exponentiations is equivalent to the exponentiation of the sum:
$$e^a \cdot e^b = e^{a+b}$$
and the logarithm of a product is equivalent to the sum of the logarithms:
$$\log(a \cdot b) = \log(a) + \log(b)$$
Let’s use these rules to start manipulating the LogSumExp expression.
\begin{align} LogSumExp(x_1…x_n) &= \log\big( \sum_{i=1}^{n} e^{x_i} \big) \\ &= \log\big( \sum_{i=1}^{n} e^{x_i – c}e^{c} \big) \\ &= \log\big( e^{c} \sum_{i=1}^{n} e^{x_i – c} \big) \\ &= \log\big( \sum_{i=1}^{n} e^{x_i – c} \big) + \log(e^{c}) \\ &= \log\big( \sum_{i=1}^{n} e^{x_i – c} \big) + c & (2)\\ \end{align}

Ok! So first we introduced a constant $c$ into the expression (line 2) and used the exponentiation rule. Since $c$ is a constant, we can factor it out of the sum (line 3) and then use the log rule (line 4). Finally, log and exp are inverse functions, so those 2 operations just cancel out to produce $c$. Critically, we’ve been able to create a term that doesn’t involve a log or exp function. Now all that’s left is to pick a good value for c that works in all cases. It turns out $max(x_1…x_n)$ works really well.

To convince ourselves of this, let’s construct a new expressin for log softmax by plugging equation 2 into equation 1:
\begin{align} \log(Softmax(x_j, x_1…x_n)) &= x_j \:-\: LogSumExp(x_1…x_n) \\ &= x_j \:-\: \log\big( \sum_{i=1}^{n} e^{x_i – c} \big) \:-\: c \end{align}
and use this to calculate values for the 2 examples above. For {1000, 1000, 1000}, $c$ will be 1000 and $e^x_j – c$ will always be 1 as $x_i – c$ is always zero. so we’ll get:
\begin{align} \log(Softmax(1000, \left[1000,1000,1000\right])) &= 1000 \:-\: log(3) \:-\: 1000 \\ &= \:- log(3)\end{align}
log(3) is a very reasonable number computers have no problem calculating. So that example worked great. Hopefully it’s clear that {-1000,-1000,-1000} will also work fine.

### The Takeaway

By thinking through a few examples, we can reason about what will happen in general:

• If none of the $x_i$ values would cause any stability issues, the “naive” verson of LogSumExp would work fine. But the “improved” version also works.
• If at least one of the $x_i$ values is huge, the naive version bombs out. The improved version does not. For the other $x_i$ values that are similarly huge, we get a good calculation. For other $x_i$s that are not huge, we will essentially approximate them as zero.
• For large negative numbers, the signs get flipped and things work the same way.

So while things aren’t perfect, we get some pretty reasonable behavior most of the time and nothing ever blows up. I’ve created a simple python example where you can play around with this to convince yourself that things actually work fine.

So that’s a wrap on LogSumExp! It’s a neat little trick that’s actually pretty easy to understand once the mechanics are deconstructed. Once you know about it and the general numerical stability problem, it should demystify some of the documentation in libraries and source code.

To cement this in your mind (and get some math practice), I would wait awhile and then try to work out the math yourself. Also think through various examples in your head and reason about what should happen. Then run my code (or rewrite the code yourself) and confirm your intuition.

If you enjoyed reading this, subscribe in Feedly to never miss another post.