Tricks of the Trade: LogSumExp

There are a lot of neat little tricks in Machine Learning to make things work better. They can do many different things: make a network train faster, improve performance, etc. Today I’ll discuss LogSumExp, which is a pattern that comes up quite a bit in Machine Learning. First let’s define the expression:
$$LogSumExp(x_1…x_n) = \log\big( \sum_{i=1}^{n} e^{x_i} \big)$$
When would we see such a thing? Well, one common place is calculating the cross entropy loss of the softmax function. If that sounded like a bunch of gobbeldy-gook: 1. get used to it, there’s a bunch of crazily named stuff in ML and 2. just realize it’s not that complicated. Follow that link to the excellent Stanford cs231n class notes for a good explanation, or just realize for the purposes of this post that the softmax function looks like this:
$$\frac{e^{x_j}}{\sum_{i=1}^{n} e^{x_i}}$$
where the $x_j$ in the numerator is one of the values (one of the $x_i$s) in the denominator. So what this is doing is essentially exponentiating a few values and the normalizing so the sum over all possible $x_j$ values is 1, as is required to produce a valid probability distribution.

So you can think of the softmax function as just a non-linear way to take any set of numbers and transforming them into a probability distribution. And for the cross entropy bit, just accept that it involves taking the log of this function. This ends up producing the LogSumExp pattern since:
$$\begin{align}\log\left(\frac{e^{x_j}}{\sum_{i=1}^{n} e^{x_i}}\right) &= \log(e^{x_j}) \:-\: \log\left(\sum_{i=1}^{n} e^{x_i}\right) \\ &= x_j \:-\: \log\left(\sum_{i=1}^{n} e^{x_i}\right) & (1)\end{align}$$

It may seem a bit mysterious as to why this is a good way to produce a probability distribution, but just take it as an article of faith for now.

Numerical Stability

Now for why LogSumExp is a thing. First, in pure mathematics, it’s not a thing. You don’t have to treat LogSumExp expressions specially at all. But when we cross over into running math on computers, it does become a thing. The reason is based in how computers represent numbers. Computers use a fixed number of bits to represent numbers. This works fine almost all of the time, but sometimes it leads to errors since it’s impossible to accurately represent an infinite set of numbers with a fixed number of bits.

To illustrate the problem, let’s take 2 examples for our $x_i$ sequence of numbers: {1000, 1000, 1000} and {-1000, -1000, -1000}. Due to my amazing mathematical ability, I know that feeding either of these sequences into the softmax function will yield a probability distribution of {1/3, 1/3, 1/3} and the log of 1/3 is a reasonable negative number. Now let’s try to calculate one of the terms of the summation in python:

>>> import math
>>> math.e**1000
Traceback (most recent call last):
File "", line 1, in
OverflowError: (34, 'Result too large')

Whoops. Maybe we’ll have better luck with -1000:

>>> math.e**-1000
0.0

That doesn’t look right either. So we’ve run into some numerical stability problems even with seemingly reasonable input values.

The Workaround

Luckily people have found a nice way to minimize these effects by relying on the fact that the product of exponentiations is equivalent to the exponentiation of the sum:
$$e^a \cdot e^b = e^{a+b}$$
and the logarithm of a product is equivalent to the sum of the logarithms:
$$\log(a \cdot b) = \log(a) + \log(b)$$
Let’s use these rules to start manipulating the LogSumExp expression.
$$\begin{align}
LogSumExp(x_1…x_n) &= \log\big( \sum_{i=1}^{n} e^{x_i} \big) \\
&= \log\big( \sum_{i=1}^{n} e^{x_i – c}e^{c} \big) \\
&= \log\big( e^{c} \sum_{i=1}^{n} e^{x_i – c} \big) \\
&= \log\big( \sum_{i=1}^{n} e^{x_i – c} \big) + \log(e^{c}) \\
&= \log\big( \sum_{i=1}^{n} e^{x_i – c} \big) + c & (2)\\
\end{align}$$

Ok! So first we introduced a constant $c$ into the expression (line 2) and used the exponentiation rule. Since $c$ is a constant, we can factor it out of the sum (line 3) and then use the log rule (line 4). Finally, log and exp are inverse functions, so those 2 operations just cancel out to produce $c$. Critically, we’ve been able to create a term that doesn’t involve a log or exp function. Now all that’s left is to pick a good value for c that works in all cases. It turns out $max(x_1…x_n)$ works really well.

To convince ourselves of this, let’s construct a new expressin for log softmax by plugging equation 2 into equation 1:
$$\begin{align}
\log(Softmax(x_j, x_1…x_n)) &= x_j \:-\: LogSumExp(x_1…x_n) \\
&= x_j \:-\: \log\big( \sum_{i=1}^{n} e^{x_i – c} \big) \:-\: c
\end{align}$$
and use this to calculate values for the 2 examples above. For {1000, 1000, 1000}, $c$ will be 1000 and $e^x_j – c$ will always be 1 as $x_i – c$ is always zero. so we’ll get:
$$\begin{align} \log(Softmax(1000, \left[1000,1000,1000\right])) &= 1000 \:-\: log(3) \:-\: 1000 \\ &= \:- log(3)\end{align}$$
log(3) is a very reasonable number computers have no problem calculating. So that example worked great. Hopefully it’s clear that {-1000,-1000,-1000} will also work fine.

The Takeaway

By thinking through a few examples, we can reason about what will happen in general:

  • If none of the $x_i$ values would cause any stability issues, the “naive” verson of LogSumExp would work fine. But the “improved” version also works.
  • If at least one of the $x_i$ values is huge, the naive version bombs out. The improved version does not. For the other $x_i$ values that are similarly huge, we get a good calculation. For other $x_i$s that are not huge, we will essentially approximate them as zero.
  • For large negative numbers, the signs get flipped and things work the same way.

So while things aren’t perfect, we get some pretty reasonable behavior most of the time and nothing ever blows up. I’ve created a simple python example where you can play around with this to convince yourself that things actually work fine.

So that’s a wrap on LogSumExp! It’s a neat little trick that’s actually pretty easy to understand once the mechanics are deconstructed. Once you know about it and the general numerical stability problem, it should demystify some of the documentation in libraries and source code.

To cement this in your mind (and get some math practice), I would wait awhile and then try to work out the math yourself. Also think through various examples in your head and reason about what should happen. Then run my code (or rewrite the code yourself) and confirm your intuition.

Getting Started With Machine Learning

Hello! Welcome to the Machine Learning section of the Feedly Blog. We’ve been doing more and more ML work here in the company and thought it would be interesting jotting down some of our thoughts. My background is in software engineering, so a lot of posts will be about how to approach ML from an outsider’s perspective, aka keeping up with the cool kids.

By the way, we are building out our ML team here, please drop us a note if you’re passionate about NLP. Anyways, on to the first post!


I originally went to college thinking I was going to work in computer hardware, maybe ending up at Intel or some company like that. But when I got to Carnegie Mellon, I found the computer science classes to be much more interesting and, not coincidentally, I seemed to do better in those classes. I really enjoyed almost every CS class I took. In fact, all but one: Intro to Machine Learning! I was really interested going into the course, but unfortunately the professor seemed equally disinterested in teaching the course and was just not very good. So machine learning dropped off my radar for a good long while.

But a couple years ago, I noticed machine learning really gaining traction and my curiosity perked up all over again. This time, I started with a MOOC by Andrew Ng, who is a fantastic professor. The difference was night and day. I was immediately hooked and started scouring the web for more classes I could take. Here’s some advice and tips I’ve learned along the way.

Is it a Good Idea to Hop on the ML Bandwagon?

There’s no question Machine Learning is here to stay. There’s been a prolonged buzz about the field for awhile now and having followed developments closely, I can say there is substance behind the hype. There are some problems that machines are just better at solving than humans.

But that doesn’t make it the right for for everyone. Working in machine learning is quite a bit different than other software engineering fields. It’s more research-y, more speculative. If you like to be able to plan out your work piece by piece and then have everything nicely tied up in a bow after x weeks, maybe it’s not a great fit. But if you enjoy working with data, continuously learning new techniques, and enjoy math (really) it might be a great career move.

How long will it take to get up to speed?

There are so many answers to this question. The first one that comes to mind is “forever”. Machine Learning is quite broad and moves incredibly fast. If, like me, you happen to need sleep you probably won’t be able to keep up with every development in the field. But another more optimistic answer might be 4 months at 10 hours/week. That, for example, should give you enough time to get through both parts of the excellent fast.ai courses.

This is not a trivial commitment as it’s probably going to be on top of your ongoing work and life commitments. But I can personally attest that it is possible, and not really that hard if you are willing to put the work in.

What are some good courses?

This really depends on how you like to learn. I personally love machine learning because of the elegant combination of so many fields of mathematics and computer science: probability, linear algebra, calculus, optimization, and much more. So I was naturally drawn to academic courses.

A terrific academic course is Stanford’s CS231n course. The videos I watched were done by Andrej Karpathy, who is a terrific lecturer. The course assignments were also well structured and doable remotely. Although the course is mostly about image problems and convolutional networks, they do start “from scratch” and also cover both feed forward and recurrent networks.

If you enjoy a more practical approach, the fast.ai courses are well done. There is no pretense here. Jeremy Howard approaches everything from a very grounded, systematic perspective and has designed the course so anyone with a moderately technical background can participate. Plus they’ve built up a nice community on their forums.

The aforementioned Andrew Ng has a new Coursera course sequence up (the course I took is quite outdated at this point). I haven’t personally tried it, but I’m sure there’s a lot of good stuff there. I would assume that everything there is also taught from a more practical perspective, but you can infer some of the math going on behind the scenes.

My recommendation would be to try a couple courses and pick the one that most captures your attention. But I would encourage you to eventually complete at least one practical course and one theoretical one. They complement each other quite nicely. To understand papers (and be warned: you will need to read academic papers), the academic courses will help you get acclimatized to the verbiage. To do projects, the practical courses will provide intuition on the dozens of decisions you have to make in any ML project.

If you need a math refresher or are looking for bonus points, MIT has a pair of great courses. A good grasp of probability is absolutely a must if you want to do any ML. The MIT class by John Tsitsiklis is amazingly well taught. The way the professor breaks down complex problems step by step until the answer just seems obvious is pure artistry.

The linear algebra class is also a fun one. The professor here is also very good and has a unique style. This one is not really necessary though, for most ML tasks you can get by with just understanding matrix multiplication.

What if I don’t know how to code?

Learn how to.

Most ML work is done in Python, which fortunately is pretty easy to pick up. And you don’t really need to be a world class programmer to do most ML work. But I would still do a quick online course before doing any ML specific work. Having to learn coding and machine learning concepts (not to mention re-learning a bunch of math you’ve probably forgotten) all at once is a recipe for disaster. Give yourself a chance and pace yourself.

I’ve got a basic grasp on things, now what?

Well, now it’s time to start modeling! There are generally 2 ways to go: find project at work/on your own or find a Kaggle competition. This depends on your particular situation, but I would recommend the Kaggle option. The main reasons are:

  1. The problem is defined. Structuring real life ML problem properly can sometimes be tricky. With Kaggle, this isn’t an issue.
  2. Similarly, sometimes building a data set can contain several hard to diagnose pitfalls. A Kaggle competition will provide you with data.
  3. With Kaggle, you’ll get a built in community all working on the same thing. If you get stuck or need a little guidance, there’s a place to go.

On the other hand, if you have a problem at work that is tailor made for a ML solution (image classification for example) then maybe a work project is a quick way to impress your coworkers and convince your boss to let you invest more time on Machine Learning.

So if you have been thinking about digging into Machine Learning, take the plunge! One of the best things about Machine Learning is that people are really generous with their time and knowledge. Once you get started you’ll find a great support system online to help you along the way.