kireet

Tricks of the Trade: LogSumExp

Posted on 2018年3月14日 by kireet

There are a lot of neat little tricks in Machine Learning to make things work better. They can do many different things: make a network train faster, improve performance, etc. Today I’ll discuss LogSumExp, which is a pattern that comes up quite a bit in Machine Learning. First let’s define the expression:
$$LogSumExp(x_1…x_n) = \log\big( \sum_{i=1}^{n} e^{x_i} \big)$$
When would we see such a thing? Well, one common place is calculating the cross entropy loss of the softmax function. If that sounded like a bunch of gobbeldy-gook: 1. get used to it, there’s a bunch of crazily named stuff in ML and 2. just realize it’s not that complicated. Follow that link to the excellent Stanford cs231n class notes for a good explanation, or just realize for the purposes of this post that the softmax function looks like this:
$$\frac{e^{x_j}}{\sum_{i=1}^{n} e^{x_i}}$$
where the $x_j$ in the numerator is one of the values (one of the $x_i$s) in the denominator. So what this is doing is essentially exponentiating a few values and the normalizing so the sum over all possible $x_j$ values is 1, as is required to produce a valid probability distribution.

So you can think of the softmax function as just a non-linear way to take any set of numbers and transforming them into a probability distribution. And for the cross entropy bit, just accept that it involves taking the log of this function. This ends up producing the LogSumExp pattern since:
$$\begin{align}\log\left(\frac{e^{x_j}}{\sum_{i=1}^{n} e^{x_i}}\right) &= \log(e^{x_j}) \:-\: \log\left(\sum_{i=1}^{n} e^{x_i}\right) \\ &= x_j \:-\: \log\left(\sum_{i=1}^{n} e^{x_i}\right) & (1)\end{align}$$

It may seem a bit mysterious as to why this is a good way to produce a probability distribution, but just take it as an article of faith for now.

Numerical Stability

Now for why LogSumExp is a thing. First, in pure mathematics, it’s not a thing. You don’t have to treat LogSumExp expressions specially at all. But when we cross over into running math on computers, it does become a thing. The reason is based in how computers represent numbers. Computers use a fixed number of bits to represent numbers. This works fine almost all of the time, but sometimes it leads to errors since it’s impossible to accurately represent an infinite set of numbers with a fixed number of bits.

To illustrate the problem, let’s take 2 examples for our $x_i$ sequence of numbers: {1000, 1000, 1000} and {-1000, -1000, -1000}. Due to my amazing mathematical ability, I know that feeding either of these sequences into the softmax function will yield a probability distribution of {1/3, 1/3, 1/3} and the log of 1/3 is a reasonable negative number. Now let’s try to calculate one of the terms of the summation in python:
>>> import math >>> math.e**1000 Traceback (most recent call last): File "", line 1, in OverflowError: (34, 'Result too large')
Whoops. Maybe we’ll have better luck with -1000:
>>> math.e**-1000 0.0
That doesn’t look right either. So we’ve run into some numerical stability problems even with seemingly reasonable input values.

The Workaround

Luckily people have found a nice way to minimize these effects by relying on the fact that the product of exponentiations is equivalent to the exponentiation of the sum:
$$e^a \cdot e^b = e^{a+b}$$
and the logarithm of a product is equivalent to the sum of the logarithms:
$$\log(a \cdot b) = \log(a) + \log(b)$$
Let’s use these rules to start manipulating the LogSumExp expression.
$$\begin{align}
LogSumExp(x_1…x_n) &= \log\big( \sum_{i=1}^{n} e^{x_i} \big) \\
&= \log\big( \sum_{i=1}^{n} e^{x_i – c}e^{c} \big) \\
&= \log\big( e^{c} \sum_{i=1}^{n} e^{x_i – c} \big) \\
&= \log\big( \sum_{i=1}^{n} e^{x_i – c} \big) + \log(e^{c}) \\
&= \log\big( \sum_{i=1}^{n} e^{x_i – c} \big) + c & (2)\\
\end{align}$$

Ok! So first we introduced a constant $c$ into the expression (line 2) and used the exponentiation rule. Since $c$ is a constant, we can factor it out of the sum (line 3) and then use the log rule (line 4). Finally, log and exp are inverse functions, so those 2 operations just cancel out to produce $c$. Critically, we’ve been able to create a term that doesn’t involve a log or exp function. Now all that’s left is to pick a good value for c that works in all cases. It turns out $max(x_1…x_n)$ works really well.

To convince ourselves of this, let’s construct a new expressin for log softmax by plugging equation 2 into equation 1:
$$\begin{align}
\log(Softmax(x_j, x_1…x_n)) &= x_j \:-\: LogSumExp(x_1…x_n) \\
&= x_j \:-\: \log\big( \sum_{i=1}^{n} e^{x_i – c} \big) \:-\: c
\end{align}$$
and use this to calculate values for the 2 examples above. For {1000, 1000, 1000}, $c$ will be 1000 and $e^x_j – c$ will always be 1 as $x_i – c$ is always zero. so we’ll get:
$$\begin{align} \log(Softmax(1000, \left[1000,1000,1000\right])) &= 1000 \:-\: log(3) \:-\: 1000 \\ &= \:- log(3)\end{align}$$
log(3) is a very reasonable number computers have no problem calculating. So that example worked great. Hopefully it’s clear that {-1000,-1000,-1000} will also work fine.

The Takeaway

By thinking through a few examples, we can reason about what will happen in general:

If none of the $x_i$ values would cause any stability issues, the “naive” verson of LogSumExp would work fine. But the “improved” version also works.
If at least one of the $x_i$ values is huge, the naive version bombs out. The improved version does not. For the other $x_i$ values that are similarly huge, we get a good calculation. For other $x_i$s that are not huge, we will essentially approximate them as zero.
For large negative numbers, the signs get flipped and things work the same way.

So while things aren’t perfect, we get some pretty reasonable behavior most of the time and nothing ever blows up. I’ve created a simple python example where you can play around with this to convince yourself that things actually work fine.

So that’s a wrap on LogSumExp! It’s a neat little trick that’s actually pretty easy to understand once the mechanics are deconstructed. Once you know about it and the general numerical stability problem, it should demystify some of the documentation in libraries and source code.

To cement this in your mind (and get some math practice), I would wait awhile and then try to work out the math yourself. Also think through various examples in your head and reason about what should happen. Then run my code (or rewrite the code yourself) and confirm your intuition.

Getting Started With Machine Learning

Posted on 2018年3月7日 by kireet

Hello! Welcome to the Machine Learning section of the Feedly Blog. We’ve been doing more and more ML work here in the company and thought it would be interesting jotting down some of our thoughts. My background is in software engineering, so a lot of posts will be about how to approach ML from an outsider’s perspective, aka keeping up with the cool kids.

By the way, we are building out our ML team here, please drop us a note if you’re passionate about NLP. Anyways, on to the first post!

I originally went to college thinking I was going to work in computer hardware, maybe ending up at Intel or some company like that. But when I got to Carnegie Mellon, I found the computer science classes to be much more interesting and, not coincidentally, I seemed to do better in those classes. I really enjoyed almost every CS class I took. In fact, all but one: Intro to Machine Learning! I was really interested going into the course, but unfortunately the professor seemed equally disinterested in teaching the course and was just not very good. So machine learning dropped off my radar for a good long while.

But a couple years ago, I noticed machine learning really gaining traction and my curiosity perked up all over again. This time, I started with a MOOC by Andrew Ng, who is a fantastic professor. The difference was night and day. I was immediately hooked and started scouring the web for more classes I could take. Here’s some advice and tips I’ve learned along the way.

Is it a Good Idea to Hop on the ML Bandwagon?

There’s no question Machine Learning is here to stay. There’s been a prolonged buzz about the field for awhile now and having followed developments closely, I can say there is substance behind the hype. There are some problems that machines are just better at solving than humans.

But that doesn’t make it the right for for everyone. Working in machine learning is quite a bit different than other software engineering fields. It’s more research-y, more speculative. If you like to be able to plan out your work piece by piece and then have everything nicely tied up in a bow after x weeks, maybe it’s not a great fit. But if you enjoy working with data, continuously learning new techniques, and enjoy math (really) it might be a great career move.

How long will it take to get up to speed?

There are so many answers to this question. The first one that comes to mind is “forever”. Machine Learning is quite broad and moves incredibly fast. If, like me, you happen to need sleep you probably won’t be able to keep up with every development in the field. But another more optimistic answer might be 4 months at 10 hours/week. That, for example, should give you enough time to get through both parts of the excellent fast.ai courses.

This is not a trivial commitment as it’s probably going to be on top of your ongoing work and life commitments. But I can personally attest that it is possible, and not really that hard if you are willing to put the work in.

What are some good courses?

This really depends on how you like to learn. I personally love machine learning because of the elegant combination of so many fields of mathematics and computer science: probability, linear algebra, calculus, optimization, and much more. So I was naturally drawn to academic courses.

A terrific academic course is Stanford’s CS231n course. The videos I watched were done by Andrej Karpathy, who is a terrific lecturer. The course assignments were also well structured and doable remotely. Although the course is mostly about image problems and convolutional networks, they do start “from scratch” and also cover both feed forward and recurrent networks.

If you enjoy a more practical approach, the fast.ai courses are well done. There is no pretense here. Jeremy Howard approaches everything from a very grounded, systematic perspective and has designed the course so anyone with a moderately technical background can participate. Plus they’ve built up a nice community on their forums.

The aforementioned Andrew Ng has a new Coursera course sequence up (the course I took is quite outdated at this point). I haven’t personally tried it, but I’m sure there’s a lot of good stuff there. I would assume that everything there is also taught from a more practical perspective, but you can infer some of the math going on behind the scenes.

My recommendation would be to try a couple courses and pick the one that most captures your attention. But I would encourage you to eventually complete at least one practical course and one theoretical one. They complement each other quite nicely. To understand papers (and be warned: you will need to read academic papers), the academic courses will help you get acclimatized to the verbiage. To do projects, the practical courses will provide intuition on the dozens of decisions you have to make in any ML project.

If you need a math refresher or are looking for bonus points, MIT has a pair of great courses. A good grasp of probability is absolutely a must if you want to do any ML. The MIT class by John Tsitsiklis is amazingly well taught. The way the professor breaks down complex problems step by step until the answer just seems obvious is pure artistry.

The linear algebra class is also a fun one. The professor here is also very good and has a unique style. This one is not really necessary though, for most ML tasks you can get by with just understanding matrix multiplication.

What if I don’t know how to code?

Learn how to.

Most ML work is done in Python, which fortunately is pretty easy to pick up. And you don’t really need to be a world class programmer to do most ML work. But I would still do a quick online course before doing any ML specific work. Having to learn coding and machine learning concepts (not to mention re-learning a bunch of math you’ve probably forgotten) all at once is a recipe for disaster. Give yourself a chance and pace yourself.

I’ve got a basic grasp on things, now what?

Well, now it’s time to start modeling! There are generally 2 ways to go: find project at work/on your own or find a Kaggle competition. This depends on your particular situation, but I would recommend the Kaggle option. The main reasons are:

The problem is defined. Structuring real life ML problem properly can sometimes be tricky. With Kaggle, this isn’t an issue.
Similarly, sometimes building a data set can contain several hard to diagnose pitfalls. A Kaggle competition will provide you with data.
With Kaggle, you’ll get a built in community all working on the same thing. If you get stuck or need a little guidance, there’s a place to go.

On the other hand, if you have a problem at work that is tailor made for a ML solution (image classification for example) then maybe a work project is a quick way to impress your coworkers and convince your boss to let you invest more time on Machine Learning.

So if you have been thinking about digging into Machine Learning, take the plunge! One of the best things about Machine Learning is that people are really generous with their time and knowledge. Once you get started you’ll find a great support system online to help you along the way.

What Goes Down Better Come Up a.k.a. Adventures in Hbase Diagnostics

Posted on 2016年7月14日 by kireet

Earlier this year, the feedly cloud had a rough patch: API requests and feed updates started slowing down, eventually reaching the point where we experienced outages for a few days during our busiest time of the day (weekday mornings). For a cloud based company, being down for any period of time is soul-crushing never mind for a few mornings in a row. This led to a lot of frustration, late nights, and general questioning of the order of the universe. But with some persistence we managed to get through it and figure out the problem. We thought it may be some combination of instructive and interesting for others to hear, so we’re sharing the story.

But first we’d especially like to thank Lars George. Lars is a central figure in the HBase community who has now founded a consulting company, OpenCore. It’s essentially impossible to find a great consultant in these situations, but through a bit of luck Lars happened to be in the Bay Area during this period and stopped by our offices for a few days to help out. His deep understanding of the HBase source code as well as past experiences was pivotal in diagnosing the problem.

The Cloud Distilled

Boiled down to basics, the feedly cloud does 2 things, downloads new articles from websites (which we call “polling”) and serve API requests so users can read articles via our website, mobile apps, and even third party applications like reeder. This sounds simple, and in some sense it is. Where it gets complex is the scale at which our system operates. On the polling side, there are about 40 million sources producing over 1000 articles every second. On the API side, we have about 10 million users generating over 200 million API requests per day. That’s a whole lotta bytes flowing through the system.

feedly Cloud diagram — the main components of the cloud

Due to this amount of data, the feedly cloud has grown significantly over the last 3 years: crawling more feeds, serving more users, and archiving more historical content – to allow users to search, go back in time, and dig deeper into topics.

Another source of complexity is openness. As a co-founder, this is one aspect of feedly that I really love. We allow essentially any website to be able to connect with any user. We also allow 3rd party apps to use our API in their application. As an engineer, this can cause lots of headaches. Sourcing article data from other websites leads to all kinds of strange edge cases — 50MB articles, weird/incorrect character encodings, etc. And 3rd party apps can generate strange/inefficient access patterns.

Both of these factors combine to make performance problems particularly hard to diagnose.

What Happened

We experienced degraded performance during the week of April 10th and more severe outages the following week. It was fairly easy to narrow the problem down to our database (HBase). In fact, In the weeks prior, we noticed occasional ‘blips’ in performance and during those blips a slowdown in database operations, albeit on a much smaller scale.

Artist depiction of the feedly cloud raining down destruction — The feedly cloud was not so happy that week, my friends.

Fortunately our ops team had already been collecting hbase metrics into a graphing system. I can’t emphasize how important this was. Without any historical information, we’d be at a total loss as to what had changed in the system. After poking around the many, many, many HBase metrics we found something that looked off (the “fsSyncLatencyAvgTime” metric). Better still, these anomalies roughly lined up with our down times. This led us to come up with a few theories:

We were writing larger values. This could occur if user or article data changed somehow or due to a buggy code change.
We were writing more data overall. Perhaps some new features we built were overwhelming HBase.
Some hardware problem.
We hit some kind of system limit in HBase and things were slowing down due to the amount or structure of our data.

Unfortunately all these theories are extremely hard to prove or disprove, and each team member has his own personal favorite. This is where Lars’s experience really helped. After reviewing the graphs, he dismissed the “system limit” theory. Our cluster is much smaller than some other companies out there and the configuration seemed sane. His feeling was it was a hardware/networking issue, but there was no clear indicator.

Theory 1: Writing Larger Values

This theory was kind of a long shot. The idea is that perhaps every so often we were writing really big values and that caused hbase to have issues. We added more metrics (this is a common theme when performance problems hit) to track when outlier read/write sizes occur, e.g. if we read or wrote a value larger than 5MB. After examining the charts, large read/writes kind of lined up with slowdowns but not really. To eliminate this as a possibility, we added an option to reject any large read/writes in our code. This wouldn’t be a final solution — all you oddballs that subscribe to 20,000 sources wouldn’t be able to access your feedly — but it let us confirm that this was not the root cause as we continued to have problems.

Theory 2: Writing More Data

This theory was perhaps more plausible than theory 1. The idea was that as feedly is growing, we eventually just reached a point where our request volume was too much for our database cluster to handle. We again added some metrics to track overall data read and write rates to hbase. Here again, things kind of lined up but not really. But we noticed we had high write volume on our analytics data table. This table contains a lot of valuable information for us, but we decided to disable all read/write activity to it as it’s not system critical.

After deploying the change, things got much better! Hour long outages were reduced to a few small blips. But this didn’t sit well with us. Our cluster is pretty sizable, and should be able to handle our request load. Also, the rate of increase in downtime was way faster than our increase in storage used or request rate. So we left the analytics table disabled to keep performance manageable but continued the hunt.

Theory 3: Hardware Problem

As a software engineer this is always my favorite theory. It generally means I’ve done nothing wrong and don’t have to do anything to fix the problem. Unfortunately hardware fails in a myriad of oddball ways, so it can be very hard to convince everyone this is the cause and more importantly to identify the failing piece of equipment. This ended up being the root cause, but was particularly hard to pin down in this case.

How we Found the Problem and Fixed it

Here again, Lars’s experience helped us out. He recommended isolating the HBase code where the problem surfaced and then creating a reproducible test by running it in a standalone manner. So after about a day of work I was able to build a test we could run on our database machines, but independent of our production data. And it reproduced the problem! When debugging intermittent issues, having a reproducible test case is 90% of the battle. I was able to enable all the database log messages during the test and I noticed 2 machines were always involved in operations when slowdowns occurred, dn1 and dn3.

I then extended the test to separately simulate the networking and disk write behavior the HBase code performed. This let us narrow down the problem down to a network issue. We removed the 2 nodes from our production cluster and things immediately got better. Our ops team found out the problem was actually in a network cable or patch panel. This was an especially insidious failure as it didn’t manifest itself in any machine logs. Incidentally, network issues was actually Lars’s original guess as to the problem!

sync latency — The metric in question when we fixed the problem. Notice the immediate change in behavior.

Lessons Learned

The important thing when dealing with performance problems (outside of, you know, fixing them) is trying to learn what you did well and what you could have done better.

Things we did well:

Have a good metrics collection/graphing system in place. This should go without saying, but lots of times these types of projects can get delayed or deferred.
Get expert help. There’s lots of great resources out there. If you can’t find a great consultant, lots of people are generally willing to help on message boards or other places.
Stayed focused/methodical. It can get crazy when things are going wrong, but having a scientific process and logical way to attack the problem can make things manageable.
Dig into our tech stack. We rely almost exclusively on open source software. This enabled us to really understand and debug what was going on.

Things we could have done better:

Communicate. While Lars suggested networking, I initially discounted it since the problem manifested everywhere in our system, not just one machine. I would have learned there are some shared resources specific to data center build outs.
Gone more quickly to the hardware possibility. We did a lot of google searching for the symptoms we were seeing in our system, but there was not much out there. This is kind of an indicator something weird is probably happening in your environment. A hardware issue is pretty likely.
Attacked the problem earlier. As I mentioned, we had seen small blips prior to the outages and even done some (mostly unproductive) diagnostic work. Unfortunately not giving this top priority came back to bite us.

But there’s a happy ending to this story. As this post hopefully demonstrates, we’ve learned a lot and came out stronger: the feedly cloud is faster than ever and we have a much better understanding of the inner workings of HBase. We realize speed is very important to our users and will continue to invest in making the feedly Cloud faster and more resilient. Speaking of resilience, though we had a small downturn in feedly pro signups in April, we are back to normal. This speaks to what a great community we have!

ソーシャル

投稿者: kireet