If you’re involved with Machine Learning, or happen to be a NVidia investor, you must have heard that GPUs are critical to the field. But for me, GPUs have always been a bit of a mystery. I had a pretty high level understanding of
GPU == fast in my head, but not much more than that.
And honestly, that gets you pretty far. I started doing hands on ML work with Keras, which is an API on top of Google’s TensorFlow library. Keras is a beautifully designed API that’s made the very practical decision that most things should be simple, and hard things should be possible.
Because of this, I didn’t really need to learn anything about the GPU. I just ran a few magic OS commands and a couple magic lines of code and — boom! — massive model training speedup achieved. And I was perfectly happy with this situation.
However, on our latest project here at feedly, we’ve decided to try PyTorch, basically because it’s been getting buzzier and buzzier. But there’s been some good data points in it’s favor. The new (highly recommended) fast.ai course sequence switched from Keras to PyTorch. Lots of recent Kaggle competitions have also been won by PyTorch models.
So we dug in and found that PyTorch makes all things possible through clear and consistent APIs. It’s more flexible than Keras, so the trade-off is that easy things get a bit harder but hard things get a bit easier. If you’re experimenting with new or unusual types of models (or you happen to be a control freak), PyTorch is definitely a better fit.
When the time came to GPU accelerate my PyTorch model and I googled for the magic
GPU -> on line of code, I found out it didn’t exist! True to form, Pytorch makes this a bit harder than Keras, but provides APIs on how you should go about doing things. The upshot of all this is that I had to bite the bullet and actually build a mental model of how the GPU is actually being used to speed up model training.
Why GPUs are Fast
Models generally have many, many parameters. For example, the popular VGG image classification model has about 140 million parameters divided over 16 layers! When running inference (predictions), you need to pass your input data (image) through each layer, usually multiplying that data by the layer parameters. During training, you have to also tweak each parameter a little bit to better fit the data. That’s a lot of arithmetic!
CPUs are good at doing a few things really fast. This is usually fine, there is enough branching (if a user does this, do that), and other sequential constraints that massive parallelism isn’t really possible. GPUs are good a doing a lot of things “slow”. Since they were originally used to do graphics requirements, they expect do a bunch of stuff at once (think converting all the pixels of an image to grayscale). So there’s a tradeoff here, and for ML the GPU wins big time due to the fact that these huge arithmetic operations that can be done in parallel.
To make things concrete, my macbook has a CPU that runs at 3.1Ghz and has 4 cores. A NVidia K80 GPU has almost 5000 cores, albeit running at a much slower 562Mhz. Although this is not really a fair thing to do, you can see that the K80 has a clock speed that is about 6 times slower, but is 1250 times more parallel.
How to Think About GPUs
Instead of the
GPU -> on line of code, PyTorch has “CUDA” tensors. CUDA is a library used to do things on GPUs. Essentially, PyTorch requires you to declare what you want to place on the GPU and then you can do operations as usual. So I thought, let’s try placing a layer on the GPU:
class Model(nn.Module): def __init__(self): super(Model, self).__init__() self.hidden = nn.Linear(784, 50) self.final = nn.Linear(50, 10) def forward(self, features): x = self.hidden(features.float().view(len(features), -1)) x = self.final(x) return F.log_softmax(x, dim=1) ... model = Model() model.hidden.cuda() model.forward(batch)
I ran the code and immediately got an error at the hidden layer calculation:
Expected object of type Variable[torch.cuda.FloatTensor] but found type Variable[torch.FloatTensor] for argument #1 'mat1'
Why? I knew immediately it had something to do with the
.cuda() bit of code I added but I wasn’t sure why. After a bit of thinking about how GPUs are supposed to speed things up, I realized, “Of course it doesn’t work, one tensor is on the GPU and another is still in main memory!”. Everything kind of snapped in place. Pytorch allows you to allocate tensors in GPU memory and then do operations on those tensors utilizing the GPU. But where do the results of those operations go? Let’s try another example:
t1 = torch.cuda.FloatTensor(20,20) t2 = torch.cuda.FloatTensor(20,20) t3 = t1.matmul(t2) print(type(t3)) <class 'torch.cuda.FloatTensor'>
It’s another tensor in GPU memory! After more trial and error I found I had to change my way of thinking. Originally I thought of Memory, CPU, and the GPU all jumbled together:
What I realized is that it I needed to think of things like this:
Essentially the CPU/main memory and GPU/GPU memory each live in their own little universes. Coming from a software engineering background, I started to think of GPU operations as a REST API. When you’re using a REST API, the real cost is sending the data back and forth. Doing stuff locally is fast (or is as fast as it can be) as is stuff done on a remote server. But what you want to avoid is a lot of data shipping back and forth as that is pure overhead.
So carrying that analogy forward, we can see that of course it makes sense that the PyTorch
matmul result is a GPU tensor. That would make it easy to do further manipulations on the GPU without shipping the data to main memory and then back to the GPU again. So if we want to use the GPU, we really want all of our parameters on the GPU as those will be used over and over again to produce predictions in the forward pass and then updated in the backward pass. Each batch of features will have to be shipped into GPU memory. But then the intermediate and final results (e.g. the output of the hidden layer) can just live in GPU memory. All we need to do is keep sending commands to the GPU to tell it how to manipulate the parameters and weights.
So in the API analogy, we just make two “heavy” requests (starred in the drawing above), one to initialize the weights and one to get the final weights after training. But we make potentially millions of lightweight requests in between those 2 to train the model.
GPU Performance Gains are Real and they are Spectacular
So what kind of speedup is possible? PyTorch has a nice little MNIST example we can use. Running this for 10 epochs took 153 seconds using the CPU only and 83 seconds using the GPU. And we could theorize that larger models could experience even bigger gains. Not bad, not bad at all.
This was all great. After a little thinking and some terrible drawings, I understood GPUs much better and the extra coding required is not bad at all. But was my thinking correct? The best way to answer this question was to build an experiment. I wanted to prove to myself that shipping data is the “slow part” of the workflow. So I tried the following 3 things:
- Do a
200x200matrix multiply in
numpy, a highly optimized CPU linear algebra library.
- Do a
200x200matrix multiply on the GPU using PyTorch cuda tensors.
- Do a
200x200matrix multiply on the GPU using PyTorch cuda tensors, copying the data back and forth every time.
As expected the GPU only operations were faster, this time by about 6x. Interestingly, 1. and 3. took almost exactly the same amount of time. The efficiency of the GPU operations was balanced almost exactly by the inefficiency of the data shipping!
If you enjoyed reading this, subscribe in Feedly to never miss another post.