In January, Jeremy Howard launched the newest iteration of his Practical Deep Learning course. Even if you are an experienced ML practitioner, I really recommend giving these a quick watch (hint: watch at 1.25/1.5 speed until you get confused, then rewind and watch at normal playback). There’s always a few tips and tricks in each lecture that are worth adding to your toolkit.
An added bonus is that this iteration is the first with his pytorch based fast.ai deep learning library. If you haven’t heard of it, the 6000+ stars on GitHub are probably an indication that you should give it a look. Given we are fans of pytorch and Jeremy at Feedly, it seemed like a great time to watch the videos and start getting our hands on some fast.ai based code.
Lesson 1 focuses on machine setup (key takeaway: get a machine with a GPU!) and convolutional neural networks (conv nets for short). These networks are great for image processing. The lesson gently introduces students to key concepts and the basics of the fast.ai library. After building a basic classifier, he recommends students download their own images and train their own classifier So that’s what I did!
Basketball vs Tennis
I decided to try to build a classifier that can differentiate between tennis players and basketball players. I downloaded 30 images of tennis players and 30 images of basketball players to play around with. Your data is probably the most important ingredient when developing your deep learning model, so I thought a bit and made a few key choices:
- Tennis and basketball at the professional level are not particularly diverse sports. I overweighted the underrepresented peoples in the images to avoid the model just picking up that most NBA basketball players are African American and most pro tennis players are not.
- I chose some images with the basketball and tennis ball in the frame and some with no ball. Same with a tennis racket. This was more for my own curiosity to see how the model would react.
- I only picked male players. This was just to save some time. I am almost certain I could have downloaded more images of women players with no loss in accuracy.
- I split the data evenly, 20 images of each class for training, 10 of each for validation. I got a bit creative for the test set, as you’ll see below.
The Amazing Power of Transfer Learning
There is a never-ending stream of amazing work in the field of deep learning, but probably my development has been the advent of transfer learning. This is where very smart people train a very generalized model that consists of 2 parts: one that extracts “features” (properties) of the input data and another that uses these features to do classification. In transfer learning, you simply take this fully trained model, lop off the classification part and then plug in your own bit. Since researchers use a very general data set, it turns out the feature extraction part of these models are still quite applicable to most image classification problems.
This is great for a few reasons:
- Training a conv net from scratch is intense. Without doing anything exotic, it can take weeks.
- Training a conv net from scratch takes a lot of data.
- Conv nets are pretty deep models, you don’t want to write your own and then take a lot of time debugging your code.
In short, other people do all the hard work and then you swoop in and reap the rewards. Life is good.
This is why I had some hope of building a good model using only 60 images. A key thing to understand is that this is not a silly or contrived example. Researchers often work with standard and well-developed data sets, but in the industry, you’re often on your own. Building a good data set can actually be one of the biggest barriers to developing a model, so being able to train a sophisticated model using only a few data points is a massive win.
As it turns out, differentiating between tennis and basketball players is as easy for a conv net as it is for human beings. By simply putting my data in the right place and repeating the 3 lines of fast.ai code it takes to run training, the model achieved perfect accuracy within minutes of training:
epoch trn_loss val_loss accuracy 0 1.09754 0.572311 0.7 1 0.802709 0.405828 0.9 2 0.581209 0.294582 0.95 3 0.456703 0.215485 0.95 4 0.371513 0.158434 1.0
You may wonder if the model is really this good or if it just got lucky. After all, we are only working with 60 images, and the model only had to get 20 validation examples correct. The answer is yes. The magic is the feature extraction layers that we stole. Those stayed unchanged during this process, I only allowed the model to optimize the classification section. This methodology allows the model to remain generalized.
It’s easy to test this claim. You can simply try to train an entire conv net from scratch. If you do this, you’ll quickly see training accuracy get to 100% and validation accuracy stay at 50%. You can think of convolutional networks as extremely powerful, but extremely lazy. With only 40 training examples to look at, they’ll simply memorize the right answers and not dig deeper to find any meaningful patterns in the data. It’s only when you have enough data such that memorization becomes impossible that the network is forced to get into gear and do good work. Through transfer learning, we lock in this nice generalization behavior.
As a side note, this model could get way, way better. Just running through the data a few more times leads to great improvement in terms of validation loss. And Lesson 2 has a few more steps that make things even better.
Once you’ve validated your model, it’s time to release it into the wild. To do this, I had my wife snap 3 pictures of me. One in a basketball pose, one in a tennis pose, and finally one of me just standing there looking my usual goofy self. I ran my (supposedly) perfect model and it guessed tennis for all 3 pictures! What’s going on??? Maybe I just look tennis-y? I actually only do play tennis nowadays. Maybe the model has secretly been spying on me?
After a few deep breaths, I calmed down and ran a simple experiment. I didn’t have a basketball handy, so I just (badly) photoshopped one into the basketball picture. Lo and behold, the model completely reversed it’s opinion and predicted the image was a basketball player. Order has been restored in the universe.
So this is simply your run of the mill model mystery. With tens or even hundreds of millions of learned parameters, conv nets are difficult to impossible to interpret directly. However, there are some techniques can be used to generate some interesting visualizations.
A pretty basic thing to do is check the output activations of a convolutional layer. Basically, seeing how excited parts of the model gets when it sees something interesting:
Here we see a few different outputs for the 2 basketball pose pictures (the ones on the left are with the photoshopped ball). It’s pretty fun to look at these. It’s always amazing to get insights into how your model “thinks”. The top image looks like some kind of skin filter. It really likes my arms and the basketball (which is leather). The next picture seems to like clothing. The 3rd seems to like left edges? The final one is also pretty cool. It also likes clothing, but actually picks up the textures in my shirt.
As you get deeper into the model, things get more, shall we say, abstract:
We could posit some theories as to what’s going on, like the filter in the first row kind of likes circular type things. But honestly, these feel more like guesses. And more importantly, while it’s fun to poke around these visualizations don’t really get us any closer to knowing what’s going on.
So instead of seeing how the convolutional filters processed the image, can we look at the filters themselves? It turns out the answer is yes. This is especially true of the first layer of the model. This layer processes the input image directly and must have a 3-dimensional volume in order to process each RGB channel. This means we can map each filter back into the RGB space:
This is pretty cool, we can see some filters like to detect vertical edges, others diagonal, others look a bit blobby. Unfortunately, layers deeper in the model are not necessarily 3 dimensional so we have to plot each dimension of the filter separately. Even worse, some layers have filters are only 1 pixel big! These layers still do things – they collapse volumes, but it’s kind of a downer in terms of data viz. Other filters are bigger but still really small, like 3×3. You can visualize them, but it’s hard to know what’s going on:
Occlusion is a really clever technique. It’s one of those things that simultaneously is really simple and obvious once you hear about it but also something you never would have thought up on your own.
If we assume that our model “looks” at images as we do, an interesting thing to do might be to cover up different parts of the image and see how the model reacts. We can do this by just slapping an uninteresting gray square over different parts of the image and running inference:
We can be a bit more systematic and slide the gray box over image, generating a prediction at each position. Then at each pixel position, we can capture the average classification when the pixel is covered up. Then we can plot things as a heatmap:
Now we’re getting somewhere. Here blue areas indicate the model didn’t change it’s original basketball prediction when those areas were covered up. The “hotter” colors indicate where the occlusion caused the model to shift its prediction towards tennis. We can see that when the basketball is occluded, the model really reacts. This explains why even in my basketball pose, an actual basketball was really important to the model.
But But But
I was really careful when constructing my data set to include shots of players without a basketball. So what’s going on. Let’s try running the same process on Giannis:
Aha! Here we see the things start getting interesting in 2 basic areas, Giannis’s right shoulder (really his left shoulder but I’ll speak in terms of image layout) and the letters/numbers on his jersey. So it seems like our model not only likes basketballs, it likes basketball jerseys as well.
Let’s try things on a tennis image as well:
On this image, I was really expecting fireworks around the racket. But the model seems more interested around the right-most shoulder again. So it seems like our model has decided basketballs and right shoulders are critical areas of interest.
Is this OK? On one hand, it’s a bit bizarre. If you ask your buddy what’s the difference between basketball players and tennis players, she’s almost certainly not going to reply “His right shoulder of course!” But on the other hand, I chose all professional player pictures – so differences in attire seem like a perfectly reasonable criterion.
This is where your training data becomes critical. If I expect my deployed model to detect out of shape software engineers pretending to play basketball, I can’t just randomly train the model on NBA players. I might need some shots of people playing in their neighborhood or the local gym as well.
Luckily since we’re utilizing transfer learning, we only need a few examples of each situation to generalize our model in the ways we want.
There is a final technique I’ve seen a few times called Deconvolutional networks. This process is similar to back-propagation but adapted in a few key ways to produce really nice visualizations of what each filter is really “thinking”. I didn’t get around to implementing this, but I’ll try to someday. I see the same data viz over and over when I do online classes, it’d be nice for people redo these so students don’t have to keep seeing the same ones!
So that’s it. Data visualizations can be pretty useful at times, and they are always fun time wasters you can use to impress your friends. I’ve shared my notebook and data (executed print out here) so you should be able to generate your own data viz pretty easily. The code should essentially just work on any typical conv net architecture, just tweak some paths and settings. Everything runs pretty quickly even in a non-GPU environment except for the occlusion images. You can introduce a stride > 1 to speed those up. Enjoy!
If you enjoyed reading this, subscribe in Feedly to never miss another post.