NLP Breakfast 8: Knowledge Distillation

Edouard Mehlman will be presenting two papers about Knowledge Distillation at the next NLP Breakfast.

A simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets.

The main idea behind knowledge distillation is to distill these large (teacher) models into smaller yet almost as efficient and more production-friendly (student) models.

Edouard will explain the main ideas behind Knowledge Distillation (teacher-student network, the temperature in softmax activation, dark knowledge, and softened probabilities) and showcase how they can be used to either reduce inference time of large neural network models or combine multiple individually trained tasks into a multi-task model.


Thursday, July 18th at 9:30am PST



Distilling the Knowledge in a Neural Network (Hinton 2015) is the original paper from Google providing insights about the motivation of Knowledge Distillation.

This blog post illustrates how distilling BERT in a simple Logistic Regression improves the results on a Sentiment Classification Task.

BAM! Born-Again Multi-Task Networks for Natural Language
is the latest Stanford paper that shows how the Multi-Task Student Network can become better than the Teacher using the annealing technique!