We are very excited to receive Emmanuel Ameisen in our Feedly office this Thursday for the first NLP breakfast of 2020!
The topic of the discussion will be around all the intuitions that we should build before (and after) building models.
Emmanuel is a talented data scientist with a solid experience in the industry as a machine learning Engineer and also in the Education space, where he led the Insight Data Science’s AI program and directed 100+ ML projects.
This recent NLP paper from the Stanford NLP group presents an original approach to the Natural Language Inference problem, ie, the task of determining whether a “hypothesis” is true (entailment), false (contradiction), or undetermined (neutral) given a “premise”.
The paper introduces a new dataset called SCI for “Stanford Corpus of Implicatives”, for which each sentence can be associated with a piece of metadata referred to as the “signature”. The signature of an implicative indicates the relation between the main clause and the complement clause. It is often inferred from a single verb in the sentence, like manage and fail, and other times from phrasal constructions like meet one’s duty or waste a chance.
A signature is composed of two symbols, each symbol being +, -, o depending on the entailment relation. The first symbol corresponds to the sentence entailment in a positive environment, whereas the second corresponds to the sentence entailment in a negative environment. Here are two examples:
The SCI dataset complements the existing MULTINLI corpus, a crowd-sourced collection of 433k sentence pairs with textual entailment annotations, which is also used for experiments.
The newly introduced RRN models claim to be more modular and adaptable, in the sense that they can learn a variety of specialized modules that play different – or even orthogonal – roles into solving the inference task.
Closed Domain Question Answering (cdQA) is an end-to-end open-source software suite for Question Answering using classical IR methods and Transfer Learning with the pre-trained model BERT (Pytorch version by HuggingFace). It includes a python package, a front-end interface, and an annotation tool.
Edouard Mehlman will be presenting two papers about Knowledge Distillation at the next NLP Breakfast.
A simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets.
The main idea behind knowledge distillation is to distill these large (teacher) models into smaller yet almost as efficient and more production-friendly (student) models.
Edouard will explain the main ideas behind Knowledge Distillation (teacher-student network, the temperature in softmax activation, dark knowledge, and softened probabilities) and showcase how they can be used to either reduce inference time of large neural network models or combine multiple individually trained tasks into a multi-task model.
This week’s discussion is an overview of progress in language modeling, you can find the live-stream video here.
Language modeling is the task of predicting the next word in a sentence, based on the previous words. We already use it in lots of everyday life products, from you phone auto-complete to search engines:
Formally, the task consist in estimating a probability distribution over the whole possible vocabulary conditioned on the words (or tokens) that we have seen so far. We’re predicting the future based on the past, but we can make the Markov assumption that the future only depends on a few items from the past, that is we condition on a fixed-size window of past words to predict the probabilities for the next word. We then use Bayes formula to get this modified estimation problem:
A possible simplification is to use the quotient of counts: we pre-compute counts for all N-grams (chunks of [words, characters or other symbols] of size N) and we use those counts to approximate probabilities.
This is a really efficient technique for inference since, as soon as the counts have been computed, probabilities are output in O(1) time using lookup tables. However, this suffers a lot from sparsity issues: if the numerator is a sequence of words that has never been seen before, the output probability will be zero.. More seriously, if the denominator has never been seen the output probability could be undefined! If there are some small fixes for those issues (smoothing, backtracking), this method also suffers from scalability problem with the window size N: the number of unique N-grams grows exponentially with N…
Neural networks for Language Modeling
To adapt the exact same setup described before to the neural network domain, we could just take the (pre-trained) word embeddings from the window, concatenating them and then using a linear projection layer to output a probability distribution. This solved both the sparsity and scalability issues discussed before. However, we still have a fixed sized window and concatenating even more words would produce too big embeddings. Moreover, the matrix multiplication W is completely independant between different word vectors, whereas we would like to have some parameters sharing between words.
Using Recurrent Neural Networks, we remove the fixed-sized context issue, and we can (hope to) learn long-term dependencies. A disadvantage of RNNs however is that the inherent sequential aspect makes them difficult to parallelize. Also, even with improvements from LSTMs or GRUs, long-term dependencies are still hard to catch.
Language Models for contextualized word embeddings
A limitations to current word embeddings is that they learn embeddings of word types, and not word tokens in context. That is, using word2vec, “jaguar” will have the same embedding in both “I just bought a new Jaguar” and “A Jaguar has attacked them”. A lot of progress has been made in this direction:
In the TagLM system, Peters et al. train a bi-LSTM language model over a huge dataset, and use the concatenation of forward and backward LMs to supplement an NER system with contextualized word representations, extracted from the last layer of the LMs.
The ELMo paper goes a step forward, by using a combination of all different layers of the LM to feed contextualized word embeddings:
The motivation under this successful experiment is that lower level layers might contain more information useful for low-level NLP tasks (POS tagging, syntactic parsing…) and higher layers useful for high level NLP tasks (sentiment analysis, question answering…).
Focusing on the Text Classification task, the ULMFit paper from Ruder and Howard leverage the same transfer learning ideas:
In this setup, a language model is first trained on a big corpus of text. Then, given a classification dataset, the LM is carefully fine-tuned to the data: different learning rates are set for each layer, and layers are gradually unfrozen to let the LM fine-tune in a top-bottom fashion.
An important issue for RNN-based models is that they rely on sequential modeling and then are hard to parallelize. Moreover, they still lack long-term dependencies learning even with LSTMs/GRU improvements. Attention layers are often the solution to keep long-term dependencies. But, if attention layers keep these dependencies, and sequentiality of RNN don’t allow parallelization, why not just using attention layers and forget about sequential models? This is the motivations behind the transformer architecture:
Using self-attention layers means that words will attend to each others. Moreover, to let words attend to each other in a flexible way, the model builds several attention heads: multi-head attention.
At the end of last year, Google AI published its transformer-based architecture, with a modification in the Language Modeling objective, with now two different training objectives:
Masked language modeling: mask 15% of tokens and predict them based on the whole text.
Is next sentence prediction
While previous LMs are inherently unidirectional (potentially using a forward and a backward version), BERT claims to bring a bidirectional version of transformers.
Future of Language Models
Language models is a very powerful promise for pre-training models. It enables models to perform competitive results in a zero-shot way, that is without any supervision! This is therefore very useful for many problems where supervised data is limited.
In the future, we expect language models, combined with multi-task learning, to be universal transfer learning techniques. Ideally we would like to move from a “narrow expert” paradigm (learning a model that is very good on 1 dataset, sampled from 1 task) into a pardigm where a learner can transfer knowledge across different domains, different datasets, different tasks, and different languages. Such a universal learner would provide a narrow expert model with contextualized knowledge and improve on performance.
OpenAI has recently raised the issue of ethics in natural language generation, with the release of GPT2, a language model claimed to be so good the institution refrained to publish the whole version and data. This will lead to lots of interesting discussions to define what will be the future of transfer learning in NLP and how we should use it.