NLP Breakfast 13: Building intuitions before building models

We are very excited to receive Emmanuel Ameisen in our Feedly office this Thursday for the first NLP breakfast of 2020!

The topic of the discussion will be around all the intuitions that we should build before (and after) building models.

Emmanuel is a talented data scientist with a solid experience in the industry as a machine learning Engineer and also in the Education space, where he led the Insight Data Science’s AI program and directed 100+ ML projects.

His new book Building Machine Learning Powered Applications just made it to the top of the charts for new AI books!


Thursday, January 23rd at 9:30am PST


NLP Breakfast 12: An Improved and Affordable feature-based RNN model for Language Modeling

Over the last years, language modeling pre-trained methods have yielded effective improvement in Natural Language Processing tasks.

Most of the improvements have come with ELMo followed by, large research on the Transformer approaches like BERT or XLNET despite the fact that the transformers are expensive to train. Unfortunately, we haven’t seen much improvement in the affordable ELMo while limitations in this architecture have been addressed in many papers.

Gregory Senay, CSO at xBrain, will present a new architecture improving ELMo and reaching higher performances in GLUE tasks for a feature-based model, and this, while keeping a limited computation cost.

This new architecture uses some improvements made with Transformers but also some tricks for speed-up and improving the quality of the training.

Whether you want to hear more about this new architecture or just want to participate in a nice discussion, feel free to join us!


Thursday, November 14th, 2019 at 9:30am PDT


NLP Breakfast 11: Structuring legal documents with Deep Learning

We are very happy to welcome Pauline Chavallard, Data Science manager at, presenting a practical NLP project on structuring legal documents with deep learning.



Court decisions are traditionally long and complex documents. To make things worse, it is not uncommon for a lawyer to only be interested in the operative part of the judgment (for example, the outcome of the trial). In fact, in general, it is pretty standard to be looking for a specific legal aspect, which can quickly feel like looking for a needle in a haystack. As such, our goal was to detect the underlying structure of decisions on Doctrine (i.e. the table of contents) to help users navigate them more easily.

Decisions can be seen as small stories. While humans can understand them because they are naturally context-aware and have some expectations, how should an algorithm operate? In order to address this challenging issue, we trained a neural network (bi-LSTM with attention) using PyTorch to help us predict a suitable table of contents given a free text decision. This talk gets into more details about our methodology and results


Thursday 17 October 2019 at 9:30am PDT


NLP Breakfast 10: Recursive Routing Networks

Edouard Mehlman will be presenting the Recursive Routing Networks paper from Stanford NLP Group.

This recent NLP paper from the Stanford NLP group presents an original approach to the Natural Language Inference problem, ie, the task of determining whether a “hypothesis” is true (entailment), false (contradiction), or undetermined (neutral) given a “premise”.

The paper introduces a new dataset called SCI for “Stanford Corpus of Implicatives”, for which each sentence can be associated with a piece of metadata referred to as the “signature”. The signature of an implicative indicates the relation between the main clause and the complement clause. It is often inferred from a single verb in the sentence, like manage and fail, and other times from phrasal constructions like meet one’s duty or waste a chance.

A signature is composed of two symbols, each symbol being +, -, o depending on the entailment relation. The first symbol corresponds to the sentence entailment in a positive environment, whereas the second corresponds to the sentence entailment in a negative environment. Here are two examples:

The SCI dataset complements the existing MULTINLI corpus, a crowd-sourced collection of 433k sentence pairs with textual entailment annotations, which is also used for experiments.

The newly introduced RRN models claim to be more modular and adaptable, in the sense that they can learn a variety of specialized modules that play different – or even orthogonal – roles into solving the inference task.


Thursday, September 26th at 9:30am PST


NLP Breakfast 9: BERT for Question Answering Systems

We are thrilled to host Andre Farias, NLP Data Scientist at SouthPigalle, presenting an ambitious NLP project in collaboration with BNP PARIBAS Personal Finance and Telecom Paris.


Closed Domain Question Answering (cdQA) is an end-to-end open-source software suite for Question Answering using classical IR methods and Transfer Learning with the pre-trained model BERT (Pytorch version by HuggingFace). It includes a python package, a front-end interface, and an annotation tool.


Thursday 8 August at 9:30am PDT


References Link to the project HuggingFace Pytorch-Transformers library BERT paper

NLP Breakfast 8: Knowledge Distillation

Edouard Mehlman will be presenting two papers about Knowledge Distillation at the next NLP Breakfast.

A simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets.

The main idea behind knowledge distillation is to distill these large (teacher) models into smaller yet almost as efficient and more production-friendly (student) models.

Edouard will explain the main ideas behind Knowledge Distillation (teacher-student network, the temperature in softmax activation, dark knowledge, and softened probabilities) and showcase how they can be used to either reduce inference time of large neural network models or combine multiple individually trained tasks into a multi-task model.


Thursday, July 18th at 9:30am PST



Distilling the Knowledge in a Neural Network (Hinton 2015) is the original paper from Google providing insights about the motivation of Knowledge Distillation.

This blog post illustrates how distilling BERT in a simple Logistic Regression improves the results on a Sentiment Classification Task.

BAM! Born-Again Multi-Task Networks for Natural Language
is the latest Stanford paper that shows how the Multi-Task Student Network can become better than the Teacher using the annealing technique!

NLP Breakfast 7: XLNet explained

Welcome to the 7th edition Feedly NLP Breakfast, an online meetup to discuss everything around NLP!

This time, Stephane Egly will present XLNet, an unsupervised pre-training strategy that improves the state of the art in NLP benchmarks.



Thursday 27 June at 9:30 AM PST


NLP Breakfast 6: Transfer NLP

Welcome to the 6th edition feedly NLP Breakfast, an online meetup to discuss everything NLP!

This time, we will present Transfer NLP, an open source library we have opened to promote reproducible experimentation and make it easy to transfer code and knowledge.

We are hosting the event in Redwood City on , but you will be able to join the presentation online:


Thursday 23 May at 9:30 AM PDT


NLP Breakfast 5: Hierarchical Multi-Task Learning (HMTL)

For this edition, we are very grateful to have Victor Sanh, a research scientist at HuggingFace presenting his paper at AAAI 2019: A Hierarchical Multi-task Approach for Learning Embeddings from Semantic Tasks, co-authored with Thomas Wolf and Sebastian Ruder.


Much effort has been devoted to evaluating whether multi-task learning can be leveraged to learn rich representations that can be used in various Natural Language Processing (NLP) down-stream applications. However, there is still a lack of understanding of the settings in which multi-task learning has a significant effect. In this work, we introduce a hierarchical model trained in a multi-task learning setup on a set of carefully selected semantic tasks. The model is trained in a hierarchical fashion to introduce an inductive bias by supervising a set of low-level tasks at the bottom layers of the model and more complex tasks at the top layers of the model. This model achieves state-of-the-art results on a number of tasks, namely Named Entity Recognition, Entity Mention Detection and Relation Extraction without hand-engineered features or external NLP tools like syntactic parsers. The hierarchical training supervision induces a set of shared semantic representations at lower layers of the model. We show that as we move from the bottom to the top layers of the model, the hidden states of the layers tend to represent more complex semantic information.


Thursday, April 14th at 9:30 AM PDT


NLP Breakfast 4: Graph Neural Networks

This is the 4th edition of the Feedly NLP Breakfast!

We will discuss Graph Neural Networks based on the slides from Stanford’s Network Representation Learning (NLR) group, adapted here. It offers an interesting high-level overview of the evolution of these algorithms, and bring focus on the Node2Vec and Gated Graph Neural Network algorithms.

In the first part of the presentation, we’ll explain the objective we want to achieve when generating these embeddings, and how to formulate the problem. First, we need to find proper graph similarity metrics, then define a good objective function, and finally use some smart optimization strategy, for instance, negative sampling used in Node2Vec to achieve better results at a lower computational cost.

We’ll then present why using a neural network approach is appealing for generating these embeddings and what are some interesting/successful architectures – eg. Graph Convolutional Network, Gated Graph Neural Network, …


Thursday, April 10th, 2019 at 9:30 AM PST


Additional Links: (Original slides) (Node2Vec) (Gated Graph Neural Network)

NLP Breakfast 3: Latest on Text Classification

This week’s NLP breakfast meetup in our Redwood City office will focus on Text Classification at scale.

We found a promising paper built on two current trends in NLP: Language Models and Transfer Learning:

An Embarrassingly Simple Approach for Transfer Learning from Pretrained Language Models
by Alexandra Chronopoulou, Christos Baziotis, Alexandros Potamianos

The solution proposed caught our interest for a couple of reasons:

  • It performs well on hard tasks like Irony or Sarcasm detection that require Natural Language Understanding
  • It builds on the ideas presented in ULMFiT paper, but at the same time simplifies the architecture, achieving better results while being more computationally efficient.

Edit: The paper got accepted at the NAACL 2019 conference



Thursday, March 28, 2019, at 9:30 AM PST


NLP Breakfast 2: The Rise of Language Models

A few weeks ago, we have experimented making our internal papers discussions open via live-streaming. The first NLP breakfast featured a discussion on the paper Accelerating Neural Transformer via an Average Attention Network, available on our NLP Breakfast YouTube channel.

This week’s discussion is an overview of progress in language modeling, you can find the live-stream video here.

Language Modeling

Language modeling is the task of predicting the next word in a sentence, based on the previous words. We already use it in lots of everyday life products, from you phone auto-complete to search engines:

Formally, the task consist in estimating a probability distribution over the whole possible vocabulary conditioned on the words (or tokens) that we have seen so far. We’re predicting the future based on the past, but we can make the Markov assumption that the future only depends on a few items from the past, that is we condition on a fixed-size window of past words to predict the probabilities for the next word. We then use Bayes formula to get this modified estimation problem:

A possible simplification is to use the quotient of counts: we pre-compute counts for all N-grams (chunks of [words, characters or other symbols] of size N) and we use those counts to approximate probabilities.

This is a really efficient technique for inference since, as soon as the counts have been computed, probabilities are output in O(1) time using lookup tables. However, this suffers a lot from sparsity issues: if the numerator is a sequence of words that has never been seen before, the output probability will be zero.. More seriously, if the denominator has never been seen the output probability could be undefined! If there are some small fixes for those issues (smoothing, backtracking), this method also suffers from scalability problem with the window size N: the number of unique N-grams grows exponentially with N…

Neural networks for Language Modeling

To adapt the exact same setup described before to the neural network domain, we could just take the (pre-trained) word embeddings from the window, concatenating them and then using a linear projection layer to output a probability distribution. This solved both the sparsity and scalability issues discussed before. However, we still have a fixed sized window and concatenating even more words would produce too big embeddings. Moreover, the matrix multiplication W is completely independant between different word vectors, whereas we would like to have some parameters sharing between words.

Using Recurrent Neural Networks, we remove the fixed-sized context issue, and we can (hope to) learn long-term dependencies. A disadvantage of RNNs however is that the inherent sequential aspect makes them difficult to parallelize. Also, even with improvements from LSTMs or GRUs, long-term dependencies are still hard to catch.

Language Models for contextualized word embeddings

A limitations to current word embeddings is that they learn embeddings of word types, and not word tokens in context. That is, using word2vec, “jaguar” will have the same embedding in both “I just bought a new Jaguar” and “A Jaguar has attacked them”. A lot of progress has been made in this direction:

In the TagLM system, Peters et al. train a bi-LSTM language model over a huge dataset, and use the concatenation of forward and backward LMs to supplement an NER system with contextualized word representations, extracted from the last layer of the LMs.

The ELMo paper goes a step forward, by using a combination of all different layers of the LM to feed contextualized word embeddings:

The motivation under this successful experiment is that lower level layers might contain more information useful for low-level NLP tasks (POS tagging, syntactic parsing…) and higher layers useful for high level NLP tasks (sentiment analysis, question answering…).

Focusing on the Text Classification task, the ULMFit paper from Ruder and Howard leverage the same transfer learning ideas:

In this setup, a language model is first trained on a big corpus of text. Then, given a classification dataset, the LM is carefully fine-tuned to the data: different learning rates are set for each layer, and layers are gradually unfrozen to let the LM fine-tune in a top-bottom fashion.


An important issue for RNN-based models is that they rely on sequential modeling and then are hard to parallelize. Moreover, they still lack long-term dependencies learning even with LSTMs/GRU improvements. Attention layers are often the solution to keep long-term dependencies. But, if attention layers keep these dependencies, and sequentiality of RNN don’t allow parallelization, why not just using attention layers and forget about sequential models? This is the motivations behind the transformer architecture:

Using self-attention layers means that words will attend to each others. Moreover, to let words attend to each other in a flexible way, the model builds several attention heads: multi-head attention.


At the end of last year, Google AI published its transformer-based architecture, with a modification in the Language Modeling objective, with now two different training objectives:

  • Masked language modeling: mask 15% of tokens and predict them based on the whole text.
  • Is next sentence prediction

While previous LMs are inherently unidirectional (potentially using a forward and a backward version), BERT claims to bring a bidirectional version of transformers.

Future of Language Models

Language models is a very powerful promise for pre-training models. It enables models to perform competitive results in a zero-shot way, that is without any supervision! This is therefore very useful for many problems where supervised data is limited.

In the future, we expect language models, combined with multi-task learning, to be universal transfer learning techniques. Ideally we would like to move from a “narrow expert” paradigm (learning a model that is very good on 1 dataset, sampled from 1 task) into a pardigm where a learner can transfer knowledge across different domains, different datasets, different tasks, and different languages. Such a universal learner would provide a narrow expert model with contextualized knowledge and improve on performance.

OpenAI has recently raised the issue of ethics in natural language generation, with the release of GPT2, a language model claimed to be so good the institution refrained to publish the whole version and data. This will lead to lots of interesting discussions to define what will be the future of transfer learning in NLP and how we should use it.

Useful Resources:

Topic Classification Skill – Leo

Sometimes you want to follow high volume publications like The Verge, NY Times, or VentureBeat because you trust them but you are only interested in narrower topics like #AI, #MixedReality, or #Blockchain.

Reducing noise and information overload is a problem we care passionately about. We have been working over the last 12 months on a new feature called Leo. You can think of Leo as your non-black-box research assistant – an easy-to-control AI tool which helps you reduce noise in your feeds and never miss important articles.

At its core, Leo is composed of a set of NLP and ML skills that allow you to train him to read, understand, and prioritize the content of your feeds. One of the first skills we are building is Topic Classification.

Leo Topic Modeling Skill

What is Topic Classification?

This skill helps Leo classify articles across 1,000 common topics. Thanks to its topic classification skill, Leo will be able to determine that the article above is about the #retail and #china topics.

Why is this skill useful?

Some sources produce interesting and high-quality content across lots of different topics. Following those sources can result in a large number of articles being added to your feeds, some of which about topics you do not care about. Let’s say that you like Techcrunch, Forbes, and Wired. Following those three sources means that you have to crunch through a thousand articles per week. Some of them on topics you do not care about. With the Leo topic skill, you can train Leo to prioritize articles about #ai across those sources and get the ten articles which are the most relevant to you without having to skim through a large list of articles.

How is Leo learning?

We feed Leo with thousands of articles about each of the thousands of topics that seem to be the most relevant to the Feedly community and generate a terminology and topic model that best captures the nuances across these topics.

Leo then processes all the articles in your feeds and classifies them based on the topic model.

Which languages does Leo understand?

Right now our dataset is English only so Leo only classified articles written in English.

How can you participate?

Leo is in its infancy (Leo 0.1). We are going to need six to nine month of training and quality improvement for Leo to grow and become useful. As part of our Lab effort (more open and transparent development process), we are inviting the Feedly community to contribute to Leo’s training.

You might see at the bottom of some article, a Leo prompt asking you to train Leo about the topics associated with the article you just read. You can in a couple of clicks pick the topics which are relevant and the topics which are wrong.

If you are not interested in participating in the Leo program and think that the footer is a waste of space, there is a preference knob to turn it off in the Leo settings page.

Join the Leo Beta

Thank you

We want to thank Quentin Lhoest for doing the preliminary ML research behind this Leo skill!

Dig deeper

How to do NLP and machine learning with Feedly

Named Entity Recognition Skill – Leo

Sometimes you want to follow publications like TechCrunch, The Verge, Forbes, Wired, … because of their high quality but you are only interested in articles mentioning a competitor, a product you are interested in, or a client you are trying to connect with.

Reducing noise and information overload is a problem we care passionately about. We have been working over the last 12 months on a new feature called Leo. You can think of Leo as your non-black-box research assistant – an easy-to-control AI tool which helps you reduce noise in your feeds and never miss important articles.

At its core, Leo is composed of a set of NLP and ML skills that allows you to train him to read, understand, and prioritize the content of your feeds. One of the first skills we are building is Named Entity Recognition (and Salience).

Leo Named Entity Recognition Skill

What is Named Entity Recognition?

This skill helps Leo detect people, companies, products in articles, map them to the right entity (disambiguation), and determine their salience (which entity is the focus of the article).

Why is this skill useful?

There are two interesting use cases related to the Named Entity Recognition skill.

The first one is around the power of disambiguation. Disambiguation allows you to prioritize ICO (Initial Coin Offering) and avoid seeing articles related to ICO (Information Commissioner’s Office). When Leo sees ICO in an article, it will look at the context of the sentence and the article to determine if it is initial coin offering or information commissioner’s office and map the term to the right entity and refine its filtering. Apple the fruit versus Apple the company is another example of this.

The second is around the power of salience. Let’s imagine that you are doing some competitive watch on JP Morgan and you have created a keyword alert for JP Morgan. One of the limitations of keyword alerts is that they show you all the mentions of JP Morgan. So if you have an article about Amazon or Google and there is a short sentence about JP Morgan, you will see that article in your feed (even if that article is not about JP Morgan). With salience, you can train Leo to only see articles which are truly about JP Morgan (and remove all the irrelevant mentions)

How is Leo learning?

We feed Leo (deep learning model) with a set of article examples we have manually tagged with various entities. By reading through all these examples, Leo learns the structure of sentences and the occurrence of entities.

In a second module, we look at the extracted entities and try to map them to a knowledge base and disambiguate homonyms.

A third module reviews the list of entities across the article and determines the salience of each of the entities assigning an aboutness/importance score to each entity.

Which languages does Leo understand?

Right now our dataset is English only so Leo can perform named entity recognition on English articles only.

How can you participate?

Leo is in its infancy (Leo 0.1). We are going to need six to nine month of training and quality improvement for Leo to grow and become useful. As part of our Lab effort (more open and transparent development process), we are inviting the Feedly community to contribute to Leo’s training.

You might see at the bottom of some article, a Leo prompt asking you to train Leo about the entities (people, products, brands, etc.) associated with the article you just read. You can in a couple of clicks pick the entities which are the most relevant.

If you are not interested in participating in the Leo program and think that the footer is a waste of space, there is a preference knob to turn it off in the Leo settings page.

Join the Leo Beta

Dig deeper

Explanation of NER (Coursera video)

Hands-on: play with a pre-trained model

Hands-on: build a model in Stanford Deep Learning course

Learning Context With Item2vec

Feedly’s discovery tool helps you explore topics and sources in different ways. If you search for a topic, we show you related topics to dive deeper. If you follow a new source, we recommend other sources to pair with it.

We produce those recommendations with a collaborative filtering technique called item2vec. This blog post gives you more information on how we achieved this.

If you want to give feedback, please join our growing Mobile+AI Lab initiative. There is a channel dedicated to machine learning conversations in the Lab Slack.

What is collaborative filtering?

A collaborative filtering model learns from past behaviors to make predictions. Knowing what someone already reads, the model will suggest other sources and topics that follow a similar pattern.

Indeed, there is a good chance that any given source is co-read with a number of other related sources. These connections form a rich web of content and help us make sense of the open web.

Since we’re trying to compute preferences based on how sources are co-read, this can be framed as a collaborative filtering problem.

Our model learns which sources are often co-read and recommends complementary sources.

There are many existing collaborative filtering techniques to use. We chose one called item2vec. Item2vec is inspired by the popular word2vec model. Let’s start with a short recap of word2vec.

Relationships between words

Suppose you are given a list of sentences such as:

  • The capital city of France is Paris.
  • Spain’s capital, Madrid, is a nice city.
  • Rome, the capital of Italy, is a major city.

Imagine extending this list with similar sentences about capital cities in various countries. Do you notice any patterns?

Particular words are likely to appear in the same context. In typical writing and speech, Paris, Rome, and Madrid often appear near the words such as city, capital, and their respective country names.

Word2vec is a deep learning model that trains on a large number of example sentences. To learn these patterns, the model represents each word with an object called a vector. Vectors are mathematical objects that have interesting geometrical properties. For example, you can calculate how close two vectors are, or you can add or subtract vectors to produce a new vector.

Word2vec is good at transposing the linguistic contexts of words in sentences into geometrical properties of the corresponding vectors. If two words often appear in the same context (eg. Rome and Paris), their word2vec vector representation will be close in the vector space. More complex relationships of words are also captured by their vector representation, as shown below in a 2-dimensional space.

Source: Mikolov, Sutskever, Chen, Corrado, Dean. “Distributed Representations of Words and Phrases and their Compositionality” (2013)

You can learn more about word2vec in Jan Bussieck’s great blog post on Deep Learning Weekly.

A useful implementation of word2vec

For our models, we used the python library gensim that provides an implementation of word2vec. It makes training our model very fast and straight-forward. It only needs a collection of sentences and a few important parameters, including:

  • The size of your embedding: the number of dimensions your vector representation will have
  • The window size: the number of adjacent words considered to be in the same context of a given word. That is, if a sentence is very long, the first and last words shouldn’t be considered part of the same context in the model.
  • The type of word2vec algorithm to use. There are two models: skip-gram and CBOW
  • The down-sample threshold: to give the training set more balance. Some words occur more frequently than others, and this parameter accounts for that.

Mapping items to vectors: item2vec

To build our recommender system, it’s helpful to have a vector representation of our “items” (sources or topics) along with their context. Indeed, by mapping every item to a vector, we can then use the vector distances between two items to know how similar they are. This gives us a base to make recommendations.

The next step is to capture the context of each source and topic. Going back to our example list of sentences, we saw that nearby words influence context.

Just like knowing nearby words helps us understand context while reading, the feeds created by Feedly users (groups of sources, sometimes called folders or collections) help us capture context for every source and topic. If a user has created a feed with three sources, A, B, and C, the model will learn that A often appears “near” B and C just like word2vec knows Paris often appears near city and France.

Using gensim’s word2vec model, we replace sentences of words with buckets of items. Since we want every pair of items in our buckets to be considered as part of the same context, no matter what the size of the bucket is, our window size is the maximum bucket size.

The down-sample threshold is also important for our model. Some topics or feeds appear more frequently than others. Choosing a good down-sample parameter improves the quality of our recommendations.

If you have time to explore the new Discover experience, we’d love to hear your feedback.

Love the web? Love reading? To see more of this technology in action and join the community discussion, join the Feedly Mobile+AI Lab initiative.

The data science behind recommendations in Feedly

One of the web’s greatest strengths is its open and distributed nature. This is also a big challenge: With millions of sites publishing on thousands of topics, how do people navigate that content and discover new trustworthy sources?

Our solution to this challenge in Feedly involves using data science to organize all of those sources and help people navigate through topics.

This post covers some of the technology behind the new discovery experience in Feedly and what I’ve learned through this project.

Learning topics with user-generated data

When you follow a news site or blog in Feedly, you have to put it in a feed. Using anonymized data on how people name their feeds, I automated the process of creating our new English-language topic taxonomy.

So, if you’re one of the 45,000 people who have a feed called “tech” where you added both The Verge and Engadget, you helped create the “tech” topic.

Engadget and The Verge frequently appear in a feed called “tech.” 

There were still some problems with this list of topics, mainly duplicates and “trash topics.”

To understand how I trained the model to recognize topics, think of a matrix, or table, with data about topics and sources.

Number of times users added a given source to one of the feeds on the left

Did you notice “My favorites” in row 6 above? It’s a great example of a trash topic because it isn’t descriptive. You might also have noticed “tech” and “technology” are duplicates since they mean the same thing. If we expand our matrix to 10,000+ topics and 100,000+ sources, we would see many other trash topics and duplicates like these.

So how can we get rid of all of the trash and duplicate topics and highlight the good ones? This is where cleaning the data is important.

In the table above, each row has an array of numbers, also known as a vector. A row where all numbers are homogenous indicates a trash topic, whereas a row showing peaks for certain websites indicates a good topic.

Here’s a graph to illustrate the difference:

“My Favorites” is a trash topic. Notice how the blue line is flat compared to the others which show a distinct peak for certain websites.

We can detect those trash topics by measuring the peaks in the corresponding graph. Turning this into vector properties, we can for instance measure the ratio between the greatest number (height of the peak) and the number of non-zero values (footprint).

Similarly, here is a graph showing duplicate topics:

“Tech” and “technology” are duplicate topics. Their distributions have a very similar shape.

To detect these duplicate topics, we also use vector properties. In our example, the values in the vectors for “Tech” [50000, 30000, 5, 2] and “Technology” [12000, 7500, 2, 0] are very similar after normalization (turning those absolute numbers into percentages). To find the similarity between two vectors, I used the Jensen-Shannon divergence method.

Now that we detected that these vectors are quite similar, we can safely merge both in our system and redirect users to the “tech” topic if they search for “technology.”

Thanks to the large community of English-language readers using Feedly, we’re able to transform all that data into a clean, de-duplicated list of over 2,500 good topics, which you can visualize on the graph below.

We’re happy to report that our taxonomy goes deep enough to contain “mycology,” the science of fungi!

Graph of the Feedly’s 2,600 English topics. The strength of the connections between two topics is proportional to the number of sources that belong to both topics in Feedly. Here, we display labels for only a few of the bigger topics

Topic tree: Building a hierarchy

Now that our sources have rich topic labels, the next challenge was to introduce a better ordering system to connect related topics.

Some topics are general (“tech”) whereas others are more specific (“iPad”). Having an internal representation of the hierarchy underlying the topics, where “iPad” is a subtopic of “Apple” and “Apple” is a subtopic of “Tech”, is useful to compute recommendations.

To build this hierarchy, we use pattern matching. The graph below shows connections between three topics (on the left side) and sources related to those topics (right side). The thicker the line, the more people added the source to a feed with that name in Feedly.

Using pattern matching, we see that the topic “Apple” is a subtopic of tech. Apple connects to a subset of sources included in the tech topic.

The pattern in this example confirms that people use “tech” and “technology” in much the same way. The lines from “Technology” are thinner because people use that term less. But these two topics are duplicates. Meanwhile, “Apple” appears to be a subtopic of “tech”: it connects to fewer sources and its connections are also related to “tech.”

By detecting those patterns, we are able to construct a tree structure for all of our topics and subtopics.

Today, when you visit Feedly’s Discover page, you’ll find a list of featured topics. Click on any of them to start exploring. The related topics are there to help you navigate deeper into this hierarchy.

Ranking recommended sources for each topic

Once we built our topics and arranged them into a hierarchy, we still needed to figure which sources to recommend and in which order. There are three criteria we wanted to optimize for:

  • Relevance — the proportion of users who added the source in the topic versus users who added the source in another topic
  • Follower count— how many Feedly users connect to this source
  • Engagement— a proxy for quality and attention

The first two criteria are straightforward. People expect to see popular websites that are also relevant to the topic they are exploring, and there often is a compromise to realize between both metrics.
The third criterion is more subjective. It should reflect the quality of the website, independently of the absolute number of users reading it. Indeed, we believe some niche websites can have fewer readers but better content.

“Battle of the Sources” experiment

To compute an engagement score, we ran an experiment with the Feedly community. We chose a few sources related to the topic “tech” and asked users to vote for the one they enjoy more.

We collected 25,000 votes within a week and produced a ranking of those websites. We looked for features that were most correlated with what the users seemed to designate as the best websites.

In the table below, for example, we show the correlation between the score one source gets and the average time spent reading that source (“read_time”, that correlation is roughly equal to 0.45). Since the correlation is positive, it means the higher the score, the longer people tend to spend on that source. All other features of interest in this example also show a positive correlation because they are all indicators of what a good source might be. Our approach enables us to select the features that are the most correlated with the results of the vote. We can then make a weighted combination of those to give best sources a little boost.

Table of correlations. Score: the Bradley-Terry score derived from user’s votes. Ratio_long: the ratio of users doing long reads (>5 seconds) from the feed. Ratio_save: the ratio of users performing save actions from the feed. Read_time: the average time spent on each article. Users: the number of users reading the feed.

Thank you to everyone who voted in the “Battle of the Sources” experiment. You can see the results of this work by exploring the featured topics on the Discover page, or by searching for your favorite topic.

Generate sources that ‘you might also like’ and more ‘related topics’

If the “related topic” list is partly derived from the hierarchy described above, we also complete the list by fetching other related topics using a collaborative filtering technique. You can learn more about how we did in this post.

Related topics suggestions are there to help you explore new sources in your niche.

We use the same technique to recommend sources “You Might Also Like” from sources you already follow.


Many thanks to the Feedly community for the direct or indirect contribution to the discovery project. Have fun exploring!


Transfer Learning in NLP

In our previous post we showed how we could use CNNs with transfer learning to build a classifier for our own pictures. Today, we present a recent trend of transfer learning in NLP  and try it on a classification task, with a dataset of amazon reviews to be classified as either positive or negative. Have a look at the notebook to reproduce the experiment on your own data!

The ideas of transfer learning in NLP are very well presented in the course and we encourage you to have a look at the forum. Our reference paper here is Howard, Ruder, “Universal Language Model Fine-tuning for Text Classification”.

So what is Transfer Learning?

Computer Vision is a field which has seen tremendous improvements because of transfer learning. Highly non-linear models with millions of parameters required massive datasets to train on, and often took days or weeks to train, just to be able to classify an image as containing a dog or a cat!


With the ImageNet challenge, teams competed each year to design the best image classifiers. It has been observed that the hidden layers of such models are able to catch general knowledge in images (edges, certain forms, style…). Hence, it was not necessary to re-train from a scratch a model each time we wanted to change task.

Let’s take the example of the VGG-16 model (Simonyan, Karen, and Zisserman. “Very deep convolutional networks for large-scale image recognition.” (2014))


This architecture is relatively complex, there are many layers and the number of parameters is large. The authors claim a training time of 3 weeks using 4 powerful GPUs.

The idea of transfer learning is that, since the intermediary layers are thought to learn general knowledge about images, we could use them as one big featurizer! We would download a pre-trained model (trained for weeks on the ImageNet task), remove the last layer of the network (the fully-connected layer, which project the features on the 1000 classes of the ImageNet challenge), add put instead the classifier of our choice, adapted to our task (a binary classifier if we are interested in classifying cats and dogs) and finally train our classification layer only. And because the data we use may be different than the data the model has been previously trained on, we can also do a fine-tuning step, where we train all layers for a reasonably short amount of time.

In addition to being quicker to train, transfer learning is particularly interesting since training only on the last layer enables us to use fewer labeled data, compared to the huge dataset required to train the full model end-to-end. Labeling data is expensive and building high quality models without requiring large data sets is very much appreciated.

And what about Transfer Learning in NLP?

Advances in deep learning for NLP are less mature than they are in Computer Vision. While it is quite conceivable that a machine is able to learn what edges, circles, squares, etc. are and then use this knowledge to do other things, the parallel is not straightforward with text data.

The initial popular attempt to transfer learning in NLP was brought by the word embedding models (widely popularized by word2vec and GloVe).

These word vector representations take advantage of the contexts in which words live, to represent them as vectors, where similar words should have similar word representations.

In this figure from the word2vec paper (Mikolov, Sutskever, Chen, Corrado, Dean. “Distributed Representations of Words and Phrases and their Compositionality” (2013)) we see that the model is able to learn the relation between countries and their capital cities.

Including pre-trained word vectors has shown to improve metrics in most NLP tasks, and thus has been widely adopted by the NLP community, which has continued to look for even better word/character/document representations. As in computer vision, pre-trained word vectors can be seen as a featurizer function, transforming each word in a set of features.

However, word embeddings represent only the first layer of most NLP models. After that, we still need to train from scratch all the RNNs/CNNs/custom layers.

Language Model Fine Tuning for Text classification

The ULMFit model was proposed by Howard and Ruder earlier this year as a way to go a step further in transfer learning for NLP.

The idea they are exploring is based on Language Models. A Language Model is a model which is able to predict the next word, based on the words already seen (think of your smartphone guessing the next words for you while you text). Just like an image classifier has gained intrinsic knowledge about images by classifying tons of them, if an NLP model is able to predict accurately the next word, it seems reasonable to say that it has learned a lot about how natural language is structured. This knowledge should provide a good initialization to then be trained on a custom task!

The ULMFit proposes to train a language model on a very large corpus of text (a wikipedia dump for example) and use it as a backbone for any classifier! Because your text data might be different than the way wikipedia is written, you would fine-tune the parameters of the language model to take these differences into account. Then, we would add a classifier layer on top of this language model, and train this layer only! The paper suggests to gradually unfreeze layers and hence to gradually train every layers. They also build upon previous work on learning rates (cyclical learning rates) and create their slanted triangular learning rates.

Take away from the ULMfit paper

The amazing practical result from this paper is that using such a pre-trained language model enables us to train a classifier on much less labeled data! While unlabeled data is almost infinite on the web, labeled data is very expensive and time-consuming to get.

Here are results they report from the IMDb sentiment analysis task:

With only 100 examples they are able to reach the same error rate that the model reaches when trained from scratch on 20k examples!!

Moreover, they provide code to pre-train a language model in the language of your choice. Because Wikipedia exists in so many languages, this enables us to quickly move from a language to another using Wikipedia data. Public labeled datasets are known to be more difficult to access in languages others than English. Here, you could fine-tune the language model on your unlabeled data, spend a few hours to manually annotate a few hundreds/thousand data points, and adapt a classifier head to your pre-trained language model to perform your task!


Playground with Amazon Reviews

To deepen our undestanding of this approach we tried it on a public dataset not reported on their paper. We found this dataset on Kaggle. It contains 4 millions reviews of products on Amazon and tags them with a sentiment, either positive or negative. We adapt the course on ULMfit to the task of classifying Amazon reviews as positive or negative. We find that with only 1000 examples the model is able to match the accuracy score obtained by training a FastText model from scratch on the full dataset, as reported on the Kaggle project home page. With 100 labeled examples only, the model is still able to get a good performance.



To reproduce this experiment you can use this notebook. Having a GPU is encouraged to run the fine-tuning and classification parts.

Unsupervised vs supervised learning in NLP, discussion around meaning

Using ULMFit, we have used both unsupervised and supervised learning. Training an unsupervised language model is “cheap” since you have access to almost unlimited text data online. However, using a supervised model is expensive because you need to get it labeled.

While the language model is able to capture a lot of relevant information from how a natural language is structured, it is not clear whether it is able to capture the meaning of the text, which is “the information or concepts that a sender intends to convey, or does convey, in communication with a receiver”.

You might have followed the very interesting Twitter thread on meaning in NLP (If not, take a look at this summary from the Hugging Face team). In this thread, Emily Bender makes her argument against meaning capture with the “Thai room experiment”:  “Imagine [you] were given the sum total of all Thai literature in a huge library. (All in Thai, no translations.) Assuming you don’t already know Thai, you won’t learn it from that.”

So we could think that what a language model learns is more about syntax than meaning. However, language models are better than just predicting syntactically relevant sentences. For example, the sentences “I ate this computer” and “I hate this computer” both are syntactically correct, but a good language model should be able to know that “I hate this computer” should be “more correct” than the hungry alternative. So, while I would not be able to write in Thai even if I had seen the whole Thai Wikipedia, it’s easy to see the language model does go beyond simple syntax/structure comprehension. So we could think of language models as learning quite a lot about the structure of natural language sentences, helping us in our quest of understanding meaning.

We won’t go further in the notion of meaning here (this is an endless and fascinating topic/debate), if you are interested we recommend Yejin Choi’s talk at ACL 2018 to dig further in the subject.

Future of transfer learning in NLP

The progress obtained by ULMFit has boosted research in transfer learning for NLP. This is an exciting time for NLP, as other fine-tuned language models also start to emerge, notably the FineTune Transformer LM. We note also that with the emergence of better language models, we will be able to even improve this transfer of knowledge. The flair NLP framework is quite promising in addressing transfer learning from a language model trained at the character level, which make it very relevant for languages with sub-word structure like German.

Fun With ConvNets (dataviz!)

In January, Jeremy Howard launched the newest iteration of his Practical Deep Learning course. Even if you are an experienced ML practitioner, I really recommend giving these a quick watch (hint: watch at 1.25/1.5 speed until you get confused, then rewind and watch at normal playback). There’s always a few tips and tricks in each lecture that are worth adding to your toolkit.

An added bonus is that this iteration is the first with his pytorch based deep learning library. If you haven’t heard of it, the 6000+ stars on GitHub are probably an indication that you should give it a look. Given we are fans of pytorch and Jeremy at Feedly, it seemed like a great time to watch the videos and start getting our hands on some based code.

Lesson 1 focuses on machine setup (key takeaway: get a machine with a GPU!) and convolutional neural networks (conv nets for short). These networks are great for image processing. The lesson gently introduces students to key concepts and the basics of the library. After building a basic classifier, he recommends students download their own images and train their own classifier So that’s what I did!

Basketball vs Tennis

I decided to try to build a classifier that can differentiate between tennis players and basketball players. I downloaded 30 images of tennis players and 30 images of basketball players to play around with. Your data is probably the most important ingredient when developing your deep learning model, so I thought a bit and made a few key choices:

  1. Tennis and basketball at the professional level are not particularly diverse sports. I overweighted the underrepresented peoples in the images to avoid the model just picking up that most NBA basketball players are African American and most pro tennis players are not.
  2. I chose some images with the basketball and tennis ball in the frame and some with no ball. Same with a tennis racket. This was more for my own curiosity to see how the model would react.
  3. I only picked male players. This was just to save some time. I am almost certain I could have downloaded more images of women players with no loss in accuracy.
  4. I split the data evenly, 20 images of each class for training, 10 of each for validation. I got a bit creative for the test set, as you’ll see below.

The Amazing Power of Transfer Learning

There is a never-ending stream of amazing work in the field of deep learning, but probably my development has been the advent of transfer learning. This is where very smart people train a very generalized model that consists of 2 parts: one that extracts “features” (properties) of the input data and another that uses these features to do classification. In transfer learning, you simply take this fully trained model, lop off the classification part and then plug in your own bit. Since researchers use a very general data set, it turns out the feature extraction part of these models are still quite applicable to most image classification problems.

This is great for a few reasons:

  1. Training a conv net from scratch is intense. Without doing anything exotic, it can take weeks.
  2. Training a conv net from scratch takes a lot of data.
  3. Conv nets are pretty deep models, you don’t want to write your own and then take a lot of time debugging your code.

In short, other people do all the hard work and then you swoop in and reap the rewards. Life is good.

This is why I had some hope of building a good model using only 60 images. A key thing to understand is that this is not a silly or contrived example. Researchers often work with standard and well-developed data sets, but in the industry, you’re often on your own. Building a good data set can actually be one of the biggest barriers to developing a model, so being able to train a sophisticated model using only a few data points is a massive win.

The Results

As it turns out, differentiating between tennis and basketball players is as easy for a conv net as it is for human beings. By simply putting my data in the right place and repeating the 3 lines of code it takes to run training, the model achieved perfect accuracy within minutes of training:

epoch      trn_loss   val_loss   accuracy              
    0      1.09754    0.572311   0.7       
    1      0.802709   0.405828   0.9                     
    2      0.581209   0.294582   0.95                    
    3      0.456703   0.215485   0.95                    
    4      0.371513   0.158434   1.0                     

You may wonder if the model is really this good or if it just got lucky. After all, we are only working with 60 images, and the model only had to get 20 validation examples correct. The answer is yes. The magic is the feature extraction layers that we stole. Those stayed unchanged during this process, I only allowed the model to optimize the classification section. This methodology allows the model to remain generalized.

It’s easy to test this claim. You can simply try to train an entire conv net from scratch. If you do this, you’ll quickly see training accuracy get to 100% and validation accuracy stay at 50%. You can think of convolutional networks as extremely powerful, but extremely lazy. With only 40 training examples to look at, they’ll simply memorize the right answers and not dig deeper to find any meaningful patterns in the data. It’s only when you have enough data such that memorization becomes impossible that the network is forced to get into gear and do good work. Through transfer learning, we lock in this nice generalization behavior.

As a side note, this model could get way, way better. Just running through the data a few more times leads to great improvement in terms of validation loss. And Lesson 2 has a few more steps that make things even better.

Test Time

Once you’ve validated your model, it’s time to release it into the wild. To do this, I had my wife snap 3 pictures of me. One in a basketball pose, one in a tennis pose, and finally one of me just standing there looking my usual goofy self. I ran my (supposedly) perfect model and it guessed tennis for all 3 pictures! What’s going on??? Maybe I just look tennis-y? I actually only do play tennis nowadays. Maybe the model has secretly been spying on me?

After a few deep breaths, I calmed down and ran a simple experiment. I didn’t have a basketball handy, so I just (badly) photoshopped one into the basketball picture. Lo and behold, the model completely reversed it’s opinion and predicted the image was a basketball player. Order has been restored in the universe.

My test images, the leftmost image is a photoshopped image of one next to it.

So this is simply your run of the mill model mystery. With tens or even hundreds of millions of learned parameters, conv nets are difficult to impossible to interpret directly. However, there are some techniques can be used to generate some interesting visualizations.

Viewing Activations

A pretty basic thing to do is check the output activations of a convolutional layer. Basically, seeing how excited parts of the model gets when it sees something interesting:

Layer 1 activations. Lighter is more activated.


Here we see a few different outputs for the 2 basketball pose pictures (the ones on the left are with the photoshopped ball). It’s pretty fun to look at these. It’s always amazing to get insights into how your model “thinks”. The top image looks like some kind of skin filter. It really likes my arms and the basketball (which is leather). The next picture seems to like clothing. The 3rd seems to like left edges? The final one is also pretty cool. It also likes clothing, but actually picks up the textures in my shirt.

As you get deeper into the model, things get more, shall we say, abstract:

Layer 5 activations

We could posit some theories as to what’s going on, like the filter in the first row kind of likes circular type things. But honestly, these feel more like guesses. And more importantly, while it’s fun to poke around these visualizations don’t really get us any closer to knowing what’s going on.

Viewing Filters

So instead of seeing how the convolutional filters processed the image, can we look at the filters themselves? It turns out the answer is yes. This is especially true of the first layer of the model. This layer processes the input image directly and must have a 3-dimensional volume in order to process each RGB channel. This means we can map each filter back into the RGB space:

mosaic of all 64 layer 1 filters.

This is pretty cool, we can see some filters like to detect vertical edges, others diagonal, others look a bit blobby. Unfortunately, layers deeper in the model are not necessarily 3 dimensional so we have to plot each dimension of the filter separately. Even worse, some layers have filters are only 1 pixel big! These layers still do things – they collapse volumes, but it’s kind of a downer in terms of data viz. Other filters are bigger but still really small, like 3×3. You can visualize them, but it’s hard to know what’s going on:

Layer 3 filters


Occlusion is a really clever technique. It’s one of those things that simultaneously is really simple and obvious once you hear about it but also something you never would have thought up on your own.

If we assume that our model “looks” at images as we do, an interesting thing to do might be to cover up different parts of the image and see how the model reacts. We can do this by just slapping an uninteresting gray square over different parts of the image and running inference:

Check out this guy and his dumb gray box.

We can be a bit more systematic and slide the gray box over image, generating a prediction at each position. Then at each pixel position, we can capture the average classification when the pixel is covered up. Then we can plot things as a heatmap:

Probably can’t drain a 15 footer to save his life.

Now we’re getting somewhere. Here blue areas indicate the model didn’t change it’s original basketball prediction when those areas were covered up. The “hotter” colors indicate where the occlusion caused the model to shift its prediction towards tennis. We can see that when the basketball is occluded, the model really reacts. This explains why even in my basketball pose, an actual basketball was really important to the model.

But But But

I was really careful when constructing my data set to include shots of players without a basketball. So what’s going on. Let’s try running the same process on Giannis:

Definitely a basketball player.

Aha! Here we see the things start getting interesting in 2 basic areas, Giannis’s right shoulder (really his left shoulder but I’ll speak in terms of image layout) and the letters/numbers on his jersey. So it seems like our model not only likes basketballs, it likes basketball jerseys as well.

Let’s try things on a tennis image as well:

On this image, I was really expecting fireworks around the racket. But the model seems more interested around the right-most shoulder again. So it seems like our model has decided basketballs and right shoulders are critical areas of interest.

Is this OK? On one hand, it’s a bit bizarre. If you ask your buddy what’s the difference between basketball players and tennis players, she’s almost certainly not going to reply “His right shoulder of course!” But on the other hand, I chose all professional player pictures – so differences in attire seem like a perfectly reasonable criterion.

This is where your training data becomes critical. If I expect my deployed model to detect out of shape software engineers pretending to play basketball, I can’t just randomly train the model on NBA players. I might need some shots of people playing in their neighborhood or the local gym as well.

Luckily since we’re utilizing transfer learning, we only need a few examples of each situation to generalize our model in the ways we want.

DeConv Networks

There is a final technique I’ve seen a few times called Deconvolutional networks. This process is similar to back-propagation but adapted in a few key ways to produce really nice visualizations of what each filter is really “thinking”. I didn’t get around to implementing this, but I’ll try to someday. I see the same data viz over and over when I do online classes, it’d be nice for people redo these so students don’t have to keep seeing the same ones! 🙂

Wrap Up

So that’s it. Data visualizations can be pretty useful at times, and they are always fun time wasters you can use to impress your friends. I’ve shared my notebook and data (executed print out here) so you should be able to generate your own data viz pretty easily. The code should essentially just work on any typical conv net architecture, just tweak some paths and settings. Everything runs pretty quickly even in a non-GPU environment except for the occlusion images. You can introduce a stride > 1 to speed those up. Enjoy!

If you enjoyed reading this, subscribe in Feedly to never miss another post.

Are Python Types Worth All The Extra Typing?

We finally took the plunge and upgraded to Python 3.6 a few weeks ago. One thing that kind of crept up on me was the addition of type hints via the typing module. I’ll admit to not being fully plugged into the python community, but I thought a language feature like this would have caused a bit of an uproar.

That being said, I’ve been quite pleased with the new features. Having spent most of my career working in the statically typed world of Java, it’s helped provide a bit of much needed structure that I’ve missed.

Typing in Python

Python is a dynamically typed language, specifically what the cool kids call “duck typing”. This is probably the sharpest of all the double edged swords when coding in python.

Essentially this refers to the fact that rather than checking that all the types being used are valid beforehand in a compilation step, your program will continue merrily along its way until it absolutely can’t anymore, at which point it will raise a TypeError. As an example, this bit of Java code won’t compile:

public static int add(int a, int b) { return a+b;}
add(1, "1");

The Java compiler will immediately complain about an incompatible type being passed. However its python equivalent:

def add(a,b):
    return a+b

Your favorite python IDE will not complain one bit about any of this, you’ll have to run the code to get an error as ints and strings can’t be added.

That stinks. But watch this:

>>> add(‘hello’, ‘world’)
>>> add([1,2,3], [4,5,6])
[1, 2, 3, 4, 5, 6]

Ok, so not all bad. This extends to classes as well, as long as an object has the right methods available, python code will just work. No interfaces required:

class Dog(object):
    def talk(self, quietly):
       return 'woof' if quietly else 'BARK'
class Cat(object):
    def talk(self, quietly):
        return 'purr' if quietly else 'YOWL'
def speak(a, quietly):
>>> speak(Dog())
>>> speak(Cat())

Nice! Fewer keystrokes and very concise code.

The Problem

This is just great for the first few weeks of your python project. But then you get pulled away for a week to do an enhancement for some other code you wrote awhile ago. Or then you need to bring a coworker onto your project and start coding together.

You return and see a line of code like speak(x). x is rather vague, so you decide to jump to the definition of speak and see Now you have to do some searches for talk to see where that’s implemented and then go look to see what those methods are doing to try to infer the general contract of the talk method that’s imposed by speak.

The Solution

Type hints allow you to provide a bit more context as to what’s going on. Let’s rewrite the code above.

class Dog(Animal):
    def talk(self, quietly: bool) -> str:
        return 'woof' if quietly else 'BARK'  

class Cat(Animal):
    def talk(self, quietly: bool) -> str:
        return 'purr' if quietly else 'YOWL'  

def speak(a:Union[Dog, Cat], quietly: bool) -> None:

If you didn’t notice, the type hints are the things after the colons and the arrows. Notice the extra information conveyed. We now know at a glance that speak expects a Dog or a Cat and a boolean flag indicating if it should speak quietly. Would this get cumbersome if we expected to have more types of speakers in the future? Yes, but that’s probably a good hint that a base class should be introduced.

Essentially, without type hints it’s hard to write self-documenting code. You have to heavily rely on naming things really well, being meticulous about commenting, and general code cleanliness. This turns out to be a pretty high bar.

An added bonus is many IDEs, such as PyCharm, support type hints and will display error squiggles when you don’t pass the right arguments.

I strongly recommend adopting type hints for your new code and then gradually updating your old code. They not only save you a bit of time when reading code, but really reduce the mental burden of remembering what type all your variables should be.

If you want to read more about type hints, check out the python documentation. We’ve been trying to emphasize hints recently at Feedly, and just the other day I got this slack message.


It’s nice when everything works out the way you wanted it to. 🙂

If you enjoyed reading this, subscribe in Feedly to never miss another post.

Demystifying the GPU

If you’re involved with Machine Learning, or happen to be a NVidia investor, you must have heard that GPUs are critical to the field. But for me, GPUs have always been a bit of a mystery. I had a pretty high level understanding of GPU == fast in my head, but not much more than that.

And honestly, that gets you pretty far. I started doing hands on ML work with Keras, which is an API on top of Google’s TensorFlow library. Keras is a beautifully designed API that’s made the very practical decision that most things should be simple, and hard things should be possible.

Because of this, I didn’t really need to learn anything about the GPU. I just ran a few magic OS commands and a couple magic lines of code and — boom! — massive model training speedup achieved. And I was perfectly happy with this situation.

However, on our latest project here at feedly, we’ve decided to try PyTorch, basically because it’s been getting buzzier and buzzier. But there’s been some good data points in it’s favor. The new (highly recommended) course sequence switched from Keras to PyTorch. Lots of recent Kaggle competitions have also been won by PyTorch models.

So we dug in and found that PyTorch makes all things possible through clear and consistent APIs. It’s more flexible than Keras, so the trade-off is that easy things get a bit harder but hard things get a bit easier. If you’re experimenting with new or unusual types of models (or you happen to be a control freak), PyTorch is definitely a better fit.

When the time came to GPU accelerate my PyTorch model and I googled for the magic GPU -> on line of code, I found out it didn’t exist! True to form, Pytorch makes this a bit harder than Keras, but provides APIs on how you should go about doing things. The upshot of all this is that I had to bite the bullet and actually build a mental model of how the GPU is actually being used to speed up model training.

Why GPUs are Fast

Models generally have many, many parameters. For example, the popular VGG image classification model has about 140 million parameters divided over 16 layers! When running inference (predictions), you need to pass your input data (image) through each layer, usually multiplying that data by the layer parameters. During training, you have to also tweak each parameter a little bit to better fit the data. That’s a lot of arithmetic!

CPUs are good at doing a few things really fast. This is usually fine, there is enough branching (if a user does this, do that), and other sequential constraints that massive parallelism isn’t really possible. GPUs are good a doing a lot of things “slow”. Since they were originally used to do graphics requirements, they expect do a bunch of stuff at once (think converting all the pixels of an image to grayscale). So there’s a tradeoff here, and for ML the GPU wins big time due to the fact that these huge arithmetic operations that can be done in parallel.

To make things concrete, my macbook has a CPU that runs at 3.1Ghz and has 4 cores. A NVidia K80 GPU has almost 5000 cores, albeit running at a much slower 562Mhz. Although this is not really a fair thing to do, you can see that the K80 has a clock speed that is about 6 times slower, but is 1250 times more parallel.

How to Think About GPUs

Instead of the GPU -> on line of code, PyTorch has “CUDA” tensors. CUDA is a library used to do things on GPUs. Essentially, PyTorch requires you to declare what you want to place on the GPU and then you can do operations as usual. So I thought, let’s try placing a layer on the GPU:

class Model(nn.Module):
    def __init__(self):
        super(Model, self).__init__()
        self.hidden = nn.Linear(784, 50) = nn.Linear(50, 10)

    def forward(self, features):
        x = self.hidden(features.float().view(len(features), -1))
        x =
        return F.log_softmax(x, dim=1)
    model = Model()

I ran the code and immediately got an error at the hidden layer calculation:

Expected object of type Variable[torch.cuda.FloatTensor] but found type Variable[torch.FloatTensor] for argument #1 'mat1'

Why? I knew immediately it had something to do with the .cuda() bit of code I added but I wasn’t sure why. After a bit of thinking about how GPUs are supposed to speed things up, I realized, “Of course it doesn’t work, one tensor is on the GPU and another is still in main memory!”. Everything kind of snapped in place. Pytorch allows you to allocate tensors in GPU memory and then do operations on those tensors utilizing the GPU. But where do the results of those operations go? Let’s try another example:

t1 = torch.cuda.FloatTensor(20,20)
t2 = torch.cuda.FloatTensor(20,20)
t3 = t1.matmul(t2)

<class 'torch.cuda.FloatTensor'>

It’s another tensor in GPU memory! After more trial and error I found I had to change my way of thinking. Originally I thought of Memory, CPU, and the GPU all jumbled together:

What I realized is that it I needed to think of things like this:


Essentially the CPU/main memory and GPU/GPU memory each live in their own little universes. Coming from a software engineering background, I started to think of GPU operations as a REST API. When you’re using a REST API, the real cost is sending the data back and forth. Doing stuff locally is fast (or is as fast as it can be) as is stuff done on a remote server. But what you want to avoid is a lot of data shipping back and forth as that is pure overhead.

So carrying that analogy forward, we can see that of course it makes sense that the PyTorch matmul result is a GPU tensor. That would make it easy to do further manipulations on the GPU without shipping the data to main memory and then back to the GPU again. So if we want to use the GPU, we really want all of our parameters on the GPU as those will be used over and over again to produce predictions in the forward pass and then updated in the backward pass. Each batch of features will have to be shipped into GPU memory. But then the intermediate and final results (e.g. the output of the hidden layer) can just live in GPU memory. All we need to do is keep sending commands to the GPU to tell it how to manipulate the parameters and weights.

So in the API analogy, we just make two “heavy” requests (starred in the drawing above), one to initialize the weights and one to get the final weights after training. But we make potentially millions of  lightweight requests in between those 2 to train the model.

GPU Performance Gains are Real and they are Spectacular

So what kind of speedup is possible? PyTorch has a nice little MNIST example we can use. Running this for 10 epochs took 153 seconds using the CPU only and 83 seconds using the GPU. And we could theorize that larger models could experience even bigger gains. Not bad, not bad at all.

Some Experiments

This was all great. After a little thinking and some terrible drawings, I understood GPUs much better and the extra coding required is not bad at all. But was my thinking correct? The best way to answer this question was to build an experiment. I wanted to prove to myself that shipping data is the “slow part” of the workflow. So I tried the following 3 things:

  1. Do a 200x200 matrix multiply in numpy, a highly optimized CPU linear algebra library.
  2. Do a 200x200 matrix multiply on the GPU using PyTorch cuda tensors.
  3. Do a 200x200 matrix multiply on the GPU using PyTorch cuda tensors, copying the data back and forth every time.

As expected the GPU only operations were faster, this time by about 6x. Interestingly, 1. and 3. took almost exactly the same amount of time. The efficiency of the GPU operations was balanced almost exactly by the inefficiency of the data shipping!

If you enjoyed reading this, subscribe in Feedly to never miss another post.

Tricks of the Trade: LogSumExp

There are a lot of neat little tricks in Machine Learning to make things work better. They can do many different things: make a network train faster, improve performance, etc. Today I’ll discuss LogSumExp, which is a pattern that comes up quite a bit in Machine Learning. First let’s define the expression:
$$LogSumExp(x_1…x_n) = \log\big( \sum_{i=1}^{n} e^{x_i} \big)$$
When would we see such a thing? Well, one common place is calculating the cross entropy loss of the softmax function. If that sounded like a bunch of gobbeldy-gook: 1. get used to it, there’s a bunch of crazily named stuff in ML and 2. just realize it’s not that complicated. Follow that link to the excellent Stanford cs231n class notes for a good explanation, or just realize for the purposes of this post that the softmax function looks like this:
$$\frac{e^{x_j}}{\sum_{i=1}^{n} e^{x_i}}$$
where the $x_j$ in the numerator is one of the values (one of the $x_i$s) in the denominator. So what this is doing is essentially exponentiating a few values and the normalizing so the sum over all possible $x_j$ values is 1, as is required to produce a valid probability distribution.

So you can think of the softmax function as just a non-linear way to take any set of numbers and transforming them into a probability distribution. And for the cross entropy bit, just accept that it involves taking the log of this function. This ends up producing the LogSumExp pattern since:
$$\begin{align}\log\left(\frac{e^{x_j}}{\sum_{i=1}^{n} e^{x_i}}\right) &= \log(e^{x_j}) \:-\: \log\left(\sum_{i=1}^{n} e^{x_i}\right) \\ &= x_j \:-\: \log\left(\sum_{i=1}^{n} e^{x_i}\right) & (1)\end{align}$$

It may seem a bit mysterious as to why this is a good way to produce a probability distribution, but just take it as an article of faith for now.

Numerical Stability

Now for why LogSumExp is a thing. First, in pure mathematics, it’s not a thing. You don’t have to treat LogSumExp expressions specially at all. But when we cross over into running math on computers, it does become a thing. The reason is based in how computers represent numbers. Computers use a fixed number of bits to represent numbers. This works fine almost all of the time, but sometimes it leads to errors since it’s impossible to accurately represent an infinite set of numbers with a fixed number of bits.

To illustrate the problem, let’s take 2 examples for our $x_i$ sequence of numbers: {1000, 1000, 1000} and {-1000, -1000, -1000}. Due to my amazing mathematical ability, I know that feeding either of these sequences into the softmax function will yield a probability distribution of {1/3, 1/3, 1/3} and the log of 1/3 is a reasonable negative number. Now let’s try to calculate one of the terms of the summation in python:

>>> import math
>>> math.e**1000
Traceback (most recent call last):
File "", line 1, in
OverflowError: (34, 'Result too large')

Whoops. Maybe we’ll have better luck with -1000:

>>> math.e**-1000

That doesn’t look right either. So we’ve run into some numerical stability problems even with seemingly reasonable input values.

The Workaround

Luckily people have found a nice way to minimize these effects by relying on the fact that the product of exponentiations is equivalent to the exponentiation of the sum:
$$e^a \cdot e^b = e^{a+b}$$
and the logarithm of a product is equivalent to the sum of the logarithms:
$$\log(a \cdot b) = \log(a) + \log(b)$$
Let’s use these rules to start manipulating the LogSumExp expression.
LogSumExp(x_1…x_n) &= \log\big( \sum_{i=1}^{n} e^{x_i} \big) \\
&= \log\big( \sum_{i=1}^{n} e^{x_i – c}e^{c} \big) \\
&= \log\big( e^{c} \sum_{i=1}^{n} e^{x_i – c} \big) \\
&= \log\big( \sum_{i=1}^{n} e^{x_i – c} \big) + \log(e^{c}) \\
&= \log\big( \sum_{i=1}^{n} e^{x_i – c} \big) + c & (2)\\

Ok! So first we introduced a constant $c$ into the expression (line 2) and used the exponentiation rule. Since $c$ is a constant, we can factor it out of the sum (line 3) and then use the log rule (line 4). Finally, log and exp are inverse functions, so those 2 operations just cancel out to produce $c$. Critically, we’ve been able to create a term that doesn’t involve a log or exp function. Now all that’s left is to pick a good value for c that works in all cases. It turns out $max(x_1…x_n)$ works really well.

To convince ourselves of this, let’s construct a new expressin for log softmax by plugging equation 2 into equation 1:
\log(Softmax(x_j, x_1…x_n)) &= x_j \:-\: LogSumExp(x_1…x_n) \\
&= x_j \:-\: \log\big( \sum_{i=1}^{n} e^{x_i – c} \big) \:-\: c
and use this to calculate values for the 2 examples above. For {1000, 1000, 1000}, $c$ will be 1000 and $e^x_j – c$ will always be 1 as $x_i – c$ is always zero. so we’ll get:
$$\begin{align} \log(Softmax(1000, \left[1000,1000,1000\right])) &= 1000 \:-\: log(3) \:-\: 1000 \\ &= \:- log(3)\end{align}$$
log(3) is a very reasonable number computers have no problem calculating. So that example worked great. Hopefully it’s clear that {-1000,-1000,-1000} will also work fine.

The Takeaway

By thinking through a few examples, we can reason about what will happen in general:

  • If none of the $x_i$ values would cause any stability issues, the “naive” verson of LogSumExp would work fine. But the “improved” version also works.
  • If at least one of the $x_i$ values is huge, the naive version bombs out. The improved version does not. For the other $x_i$ values that are similarly huge, we get a good calculation. For other $x_i$s that are not huge, we will essentially approximate them as zero.
  • For large negative numbers, the signs get flipped and things work the same way.

So while things aren’t perfect, we get some pretty reasonable behavior most of the time and nothing ever blows up. I’ve created a simple python example where you can play around with this to convince yourself that things actually work fine.

So that’s a wrap on LogSumExp! It’s a neat little trick that’s actually pretty easy to understand once the mechanics are deconstructed. Once you know about it and the general numerical stability problem, it should demystify some of the documentation in libraries and source code.

To cement this in your mind (and get some math practice), I would wait awhile and then try to work out the math yourself. Also think through various examples in your head and reason about what should happen. Then run my code (or rewrite the code yourself) and confirm your intuition.

If you enjoyed reading this, subscribe in Feedly to never miss another post.

Getting Started With Machine Learning

Hello! Welcome to the Machine Learning section of the Feedly Blog. We’ve been doing more and more ML work here in the company and thought it would be interesting jotting down some of our thoughts. My background is in software engineering so a lot of posts will be about how to approach ML from an outsider’s perspective, aka keeping up with the cool kids.

By the way, we are building out our ML team here, please drop us a note if you’re passionate about NLP. Anyways, on to the first post!

I originally went to college thinking I was going to work in computer hardware, maybe ending up at Intel or some company like that. But when I got to Carnegie Mellon, I found the computer science classes to be much more interesting and, not coincidentally, I seemed to do better in those classes. I really enjoyed almost every CS class I took. In fact, all but one: Intro to Machine Learning! I was really interested going into the course, but unfortunately, the professor seemed equally disinterested in teaching the course and was just not very good. So machine learning dropped off my radar for a good long while.

But a couple years ago, I noticed machine learning really gaining traction and my curiosity perked up all over again. This time, I started with a MOOC by Andrew Ng, who is a fantastic professor. The difference was night and day. I was immediately hooked and started scouring the web for more classes I could take. Here’s some advice and tips I’ve learned along the way.

Is it a Good Idea to Hop on the ML Bandwagon?

There’s no question Machine Learning is here to stay. There’s been a prolonged buzz about the field for awhile now and having followed developments closely, I can say there is substance behind the hype. There are some problems that machines are just better at solving than humans.

But that doesn’t make it the right for everyone. Working in machine learning is quite a bit different than other software engineering fields. It’s more research-y, more speculative. If you like to be able to plan out your work piece by piece and then have everything nicely tied up in a bow after x weeks, maybe it’s not a great fit. But if you enjoy working with data, continuously learning new techniques, and enjoy math (really) it might be a great career move.

How long will it take to get up to speed?

There are so many answers to this question. The first one that comes to mind is “forever”. Machine Learning is quite broad and moves incredibly fast. If, like me, you happen to need sleep you probably won’t be able to keep up with every development in the field. But another more optimistic answer might be 4 months at 10 hours/week. That, for example, should give you enough time to get through both parts of the excellent courses.

This is not a trivial commitment as it’s probably going to be on top of your ongoing work and life commitments. But I can personally attest that it is possible, and not really that hard if you are willing to put the work in.

What are some good courses?

This really depends on how you like to learn. I personally love machine learning because of the elegant combination of so many fields of mathematics and computer science: probability, linear algebra, calculus, optimization, and much more. So I was naturally drawn to academic courses.

A terrific academic course is Stanford’s CS231n course. The videos I watched were done by Andrej Karpathy, who is a terrific lecturer. The course assignments were also well structured and doable remotely. Although the course is mostly about image problems and convolutional networks, they do start “from scratch” and also cover both feedforward and recurrent networks.

If you enjoy a more practical approach, the courses are well done. There is no pretense here. Jeremy Howard approaches everything from a very grounded, systematic perspective and has designed the course so anyone with a moderately technical background can participate. Plus they’ve built up a nice community on their forums.

The aforementioned Andrew Ng has a new Coursera course sequence up (the course I took is quite outdated at this point). I haven’t personally tried it, but I’m sure there’s a lot of good stuff there. I would assume that everything there is also taught from a more practical perspective, but you can infer some of the math going on behind the scenes.

My recommendation would be to try a couple courses and pick the one that most captures your attention. But I would encourage you to eventually complete at least one practical course and one theoretical one. They complement each other quite nicely. To understand papers (and be warned: you will need to read academic papers), the academic courses will help you get acclimatized to the verbiage. To do projects, the practical courses will provide intuition on the dozens of decisions you have to make in any ML project.

If you need a math refresher or are looking for bonus points, MIT has a pair of great courses. A good grasp of probability is absolutely a must if you want to do any ML. The MIT class by John Tsitsiklis is amazingly well taught. The way the professor breaks down complex problems step by step until the answer just seems obvious is pure artistry.

The linear algebra class is also a fun one. The professor here is also very good and has a unique style. This one is not really necessary though, for most ML tasks you can get by with just understanding matrix multiplication.

What if I don’t know how to code?

Learn how to.

Most ML work is done in Python, which fortunately is pretty easy to pick up. And you don’t really need to be a world-class programmer to do most ML work. But I would still do a quick online course before doing any ML specific work. Having to learn coding and machine learning concepts (not to mention re-learning a bunch of math you’ve probably forgotten) all at once is a recipe for disaster. Give yourself a chance and pace yourself.

I’ve got a basic grasp on things, now what?

Well, now it’s time to start modeling! There are generally 2 ways to go: find a project at work/on your own or find a Kaggle competition. This depends on your particular situation, but I would recommend the Kaggle option. The main reasons are:

  1. The problem is defined. Structuring real life ML problem properly can sometimes be tricky. With Kaggle, this isn’t an issue.
  2. Similarly, sometimes building a data set can contain several hard to diagnose pitfalls. A Kaggle competition will provide you with data.
  3. With Kaggle, you’ll get a built-in community all working on the same thing. If you get stuck or need a little guidance, there’s a place to go.

On the other hand, if you have a problem at work that is tailor-made for an ML solution (image classification for example) then maybe a work project is a quick way to impress your coworkers and convince your boss to let you invest more time on Machine Learning.

So if you have been thinking about digging into Machine Learning, take the plunge! One of the best things about Machine Learning is that people are really generous with their time and knowledge. Once you get started you’ll find a great support system online to help you along the way.

If you enjoyed reading this, subscribe in Feedly to never miss another post.