Learning Context With Item2vec

Feedly’s discovery tool helps you explore topics and sources in different ways. If you search for a topic, we show you related topics to dive deeper. If you follow a new source, we recommend other sources to pair with it.

We produce those recommendations with a collaborative filtering technique called item2vec. This blog post gives you more information on how we achieved this.

If you want to give feedback, please join our growing Mobile+AI Lab initiative. There is a channel dedicated to machine learning conversations in the Lab Slack.

What is collaborative filtering?

A collaborative filtering model learns from past behaviors to make predictions. Knowing what someone already reads, the model will suggest other sources and topics that follow a similar pattern.

Indeed, there is a good chance that any given source is co-read with a number of other related sources. These connections form a rich web of content and help us make sense of the open web.

Since we’re trying to compute preferences based on how sources are co-read, this can be framed as a collaborative filtering problem.

Our model learns which sources are often co-read and recommends complementary sources.

There are many existing collaborative filtering techniques to use. We chose one called item2vec. Item2vec is inspired by the popular word2vec model. Let’s start with a short recap of word2vec.

Relationships between words

Suppose you are given a list of sentences such as:

  • The capital city of France is Paris.
  • Spain’s capital, Madrid, is a nice city.
  • Rome, the capital of Italy, is a major city.

Imagine extending this list with similar sentences about capital cities in various countries. Do you notice any patterns?

Particular words are likely to appear in the same context. In typical writing and speech, Paris, Rome, and Madrid often appear near the words such as city, capital, and their respective country names.

Word2vec is a deep learning model that trains on a large number of example sentences. To learn these patterns, the model represents each word with an object called a vector. Vectors are mathematical objects that have interesting geometrical properties. For example, you can calculate how close two vectors are, or you can add or subtract vectors to produce a new vector.

Word2vec is good at transposing the linguistic contexts of words in sentences into geometrical properties of the corresponding vectors. If two words often appear in the same context (eg. Rome and Paris), their word2vec vector representation will be close in the vector space. More complex relationships of words are also captured by their vector representation, as shown below in a 2-dimensional space.

Source: Mikolov, Sutskever, Chen, Corrado, Dean. “Distributed Representations of Words and Phrases and their Compositionality” (2013)

You can learn more about word2vec in Jan Bussieck’s great blog post on Deep Learning Weekly.

A useful implementation of word2vec

For our models, we used the python library gensim that provides an implementation of word2vec. It makes training our model very fast and straight-forward. It only needs a collection of sentences and a few important parameters, including:

  • The size of your embedding: the number of dimensions your vector representation will have
  • The window size: the number of adjacent words considered to be in the same context of a given word. That is, if a sentence is very long, the first and last words shouldn’t be considered part of the same context in the model.
  • The type of word2vec algorithm to use. There are two models: skip-gram and CBOW
  • The down-sample threshold: to give the training set more balance. Some words occur more frequently than others, and this parameter accounts for that.

Mapping items to vectors: item2vec

To build our recommender system, it’s helpful to have a vector representation of our “items” (sources or topics) along with their context. Indeed, by mapping every item to a vector, we can then use the vector distances between two items to know how similar they are. This gives us a base to make recommendations.

The next step is to capture the context of each source and topic. Going back to our example list of sentences, we saw that nearby words influence context.

Just like knowing nearby words helps us understand context while reading, the feeds created by Feedly users (groups of sources, sometimes called folders or collections) help us capture context for every source and topic. If a user has created a feed with three sources, A, B, and C, the model will learn that A often appears “near” B and C just like word2vec knows Paris often appears near city and France.

Using gensim’s word2vec model, we replace sentences of words with buckets of items. Since we want every pair of items in our buckets to be considered as part of the same context, no matter what the size of the bucket is, our window size is the maximum bucket size.

The down-sample threshold is also important for our model. Some topics or feeds appear more frequently than others. Choosing a good down-sample parameter improves the quality of our recommendations.

If you have time to explore the new Discover experience, we’d love to hear your feedback.

Love the web? Love reading? To see more of this technology in action and join the community discussion, join the Feedly Mobile+AI Lab initiative.

Happy #BookLoversDay!

Books have the power to inspire, connect, and educate. Today in honor of Book Lovers Day, here are some of the books that have inspired the Feedly team as lifelong learners.

What’s on your must-read list right now? What recent read inspired you to see the world in a new way? Tweet at us, or comment below. We always respond.

Ultramarathon Man: Confessions of an All-Night Runner by Dean Karnazes

Petr says, “I liked the story and how passionate one can be about running and endurance and pursuing dreams. It inspired me to run longer distances.”

Grandma Gatewood’s Walk by Ben Montgomery

Emily says, “I felt a connection to this 67-year-old woman who lived and worked on farms all her life before deciding she needed to hike the 2,050-mile Appalachian Trail. The suffering she happily endured on the trail must have been a welcome relief from the darkness of her past.”

Evicted by Matthew Desmond

Victoria says, “This is one of my faves because of the empathy and understanding it creates within you as you experience the loss of eviction through the eyes of the evicted. It’s a powerful piece on how to better take care of your neighbors.”

The Story of a Shipwrecked Sailor by Gabriel García Márquez

Eduardo says “It’s easily one of my favorite books. The struggle of the guy who was adrift at sea … he never lost hope. You could almost feel what he was feeling. That’s the vividness of the writing.”

Barbarian Days by William Finnegan

Remi says“Finnegan has a way of pulling his reader into what a life of pursuing their obsession and journeying all over the world really feels like. Bonus points for the years in South Africa which bring it back to a moment in history … beautifully written, permeating passion all the way through.

Les Fleurs Du Mal (The Flowers of Evil) by Charles Baudelaire

Guillaume says, It has the best reread value of any book I know. Every piece is incredibly beautiful and well written, and the whole volume oozes a sort of calm melancholy that always gets me.

Le Mythe de Sisyphe (The Myth of Sisyphus) by Albert Camus

David says, “This was one of the most pivotal books in my life.”

Thanks for reading!

Here are some of our most-loved books. What are yours?