Feedly’s discovery tool helps you explore topics and sources in different ways. If you search for a topic, we show you related topics to dive deeper. If you follow a new source, we recommend other sources to pair with it.
We produce those recommendations with a collaborative filtering technique called item2vec. This blog post gives you more information on how we achieved this.
What is collaborative filtering?
A collaborative filtering model learns from past behaviors to make predictions. Knowing what someone already reads, the model will suggest other sources and topics that follow a similar pattern.
Indeed, there is a good chance that any given source is co-read with a number of other related sources. These connections form a rich web of content and help us make sense of the open web.
Since we’re trying to compute preferences based on how sources are co-read, this can be framed as a collaborative filtering problem.
Relationships between words
Suppose you are given a list of sentences such as:
- The capital city of France is Paris.
- Spain’s capital, Madrid, is a nice city.
- Rome, the capital of Italy, is a major city.
Imagine extending this list with similar sentences about capital cities in various countries. Do you notice any patterns?
Particular words are likely to appear in the same context. In typical writing and speech, Paris, Rome, and Madrid often appear near the words such as city, capital, and their respective country names.
Word2vec is a deep learning model that trains on a large number of example sentences. To learn these patterns, the model represents each word with an object called a vector. Vectors are mathematical objects that have interesting geometrical properties. For example, you can calculate how close two vectors are, or you can add or subtract vectors to produce a new vector.
Word2vec is good at transposing the linguistic contexts of words in sentences into geometrical properties of the corresponding vectors. If two words often appear in the same context (eg. Rome and Paris), their word2vec vector representation will be close in the vector space. More complex relationships of words are also captured by their vector representation, as shown below in a 2-dimensional space.
You can learn more about word2vec in Jan Bussieck’s great blog post on Deep Learning Weekly.
A useful implementation of word2vec
For our models, we used the python library gensim that provides an implementation of word2vec. It makes training our model very fast and straight-forward. It only needs a collection of sentences and a few important parameters, including:
- The size of your embedding: the number of dimensions your vector representation will have
- The window size: the number of adjacent words considered to be in the same context of a given word. That is, if a sentence is very long, the first and last words shouldn’t be considered part of the same context in the model.
- The type of word2vec algorithm to use. There are two models: skip-gram and CBOW
- The down-sample threshold: to give the training set more balance. Some words occur more frequently than others, and this parameter accounts for that.
Mapping items to vectors: item2vec
To build our recommender system, it’s helpful to have a vector representation of our “items” (sources or topics) along with their context. Indeed, by mapping every item to a vector, we can then use the vector distances between two items to know how similar they are. This gives us a base to make recommendations.
The next step is to capture the context of each source and topic. Going back to our example list of sentences, we saw that nearby words influence context.
Just like knowing nearby words helps us understand context while reading, the feeds created by Feedly users (groups of sources, sometimes called folders or collections) help us capture context for every source and topic. If a user has created a feed with three sources, A, B, and C, the model will learn that A often appears “near” B and C just like word2vec knows Paris often appears near city and France.
Using gensim’s word2vec model, we replace sentences of words with buckets of items. Since we want every pair of items in our buckets to be considered as part of the same context, no matter what the size of the bucket is, our window size is the maximum bucket size.
The down-sample threshold is also important for our model. Some topics or feeds appear more frequently than others. Choosing a good down-sample parameter improves the quality of our recommendations.
If you have time to explore the new Discover experience, we’d love to hear your feedback.
Love the web? Love reading? To see more of this technology in action and join the community discussion, join the Feedly Mobile+AI Lab initiative.