We love the Web because it is an open and distributed network that offers everyone the freedom and control to publish and follow what matters to them.
We also love the web because it has enabled a new generation of content creators (Ben Thompson, Bruce Schneier, Tina Eisenberg, Seth Godin, Maria Popova, etc.). Those independent thinkers continuously explore the edge of the known and share insightful and inspiring ideas with their communities.
Connecting people to the best sources for the topics that matter to them has been core to our mission since the very start of Feedly.
But discovery is a hard problem. The web is organic, a reflection of the global community’s changing needs and priorities. There are millions of sources across thousands of topics and we all have a different appetite when it comes to feeding our minds.
About twelve months ago, we created a machine learning team to see if the latest progress in deep learning and natural language processing could help us crack this nut.
Today, we are excited to give you a preview of the result of that work with the release of the new discovery experience in the Feedly Lab app (Experience 06).
Two thousand topics
The first discovery challenge is to create a taxonomy of topics.
You can think of Feedly as a rich graph of people, topics, and sources. To build the right taxonomy, we started with the raw data on all of Feedly’s sources. We had to create a model to clean, enrich, and organize that data into a hierarchy of topics. Learn more about the data science behind this.
The result is a rich, interconnected network of two thousand English topics. And it’s mapped well with how people expect to explore and read on the Web.
On the discovery homepage, we showcase thirty topics based on popular industries, trends, skills, or passions. You can access all of the topics in Feedly via the search box.
The fifty most interesting sources
The second discovery challenge is to find the fifty most interesting sources someone researching any topic might want to follow.
Ranking sources is hard because not all sources are equal. In tech as an example, you have mainstream publications like The Verge or TechCrunch, expert voices like Ben Thompson, and lots of B-list noisy sources which don’t add much value.
In addition, for niche topics like virtual reality, some sources are specific to VR while others cover a range of related topics.
To solve this challenge, we created a model which looks at sources through three different lenses:
- follower count
- relevance (how focused is the source on the given topic)
- engagement (a proxy for quality and attention)
The outcome is new search result cards. You can explore the fifty most interesting sources for a given topic and sort them using the lens that is most important to you.
One of the benefits of the new topic model is that the 2,000 topics are organized in a hierarchy. This makes it easy for you to zoom in or out and explore many different neighborhoods of the Web.
For example, from the cybersecurity topic, you can jump to a list of related topics that let you dig deeper into malware, forensics, or privacy.
One more thing…
We have done a lot of research over the last four years to understand how people discover new sources. One insight we learned is that people often co-read certain sources. For example, if you are interested in art, design, and pop culture and you follow Fubiz, there is a high chance that you also follow Designboom.
With that in mind, we spent some time creating a model that learns what sources are often co-read. The idea is that a user could enter a source that they love and discover another source they could pair it with.
You can learn more about the machine learning model (we call it feed2vec) powering this experience through the article Paul published here.
As a user, you can access this feature by searching in the discover page for a source you love to read. The result will be a list of sources which are often co-read with that source.
I would like to thank Paul, Michelle, Mathieu, and Aymeric for the great research work they did to take this project from zero to one. People who have tried to tackle discovery know that it is a very hard challenge and the results of this project have been very impressive.
We would also like to thank the community for participating in the Battle of the Sources experiment. Your input was key in helping us learn how to model the source ranking. We are going to continue to invest in discovery and we look forward to continuing to collaborate with you.
We would also like to thank Dan Newman, Daron Brewood, Enrico, Joey, Lior, Paul Adams, Ryan Murphy, and Joseph Thornley from the Lab for reviewing an earlier version of this article.
One of the web’s greatest strengths is its open and distributed nature. This is also a big challenge: With millions of sites publishing on thousands of topics, how do people navigate that content and discover new trustworthy sources?
Our solution to this challenge in Feedly involves using data science to organize all of those sources and help people navigate through topics.
This post covers some of the technology behind the new discovery experience in Feedly and what I’ve learned through this project.
Learning topics with user-generated data
When you follow a news site or blog in Feedly, you have to put it in a feed. Using anonymized data on how people name their feeds, I automated the process of creating our new English-language topic taxonomy.
So, if you’re one of the 45,000 people who have a feed called “tech” where you added both The Verge and Engadget, you helped create the “tech” topic.
There were still some problems with this list of topics, mainly duplicates and “trash topics.”
To understand how I trained the model to recognize topics, think of a matrix, or table, with data about topics and sources.
Did you notice “My favorites” in row 6 above? It’s a great example of a trash topic because it isn’t descriptive. You might also have noticed “tech” and “technology” are duplicates since they mean the same thing. If we expand our matrix to 10,000+ topics and 100,000+ sources, we would see many other trash topics and duplicates like these.
So how can we get rid of all of the trash and duplicate topics and highlight the good ones? This is where cleaning the data is important.
In the table above, each row has an array of numbers, also known as a vector. A row where all numbers are homogenous indicates a trash topic, whereas a row showing peaks for certain websites indicates a good topic.
Here’s a graph to illustrate the difference:
We can detect those trash topics by measuring the peaks in the corresponding graph. Turning this into vector properties, we can for instance measure the ratio between the greatest number (height of the peak) and the number of non-zero values (footprint).
Similarly, here is a graph showing duplicate topics:
To detect these duplicate topics, we also use vector properties. In our example, the values in the vectors for “Tech” [50000, 30000, 5, 2] and “Technology” [12000, 7500, 2, 0] are very similar after normalization (turning those absolute numbers into percentages). To find the similarity between two vectors, I used the Jensen-Shannon divergence method.
Now that we detected that these vectors are quite similar, we can safely merge both in our system and redirect users to the “tech” topic if they search for “technology.”
Thanks to the large community of English-language readers using Feedly, we’re able to transform all that data into a clean, de-duplicated list of over 2,500 good topics, which you can visualize on the graph below.
We’re happy to report that our taxonomy goes deep enough to contain “mycology,” the science of fungi!
Topic tree: Building a hierarchy
Now that our sources have rich topic labels, the next challenge was to introduce a better ordering system to connect related topics.
Some topics are general (“tech”) whereas others are more specific (“iPad”). Having an internal representation of the hierarchy underlying the topics, where “iPad” is a subtopic of “Apple” and “Apple” is a subtopic of “Tech”, is useful to compute recommendations.
To build this hierarchy, we use pattern matching. The graph below shows connections between three topics (on the left side) and sources related to those topics (right side). The thicker the line, the more people added the source to a feed with that name in Feedly.
The pattern in this example confirms that people use “tech” and “technology” in much the same way. The lines from “Technology” are thinner because people use that term less. But these two topics are duplicates. Meanwhile, “Apple” appears to be a subtopic of “tech”: it connects to fewer sources and its connections are also related to “tech.”
By detecting those patterns, we are able to construct a tree structure for all of our topics and subtopics.
Today, when you visit Feedly’s Discover page, you’ll find a list of featured topics. Click on any of them to start exploring. The related topics are there to help you navigate deeper into this hierarchy.
Ranking recommended sources for each topic
Once we built our topics and arranged them into a hierarchy, we still needed to figure which sources to recommend and in which order. There are three criteria we wanted to optimize for:
- Relevance — the proportion of users who added the source in the topic versus users who added the source in another topic
- Follower count— how many Feedly users connect to this source
- Engagement— a proxy for quality and attention
The first two criteria are straightforward. People expect to see popular websites that are also relevant to the topic they are exploring, and there often is a compromise to realize between both metrics.
The third criterion is more subjective. It should reflect the quality of the website, independently of the absolute number of users reading it. Indeed, we believe some niche websites can have fewer readers but better content.
“Battle of the Sources” experiment
To compute an engagement score, we ran an experiment with the Feedly community. We chose a few sources related to the topic “tech” and asked users to vote for the one they enjoy more.
We collected 25,000 votes within a week and produced a ranking of those websites. We looked for features that were most correlated with what the users seemed to designate as the best websites.
In the table below, for example, we show the correlation between the score one source gets and the average time spent reading that source (“read_time”, that correlation is roughly equal to 0.45). Since the correlation is positive, it means the higher the score, the longer people tend to spend on that source. All other features of interest in this example also show a positive correlation because they are all indicators of what a good source might be. Our approach enables us to select the features that are the most correlated with the results of the vote. We can then make a weighted combination of those to give best sources a little boost.
Thank you to everyone who voted in the “Battle of the Sources” experiment. You can see the results of this work by exploring the featured topics on the Discover page, or by searching for your favorite topic.
Generate sources that ‘you might also like’ and more ‘related topics’
If the “related topic” list is partly derived from the hierarchy described above, we also complete the list by fetching other related topics using a collaborative filtering technique. You can learn more about how we did in this post.
We use the same technique to recommend sources “You Might Also Like” from sources you already follow.
Many thanks to the Feedly community for the direct or indirect contribution to the discovery project. Have fun exploring!