Hey Reddit, let's make some recommendations!
TLDR: Just read the bold sections and skim the rest.
There is a feature for casual discovery that is missing from Reddit that you will find on just about every other regular news site: recommended articles.
Say you are reading about Putin on Ukraine and you are like, I want to see some other things that Putin has said about Ukraine. What do you do? Google it? Wouldn't it be nice if Reddit made topical related content like this easy to find?
This isn't a novel idea, so I am sure it has come up before at Reddit. Probably many times. So why don't they have this feature?
Because it's hard to do for a site like Reddit.
They can't just install a Wordpress plugin to get this functionality. Or a widget. Or even just throw everything into ElasticSearch/SOLR and use the related documents feature. Or use standard clustering techniques. I don't think there is anything they can use off-the-shelf at all.
Let's break it down into 2 standard techniques so we can better understand why this is a difficult problem.
Technique 1: Contextual analysis. Basically you analyze the terms in a document and look for other documents with those same terms. By it's nature you will have lots of duplicate or nearly duplicate articles submitted to Reddit, so this technique will produce articles that are exactly the same or almost exactly the same as the source. This is not helpful by itself. We want to see different content that is related by topic, not just more of the same.
Technique 2: Collaborative filtering. You analyze the browsing patterns of all users to show them highly correlated posts by click. Basically you show people posts that have had the highest (relative) number of clicks by users that viewed the current post. If the algorithm is tuned poorly we will just get content from the homepage. If tuned as well as you can, you will still not necessarily get posts related by topic. You will just get more interesting stuff that the same people found interesting. Which is ok, but not exactly what we want.
So it sounds like we just need to mix the two techniques, right? Well, yes. That's what you do. Good idea. So why haven't they done this?
It turns out this is pretty hard for a site like Reddit. There is a lot of content linked to and created on Reddit, and a lot is duplicated content. And there are a lot of users, roughly 200 million unique visitors per month visiting over 7 billion pages. That's a lot of data. And it never stops coming. You can't just Hadoop it because that takes too long. We need that data processed continuously and indexed so we can deliver results asap.
So how do we solve this problem?
We create 2 indexes. One for the T1 and one for T2. And we expire older data because A. we don't want old stuff and B. there is just too much to process if we look at all of it all the time. We get a list of results for each technique and then we combine the scores. Sounds easy enough, right?
T1 is actually not too bad. We can actually just use ElasticSearch or SOLR and index the content that people link to, and use their built in MLT (More Like This) functionality. This gives ok results, but I prefer to use my own homegrown algorithm that uses some NLP to pull out noun-phrases and indexes those rather than simply words. "United" is a much less descriptive term than "United States". It also helps to determine a relative importance of those terms using some sort of holistic analysis of the document (and all documents), but I won't go into those details. That's my secret sauce. But the built-in stuff will do an ok-ish job.
T2 is very tricky. We need something like Amazon's "people that bought this also bought these" algorithm, but we need it updated MUCH faster. Amazon probably updates this type of recommendation on the order of once or twice per day. And that's plenty because their products don't change that often. So they probably just Hadoop it. In order to be effective we need to update our recommendations roughly once per minute for all recent posts. To do this we need to be creative.
I am sure there are many approaches we can take to this problem, but here is what I would do given the relatively unique constraints.
We will have hundreds of millions of new data points each day to index. An ideal scenario is that we can compare the people that clicked/commented/voted/etc on a post to the people who did so on every other post. Basically we compare every document to every other document. This N^2 complexity is probably not going to work, but let's do some quick calculations with some hopefully correct assumptions to make sure. There have been 190 million posts to Reddit in its life up to June 2015, so let's make an assumption that they get roughly 250k per day. They get about 7 billion page views per month, or about 230 million per day. I am sure a lot of that is just to the homepage which isn't very helpful, so let's make an assumption that they get roughly 100 million valuable clicks/comments/votes/etc on posts each day. That is an average of 400 for each of the posts. So we need to compare a list (matrix) of 400 things to another 400 things 250,000 * 250,000 times. If each comparison takes one millisecond (a very rough guess) we are still talking about 62,500,000 seconds, or 723 days on a single core machine. And really we want to hold data for more than one day which causes our compute time to go up exponentially, so this method would take thousands of machines to be effective. Brute force is no the way to go.
So we need some serious shortcuts. My gut says we can efficiently precalculate enough lists on one or a handful of machines and then cache those lists in memcached for super quick final scoring. Let's see if I'm right.
First off, let's put everything into memory. 100 million document ids is only 800 megabytes if we store it as primitive longs. Add overhead for list objects and we are still very much ok. Maybe a handful of gigs. Let's think about what our document comparison is so we can find a more efficient way of finding documents to compare. Let's simplify things and say that each document represents a post, and the document is just a list of userids that have interacted with the document. Hmmm... smells ripe for an inverted index to me. Why compare every document to every other document when you really only need to compare the ones that will actually have a match?
So we create a second index that would consist of users with the posts that they interacted with. This doubles our memory footprint, but still we are still only dealing with a handful of gigabytes for one day. When we want to find the other documents that have the most relatively co-occurring users we start with the document which is a list of users, then look up all the user documents to find the lists of other posts they have interacted with. Add them all up, divide by the total number of times each post has been interacted with (kinda like tf-idf, for a relative similarity score) and voila, we have a list of similar posts with scores attached. The best part is, we have in-memory super fast lookups and probably on average produce a list of related posts for each post in single digit milliseconds (although it depends on how many days of data you want to hold of course). My gut says roughly 5 ms on average because most posts will have very few interactions. And my gut is usually good for its word.
Ok now I am feeling a little self conscious about my gut, so I wrote some code to validate my assumptions. I simulated up to 160 random document interactions for 30 million users for a total of about 166 million connections (it was slightly different every time I ran it). Then I wrote a simple, single-threaded version of the algorithm and ran it on a E5-2620 server. For this test the average calculation was done in 0.57 milliseconds. There will be some additional overhead for the more complex algorithm, but most of the work is done in this simple version. The amount of data will increase for the production version, and complexity will increase some, but I think my 5ms guess is pretty darn good.
If we get a sweet server (like a dual e5-2699 with 36 total cores!) we should be able to do a full compilation of recommendations for recent articles in less than a minute on one server.
Ok, now we have both T1 and T2 indexed, the next step is to run the final calculation and cache it in memcached. We want the front end to be able to grab this data as quickly as humanly possible, and that's what memcached is for. So we iterate through our post list and run the T1 query and grab the T2 data from memcached and combine the scores. How long will that take?
Well, for T1 it depends on how big our ElasticSearch/SOLR cluster is, but given we are really only indexing only roughly 1 million documents we should be able to tune the servers to return results in 1ms, and I expect it to scale well so we can hit it with 8 threads and not impact performance much. Especially since we will be doing the same queries over and over, and although the data will change fairly often, queries will be highly cacheable. This is a gut call again! There are alternatives to make things faster if this doesn't pan out, including writing our own algorithm to run in memory. This would be extremely fast, so I am not worried.
For T2 we will be able to retrieve data from memcached in sub millisecond time. No problem there. Especially if we hit it multithreaded so that the memcached protocol can combine multiple fetches into one request. Memcached rocks in this respect.
So then we have a machine run 8 threads to step through 1 million documents. Let's say we have an average of 1.5ms per request, so that means we can cache results for those 1 million documents in less than 90 seconds. Luckily we can easily make it faster by doubling our ElasticSearch cluster and doubling our threads. We actually can probably up the thread count beyond 8 without the ES cluster expansion because some of our time will be spent in network latency. So anyways, doubling up servers gives us a refreshed set of results in 45 seconds! We have hit our goal! Total number of servers required: 5ish to 10. And I think that is relatively conservative, I bet given some real time to optimize and not using ElasticSearch I could do it in 1 beefy server.
I am looking for opportunities. Reddit, I think you should hire me to build this. I think it would be fun. And when I am done with this I could overhaul your search. I have lots of ideas there.
Still, if you aren't Reddit and you have some similar opportunities, let's talk! I am a data engineer that likes tackling tough problems I also build big search engines like Twicsy (>5 billion pictures) and PersoniFind (>500 million people). I will science when necessary, and I can build systems like these to be highly effective and fast, but then to optimize the algorithm I would consult a data scientist. I have also co-founded a few startups and have been known to CTO.