Dumbfounder's Blog @dumbfounder - Tumblr Blog

How I targeted the Reddit CEO with Facebook ads to get an interview at Reddit

A bit over 2 years ago I was coming off a failed startup and out looking for new opportunities. I am not a fan of looking for jobs, my style is to target interesting companies and convince them to let me join them in their quest. So I decided to target Reddit because it looked like they were making some new moves. And, because, it’s Reddit!

I hatched the hare-brained scheme to write a blog article that might catch the attention of a particular person, get them to read it somehow, and then say “hey kid, I like the cut of your jib, do you want to come help us conquer the world?”. Sure, that person is actually younger than me, but you get the point.

I wanted to work at Reddit. Well, not just work at Reddit, but develop something very cool at Reddit that I think can make a huge impact on the business. So I wrote about it. I knew that the CEO of Reddit was a technical founder, so I put quite a deal of effort into that blog entry to try to impress him. Now, how to get him to see it?

My first thought was email. I can just email him! But that’s boring. What about getting my article to the top of Hacker News? I bet the CEO (and founder) of Reddit still frequents Hacker News. It is an amazing community, built off an early version of Reddit, and run by Y-Combinator (which incubated Reddit way back when). The problem is, I didn’t think my article would be interesting enough to a large crowd to make it to the top of HN.

Then I remembered a trick a friend of mine taught me a while back. That friend (who shall remain nameless, unless he wants me to name him) was running a startup and was using a very interesting technique to increase his chances of closing deals. Who wants to buy from a company they have never heard of? Not many people. His strategy was to target prospects (as directly as possible) with Facebook ads about his product so that when he called or met with them, they would already know the name. They probably didn’t know why they knew, but they had seen the name before, and that meant a great deal when meeting or speaking for the first time.

So I decided I would target the CEO of Reddit with Facebook ads.

But how? I didn’t have a big budget so I needed to be clever.

It turns out the Reddit CEO had a public Facebook profile, so I could go there to see details about him. Where he lived. What he was interested in. I took that info to the Facebook platform to help narrow down the campaign. But I didn’t want everyone to click on it, just one person. So I custom tailored the ad to directly target the one person I wanted to read it.

“Steve: Reddit needs recommendations”

The ad reached 197 people. 4 People clicked on it. One of them was the CEO of Reddit. I spent a total of $10.62.

Steve Huffman, CEO of Reddit, saw my ad, clicked on it, read (probably skimmed) my article, and liked it well enough to send a note to Reddit HR to contact me about a position.

Mission accomplished.

Chris Seline

P.S. I am out again looking for new opportunities. Drop me a line at my first name at twicsy.com if you like the cut of my jib!

•18+ Adults Only

Watch Anya Live on Cam

Anya is live and ready to show you everything. Watch her strip, dance, and perform exclusive shows just for you. Interact in real-time and make your fantasies come true.

✓ Live Streaming✓ Interactive Chat✓ Private Shows✓ HD Quality✓ Free Actions

Free to watch • No registration required • HD streaming

Hey Reddit, let's make some recommendations!

TLDR: Just read the bold sections and skim the rest.

There is a feature for casual discovery that is missing from Reddit that you will find on just about every other regular news site: recommended articles.

Say you are reading about Putin on Ukraine and you are like, I want to see some other things that Putin has said about Ukraine. What do you do? Google it? Wouldn't it be nice if Reddit made topical related content like this easy to find?

I vote yes!

This isn't a novel idea, so I am sure it has come up before at Reddit. Probably many times. So why don't they have this feature?

Because it's hard to do for a site like Reddit.

They can't just install a Wordpress plugin to get this functionality. Or a widget. Or even just throw everything into ElasticSearch/SOLR and use the related documents feature. Or use standard clustering techniques. I don't think there is anything they can use off-the-shelf at all.

Why is that?

Let's break it down into 2 standard techniques so we can better understand why this is a difficult problem.

Technique 1: Contextual analysis. Basically you analyze the terms in a document and look for other documents with those same terms. By it's nature you will have lots of duplicate or nearly duplicate articles submitted to Reddit, so this technique will produce articles that are exactly the same or almost exactly the same as the source. This is not helpful by itself. We want to see different content that is related by topic, not just more of the same.

Technique 2: Collaborative filtering. You analyze the browsing patterns of all users to show them highly correlated posts by click. Basically you show people posts that have had the highest (relative) number of clicks by users that viewed the current post. If the algorithm is tuned poorly we will just get content from the homepage. If tuned as well as you can, you will still not necessarily get posts related by topic. You will just get more interesting stuff that the same people found interesting. Which is ok, but not exactly what we want.

So it sounds like we just need to mix the two techniques, right? Well, yes. That's what you do. Good idea. So why haven't they done this?

It turns out this is pretty hard for a site like Reddit. There is a lot of content linked to and created on Reddit, and a lot is duplicated content. And there are a lot of users, roughly 200 million unique visitors per month visiting over 7 billion pages. That's a lot of data. And it never stops coming. You can't just Hadoop it because that takes too long. We need that data processed continuously and indexed so we can deliver results asap.

So how do we solve this problem?

We create 2 indexes. One for the T1 and one for T2. And we expire older data because A. we don't want old stuff and B. there is just too much to process if we look at all of it all the time. We get a list of results for each technique and then we combine the scores. Sounds easy enough, right?

T1 is actually not too bad. We can actually just use ElasticSearch or SOLR and index the content that people link to, and use their built in MLT (More Like This) functionality. This gives ok results, but I prefer to use my own homegrown algorithm that uses some NLP to pull out noun-phrases and indexes those rather than simply words. "United" is a much less descriptive term than "United States". It also helps to determine a relative importance of those terms using some sort of holistic analysis of the document (and all documents), but I won't go into those details. That's my secret sauce. But the built-in stuff will do an ok-ish job.

T2 is very tricky. We need something like Amazon's "people that bought this also bought these" algorithm, but we need it updated MUCH faster. Amazon probably updates this type of recommendation on the order of once or twice per day. And that's plenty because their products don't change that often. So they probably just Hadoop it. In order to be effective we need to update our recommendations roughly once per minute for all recent posts. To do this we need to be creative.

I am sure there are many approaches we can take to this problem, but here is what I would do given the relatively unique constraints.

We will have hundreds of millions of new data points each day to index. An ideal scenario is that we can compare the people that clicked/commented/voted/etc on a post to the people who did so on every other post. Basically we compare every document to every other document. This N^2 complexity is probably not going to work, but let's do some quick calculations with some hopefully correct assumptions to make sure. There have been 190 million posts to Reddit in its life up to June 2015, so let's make an assumption that they get roughly 250k per day. They get about 7 billion page views per month, or about 230 million per day. I am sure a lot of that is just to the homepage which isn't very helpful, so let's make an assumption that they get roughly 100 million valuable clicks/comments/votes/etc on posts each day. That is an average of 400 for each of the posts. So we need to compare a list (matrix) of 400 things to another 400 things 250,000 * 250,000 times. If each comparison takes one millisecond (a very rough guess) we are still talking about 62,500,000 seconds, or 723 days on a single core machine. And really we want to hold data for more than one day which causes our compute time to go up exponentially, so this method would take thousands of machines to be effective. Brute force is no the way to go.

Not good enough!

So we need some serious shortcuts. My gut says we can efficiently precalculate enough lists on one or a handful of machines and then cache those lists in memcached for super quick final scoring. Let's see if I'm right.

First off, let's put everything into memory. 100 million document ids is only 800 megabytes if we store it as primitive longs. Add overhead for list objects and we are still very much ok. Maybe a handful of gigs. Let's think about what our document comparison is so we can find a more efficient way of finding documents to compare. Let's simplify things and say that each document represents a post, and the document is just a list of userids that have interacted with the document. Hmmm... smells ripe for an inverted index to me. Why compare every document to every other document when you really only need to compare the ones that will actually have a match?

So we create a second index that would consist of users with the posts that they interacted with. This doubles our memory footprint, but still we are still only dealing with a handful of gigabytes for one day. When we want to find the other documents that have the most relatively co-occurring users we start with the document which is a list of users, then look up all the user documents to find the lists of other posts they have interacted with. Add them all up, divide by the total number of times each post has been interacted with (kinda like tf-idf, for a relative similarity score) and voila, we have a list of similar posts with scores attached. The best part is, we have in-memory super fast lookups and probably on average produce a list of related posts for each post in single digit milliseconds (although it depends on how many days of data you want to hold of course). My gut says roughly 5 ms on average because most posts will have very few interactions. And my gut is usually good for its word.

Ok now I am feeling a little self conscious about my gut, so I wrote some code to validate my assumptions. I simulated up to 160 random document interactions for 30 million users for a total of about 166 million connections (it was slightly different every time I ran it). Then I wrote a simple, single-threaded version of the algorithm and ran it on a E5-2620 server. For this test the average calculation was done in 0.57 milliseconds. There will be some additional overhead for the more complex algorithm, but most of the work is done in this simple version. The amount of data will increase for the production version, and complexity will increase some, but I think my 5ms guess is pretty darn good.

If we get a sweet server (like a dual e5-2699 with 36 total cores!) we should be able to do a full compilation of recommendations for recent articles in less than a minute on one server.

Ok, now we have both T1 and T2 indexed, the next step is to run the final calculation and cache it in memcached. We want the front end to be able to grab this data as quickly as humanly possible, and that's what memcached is for. So we iterate through our post list and run the T1 query and grab the T2 data from memcached and combine the scores. How long will that take?

Well, for T1 it depends on how big our ElasticSearch/SOLR cluster is, but given we are really only indexing only roughly 1 million documents we should be able to tune the servers to return results in 1ms, and I expect it to scale well so we can hit it with 8 threads and not impact performance much. Especially since we will be doing the same queries over and over, and although the data will change fairly often, queries will be highly cacheable. This is a gut call again! There are alternatives to make things faster if this doesn't pan out, including writing our own algorithm to run in memory. This would be extremely fast, so I am not worried.

For T2 we will be able to retrieve data from memcached in sub millisecond time. No problem there. Especially if we hit it multithreaded so that the memcached protocol can combine multiple fetches into one request. Memcached rocks in this respect.

So then we have a machine run 8 threads to step through 1 million documents. Let's say we have an average of 1.5ms per request, so that means we can cache results for those 1 million documents in less than 90 seconds. Luckily we can easily make it faster by doubling our ElasticSearch cluster and doubling our threads. We actually can probably up the thread count beyond 8 without the ES cluster expansion because some of our time will be spent in network latency. So anyways, doubling up servers gives us a refreshed set of results in 45 seconds! We have hit our goal! Total number of servers required: 5ish to 10. And I think that is relatively conservative, I bet given some real time to optimize and not using ElasticSearch I could do it in 1 beefy server.

So, let's do it!

I am looking for opportunities. Reddit, I think you should hire me to build this. I think it would be fun. And when I am done with this I could overhaul your search. I have lots of ideas there.

Still, if you aren't Reddit and you have some similar opportunities, let's talk! I am a data engineer that likes tackling tough problems I also build big search engines like Twicsy (>5 billion pictures) and PersoniFind (>500 million people). I will science when necessary, and I can build systems like these to be highly effective and fast, but then to optimize the algorithm I would consult a data scientist. I have also co-founded a few startups and have been known to CTO.

Email me: [email protected] (remove the little o's)

#recommendations #elasticsearch #solr #search #reddit

jeremypalmer-blog

I received an interesting email the other day from a company we linked to from one of our websites.

In short, the email was a request to remove links from our site to their site. We linked to this company on our own accord, with no prior solicitation, because we felt it would be useful to...

Google bot now messing up Google analytics?!?!

I noticed a significant uptick in traffic in traffic on Twicsy yesterday, which naturally caused me to become excited and curious. I am an analytics junky and I open up the real-time view in Google analytics several times per day. Our traffic is largely predictable, and I have a good feel for the number of active users at different points in the day.

The number of active visitors on the site late yesterday was double the normal average. Which seems great, right? I quickly noticed that the spike seemed suspicious. Normally when we see spikes there is a particular page that is linked from a big site that is the cause. But the top active pages didn't show any pages that stood out. The top referrals, top social traffic, and top keywords all looked normal as well (generally they have long tail distributions for us). It was late though so I decided to do my digging in the morning.

Overall traffic was about 50-60,000 visits above normal yesterday, so I took to GA this morning to find numbers 50-60k out of the norm to explain things.The first thing I noticed is that the spike of traffic was clearly coming from the US, but when I drilled into the US geo "(not set)" was the number one state, it was about 9x the volumes of that next biggest state, and totaled over 66k visits. The day before the "(not set)" was only 1745 visits. Very suspicious.

So I did some Googling about bots that affected Google analytics and I found an article saying that Bing sometimes causes this issue. So I did what they suggested and looked up the top networks sending me traffic to see if Bing was the culprit. I was shocked to see the number one network was "google inc." with over 65,000 visits. There's that (approximate) number again. I looked at the day before and "google inc." was way down there with only 853 visits.

So my question to the community (and to Google) is:

WTF?!?!?!

#google analytics #bots

The 10th year anniversary of my really successful, unfundable, piece of shit startup I hate to love

Today is the 10th anniversary of my startup. 10. Freaking. Years.

This was not a straightforward journey, even in comparison to most startups. I started working on the technology for general web search engine Dumbfind on November 1st, 2003. After a long 20 months of work I raised $500k from an angel investor to turn it into a real company.

But really it wasn't a real company. It was a bad idea to go head-to-head with Google directly. I knew I would never beat them, but at the time I thought if I built something interesting it would find a place, or get bought by Microsoft to help them beef up their tech.

That didn't pan out.

So I pivoted. I launched Searchles, a social search engine, that was basically a more social, more searchy version of Delicious. We had some modest successes, but revenue was not one of them. We landed a deals with The Washington Post, The Denver Post, and Perez Hilton, but revenue was quite terrible. We launched a number of products based on the tech we had created over the years but nothing really worked.

From 2005-2008 I raised about $1.7m in angel funding. I hired and I fired. I built and I launched. Over and over. But nothing worked so I had to lay everyone off. In 2009 it was just me again trying to build some cool tech and figure out what to do with it.

Then I had the idea to create a Twitter picture search engine. A niche, to be sure, but I thought possibly an interesting one. I went from idea to launch in just over a week and unleashed Twicsy on the world on June 19th 2009. I sent one email about the launch to the TechCrunch tips line and they wrote about it that day. And then the next day they wrote about it again. Surely the investors would be clamoring to get a piece of this?

Not so much.

I did not have the funds to continue to work on Twicsy full time for much longer. I got a job and worked on Twicsy on the side. Really, all I did was keep it running.

4 years and more than 2.5 billion indexed pictures later Twicsy now has over 6 million unique visitors per month.

6 million unique visitors per month! That is quite a feat for a website mostly set on idle for 4 years. 4 years ago I would have been flabbergasted to achieve that level of success. Any startup would, right? Right? Right??!?!?!!?

But there is a problem.

My stupid, piece of shit startup still doesn't make any money. I barely make enough advertising money (from 9 different networks) to pay for the servers. And the servers aren't that expensive.

I have been working on Twicsy again full time since March to see if I could raise some money and turn it into a real business.

So far, I have failed.

I believe that Twicsy has amazing potential and I think it can change the way you search and discover pictures of what's happening in the world. It doesn't seem like anyone else believes that though.

Twicsy itself has been an insanely brutal journey, but to realize that this journey actually began 10 years ago is especially... I don't know what.

And so Twicsy is now officially going on the backburner again.

Twicsy, you son of a bitch. You frustrate me so. I love you, but oh how I hate to love you.

•18+ Adults Only

Watch Anya Live on Cam

Anya is live and ready to show you everything. Watch her strip, dance, and perform exclusive shows just for you. Interact in real-time and make your fantasies come true.

✓ Live Streaming✓ Interactive Chat✓ Private Shows✓ HD Quality✓ Free Actions

Free to watch • No registration required • HD streaming

Toyota's idea of a good UX

Open up the Toyota Financial Services app on your iphone.

Put in user name and password. Oops, wrong password. Hey, you deleted my username so I have to type it in again. Jerks.

Put in the name of your first car. I just logged in, is this necessary? It's case sensitive! Watch out! Ugh, what did I put?

Select whether this is your personal mobile device or a shared device. Why? I have done this before... why do I need to do it again?

Agree to these terms of service which is about 5 pages long of reading. Hit next. I think I did this last time too.

Agree to MORE terms of service. Another ~5 pages. Hit ok. I DID THIS ALREADY.

Ok! Finally in!

Select make a payment.

Select a car.

View payment info (car and payment amount), then click make a payment.

View MORE payment info (bank details). Click box here to select to save banking info. But I already did that last time? If I don't select it again will it lose my bank info? I have no idea.

Hit next.

Review all data together now. Enter in last 4 digits of social security number to verify.

Click Select box to say I am authorized to use this bank account TO VERIFY.

Click submit payment TO FUCKING VERIFY.

Transaction finally goes through. Send confirmation of payment to email address? Click AGAIN. Why do you need to ask? Can't you just send it?

OK DONE. JHC. Can I claw my eyes out now?

How it should work:

After I get my details into the app the firs time...

Prompt user to login

Show previous payment and button to click to make a new payment.

Click make a payment.

Confirm payment? Click yes.

FUCKING DONE. 10 seconds TOTAL. It does not need to take longer than this.

If you want more functionality have a link to a settings page.

Going through this every month pisses me off. It is almost easier to find the bill, write a check, and then WALK TO JAPAN AND GIVE IT TO YOU ASSHOLES.

Top Twitter Pictures of 2011!

http://twicsy.com/top/2011

Dear Twitter, You Are Doing it Wrong

Dear Twitter,

We can only account for ourselves, but there is probably a lot of apprehension in the Twitter picture market over the impending launch of your new picture service.

We just wanted to let you know that you are doing it wrong.

First off, what's up with that massive URLs? Domains like Twitpic, flic.kr, and your own twimg.com have been chosen because the domains are short. It means you can fit more text before and after a link. This is Twitter 101. Your domain is 4 characters longer than Twitpic's, and 6 longer than your own twimg.com. These are precious characters that you have wasted. You have also wasted a few precious characters on the picture id's. Using A-Z,a-z,0-9 as your character set, you could have limited the id size to 6 characters gave you 56 billion id's to work with. You went right to 7 for some reason. Really you could have started with 5 and expanded from there when necessary. You have the money to buy domains like "t.co", why not spend a little more and get something for pics like "p.co"?

Second, where are the thumbnails? Are you expecting 3rd parties to integrate with the photos on twitter.com? If so, a lot of them will need thumbnails. I hope this is coming soon, right now I don't see it anywhere, and we have been digging.

Third, all the other (good) Twitter pic sites have some very easy way to transform the "page" URL to the picture URL. For instance http://twitpic.com/5651z1 is the picture page, http://twitpic.com/show/large/5651z1 is a link to the actual image, and http://twitpic.com/show/thumb/5651z1 is a link to the thumbnail. This allows apps to integrate Twitpic pictures without making separate calls to an API to retrieve the picture URLs. For Twitter, you guys make us (1) get the status id of the tweet with the picture (2) make a call to your API (which is rate limited...), (3) get the media URL, and (4) cache that URL for future calls to avoid rate limiting. What a pain. And since you don't have thumbs, we can either be very inefficient and refer to the large pic, or download the pic and create thumbs ourselves. YUCK.

Fourth, it doesn't work in Firefox. This is a nitpicky thing I guess, Firefox only has about 43% of the market: http://twicsy.com/i/cvnWF

Your video is very pretty in that it shows how pictures can integrate with Twitter and how we can search photos and everyone is happy. We already have all of that right now without your new service. There are numerous services to upload pics, and then there is Twicsy.com to search them. I fail to see how you are making Twitter any better by launching a service that clearly has not been thought out that well. We will do what we can to integrate with your Frankenstein, but wow, we hope you decide to pull back and rethink this whole thing before you do your official launch.

Sincerely,

Twicsy

Correction:

Thanks to @montabe_com we learned from that they do have thumbnails. More info can be found here: https://dev.twitter.com/pages/tweet_entities#media

Trending Blogs

Last Seen Blogs

Dumbfounder's Blog