Being at Funzio - Part 7 of N
After all the trouble we went through by using Flume, we realized that Flume did not satisfy our requirements. So we decided to build an in house alternative which did not require all the bells and whistles. This also meant that it would be simpler to implement. It would be super simple, easy to maintain and configure with a few parameters... all the good stuff that every engineering project wants to have. So was it as simple, easy to maintain etc as we originally wanted it to be? ... erm, maybe. We went through many iterations before we finally stabilized into a design which we were satisfied with.
Our initial design looked something like this:
We decided to pick NodeJS as a language to implement it in as it had inherent advantages in handling concurrent network requests. Our app server connected to the service which accepted tcp connection. It sends the data over the connection and once its done sending over, close the connection. Our service consisted of a bunch of "collectors" which accepted data from any app server that connected to it. It would initially log all the incoming data into a file and post process it at periodic intervals to copy over the data into vertica. This design meant that the game code was responsible for handling all failure cases like one of the collectors being down, the connection being too slow etc. In a full failure scenario when all the collectors were down, the app server would simply log the unsent data to a file locally which was then later copied into vertica by our existing daily batch import job.
In about a week, we developed it and shipped it to production for a couple of our low traffic games. I then named it Relay since it was all about message passing from one server to another. Cool name eh?
Over the next week we debugged it, handled all the bugs and once it was sufficiently stable, started rolling it out to the other games. We had rolled it out to all the games, and the system scaled beautifully... until it broke down as we rolled it out to the last game.
//DEVIL IS IN THE DETAILS
One of the best decisions we made was to make the game code fell back to the old behavior in case Relay failed. And it worked wonderfully, we never lost a single bit of data when we were making a transition. The data might have come in a bit late (batch imported once a day), but it would eventually be available.
The way we rolled out the code was to low traffic games first and then the higher traffic games which had more data. We also had knobs which would control what set of events were sent through relay, flume or daily import job. Our aim was to make ALL the traffic go through Relay eventually. And we started by just pumping as much as we can through Relay to see where it breaks.
So what happened when we rolled it out to the last game? The collector part of the process slowed down to a crawl because of the amount of data coming in. It simply couldn't keep up with processing the incoming stream.
The collector had two parts:
The listener handled the incoming tcp connection, accepted the data and logged it immediately to a file. This file was processed at regular intervals and then copied into vertica. Keep in mind that both parts were part of a single process.
The bottleneck was the second half of the collector. When it started to process the next file, the process would go into this massive slowdown and stop accepting tcp connections until it finished processing the file. It took quite a while to figure out why it was slowing down so much, it turned that the garbage collector hogged all the processing while deallocating and reallocating memory. Even though NodeJS is asynchronous, it is still a single process i.e. it can only execute one thread at a time. So if any one thing is executing, nothing else executes in parallel .. but anytime any thread is waiting for I/O nodejs goes off and handles something else giving it incredible speed and scalalbilty for applications which are continuously waiting for some I/O or are event based.
NodeJS also does not have a clean way to iterate through a file line by line. So what I did was just load the entire thing in memory and then process it. That was also (obviously) causing the process to be memory intensive. We fixed it later by using the 'lazy' module which helps in making processing things sequentially without all the crazy hoops that NodeJS really requires you to do.
Once we verified that the second part was causing the problem, it took me about 15 mins to split them up into two distinct nodejs processes and deploy them out separately. To do this fast, all I did was make two programs with one having the "copying to vertica" part disabled and vice versa. The code was still exactly the same and it allowed us to test it out really quickly.
So the new Relay setup looked:
This setup almost satisfied what we wanted, but not quite. We also wanted graceful handling of failures of various components of the system. This meant that if any collectors, copiers or the app server failing should not cause any effect on other parts of the system. With the current setup, if the collectors went down, the game would start throwing a lot of internal errors when its unable to connect to a collector (it would not affect the game itself but would lead to a slowdown in the performance). This was still unacceptable because we wanted to completely decouple the application code from analytics.
After working on it a little bit more we got the final design as shown:
In the new setup the app server would simply log to a file locally in a particular format. We introduced a relay sender on each app server which is continuously running. It picked up the logged data and sent it it to the collector. This allowed completely decoupling of the analytics systems and really helps us in maintaining it without touching the game code at all. Further, each module - sender, listener and copier are completely separate entities and so they can be used/maintained/upgraded/swapped in and out independently.
Once we had this design, we realized that we can use the individual components separately in other parts of the system which had similar needs. These new usecases also prodded us to redesign relay to be much more configurable and generic.
We think that other people may also find this system useful, and as part of our open source initiative, we have open sourced Relay here. It has the generic parts of sender and listener and they can be used independently. They are also configurable via just a few command line params.
This project has shown me the true value of modularizing your code than any project for a software engineering class that purports to teach the same principles.