How to troubleshoot distributed systems
More and more Internet services need to have a complex distributed architecture in order to handle a large number of clients and remain available in high load conditions. These applications are usually a collection of smaller or larger software entities, most often running on many different physical servers, each having its own purpose. Quite often, these components are written by different teams, using different frameworks and programming languages. Understanding the behavior of such a system requires observing the communication between the entities that are running on different servers.
As an example, let’s consider a simple blogging platform architecture When a user wants to see the latest posts, then the browser generates a GET /posts/ request and sends it to the web server, that forwards the HTTP request to one of the web app servers. The web app interrogates the database to get a list of the latest posts and encodes them to send them back in a HTTP response to the twitter web server that relays it to the user. Everything looks clean and easy, but what happens in case the user gets back an error instead of the list of posts? The issue can be anywhere on the path from the user’s browser down to the database.
To track down the issue, the first idea that you have is probably to check the logs of each server involved in the process of getting the posts and look for exceptions. In case you are unlucky and you don’t find anything in the logs, then you try reproducing the issue while the services are running in debug level and tracing on each server. After hours of debugging, you find out that the issue was a race condition between the cleanup process and the database. Indeed this is a nasty bug that requires a lot of investigation. But what if you would have a system to show you all the transactions between all the servers involved in the process of getting the posts? What if this system would show you the root cause in couple of seconds?













