Building a Production Service is Difficult
One of the major components of offering a service is just that, offering it. Keeping a system online with consistent data isn't as easy as spinning up a machine with a database and walking away. Datacenters experience latency, systems need updates, hard drives fail and in some cases, entire companies get their network destroyed overnight. One of the big goals for PassiveTotal was to take our service and ensure that we always had a way to recover from failure, especially since our machines were located in the cloud.
Prior to the redesign, PassiveTotal was running on a larger instance within Digital Ocean where backups were done locally and then copied on a scheduled basis. The biggest issue we had with this setup was the potential for node failure, either due to our processes or our hosting provider. Our initial redesign split the application logic from the database hosts and had the databases clustered together with one node in San Francisco and another node in New York. Accounting for the changes in the code was easy and our new designed was able to ensure we had two copies of data in two different places, a de-coupled application server and solid recovery process.
Like good engineers, before we deployed the new design to production, extensive testing was done locally to account for any strange bugs or unexpected output. This mostly included failure between the two clustered database nodes and observing the election process that followed. While we tested, we also documented our processes and wrote a test suite to check the servers to ensure the processes were properly followed. Using our guides and test suite, we deployed our new design to production with literally no downtime. To do that, we ran the old version in parallel with it pointing to the new database nodes until the DNS records were able to update on all our clients.
For over two weeks, we experience no issues with our new setup. Everything was working great, our primary database node was under minimal load, our secondary ready and waiting for a failure and backups funneling up to the core application node. Then one morning, we woke up to find a couple messages from newly-approved users saying they couldn't access their account. Weird, when querying the database, they were no longer there. Looking at the cluster status, it was apparent that something had failed overnight as our New York primary was now offline and San Francisco was serving as the master.
The good news was that our site never went down (yay!), the bad news though was our databases were clearly out of sync. For the next couple hours we went through the logs to try and figure out what had happened. To the best of our knowledge, New York became overwhelmed and killed the database process therefore sparking the election to promote San Francisco to primary. The databases were out of sync due to latency between datacenters (NY and SF) and our asynchronous replication writes were not being verified (apparently this is not a default action).
Having identified the major problem to be latency between datacenters, we quickly spun up two new data nodes in New York, ran them through our process, verified they were configured properly and moved the daily backup snapshot from our original New York primary over to the new cluster. Within 10 minutes, we were able to point the application server over to the new cluster and business resumed like normal. Had we not had backups or documented our processes or wrote test suites to check our configurations, this node failure would have taken much longer to recover from.
We are pleased to say that our cluster has been functioning fine for several weeks. Our databases are always in sync and we are now able to do rolling restarts post-updates without any downtime. Not having to worry about our data nodes means we can focus on scaling our application server up to avoid any future failures there.














