Website Outages
(this post pertains to the website outages experienced on 1/15/2014.)
The website experienced two outages today--one this morning and another one around midday. Â During these times, successful user-logins were sporadic, and errors like the one shown in the image below appeared.
These errors and outages occurred because of a misconfiguration of the MongoDB driver that we use within node on our API servers.  To be exact, no connectTimeoutMS, or socketTimeoutMS settings were specified in our JSON config file used by production for connections to our Mongo database. Â
How long an application waits for the initial connection to be established and how long it waits for responses to subsequent requests is determined by the values of [connectTimeoutMS] and [socketTimeoutMS, respectively]. Â
An API server averages at around 140 database-directed connections, with a minimum of 32 of said connections: if, for example, we are in the middle of a deploy: where 12 of our 24 API servers are taken aside and brought up to speed with whatever git branch we're deploying for, when those 12 servers come back up, that's around 17 hundred connections being made to our database; if the driver is unable to connect to our primary database server, it will attempt to connect to the next server, a secondary database server, listed in the hosts array passed to it: consisting of the members of our database's replica set.  Due to the way MongoDB operates, writes, excepting replication writes, to a secondary are illegal; so when some of the connections to the database end up being made on secondaries, it produces the error seen in the aforementioned image.
On a test server, we changed the driver's timeout from 1 to 18 seconds; dropping from an average of 87 timeouts over the course of ten minutes to an average of 4.  After deploying a 60 second timeout for both connectTimeoutMS and socketTimeoutMS, a survey of /var/log/node.log on six random API servers showed an average of 9 connections timing out over a period of ten minutes instead of the previous 87 connections timing out.












