Ringmaster - The Cassandra backed cluster resource manager
Today we introduce Ringmaster, a masterless, Cassandra backed cluster manager currently capable of running Spark applications on it but extendable to support other frameworks and applications.
A couple of years ago we started moving beyond Hadoop. We built the Tuplejump platform integrating Cassandra and Spark. In development we use Spark Standalone mode while in production we run Spark on Apache Mesos. This setup works great, but we want to take a step further towards our original goal of one click and fast Big Data cluster deployment. In current setup we still have additional moving parts, Spark server, mesos, zookeeper and tools to deploy, monitor and manage them. With Ringmaster we move towards removing these.
Ringmaster is a very simplified take on Mesos cluster manager which currently targets on running Spark jobs. We will be extending it to support Hydra, our stream processing engine, but we donât intend it to replace Mesos as a generic cluster management system and hence it will always remain a small and easy to manage code base. Moreover it builds on Cassandraâs infrastructure and functionality which is proven and battle hardened with years in production.
Before we take a dive into the workings of ringmaster, letâs take a quick look at Cassandra cluster. The figure below shows a typical Cassandra ring, with tokens assigned T-0, T-1, ⌠T11.
Cassandra ring with Tokens T-0 to T-11
In this ring of 12 cassandra will first calculate the hash token for a row and place it on the corresponding node. The row will be replicated to other nodes based on the replication factor and replication strategy. To read more about replication in Cassandra and understand how various strategies affect the nodes where the data is replicated read up here.
Let us take the simplest of the cases. In this ring let us suppose we are using the simple replication strategy and replication factor of 3, then for a row that hashes in the range between T-0 and T-1 then the row will first be persisted on the zeroth node and will be replicated on the first and second node.
Simple Replication for Factor 3 on Node T-0
To setup Spark in this cluster we move from simple replication strategy to network topology replication strategy placing the nodes that will run analytics tasks in a different virtual datacenter and installing Spark workers on them. The first data center (DC1) will be used by the high performance realtime read/write clients while the second (DC2) will exclusively be kept for Spark and Shark.
Let us assume we put every 4th node in DC2, the ring will look like this.
In this as we see The Row R1 which hashed to in the range of T-0, it is saved on node T-0 replicated clockwise till it gets replicated on node T-3 in DC2. This way all the data written to nodes in DC1 gets replicated to nodes in DC2.
Using this mechanism we ensure two things, first that the read/write performance of the clients is not affected by the analytic jobs and on analytics side we ensure data locality for the spark workers.
Now the complications arise when we add the Spark master and the backup and zookeeper ensemble to enable master switchover. Our current production clusters look somewhat like this (removing Mesos to get some clarity).
Current Tuplejump Cluster Setup
As you can see this requires us maintaining 5 additional servers with different software apart from our Cassandra ring and even in the cassandra ring we have different pieces to be installed in different nodes. In DC1 it is only Cassandra while DC2 requires Cassandra and Spark Workers to be installed and Spark workers configured to use the spark masters via zookeeper. This is where cluster deployment and configuration management systems business thrives and we for our needs use Puppet, Chef, Salt Stack, etc. depending on what the clientâs team is familiar with or uses internally.
Letâs quickly review the problems with our current setup,
Additional infrastructure
Additional software components and configuration
Removing zookeeper and backup server create a SPOF
Sometime back we through our partnership with Imaginea built a PoC of a masterless Spark Standalone cluster and have already created a pull request to Apache spark for the same (https://github.com/apache/incubator-spark/pull/172). This was the first cut of Ringmaster or project-rm. This uses Akka Cluster as the backbone and Actor messaging to communicate amongst the workers. The system is masterless as your application can connect to any of the nodes and it will create a dedicated scheduler for it. The state of the scheduler is replicated over the cluster and a cluster listener initiates another actor to take over if one of the nodes running a scheduler go down.
The scheduler is modified spark scheduler which communicates to the worker actors using akka messaging sending resource requests, receiving offers and triggering tasks on available workers.
What the diagram doesnât show is additional actors, Cluster monitor and State store which resply, monitor the cluster node membership messages and store and replicate the schedulerâs state. This resource manager solves 2 of our issues with the current setup,
There is no single point of failure, if a node dies the client can connect to another node and the scheduler will run there acting as the new master
Since the scheduler is co-located on the worker node and doesnât need ZK for reliability we donât require the additional infrastructure
We still need to deploy multiple software components and the deployment though simplified a lot than earlier (configuring Akka cluster is the only requirement in addition to configuring Cassandra) it is still more than just Cassandra. Also the functionality built into Cassandra and the functionality we use from Akka-Cluster (Cluster messaging, node membership, replicated state store, etc.) overlaps. And thus began the next phase for Ringmaster, we started replacing the functionality we built over akka cluster with the one found in Cassandra. We still use akka actors, but donât use the akka cluster. Now we run Ringmaster as a javaagent in the Cassandra JVM and so have access to all the internals of Cassandra. Major functionality we used from Cassandra was,
Akka cluster node membership replaced by cassandra peer membership. One akka actor on each node listens to the peer state messages in Cassandra and takes an action on it if it is the âclockwise adjacent nodeâ in the datacenter to the node going down.
The remote actor communication is done through custom defined Cassandra messages.
The scheduler state is stored to Cassandra and it takes care of replication.
Once Ringmaster goes in production, this is how our cluster will look. No additional moving parts to the engine, just the Cassandra cluster the javaagent on all of them. The agent gets activated only on the nodes with Data Center marked as analytics or nodes marked with Analytics role. More on that next time.
We will be soon releasing the early access to Ringmaster for you to dig your teeth in, followed by the codebase and general availability as we do with all our other components. In the meanwhile please let the comments and queries flow in.