[NoSQL] HBase and Cassandra
This is just a quick comparison of HBase and Cassandra for my first Tumblr post. It is somewhat biased for HBase to look more cool and shiny because I've been lately doing more study on HBase. I'll come up with more detailed individual research on later posts. Hope this helps.
Auto load balancing and fail-over
Google designed Bigdata for intercontinental datacenters and reliable replication
CP of CAP theorem (Consistency, Partition Tolerance)
Compression and Compaction for more efficient data cache performance
Automatic multiple shards per server
HDFS â data replication, end-to-end checksums, auto re-balancing
MapReduce â Data processing and analytics
Distributes data load by row key in sorted order
If a region server is down, write will be blocked for the data until it is redistributed
Replication: data+log backup method (log shipping)
The data is written to the HBase write-ahead-log in RAM, then it is then flushed to disk
The file on disk is automatically replicated due to the Hadoop Filesystemâs nature
The data enters a âReplication Logâ, where it is piped to another Data Center.
 When application has a variable schema where each row is slightly different
 When data is stored in collections and keyed on the same value
 When you need key based access to data when storing or retrieving (random access)
 When RDB canât add columns fast enough and most of them contain NULL in each row
 HBase use ZooKeeper for various distributed coordination services such as master election
 Use proper NTP and DNS settings among all nodes
 Monitoring of the HBase cluster is very important (mem, cpu, disk, network on each node)
 Use a key prefix that distributes well (ex: do not use timestamp)
 Keep the reasonable number of regions based on memstore and RAM size, and limit RegionServer JVM java heal to 12GB to minimize long GC (Garbage Collection) pausesÂ
 HBase does not replace RDB; no sql, optimizer, transactions, and joins
 HBase is CPU & Memory intensive with sporadic large seq IO access; avoid sharing resources with other tasks such as MR
Facebook Messages â replaced existing MySQL and Cassandra infrastructure
AP of CAP theorem(Availability, Partition Tolerance)
Distributes load across all nodes evenly
BigTable & Dynamo based hybrid architecture
Amazon designed Dynamo on top of close located high fiber network for better performance
Write never fails for high availability
Relies on read for resolving conflicts (expected to be slow on reads)
Use Vector Clock to find most update-to-date data (latest timestamps)
Share state and replicate data using P2P sharing model (Gossip)
Hinted Handoff - basically a hint is given to other live nodes to replay write operations when a dead node comes back on
The value is written to the âCoordinatorâ node
A duplicate value is written to another node in the same cluster
A third and fourth value are written from the Coordinator to another cluster across the high-speed fiber
A fifth and sixth value are written from the Coordinator to a third cluster across the fiber
Any conflicts are resolved in the cluster by examining timestamps and determining the âbestâ value.
Cassandra is used to implement Facebook Inbox Search on 2008
Quick diff of HBase and Cassandra
HBase has simpler consistency model than Cassandra
HBase takes consistency win
HBase has SPOF on namenode. Cassandra does not
Cassandra has simpler implementation and easier to hack
Cassandra wins stress test (reliability) done by Adku
Cassandra wins performance test done by Adku
Cassandra has larger developer community (starting to change since 2010)
Indeed Job Trend Comparison
The following is a fairly recent one from SimplyHired. It shows similar curve as the data reported from Indeed. But it also shows that HBase has been growing steadily since Jan 11 whereas Cassandra shows lots of ups and downs. MongoDB is another big NoSQL solution you might want to keep in mind for your data ware-housing.
http://www.facebook.com/note.php?note_id=454991608919 Nov 16, 2010
http://www.cloudera.com/blog/2011/04/hbase-dos-and-donts/ Apr 20, 2011
http://blog.adku.com/2011/02/hbase-vs-cassandra.html Feb 2, 2011
http://www.roadtofailure.com/2009/10/29/hbase-vs-cassandra-nosql-battle/ Oct 29, 2009