Real world Repairs
Background
Maintaining anti-entropy in Cassandra is achieved via the repair operation. This is vital, yet not well-supported in the Cassandra world. Thus, one need to think of innovative options to make this work at scale.
Following are the tools out there to run repairs on 2.1 clusters.
Netflix Tickler: https://github.com/ckalantzis/cassTickler (Read at CL.ALL via CQL continuously :: Python)
Spotify Reaper: https://github.com/spotify/cassandra-reaper (Subrange repair, provides a REST endpoint and calls APIs through JMX :: Java)
List subranges: https://github.com/pauloricardomg/cassandra-list-subranges (Tool to get subranges for a given node. :: Java)
Subrange Repair: https://github.com/BrianGallew/cassandra_range_repair (Tool to subrange repair :: Python)
Mutation Based Repair (An idea that is now dropped): https://issues.apache.org/jira/browse/CASSANDRA-8911
Problem Space
Office 365 uses Cassandra to learn deeply about its users. We need a reliable way to repair data when cluster has entropy (for variety of reasons). Therefore, we embarked on a journey to evaluate various approaches.
We can divide all the datasets into following buckets.
High Consistency
system_auth : This table is queried before every CQL query is executed and must be maintained consistent.
Medium Consistency
These tables have following characteristics
- Long TTLs else running repairs is moot.
- Power end user scenarios directly (e.g. tables powering APIs)
- Need to support Deletes (without TTLs)
- Require something beyond "statistically correct" consistency (this goes back to 2, and means "don't run repairs on gigantic tables holding raw data").
The stretch goal here is ability to read at LOCAL_ONE.
Experimentation
The tests were ran on a 300 node cluster, and we deleted ~100 GB of data from a node before running various repair approaches. The goal was to see how reliably and quickly the deleted data comes back, and if there is any performance impact on rest of cluster.
The default "nodetool repair" simply did not work (even functionally). Various approaches related to subrange repairs brought back the data but they were not reliable.
What worked best was a Spark job that ran the table at Consistency=ALL. With enough resources allocated to the job, the job always ran successfully and repaired the data.
We strongly recommend Spark as the means to repair if you have it in your stack.