This blog post is a summary of many misunderstandings about Cassandra replication, high availability and consistency I have seen from Cassandra users on the fields
I Replication and high availability
Recently I had been talking to developers bootstrapping a new Cassandra project and I just realized that even seasoned Cassandra users still get confused about high availability. In their mind, the more nodes you have the higher availability you get, which is completely wrong. Let’s say you have a cluster of 100 nodes and replication factor (RF) = 1, if you loose 1 node, you loose 1/100th of your data and therefore you are no longer highly available in absolute. Indeed high availability is more dependent on the replication factor. The more replicas you have, the higher the availability.
You also need to take into account the consistency level (CL) to be used for your queries. With RF=3 and CL=ONE you can afford to loose up to 2 nodes at the same time, if you are using CL=QUORUM, you can only loose only 1 node since QUORUM requires having at least 2 replicas out of 3 to be up in order to execute the request.
To increase you failure tolerance, you can choose RF=5, in this case with CL=ONE you can loose up to 4 nodes simultaneously without compromising your availability or up to 2 nodes simultaneously if you are using CL=QUORUM
Below is a matrix of the high availability in function of replication factor and consistency level:
|RF||Used CL||Number of allowed simultaneous failed nodes without compromising HA|
Long story short, replication factor is the key to get high availability. Majority of users chooses RF=3. For some rare case where immediate consistency and the ability to loose 2 nodes simultaneously is required, we need RF=5 but then the price in term of storage is quite expensive.
RF=5 is rarely seen in production but there is indeed a real niche use-case for this requirement. Imagine that you are performing a rolling upgrade in your cluster, it implies having 1 node offline for a short period of time. If you are unlucky and one node goes down when you’re upgrading another one, you will loose availability. If you are a paranoid Cassandra ops, you may want to have RF=5 to shield yourself from this scenario but it does not worth to pay the cost for 2 extra copies just for this edge case.
II Number of nodes and load balancing
If the replication factor does determine your high availability, what is the total number of nodes good for ?
In fact, the number of nodes in the cluster is still an important parameter, it will determine how your workload will be distributed. The smallest Cassandra set up in production is usually 3 nodes. If you want immediate consistency using QUORUM/LOCAL_QUORUM, you should have at least RF=3. But then it means that each node with manage 100% of your data and there is no real data distribution in this case. If you have 4 nodes and keep RF=3, each node will manage now 3/4th of your data which is slighly better but not that great. Having 5 nodes means each node manages 3/5th of the data, better.
So as we can see, the more nodes you have, the better your workload will be spread. Of course we make the assumption that the token ranges are allocated evenly and there is no data skew.
III Consistency level
I said earlier that if you want immediate consistency using QUORUM/LOCAL_QUORUM, you should have at least RF=3. Some Cassandra beginners don’t understand this requirement, why at least RF=3 and not RF=2?
So let’s rule out RF=1 because it is non-sensical for production. We have the choice between RF=2 and RF=3. Using QUORUM/LOCAL_QUORUM means strict majority of replicas. If you have 3 replicas, strict majority is 2 but if you have only 2 replicas, strict majority of 2 is … 2 so in this particular case, QUORUM/LOCAL_QUORUM is equivalent to ALL and consequently you give up on high availability in case of 1 node failure.
Interestingly enough, I have seen some experienced Cassandra folks that use RF=2. They write at CL=ONE to have high availability and read with CL=ONE. But they set the read_repair_chance parameter to 1 so that it is equivalent to reading at CL=ALL. In case of 1 node failure, the read repair will still succeed because there is no hard-requirement to have all nodes online for read repair.
This is pretty much equivalent to using CL=ALL for read and use DowngradingConsistencyStrategy at the driver level.
In general I would not recommend this kind of strategy/exotic usage of read_repair_chance. Having RF=3 is much safer and the overhead in term of data is just one extra replica, not a big deal.
IV Consistency in multi-datacenter setup
Talking about multi-datacenter (multi-DC), again Cassandra beginners get surprised during my presentation when I told them that 100% of their data is replicated in each datacenter. Indeed, even if they belong to the same cluster, each data center has its own replication factor. That make thing a little bit more complicated to reason with consistency level.
Suppose we have 2 datacenters, DC1 and DC2. In DC1 we set RF=5 and in DC2, RF=3 Now, on the cluster level, you have 5 + 3 = 8 copies in total for your data. A request using CL=QUORUM will compute the strict majority with regard to the total number of copies in the cluster, so in this case QUORUM of 8 is 5.
If your client connect to DC1, it is likely that all the 5 replicas in this DC will reply back faster than the replicas in DC2. But if your client connects to DC2, since there are only 3 copies max for this datacenter, each request always need to contact at least 2 extra replicas in DC1 to honour CL=QUORUM and this has a huge impact on the latency.
EACH_QUORUM consistency level is even trickier, it means that for each request we require a strict majority of replicas in each DC to reply. With the above example it means we need 3 replicas out of 5 in DC1 and 2 replicas out of 3 in DC1. Using CL=EACH_QUORUM means that you give up on availability whenever a whole datacenter goes down, but it guarantees you consistency across datacenters
So is it possible to have both cross-DC consistency as well as high availability in case one one complete DC failure ?
The answer is yes, let’s do the reasonning from the beginning. In order to have strong consistency across datacenters, we cannot use CL=LOCAL_ONE or CL=LOCAL_QUORUM. If we use CL=EACH_QUORUM, we forfeit high availability in case of one datacenter failure so it should be excluded. We only have CL=QUORUM left as an alternative. The trick is now how to choose the correct number of datacenters and the replication factor in each of them.
If we go for 2 datacenters, with for example RF=3 for each DC. CL=QUORUM on 3 + 3 = 6 replicas is 4, it means that when one DC goes down, CL=QUORUM cannot be achieved because you only have 3 replicas left in the surviving DC, not good.
If we choose RF=5 for DC1 and RF=3 for DC2, CL=QUORUM of 5 + 3 = 8 requires 5 replicas so if DC1 goes down, we’re basically back to the previous situation.
Now if we have 3 datacenters, RF=2 in each of them, QUORUM of 2 + 2 + 2 = 6 is 4 replicas. Whenever one DC goes down, the 2 remaining DCs still provide 4 replicas and they can satisfy the QUORUM consistency level.
Indeed, having 3 datacenters and using CL=QUORUM will hurt you badly in term of latency (because cross-DC latency is huge) but as long as performance is not your primary driver but high availability and cross-DC consistency, this architecture is a sensible solution
Fantastic and clear post with good examples.
Can you please help me in understanding what is “immediate consistency”
Also can we say higher consistency we configure , higher the latency and vice versa.
Thanks in advance.
What I qualify by “immediate consistency” is, in the literature, defined as PRAM (Pipeline Random Access Memory). This is a combination of Monotonic Reads, Monotic Writes and Read Your Writes.
QUORUM is the closest consistency level with regards to the definition of PRAM.
For more lecture, slide 6 of the paper from Peter Bailis: http://www.bailis.org/papers/hat-vldb2014.pdf
“can we say higher consistency we configure , higher the latency and vice versa” –> Yes, by virtue of physic laws
Hi thank you for this lacture it’s very useful .
let me explan my project and i need your advice,
i have two data center the first one have 10 nodes for cassandra and the second data center have 20 nodes for solr i use datastax for all nodes.
so i set the first dc1 RF=3 and dc2 RF=5 also i used for read and write CL=QUORUM.
Why do you put RF = 5 in the second DC with 20 nodes?
What are you trying to achieve?
It’s just written here To increase you failure tolerance, you can choose RF=5, … up to 2 nodes simultaneously if you are using CL=QUORUM
Thanks for sharing your knowledge on this blog !
2 DC with RF=3 and 6 node total.
If I use : DCAwareRoundRobinPolicy + withUsedHostsPerRemoteDc + allowRemoteDCsForLocalConsistencyLevel(ip of my 3 remote node)
– Requests @ QUORUM will need 6*0.51 = 4 acks among 6 possible response
– Requests @ LOCAL_QUORUM will need 3*0.51 = 2 acks among 6 possible response. <— not sure about this one ?
I am considering this in a layout where DC2 is just a failover of DC1 in the same site.
I am also considering https://github.com/adejanovski/cassandra-dcaware-failover/tree/master/DCAwareFailoverRoundRobinPolicy