{"id":13216,"date":"2017-07-24T09:14:26","date_gmt":"2017-07-24T09:14:26","guid":{"rendered":"http:\/\/www.doanduyhai.com\/blog\/?p=13216"},"modified":"2017-08-10T16:44:53","modified_gmt":"2017-08-10T16:44:53","slug":"replication-high-availability-and-consistency-in-cassandra","status":"publish","type":"post","link":"https:\/\/www.doanduyhai.com\/blog\/?p=13216","title":{"rendered":"Replication, High Availability and Consistency in Cassandra"},"content":{"rendered":"<p>This blog post is a summary of many misunderstandings about Cassandra replication, high availability and consistency I have seen from Cassandra users on the fields<\/p>\n<p><!--more--><\/p>\n<h1>I Replication and high availability<\/h1>\n<p>Recently I had been talking to developers bootstrapping a new Cassandra project and I just realized that even seasoned Cassandra users still get confused about high availability. In their mind, <em>the more nodes you have the higher availability you get<\/em>, which is completely <strong>wrong<\/strong>. Let&#8217;s say you have a cluster of 100 nodes and replication factor (RF) = 1, if you loose 1 node, you loose 1\/100<sup>th<\/sup> of your data and therefore you are no longer highly available in absolute. Indeed high availability is more dependent on the replication factor. <strong>The more replicas you have, the higher the availability<\/strong>. <\/p>\n<p> You also need to take into account the <strong>consistency level (CL)<\/strong> to be used for your queries. With <strong>RF=3<\/strong> and <strong>CL=ONE<\/strong> you can afford to <strong>loose up to 2 nodes at the same time<\/strong>, if you are using <strong>CL=QUORUM<\/strong>, you can only <strong>loose only 1 node<\/strong> since QUORUM requires having at least 2 replicas out of 3 to be up in order to execute the request.<\/p>\n<p> To increase you failure tolerance, you can choose <strong>RF=5<\/strong>, in this case with <strong>CL=ONE<\/strong> you can <strong>loose up to 4 nodes simultaneously<\/strong> without compromising your availability or <strong>up to 2 nodes simultaneously<\/strong> if you are using <strong>CL=QUORUM<\/strong><\/p>\n<p> Below is a matrix of the high availability in function of replication factor and consistency level:<\/p>\n<table>\n<thead>\n<tr>\n<th>RF<\/th>\n<th>Used CL<\/th>\n<th>Number of allowed simultaneous failed nodes without compromising HA<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>3<\/td>\n<td>ONE\/LOCAL_ONE<\/td>\n<td>2 nodes<\/td>\n<\/tr>\n<tr>\n<td>3<\/td>\n<td>QUORUM\/LOCAL_QUORUM<\/td>\n<td>1 node<\/td>\n<\/tr>\n<tr>\n<td>5<\/td>\n<td>ONE\/LOCAL_ONE<\/td>\n<td>4 nodes<\/td>\n<\/tr>\n<tr>\n<td>5<\/td>\n<td>QUORUM\/LOCAL_QUORUM<\/td>\n<td>2 nodes<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p> Long story short, replication factor is the key to get high availability. Majority of users chooses <strong>RF=3<\/strong>. For some rare case where immediate consistency and the ability to loose 2 nodes simultaneously is required, we need <strong>RF=5<\/strong> but then the price in term of storage is quite expensive.<\/p>\n<p> RF=5 is rarely seen in production but there is indeed a real niche use-case for this requirement. Imagine that you are performing a <strong>rolling upgrade<\/strong> in your cluster, it implies having 1 node offline for a short period of time. If you are unlucky and one node goes down when you&#8217;re upgrading another one, you will loose availability. If you are a paranoid Cassandra ops, you may want to have <strong>RF=5<\/strong> to shield yourself from this scenario but it does not worth to pay the cost for 2 extra copies just for this edge case.<\/p>\n<h1>II Number of nodes and load balancing<\/h3>\n<p>If the replication factor does determine your high availability, what is the total number of nodes good for ? <\/p>\n<p>In fact, the number of nodes in the cluster is still an important parameter, it will determine how your workload will be distributed. The smallest Cassandra set up in production is usually 3 nodes. If you want immediate consistency using QUORUM\/LOCAL_QUORUM, you should have at least <strong>RF=3<\/strong>. But then it means that each node with manage 100% of your data and there is no real data distribution in this case. If you have <strong>4 nodes<\/strong> and keep <strong>RF=3<\/strong>, each node will manage now <strong>3\/4<sup>th<\/sup><\/strong> of your data which is slighly better but not that great. Having <strong>5 nodes<\/strong> means each node manages <strong>3\/5<sup>th<\/sup><\/strong> of the data, better.<\/p>\n<p>So as we can see, the more nodes you have, the better your workload will be spread. Of course we make the assumption that the token ranges are allocated evenly and there is no data skew.<\/p>\n<h1>III Consistency level<\/h1>\n<p>I said earlier that <em>if you want immediate consistency using QUORUM\/LOCAL_QUORUM, you should have at least <strong>RF=3<\/strong><\/em>. Some Cassandra beginners don&#8217;t understand this requirement, why at least <strong>RF=3<\/strong> and not <strong>RF=2<\/strong>?<\/p>\n<p>So let&#8217;s rule out <strong>RF=1<\/strong> because it is non-sensical for production. We have the choice between <strong>RF=2<\/strong> and <strong>RF=3<\/strong>. Using <strong>QUORUM\/LOCAL_QUORUM<\/strong> means strict majority of replicas. If you have 3 replicas, strict majority is 2 but if you have only 2 replicas, strict majority of 2 is &#8230; 2 so in this particular case, <strong>QUORUM\/LOCAL_QUORUM<\/strong> is equivalent to <strong>ALL<\/strong> and consequently you give up on high availability in case of 1 node failure. <\/p>\n<p>Interestingly enough, I have seen some experienced Cassandra folks that use <strong>RF=2<\/strong>. They write at <strong>CL=ONE<\/strong> to have high availability and read with <strong>CL=ONE<\/strong>. But they set the <em>read_repair_chance<\/em> parameter to 1 so that it is equivalent to reading at <strong>CL=ALL<\/strong>. In case of 1 node failure, the read repair will still succeed because there is no hard-requirement to have all nodes online for read repair.<\/p>\n<p>This is pretty much equivalent to using <strong>CL=ALL<\/strong> for read and use <strong>DowngradingConsistencyStrategy<\/strong> at the driver level.<\/p>\n<p>In general I would not recommend this kind of strategy\/exotic usage of <em>read_repair_chance<\/em>. Having <strong>RF=3<\/strong> is much safer and the overhead in term of data is just one extra replica, not a big deal.<\/p>\n<h1>IV Consistency in multi-datacenter setup<\/h1>\n<p>Talking about multi-datacenter (multi-DC), again Cassandra beginners get surprised during my presentation when I told them that <em>100% of their data is replicated in each datacenter<\/em>. Indeed, even if they belong to the same cluster, each data center has its own replication factor. That make thing a little bit more complicated to reason with consistency level.<\/p>\n<p>Suppose we have 2 datacenters, <strong>DC1<\/strong> and <strong>DC2<\/strong>. In <strong>DC1<\/strong> we set <strong>RF=5<\/strong> and in <strong>DC2<\/strong>, <strong>RF=3<\/strong> Now, on the cluster level, you have <em>5 + 3 = 8<\/em> copies in total for your data. A request using <strong>CL=QUORUM<\/strong> will compute the strict majority with regard to the total number of copies in the cluster, so in this case <strong>QUORUM of 8 is 5<\/strong>. <\/p>\n<p>If your client connect to <strong>DC1<\/strong>, it is likely that all the 5 replicas in this DC will reply back faster than the replicas in <strong>DC2<\/strong>. But if your client connects to <strong>DC2<\/strong>, since there are only 3 copies max for this datacenter, each request always need to contact at least 2 extra replicas in <strong>DC1<\/strong> to honour <strong>CL=QUORUM<\/strong> and this has a huge impact on the latency.<\/p>\n<p> <strong>EACH_QUORUM<\/strong> consistency level is even trickier, it means that for each request we require a <strong>strict majority of replicas in each DC to reply<\/strong>. With the above example it means we need <em>3 replicas out of 5 in DC1 and 2 replicas out of 3 in DC1<\/em>. Using <strong>CL=EACH_QUORUM<\/strong>  means that <em>you give up on availability whenever a whole datacenter goes down<\/em>, but it guarantees you <em>consistency across datacenters<\/em><\/p>\n<p><strong>So is it possible to have both cross-DC consistency as well as high availability in case one one complete DC failure ?<\/strong><\/p>\n<p>The answer is yes, let&#8217;s do the reasonning from the beginning. In order to have strong consistency across datacenters, we cannot use <strong>CL=LOCAL_ONE<\/strong> or <strong>CL=LOCAL_QUORUM<\/strong>. If we use <strong>CL=EACH_QUORUM<\/strong>, we forfeit high availability in case of one datacenter failure so it should be excluded. We only have <strong>CL=QUORUM<\/strong> left as an alternative. <strong>The trick is now how to choose the correct number of datacenters and the replication factor in each of them<\/strong>.<\/p>\n<p>If we go for 2 datacenters, with for example <strong>RF=3<\/strong> for each DC. <strong>CL=QUORUM<\/strong> on 3 + 3 = 6 replicas is 4, it means that when one DC goes down, <strong>CL=QUORUM<\/strong> cannot be achieved because you only have 3 replicas left in the surviving DC, not good. <\/p>\n<p>If we choose <strong>RF=5<\/strong> for <strong>DC1<\/strong> and <strong>RF=3<\/strong> for <strong>DC2<\/strong>, <strong>CL=QUORUM<\/strong> of 5 + 3 = 8 requires 5 replicas so if <strong>DC1<\/strong> goes down, we&#8217;re basically back to the previous situation. <\/p>\n<p>Now if we have <strong>3 datacenters<\/strong>, <strong>RF=2<\/strong> in each of them, QUORUM of 2 + 2 + 2 = 6 is 4 replicas. Whenever one DC goes down, the 2 remaining DCs still provide 4 replicas and they can satisfy the <strong>QUORUM<\/strong> consistency level.<\/p>\n<p> Indeed, having 3 datacenters and using <strong>CL=QUORUM<\/strong> will hurt you badly in term of latency (because cross-DC latency is huge) but as long as performance is not your primary driver but high availability and cross-DC consistency, this architecture is a sensible solution<\/p>\n","protected":false},"excerpt":{"rendered":"<p>This blog post is a summary of many misunderstandings about Cassandra replication, high availability and consistency I have seen from Cassandra users on the fields<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[57,10],"tags":[],"_links":{"self":[{"href":"https:\/\/www.doanduyhai.com\/blog\/index.php?rest_route=\/wp\/v2\/posts\/13216"}],"collection":[{"href":"https:\/\/www.doanduyhai.com\/blog\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.doanduyhai.com\/blog\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.doanduyhai.com\/blog\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.doanduyhai.com\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=13216"}],"version-history":[{"count":7,"href":"https:\/\/www.doanduyhai.com\/blog\/index.php?rest_route=\/wp\/v2\/posts\/13216\/revisions"}],"predecessor-version":[{"id":13223,"href":"https:\/\/www.doanduyhai.com\/blog\/index.php?rest_route=\/wp\/v2\/posts\/13216\/revisions\/13223"}],"wp:attachment":[{"href":"https:\/\/www.doanduyhai.com\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=13216"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.doanduyhai.com\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=13216"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.doanduyhai.com\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=13216"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}