{"id":1930,"date":"2016-01-11T12:53:20","date_gmt":"2016-01-11T12:53:20","guid":{"rendered":"http:\/\/www.doanduyhai.com\/blog\/?p=1930"},"modified":"2017-08-10T16:45:40","modified_gmt":"2017-08-10T16:45:40","slug":"cassandra-3-0-materialized-views-technical-deep-dive","status":"publish","type":"post","link":"https:\/\/www.doanduyhai.com\/blog\/?p=1930","title":{"rendered":"Cassandra 3.0 Materialized Views Technical Deep Dive"},"content":{"rendered":"<p>In this blog post, we&#8217;ll dig into the brand new <strong>materialized view<\/strong> feature of <strong>Cassandra 3.0<\/strong>. We&#8217;ll see how it is implemented internally, how you should use it to get the most of its performance and which caveats to avoid.<\/p>\n<blockquote><p>For the remaining of this post Cassandra == Apache Cassandra\u2122<\/p><\/blockquote>\n<h1>Why materialized views ?<\/h1>\n<p>Being an evangelist for <strong>Apache Cassandra<\/strong> for more than a year, I&#8217;ve spent my time talking about the technology and especially providing advices and <strong>best practices<\/strong> for <strong>data modeling<\/strong>.<\/p>\n<p> One of the key point with Cassandra data model is <strong>denormalization<\/strong>, aka duplicate your data for faster access. You&#8217;re trading disk space for read latency.<\/p>\n<p> If your data is immutable by nature (like time series data\/sensor data) you&#8217;re good to go and it should work like a charm. <\/p>\n<p> However, mutable data that need to be denormalized are always a paint point. Generally people end up with the following strategies:<\/p>\n<ul>\n<li>denormalize immutable data<\/li>\n<li>for mutable data, either:\n<ul>\n<li>accept to normalize them and pay the price of extra-reads but don&#8217;t care about mutation<\/li>\n<li>denormalize but pay the price for read-before-write and manual handling of updates<\/li>\n<\/ul>\n<\/li>\n<li>since denormalization is required most of the time for different read patterns, you can rely on a 3<sup>rd<\/sup> party indexing solution (like <a href=\"http:\/\/docs.datastax.com\/en\/datastax_enterprise\/4.8\/datastax_enterprise\/srch\/srchIntro.html\" title=\"Datastax Enterprise Search\" target=\"_blank\"><strong>Datastax Enterprise Search<\/strong><\/a> or <a href=\"https:\/\/github.com\/Stratio\/cassandra-lucene-index\" title=\"Stratio Secondary Index\" target=\"_blank\"><strong>Stratio Lucene-based secondary index<\/strong><\/a> or more recently the <a href=\"https:\/\/github.com\/xedin\/sasi\" title=\"SASI\" target=\"_blank\"><strong>SASI secondary index<\/strong><\/a>) for the job<\/li>\n<\/ul>\n<p> Both solutions for mutable data are far from ideal because it incurs much overhead for developers (extra-read or sync updated data on the client side)<\/p>\n<p> The <strong>materialized views<\/strong> have been designed to <strong>alleviate the pain for developers<\/strong>, although it does not magically solve all the overhead of denormalization.<\/p>\n<h1>Materialized view creation syntax <\/h1>\n<p>Below is the syntax to create a materialized view:<\/p>\n<pre class=\"brush: sql; title: ; notranslate\" title=\"\">\r\nCREATE MATERIALIZED VIEW [IF NOT EXISTS] keyspace_name.view_name AS \r\nSELECT column1, column2, ... \r\nFROM keyspace_name.base_table_name \r\nWHERE column1 IS NOT NULL AND column2 IS NOT NULL ... \r\nPRIMARY KEY(column1, column2, ...)\r\n<\/pre>\n<p>At first view, it is obvious that the materialized view needs a <strong>base table<\/strong>. A materialized view, conceptually, is just another way to present the data of the <strong>base table<\/strong>, with a <strong>different primary key<\/strong> for a <strong>different access pattern<\/strong>.<\/p>\n<p>The alert reader should remark the clause <strong>WHERE column1 IS NOT NULL AND column2 IS NOT NULL &#8230;<\/strong>. This clause guarantees that all columns that will be used as primary key for the view are not null, of course.<\/p>\n<p>Some notes on the constraints that apply to materialized views creation:<\/p>\n<ul>\n<li>The <strong>AS SELECT column1, column2, &#8230;<\/strong> clause lets you pick which columns of the <strong>base table<\/strong> you want to duplicate into the view. For now, you should pick <strong>at least all columns of the base table that are part of it&#8217;s primary key<\/strong><\/li>\n<li>The <strong>WHERE column1 IS NOT NULL AND column2 IS NOT NULL &#8230;<\/strong> clause guarantees that the primary key of the view has no null column<\/li>\n<li>The <strong>PRIMARY KEY(column1, column2, &#8230;)<\/strong> clause should contain <strong>all primary key columns of the base table, plus at most one column that is NOT part of the base table&#8217;s primary key<\/strong>.The order of the columns in the primary key does not matter, which allows us to access data by different patterns<\/li>\n<\/ul>\n<p>An example is worth a thousand words:<\/p>\n<pre class=\"brush: sql; title: ; notranslate\" title=\"\">\r\nCREATE TABLE user(\r\n   id int PRIMARY KEY,\r\n   login text,\r\n   firstname text,\r\n   lastname text,\r\n   country text,\r\n   gender int\r\n);\r\n\r\nCREATE MATERIALIZED VIEW user_by_country \r\nAS SELECT *  \/\/denormalize ALL columns\r\nFROM user \r\nWHERE country IS NOT NULL AND id IS NOT NULL\r\nPRIMARY KEY(country, id);\r\n\r\nINSERT INTO user(id,login,firstname,lastname,country) VALUES(1, 'jdoe', 'John', 'DOE', 'US');\r\nINSERT INTO user(id,login,firstname,lastname,country) VALUES(2, 'hsue', 'Helen', 'SUE', 'US');\r\nINSERT INTO user(id,login,firstname,lastname,country) VALUES(3, 'rsmith', 'Richard', 'SMITH', 'UK');\r\nINSERT INTO user(id,login,firstname,lastname,country) VALUES(4, 'doanduyhai', 'DuyHai', 'DOAN', 'FR');\r\n\r\nSELECT * FROM user_by_country;\r\n\r\n country | id | firstname | lastname | login\r\n---------+----+-----------+----------+------------\r\n      FR |  4 |    DuyHai |     DOAN | doanduyhai\r\n      US |  1 |      John |      DOE |       jdoe\r\n      US |  2 |     Helen |      SUE |       hsue\r\n      UK |  3 |   Richard |    SMITH |     rsmith\r\n\r\nSELECT * FROM user_by_country WHERE country='US';\r\n\r\n country | id | firstname | lastname | login\r\n---------+----+-----------+----------+-------\r\n      US |  1 |      John |      DOE |  jdoe\r\n      US |  2 |     Helen |      SUE |  hsue\r\n<\/pre>\n<p> In the above example, we want to find users by their country code, thus the <strong>WHERE country IS NOT NULL<\/strong> clause. We also need to include the primary key of the original table (<strong>AND id IS NOT NULL<\/strong>)<\/p>\n<p> The primary key of the view is composed of the <strong>country as partition key<\/strong>. Since there can be many users in the same country, we need to add the <strong>user id as clustering column<\/strong> to distinguish them.<\/p>\n<p> The rationale for the clause <strong>WHERE xxx IS NOT NULL<\/strong> is to guarantee that null values in the base table will <strong>NOT<\/strong> be denormalized to the view. For example, an user who did not set his country won&#8217;t be copied to the view, mainly because <strong>SELECT * FROM user_by_country WHERE country = null<\/strong> doesn&#8217;t make sense since country is part of the primary key. Also, in the future, you may be able to use other clauses than the <strong>IS NOT NULL<\/strong>, mainly using User Defined Functions to filter data to be denormalized.<\/p>\n<p> The rationale for the constraint (<strong>all primary key columns of the base table, plus at most one column that is NOT part of the base table&#8217;s primary key<\/strong>) is to avoid null value for the primary key. <\/p>\n<p> Example:<\/p>\n<pre class=\"brush: sql; title: ; notranslate\" title=\"\">\r\nCREATE MATERIALIZED VIEW user_by_country_and_gender \r\nAS SELECT *  \/\/denormalize ALL columns\r\nFROM user \r\nWHERE country IS NOT NULL AND gender IS NOT NULL AND id IS NOT NULL\r\nPRIMARY KEY((country, gender),id)\r\n\r\nINSERT INTO user(id,login,firstname,lastname,country,gender) VALUES(100,'nowhere','Ian','NOWHERE',null,1);\r\nINSERT INTO user(id,login,firstname,lastname,country,gender) VALUES(100,'nosex','Jean','NOSEX','USA',null);\r\n<\/pre>\n<p>  With the above example, both users &#8216;<em>nowhere<\/em>&#8216; and &#8216;<em>nosex<\/em>&#8216; cannot be denormalized into the view because at least one column that is part of the view primary key is null.<\/p>\n<p> In the future, <strong>null<\/strong> values may be considered as <em>yet-another-value<\/em> and this restriction may be lifted to allow more than 1 non primary key column of the <strong>base table<\/strong> to be used as key for the view.<\/p>\n<h1>Technical implementation<\/h1>\n<h3>A Materialized View update steps<\/h3>\n<p>Below is the sequence of operations when data are inserted\/updated\/deleted in the <strong>base table<\/strong> <\/p>\n<ol>\n<li>If the system property <em>cassandra.mv_enable_coordinator_batchlog<\/em> is set, the <strong>coordinator<\/strong> will create a batchlog for the operation<br \/>\n<div id=\"attachment_1949\" style=\"width: 955px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/01\/MV_step1.png\"><img aria-describedby=\"caption-attachment-1949\" loading=\"lazy\" src=\"https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/01\/MV_step1.png\" alt=\"MV Step1\" width=\"945\" height=\"404\" class=\"size-full wp-image-1949\" srcset=\"https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/01\/MV_step1.png 945w, https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/01\/MV_step1-300x128.png 300w\" sizes=\"(max-width: 945px) 100vw, 945px\" \/><\/a><p id=\"caption-attachment-1949\" class=\"wp-caption-text\">MV Step1<\/p><\/div><\/li>\n<li>the <strong>coordinator<\/strong> sends the mutation to all replicas and will wait for as many acknowledgement(s) as requested by <strong>Consistency Level<\/strong><div id=\"attachment_1951\" style=\"width: 955px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/01\/MV_step2.png\"><img aria-describedby=\"caption-attachment-1951\" loading=\"lazy\" src=\"https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/01\/MV_step2.png\" alt=\"MV Step2\" width=\"945\" height=\"538\" class=\"size-full wp-image-1951\" srcset=\"https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/01\/MV_step2.png 945w, https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/01\/MV_step2-300x171.png 300w\" sizes=\"(max-width: 945px) 100vw, 945px\" \/><\/a><p id=\"caption-attachment-1951\" class=\"wp-caption-text\">MV Step2<\/p><\/div><\/li>\n<li>each replica is acquiring a <strong>local lock<\/strong> on the partition to be<br \/>\ninserted\/updated\/deleted in the <strong>base table<\/strong> <div id=\"attachment_1953\" style=\"width: 955px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/01\/MV-Step3.png\"><img aria-describedby=\"caption-attachment-1953\" loading=\"lazy\" src=\"https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/01\/MV-Step3.png\" alt=\"MV Step3\" width=\"945\" height=\"430\" class=\"size-full wp-image-1953\" srcset=\"https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/01\/MV-Step3.png 945w, https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/01\/MV-Step3-300x137.png 300w\" sizes=\"(max-width: 945px) 100vw, 945px\" \/><\/a><p id=\"caption-attachment-1953\" class=\"wp-caption-text\">MV Step3<\/p><\/div><\/li>\n<li>each replica is performing a local read on the partition of the <strong>base table<\/strong><br \/>\n<div id=\"attachment_1954\" style=\"width: 955px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/01\/MV-Step4.png\"><img aria-describedby=\"caption-attachment-1954\" loading=\"lazy\" src=\"https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/01\/MV-Step4.png\" alt=\"MV Step4\" width=\"945\" height=\"393\" class=\"size-full wp-image-1954\" srcset=\"https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/01\/MV-Step4.png 945w, https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/01\/MV-Step4-300x125.png 300w\" sizes=\"(max-width: 945px) 100vw, 945px\" \/><\/a><p id=\"caption-attachment-1954\" class=\"wp-caption-text\">MV Step4<\/p><\/div><\/li>\n<li>each replica creates a <strong>local batchlog<\/strong> with the following statements:\n<ul>\n<li><strong>DELETE FROM user_by_country WHERE country = &#8216;old_value&#8217;<\/strong><\/li>\n<li><strong>INSERT INTO user_by_country(country, id, &#8230;) VALUES(&#8216;FR&#8217;, 1, &#8230;)<\/strong><\/li>\n<\/ul>\n<p>    <div id=\"attachment_1956\" style=\"width: 955px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/01\/MV-Step5.png\"><img aria-describedby=\"caption-attachment-1956\" loading=\"lazy\" src=\"https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/01\/MV-Step5.png\" alt=\"MV Step5\" width=\"945\" height=\"406\" class=\"size-full wp-image-1956\" srcset=\"https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/01\/MV-Step5.png 945w, https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/01\/MV-Step5-300x129.png 300w\" sizes=\"(max-width: 945px) 100vw, 945px\" \/><\/a><p id=\"caption-attachment-1956\" class=\"wp-caption-text\">MV Step5<\/p><\/div>\n  <\/li>\n<li>each replica executes the batchlog <strong>asynchronously<\/strong>. For each statement in the <strong>batchlog<\/strong>, it is executed against a <strong>paired view replica<\/strong> (explained later below) using <strong>CL = ONE<\/strong><br \/>\n  <div id=\"attachment_1958\" style=\"width: 955px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/01\/MV-Step6.png\"><img aria-describedby=\"caption-attachment-1958\" loading=\"lazy\" src=\"https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/01\/MV-Step6.png\" alt=\"MV Step6\" width=\"945\" height=\"466\" class=\"size-full wp-image-1958\" srcset=\"https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/01\/MV-Step6.png 945w, https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/01\/MV-Step6-300x148.png 300w\" sizes=\"(max-width: 945px) 100vw, 945px\" \/><\/a><p id=\"caption-attachment-1958\" class=\"wp-caption-text\">MV Step6<\/p><\/div>\n  <\/li>\n<li>each replica applies the mutation on the <strong>base table<\/strong> locally<br \/>\n  <div id=\"attachment_1961\" style=\"width: 955px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/01\/MV-Step7.png\"><img aria-describedby=\"caption-attachment-1961\" loading=\"lazy\" src=\"https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/01\/MV-Step7.png\" alt=\"MV Step7\" width=\"945\" height=\"399\" class=\"size-full wp-image-1961\" srcset=\"https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/01\/MV-Step7.png 945w, https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/01\/MV-Step7-300x127.png 300w\" sizes=\"(max-width: 945px) 100vw, 945px\" \/><\/a><p id=\"caption-attachment-1961\" class=\"wp-caption-text\">MV Step7<\/p><\/div>\n  <\/li>\n<li>each replica releases the <strong>local lock<\/strong> on the partition of the <strong>base table<\/strong><br \/>\n  <div id=\"attachment_1963\" style=\"width: 955px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/01\/MV-Step8.png\"><img aria-describedby=\"caption-attachment-1963\" loading=\"lazy\" src=\"https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/01\/MV-Step8.png\" alt=\"MV Step8\" width=\"945\" height=\"428\" class=\"size-full wp-image-1963\" srcset=\"https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/01\/MV-Step8.png 945w, https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/01\/MV-Step8-300x136.png 300w\" sizes=\"(max-width: 945px) 100vw, 945px\" \/><\/a><p id=\"caption-attachment-1963\" class=\"wp-caption-text\">MV Step8<\/p><\/div>\n  <\/li>\n<li>If the local mutation is successful, each replica sends an acknowledgement back to the <strong>coordinator<\/strong><br \/>\n  <div id=\"attachment_1964\" style=\"width: 955px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/01\/MV-Step9.png\"><img aria-describedby=\"caption-attachment-1964\" loading=\"lazy\" src=\"https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/01\/MV-Step9.png\" alt=\"MV Step9\" width=\"945\" height=\"555\" class=\"size-full wp-image-1964\" srcset=\"https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/01\/MV-Step9.png 945w, https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/01\/MV-Step9-300x176.png 300w\" sizes=\"(max-width: 945px) 100vw, 945px\" \/><\/a><p id=\"caption-attachment-1964\" class=\"wp-caption-text\">MV Step9<\/p><\/div>\n  <\/li>\n<li>if as many acknowledgement(s) as <strong>Consistency Level<\/strong> are received by the <strong>coordinator<\/strong>, the client is acknowledged that the mutation is successful<br \/>\n  <div id=\"attachment_1965\" style=\"width: 955px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/01\/MV-Step10.png\"><img aria-describedby=\"caption-attachment-1965\" loading=\"lazy\" src=\"https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/01\/MV-Step10.png\" alt=\"MV Step10\" width=\"945\" height=\"575\" class=\"size-full wp-image-1965\" srcset=\"https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/01\/MV-Step10.png 945w, https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/01\/MV-Step10-300x183.png 300w\" sizes=\"(max-width: 945px) 100vw, 945px\" \/><\/a><p id=\"caption-attachment-1965\" class=\"wp-caption-text\">MV Step10<\/p><\/div>\n  <\/li>\n<li>optionally, if the system property <em>cassandra.mv_enable_coordinator_batchlog<\/em> is set and if a <strong>QUORUM<\/strong> of acknowledgements are received by the <strong>coordinator<\/strong>, the coordinator-level <strong>batchlog<\/strong> is removed<br \/>\n  <div id=\"attachment_1966\" style=\"width: 955px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/01\/MV-Step11.png\"><img aria-describedby=\"caption-attachment-1966\" loading=\"lazy\" src=\"https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/01\/MV-Step11.png\" alt=\"MV Step11\" width=\"945\" height=\"401\" class=\"size-full wp-image-1966\" srcset=\"https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/01\/MV-Step11.png 945w, https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/01\/MV-Step11-300x127.png 300w\" sizes=\"(max-width: 945px) 100vw, 945px\" \/><\/a><p id=\"caption-attachment-1966\" class=\"wp-caption-text\">MV Step11<\/p><\/div> <\/li>\n<\/ol>\n<h3>B Paired view replica definition<\/h3>\n<p> Before explaining in detail the rationale of some technical steps, let&#8217;s define what is a <strong>paired view replica<\/strong>. Below is the formal definition in the source code:<\/p>\n<blockquote><p>\nThe view natural endpoint is the endpoint which has the same cardinality as this node in the replication factor. <\/p>\n<p>The cardinality is the number at which this node would store a piece of data, given the change in replication factor. <\/p>\n<p>If the keyspace&#8217;s replication strategy is a <strong>NetworkTopologyStrategy<\/strong>, we filter the ring to contain only nodes in the local datacenter when calculating cardinality. <\/p>\n<p>For example, if we have the following ring: <\/p>\n<ul>\n<li>A, T1 -> B, T2 -> C, T3 -> A <\/li>\n<\/ul>\n<p>For the token T1, at RF=1, A would be included, so A&#8217;s cardinality for T1 is 1.<br \/>\nFor the token T1, at RF=2, B would be included, so B&#8217;s cardinality for token T1 is 2.<br \/>\nFor the token T3, at RF=2, A would be included, so A&#8217;s cardinality for T3 is 2. <\/p>\n<p>For a view whose base token is T1 and whose view token is T3, the pairings between the nodes would be: <\/p>\n<ul>\n<li>A writes to C (A&#8217;s cardinality is 1 for T1, and C&#8217;s cardinality is 1 for T3)<\/li>\n<li> B writes to A (B&#8217;s cardinality is 2 for T1, and A&#8217;s cardinality is 2 for T3)<\/li>\n<li>C writes to B (C&#8217;s cardinality is 3 for T1, and B&#8217;s cardinality is 3 for T3)<\/li>\n<\/ul>\n<\/blockquote>\n<h3>C Local lock on base table partition <\/h3>\n<p> The reader should wonder why each replica needs to acquire a <strong>local lock<\/strong> on the <strong>base table<\/strong> partition since locking is expensive. The reason of this lock is to guarantee view update consistency in case of <strong>concurrent updates<\/strong> on the <strong>base table<\/strong> partition.<\/p>\n<p> Let&#8217;s say we have 2 concurrent updates on an user (id=1) whose original country is <strong>UK<\/strong>:<\/p>\n<ol>\n<li><strong>UPDATE &#8230; SET country=&#8217;US&#8217; WHERE id=1<\/strong><\/li>\n<li><strong>UPDATE &#8230; SET country=&#8217;FR&#8217; WHERE id=1<\/strong><\/li>\n<\/ol>\n<p> Without the <strong>local lock<\/strong>, we&#8217;ll have interleaved mutations for the view<br \/>\n <div id=\"attachment_1973\" style=\"width: 955px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/01\/Interleaved-Mutations.png\"><img aria-describedby=\"caption-attachment-1973\" loading=\"lazy\" src=\"https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/01\/Interleaved-Mutations.png\" alt=\"Interleaved Mutations\" width=\"945\" height=\"443\" class=\"size-full wp-image-1973\" srcset=\"https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/01\/Interleaved-Mutations.png 945w, https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/01\/Interleaved-Mutations-300x141.png 300w\" sizes=\"(max-width: 945px) 100vw, 945px\" \/><\/a><p id=\"caption-attachment-1973\" class=\"wp-caption-text\">Interleaved Mutations<\/p><\/div><\/p>\n<p> The user (id=1) now has 2 entries in the view table (country=&#8217;<strong>US<\/strong>&#8216; and country=&#8217;<strong>FR<\/strong>&#8216;)<\/p>\n<p> This issue is fixed with the <strong>local lock<\/strong><\/p>\n<div id=\"attachment_1976\" style=\"width: 955px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/01\/Serialized-Mutations.png\"><img aria-describedby=\"caption-attachment-1976\" loading=\"lazy\" src=\"https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/01\/Serialized-Mutations.png\" alt=\"Serialized Mutations\" width=\"945\" height=\"489\" class=\"size-full wp-image-1976\" srcset=\"https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/01\/Serialized-Mutations.png 945w, https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/01\/Serialized-Mutations-300x155.png 300w\" sizes=\"(max-width: 945px) 100vw, 945px\" \/><\/a><p id=\"caption-attachment-1976\" class=\"wp-caption-text\">Serialized Mutations<\/p><\/div>\n<p> Indeed, it is necessary that the sequence of operations <em>1) read base table data 2) remove view old partition 3) insert view new partition<\/em> is executed <strong>atomically<\/strong>, thus the need of locking<\/p>\n<h3>D Local batchlog for view asynchronous update<\/h3>\n<p>The <strong>local batchlog<\/strong> created on each replica for view update guarantees that, even in case of failure (because a view replica is temporarily down for example), the view update will <strong>eventually<\/strong> be committed.<\/p>\n<p> The consistency level <strong>ONE<\/strong> is used because each base table replica is responsible for the update of its <strong>paired view replica<\/strong>, thus consistency level <strong>ONE<\/strong> is sufficient.<\/p>\n<p> Furthermore, the update of each paired view replica is performed <strong>asynchronously<\/strong>, e.g. the replica will not block and wait for an acknowledgement before processing to <strong>base table<\/strong> mutation. The <strong>local batchlog<\/strong> guarantees automatic retries in case of error.<\/p>\n<h3>E View data consistency level<\/h3>\n<p> The consistency level requested by the client on the base table is respected, e.g. if <strong>QUORUM<\/strong> is required (<strong>RF=3<\/strong>), the coordinator will acknowledge a successful write only if it receives 2 acks from <strong>base table<\/strong> replicas. In this case, the client is sure that the <strong>base table<\/strong> update is made durable on <strong>at least 2 replicas out of 3<\/strong>.<\/p>\n<p> The consistency guarantee is <strong>weaker<\/strong> for view table. With the above example, we only have the guarantee that the view will be updated <strong>eventually<\/strong> on <strong>at least 2 view replicas out of 3<\/strong>.<\/p>\n<blockquote><p> The main difference in term of guarantee compared to <strong>base table<\/strong> lies in the <strong>eventually<\/strong> (asynchronous local batchlog). At the time the coordinator receives 2 acks from base table replicas, we are <strong>not<\/strong> sure that the view has been updated on at least 2 replicas.\n<\/p><\/blockquote>\n<h3>F Coordinator batchlog<\/h3>\n<p>The system property <em>cassandra.mv_enable_coordinator_batchlog<\/em> only helps in edge cases. Let&#8217;s consider below an example of such edge-case:<\/p>\n<ul>\n<li><strong>coordinator<\/strong> receives update, starts sending to base replicas<\/li>\n<li><strong>coordinator<\/strong> sends update to one <strong>base replica<\/strong><\/li>\n<li><strong>base replica<\/strong> receives the update and starts to process<\/li>\n<li><strong>coordinator<\/strong> dies before update is sent to any other base replica<\/li>\n<li><strong>base replica<\/strong> sends update to <strong>view replica<\/strong> through async local batch<\/li>\n<li><strong>base replica<\/strong> dies and cannot be brought back up<\/li>\n<li><strong>view replica<\/strong> processes update<\/li>\n<\/ul>\n<p> It&#8217;s very unlikely for all of those to happen, so protecting against that case while paying such a high penalty with <strong>coordinator batchlog<\/strong> doesn&#8217;t make sense in the general case and the parameter <em>cassandra.mv_enable_coordinator_batchlog<\/em> is <strong>disabled by default<\/strong>.<\/p>\n<h1>Performance consideration<\/h1>\n<p> Compared to a normal mutation, a mutation on a base table having materialized views will incur the following extra costs:<\/p>\n<ul>\n<li><strong>local lock<\/strong> on <strong>base table<\/strong> partition<\/li>\n<li>local <em>read-before-write<\/em> on <strong>base table<\/strong> partition<\/li>\n<li><strong>local batchlog<\/strong> for materialized view<\/li>\n<li><strong>optionally<\/strong>, coordinator batchlog<\/li>\n<\/ul>\n<p> In practice, most of the performance hits are incurred by the local <em>read-before-write<\/em> but this cost is only paid once and does not depends on the number of views associated with the table.<\/p>\n<p> However, increasing the number of views will have an impact on the <strong>cluster-wide write throughput<\/strong> because for each base table update, you&#8217;ll add an extra <em>(DELETE + INSERT) * nb_of_views<\/em> load to the cluster.<\/p>\n<p> That being said, it does not make sense to compare raw write throughput between a normal table and a table having views. It&#8217;s more sensible to compare write throughput between a <strong>manually denormalized table<\/strong> (using logged batch client-side) and the <strong>same table using materialized views<\/strong>. In this case, automatic server-side denormalization with materialized views clearly wins because:<\/p>\n<ol>\n<li>it saves network traffic for <em>read-before-write<\/em><\/li>\n<li>it saves network traffic for logged batch of denormalized table mutations<\/li>\n<li>it removes the pain for the developer from having to keep denormalized tables synced with base tables<\/li>\n<\/ol>\n<p>Another performance consideration worth mentioning is hot-spot. Similar to manual denormalization, if your view partition key is chosen poorly, you&#8217;ll end up with hot spots in your cluster. A simple example with our <strong>user<\/strong> table is to create a materialized <strong>view user_by_gender<\/strong><\/p>\n<pre class=\"brush: sql; title: ; notranslate\" title=\"\">\r\n\/\/ THIS IS AN ANTI-PATTERN !!!!\r\nCREATE MATERIALIZED VIEW user_by_gender\r\nAS SELECT * FROM user\r\nWHERE id IS NOT NULL AND gender IS NOT NULL\r\nPRIMARY KEY(gender, id) \r\n<\/pre>\n<p> With the above view, all users will end up in only 2 partitions: MALE &amp; FEMALE. You certainly don&#8217;t want such hot-spots in your cluster.<\/p>\n<p> Now, how do materialized views compare to <strong>secondary index<\/strong> for <strong>read performance<\/strong> ?<\/p>\n<p> Depending on the implementation of your secondary index, the read performance may vary. If the  implementation performs a scatter-gather operation, the read performance will be closely bound to the number of nodes in the datacenter\/cluster.<\/p>\n<p> Even with a smart implementation of secondary index like <a href=\"https:\/\/github.com\/xedin\/sasi\" title=\"SASI\" target=\"_blank\"><strong>SASI<\/strong><\/a> that <a href=\"https:\/\/twitter.com\/doanduyhai\/status\/662392685706289152\" title=\"SASI does not perform scatter-gather\" target=\"_blank\">does not scan all the nodes<\/a>, a read operation always consist of 2 different read paths hitting disk:<\/p>\n<ul>\n<li>read the index on disk to find relevant primary keys<\/li>\n<li>read the source data from C*<\/li>\n<\/ul>\n<p> That being said, it&#8217;s pretty obvious that materialized views will give you better read performance since the read is straight-forward and done in 1 step. The idea is that you pay the cost at write time for a gain at read time.<\/p>\n<p> Still, materialized views loose against advanced secondary index implementation in term of querying because <strong>only exact match is allowed<\/strong>, ranged scans (give me user where country is between &#8216;UK&#8217; and &#8216;US&#8217;) will ruin your read performance.<\/p>\n<h1>Materialized views and operations<\/h1>\n<p>We do not forget our ops friends and in this chapter, we discuss the impact of having materialized views in term of operations.<\/p>\n<p><strong>Repair &#038; hints<\/strong> :<\/p>\n<ul>\n<li>it is possible to repair a view independently from its base table<\/li>\n<li>if the base table is repaired, the view will also be repaired thanks to the <strong>mutation-based repair<\/strong> (repair that goes through write path, unlike normal repair)<\/li>\n<li>read-repair on views behave like normal read-repair<\/li>\n<li>read-repair on base table will also repair views<\/li>\n<li>hints replay on base table will trigger mutations on associated views<\/li>\n<\/ul>\n<p><strong>Schema<\/strong> :<\/p>\n<ul>\n<li>materialized views can be tuned as any standard table (compaction, compression, &#8230;). Use the <strong>ALTER MATERIALIZED VIEW<\/strong> command<\/li>\n<li>you cannot drop a column from based table that is used by a materialized view, even if this column is not part of the view primary key<\/li>\n<li>you can add a new column to the base table, its initial value will be set to <strong>null<\/strong> in associated views<\/li>\n<li>you cannot drop the base table, you have to drop all associated views first<\/li>\n<\/ul>\n<h1>The shadowable view tombstone<\/h1>\n<blockquote><p>This section is purely technical for those who want to understand the deep internals. You can safely skip it\n<\/p><\/blockquote>\n<p> During the developement of materialized view some issues arose with tombstones and view timestamps. Let&#8217;s take this example:<\/p>\n<pre class=\"brush: sql; title: ; notranslate\" title=\"\">\r\nCREATE TABLE base (a int, b int, c int, PRIMARY KEY (a));\r\n\r\nCREATE MATERIALIZED VIEW view AS\r\n    SELECT * FROM base\r\n    WHERE a IS NOT NULL\r\n    AND b IS NOT NULL\r\n    PRIMARY KEY (a, b);\r\n\r\n\/\/Insert initial data\r\nINSERT INTO base (a, b, c) VALUES (0, 0, 1) USING TIMESTAMP 0;\r\n\r\n\/\/1st update\r\nUPDATE base SET b = 1 USING TIMESTAMP 2 WHERE a = 0;\r\n\r\n\/\/2nd update\r\nUPDATE base SET b = 0 USING TIMESTAMP 3 WHERE a = 0;\r\n<\/pre>\n<p><\/p>\n<blockquote><p> ts is shortcut for timestamp<\/p><\/blockquote>\n<p> Upon initial data insertion, the view will contain this row: <strong>pk = (0,0), row_ts=0,  c=1@ts<sub>0<\/sub><\/strong><\/p>\n<p> On the 1<sup>st<\/sup> update, the view status is:<\/p>\n<ul>\n<li>pk=(0,0), row@<strong>ts<sub>0<\/sub><\/strong>, row_tombstone@<strong>ts<sub>2<\/sub><\/strong>, c=1@<strong>ts<sub>0<\/sub><\/strong> (DELETE FROM view WHERE a=0 AND b=0)<\/li>\n<li>pk=(0,1), row@ts<sub>2<\/sub>, c=1@ts<sub>0<\/sub> (INSERT INTO view &#8230; USING TIMESTAMP 2)<\/li>\n<\/ul>\n<p> The row (0,0) logically no longer exists because row tombstone timestamp > row timestamp, so far so good. On the 2<sup>nd<\/sup> update, the view status is:<\/p>\n<ul>\n<li>pk=(0,0), row@<strong>ts<sub>3<\/sub><\/strong>, row_tombstone@<strong>ts<sub>2<\/sub><\/strong>, c=1@<strong>ts<sub>0<\/sub><\/strong> (INSERT INTO view &#8230;)<\/li>\n<li>pk=(0,1), row@ts<sub>2<\/sub>, row_tombstone@ts<sub>3<\/sub>, c=1@ts<sub>0<\/sub> (DELETE FROM view WHERE a=0 AND b=1)<\/li>\n<\/ul>\n<p>Since we re-set b to 0, the view row (0,0) is <strong>re-inserted<\/strong> again but <strong>the timestamp for each column is different<\/strong>. (a,b) = (0,0)@<strong>ts<sub>3<\/sub><\/strong> but c=1@<strong>ts<sub>0<\/sub><\/strong> because column <strong>c<\/strong> was not modified. <\/p>\n<p>The problem is that now, if you read the view partition (0,0), <em>column <strong>c<\/strong> value will be shadowed by the old row tombstone@<strong>ts<sub>2<\/sub><\/strong><\/em> so <strong>SELECT * FROM view WHERE a=0 AND b=0 will<\/strong> return <strong>(0,0,null)<\/strong> which is wrong &#8230;<\/p>\n<p>A na\u00efve solution would be upgrading the column <strong>c<\/strong> timestamp to 3 after the second update, e.g. pk=(0,0), row@<strong>ts<sub>3<\/sub><\/strong>, row_tombstone@<strong>ts<sub>2<\/sub><\/strong>, c=1@<strong>ts<sub>3<\/sub><\/strong><\/p>\n<p>But then what should we do if there is another <strong>UPDATE base SET c=2 USING TIMESTAMP 1 WHERE a=0 AND b=0<\/strong> later? If we follow the previous rule, we will set the timestamp to 1 for column <strong>c<\/strong> in the view and it will be overriden by the previous value (c=1@<strong>ts<sub>3<\/sub><\/strong>)&#8230;<\/p>\n<p>The dev team came out with a solution: <strong>shadowable tombstone<\/strong>! See <strong><a href=\"https:\/\/issues.apache.org\/jira\/browse\/CASSANDRA-10261\" title=\"Materialized View Timestamp Issue\" target=\"_blank\">CASSANDRA-10261<\/a><\/strong> for more details.<\/p>\n<p>The formal definition of <strong>shadowable tombstone<\/strong> from the source code comments is:<\/p>\n<blockquote><p>A shadowable row tombstone only exists if the row timestamp (<em>primaryKeyLivenessInfo().timestamp()<\/em>) is lower than the deletion timestamp. That is, if a row has a shadowable tombstone with timestamp A and an update is made to that row with a timestamp B such that B > A, then the shadowable tombstone is &#8216;shadowed&#8217; by that update. Currently, the only use of shadowable row deletions is Materialized Views, see <strong>CASSANDRA-10261<\/strong>.\n<\/p><\/blockquote>\n<p>With this implemented, on the 1<sup>st<\/sup> update, the view status is:<\/p>\n<ul>\n<li>pk=(0,0), row@<strong>ts<sub>0<\/sub><\/strong>, <strong>shadowable_tombstone<\/strong>@<strong>ts<sub>2<\/sub><\/strong>, c=1@ts<sub>0<\/sub> (DELETE FROM view WHERE a=0 AND b=0)<\/li>\n<li>pk=(0,1), row@ts<sub>2<\/sub>, c=1@ts<sub>0<\/sub> (INSERT INTO view &#8230; USING TIMESTAMP 2)<\/li>\n<\/ul>\n<p>On the 2<sup>nd<\/sup> update, the view status becomes:<\/p>\n<ul>\n<li>pk=(0,0), row@<strong>ts<sub>3<\/sub><\/strong>, <strong>shadowable_tombstone<\/strong>@<strong>ts<sub>2<\/sub><\/strong>, c=1@<strong>ts<sub>0<\/sub><\/strong> (INSERT INTO view &#8230;)<\/li>\n<li>pk=(0,1), row@ts<sub>2<\/sub>, <strong>shadowable_tombstone@ts<sub>3<\/sub><\/strong>, c=1@ts<sub>0<\/sub> (DELETE FROM view WHERE a=0 AND b=1)<\/li>\n<\/ul>\n<p>Now, when reading the view partition (0,0), since the <strong>shadowable tombstone<\/strong>  (<strong>ts<sub>2<\/sub><\/strong>) is shadowed by the new row timestamp (<strong>ts<sub>3<\/sub><\/strong>), column <strong>c<\/strong> value is taken into account even if its timestamp (<strong>ts<sub>0<\/sub><\/strong>) is lower than the <strong>shadowable tombstone<\/strong> timestamp (<strong>ts<sub>2<\/sub><\/strong>)<\/p>\n<p> In a nutshell:<\/p>\n<ul>\n<li>If shadowable tombstone timestamp > row timestamp, shadowable tombstone behave like a normal tombstone<\/li>\n<li>If shadowable tombstone timestamp < row timestamp, ignore this shadowable tombstone for last-write-win reconciliation (as if it does not exists)<\/li>\n<\/ul>\n<p>And that&#8217;s it. I hope you enjoy this in-depth post. Many thanks to <strong>Carl Yeksigian<\/strong> for his technical help demystifying materialized views.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>In this blog post, we&#8217;ll dig into the brand new materialized view feature of Cassandra 3.0. We&#8217;ll see how it is implemented internally, how you should use it to get the most of its performance and which caveats to avoid&#8230;.<br \/><a class=\"read-more-button\" href=\"https:\/\/www.doanduyhai.com\/blog\/?p=1930\">Read more<\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[57,10],"tags":[56],"_links":{"self":[{"href":"https:\/\/www.doanduyhai.com\/blog\/index.php?rest_route=\/wp\/v2\/posts\/1930"}],"collection":[{"href":"https:\/\/www.doanduyhai.com\/blog\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.doanduyhai.com\/blog\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.doanduyhai.com\/blog\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.doanduyhai.com\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=1930"}],"version-history":[{"count":74,"href":"https:\/\/www.doanduyhai.com\/blog\/index.php?rest_route=\/wp\/v2\/posts\/1930\/revisions"}],"predecessor-version":[{"id":13203,"href":"https:\/\/www.doanduyhai.com\/blog\/index.php?rest_route=\/wp\/v2\/posts\/1930\/revisions\/13203"}],"wp:attachment":[{"href":"https:\/\/www.doanduyhai.com\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=1930"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.doanduyhai.com\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=1930"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.doanduyhai.com\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=1930"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}