{"id":13191,"date":"2016-04-18T02:00:38","date_gmt":"2016-04-18T02:00:38","guid":{"rendered":"http:\/\/www.planetcassandra.org\/?p=13191"},"modified":"2017-10-30T19:16:39","modified_gmt":"2017-10-30T19:16:39","slug":"cassandra-native-secondary-index-deep-dive","status":"publish","type":"post","link":"https:\/\/www.doanduyhai.com\/blog\/?p=13191","title":{"rendered":"Cassandra Native Secondary Index Deep Dive"},"content":{"rendered":"<p>The native secondary index is the less known and most misused feature of Cassandra.<\/p>\n<p>In this article we&#8217;ll explain thoroughly the technical implementation of native secondary index<br \/>\nto highlight <strong>best use-cases<\/strong> and the <strong>worst anti-patterns<\/strong>.<\/p>\n<blockquote><p>For the remaining of this post Cassandra == Apache Cassandra\u2122<\/p><\/blockquote>\n<hr \/>\n<h2>A General architecture<\/h2>\n<p>Let&#8217;s say that we have the following <em>users<\/em> table:<\/p>\n<pre class=\"lang:sql\" title=\"Base user table\"> CREATE TABLE users(\r\n     user_id bigint,\r\n     firstname text,\r\n     lastname text,\r\n     ...\r\n     country text,\r\n     ...\r\n     PRIMARY KEY(user_id)  \r\n );\r\n<\/pre>\n<p>Such table structure only allows you to lookup user by <em>user_id<\/em> only. If we create a secondary index on the column <em>country<\/em>, the index would be a hidden table with the following structure<\/p>\n<pre class=\"lang:sql\" title=\"Country Index\">CREATE TABLE country_index(\r\n    country text,\r\n    user_id bigint,\r\n    PRIMARY KEY((country), user_id) \r\n);\r\n<\/pre>\n<p>The main difference with a <em>normal<\/em> Cassandra table is that the partition of <em>country_index<\/em> would not be distributed using the cluster-wide partitioner (e.g. <strong>Murmur3Partitioner<\/strong> by default).<\/p>\n<p>Secondary index in Cassandra, unlike <strong><a href=\"https:\/\/docs.datastax.com\/en\/cql\/3.3\/cql\/cql_using\/useCreateMV.html\" target=\"_blank\">Materialized Views<\/a><\/strong>, is a <strong>distributed index<\/strong>. This means that the index itself is co-located with the source data on the same node. See an example below:<\/p>\n<div id=\"attachment_13504\" style=\"width: 1411px\" class=\"wp-caption aligncenter\"><img aria-describedby=\"caption-attachment-13504\" loading=\"lazy\" src=\"https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/Native2i-Distributed-Index-3-1.png\" alt=\"Distributed Index\" width=\"1401\" height=\"736\" class=\"size-full wp-image-13504\" srcset=\"https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/Native2i-Distributed-Index-3-1.png 1401w, https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/Native2i-Distributed-Index-3-1-300x158.png 300w, https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/Native2i-Distributed-Index-3-1-768x403.png 768w, https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/Native2i-Distributed-Index-3-1-1024x538.png 1024w, https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/Native2i-Distributed-Index-3-1-1170x615.png 1170w\" sizes=\"(max-width: 1401px) 100vw, 1401px\" \/><p id=\"caption-attachment-13504\" class=\"wp-caption-text\">Distributed Index<\/p><\/div>\n<p>The technical rationales to store index data along-side with original data are:<\/p>\n<ol>\n<li>reduce index update latency and the chance of lost index update<\/li>\n<li>avoid arbitrary wide partitions<\/li>\n<\/ol>\n<p>Indeed if the index data has to be distributed across the cluster as normal data using the configured partitioner, we would face the same issue as with <strong>Materialized Views<\/strong> e.g. how to ensure that the index data has been written effectively to disk before acknowledging the mutation to the client.<\/p>\n<p>A synchronous write of index data will definitely kill down the write latency and we&#8217;re not even considering<strong> Consistency Level<\/strong> into the game<\/p>\n<p>By co-locating the index data on the same node as source data, a write to a table with index just costs an extra local mutation when flushing original data to SSTables (more details about it in the next chapter).<\/p>\n<p>The second advantage of distributed index is to avoid arbitrary wide partitions. If we were to store in a single partition the country index, there will be 60 millions+ cells for the single <strong>FR<\/strong> country (assuming that we index all <strong>FR<\/strong> population). Imagine how wide the <strong>CN<\/strong> partition would be &#8230;<\/p>\n<hr \/>\n<h2>B Local write path<\/h2>\n<p>The write path to a table having native secondary index is exactly the same as for a normal table with respect to commit log. Whenever a mutation is applied to base table in memory (<strong>memtable<\/strong>), it is dispatched as notification to all registered indices on this table so that each index implementation can apply the necessary processing.<\/p>\n<p>The native secondary index implementation just creates an inverted index for the hidden index table. It handles 3 types of operations:<\/p>\n<ol>\n<li>insert of new CQL row<\/li>\n<li>update of CQL row<\/li>\n<li>delete of CQL row<\/li>\n<\/ol>\n<p>For scenario 1. the index just creates a new entry (<em>partition key + clustering columns<\/em>) into the index table.<\/p>\n<p>For scenario 2. it is a little bit more involved. This scenario only occurs <strong>IF AND ONLY IF<\/strong> the new mutation is replacing a value that is still contained in the memtable. In this case, because Cassandra still has the previous value to be indexed, it will pass the previous and new value to the secondary index. The index manager will then remove the entry for the previous indexed value and add a new one for the new indexed value.<\/p>\n<p>Scenario 3. is pretty straightforward, the secondary index just writes a tombstone to the index entry<\/p>\n<p>Index memtable and base memtable will generally be flushed to SSTables at the same time but there is no strong guarantee on this behavior. Once flushed to disk, index data will have a different life-cycle than base data e.g. the index table may be compacted independently of base table compaction.<\/p>\n<p>An interesting details to know is that the <strong>compaction strategy<\/strong> of the secondary index table inherits from the one chosen for the base table.<\/p>\n<hr \/>\n<h2>C Index data model<\/h2>\n<p>Now let&#8217;s look further in details how the schema for the inverse index is designed. Suppose we have a generic table<\/p>\n<pre class=\"lang:sql\" title=\"Generic table\">CREATE TABLE base_table(\r\n    partition1 uuid,\r\n    ...\r\n    partitionN uuid,\r\n    static_column text static,\r\n    clustering1 uuid,\r\n    ...\r\n    clusteringM uuid,\r\n    regular text,\r\n    list_text list,\r\n    set_text set,\r\n    map_int_text map&lt;int, text&gt;,\r\n    PRIMARY KEY((partition1, ..., partitionN), clustering1, ... , clusteringN)\r\n);\r\n<\/pre>\n<h4>1) Index on regular column<\/h4>\n<p>Suppose that we create an index on <em>regular text<\/em> column, the schema of the index table will be:<\/p>\n<pre class=\"lang:sql\" title=\"Regular Column Index\">CREATE TABLE regular_idx(\r\n    regular text,\r\n    partitionColumns blob,\r\n    clustering1 uuid,\r\n    ...\r\n    clusteringM uuid,\r\n    PRIMARY KEY((regular), partitionColumns, clustering1, ..., clusteringM)\r\n);\r\n<\/pre>\n<p>The partition key of <em>regular_idx<\/em> is the indexed value (<em>regular<\/em>) itself. The clustering columns are composed of:<\/p>\n<ul>\n<li>the <em>base_table<\/em> partition components collapsed into a single blob (<strong>partitionColumns<\/strong>)<\/li>\n<li>the <em>base_table<\/em> clustering columns<\/li>\n<\/ul>\n<p>The idea here is to store the entire PRIMARY KEY of the CQL row containing the indexed regular value<\/p>\n<h4>2) Index on static column<\/h4>\n<p>Suppose that we create an index on <em>static_column text<\/em> column, the schema of the index table will be:<\/p>\n<pre class=\"lang:sql\" title=\"Static Column Index\">CREATE TABLE static_idx(\r\n    static_column text,\r\n    partitionColumns blob,\r\n    clustering1 uuid,\r\n    ...\r\n    clusteringM uuid,\r\n    PRIMARY KEY((regular), partitionColumns)\r\n);\r\n<\/pre>\n<p>Indeed, since a static value is common for <strong>all CQL rows<\/strong> in the same partition, we only need to store a reference to the partition key of the <em>base_table<\/em><\/p>\n<h4>3) Index on Partition component<\/h4>\n<p>If we create an index on the <em>partitionK uuid<\/em> component, the schema of the index table will be:<\/p>\n<pre class=\"lang:sql\" title=\"Partition Component Index\">CREATE TABLE partition_idx(\r\n    partitionK uuid,\r\n    partitionColumns blob,\r\n    clustering1 uuid,\r\n    ...\r\n    clusteringM uuid,\r\n    PRIMARY KEY((partitionK), partitionColumns, clustering1, ..., clusteringM)\r\n);\r\n<\/pre>\n<p>Strangely enough, instead of just storing the <strong>partitionColumns<\/strong>, Cassandra also stores the all the clustering columns of the base table.<\/p>\n<p>The reason is that secondary index for static columns has been implemented recently. I have created a <strong><a href=\"https:\/\/issues.apache.org\/jira\/browse\/CASSANDRA-11538\">CASSANDRA-11538<\/a><\/strong> to grant the same treatment for partition component index<\/p>\n<h4>4) Index on Clustering column<\/h4>\n<p>It is possible to have an index on the clustering column. Let&#8217;s suppose that we index <em>clusteringJ uuid<\/em>, 1 \u2264 J \u2264 M. The corresponding clustering index schema will be:<\/p>\n<pre class=\"lang:sql\" title=\"Clustering Column Index\">CREATE TABLE clustering_idx(\r\n    partitionColumns uuid,\r\n    partitionKeys blob,\r\n    clustering1 uuid,\r\n    ...\r\n    clusteringI uuid,\r\n    clusteringK uuid,\r\n    ...\r\n    clusteringM uuid,\r\n    PRIMARY KEY((clusteringJ), partitionColumns, clustering1, clusteringI, clusteringK, ..., clusteringM)\r\n);\r\n<\/pre>\n<p>Indeed, the index stores the <em>clusteringJ<\/em> as partition key, the complete <em>partitionColumns<\/em> as a single blob and the original clustering columns of the rows <strong>except <em>clusteringJ<\/em><\/strong> because we have already its value as partition key<\/p>\n<h4>5) Index on list value<\/h4>\n<p>Let&#8217;s say we want to index values of <em>list_text list&lt;text&gt;<\/em>, Cassandra will create the following index table:<\/p>\n<pre class=\"lang:sql\" title=\"List Value Index\">CREATE TABLE list_idx(\r\n    list_value text,\r\n    partitionKeys blob,\r\n    clustering1 uuid,\r\n    ...\r\n    clusteringM uuid,\r\n    list_position timeuuid,\r\n    PRIMARY KEY((list_value), partitionColumns, clustering1, ..., clusteringM, list_position)\r\n);\r\n<\/pre>\n<p>In addition of the complete primary key of the base table, the index table also stores the position of the indexed value within the list e.g. its cell name = <em>list_position<\/em>. This cell name has <strong>timeuuid<\/strong> type<\/p>\n<h4>5) Index on Set value<\/h4>\n<p>If we index the <em>set_text set&lt;text&gt;<\/em> column, the corresponding index table would be:<\/p>\n<pre class=\"lang:sql\" title=\"Set Value Index\">CREATE TABLE set_idx(\r\n    set_value text,\r\n    partitionKeys blob,\r\n    clustering1 uuid,\r\n    ...\r\n    clusteringM uuid,\r\n    set_value text,\r\n    PRIMARY KEY((set_value), partitionColumns, clustering1, ..., clusteringM, set_value)\r\n);\r\n<\/pre>\n<p>We store the complete primary key of the base table + the cell name of the <em>set_text<\/em> set, which happens to be the indexed value itself<\/p>\n<h4>6) Index on Map value<\/h4>\n<p>If we index the value of <em>map_int_text map&lt;int, text&gt;<\/em> column, the corresponding index table would be:<\/p>\n<pre class=\"lang:sql\" title=\"Map Value Index\">CREATE TABLE map_value_idx(\r\n    map_value text,\r\n    partitionKeys blob,\r\n    clustering1 uuid,\r\n    ...\r\n    clusteringM uuid,\r\n    map_key int,\r\n    PRIMARY KEY((map_value), partitionColumns, clustering1, ..., clusteringM, map_key)\r\n);\r\n<\/pre>\n<p>This time, the cell name of the <em>map_int_text<\/em> column is the map key itself.<\/p>\n<h4>7) Index on Map Key and Entry<\/h4>\n<p>If you index on map key, the index table would resemble:<\/p>\n<pre class=\"lang:sql\" title=\"Map Key Index\">CREATE TABLE map_value_idx(\r\n    map_key int,\r\n    partitionKeys blob,\r\n    clustering1 uuid,\r\n    ...\r\n    clusteringM uuid,\r\n    PRIMARY KEY((map_key), partitionColumns, clustering1, ..., clusteringM)\r\n);\r\n<\/pre>\n<p>An index created on map entry (key\/value) would create:<\/p>\n<pre class=\"lang:sql\" title=\"Map Entry Index\">CREATE TABLE map_entry_idx(\r\n    map_entry blob,\r\n    partitionKeys blob,\r\n    clustering1 uuid,\r\n    ...\r\n    clusteringM uuid,\r\n    PRIMARY KEY((map_entry), partitionColumns, clustering1, ..., clusteringM)\r\n);\r\n<\/pre>\n<p>The <em>map_entry<\/em> column is just a <strong>blob<\/strong> containing the key\/value pair serialized together as <strong>byte[ ]<\/strong>.<\/p>\n<p>Please notice that for map key and map entry indices, the PRIMARY KEY of the index tables does not contain the <em>map_key<\/em> column as last clustering column, as opposed to map value index implementation.<\/p>\n<hr \/>\n<h2>D Local Read Path<\/h2>\n<p>The local read path for native secondary index is quite straightforward. First Cassandra reads the index table to retrieve the primary key of all matching rows and for each of them, it will read the original table to fetch out the data.<\/p>\n<p>One na\u00efve approach would be for each entry in the index table, request the data from the original table. This approach, although correct, is <strong>horribly inefficient<\/strong>. The current implementation <em>groups the primary keys returned by the index by partition key and will scan the original table partition by partition to retrieve the source data<\/em>.<\/p>\n<hr \/>\n<h2 id=\"cluster_read_path\">E Cluster Read Path<\/h2>\n<p>Unlike many distributed search engines (<strong>ElasticSearch<\/strong> and <strong>Solr<\/strong> to name the few), Cassandra does not query all nodes in the cluster for secondary index searching. It has a special algorithm to optimize <strong>range query<\/strong> (and thus secondary index search query) on the cluster.<\/p>\n<p>Querying all nodes (or all primary replicas) in on query to search for data suffers from many problems:<\/p>\n<ul>\n<li>on a large cluster (1000 nodes), querying <strong>all primary replicas<\/strong> (N\/RF, N = number of nodes, RF = replication factor) is prohibitive in term of network bandwidth<\/li>\n<\/ul>\n<p>&nbsp;<\/p>\n<ul>\n<li>the coordinator will be overwhelmed quickly by the amount of returned data. Even if the client has specified a limit (ex: LIMIT 100), on a cluster of 100 nodes with RF=3, the coordinator will query in parallel 34 nodes, each returning 100 rows so we end up with 3400 rows on the coordinator JVM heap<\/li>\n<\/ul>\n<p>To optimize the distributed search query, Cassandra implements a sophisticated algorithm to query data by range of partition keys (called Range Scan). <strong>This algorithm is not specific to secondary index but is common for all range scans<\/strong>.<\/p>\n<p>The general idea of this algorithm is to query data by <strong>rounds<\/strong>. At each round Cassandra uses a <strong>CONCURRENCY_FACTOR<\/strong> which determines how many nodes need to be queried. If the first round does not return enough rows as requested by the client, a new round is started by increasing the <strong>CONCURRENCY_FACTOR<\/strong>.<\/p>\n<blockquote><p>Remark: Cassandra will query the nodes following the token range so there is no specific ordering to be expected from the returned results<\/p><\/blockquote>\n<p>Below is the exact algorithm:<\/p>\n<ul>\n<ul>\n<li>select first the index with the lowest estimate returned rows e.g. the most restrictive index. Cassandra will filter down the resulSet using the other indices (if there are multiple indices in the query).The estimate returned rows for a native secondary index is equal to the estimate of number of CQL rows in the index table(<em>estimate_rows<\/em>) because <em>each CQL row in the index table points to a single primary key of the base table<\/em>. <strong>This estimate rows count is available as a standard histogram computed for all tables (index table or normal one)<\/strong><br \/>\n<div id=\"attachment_13505\" style=\"width: 810px\" class=\"wp-caption aligncenter\"><img aria-describedby=\"caption-attachment-13505\" loading=\"lazy\" src=\"https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/Native2i-Formula1-3.png\" alt=\"Estimated_rows_by_token_range\" width=\"800\" height=\"72\" class=\"size-full wp-image-13505\" srcset=\"https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/Native2i-Formula1-3.png 800w, https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/Native2i-Formula1-3-300x27.png 300w, https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/Native2i-Formula1-3-768x69.png 768w\" sizes=\"(max-width: 800px) 100vw, 800px\" \/><p id=\"caption-attachment-13505\" class=\"wp-caption-text\">Estimated_rows_by_token_range<\/p><\/div><br \/>\nWe use local data for this estimate based on the <strong>strong assumption<\/strong> that the data is distributed evenly on the cluster so that local data is representative of the data distribution on each node. If you have a bad data model where the data distribution is skewed, this estimate will be wrong.<\/li>\n<\/ul>\n<\/ul>\n<p>&nbsp;<\/p>\n<ul>\n<ul>\n<li>next, underestimate a little bit the previous <em>estimate_rows_by_token_range<\/em> value by multiplying it with a <strong>CONCURRENT_SUBREQUEST_MARGIN<\/strong> (default = 0.1) to increase the likelihood that we fetch enough CQL rows in the first round of query<\/li>\n<\/ul>\n<\/ul>\n<p>&nbsp;<\/p>\n<ul>\n<ul>\n<li>determine the <strong>CONCURRENCY_FACTOR<\/strong> e.g. the number of nodes to contact for the <em>current round of query<\/em>. The formula is given by:<br \/>\n<div id=\"attachment_13506\" style=\"width: 1384px\" class=\"wp-caption aligncenter\"><img aria-describedby=\"caption-attachment-13506\" loading=\"lazy\" src=\"https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/Native2i-Formula2-3.png\" alt=\"Concurrency-Factor-Formula\" width=\"1374\" height=\"85\" class=\"size-full wp-image-13506\" srcset=\"https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/Native2i-Formula2-3.png 1374w, https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/Native2i-Formula2-3-300x19.png 300w, https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/Native2i-Formula2-3-768x48.png 768w, https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/Native2i-Formula2-3-1024x63.png 1024w, https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/Native2i-Formula2-3-1170x72.png 1170w\" sizes=\"(max-width: 1374px) 100vw, 1374px\" \/><p id=\"caption-attachment-13506\" class=\"wp-caption-text\">Concurrency-Factor-Formula<\/p><\/div><br \/>\n<em>requested_LIMIT<\/em> corresponds to the limit set by the query (`SELECT &#8230; WHERE &#8230; <strong>LIMIT<\/strong> xxx`) or by the query <strong>fetchSize<\/strong> when using server-side paging.<\/li>\n<\/ul>\n<\/ul>\n<p>&nbsp;<\/p>\n<ul>\n<ul>\n<li>then start querying as many nodes as <em>CONCURRENCY_FACTOR<\/em><\/li>\n<\/ul>\n<\/ul>\n<p>&nbsp;<\/p>\n<ul>\n<ul>\n<li>if the first round rows count satisfies the <em>requested_LIMIT<\/em>, return the results<\/li>\n<\/ul>\n<\/ul>\n<p>&nbsp;<\/p>\n<ul>\n<li>else update the <strong>CONCURRENCY_FACTOR<\/strong> with the algorithm below and start another round of query\n<ul>\n<li>if returned_rows = 0 then<br \/>\n          <br \/> <br \/>\n          <strong>CONCURRENCY_FACTOR<\/strong> = <em>token_ranges_count &#8211; already_queries_token_ranges<\/em>\n        <\/li>\n<p> <\/p>\n<li>else <br \/>\n         <div id=\"attachment_13507\" style=\"width: 1100px\" class=\"wp-caption aligncenter\"><img aria-describedby=\"caption-attachment-13507\" loading=\"lazy\" src=\"https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/Native2i-Formula4_v2-2.png\" alt=\"Updated-Concurrency-Factor\" width=\"1090\" height=\"286\" class=\"size-full wp-image-13507\" srcset=\"https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/Native2i-Formula4_v2-2.png 1090w, https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/Native2i-Formula4_v2-2-300x79.png 300w, https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/Native2i-Formula4_v2-2-768x202.png 768w, https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/Native2i-Formula4_v2-2-1024x269.png 1024w\" sizes=\"(max-width: 1090px) 100vw, 1090px\" \/><p id=\"caption-attachment-13507\" class=\"wp-caption-text\">Updated-Concurrency-Factor<\/p><\/div>\n    <\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<p>Below is an illustration of how it works on a 8 nodes cluster:<\/p>\n<div id=\"attachment_13508\" style=\"width: 655px\" class=\"wp-caption aligncenter\"><img aria-describedby=\"caption-attachment-13508\" loading=\"lazy\" src=\"https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/8nodecluster.jpg\" alt=\"Range-Query-Cinematics\" width=\"645\" height=\"747\" class=\"size-full wp-image-13508\" srcset=\"https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/8nodecluster.jpg 645w, https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/8nodecluster-259x300.jpg 259w\" sizes=\"(max-width: 645px) 100vw, 645px\" \/><p id=\"caption-attachment-13508\" class=\"wp-caption-text\">Range-Query-Cinematics<\/p><\/div>\n<p>&nbsp;<\/p>\n<p>The curious reader can refer to the class <em>`StorageProxy.RangeCommandIterator`<\/em> and the method <em>`StorageProxy::getRangeSlice()`<\/em> for the source code of this algorithm.<\/p>\n<hr \/>\n<h2>F Best use case for native and other implementations of secondary index<\/h2>\n<p>Because of how it is implemented cluster-wide, all secondary index implementations work best when Cassandra can narrow down the number of nodes to query (<em>e.g. narrow down the token ranges to query<\/em>). This target can be achieved if the client query restricts the partition key:<\/p>\n<ul>\n<li>with a single value (<em>`WHERE partition = xxx`<\/em>). This is the most efficient way to use secondary index because the coordinator only needs to query 1 node (+ replicas depending on the requested consistency level)<\/li>\n<\/ul>\n<p>&nbsp;<\/p>\n<ul>\n<li>with a list of partition keys (<em>`WHERE partition IN (aaa, bbb, ccc)`<\/em>). This is still quite efficient because the number of nodes to be queried is bounded by the number of distinct values in the IN clause<\/li>\n<\/ul>\n<p>&nbsp;<\/p>\n<ul>\n<li>with a range of token (<em>`WHERE token(partition) \u2265 xxx AND token(partition) \u2264 yyy`<\/em>). This token range restriction avoids querying all primary replicas in the cluster. Of course, the narrower the token range restriction, the better it is<\/li>\n<\/ul>\n<hr \/>\n<h2>E Caveats<\/h2>\n<p>There are some well known anti-patterns to avoid when using <strong>native secondary index<\/strong>:<\/p>\n<ul>\n<li>avoid very low cardinality index e.g. index where the number of distinct values is very low. A good example is an index on the gender of an user. On each node, the whole user population will be distributed on only 2 different partitions for the index: MALE &amp; FEMALE. If the number of users per node is very dense (e.g. millions) we&#8217;ll have very wide partitions for MALE &amp; FEMALE index, which is bad<\/li>\n<\/ul>\n<p>&nbsp;<\/p>\n<ul>\n<li>avoid very high cardinality index. For example, indexing user by their email address is a very bad idea. Generally an email address is used by atmost 1 user. So there are as many distinct index values (email addresses) as there are users. When searching user by email, in the best case the coordinator will hit 1 node and find the user by chance. The worst case is when the coordinator hits all primary replicas without finding any answer (0 rows for querying N\/RF nodes !)<\/li>\n<\/ul>\n<p>&nbsp;<\/p>\n<ul>\n<li>avoid indexing a column which is updated (or removed then created) frequently. By design the index data are stored in a Cassandra table and Cassandra data structure is designed for immutability. Indexing frequently updated data will increase write amplification (for the base table + for the index table)<\/li>\n<\/ul>\n<p>If you need to index a column whose cardinality is a <strong>1-to-1 relationship<\/strong> with the base row (for example an email address for an user), you can use <strong>Materialized Views<\/strong> instead. They can be seen as global index and guarantee that the query will be executed on only one node (+ replicas depending on consistency level).<\/p>\n<hr\/>\n<p>Article originally published on www.planetcassandra.org<\/p>\n","protected":false},"excerpt":{"rendered":"<p>The native secondary index is the less known and most misused feature of Cassandra. In this article we&#8217;ll explain thoroughly the technical implementation of native secondary index to highlight best use-cases and the worst anti-patterns. For the remaining of this&#8230;<br \/><a class=\"read-more-button\" href=\"https:\/\/www.doanduyhai.com\/blog\/?p=13191\">Read more<\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[57,10],"tags":[],"_links":{"self":[{"href":"https:\/\/www.doanduyhai.com\/blog\/index.php?rest_route=\/wp\/v2\/posts\/13191"}],"collection":[{"href":"https:\/\/www.doanduyhai.com\/blog\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.doanduyhai.com\/blog\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.doanduyhai.com\/blog\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.doanduyhai.com\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=13191"}],"version-history":[{"count":5,"href":"https:\/\/www.doanduyhai.com\/blog\/index.php?rest_route=\/wp\/v2\/posts\/13191\/revisions"}],"predecessor-version":[{"id":13509,"href":"https:\/\/www.doanduyhai.com\/blog\/index.php?rest_route=\/wp\/v2\/posts\/13191\/revisions\/13509"}],"wp:attachment":[{"href":"https:\/\/www.doanduyhai.com\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=13191"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.doanduyhai.com\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=13191"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.doanduyhai.com\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=13191"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}