{"id":2058,"date":"2016-04-25T20:17:52","date_gmt":"2016-04-25T20:17:52","guid":{"rendered":"http:\/\/www.doanduyhai.com\/blog\/?p=2058"},"modified":"2017-08-10T16:45:04","modified_gmt":"2017-08-10T16:45:04","slug":"cassandra-sasi-index-technical-deep-dive","status":"publish","type":"post","link":"https:\/\/www.doanduyhai.com\/blog\/?p=2058","title":{"rendered":"Cassandra SASI Index Technical Deep Dive"},"content":{"rendered":"<p>This blog post is a technical deep dive into the new cool <strong>SASI<\/strong> index that enables full text search as well as faster multi-criteria search in <strong>Cassandra<\/strong> (<em>introduced since <strong>Cassandra 3.4<\/strong> but I recommend <strong>Cassandra 3.5<\/strong> at least because of critical bugs being fixed<\/em>).<\/p>\n<blockquote><p>For the remaining of this post Cassandra == Apache Cassandra\u2122<\/p><\/blockquote>\n<p>First, a big thank to <strong>Sam Tunnicliffe<\/strong> of <strong>Datastax<\/strong> and <strong>Pavel Yaskevich<\/strong> without whom this post is not possible.<\/p>\n<p>The post divides itself into 10 parts. Below is the table of content:<\/p>\n<ul>\n<li>A) <a href=\"#what_is_sasi\">What is SASI ?<\/a><\/li>\n<li>B) <a href=\"#sasi_syntax_and_usage\">SASI Syntax and Usage<\/a><\/li>\n<ul>\n<li>1) Text data types<\/li>\n<li>2) Other data types<\/li>\n<li>3) All data types<\/li>\n<\/ul>\n<li>C) <a href=\"#sasi_life_cycle\">SASI Life-Cycle<\/a><\/li>\n<li>D) <a href=\"#sasi_write_path\">Write Path<\/a>\n<ul>\n<li>1) In Memory<\/li>\n<li>2) On Flush<\/li>\n<\/ul>\n<\/li>\n<li>E) <a href=\"#sasi_on_disk_format\">On Disk Data Format &#038; Layout<\/a><\/li>\n<ul>\n<li>1) Non SPARSE mode Layout<\/li>\n<li>2) Header Block<\/li>\n<li>3) Non SPARSE Data Block<\/li>\n<li>4) Non SPARSE Term Block<\/li>\n<li>5) Common TokenTree Block<\/li>\n<li>6) SPARSE mode Layout<\/li>\n<li>7) SPARSE Data Block<\/li>\n<li>8) SPARSE Term Block<\/li>\n<li>9) Pointer Block<\/li>\n<li>10) Meta Data Info<\/li>\n<\/ul>\n<li>F) <a href=\"#sasi_read_path\">Read Path<\/a>\n<ul>\n<li>1) Query Planner<\/li>\n<li>2) Cluster Read Path<\/li>\n<li>3) Local Read Path<\/li>\n<\/ul>\n<\/li>\n<li>G) <a href=\"#sasi_disk_space_usage\">Disk Space Usage<\/a><\/li>\n<li>H) <a href=\"#sasi_perf_benchmarks\">Some Performance Benchmarks<\/a><\/li>\n<li>I) <a href=\"#sasi_vs_search\">SASI vs Search Engines<\/a><\/li>\n<li>J) <a href=\"#sasi_trade_offs\">SASI Trade-Offs<\/a><\/li>\n<\/ul>\n<div id=\"what_is_sasi\" style=\"min-height: 90px\">&nbsp;<\/div>\n<h2>A) What is SASI ?<\/h2>\n<p><strong>SASI<\/strong> stands for <strong>S<\/strong>STable-<strong>A<\/strong>ttached <strong>S<\/strong>econdary <strong>I<\/strong>ndex, e.g. the life-cycle of <strong>SASI<\/strong> index files are the same as the one of corresponding SSTables. <strong>SASI<\/strong> is a contribution from a team of engineers, below is the list of all contributors:<\/p>\n<ul>\n<li>Pavel Yaskevich<\/li>\n<li>Jordan West<\/li>\n<li>Jason Brown<\/li>\n<li>Mikhail Stepura<\/li>\n<li>Michael Kjellman<\/li>\n<\/ul>\n<p><strong>SASI<\/strong> is not <em>yet-another-implementation<\/em> of <strong>Cassandra<\/strong> secondary index interface, it introduces a new idea: let the index file follows the life-cycle of the SSTable. It means that whenever an SSTable is created on disk, a corresponding <strong>SASI<\/strong> index file is also created. When are SSTables created ?<\/p>\n<ol>\n<li>during normal flush<\/li>\n<li>during compaction<\/li>\n<li>during streaming operations (node joining or being decommissioned)<\/li>\n<\/ol>\n<p>To enable this new architecture, the <strong>Cassandra<\/strong> source code had to be modified to introduce the new <code>SSTableFlushObserver<\/code> class whose goal is to intercept SSTable flushing and generates the corresponding <strong>SASI<\/strong> index file.<\/p>\n<div id=\"sasi_syntax_and_usage\" style=\"min-height: 90px\">&nbsp;<\/div>\n<h2>B) SASI Syntax and Usage<\/h2>\n<p><strong>SASI<\/strong> uses the standard CQL syntax to create a custom secondary index. Let&#8217;s see all the available index options.<\/p>\n<h4>1) For text data types (text, varchar &#038; ascii)<\/h4>\n<p><strong>Indexing <code>mode<\/code><\/strong>:<\/p>\n<ul>\n<li>\n     <strong>PREFIX<\/strong>: allows matching text value by:<\/p>\n<ul>\n<li>prefix using the <strong><code>LIKE 'prefix%'<\/code><\/strong> syntax<\/li>\n<li>exact match using equality (<strong>=<\/strong>)<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<ul>\n<li>\n    <strong>CONTAINS<\/strong>: allows matching text value by:<\/p>\n<ul>\n<li>prefix using the <strong><code>LIKE 'prefix%'<\/code><\/strong> syntax (if <strong><code>org.apache.cassandra.index.sasi.analyzer.NonTokenizingAnalyzer<\/code><\/strong> is used)<\/li>\n<li>suffix using the <strong><code>LIKE '%suffix'<\/code><\/strong> syntax<\/li>\n<li>substring using the <strong><code>LIKE '%substring%'<\/code><\/strong> syntax<\/li>\n<li>exact match using equality (<strong>=<\/strong>) (if <strong><code>org.apache.cassandra.index.sasi.analyzer.NonTokenizingAnalyzer<\/code><\/strong> is used)<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<p><strong>Indexing <code>mode<\/code><\/strong>:<\/p>\n<ul>\n<li><code>analyzed<\/code> (true\/false): activate text analysis. Warning: lower-case\/upper-case normalization requires an analyzer<\/li>\n<\/ul>\n<p><strong>Analyzer class (<code>analyzer_class<\/code>)<\/strong>:<\/p>\n<ul>\n<li><strong><code>org.apache.cassandra.index.sasi.analyzer.NonTokenizingAnalyzer<\/code><\/strong> with options:\n<ul>\n<li><code>case_sensitive<\/code> (true\/false): search using case sensitivity<\/li>\n<li><code>normalize_lowercase<\/code> (true\/false): store text as lowercase<\/li>\n<li><code>normalize_uppercase<\/code> (true\/false): store text as uppercase<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<ul>\n<li><strong><code>org.apache.cassandra.index.sasi.analyzer.StandardAnalyzer<\/code><\/strong> with options:\n<ul>\n<li><code>tokenization_locale<\/code>: locale to be used for tokenization, stemming and stop words skipping<\/li>\n<li><code>tokenization_enable_stemming<\/code> (true\/false): enable <a href=\"https:\/\/en.wikipedia.org\/wiki\/Stemming\" title=\"Stemming\" target=\"_blank\">stemming<\/a> (locale dependent)<\/li>\n<li><code>tokenization_skip_stop_words<\/code> (true\/false): skip indexing stop words (locale dependent)<\/li>\n<li><code>tokenization_normalize_lowercase<\/code> (true\/false): store text as lowercase<\/li>\n<li><code>tokenization_normalize_uppercase<\/code> (true\/false): store text as uppercase<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<div class=\"accordion \"  ><br \/>\n  <div class=\"accordion-group\">\r\n        <div class=\"accordion-heading\">\r\n        <a class=\"accordion-toggle\" data-toggle=\"collapse\" \r\n        data-parent=\"#accordion2\" href=\"#accordian_item_182\">Example of text index<\/a>\r\n        <\/div>\r\n         <div id=\"accordian_item_182\" class=\"accordion-body collapse in\">\r\n            <div class=\"accordion-inner\"><\/p>\n<pre class=\"brush: sql; title: ; notranslate\" title=\"\">\r\n\/\/ Full text search on albums title\r\nCREATE CUSTOM INDEX albums_title_idx ON music.albums(title) \r\nUSING 'org.apache.cassandra.index.sasi.SASIIndex'\r\nWITH OPTIONS = {\r\n    'mode': 'CONTAINS',\r\n    'analyzer_class': 'org.apache.cassandra.index.sasi.analyzer.StandardAnalyzer',\r\n    'tokenization_enable_stemming': 'true',\r\n    'tokenization_locale': 'en',\r\n    'tokenization_skip_stop_words': 'true',\r\n    'analyzed': 'true',\r\n    'tokenization_normalize_lowercase': 'true'\r\n};\r\n\r\n\/\/ Full text search on artist name with neither Tokenization nor case sensitivity\r\nCREATE CUSTOM INDEX albums_artist_idx ON music.albums(artist) \r\nUSING 'org.apache.cassandra.index.sasi.SASIIndex'\r\nWITH OPTIONS = {\r\n     'mode': 'PREFIX', \r\n     'analyzer_class': 'org.apache.cassandra.index.sasi.analyzer.NonTokenizingAnalyzer',\r\n     'case_sensitive': 'false'\r\n};\r\n   <\/pre>\n<p>  <\/div>\r\n         <\/div>\r\n        <\/div><br \/>\n<\/div>\n<h4>1) For other data types (int, date, uuid &#8230;)<\/h4>\n<p><strong>Indexing <code>mode<\/code><\/strong>:<\/p>\n<ul>\n<li>\n     <strong>PREFIX<\/strong>: allows matching values by:<\/p>\n<ul>\n<li>equality (<strong>=<\/strong>)<\/li>\n<li>range (<strong> &lt;, &le;, &gt;, &ge; <\/strong>)<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<ul>\n<li>\n     <strong>SPARSE<\/strong>: allows matching sparse index values by:<\/p>\n<ul>\n<li>equality (<strong>=<\/strong>)<\/li>\n<li>range (<strong> &lt;, &le;, &gt;, &ge; <\/strong>)<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<blockquote><p>There is an important remark about <strong>SPARSE<\/strong> mode. By <strong>sparse<\/strong>, it means that <strong>for each indexed value, there are very few (maximum 5 actually) matching rows<\/strong>. If there are more than 5 matching rows, an exception similar to the one below will be thrown:<\/p>\n<p><code>java.io.IOException: Term - 'xxx' belongs to more than 5 keys in SPARSE mode, which is not allowed.<\/code><\/p><\/blockquote>\n<p><strong>SPARSE<\/strong> mode has been designed primarily to index very unique values and allow efficient storage and efficient range query. For example, if you&#8217;re storing user account and creates an index on the <code>account_creation_date<\/code> column (millisecond precision), it&#8217;s likely that you&#8217;ll have very few matching user(s) for a given date. However, you&#8217;ll be able to search user whose account has been created between a wide range of date (<code>WHERE account_creation_date > xxx AND account_creation_date < yyy<\/code>) in a very efficient manner.<\/p>\n<div class=\"accordion \"  ><br \/>\n  <div class=\"accordion-group\">\r\n        <div class=\"accordion-heading\">\r\n        <a class=\"accordion-toggle\" data-toggle=\"collapse\" \r\n        data-parent=\"#accordion2\" href=\"#accordian_item_827\">Example of numeric index<\/a>\r\n        <\/div>\r\n         <div id=\"accordian_item_827\" class=\"accordion-body collapse in\">\r\n            <div class=\"accordion-inner\"><\/p>\n<pre class=\"brush: sql; title: ; notranslate\" title=\"\">\r\n\/\/ Range search on numeric value\r\nCREATE CUSTOM INDEX albums_year_idx ON music.albums(year) \r\nUSING 'org.apache.cassandra.index.sasi.SASIIndex'\r\nWITH OPTIONS = {'mode': 'PREFIX'};\r\n   <\/pre>\n<p>  <\/div>\r\n         <\/div>\r\n        <\/div><br \/>\n<\/div>\n<h4>3) For all data types<\/h4>\n<ul>\n<li><code>max_compaction_flush_memory_in_mb<\/code>: defines the max size for the <code>OnDiskIndex<\/code> structure to be kept in memory during compaction. If the index exceeds this size, it will be flushed to disk in segments and merged together in a second pass to create the final <code>OnDiskIndex<\/code>  file<\/li>\n<\/ul>\n<div id=\"sasi_life_cycle\" style=\"min-height: 90px\">&nbsp;<\/div>\n<h2>C) SASI Life-Cycle<\/h2>\n<p>When a mutation is pushed to a node, first it is written into a <strong>CommitLog<\/strong> then put into <strong>MemTable<\/strong>. At the same time, the mutation is indexed into <strong>SASI<\/strong> in-memory index structure (<code><a href=\"https:\/\/github.com\/apache\/cassandra\/blob\/trunk\/src\/java\/org\/apache\/cassandra\/index\/sasi\/memory\/IndexMemtable.java\" target=\"_blank\">IndexMemtable<\/a><\/code>)<\/p>\n<div class=\"accordion \"  ><br \/>\n  <div class=\"accordion-group\">\r\n        <div class=\"accordion-heading\">\r\n        <a class=\"accordion-toggle\" data-toggle=\"collapse\" \r\n        data-parent=\"#accordion2\" href=\"#accordian_item_824\">IndexMemtable.index(DecoratedKey key, ByteBuffer value)<\/a>\r\n        <\/div>\r\n         <div id=\"accordian_item_824\" class=\"accordion-body collapse \">\r\n            <div class=\"accordion-inner\"><\/p>\n<pre class=\"brush: java; highlight: &#091;21&#093;; title: ; notranslate\" title=\"\">\r\n    public long index(DecoratedKey key, ByteBuffer value)\r\n    {\r\n        if (value == null || value.remaining() == 0)\r\n            return 0;\r\n\r\n        AbstractType&lt;?&gt; validator = index.columnIndex.getValidator();\r\n        if (!TypeUtil.isValid(value, validator))\r\n        {\r\n            int size = value.remaining();\r\n            if ((value = TypeUtil.tryUpcast(value, validator)) == null)\r\n            {\r\n                logger.error(&quot;Can't add column {} to index for key: {}, value size {}, validator: {}.&quot;,\r\n                             index.columnIndex.getColumnName(),\r\n                             index.columnIndex.keyValidator().getString(key.getKey()),\r\n                             FBUtilities.prettyPrintMemory(size),\r\n                             validator);\r\n                return 0;\r\n            }\r\n        }\r\n\r\n        return index.add(key, value);\r\n    }\r\n   <\/pre>\n<p>  <\/div>\r\n         <\/div>\r\n        <\/div><br \/>\n<\/div>\n<div id=\"attachment_2081\" style=\"width: 722px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/SASI_MemTable.png\"><img aria-describedby=\"caption-attachment-2081\" loading=\"lazy\" src=\"https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/SASI_MemTable.png\" alt=\"SASI IndexMemtable\" width=\"712\" height=\"305\" class=\"size-full wp-image-2081\" srcset=\"https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/SASI_MemTable.png 1424w, https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/SASI_MemTable-300x129.png 300w, https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/SASI_MemTable-1024x439.png 1024w, https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/SASI_MemTable-1170x501.png 1170w\" sizes=\"(max-width: 712px) 100vw, 712px\" \/><\/a><p id=\"caption-attachment-2081\" class=\"wp-caption-text\">SASI IndexMemtable<\/p><\/div>\n<p>Later on, when MemTables are flushed to disk, <strong>SASI<\/strong> will create one <strong>OnDiskIndex<\/strong> file for each SSTable<\/p>\n<div id=\"attachment_2083\" style=\"width: 722px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/SASI_SSTable.png\"><img aria-describedby=\"caption-attachment-2083\" loading=\"lazy\" src=\"https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/SASI_SSTable.png\" alt=\"SASI OnDiskIndex\" width=\"712\" height=\"300\" class=\"size-full wp-image-2083\" srcset=\"https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/SASI_SSTable.png 1424w, https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/SASI_SSTable-300x126.png 300w, https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/SASI_SSTable-1024x431.png 1024w, https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/SASI_SSTable-1170x492.png 1170w\" sizes=\"(max-width: 712px) 100vw, 712px\" \/><\/a><p id=\"caption-attachment-2083\" class=\"wp-caption-text\">SASI OnDiskIndex<\/p><\/div>\n<p>This write-path applies to:<\/p>\n<ul>\n<li>normal mutations<\/li>\n<li>read repairs<\/li>\n<li>normal repairs<\/li>\n<li>hints replays<\/li>\n<li>streaming operations (node joining, node decommissioned)<\/li>\n<\/ul>\n<p>If SSTables are compacted, the <strong>OnDiskIndex<\/strong> files will also follow the compaction cycle and will be merged into 1 big final <strong>OnDiskIndex<\/strong> file<\/p>\n<div id=\"attachment_2085\" style=\"width: 725px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/SASI_Compaction.png\"><img aria-describedby=\"caption-attachment-2085\" loading=\"lazy\" src=\"https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/SASI_Compaction.png\" alt=\"SASI OnDiskIndex Compaction\" width=\"715\" height=\"309\" class=\"size-full wp-image-2085\" srcset=\"https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/SASI_Compaction.png 1429w, https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/SASI_Compaction-300x130.png 300w, https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/SASI_Compaction-1024x443.png 1024w, https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/SASI_Compaction-1170x506.png 1170w\" sizes=\"(max-width: 715px) 100vw, 715px\" \/><\/a><p id=\"caption-attachment-2085\" class=\"wp-caption-text\">SASI OnDiskIndex Compaction<\/p><\/div>\n<div id=\"sasi_write_path\" style=\"min-height: 90px\">&nbsp;<\/div>\n<h2>D) Write Path<\/h2>\n<h4>1) In Memory<\/h4>\n<p>When a mutation is appended into <strong>MemTable<\/strong>, the <code>AtomicBTreePartition.RowUpdater.apply()<\/code> methods will be invoked and the mutation is passed to the appropriate indexer<\/p>\n<div class=\"accordion \"  ><br \/>\n  <div class=\"accordion-group\">\r\n        <div class=\"accordion-heading\">\r\n        <a class=\"accordion-toggle\" data-toggle=\"collapse\" \r\n        data-parent=\"#accordion2\" href=\"#accordian_item_555\">AtomicBTreePartition.RowUpdater.apply(Row insert)<\/a>\r\n        <\/div>\r\n         <div id=\"accordian_item_555\" class=\"accordion-body collapse \">\r\n            <div class=\"accordion-inner\"><\/p>\n<pre class=\"brush: java; highlight: &#091;4&#093;; title: ; notranslate\" title=\"\">\r\n        public Row apply(Row insert)\r\n        {\r\n            Row data = Rows.copy(insert, builder(insert.clustering())).build();\r\n            indexer.onInserted(insert);\r\n\r\n            this.dataSize += data.dataSize();\r\n            this.heapSize += data.unsharedHeapSizeExcludingData();\r\n            if (inserted == null)\r\n                inserted = new ArrayList&lt;&gt;();\r\n            inserted.add(data);\r\n            return data;\r\n        }\r\n   <\/pre>\n<p>  <\/div>\r\n         <\/div>\r\n        <\/div><br \/>\n  <div class=\"accordion-group\">\r\n        <div class=\"accordion-heading\">\r\n        <a class=\"accordion-toggle\" data-toggle=\"collapse\" \r\n        data-parent=\"#accordion2\" href=\"#accordian_item_136\">AtomicBTreePartition.RowUpdater.apply(Row existing, Row update)<\/a>\r\n        <\/div>\r\n         <div id=\"accordian_item_136\" class=\"accordion-body collapse \">\r\n            <div class=\"accordion-inner\"><\/p>\n<pre class=\"brush: java; highlight: &#091;8&#093;; title: ; notranslate\" title=\"\">\r\n        public Row apply(Row existing, Row update)\r\n        {\r\n            Row.Builder builder = builder(existing.clustering());\r\n            colUpdateTimeDelta = Math.min(colUpdateTimeDelta, Rows.merge(existing, update, builder, nowInSec));\r\n\r\n            Row reconciled = builder.build();\r\n\r\n            indexer.onUpdated(existing, reconciled);\r\n\r\n            dataSize += reconciled.dataSize() - existing.dataSize();\r\n            heapSize += reconciled.unsharedHeapSizeExcludingData() - existing.unsharedHeapSizeExcludingData();\r\n            if (inserted == null)\r\n                inserted = new ArrayList&lt;&gt;();\r\n            inserted.add(reconciled);\r\n            discard(existing);\r\n\r\n            return reconciled;\r\n        }\r\n   <\/pre>\n<p>  <\/div>\r\n         <\/div>\r\n        <\/div><br \/>\n<\/div>\n<p>In the case of <strong>SASI<\/strong>, it will call <code>IndexMemtable.index()<\/code> method. Depending on the indexed column type and index mode, an appropriate data-structure is used to store the indexed values:<\/p>\n<table>\n<thead>\n<tr>\n<th>Index Mode<\/th>\n<th>Data Type<\/th>\n<th>Data Structure<\/th>\n<th>Usage syntax<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td><strong>PREFIX<\/strong><\/td>\n<td>text, ascii, varchar<\/td>\n<td>Guava <code>ConcurrentRadixTree<\/code><\/td>\n<td>\n          name LIKE 'John%'<br \/>\n          name = 'Johnathan'\n      <\/td>\n<\/tr>\n<tr>\n<td><strong>CONTAINS<\/strong><\/td>\n<td>text, ascii, varchar<\/td>\n<td>Guava <code>ConcurrentSuffixTree<\/code><\/td>\n<td>\n          name LIKE 'John%' <strong><sup>*<\/sup><\/strong><br \/>\n          name LIKE '%nathan'<br \/>\n          name LIKE '%nat%'<br \/>\n          name = 'Johnathan' <strong><sup>*<\/sup><\/strong>\n      <\/td>\n<\/tr>\n<tr>\n<td><strong>PREFIX<\/strong><\/td>\n<td>others (int, date, uuid ...)<\/td>\n<td>Modified JDK <code>ConcurrentSkipListSet<\/code><\/td>\n<td>\n        age = 20<br \/>\n        age >= 20 AND age <= 30\n      <\/td>\n<\/tr>\n<tr>\n<td><strong>SPARSE<\/strong><\/td>\n<td>others (int, date, uuid ...)<\/td>\n<td>Modified JDK <code>ConcurrentSkipListSet<\/code><\/td>\n<td>\n        event_date >= '2016-03-23 00:00:00+0000'<br \/>\n        AND<br \/>\n        event_date <= '2016-04-23 00:00:00+0000'\n      <\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><strong><sup>*<\/sup><\/strong> only if <strong><code>org.apache.cassandra.index.sasi.analyzer.NonTokenizingAnalyzer<\/code><\/strong> is used<br \/>\n<br \/>\n&nbsp;<br \/>\n<\/p>\n<blockquote>\n<div style=\"font-size:1.3em !important\">Please note that <strong>SASI does not intercept DELETE for indexing<\/strong>. Indeed the resolution and reconciliation of deleted data is let to Cassandra at read time. <strong>SASI<\/strong> only indexes INSERT and UPDATE<\/div>\n<\/blockquote>\n<h4>2) On Flush<\/h4>\n<p>When <strong>Cassandra<\/strong> is ready to flush SSTables to disk, it will call <code>SSTableWriter.observers()<\/code> to get a list of all observers. Currently only <strong>SASI<\/strong> registers an observer and it is the <code>PerSSTableIndexWriter<\/code> class. Native secondary index doesn't implement any observer:<\/p>\n<div class=\"accordion \"  ><br \/>\n  <div class=\"accordion-group\">\r\n        <div class=\"accordion-heading\">\r\n        <a class=\"accordion-toggle\" data-toggle=\"collapse\" \r\n        data-parent=\"#accordion2\" href=\"#accordian_item_380\">SSTableWriter.observers(Descriptor descriptor,Collection<Index> indexes,OperationType operationType)<\/a>\r\n        <\/div>\r\n         <div id=\"accordian_item_380\" class=\"accordion-body collapse \">\r\n            <div class=\"accordion-inner\"><\/p>\n<pre class=\"brush: java; highlight: &#091;11,14&#093;; title: ; notranslate\" title=\"\">\r\n    private static Collection&lt;SSTableFlushObserver&gt; observers(Descriptor descriptor,\r\n                                                              Collection&lt;Index&gt; indexes,\r\n                                                              OperationType operationType)\r\n    {\r\n        if (indexes == null)\r\n            return Collections.emptyList();\r\n\r\n        List&lt;SSTableFlushObserver&gt; observers = new ArrayList&lt;&gt;(indexes.size());\r\n        for (Index index : indexes)\r\n        {\r\n            SSTableFlushObserver observer = index.getFlushObserver(descriptor, operationType);\r\n            if (observer != null)\r\n            {\r\n                observer.begin();\r\n                observers.add(observer);\r\n            }\r\n        }\r\n\r\n        return ImmutableList.copyOf(observers);\r\n    }\r\n   <\/pre>\n<p>  <\/div>\r\n         <\/div>\r\n        <\/div><br \/>\n<\/div>\n<p>Then, for each new partition to be written to disk, <code>BigTableWriter.append()<\/code> will call each observer <code>startPartition()<\/code> method, <strong>passing the offset of the current partition in the current SSTable<\/strong>:<br \/>\n<div class=\"accordion \"  ><br \/>\n  <div class=\"accordion-group\">\r\n        <div class=\"accordion-heading\">\r\n        <a class=\"accordion-toggle\" data-toggle=\"collapse\" \r\n        data-parent=\"#accordion2\" href=\"#accordian_item_871\">BigTableWriter.append(UnfilteredRowIterator iterator)<\/a>\r\n        <\/div>\r\n         <div id=\"accordian_item_871\" class=\"accordion-body collapse \">\r\n            <div class=\"accordion-inner\"><\/p>\n<pre class=\"brush: java; highlight: &#091;15&#093;; title: ; notranslate\" title=\"\">\r\n    public RowIndexEntry append(UnfilteredRowIterator iterator)\r\n    {\r\n        DecoratedKey key = iterator.partitionKey();\r\n\r\n        if (key.getKey().remaining() &gt; FBUtilities.MAX_UNSIGNED_SHORT)\r\n        {\r\n            logger.error(&quot;Key size {} exceeds maximum of {}, skipping row&quot;, key.getKey().remaining(), FBUtilities.MAX_UNSIGNED_SHORT);\r\n            return null;\r\n        }\r\n\r\n        if (iterator.isEmpty())\r\n            return null;\r\n\r\n        long startPosition = beforeAppend(key);\r\n        observers.forEach((o) -&gt; o.startPartition(key, iwriter.indexFile.position()));\r\n        \r\n        ...\r\n        \r\n    }\r\n   <\/pre>\n<p>  <\/div>\r\n         <\/div>\r\n        <\/div><br \/>\n<\/div><\/p>\n<p>For each row in the partition, the method <code>org.apache.cassandra.db.ColumnIndex.add()<\/code> is called and will notify each observer of the row content to be indexed<\/p>\n<div class=\"accordion \"  ><br \/>\n  <div class=\"accordion-group\">\r\n        <div class=\"accordion-heading\">\r\n        <a class=\"accordion-toggle\" data-toggle=\"collapse\" \r\n        data-parent=\"#accordion2\" href=\"#accordian_item_149\">ColumnIndex.add(Unfiltered unfiltered)<\/a>\r\n        <\/div>\r\n         <div id=\"accordian_item_149\" class=\"accordion-body collapse \">\r\n            <div class=\"accordion-inner\"><\/p>\n<pre class=\"brush: java; highlight: &#091;16&#093;; title: ; notranslate\" title=\"\">\r\n        private void add(Unfiltered unfiltered) throws IOException\r\n        {\r\n            long pos = currentPosition();\r\n\r\n            if (firstClustering == null)\r\n            {\r\n                \/\/ Beginning of an index block. Remember the start and position\r\n                firstClustering = unfiltered.clustering();\r\n                startPosition = pos;\r\n            }\r\n\r\n            UnfilteredSerializer.serializer.serialize(unfiltered, header, writer, pos - previousRowStart, version);\r\n\r\n            \/\/ notify observers about each new row\r\n            if (!observers.isEmpty())\r\n                observers.forEach((o) -&gt; o.nextUnfilteredCluster(unfiltered));\r\n            ...\r\n        }\r\n    <\/pre>\n<p>  <\/div>\r\n         <\/div>\r\n        <\/div><br \/>\n<\/div>\n<p>When reaching the end of the <strong>MemTable<\/strong>, the method <code>SSTableWriter.finish()<\/code> is invoked to trigger the actual flush. This code also notifies any registered observer to finalize their work<\/p>\n<div class=\"accordion \"  ><br \/>\n  <div class=\"accordion-group\">\r\n        <div class=\"accordion-heading\">\r\n        <a class=\"accordion-toggle\" data-toggle=\"collapse\" \r\n        data-parent=\"#accordion2\" href=\"#accordian_item_534\">SSTableWriter.finish(boolean openResult)<\/a>\r\n        <\/div>\r\n         <div id=\"accordian_item_534\" class=\"accordion-body collapse \">\r\n            <div class=\"accordion-inner\"><\/p>\n<pre class=\"brush: java; highlight: &#091;5&#093;; title: ; notranslate\" title=\"\">\r\n    public SSTableReader finish(boolean openResult)\r\n    {\r\n        setOpenResult(openResult);\r\n        txnProxy.finish();\r\n        observers.forEach(SSTableFlushObserver::complete);\r\n        return finished();\r\n    }\r\n    <\/pre>\n<p>  <\/div>\r\n         <\/div>\r\n        <\/div><br \/>\n<\/div>\n<p>From <strong>SASI<\/strong> side, the indexing part is done inside the class <code>PerSSTableIndexWriter<\/code>.<\/p>\n<p>All the indexing logic is done by method <code>PerSSTableIndexWriter.Index.add()<\/code>. For each indexed value (called <strong><code>term<\/code><\/strong> in the source code), the analyzer class will split it into multiple tokens (if <code>StandardAnalyzer<\/code> is used) and pass the <em>(term, partition key as token value, partition offset in SSTable)<\/em> triplet to the class <code>OnDiskIndexBuilder<\/code>.<\/p>\n<p>If the built <code>OnDiskIndex<\/code> size has not reach <strong>1Gb<\/strong>, the next <code>term<\/code> is processed otherwise <strong>SASI<\/strong> will schedule an asynchronous flush of this <em>partial segment<\/em> to disk and start building a new one.<\/p>\n<div class=\"accordion \"  ><br \/>\n  <div class=\"accordion-group\">\r\n        <div class=\"accordion-heading\">\r\n        <a class=\"accordion-toggle\" data-toggle=\"collapse\" \r\n        data-parent=\"#accordion2\" href=\"#accordian_item_825\">PerSSTableIndexWriter.Index.add(ByteBuffer term, DecoratedKey key, long keyPosition)<\/a>\r\n        <\/div>\r\n         <div id=\"accordian_item_825\" class=\"accordion-body collapse \">\r\n            <div class=\"accordion-inner\"><\/p>\n<pre class=\"brush: java; highlight: &#091;8,9,11,39,43,44,46&#093;; title: ; notranslate\" title=\"\">\r\npublic void add(ByteBuffer term, DecoratedKey key, long keyPosition)\r\n        {\r\n            if (term.remaining() == 0)\r\n                return;\r\n\r\n            boolean isAdded = false;\r\n\r\n            analyzer.reset(term);\r\n            while (analyzer.hasNext())\r\n            {\r\n                ByteBuffer token = analyzer.next();\r\n                int size = token.remaining();\r\n\r\n                if (token.remaining() &gt;= OnDiskIndexBuilder.MAX_TERM_SIZE)\r\n                {\r\n                    logger.info(&quot;Rejecting value (size {}, maximum {}) for column {} (analyzed {}) at {} SSTable.&quot;,\r\n                            FBUtilities.prettyPrintMemory(term.remaining()),\r\n                            FBUtilities.prettyPrintMemory(OnDiskIndexBuilder.MAX_TERM_SIZE),\r\n                            columnIndex.getColumnName(),\r\n                            columnIndex.getMode().isAnalyzed,\r\n                            descriptor);\r\n                    continue;\r\n                }\r\n\r\n                if (!TypeUtil.isValid(token, columnIndex.getValidator()))\r\n                {\r\n                    if ((token = TypeUtil.tryUpcast(token, columnIndex.getValidator())) == null)\r\n                    {\r\n                        logger.info(&quot;({}) Failed to add {} to index for key: {}, value size was {}, validator is {}.&quot;,\r\n                                    outputFile,\r\n                                    columnIndex.getColumnName(),\r\n                                    keyValidator.getString(key.getKey()),\r\n                                    FBUtilities.prettyPrintMemory(size),\r\n                                    columnIndex.getValidator());\r\n                        continue;\r\n                    }\r\n                }\r\n\r\n                currentBuilder.add(token, key, keyPosition);\r\n                isAdded = true;\r\n            }\r\n\r\n            if (!isAdded || currentBuilder.estimatedMemoryUse() &lt; maxMemorySize)\r\n                return; \/\/ non of the generated tokens were added to the index or memory size wasn't reached\r\n\r\n            segments.add(getExecutor().submit(scheduleSegmentFlush(false)));\r\n        }\r\n    <\/pre>\n<p>  <\/div>\r\n         <\/div>\r\n        <\/div><br \/>\n<\/div>\n<p>The reason to flush index file by segments is to avoid <code>OutOfMemoryException<\/code>. Once all segments are flushed, they will be stitched together to create the final <code>OnDiskIndex<\/code> file.<\/p>\n<p>The memory threshold is defined in the method <code>PerSSTableIndexWriter.maxMemorySize()<\/code><\/p>\n<div class=\"accordion \"  ><br \/>\n  <div class=\"accordion-group\">\r\n        <div class=\"accordion-heading\">\r\n        <a class=\"accordion-toggle\" data-toggle=\"collapse\" \r\n        data-parent=\"#accordion2\" href=\"#accordian_item_518\">PerSSTableIndexWriter.maxMemorySize(ColumnIndex columnIndex)<\/a>\r\n        <\/div>\r\n         <div id=\"accordian_item_518\" class=\"accordion-body collapse \">\r\n            <div class=\"accordion-inner\"><\/p>\n<pre class=\"brush: java; title: ; notranslate\" title=\"\">\r\n    protected long maxMemorySize(ColumnIndex columnIndex)\r\n    {\r\n        \/\/ 1G for memtable and configuration for compaction\r\n        return source == OperationType.FLUSH ? 1073741824L : columnIndex.getMode().maxCompactionFlushMemoryInMb;\r\n    }\r\n    <\/pre>\n<p>  <\/div>\r\n         <\/div>\r\n        <\/div><br \/>\n<\/div>\n<p>When the SSTable flush is complete, the method <code>PerSSTableIndexWriter.complete()<\/code> is called, it will trigger the stitching of index segments together, if there are more than 1 segments.<\/p>\n<p>The stitching phase is necessary because the <strong>terms are sorted in each segment<\/strong> but not globally. The stitching process will help sorting the term globally and merge all the TokenTrees together to create the final index file.<\/p>\n<div class=\"accordion \"  ><br \/>\n  <div class=\"accordion-group\">\r\n        <div class=\"accordion-heading\">\r\n        <a class=\"accordion-toggle\" data-toggle=\"collapse\" \r\n        data-parent=\"#accordion2\" href=\"#accordian_item_216\">PerSSTableIndexWriter.complete(final CountDownLatch latch)<\/a>\r\n        <\/div>\r\n         <div id=\"accordian_item_216\" class=\"accordion-body collapse \">\r\n            <div class=\"accordion-inner\"><\/p>\n<pre class=\"brush: java; highlight: &#091;30,31,32,33,34,35,36,37,38,39,41,42,43,44,55,56,57,58,59,60,61,62,63&#093;; title: ; notranslate\" title=\"\">\r\n        public void complete(final CountDownLatch latch)\r\n        {\r\n            logger.info(&quot;Scheduling index flush to {}&quot;, outputFile);\r\n\r\n            getExecutor().submit((Runnable) () -&gt; {\r\n                long start1 = System.nanoTime();\r\n\r\n                OnDiskIndex[] parts = new OnDiskIndex[segments.size() + 1];\r\n\r\n                try\r\n                {\r\n                    \/\/ no parts present, build entire index from memory\r\n                    if (segments.isEmpty())\r\n                    {\r\n                        scheduleSegmentFlush(true).call();\r\n                        return;\r\n                    }\r\n\r\n                    \/\/ parts are present but there is something still in memory, let's flush that inline\r\n                    if (!currentBuilder.isEmpty())\r\n                    {\r\n                        @SuppressWarnings(&quot;resource&quot;)\r\n                        OnDiskIndex last = scheduleSegmentFlush(false).call();\r\n                        segments.add(Futures.immediateFuture(last));\r\n                    }\r\n\r\n                    int index = 0;\r\n                    ByteBuffer combinedMin = null, combinedMax = null;\r\n\r\n                    for (Future&lt;OnDiskIndex&gt; f : segments)\r\n                    {\r\n                        OnDiskIndex part = f.get();\r\n                        if (part == null)\r\n                            continue;\r\n\r\n                        parts[index++] = part;\r\n                        combinedMin = (combinedMin == null || keyValidator.compare(combinedMin, part.minKey()) &gt; 0) ? part.minKey() : combinedMin;\r\n                        combinedMax = (combinedMax == null || keyValidator.compare(combinedMax, part.maxKey()) &lt; 0) ? part.maxKey() : combinedMax;\r\n                    }\r\n\r\n                    OnDiskIndexBuilder builder = newIndexBuilder();\r\n                    builder.finish(Pair.create(combinedMin, combinedMax),\r\n                                   new File(outputFile),\r\n                                   new CombinedTermIterator(parts));\r\n                }\r\n                catch (Exception | FSError e)\r\n                {\r\n                    logger.error(&quot;Failed to flush index {}.&quot;, outputFile, e);\r\n                    FileUtils.delete(outputFile);\r\n                }\r\n                finally\r\n                {\r\n                    logger.info(&quot;Index flush to {} took {} ms.&quot;, outputFile, TimeUnit.NANOSECONDS.toMillis(System.nanoTime() - start1));\r\n\r\n                    for (int segment = 0; segment &lt; segmentNumber; segment++)\r\n                    {\r\n                        OnDiskIndex part = parts[segment];\r\n\r\n                        if (part != null)\r\n                            FileUtils.closeQuietly(part);\r\n\r\n                        FileUtils.delete(outputFile + &quot;_&quot; + segment);\r\n                    }\r\n\r\n                    latch.countDown();\r\n                }\r\n            });\r\n        }\r\n    <\/pre>\n<p>  <\/div>\r\n         <\/div>\r\n        <\/div><br \/>\n<\/div>\n<div id=\"sasi_on_disk_format\" style=\"min-height: 90px\">&nbsp;<\/div>\n<h2>E) On Disk Data Format & Layout<\/h2>\n<h4>1) Non SPARSE mode Layout<\/h4>\n<p>All the format of <code>OnDiskIndex<\/code> is described in the class <code>OnDiskIndexBuilder<\/code>. From a higher point of view, the <code>OnDiskIndex<\/code> layout for non <strong>SPARSE<\/strong> mode is:<\/p>\n<div id=\"attachment_2203\" style=\"width: 726px\" class=\"wp-caption aligncenter\"><img aria-describedby=\"caption-attachment-2203\" loading=\"lazy\" src=\"https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/NON_SPARSE_mode_OnDiskIndex_Layout.png\" alt=\"NON SPARSE mode OnDiskIndex\" width=\"716\" height=\"285\" class=\"size-full wp-image-2203\" srcset=\"https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/NON_SPARSE_mode_OnDiskIndex_Layout.png 1432w, https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/NON_SPARSE_mode_OnDiskIndex_Layout-300x119.png 300w, https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/NON_SPARSE_mode_OnDiskIndex_Layout-768x306.png 768w, https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/NON_SPARSE_mode_OnDiskIndex_Layout-1024x408.png 1024w, https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/NON_SPARSE_mode_OnDiskIndex_Layout-1170x466.png 1170w\" sizes=\"(max-width: 716px) 100vw, 716px\" \/><p id=\"caption-attachment-2203\" class=\"wp-caption-text\">NON SPARSE mode OnDiskIndex<\/p><\/div>\n<p>The <strong>Header Block<\/strong> contains general meta data information. The <strong>Data Block<\/strong> contains indexed data with matching token value(s) and offset(s). The <strong>Pointer Block<\/strong> contains pointers to lower levels. It can be seen as a <em>binary tree<\/em> whose goal is to help performing binary search quickly on terms. <\/p>\n<p><strong>Levels Count<\/strong> indicates the number of levels the current binary tree of pointers has<\/p>\n<p><strong>Pointer Block Meta<\/strong> and <strong>Data Block Meta<\/strong> contain offsets to pointer and data blocks to speed up disk access.<\/p>\n<p><strong>Level Index Offset<\/strong> is the offset from the beginning of the file of the whole meta data info block<\/p>\n<blockquote><p>Please notice that Header, Data and Pointer blocks length are multiple of 4k. This is purposely designed to align with block size on disk.<\/p><\/blockquote>\n<h4>2) Header Block<\/h4>\n<div id=\"attachment_2229\" style=\"width: 667px\" class=\"wp-caption aligncenter\"><img aria-describedby=\"caption-attachment-2229\" loading=\"lazy\" src=\"https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/Header_Block.png\" alt=\"Header Block\" width=\"657\" height=\"185\" class=\"size-full wp-image-2229\" srcset=\"https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/Header_Block.png 1314w, https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/Header_Block-300x84.png 300w, https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/Header_Block-768x216.png 768w, https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/Header_Block-1024x288.png 1024w, https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/Header_Block-1170x329.png 1170w\" sizes=\"(max-width: 657px) 100vw, 657px\" \/><p id=\"caption-attachment-2229\" class=\"wp-caption-text\">Header Block<\/p><\/div>\n<p>The <strong>Descriptor Version<\/strong> is a currently hard-coded value: <strong><code>ab<\/code><\/strong>. <\/p>\n<p>The <strong>Term Size<\/strong> depends on the indexed column data type:<\/p>\n<table>\n<thead>\n<tr>\n<th>Data Type<\/th>\n<th>Term Size<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td><strong>int<\/strong>, <strong>float<\/strong><\/td>\n<td>4<\/td>\n<\/tr>\n<tr>\n<td><strong>bigint<\/strong>, <strong>double<\/strong>, <strong>timestamp<\/strong><\/td>\n<td>8<\/td>\n<\/tr>\n<tr>\n<td><strong>uuid<\/strong>, <strong>timeuuid<\/strong><\/td>\n<td>16<\/td>\n<\/tr>\n<tr>\n<td>all other types<\/td>\n<td>-1 (variable size)<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>The <strong>Min Term<\/strong> and <strong>Max Term<\/strong> represent respectively the minimum & maximum indexed value found in this index file. <strong>The indexed values are ordered by their type<\/strong> (text --> lexicographic ordering, timestamp --> date ordering , etc ...). Those min\/max terms are useful for <strong>range queries<\/strong> and allow <strong>SASI<\/strong> to skip the entire index file if the [min - max] range does not match the one of the query<\/p>\n<p>The <strong>Min Pk<\/strong> and <strong>Max Pk<\/strong> represent respectively the minimum & maximum partition keys of the matching partitions in this index files. Again they are used to skip index files if the search query specifies a partition key. <\/p>\n<p><strong>Index Mode<\/strong> is just the chosen index mode (<strong>PREFIX<\/strong>, <strong>CONTAINS<\/strong> or <strong>SPARSE<\/strong>)<\/p>\n<p><strong>Has Partial<\/strong> is a boolean flag introduced by <a href=\"https:\/\/issues.apache.org\/jira\/browse\/CASSANDRA-11434\" target=\"_blank\">CASSANDRA-11434<\/a> for backward compatibility and to enable prefix and equality match when using index mode <strong>CONTAINS<\/strong> with <strong><code>NonTokenizingAnalyzer<\/code><\/strong>. More details on this in the next chapter.<\/p>\n<h4>3) Non SPARSE Data Block<\/h4>\n<div id=\"attachment_2273\" style=\"width: 640px\" class=\"wp-caption aligncenter\"><img aria-describedby=\"caption-attachment-2273\" loading=\"lazy\" src=\"https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/Non_SPARSE_Data_Block-1-1024x430.png\" alt=\"Non SPARSE Data Block\" width=\"630\" height=\"265\" class=\"size-large wp-image-2273\" srcset=\"https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/Non_SPARSE_Data_Block-1-1024x430.png 1024w, https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/Non_SPARSE_Data_Block-1-300x126.png 300w, https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/Non_SPARSE_Data_Block-1-768x322.png 768w, https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/Non_SPARSE_Data_Block-1-1170x491.png 1170w, https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/Non_SPARSE_Data_Block-1.png 1444w\" sizes=\"(max-width: 630px) 100vw, 630px\" \/><p id=\"caption-attachment-2273\" class=\"wp-caption-text\">Non SPARSE Data Block<\/p><\/div>\n<p><strong>Terms Count<\/strong> represents the number of terms (indexed values) in the next Term Block.<\/p>\n<p><strong>Offsets Array<\/strong> is an array of relative offsets for each entry in the <strong>Term Block<\/strong> beginning from the current position<\/p>\n<p><strong>Term Block<\/strong> is a block containing terms and their metadata, it is described below.<\/p>\n<p><strong>TokenTree Block<\/strong> is a block containing a binary tree of token values, it is described below.<\/p>\n<p><strong>Padding<\/strong> is there to fill a <em>block worth of 4k<\/em><\/p>\n<h4>4) Non SPARSE Term Block<\/h4>\n<div id=\"attachment_2275\" style=\"width: 640px\" class=\"wp-caption aligncenter\"><img aria-describedby=\"caption-attachment-2275\" loading=\"lazy\" src=\"https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/Non_SPARSE_Term_Block-1024x326.png\" alt=\"Non SPARSE Term Block\" width=\"630\" height=\"200\" class=\"size-large wp-image-2275\" srcset=\"https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/Non_SPARSE_Term_Block-1024x326.png 1024w, https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/Non_SPARSE_Term_Block-300x96.png 300w, https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/Non_SPARSE_Term_Block-768x245.png 768w, https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/Non_SPARSE_Term_Block-1170x373.png 1170w, https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/Non_SPARSE_Term_Block.png 1385w\" sizes=\"(max-width: 630px) 100vw, 630px\" \/><p id=\"caption-attachment-2275\" class=\"wp-caption-text\">Non SPARSE Term Block<\/p><\/div>\n<p>Each entry in the <strong>Non SPARSE Term Block<\/strong> is composed of a <strong>Partial Bit<\/strong> which tells whether the current term represent the original term or is one of its suffixes. The term itself is then written, followed by a <strong>0x0 byte<\/strong> and then a TokenTree offset. This offset point to a node in the TokenTree Block that follow this <strong>Term Block<\/strong>. <\/p>\n<div class=\"accordion \"  ><br \/>\n  <div class=\"accordion-group\">\r\n        <div class=\"accordion-heading\">\r\n        <a class=\"accordion-toggle\" data-toggle=\"collapse\" \r\n        data-parent=\"#accordion2\" href=\"#accordian_item_501\">Example of non SPARSE term block content (mode = CONTAINS, names = {Helen, Jonathan, Patrick})<\/a>\r\n        <\/div>\r\n         <div id=\"accordian_item_501\" class=\"accordion-body collapse \">\r\n            <div class=\"accordion-inner\"><\/p>\n<pre class=\"brush: plain; title: ; notranslate\" title=\"\">\r\nTerms Count : 20, Offsets [0, 9, 21, 34, 43, 54, 63, 73, 85, 99, 109, 125, 133, 143, 151, 164, 179, 193, 204, 215]\r\nData Term (partial ? true)  : an. 0x0, TokenTree offset : 0\r\nData Term (partial ? true)  : athan. 0x0, TokenTree offset : 80\r\nData Term (partial ? true)  : atrick. 0x0, TokenTree offset : 160\r\nData Term (partial ? true)  : ck. 0x0, TokenTree offset : 240\r\nData Term (partial ? true)  : elen. 0x0, TokenTree offset : 320\r\nData Term (partial ? true)  : en. 0x0, TokenTree offset : 400\r\nData Term (partial ? true)  : han. 0x0, TokenTree offset : 480\r\nData Term (partial ? false) : helen. 0x0, TokenTree offset : 560\r\nData Term (partial ? true)  : hnathan. 0x0, TokenTree offset : 640\r\nData Term (partial ? true)  : ick. 0x0, TokenTree offset : 720\r\nData Term (partial ? false) : johnathan. 0x0, TokenTree offset : 800\r\nData Term (partial ? true)  : k. 0x0, TokenTree offset : 880\r\nData Term (partial ? true)  : len. 0x0, TokenTree offset : 960\r\nData Term (partial ? true)  : n. 0x0, TokenTree offset : 1040\r\nData Term (partial ? true)  : nathan. 0x0, TokenTree offset : 1136\r\nData Term (partial ? true)  : ohnathan. 0x0, TokenTree offset : 1216\r\nData Term (partial ? false) : patrick. 0x0, TokenTree offset : 1296\r\nData Term (partial ? true)  : rick. 0x0, TokenTree offset : 1376\r\nData Term (partial ? true)  : than. 0x0, TokenTree offset : 1456\r\nData Term (partial ? true)  : trick. 0x0, TokenTree offset : 1536\r\n    <\/pre>\n<p>  <\/div>\r\n         <\/div>\r\n        <\/div><br \/>\n<\/div>\n<p>\n&nbsp;<br \/>\n<\/p>\n<blockquote><p>Please notice that <strong>terms are sorted<\/strong> inside each Term Block as well as between different Term Blocks.<\/p><\/blockquote>\n<h4>5) Common TokenTree Block<\/h4>\n<div id=\"attachment_2277\" style=\"width: 640px\" class=\"wp-caption aligncenter\"><img aria-describedby=\"caption-attachment-2277\" loading=\"lazy\" src=\"https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/TokenTree_Block-1-1024x323.png\" alt=\"TokenTree Block\" width=\"630\" height=\"199\" class=\"size-large wp-image-2277\" srcset=\"https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/TokenTree_Block-1-1024x323.png 1024w, https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/TokenTree_Block-1-300x94.png 300w, https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/TokenTree_Block-1-768x242.png 768w, https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/TokenTree_Block-1-1170x369.png 1170w, https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/TokenTree_Block-1.png 1362w\" sizes=\"(max-width: 630px) 100vw, 630px\" \/><p id=\"caption-attachment-2277\" class=\"wp-caption-text\">TokenTree Block<\/p><\/div>\n<p><strong>Node Header<\/strong>:<\/p>\n<ul>\n<li><strong>InfoByte<\/strong> is a flag. <strong>0<\/strong> means the current node is a Root node. <strong>1<\/strong> means the current node is the Leaf node and <strong>3<\/strong> means the current node is a last Leaf node or a Root node for a single node tree.<\/li>\n<li><strong>Token Count<\/strong> give the number of matching token values for the given term.<\/li>\n<li><strong>Min Token<\/strong> and <strong>Max Token<\/strong> are self-explanatory<\/li>\n<\/ul>\n<p>The <strong>Node Entries<\/strong> block contains a sequence of <em>(token value, offset(s))<\/em>. Because of possible (although extremely rare) hash collision, a single token value can refer to multiple  partition keys, thus multiple offsets in the <strong>SSTable<\/strong>.<\/p>\n<div class=\"accordion \"  ><br \/>\n  <div class=\"accordion-group\">\r\n        <div class=\"accordion-heading\">\r\n        <a class=\"accordion-toggle\" data-toggle=\"collapse\" \r\n        data-parent=\"#accordion2\" href=\"#accordian_item_232\">Example of TokenTree block content (1 Root\/Leaf node + 1 Root node with 3 Leaf nodes)<\/a>\r\n        <\/div>\r\n         <div id=\"accordian_item_232\" class=\"accordion-body collapse \">\r\n            <div class=\"accordion-inner\"><\/p>\n<pre class=\"brush: plain; title: ; notranslate\" title=\"\">\r\nRoot Header -- Infobyte : 3, tokens count : 3, min token : -4628280296660234682, max token : 5209625165902544754\r\n                token : -4628280296660234682, offset data : 626062\r\n                token : -276633795570262675, offset data : 1236735\r\n                token : 5209625165902544754, offset data : 2004475\r\n\r\n...\r\nRoot Header -- Infobyte : 0, tokens count : 2, min token : 1002236203180810244, max token : 9166816315445099933\r\nChild offsets: [4096, 8192, 12288]\r\nLeaf Header -- Infobyte : 1, tokens count : 248, min token : -9120558309355192568, max token : 947122733220850512\r\n                token : -9120558309355192568, offset data : 13568\r\n                token : -9115699645219380894, offset data : 14118\r\n                token : -9110053775482800927, offset data : 15042\r\n                token : -9087332613408394714, offset data : 17704\r\n\r\n                ...\r\n\r\nLeaf Header -- Infobyte : 1, tokens count : 194, min token : 1002236203180810244, max token : 9139944811025517925\r\n                token : 1002236203180810244, offset data : 1416779\r\n                token : 1079301330783368458, offset data : 1427152\r\n                token : 1136093249390936984, offset data : 1434834\r\n                token : 1165503468422334041, offset data : 1438905 \r\n\r\n                ...\r\n\r\nLeaf Header -- Infobyte : 3, tokens count : 1, min token : 9166816315445099933, max token : 9166816315445099933\r\n                token : 9166816315445099933, offset data : 2567147\r\n    <\/pre>\n<p>  <\/div>\r\n         <\/div>\r\n        <\/div><br \/>\n<\/div>\n<p>Inside the <strong>Term Block<\/strong>, there are TokenTree offsets that point to entries inside the <strong>TokenTree Block<\/strong>. With this layout, each term can refer to a list of partition offsets in the corresponding <strong>SSTable<\/strong> for lookup.<\/p>\n<div id=\"attachment_2170\" style=\"width: 543px\" class=\"wp-caption aligncenter\"><img aria-describedby=\"caption-attachment-2170\" loading=\"lazy\" src=\"https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/TermBlock_TokenTreeBlock.png\" alt=\"Term - TokenTree Link\" width=\"533\" height=\"250\" class=\"size-full wp-image-2170\" srcset=\"https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/TermBlock_TokenTreeBlock.png 1066w, https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/TermBlock_TokenTreeBlock-300x140.png 300w, https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/TermBlock_TokenTreeBlock-768x360.png 768w, https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/TermBlock_TokenTreeBlock-1024x479.png 1024w\" sizes=\"(max-width: 533px) 100vw, 533px\" \/><p id=\"caption-attachment-2170\" class=\"wp-caption-text\">Term - TokenTree Link<\/p><\/div>\n<h4>6) SPARSE mode Layout<\/h4>\n<p>If you're choosing the index <strong>SPARSE<\/strong> mode, the layout is slightly different:<\/p>\n<div id=\"attachment_2201\" style=\"width: 726px\" class=\"wp-caption aligncenter\"><img aria-describedby=\"caption-attachment-2201\" loading=\"lazy\" src=\"https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/SPARSE_mode_OnDiskIndex_Layout.png\" alt=\"SPARSE mode OnDiskIndex\" width=\"716\" height=\"285\" class=\"size-full wp-image-2201\" srcset=\"https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/SPARSE_mode_OnDiskIndex_Layout.png 1432w, https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/SPARSE_mode_OnDiskIndex_Layout-300x119.png 300w, https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/SPARSE_mode_OnDiskIndex_Layout-768x306.png 768w, https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/SPARSE_mode_OnDiskIndex_Layout-1024x408.png 1024w, https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/SPARSE_mode_OnDiskIndex_Layout-1170x466.png 1170w\" sizes=\"(max-width: 716px) 100vw, 716px\" \/><p id=\"caption-attachment-2201\" class=\"wp-caption-text\">SPARSE mode OnDiskIndex<\/p><\/div>\n<p>There is a new <strong>Super Block Meta<\/strong> that is added to the end of the Meta Data Info zone.<\/p>\n<p>This <strong>Super Block Meta<\/strong> gives the <strong>number<\/strong> and <strong>offsets<\/strong> of all <strong>Super TokenTree Blocks<\/strong> described below<\/p>\n<div class=\"accordion \"  ><br \/>\n  <div class=\"accordion-group\">\r\n        <div class=\"accordion-heading\">\r\n        <a class=\"accordion-toggle\" data-toggle=\"collapse\" \r\n        data-parent=\"#accordion2\" href=\"#accordian_item_367\">Example of Super Block Meta content<\/a>\r\n        <\/div>\r\n         <div id=\"accordian_item_367\" class=\"accordion-body collapse \">\r\n            <div class=\"accordion-inner\"><\/p>\n<pre class=\"brush: plain; title: ; notranslate\" title=\"\">\r\nSuper Block offset count : 12\r\nSuper Block offsets : [528384, 1220608, 1916928, 2609152, 3301376, 3997696, 4689920, 5382144, 6078464, 6770688, 7462912, 7995392]\r\n    <\/pre>\n<p>  <\/div>\r\n         <\/div>\r\n        <\/div><br \/>\n<\/div>\n<h4>7) SPARSE Data Block<\/h4>\n<div id=\"attachment_2193\" style=\"width: 735px\" class=\"wp-caption aligncenter\"><img aria-describedby=\"caption-attachment-2193\" loading=\"lazy\" src=\"https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/SPARSE_Data_Block.png\" alt=\"SPARSE Data Block\" width=\"725\" height=\"303\" class=\"size-full wp-image-2193\" \/><p id=\"caption-attachment-2193\" class=\"wp-caption-text\">SPARSE Data Block<\/p><\/div>\n<p>The <strong>SPARSE Data Block<\/strong> contains a <strong>SPARSE Term Block<\/strong> (described below) and for each <strong>64 entries<\/strong>, adds an extra <strong>Super TokenTree Block<\/strong>. The latter is just a merge of the 64 previous small TokenTree Blocks.<\/p>\n<p>Because it is a <strong>SPARSE<\/strong> index, for each indexed value, there is maximum 5 matching rows. Most of the time there is only 1 matching row so indeed the <strong>TokenTree Block<\/strong> is very small and contains almost just 1 entry: <em>(token value, offset(s))<\/em>.<\/p>\n<blockquote><p>Thus, the <strong>Super TokenTree Block<\/strong> is there to aggregate all the <em>(token value, offset(s))<\/em> data into one super tree to accelerate queries that cover a wide range of values.<\/p><\/blockquote>\n<h4>8) SPARSE Term Block<\/h4>\n<div id=\"attachment_2302\" style=\"width: 640px\" class=\"wp-caption aligncenter\"><img aria-describedby=\"caption-attachment-2302\" loading=\"lazy\" src=\"https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/SPARSE_Term_Block-1024x326.png\" alt=\"SPARSE Term Block\" width=\"630\" height=\"200\" class=\"size-large wp-image-2302\" srcset=\"https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/SPARSE_Term_Block-1024x326.png 1024w, https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/SPARSE_Term_Block-300x95.png 300w, https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/SPARSE_Term_Block-768x244.png 768w, https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/SPARSE_Term_Block-1170x372.png 1170w, https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/SPARSE_Term_Block.png 1386w\" sizes=\"(max-width: 630px) 100vw, 630px\" \/><p id=\"caption-attachment-2302\" class=\"wp-caption-text\">SPARSE Term Block<\/p><\/div>\n<p>For <strong>SPARSE Term Block<\/strong>, instead of TokenTree offset, <strong>SASI<\/strong> just stores token  count and an array of token (for the case where there is hash collision). <\/p>\n<div class=\"accordion \"  ><br \/>\n  <div class=\"accordion-group\">\r\n        <div class=\"accordion-heading\">\r\n        <a class=\"accordion-toggle\" data-toggle=\"collapse\" \r\n        data-parent=\"#accordion2\" href=\"#accordian_item_293\">Example of SPARSE term block content<\/a>\r\n        <\/div>\r\n         <div id=\"accordian_item_293\" class=\"accordion-body collapse \">\r\n            <div class=\"accordion-inner\"><\/p>\n<pre class=\"brush: plain; title: ; notranslate\" title=\"\">\r\nTerm count : 151, Offsets [0, 25, 50, 75, 100, 125, 150, 175, 200, 225, 250, 275, 300, 325, 350, 375, 400, 425, 450, 475, 500, 525, 550, 575, 600, 625, 650, 675, 700, 725, 750, 775, 800, 825, 850, 875, 900, 925, 950, 975, 1000, 1025, 1050, 1075, 1100, 1125, 1150, 1175, 1200, 1225, 1250, 1275, 1300, 1325, 1350, 1375, 1400, 1425, 1450, 1475, 1500, 1525, 1550, 1575, 1600, 1625, 1650, 1675, 1700, 1725, 1750, 1775, 1800, 1825, 1850, 1875, 1900, 1925, 1950, 1975, 2000, 2025, 2050, 2075, 2100, 2125, 2150, 2175, 2200, 2225, 2250, 2275, 2300, 2325, 2350, 2375, 2400, 2425, 2450, 2475, 2500, 2525, 2550, 2575, 2600, 2625, 2650, 2675, 2700, 2725, 2750, 2775, 2800, 2825, 2850, 2875, 2900, 2925, 2950, 2975, 3000, 3025, 3050, 3075, 3100, 3125, 3150, 3175, 3200, 3225, 3250, 3275, 3300, 3325, 3350, 3375, 3400, 3425, 3450, 3475, 3500, 3525, 3550, 3575, 3600, 3625, 3650, 3675, 3700, 3725, 3750]\r\nSPARSE mode Data Term (partial ? false) : 00006d9c-2e82-4121-af62-4985ef049ab2. Token count : 1, Tokens [454478372476719604]\r\nSPARSE mode Data Term (partial ? false) : 0000b112-bd10-4b0f-b630-756d58a120f5. Token count : 1, Tokens [-4566353347737760613]\r\nSPARSE mode Data Term (partial ? false) : 0000c8a7-77a5-4556-aba9-7ae25484e1ac. Token count : 1, Tokens [7930016921529937694]\r\nSPARSE mode Data Term (partial ? false) : 00022bcc-d2c7-43b7-81e0-78e8cea743e6. Token count : 1, Tokens [1669390735346713894]\r\nSPARSE mode Data Term (partial ? false) : 0002aded-efc8-46ea-acb7-56839003eed9. Token count : 1, Tokens [8078947252161450449]\r\nSPARSE mode Data Term (partial ? false) : 0002ffe6-cb63-4055-a3ce-f40a4bc57b46. Token count : 1, Tokens [339460836208023232]\r\nSPARSE mode Data Term (partial ? false) : 0003b80b-3231-447f-a52c-0733cdcb4fc0. Token count : 1, Tokens [-3305941541833453269]\r\nSPARSE mode Data Term (partial ? false) : 000477ab-8965-4d79-9cab-a1257f794eeb. Token count : 1, Tokens [-471202335109983528]\r\nSPARSE mode Data Term (partial ? false) : 0005751e-327c-4c00-8a91-2ff78c41835f. Token count : 1, Tokens [7499979976904876222]\r\n\r\n...\r\n\r\n    <\/pre>\n<p>  <\/div>\r\n         <\/div>\r\n        <\/div><br \/>\n<\/div>\n<h4>9) Pointer Block<\/h4>\n<p>Now, we describe how the Pointer Blocks are built and their layout.<\/p>\n<div id=\"attachment_2209\" style=\"width: 1460px\" class=\"wp-caption aligncenter\"><img aria-describedby=\"caption-attachment-2209\" loading=\"lazy\" src=\"https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/Pointer_Block_Building.png\" alt=\"Pointer Block Building\" width=\"1450\" height=\"658\" class=\"size-full wp-image-2209\" srcset=\"https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/Pointer_Block_Building.png 1450w, https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/Pointer_Block_Building-300x136.png 300w, https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/Pointer_Block_Building-768x349.png 768w, https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/Pointer_Block_Building-1024x465.png 1024w, https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/Pointer_Block_Building-1170x531.png 1170w\" sizes=\"(max-width: 1450px) 100vw, 1450px\" \/><p id=\"caption-attachment-2209\" class=\"wp-caption-text\">Pointer Block Building<\/p><\/div>\n<p>Every time that a <strong>Data Block<\/strong> reaches <strong>4k worth of data<\/strong>, it is flushed to disk and the last term is promoted to the upper level called <strong>Pointer Level<\/strong>. When this <strong>Pointer Block<\/strong> content reaches 4k worth of data again, it is flushed to disk and the last <strong>Pointer Term<\/strong> (described below) is promoted to the upper level and so on.<\/p>\n<p><strong>SASI<\/strong> builds the index data from bottom up e.g. first the <strong>Data Level<\/strong> and then all the <strong>Pointer Levels<\/strong> up to the <strong>Root Pointer Level<\/strong>. This bottom-up approach has the advantage not to require a lot of memory because data are flushed to disk for every block of 4k. Inside each <strong>Pointer Level<\/strong>, the same 4k worth of data rule applies and this end up by creating a kind of binary tree.<\/p>\n<p>Contrary to classical B+Tree, the <strong>Pointer Block<\/strong> tree adds up levels only on 4k block of data threshold so there is no guarantee about tree balance with regard to the content.<\/p>\n<p>Terms are sorted at the <strong>Data Level<\/strong> so the terms inside each <strong>Pointer Level<\/strong> are also sorted as a result.<\/p>\n<p>Now let's see the structure of each <strong>Pointer Block<\/strong>:<\/p>\n<div id=\"attachment_2284\" style=\"width: 640px\" class=\"wp-caption aligncenter\"><img aria-describedby=\"caption-attachment-2284\" loading=\"lazy\" src=\"https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/Pointer_Block-1024x369.png\" alt=\"Pointer Block\" width=\"630\" height=\"227\" class=\"size-large wp-image-2284\" srcset=\"https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/Pointer_Block-1024x369.png 1024w, https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/Pointer_Block-300x108.png 300w, https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/Pointer_Block-768x277.png 768w, https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/Pointer_Block-1170x422.png 1170w, https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/Pointer_Block.png 1220w\" sizes=\"(max-width: 630px) 100vw, 630px\" \/><p id=\"caption-attachment-2284\" class=\"wp-caption-text\">Pointer Block<\/p><\/div>\n<p>Again, the structure is very similar to a <strong>Data Block<\/strong>. The only difference is the <strong>Pointer Term Block<\/strong> instead of Term Block.<\/p>\n<div id=\"attachment_2285\" style=\"width: 640px\" class=\"wp-caption aligncenter\"><img aria-describedby=\"caption-attachment-2285\" loading=\"lazy\" src=\"https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/Pointer_Term_Block-1024x322.png\" alt=\"Pointer Term Block\" width=\"630\" height=\"198\" class=\"size-large wp-image-2285\" srcset=\"https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/Pointer_Term_Block-1024x322.png 1024w, https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/Pointer_Term_Block-300x94.png 300w, https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/Pointer_Term_Block-768x242.png 768w, https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/Pointer_Term_Block-1170x368.png 1170w, https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/Pointer_Term_Block.png 1172w\" sizes=\"(max-width: 630px) 100vw, 630px\" \/><p id=\"caption-attachment-2285\" class=\"wp-caption-text\">Pointer Term Block<\/p><\/div>\n<p>Inside each <strong>Pointer Term Block<\/strong>, each term is pointing to the Data Block Index e.g. the index position of the corresponding <strong>Data Block<\/strong> at the <strong>Data Level<\/strong>.<\/p>\n<p>This index is useful because <strong>SASI<\/strong> stores all the offsets of Data Blocks in an array (accessible by index) in the Data Block Meta we'll see below.<\/p>\n<div class=\"accordion \"  ><br \/>\n  <div class=\"accordion-group\">\r\n        <div class=\"accordion-heading\">\r\n        <a class=\"accordion-toggle\" data-toggle=\"collapse\" \r\n        data-parent=\"#accordion2\" href=\"#accordian_item_910\">Example of Pointer Block content<\/a>\r\n        <\/div>\r\n         <div id=\"accordian_item_910\" class=\"accordion-body collapse \">\r\n            <div class=\"accordion-inner\"><\/p>\n<pre class=\"brush: plain; title: ; notranslate\" title=\"\">\r\nPOINTERS BLOCKS\r\nTerm count: 7, Offsets [0, 20, 40, 60, 80, 100, 120]\r\nPointer Term (partial ? false) : fdcff974-bddd-4c4a-a6ff-6615de31d2a1, Block number : 740.\r\nPointer Term (partial ? false) : fe20819f-393c-483e-9e2f-cbd8193fdd15, Block number : 741.\r\nPointer Term (partial ? false) : fe722e4d-25c0-49cd-a9b3-914191e36e9c, Block number : 742.\r\nPointer Term (partial ? false) : fed46ad8-f5a8-406b-a29e-70e71f1862fd, Block number : 743.\r\nPointer Term (partial ? false) : ff352093-c3e4-4e57-83f5-fb5b9101e3e9, Block number : 744.\r\nPointer Term (partial ? false) : ff8c2aab-23d4-4b6e-a706-17dda3a78319, Block number : 745.\r\nPointer Term (partial ? false) : ffeb113c-0bdc-4be5-b3cf-1e0449b37938, Block number : 746.\r\nTerm count : 4, Offsets [0, 20, 40, 60]\r\nPointer Term (partial ? false) : 3f207887-da39-40c0-833c-91547548700f, Block number : 0.\r\nPointer Term (partial ? false) : 7e6f890a-43a8-4021-a473-f18d575d5466, Block number : 1.\r\nPointer Term (partial ? false) : be7c4641-d198-4a97-a279-28f54a8e1cc0, Block number : 2.\r\nPointer Term (partial ? false) : fd711b21-de0c-4270-bb03-956286a2c36a, Block number : 3.\r\n    <\/pre>\n<p>  <\/div>\r\n         <\/div>\r\n        <\/div><br \/>\n<\/div>\n<h4>10) Meta Data Info<\/h4>\n<div id=\"attachment_2226\" style=\"width: 547px\" class=\"wp-caption aligncenter\"><img aria-describedby=\"caption-attachment-2226\" loading=\"lazy\" src=\"https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/Meta_Data_Info.png\" alt=\"Meta Data Info\" width=\"537\" height=\"104\" class=\"size-full wp-image-2226\" srcset=\"https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/Meta_Data_Info.png 1073w, https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/Meta_Data_Info-300x58.png 300w, https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/Meta_Data_Info-768x149.png 768w, https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/Meta_Data_Info-1024x199.png 1024w\" sizes=\"(max-width: 537px) 100vw, 537px\" \/><p id=\"caption-attachment-2226\" class=\"wp-caption-text\">Meta Data Info<\/p><\/div>\n<p>The <strong>Meta Data Info<\/strong> block consists of:<\/p>\n<ul>\n<li><strong>Levels Count<\/strong>: number of Pointer Levels in <strong>Pointer Block<\/strong><\/li>\n<li><strong>Pointer Block Meta<\/strong>: Pointer Block count and offsets to those blocks<\/li>\n<li><strong>Data Block Meta<\/strong>: Data Block count and offsets to those blocks<\/li>\n<li><strong>Super Block Meta<\/strong> (for SPARSE mode only): Super TokenTree Block count and offsets to those blocks<\/li>\n<li><strong>Level Index Offset<\/strong>: offset from the beginning of the file to the <strong>Meta Data Info<\/strong> block<\/li>\n<\/ul>\n<div id=\"attachment_2220\" style=\"width: 520px\" class=\"wp-caption aligncenter\"><img aria-describedby=\"caption-attachment-2220\" loading=\"lazy\" src=\"https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/Pointer_Block_Meta.png\" alt=\"Pointer Block Meta\" width=\"510\" height=\"244\" class=\"size-full wp-image-2220\" srcset=\"https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/Pointer_Block_Meta.png 1019w, https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/Pointer_Block_Meta-300x144.png 300w, https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/Pointer_Block_Meta-768x368.png 768w\" sizes=\"(max-width: 510px) 100vw, 510px\" \/><p id=\"caption-attachment-2220\" class=\"wp-caption-text\">Pointer Block Meta<\/p><\/div>\n<div class=\"accordion \"  ><br \/>\n  <div class=\"accordion-group\">\r\n        <div class=\"accordion-heading\">\r\n        <a class=\"accordion-toggle\" data-toggle=\"collapse\" \r\n        data-parent=\"#accordion2\" href=\"#accordian_item_766\">Example of Pointer Block Meta<\/a>\r\n        <\/div>\r\n         <div id=\"accordian_item_766\" class=\"accordion-body collapse \">\r\n            <div class=\"accordion-inner\"><\/p>\n<pre class=\"brush: plain; title: ; notranslate\" title=\"\">\r\nLevels count : 2\r\n--------------\r\nPOINTER BLOCKS META\r\nBlock offset count : 1, Block offsets : [37830656]\r\nBlock offset count : 2, Block offsets : [22806528, 37826560]\r\n    <\/pre>\n<p>  <\/div>\r\n         <\/div>\r\n        <\/div><br \/>\n<\/div>\n<div id=\"attachment_2222\" style=\"width: 378px\" class=\"wp-caption aligncenter\"><img aria-describedby=\"caption-attachment-2222\" loading=\"lazy\" src=\"https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/Data_Block_Super_Block_Meta.png\" alt=\"Data Block &amp; Super Block Meta\" width=\"368\" height=\"220\" class=\"size-full wp-image-2222\" srcset=\"https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/Data_Block_Super_Block_Meta.png 736w, https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/Data_Block_Super_Block_Meta-300x180.png 300w\" sizes=\"(max-width: 368px) 100vw, 368px\" \/><p id=\"caption-attachment-2222\" class=\"wp-caption-text\">Data Block & Super Block Meta<\/p><\/div>\n<div class=\"accordion \"  ><br \/>\n  <div class=\"accordion-group\">\r\n        <div class=\"accordion-heading\">\r\n        <a class=\"accordion-toggle\" data-toggle=\"collapse\" \r\n        data-parent=\"#accordion2\" href=\"#accordian_item_457\">Example of Data Block Meta<\/a>\r\n        <\/div>\r\n         <div id=\"accordian_item_457\" class=\"accordion-body collapse \">\r\n            <div class=\"accordion-inner\"><\/p>\n<pre class=\"brush: plain; title: ; notranslate\" title=\"\">\r\nDATA BLOCKS META\r\nBlock offset count : 748, Block offsets : [4096, 12288, 20480, ...]\r\n    <\/pre>\n<p>  <\/div>\r\n         <\/div>\r\n        <\/div><br \/>\n<\/div>\n<div class=\"accordion \"  ><br \/>\n  <div class=\"accordion-group\">\r\n        <div class=\"accordion-heading\">\r\n        <a class=\"accordion-toggle\" data-toggle=\"collapse\" \r\n        data-parent=\"#accordion2\" href=\"#accordian_item_516\">Example of Super Block Meta<\/a>\r\n        <\/div>\r\n         <div id=\"accordian_item_516\" class=\"accordion-body collapse \">\r\n            <div class=\"accordion-inner\"><\/p>\n<pre class=\"brush: plain; title: ; notranslate\" title=\"\">\r\nSuper Block offset count : 12, Super Block offsets : [528384, 1220608, 1916928, ...]\r\n    <\/pre>\n<p>  <\/div>\r\n         <\/div>\r\n        <\/div><br \/>\n<\/div>\n<p>It was very hard to reverse-engineer <strong>SASI<\/strong> source code to understand the <strong>OnDiskIndex<\/strong> layout, even with some help from <strong>Pavel Yaskevich<\/strong>. The reason is that the source code is quite abstract (frequent use of generics and polymorphism to mutualise code, which is very good) and very low level (usage of bit operators for performance).<\/p>\n<p>To be able to have a clear understanding of the layout, I had to patch the source code to introduce debugging points through all the life-cycle of <strong>OnDiskIndex<\/strong> building and output the content to the file <em><code>\/tmp\/debug_SASI.txt<\/code><\/em>. If you want to look into the index structure and see how data are really organized on disk, just apply the <a href=\"https:\/\/gist.github.com\/doanduyhai\/7f0dc2dda6ba3fe381cdcf8b9a93ed89\" target=\"_blank\">SASI Debug Patch<\/a>. Warning, the patch has been created from <strong>Cassandra 3.6-SNAPSPHOT<\/strong>. Future updates to <strong>SASI<\/strong> source code may require manual merging when applying this patch.<\/p>\n<div id=\"sasi_read_path\" style=\"min-height: 90px\">&nbsp;<\/div>\n<h2>F) Read Path<\/h2>\n<h4>1) Query Planner<\/h4>\n<p>The integrated <strong>Query Planner<\/strong> is the real workhorse of <strong>SASI<\/strong>. It is responsible to:<\/p>\n<ol>\n<li>Create a Query Plan<\/li>\n<li>Analyze the query<\/li>\n<li>Build an Expressions tree<\/li>\n<li>Optimize the Expressions tree with predicates push-down and merge<\/li>\n<li>Execute the query<\/li>\n<\/ol>\n<p>First, the query expressions (predicates) are analyzed and grouped into a MultiMap (a map with multiple values). Expressions are sorted by column name and then by operator precedence.<\/p>\n<table>\n<thead>\n<tr>\n<th>Operator<\/th>\n<th>Priority (Higher value, Better Prioriry)<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td><strong>=<\/strong><\/td>\n<td>5<\/td>\n<\/tr>\n<tr>\n<td><strong>LIKE<\/strong><\/td>\n<td>4<\/td>\n<\/tr>\n<tr>\n<td><strong>&gt;<\/strong>, <strong>&ge;<\/strong><\/td>\n<td>3<\/td>\n<\/tr>\n<tr>\n<td><strong>&lt;<\/strong>, <strong>&le;<\/strong><\/td>\n<td>2<\/td>\n<\/tr>\n<tr>\n<td><strong>!=<\/strong><\/td>\n<td>1<\/td>\n<\/tr>\n<tr>\n<td>other custom expressions<\/td>\n<td>0<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Expressions using <strong>LIKE<\/strong> predicate are passed to the analyzer. If the <strong><code>StandardAnalyzer<\/code><\/strong> is used, the queried value is tokenized and each token is added as an alternation. A query like <em><code>WHERE title LIKE 'love sad'<\/code><\/em> will be turned into the equivalent of <em><code>WHERE title LIKE 'love' OR title LIKE 'sad'<\/code><\/em> (see <code>Operation.analyzeGroup()<\/code>)<\/p>\n<p>The result of the query optimization is an operation tree where predicates are merged and re-arranged.<\/p>\n<p>Let's consider the following query: <em><code>WHERE age < 100 AND fname = 'p*' AND first_name != 'pa*' AND age > 21<\/code><\/em><\/p>\n<div id=\"attachment_2233\" style=\"width: 383px\" class=\"wp-caption aligncenter\"><img aria-describedby=\"caption-attachment-2233\" loading=\"lazy\" src=\"https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/SASI_Operation_Tree1.png\" alt=\"SASI Operation Tree Step 1\" width=\"373\" height=\"314\" class=\"size-full wp-image-2233\" \/><p id=\"caption-attachment-2233\" class=\"wp-caption-text\">SASI Operation Tree Step 1<\/p><\/div>\n<p>Since <strong>AND<\/strong> clause is commutative and associative, <strong>SASI<\/strong> can merge <code>fname<\/code> predicate with <code>age<\/code> predicate.<\/p>\n<div id=\"attachment_2234\" style=\"width: 1415px\" class=\"wp-caption aligncenter\"><img aria-describedby=\"caption-attachment-2234\" loading=\"lazy\" src=\"https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/SASI_Operation_Tree2.png\" alt=\"SASI Operation Tree Step 2\" width=\"1405\" height=\"518\" class=\"size-full wp-image-2234\" srcset=\"https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/SASI_Operation_Tree2.png 1405w, https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/SASI_Operation_Tree2-300x111.png 300w, https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/SASI_Operation_Tree2-768x283.png 768w, https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/SASI_Operation_Tree2-1024x378.png 1024w, https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/SASI_Operation_Tree2-1170x431.png 1170w\" sizes=\"(max-width: 1405px) 100vw, 1405px\" \/><p id=\"caption-attachment-2234\" class=\"wp-caption-text\">SASI Operation Tree Step 2<\/p><\/div>\n<p>Now, not equal operator (<strong><code>!=<\/code><\/strong>) can be merged with the prefix search as an exclusion filter.<\/p>\n<div id=\"attachment_2235\" style=\"width: 1473px\" class=\"wp-caption aligncenter\"><img aria-describedby=\"caption-attachment-2235\" loading=\"lazy\" src=\"https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/SASI_Operation_Tree3.png\" alt=\"SASI Operation Tree Step 3\" width=\"1463\" height=\"506\" class=\"size-full wp-image-2235\" srcset=\"https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/SASI_Operation_Tree3.png 1463w, https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/SASI_Operation_Tree3-300x104.png 300w, https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/SASI_Operation_Tree3-768x266.png 768w, https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/SASI_Operation_Tree3-1024x354.png 1024w, https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/SASI_Operation_Tree3-1170x405.png 1170w\" sizes=\"(max-width: 1463px) 100vw, 1463px\" \/><p id=\"caption-attachment-2235\" class=\"wp-caption-text\">SASI Operation Tree Step 3<\/p><\/div>\n<p>Indeed, not equal predicate is implemented internally as range scan (scan on the range of tokens) with exclusion filter. If the query has only a not equal predicate, <strong>SASI<\/strong> needs to scan through all the <strong>OnDiskIndex<\/strong> file and remove un-wanted values. This is not very optimized but unavoidable.<\/p>\n<p>However, if not equal predicate is used in conjunction with other predicates (LIKE or inequality) then <strong>SASI<\/strong> will embed the former as exclusion filter while performing search on the latter.<\/p>\n<p>Finally, the predicates on <code>age<\/code> can be merged together again because <strong>AND<\/strong> is commutative and associative.<\/p>\n<div id=\"attachment_2236\" style=\"width: 1400px\" class=\"wp-caption aligncenter\"><img aria-describedby=\"caption-attachment-2236\" loading=\"lazy\" src=\"https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/SASI_Operation_Tree4.png\" alt=\"SASI Operation Tree Step 4\" width=\"1390\" height=\"439\" class=\"size-full wp-image-2236\" srcset=\"https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/SASI_Operation_Tree4.png 1390w, https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/SASI_Operation_Tree4-300x95.png 300w, https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/SASI_Operation_Tree4-768x243.png 768w, https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/SASI_Operation_Tree4-1024x323.png 1024w, https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/SASI_Operation_Tree4-1170x370.png 1170w\" sizes=\"(max-width: 1390px) 100vw, 1390px\" \/><p id=\"caption-attachment-2236\" class=\"wp-caption-text\">SASI Operation Tree Step 4<\/p><\/div>\n<h4>2) Cluster Read Path<\/h4>\n<p>The read path for SASI query on the cluster is exactly the one implemented for normal range scan query. Please read my article on Native Secondary Index, chapter <a href=\"http:\/\/www.planetcassandra.org\/blog\/cassandra-native-secondary-index-deep-dive\/#cluster_read_path\" target=\"_blank\">E) Cluster Read Path<\/a> to have a clear understanding of how the coordinator issues queries across the cluster.<\/p>\n<p>Because <code>SASIIndex.getEstimatedResultRows()<\/code> returns <code>Long.MIN_VALUE<\/code> as a work-around to have higher precedence on native secondary index, the formula to compute the CONCURRENCY_FACTOR for the first round of query is completely ineffective and <strong>always return 1<\/strong>. <\/p>\n<div class=\"accordion \"  ><br \/>\n  <div class=\"accordion-group\">\r\n        <div class=\"accordion-heading\">\r\n        <a class=\"accordion-toggle\" data-toggle=\"collapse\" \r\n        data-parent=\"#accordion2\" href=\"#accordian_item_860\">SASIIndex.getEstimatedResultRows()<\/a>\r\n        <\/div>\r\n         <div id=\"accordian_item_860\" class=\"accordion-body collapse \">\r\n            <div class=\"accordion-inner\"><\/p>\n<pre class=\"brush: java; highlight: &#091;3,4,5&#093;; title: ; notranslate\" title=\"\">\r\n    public long getEstimatedResultRows()\r\n    {\r\n        \/\/ this is temporary (until proper QueryPlan is integrated into Cassandra)\r\n        \/\/ and allows us to priority SASI indexes if any in the query since they\r\n        \/\/ are going to be more efficient, to query and intersect, than built-in indexes.\r\n        return Long.MIN_VALUE;\r\n    } \r\n    <\/pre>\n<p>  <\/div>\r\n         <\/div>\r\n        <\/div><br \/>\n<\/div>\n<blockquote>\n<div style=\"font-size: 1.3 em !important\">\nAs a result, every search with <strong>SASI<\/strong> currently always hit the same node, which is the node responsible for the first token range on the cluster. Subsequent rounds of query (if any) will spread out to other nodes eventually<\/div>\n<\/blockquote>\n<p>Let's hope that this temporary hack will be removed once the Query Plan get fully integrated into <strong>Cassandra<\/strong>.<\/p>\n<h4>3) Local Read Path<\/h4>\n<p>On each local node, <strong>SASI<\/strong> will load the <strong>OnDiskIndex<\/strong> files into system page cache using memory mapped buffer (<code>org.apache.cassandra.index.sasi.utils.MappedBuffer<\/code>) to speed up reading and search.<\/p>\n<p>First, on index file opening, <strong>SASI<\/strong> reads the last 8 bytes at the end of the file to retrieve the offset (<strong>Level Index Offset<\/strong>) of the <strong>Meta Data Info<\/strong> block (see data layout above).<\/p>\n<p>Then it loads all the <strong>Pointer Block Meta<\/strong> and <strong>Data Bloc Meta<\/strong> into memory.<\/p>\n<div id=\"attachment_2243\" style=\"width: 700px\" class=\"wp-caption aligncenter\"><img aria-describedby=\"caption-attachment-2243\" loading=\"lazy\" src=\"https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/Pointer_Block_Binary_Search.png\" alt=\"Pointer Block Binary Search\" width=\"690\" height=\"275\" class=\"size-full wp-image-2243\" srcset=\"https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/Pointer_Block_Binary_Search.png 1379w, https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/Pointer_Block_Binary_Search-300x119.png 300w, https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/Pointer_Block_Binary_Search-768x306.png 768w, https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/Pointer_Block_Binary_Search-1024x408.png 1024w, https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/Pointer_Block_Binary_Search-1170x466.png 1170w\" sizes=\"(max-width: 690px) 100vw, 690px\" \/><p id=\"caption-attachment-2243\" class=\"wp-caption-text\">Pointer Block Binary Search<\/p><\/div>\n<p>When searching for a term, <strong>SASI<\/strong> uses the <strong>Pointer Block<\/strong> to perform <em>binary search<\/em> from the <strong>Root Pointer Level<\/strong> down to the last <strong>Pointer Level<\/strong>. From this last <strong>Pointer Level<\/strong>, <strong>SASI<\/strong> knows in which <strong>Data Block<\/strong> (because the <strong>Pointer Term<\/strong> keeps a reference to the <strong>Data Block<\/strong> index) it should look for the actual matched value, if any.<\/p>\n<p>Inside each <strong>Data Block<\/strong>, since the terms are sorted, <strong>SASI<\/strong> can again use binary search to reach quickly the matching value.<\/p>\n<div id=\"attachment_2245\" style=\"width: 640px\" class=\"wp-caption aligncenter\"><img aria-describedby=\"caption-attachment-2245\" loading=\"lazy\" src=\"https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/Term_Block_Binary_Search-1024x443.png\" alt=\"Term Block Binary Search\" width=\"630\" height=\"272\" class=\"size-large wp-image-2245\" srcset=\"https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/Term_Block_Binary_Search-1024x443.png 1024w, https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/Term_Block_Binary_Search-300x130.png 300w, https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/Term_Block_Binary_Search-768x332.png 768w, https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/Term_Block_Binary_Search-1170x506.png 1170w, https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/Term_Block_Binary_Search.png 1374w\" sizes=\"(max-width: 630px) 100vw, 630px\" \/><p id=\"caption-attachment-2245\" class=\"wp-caption-text\">Term Block Binary Search<\/p><\/div>\n<p>For <strong>prefix search<\/strong>, since all the text terms are stored in their original form, <strong>SASI<\/strong> will strip out the <strong><code>%<\/code><\/strong> character and compare the searched value with the stored term prefix having the same length as the former. <\/p>\n<p>For example, if the index contains the term <em><code>'Jonathan'<\/code><\/em> and the query is <em><code>LIKE 'John%'<\/code><\/em>, <strong>SASI<\/strong> will remove the last 4 characters of <em><code>'Jonathan'<\/code><\/em> and compare <em><code>'Jona'<\/code><\/em> to <em><code>'John'<\/code><\/em>. In this case, there is no match.<\/p>\n<p>If the index mode is <strong>CONTAINS<\/strong> and the user issues a prefix or equality search, <strong>SASI will only use stored terms that have their Partial Bit = false<\/strong> . Indeed, all stored terms whose Partial Bit = true mean that they are a suffix of a longer string and thus cannot be used for neither prefix nor equality search.<\/p>\n<p>Let's illustrate will a simple example. Suppose we index the following names using mode <strong>CONTAINS<\/strong> with <strong><code>NonTokenizingAnalyzer<\/code><\/strong>: <em>Helen<\/em>, <em>Johnathan<\/em> & <em>Patrick<\/em>:<\/p>\n<div class=\"accordion \"  ><br \/>\n  <div class=\"accordion-group\">\r\n        <div class=\"accordion-heading\">\r\n        <a class=\"accordion-toggle\" data-toggle=\"collapse\" \r\n        data-parent=\"#accordion2\" href=\"#accordian_item_799\">Stored terms for CONTAINS mode<\/a>\r\n        <\/div>\r\n         <div id=\"accordian_item_799\" class=\"accordion-body collapse \">\r\n            <div class=\"accordion-inner\"><\/p>\n<pre class=\"brush: plain; title: ; notranslate\" title=\"\">\r\nData Term (partial ? true)  : an. 0x0, TokenTree offset : 0\r\nData Term (partial ? true)  : athan. 0x0, TokenTree offset : 80\r\nData Term (partial ? true)  : atrick. 0x0, TokenTree offset : 160\r\nData Term (partial ? true)  : ck. 0x0, TokenTree offset : 240\r\nData Term (partial ? true)  : elen. 0x0, TokenTree offset : 320\r\nData Term (partial ? true)  : en. 0x0, TokenTree offset : 400\r\nData Term (partial ? true)  : han. 0x0, TokenTree offset : 480\r\nData Term (partial ? false) : helen. 0x0, TokenTree offset : 560\r\nData Term (partial ? true)  : hnathan. 0x0, TokenTree offset : 640\r\nData Term (partial ? true)  : ick. 0x0, TokenTree offset : 720\r\nData Term (partial ? false) : johnathan. 0x0, TokenTree offset : 800\r\nData Term (partial ? true)  : k. 0x0, TokenTree offset : 880\r\nData Term (partial ? true)  : len. 0x0, TokenTree offset : 960\r\nData Term (partial ? true)  : n. 0x0, TokenTree offset : 1040\r\nData Term (partial ? true)  : nathan. 0x0, TokenTree offset : 1136\r\nData Term (partial ? true)  : ohnathan. 0x0, TokenTree offset : 1216\r\nData Term (partial ? false) : patrick. 0x0, TokenTree offset : 1296\r\nData Term (partial ? true)  : rick. 0x0, TokenTree offset : 1376\r\nData Term (partial ? true)  : than. 0x0, TokenTree offset : 1456\r\nData Term (partial ? true)  : trick. 0x0, TokenTree offset : 1536\r\n    <\/pre>\n<p>  <\/div>\r\n         <\/div>\r\n        <\/div><br \/>\n<\/div>\n<p>If we now search by prefix with <em><code>LIKE 'John%'<\/code><\/em>, out of the 20 stored terms, only 3 of them have Partial Bit = false (<em>helen<\/em>, <em>johnathan<\/em> & <em>patrick<\/em>) and will be used for the prefix match.<\/p>\n<p>Once a match is found, <strong>SASI<\/strong> returns the token value of the partition and offset(s) from the beginning of the SSTable. This offset will be used by <em><code>SSTableIndex.DecoratedKeyFetcher.apply()<\/code><\/em> method to retrieve the <code>DecoratedKey<\/code> from the SSTable. This method is just delegating the work to <em><code>SSTableReader.keyAt()<\/code><\/em> method.<\/p>\n<div class=\"accordion \"  ><br \/>\n  <div class=\"accordion-group\">\r\n        <div class=\"accordion-heading\">\r\n        <a class=\"accordion-toggle\" data-toggle=\"collapse\" \r\n        data-parent=\"#accordion2\" href=\"#accordian_item_129\">SSTableReader.keyAt(long indexPosition)<\/a>\r\n        <\/div>\r\n         <div id=\"accordian_item_129\" class=\"accordion-body collapse \">\r\n            <div class=\"accordion-inner\"><\/p>\n<pre class=\"brush: java; highlight: &#091;14,15&#093;; title: ; notranslate\" title=\"\">\r\npublic DecoratedKey keyAt(long indexPosition) throws IOException\r\n    {\r\n        DecoratedKey key;\r\n        try (FileDataInput in = ifile.createReader(indexPosition))\r\n        {\r\n            if (in.isEOF())\r\n                return null;\r\n\r\n            key = decorateKey(ByteBufferUtil.readWithShortLength(in));\r\n\r\n            \/\/ hint read path about key location if caching is enabled\r\n            \/\/ this saves index summary lookup and index file iteration which whould be pretty costly\r\n            \/\/ especially in presence of promoted column indexes\r\n            if (isKeyCacheSetup())\r\n                cacheKey(key, rowIndexEntrySerializer.deserialize(in));\r\n        }\r\n\r\n        return key;\r\n    }\r\n    <\/pre>\n<p>  <\/div>\r\n         <\/div>\r\n        <\/div><br \/>\n<\/div>\n<p>By chance (or was it intended), calling this method also pulls the entry into the Partition Key Cache so that subsequent access to this partition will leverage the cache to access the partition directly on disk.<\/p>\n<p>Once the <em><code>DecoratedKey<\/code><\/em> for the matching partition is found, <strong>SASI<\/strong> just hands over the data reading part to <strong>Cassandra<\/strong> <em><code>SingleReadCommand<\/code><\/em> which has the responsibility to fetch the matching row(s) and apply reconciliation logic (last-write-win, tombstone ...)<\/p>\n<div class=\"accordion \"  ><br \/>\n  <div class=\"accordion-group\">\r\n        <div class=\"accordion-heading\">\r\n        <a class=\"accordion-toggle\" data-toggle=\"collapse\" \r\n        data-parent=\"#accordion2\" href=\"#accordian_item_814\">QueryController.getPartition(DecoratedKey key, ReadExecutionController executionController)<\/a>\r\n        <\/div>\r\n         <div id=\"accordian_item_814\" class=\"accordion-body collapse \">\r\n            <div class=\"accordion-inner\"><\/p>\n<pre class=\"brush: java; highlight: &#091;7,16&#093;; title: ; notranslate\" title=\"\">\r\n    public UnfilteredRowIterator getPartition(DecoratedKey key, ReadExecutionController executionController)\r\n    {\r\n        if (key == null)\r\n            throw new NullPointerException();\r\n        try\r\n        {\r\n            SinglePartitionReadCommand partition = SinglePartitionReadCommand.create(command.isForThrift(),\r\n                                                                                     cfs.metadata,\r\n                                                                                     command.nowInSec(),\r\n                                                                                     command.columnFilter(),\r\n                                                                                     command.rowFilter().withoutExpressions(),\r\n                                                                                     DataLimits.NONE,\r\n                                                                                     key,\r\n                                                                                     command.clusteringIndexFilter(key));\r\n\r\n            return partition.queryMemtableAndDisk(cfs, executionController.baseReadOpOrderGroup());\r\n        }\r\n        finally\r\n        {\r\n            checkpoint();\r\n        }\r\n    }\r\n    <\/pre>\n<p>  <\/div>\r\n         <\/div>\r\n        <\/div><br \/>\n<\/div>\n<p>At that point the alert reader should realise that <strong>SASI<\/strong> does not fully optimize SSTable disk access. Indeed the index only stores the <strong>offset to the complete partition<\/strong>, not to the exact matching rows. If your schema has very wide partitions, <strong>Cassandra<\/strong> will have to full scan it to find the rows. Worst, unlike native secondary index where clustering values are also kept in the index data to help skipping blocks to the nearest position, <strong>SASI<\/strong> index only provides partition offsets.<\/p>\n<p>I asked <strong>Pavel Yaskevich<\/strong> why <strong>SASI<\/strong> team did not optimize further the read path. It turns out that they thought about it but decided intentionally to keep the current design.<\/p>\n<p>Indeed, to improve the read path, we could store <strong>the offset to the row itself<\/strong> instead of the partition. But problem is currently in the <strong>Cassandra<\/strong> SSTable code infrastructure, it is not possible to pass offset to access a row directly. And it would require substantial changes, at least, to introduce row offset.<\/p>\n<p>The second idea is to store clustering columns values in the <strong>OnDiskIndex<\/strong> to help skipping blocks of data. But again it would require storing more extra data in the index file and make the read path more complex.<\/p>\n<p>Anyway the current read path is not very fast for linear scanning over a huge amount of data, thus the JIRA epic <a href=\"https:\/\/issues.apache.org\/jira\/browse\/CASSANDRA-9259\" target=\"_blank\">CASSANDRA-9259<\/a> is opened to improve it and once done, <strong>SASI<\/strong> can naturally benefit from the performance improvement.<\/p>\n<div id=\"sasi_disk_space_usage\" style=\"min-height: 90px\">&nbsp;<\/div>\n<h2>G) Disk Space Usage<\/h2>\n<p>To be able to search with suffix, <strong>SASI<\/strong> has to compute all combinations of suffix from the original term so the longer the term, the more there are suffixes to be stored. The number of suffix is equal to <em><code>term_size - 1<\/code><\/em>.<\/p>\n<p>As a mean of comparison, I have a table <em><code>albums<\/code><\/em> with the following schema:<\/p>\n<div class=\"accordion \"  ><br \/>\n  <div class=\"accordion-group\">\r\n        <div class=\"accordion-heading\">\r\n        <a class=\"accordion-toggle\" data-toggle=\"collapse\" \r\n        data-parent=\"#accordion2\" href=\"#accordian_item_453\">Table Albums Schema<\/a>\r\n        <\/div>\r\n         <div id=\"accordian_item_453\" class=\"accordion-body collapse in\">\r\n            <div class=\"accordion-inner\"><\/p>\n<pre class=\"brush: sql; title: ; notranslate\" title=\"\">\r\nCREATE TABLE music.albums (\r\n    id uuid PRIMARY KEY,\r\n    artist text,\r\n    country text,\r\n    quality text,\r\n    status text,\r\n    title text,\r\n    year int\r\n)\r\n\r\n    <\/pre>\n<p>  <\/div>\r\n         <\/div>\r\n        <\/div><br \/>\n<\/div>\n<p>The table contains \u2248 <strong>110 000 albums<\/strong> and the SSTable size on disk is about <strong>6.8Mb<\/strong>. I created some indices on this table. Below is an overview of the disk space usage for each index:<\/p>\n<table>\n<thead>\n<tr>\n<th>Index Name<\/th>\n<th>Index Mode<\/th>\n<th>Analyzer<\/th>\n<th>Index Size<\/th>\n<th>Index Size\/SSTable Size Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>albums_country_idx<\/td>\n<td><strong>PREFIX<\/strong><\/td>\n<td><strong><code>NonTokenizingAnalyzer<\/code><\/strong><\/td>\n<td>2Mb<\/td>\n<td>0.29<\/td>\n<\/tr>\n<tr>\n<td>albums_year_idx<\/td>\n<td><strong>PREFIX<\/strong><\/td>\n<td>N\/A<\/td>\n<td>2.3Mb<\/td>\n<td>0.34<\/td>\n<\/tr>\n<tr>\n<td>albums_artist_idx<\/td>\n<td><strong>CONTAINS<\/strong><\/td>\n<td><strong><code>NonTokenizingAnalyzer<\/code><\/strong><\/td>\n<td><strong>30Mb<\/strong><\/td>\n<td><strong>4.41<\/strong><\/td>\n<\/tr>\n<tr>\n<td>albums_title_idx<\/td>\n<td><strong>CONTAINS<\/strong><\/td>\n<td><strong><code>StandardAnalyzer<\/code><\/strong><\/td>\n<td><strong>41Mb<\/strong><\/td>\n<td><strong>6.03<\/strong><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>As we can see, using <strong>CONTAINS<\/strong> mode can increase the disk usage by <strong>x4 - x6<\/strong>. Since album titles tends to be a long text, the inflation rate is x6. It will be more if we chose the <strong><code>NonTokenizingAnalyzer<\/code><\/strong> because the <strong><code>StandardAnalyzer<\/code><\/strong> splits the text into tokens, remove stop words and perform stemming. All this help reducing the total size of the term. <\/p>\n<p>As a conclusion, use <strong>CONTAINS<\/strong> mode wisely and be ready to pay the price in term of disk space. There is no way to avoid it. Even with efficient search engines like <strong>ElasticSearch<\/strong> or <strong>Solr<\/strong>, it is officially recommended to avoid substring search (<em><code>LIKE %substring%<\/code><\/em>) for the sake of performance.<\/p>\n<div id=\"sasi_perf_benchmarks\" style=\"min-height: 90px\">&nbsp;<\/div>\n<h2>H) Some Performance Benchmarks<\/h2>\n<p>Below are the hardware specs used for the benchmark:<\/p>\n<ul>\n<li>13 bare metal machines<\/li>\n<li>6 CPU (HT) = 12 cores<\/li>\n<li>64Gb RAM<\/li>\n<li>4 SSD RAID 0 for a total of 1.5Tb<\/li>\n<\/ul>\n<p>Cassandra configuration:<\/p>\n<ul>\n<li>num token: 64<\/li>\n<li>concurrent_compactors: 2<\/li>\n<li>compaction_throughput_mb_per_sec: 256<\/li>\n<li>G1 GC with 32Gb heap<\/li>\n<\/ul>\n<p>Schema:<\/p>\n<div class=\"accordion \"  ><br \/>\n  <div class=\"accordion-group\">\r\n        <div class=\"accordion-heading\">\r\n        <a class=\"accordion-toggle\" data-toggle=\"collapse\" \r\n        data-parent=\"#accordion2\" href=\"#accordian_item_352\">Test Schema<\/a>\r\n        <\/div>\r\n         <div id=\"accordian_item_352\" class=\"accordion-body collapse in\">\r\n            <div class=\"accordion-inner\"><\/p>\n<pre class=\"brush: sql; title: ; notranslate\" title=\"\">\r\n\r\nCREATE KEYSPACE test WITH replication = {'class': 'NetworkTopologyStrategy', 'datacenter1': '2'}  AND durable_writes = true;\r\n\r\ncreate table if not exists test.resource_bench ( \r\n dsr_id uuid,\r\n rel_seq bigint,\r\n seq bigint,\r\n dsp_code varchar,\r\n model_code varchar,\r\n media_code varchar,\r\n transfer_code varchar,\r\n commercial_offer_code varchar,\r\n territory_code varchar,\r\n period_end_month_int int,\r\n authorized_societies_txt text,\r\n rel_type text,\r\n status text,\r\n dsp_release_code text,\r\n title text,\r\n contributors_name list&lt;text&gt;,\r\n unic_work text,\r\n paying_net_qty bigint,\r\nPRIMARY KEY ((dsr_id, rel_seq), seq)\r\n) WITH CLUSTERING ORDER BY (seq ASC)\r\nAND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}; \r\n\r\nCREATE CUSTOM INDEX resource_period_end_month_int_idx ON sharon.resource_bench (period_end_month_int) USING 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {'mode': 'PREFIX'};\r\n\r\nCREATE CUSTOM INDEX resource_territory_code_idx ON sharon.resource_bench (territory_code) USING 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {'analyzer_class': 'org.apache.cassandra.index.sasi.analyzer.NonTokenizingAnalyzer', 'case_sensitive': 'false'};\r\n\r\nCREATE CUSTOM INDEX resource_dsp_code_idx ON sharon.resource_bench (dsp_code) USING 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {'analyzer_class': 'org.apache.cassandra.index.sasi.analyzer.NonTokenizingAnalyzer', 'case_sensitive': 'false'};\r\n\r\n    <\/pre>\n<p>  <\/div>\r\n         <\/div>\r\n        <\/div><br \/>\n<\/div>\n<p>The table has 1 numerical DENSE index (<code>resource_period_end_month_int_idx<\/code>) and 2 text DENSE indices (<code>resource_territory_code_idx<\/code> & <code>resource_dsp_code_idx<\/code>).<\/p>\n<p>The cardinality for each indexed columns are:<\/p>\n<ul>\n<li><code>period_end_month_int<\/code>: <strong>36<\/strong> distinct values<\/li>\n<li><code>territory_code<\/code>: <strong>7<\/strong> distinct values<\/li>\n<li><code>dsp_code<\/code>: <strong>7<\/strong> distinct values<\/li>\n<\/ul>\n<p>Then I deployed a co-located Spark installation on those machines and used a Spark script to inject <strong>1.3 billion rows<\/strong>.<\/p>\n<p>Without <strong>SASI<\/strong> index, the insert took \u2248 4h. With the above 3 indices, it took \u2248 6h. Clearly the index has an impact on the <strong>write<\/strong> and <strong>compaction<\/strong> throughput because of the overhead required to create and flush the index files.<\/p>\n<p>I also benchmarked the time it took to build <strong>SASI<\/strong> index from existing data:<\/p>\n<ul>\n<li><code>period_end_month_int<\/code>: 1h20<\/li>\n<li><code>territory_code<\/code>: 1h<\/li>\n<li><code>model_code<\/code>: (DENSE text index with only 2 distinc values): 1h34<\/li>\n<\/ul>\n<p>Next, I benchmarked the query latency. There are 2 distinct scenarios. First I used server-side paging to fetch all data matching some predicates. The second test adds a <strong>LIMIT<\/strong> clause with different value to see how it can impact response time.<\/p>\n<blockquote><p>Please note that when LIMIT is not set, <em>fetchSize = 10000<\/em> and a sleep time of <strong>20 ms<\/strong> for each page is used to let the cluster breath.<\/p><\/blockquote>\n<table>\n<thead>\n<tr>\n<th>Query<\/th>\n<th>Limit<\/th>\n<th>Fetched Rows<\/th>\n<th>Query Time<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td><em><code>WHERE period_end_month_int=201401<\/code><\/em><\/td>\n<td>None<\/td>\n<td>36 109 986<\/td>\n<td>609 secs<\/td>\n<\/tr>\n<tr>\n<td><em><code>WHERE period_end_month_int=201406 AND dsp_code='vevo'<\/code><\/em><\/td>\n<td>None<\/td>\n<td>2 781 492<\/td>\n<td>330 secs<\/td>\n<\/tr>\n<tr>\n<td><em><code>WHERE period_end_month_int=201406 AND dsp_code='vevo' AND territory_code='FR'<\/code><\/em><\/td>\n<td>None<\/td>\n<td>1 044 547<\/td>\n<td>372 secs<\/td>\n<\/tr>\n<tr>\n<td><em><code>WHERE period_end_month_int=201406 AND dsp_code='vevo'<br \/>\nAND territory_code='FR' AND model_code='AdFunded'<\/code><\/em><\/td>\n<td>None<\/td>\n<td>360 334<\/td>\n<td>116 secs<\/td>\n<\/tr>\n<tr>\n<td><em><code>WHERE period_end_month_int=201406<\/code><\/em><\/td>\n<td>100<\/td>\n<td>100<\/td>\n<td>26 ms<\/td>\n<\/tr>\n<tr>\n<td><em><code>WHERE period_end_month_int=201406<\/code><\/em><\/td>\n<td>1000<\/td>\n<td>1000<\/td>\n<td>143 ms<\/td>\n<\/tr>\n<tr>\n<td><em><code>WHERE period_end_month_int=201406<\/code><\/em><\/td>\n<td>10000<\/td>\n<td>10000<\/td>\n<td>693 ms<\/td>\n<\/tr>\n<tr>\n<td><em><code>WHERE period_end_month_int=201406<\/code><\/em><\/td>\n<td>100000<\/td>\n<td>100000<\/td>\n<td>5087 ms<\/td>\n<\/tr>\n<tr>\n<td><em><code>WHERE period_end_month_int=201406 AND dsp_code='vevo'<\/code><\/em><\/td>\n<td>100<\/td>\n<td>100<\/td>\n<td>35 ms<\/td>\n<\/tr>\n<tr>\n<td><em><code>WHERE period_end_month_int=201406 AND dsp_code='vevo'<\/code><\/em><\/td>\n<td>1000<\/td>\n<td>1000<\/td>\n<td>175 ms<\/td>\n<\/tr>\n<tr>\n<td><em><code>WHERE period_end_month_int=201406 AND dsp_code='vevo'<\/code><\/em><\/td>\n<td>10000<\/td>\n<td>10000<\/td>\n<td>1375 ms<\/td>\n<\/tr>\n<tr>\n<td><em><code>WHERE period_end_month_int=201406 AND dsp_code='vevo'<\/code><\/em><\/td>\n<td>100000<\/td>\n<td>100000<\/td>\n<td>16984 ms<\/td>\n<\/tr>\n<tr>\n<td><em><code>WHERE period_end_month_int=201406 AND dsp_code='vevo' AND territory_code='FR'<\/code><\/em><\/td>\n<td>100<\/td>\n<td>100<\/td>\n<td>71 ms<\/td>\n<\/tr>\n<tr>\n<td><em><code>WHERE period_end_month_int=201406 AND dsp_code='vevo' AND territory_code='FR'<\/code><\/em><\/td>\n<td>1000<\/td>\n<td>1000<\/td>\n<td>337 ms<\/td>\n<\/tr>\n<tr>\n<td><em><code>WHERE period_end_month_int=201406 AND dsp_code='vevo' AND territory_code='FR'<\/code><\/em><\/td>\n<td>10000<\/td>\n<td>10000<\/td>\n<td>4548 ms<\/td>\n<\/tr>\n<tr>\n<td><em><code>WHERE period_end_month_int=201406 AND dsp_code='vevo' AND territory_code='FR'<\/code><\/em><\/td>\n<td>100000<\/td>\n<td>100000<\/td>\n<td>8658 ms<\/td>\n<\/tr>\n<tr>\n<td><em><code>WHERE period_end_month_int=201406 AND dsp_code='vevo'<br \/>\nAND territory_code='FR' AND model_code='AdFunded'<\/code><\/em><\/td>\n<td>100<\/td>\n<td>100<\/td>\n<td>378 ms<\/td>\n<\/tr>\n<tr>\n<td><em><code>WHERE period_end_month_int=201406 AND dsp_code='vevo'<br \/>\nAND territory_code='FR' AND model_code='AdFunded'<\/code><\/em><\/td>\n<td>1000<\/td>\n<td>1000<\/td>\n<td>2952 ms<\/td>\n<\/tr>\n<tr>\n<td><em><code>WHERE period_end_month_int=201406 AND dsp_code='vevo'<br \/>\nAND territory_code='FR' AND model_code='AdFunded'<\/code><\/em><\/td>\n<td>10000<\/td>\n<td>10000<\/td>\n<td>5026 ms<\/td>\n<\/tr>\n<tr>\n<td><em><code>WHERE period_end_month_int=201406 AND dsp_code='vevo'<br \/>\nAND territory_code='FR' AND model_code='AdFunded'<\/code><\/em><\/td>\n<td>100000<\/td>\n<td>100000<\/td>\n<td>16319 ms<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>The results are quite interesting. When fetching all the data out of <strong>Cassandra<\/strong> using server-side paging, the more predicates we have to narrow down the result set, the faster it is because there are less rows to retrieve, which is quite intuitive.<\/p>\n<p>However, results of queries using <strong>LIMIT<\/strong> is more surprising. For small values of limit, we can see that the more we add predicates and the slower the query is ... until some threshold (around <em>10 000 rows<\/em>) where the latency look more similar to server-side paging queries.<\/p>\n<div id=\"attachment_2262\" style=\"width: 387px\" class=\"wp-caption aligncenter\"><img aria-describedby=\"caption-attachment-2262\" loading=\"lazy\" src=\"https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/Benchmark_Limit_100.png\" alt=\"Benchmark Limit 100\" width=\"377\" height=\"226\" class=\"size-full wp-image-2262\" srcset=\"https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/Benchmark_Limit_100.png 753w, https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/Benchmark_Limit_100-300x180.png 300w\" sizes=\"(max-width: 377px) 100vw, 377px\" \/><p id=\"caption-attachment-2262\" class=\"wp-caption-text\">Benchmark Limit 100<\/p><\/div>\n<div id=\"attachment_2263\" style=\"width: 387px\" class=\"wp-caption aligncenter\"><img aria-describedby=\"caption-attachment-2263\" loading=\"lazy\" src=\"https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/Benchmark_Limit_1000.png\" alt=\"Benchmark Limit 1000\" width=\"377\" height=\"226\" class=\"size-full wp-image-2263\" srcset=\"https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/Benchmark_Limit_1000.png 753w, https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/Benchmark_Limit_1000-300x180.png 300w\" sizes=\"(max-width: 377px) 100vw, 377px\" \/><p id=\"caption-attachment-2263\" class=\"wp-caption-text\">Benchmark Limit 1000<\/p><\/div>\n<div id=\"attachment_2264\" style=\"width: 387px\" class=\"wp-caption aligncenter\"><img aria-describedby=\"caption-attachment-2264\" loading=\"lazy\" src=\"https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/Benchmark_Limit_10000.png\" alt=\"Benchmark Limit 10000\" width=\"377\" height=\"226\" class=\"size-full wp-image-2264\" srcset=\"https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/Benchmark_Limit_10000.png 753w, https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/Benchmark_Limit_10000-300x180.png 300w\" sizes=\"(max-width: 377px) 100vw, 377px\" \/><p id=\"caption-attachment-2264\" class=\"wp-caption-text\">Benchmark Limit 10000<\/p><\/div>\n<div id=\"attachment_2265\" style=\"width: 387px\" class=\"wp-caption aligncenter\"><img aria-describedby=\"caption-attachment-2265\" loading=\"lazy\" src=\"https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/Benchmark_Limit_100000.png\" alt=\"Benchmark Limit 100000\" width=\"377\" height=\"226\" class=\"size-full wp-image-2265\" srcset=\"https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/Benchmark_Limit_100000.png 753w, https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2016\/04\/Benchmark_Limit_100000-300x180.png 300w\" sizes=\"(max-width: 377px) 100vw, 377px\" \/><p id=\"caption-attachment-2265\" class=\"wp-caption-text\">Benchmark Limit 100000<\/p><\/div>\n<p>One possible explanation is that the more predicates you add and the more index files <strong>SASI<\/strong> has to read for the query so for small <strong>LIMIT<\/strong> values it spends more time on index reading than on fetching raw data from <strong>Cassandra<\/strong>. But above a <strong>LIMIT<\/strong> threshold, adding more predicates is beneficial because you reduce the number of returned rows thus limit <strong>Cassandra<\/strong> sequential scans.<\/p>\n<p>Generally speaking, there is a limit of number of returned rows above which it is slower to query using <strong>SASI<\/strong> or any secondary index compared to a full table scan using <strong>ALLOW FILTERING<\/strong> and paging. Why is that ? Because reading the index files into memory has a cost and this cost only increases when the returned result set grows.<\/p>\n<div id=\"sasi_vs_search\" style=\"min-height: 90px\">&nbsp;<\/div>\n<h2>I) SASI vs Search Engines<\/h2>\n<p>Somehow one wants to compare <strong>SASI<\/strong> with classical search engines like <strong>ElasticSearch<\/strong>, <strong>Solr<\/strong> or <strong>Datastax Enterprise Search<\/strong>. The comparison is quite simple indeed. Despite its convenience and the fact that <strong>SASI<\/strong> is strongly integrated to <strong>Cassandra<\/strong> and CQL, it has a number of drawbacks when compared to real search engines.<\/p>\n<ul>\n<li><strong>SASI<\/strong> requires 2 passes on disk to fetch data: 1 pass to read the index files and 1 pass for the normal <strong>Cassandra<\/strong> read path whereas search engines retrieves the result in a single pass (<strong>DSE Search<\/strong> has a <em>singlePass<\/em> option too). By laws of physics, <strong>SASI<\/strong> will always be slower, even if we improve the sequential read path in <strong>Cassandra<\/strong><\/li>\n<li>Although <strong>SASI<\/strong> allows full text search with tokenization and <strong>CONTAINS<\/strong> mode, there is no scoring applied to matched terms<\/li>\n<li><strong>SASI<\/strong> returns result in token range order, which can be considered as random order from the user point of view. It is not possible to ask for total ordering of the result, even when <strong>LIMIT<\/strong> clause is used. Search engines don't have this limitation<\/li>\n<li>last but not least, it is not possible to perform aggregation (or faceting) with <strong>SASI<\/strong>. The <strong>GROUP BY<\/strong> clause may be introduced into CQL in a near future but it is done on <strong>Cassandra<\/strong> side, there is no pre-aggregation possible on <strong>SASI<\/strong> terms that can help speeding up aggregation queries<\/li>\n<\/ul>\n<p>That being said, if you don't need ordering, grouping or scoring, <strong>SASI<\/strong> is a very nice alternative to pulling a search engine into the game.<\/p>\n<p>I would never have though that I could one day use the <em><code>LIKE '%term%'<\/code><\/em> predicate with <strong>Cassandra<\/strong> so from this point of view it is already a great improvement over the limitations of the past.<\/p>\n<div id=\"sasi_trade_offs\" style=\"min-height: 90px\">&nbsp;<\/div>\n<h2>J) SASI Trade-Offs<\/h2>\n<p>You should use <strong>SASI<\/strong> if:<\/p>\n<ul>\n<li>you need multi criteria search and you don't need ordering\/grouping\/scoring<\/li>\n<li>you mostly need 100 to 1000 of rows for your search queries<\/li>\n<li>you always know the partition keys of the rows to be searched for (this one applies to native secondary index too)<\/li>\n<li>you want to index static columns (<strong>SASI<\/strong> has no penalty since it indexes the whole partition)<\/li>\n<\/ul>\n<p>You should avoid <strong>SASI<\/strong> if:<\/p>\n<ul>\n<li>you have very wide partitions to index, <strong>SASI<\/strong> only give the partition offset. The expensive linear scanning is still performed on <strong>Cassandra<\/strong> side, without the help of clustering column index for skipping blocks<\/li>\n<li>you have strong SLA on search latency, for example sub-second requirement<\/li>\n<li>you need search for analytics scenarios (<strong>SASI<\/strong> is not the right fit to fetch half of your table) unless you use <strong>SASI<\/strong> with co-located <strong>Apache Spark<\/strong> but even in this case, search engines win with 2 orders of magnitude for latency<\/li>\n<li>ordering of the search results is important for you<\/li>\n<\/ul>\n<p>If you decide to try <strong>SASI<\/strong> in production, please keep in mind that <strong>SASI<\/strong> does impact your write\/flush throughput, compaction throughput as well as repair and streaming operations. It is quite expected because <strong>SASI<\/strong> index files follow SSTable life-cycle.<\/p>\n<p>Also beware of the <strong>CONTAINS<\/strong> mode whose cost of disk space can be prohibitive.<\/p>\n<p>Avoid using (<strong><code>!=<\/code><\/strong>) alone because it will end up scanning entire token ranges, which is expensive. Use it in combination with other predicates.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>This blog post is a technical deep dive into the new cool SASI index that enables full text search as well as faster multi-criteria search in Cassandra (introduced since Cassandra 3.4 but I recommend Cassandra 3.5 at least because of&#8230;<br \/><a class=\"read-more-button\" href=\"https:\/\/www.doanduyhai.com\/blog\/?p=2058\">Read more<\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[57,10],"tags":[],"_links":{"self":[{"href":"https:\/\/www.doanduyhai.com\/blog\/index.php?rest_route=\/wp\/v2\/posts\/2058"}],"collection":[{"href":"https:\/\/www.doanduyhai.com\/blog\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.doanduyhai.com\/blog\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.doanduyhai.com\/blog\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.doanduyhai.com\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=2058"}],"version-history":[{"count":216,"href":"https:\/\/www.doanduyhai.com\/blog\/index.php?rest_route=\/wp\/v2\/posts\/2058\/revisions"}],"predecessor-version":[{"id":2318,"href":"https:\/\/www.doanduyhai.com\/blog\/index.php?rest_route=\/wp\/v2\/posts\/2058\/revisions\/2318"}],"wp:attachment":[{"href":"https:\/\/www.doanduyhai.com\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=2058"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.doanduyhai.com\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=2058"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.doanduyhai.com\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=2058"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}