{"id":13439,"date":"2017-08-31T16:42:01","date_gmt":"2017-08-31T16:42:01","guid":{"rendered":"http:\/\/www.doanduyhai.com\/blog\/?p=13439"},"modified":"2017-09-01T13:33:43","modified_gmt":"2017-09-01T13:33:43","slug":"gremlin-recipes-10-depth-first-vs-breadth-first","status":"publish","type":"post","link":"https:\/\/www.doanduyhai.com\/blog\/?p=13439","title":{"rendered":"Gremlin recipes: 10 Depth-first vs Breadth-first"},"content":{"rendered":"<p>This blog post is the 9<sup>th<\/sup> from the series <strong>Gremlin Recipes<\/strong>. It is recommended to read the previous blog posts first: <\/p>\n<ol>\n<li><a href=\"https:\/\/www.doanduyhai.com\/blog\/?p=13224\" target=\"_blank\"><strong>Gremlin as a Stream<\/strong><\/a><\/li>\n<li><a href=\"https:\/\/www.doanduyhai.com\/blog\/?p=13260\" target=\"_blank\"><strong>SQL to Gremlin<\/strong><\/a><\/li>\n<li><a href=\"https:\/\/www.doanduyhai.com\/blog\/?p=13285\" target=\"_blank\"><strong>Recommendation Engine traversal<\/strong><\/a><\/li>\n<li><a href=\"https:\/\/www.doanduyhai.com\/blog\/?p=13301\" target=\"_blank\"><strong>Recursive traversals<\/strong><\/a><\/li>\n<li><a href=\"https:\/\/www.doanduyhai.com\/blog\/?p=13320\" target=\"_blank\"><strong>Path object<\/strong><\/a><\/li>\n<li><a href=\"https:\/\/www.doanduyhai.com\/blog\/?p=13352\" target=\"_blank\"><strong>Projection and selection<\/strong><\/a><\/li>\n<li><a href=\"https:\/\/www.doanduyhai.com\/blog\/?p=13374\" target=\"_blank\"><strong>Variable handling<\/strong><\/a><\/li>\n<li><a href=\"https:\/\/www.doanduyhai.com\/blog\/?p=13404\" target=\"_blank\"><strong>sack() operator<\/strong><\/a><\/li>\n<li><a href=\"https:\/\/www.doanduyhai.com\/blog\/?p=13424\" target=\"_blank\"><strong>Pattern Matching<\/strong><\/a><\/li>\n<\/ol>\n<p><!--more--><\/p>\n<h1>I KillrVideo dataset<\/h1>\n<p>To illustrate this series of recipes, you need first to create the schema for <strong>KillrVideo<\/strong> and import the data. See <a href=\"https:\/\/www.doanduyhai.com\/blog\/?p=13224#killrvideo_dataset\" target=\"_blank\"><strong>here<\/strong><\/a> for more details.<\/p>\n<p>The graph schema of this dataset is :<\/p>\n<p><iframe loading=\"lazy\" src=\"https:\/\/s3.amazonaws.com\/datastax-graph-schema-viewer\/index.html#\/?schema=killr_video_small.json\" height=\"600px\" width=\"100%\"><\/iframe><\/p>\n<h1>II Evaluation strategy in <strong>Gremlin<\/strong><\/h1>\n<p>Familiar users of <strong>Gremlin<\/strong> know that there exist 2 distinct evaluation strategies for traversals:<\/p>\n<ol>\n<li><strong>depth-first<\/strong>: this strategy traverses down the entire path as specified by a the steps before turning to the next legal path. In theory there is a single traverser that explores each path but in practice for optimisation purposes, the vendors can implement concurrent traversers for distinct paths. The <strong>depth-first strategy<\/strong> is the default strategy in <strong>Gremlin<\/strong> unless specified otherwise.<\/li>\n<li><strong>breadth-first<\/strong>: this strategy tries to parallelize the traversal of each step as much as possible. On each step, <strong>Gremlin<\/strong> will explore all possible paths before moving to the next step in the traversal<\/li>\n<\/ol>\n<p>Let&#8217;s take a the recommendation engine traversal we have seen in previous posts:<\/p>\n<pre class=\"brush: java; title: ; wrap-lines: false; notranslate\" title=\"\">\r\ngremlin&gt;g.V().\r\n  has(&quot;movie&quot;, &quot;title&quot;, &quot;Blade Runner&quot;).                       \/\/ stage 1\r\n  inE(&quot;rated&quot;).filter(values(&quot;rating&quot;).is(gte(8.203))).outV(). \/\/ stage 2\r\n  outE(&quot;rated&quot;).filter(values(&quot;rating&quot;).is(gte(8.203))).inV()  \/\/ stage 3\r\n<\/pre>\n<p>The above traversal can be decomposed into 3 stages<\/p>\n<ol>\n<li><code><strong>has(\"movie\", \"title\", \"Blade Runner\")<\/strong><\/code><\/li>\n<li><code><strong>inE(\"rated\").filter(values(\"rating\").is(gte(8.203))).outV()<\/strong><\/code><\/li>\n<li><code><strong>outE(\"rated\").filter(values(\"rating\").is(gte(8.203))).inV()<\/strong><\/code><\/li>\n<\/ol>\n<p>Below is how a concurrent <strong>depth-first<\/strong> strategy would work<\/p>\n<div id=\"attachment_13441\" style=\"width: 1510px\" class=\"wp-caption aligncenter\"><img aria-describedby=\"caption-attachment-13441\" loading=\"lazy\" src=\"https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2017\/08\/depth-first.gif\" alt=\"Depth First strategy\" width=\"1500\" height=\"731\" class=\"size-full wp-image-13441\" \/><p id=\"caption-attachment-13441\" class=\"wp-caption-text\">Depth First strategy<\/p><\/div>\n<p>From the <em>Blade Runner<\/em> movie vertex, some traversers start exploring down the traversal by navigating to user vertices and then to other movie vertices. Note each possible path can be explored in parallel by a traverser and at any time, each traverser can be at different stage with regard to the global traversal.<\/p>\n<p>The same traversal using <strong>breadth-first<\/strong> strategy:<\/p>\n<div id=\"attachment_13444\" style=\"width: 1510px\" class=\"wp-caption aligncenter\"><img aria-describedby=\"caption-attachment-13444\" loading=\"lazy\" src=\"https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2017\/08\/breadth-first.gif\" alt=\"Breadth first strategy\" width=\"1500\" height=\"732\" class=\"size-full wp-image-13444\" \/><p id=\"caption-attachment-13444\" class=\"wp-caption-text\">Breadth first strategy<\/p><\/div>\n<p>With this strategy there are as many traversers as there are legal paths in each stage. And the traversers move to the next stage only when <strong>all<\/strong> paths of the current stage have been explored.<\/p>\n<h1>III OLTP vs OLAP mode<\/h1>\n<p>Very closely related to the evaluation strategies is the Gremlin execution mode: <strong>OLTP<\/strong> or <strong>OLAP<\/strong>. The below table gives a summary of the characteristics of each execution mode:<\/p>\n<table border=\"1\">\n<thead>\n<tr>\n<th>Features<\/th>\n<th>OLTP<\/th>\n<th>OLAP<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Workload type<\/td>\n<td>Real-time<\/td>\n<td>Batch<\/td>\n<\/tr>\n<tr>\n<td>Parallelism<\/td>\n<td>Serial or concurrent<\/td>\n<td>Parallel<\/td>\n<\/tr>\n<tr>\n<td>Memory requirement<\/td>\n<td>Low<\/td>\n<td>High<\/td>\n<\/tr>\n<tr>\n<td>Evaluation<\/td>\n<td>Lazy<\/td>\n<td>Eager<\/td>\n<\/tr>\n<tr>\n<td>Resources usage<\/td>\n<td>Low<\/td>\n<td>High<\/td>\n<\/tr>\n<tr>\n<td>Data access<\/td>\n<td>By primary or secondary indices<\/td>\n<td>Almost full dataset scan<\/td>\n<\/tr>\n<tr>\n<td>Concurrent traversals<\/td>\n<td>High because each traversal requires fewer resources<\/td>\n<td>Low because each traversal monopolizes lots of resrouces<\/td>\n<\/tr>\n<tr>\n<td>Response time<\/td>\n<td>Millisecs to secs<\/td>\n<td>Minutes to hours<\/td>\n<\/tr>\n<tr>\n<td>Traveral pattern<\/td>\n<td><em>Start from single\/few vertices and expand with filtering<\/em><\/td>\n<td><em>Process all vertices of some label<\/em><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>By default, <strong>Gremlin<\/strong> uses the <strong>OLTP<\/strong> execution mode which interacts with a limited set of data and respond on the order of milliseconds or seconds. If you need a batch processing, <strong>OLAP<\/strong> mode is more suitable. <\/p>\n<p>With <strong>Gremlin OLTP<\/strong>, the graph is walked by moving from vertex-to-vertex via incident edges. With <strong>Gremlin OLAP<\/strong>, all vertices are provided a <code>VertexProgram<\/code>. The programs send messages to one another with the topological structure of the graph acting as the communication network (though random message passing possible). In many respects, the messages passed are like the <strong>OLTP<\/strong> traversers moving from vertex-to-vertex. However, all messages are moving independent of one another, in parallel. Once a vertex program is finished computing, TinkerPop3\u2019s <strong>OLAP<\/strong> engine supports any number MapReduce jobs over the resultant graph.<\/p>\n<div id=\"attachment_13447\" style=\"width: 760px\" class=\"wp-caption aligncenter\"><img aria-describedby=\"caption-attachment-13447\" loading=\"lazy\" src=\"https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2017\/08\/OLAP-Gremlin.png\" alt=\"OLAP Gremlin\" width=\"750\" height=\"586\" class=\"size-full wp-image-13447\" srcset=\"https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2017\/08\/OLAP-Gremlin.png 750w, https:\/\/www.doanduyhai.com\/blog\/wp-content\/uploads\/2017\/08\/OLAP-Gremlin-300x234.png 300w\" sizes=\"(max-width: 750px) 100vw, 750px\" \/><p id=\"caption-attachment-13447\" class=\"wp-caption-text\">OLAP Gremlin<\/p><\/div>\n<p>Indeed the <strong>OLAP<\/strong> execution mode is very similar in its concept to a <strong>Hadoop<\/strong> job (if you are familiar with <strong>Hadoop<\/strong> eco-system). In the above picture, we can see that there is a need to synchronize in bulk the different jobs executed on different machines to move from one stage of your traversal to another. This <em>bulk-synchronization<\/em> is similar to a reduce-step in <strong>Hadoop<\/strong> where all the data from different workers are merged together .<\/p>\n<p>Because of the differences in execution model, users of <strong>Gremlin OLAP<\/strong> should pay attention to the following points:<\/p>\n<h4>A. sideEffect scope<\/h4>\n<p>Since in OLAP mode the computation is distributed among different machines, you may not be able to have a global view of the side effect until a synchronization barrier is reached.<\/p>\n<p>Let&#8217;s consider the following traversal:<\/p>\n<pre class=\"brush: java; title: ; wrap-lines: false; notranslate\" title=\"\">\r\ngremlin&gt;g.V().\r\n  has(&quot;user&quot;, &quot;userId&quot;, &quot;u861&quot;).as(&quot;user&quot;)                 \/\/ Iterator&lt;User&gt;\r\n  outE(&quot;rated&quot;).filter(values(&quot;rating&quot;).is(gte(9))).inV(). \/\/ Iterator&lt;Movie&gt;\r\n  out(&quot;belongsTo&quot;).values(&quot;name&quot;).as(&quot;genres&quot;).            \/\/ Save all the genres of well rated movies by this user\r\n  ...\r\n<\/pre>\n<p>Because the traversal steps are executed locally on different machine, each machine may label different genres and the genres found on one machine may not be comprehensive. Consequently if you want to re-used the <em>genres<\/em> label and be sure it contains all the gathered genres, you have to force a synchronization barrier by calling <code><strong>barrier()<\/strong><\/code> before accessing the <em>genres<\/em> label.<\/p>\n<h4>B. Path information<\/h4>\n<p>With OLAP mode, since all the paths are explored at each stage, we face quickly the combinatorial explosion and thus keeping and processing path information is very expensive and is not practical since it requires sending path information data between machines involved in the computation of the traversal <\/p>\n<h4>C. Global ordering<\/h4>\n<p>Because <strong>Gremlin OLAP<\/strong> is using a Map\/Reduce approach, mid-traversal global ordering is useless and is a waste of resource. For example:<\/p>\n<pre class=\"brush: java; title: ; wrap-lines: false; notranslate\" title=\"\">\r\ngremlin&gt;g.V().\r\n  hasLabel(&quot;user&quot;).             \/\/ Step 1\r\n  order().by(&quot;age&quot;, decr).      \/\/ Step 2\r\n  out(&quot;rated&quot;).values(&quot;title&quot;)  \/\/ Step 3\r\n<\/pre>\n<p>At step 2, we perform a global ordering of all users by their age. It requires a bulk synchronization to gather all user data in a single machine to perform the global ordering, fair enough. But at step 3 (<code><strong>out(\"rated\")<\/strong><\/code>), since we continue the traversal, all the ordered user data are again re-broadcasted to multiple machines on the cluster for computation so we will loose the global ordering we just computed.<\/p>\n<p>The only meaningful global ordering in <strong>OLAP<\/strong> mode is when it is the <strong>last step in the traversal<\/strong>. <\/p>\n<h1>IV Note on barriers<\/h1>\n<p>All <strong>Gremlin<\/strong> steps can be sorted in one of the following categories:<\/p>\n<ul>\n<li><strong>map<\/strong>: <code><strong>map()<\/strong><\/code>, <code><strong>flatMap()<\/strong><\/code>&#8230;: transform a stream of data type into another data type<\/li>\n<li><strong>filter<\/strong>: <code><strong>filter()<\/strong><\/code>, <code><strong>where()<\/strong><\/code>: prune non matching traversers<\/li>\n<li><strong>sideEffect<\/strong>: <code><strong>store()<\/strong><\/code>, <code><strong>aggregate<\/strong><\/code> &#8230;: perform side effects on the current step and pass the traverser to the next step unchanged<\/li>\n<li><strong>terminal<\/strong>: <code><strong>next()<\/strong><\/code>, <code><strong>toList()<\/strong><\/code>&#8230;: materialize the lazy evaluation of the traversal into concrete results<\/li>\n<li><strong>branching<\/strong>: <code><strong>repeat()<\/strong><\/code>, <code><strong>choose()<\/strong><\/code>&#8230;: split the traverser to multiple paths\n<\/li>\n<li><strong>collecting barrier<\/strong>: <code><strong>order()<\/strong><\/code>, <code><strong>barrier()<\/strong><\/code>&#8230;: all of the traversers prior to the barrier are collected and then processed before passing to the next step <\/li>\n<li><strong>reducing barrier<\/strong>: <code><strong>count()<\/strong><\/code>, <code><strong>sum()<\/strong><\/code>..: all the traversers prior to the barrier are processed by a reduce function and a single <strong>reduced value<\/strong> traverser is passed to the next step <\/li>\n<\/ul>\n<p>That being said, when using <strong>Gremlin OLTP<\/strong> you should understand that reaching a barrier step will disable the lazy evaluation nature of your traversal. The traversers are blocked at this barrier until all of them are completely processed before moving to the next step.<\/p>\n<h1>V Practical considerations<\/h1>\n<p>On the practical side, when you are using <strong>Gremlin OLTP<\/strong> you should be aware of some classical caveats:<\/p>\n<h4>A. Traversal starting points<\/h4>\n<p>Usually for real-time graph exploration people start with an <strong>identified vertex<\/strong> (e.g. access vertex by <strong>id<\/strong> or with some <strong>indices<\/strong>) or a very <em>limited set of vertices<\/em> and then expand from there. Examples<\/p>\n<pre class=\"brush: java; title: OLTP Pattern from single vertex; wrap-lines: false; notranslate\" title=\"OLTP Pattern from single vertex\">\r\ngremlin&gt;g.V().\r\n  has(&quot;user&quot;, &quot;userId&quot;, &quot;u861&quot;). \/\/ Single user\r\n  out(&quot;rated&quot;).                  \/\/ All movies he rated\r\n  ...\r\n<\/pre>\n<pre class=\"brush: java; title: OLTP Pattern from limited set of vertices; wrap-lines: false; notranslate\" title=\"OLTP Pattern from limited set of vertices\">\r\ngremlin&gt;g.V().\r\n  hasLabel(&quot;user&quot;). \r\n  has(&quot;age&quot;,30).\r\n  has(&quot;gender&quot;,&quot;M&quot;).\r\n  has(&quot;city&quot;,&quot;Austin&quot;).\r\n  out(&quot;rated&quot;).                  \r\n  ...\r\n<\/pre>\n<p>By <em>limited set of vertices<\/em>, we mean a handful of vertices, not more than 100 by rule of thumb otherwise you&#8217;ll face combinatorial explosion very quickly!<\/p>\n<p>Now an example of <strong>OLAP<\/strong> traversal<\/p>\n<pre class=\"brush: java; title: OLAP Pattern; wrap-lines: false; notranslate\" title=\"OLAP Pattern\">\r\ngremlin&gt;g.V().\r\n    hasLabel(&quot;user&quot;).                       \/\/ Iterator&lt;User&gt;\r\n    order().by(outE(&quot;knows&quot;).count(),decr). \/\/ Global ordering on ALL users\r\n    limit(1)                                \/\/ Take the 1st in the list\r\n  ...\r\n<\/pre>\n<h4>B. Combinatorial explosion<\/h4>\n<p>Even if you manage to reduce the number of vertices as your traversal starting point, things can go quickly out of control pretty fast. Indeed, every time you navigate through an edge, it involves an <strong>1-to-N relationship<\/strong>, with N being arbitrary large. In worst case, where your vertex is a super node, the number of adjacent edges can be millions!<\/p>\n<p>So let&#8217;s take the previous OLTP example:<\/p>\n<pre class=\"brush: java; title: OLTP Pattern from single vertex; wrap-lines: false; notranslate\" title=\"OLTP Pattern from single vertex\">\r\ngremlin&gt;g.V().\r\n  has(&quot;user&quot;, &quot;userId&quot;, &quot;u861&quot;). \/\/ Single user\r\n  out(&quot;knows&quot;).                  \/\/ All users he knows\r\n  ...\r\n<\/pre>\n<p>How can you be sure that user <em>u861<\/em> has a limited\/reasonable number of outgoing <em>knows<\/em> edges ??? It can range from 0\/1  to 10 000. If you don&#8217;t know your dataset prior to the traversal, there can be some safeguards:<\/p>\n<ul>\n<li>count the adjacent edges cardinality for a given vertex<\/li>\n<li>limit the expansion with sampling<\/li>\n<li>limit the expansion with selective filtering<\/li>\n<\/ul>\n<pre class=\"brush: java; title: Counting edge cardinality; wrap-lines: false; notranslate\" title=\"Counting edge cardinality\">\r\ngremlin&gt;g.V().\r\n  has(&quot;user&quot;, &quot;userId&quot;, &quot;u861&quot;). \/\/ Single user\r\n  outE(&quot;knows&quot;).count()          \/\/ \r\n  ...\r\n<\/pre>\n<pre class=\"brush: java; title: Limiting expansion with sampling; wrap-lines: false; notranslate\" title=\"Limiting expansion with sampling\">\r\ngremlin&gt;g.V().\r\n  has(&quot;user&quot;, &quot;userId&quot;, &quot;u861&quot;). \/\/ Single user\r\n  out(&quot;knows&quot;).sample(100)       \/\/ Take maximum 100 friends\r\n  ...\r\n<\/pre>\n<p><code><strong>sample()<\/strong><\/code> is a convenient step to limit graph rapid explosion but it has a serious drawback that with sampling you loose any exhaustivity in your traversal. That&#8217;s the price to pay to limit resources usage.<\/p>\n<p>Another strategy is to use <strong>very restrictive filtering<\/strong> to avoid explosion:<\/p>\n<pre class=\"brush: java; title: Limiting expansion with sampling; wrap-lines: false; notranslate\" title=\"Limiting expansion with sampling\">\r\ngremlin&gt;g.V().\r\n  has(&quot;user&quot;, &quot;userId&quot;, &quot;u861&quot;). \/\/ Single user\r\n  out(&quot;knows&quot;).\r\n  has(&quot;age&quot;,30).\r\n  has(&quot;gender&quot;,&quot;M&quot;)\r\n  ...\r\n<\/pre>\n<p>Though, filtering is not a silver bullet since it require creating relevant <strong>indices<\/strong> before hand and even with filtering, if you don&#8217;t know the <strong>distribution<\/strong> of your dataset, you don&#8217;t know indeed how restrictive your filters are so with the above traversal with restriction on age and gender, we may end with millions of friends !!!<\/p>\n<p>As we can see, without extreme care, any <strong>Gremlin OLTP<\/strong> traversal can turn quickly into a full scan\/<strong>OLAP<\/strong> workload. This is the major trap for beginners.<\/p>\n<hr>\n<p>And that&#8217;s all folks! <strong>This is the last Gremlin recipe. I hope you had fun following this series.<\/strong>.<\/p>\n<p>If you have any question about <strong>Gremlin<\/strong>, find me on the <strong><a href=\"http:\/\/datastaxacademy.slack.com\" target=\"_blank\">datastaxacademy.slack.com<\/a><\/strong>, channel <strong>dse-graph<\/strong>. My id is <em>@doanduyhai<\/em>   <\/p>\n","protected":false},"excerpt":{"rendered":"<p>This blog post is the 9th from the series Gremlin Recipes. It is recommended to read the previous blog posts first: Gremlin as a Stream SQL to Gremlin Recommendation Engine traversal Recursive traversals Path object Projection and selection Variable handling&#8230;<br \/><a class=\"read-more-button\" href=\"https:\/\/www.doanduyhai.com\/blog\/?p=13439\">Read more<\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[58,10],"tags":[],"_links":{"self":[{"href":"https:\/\/www.doanduyhai.com\/blog\/index.php?rest_route=\/wp\/v2\/posts\/13439"}],"collection":[{"href":"https:\/\/www.doanduyhai.com\/blog\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.doanduyhai.com\/blog\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.doanduyhai.com\/blog\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.doanduyhai.com\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=13439"}],"version-history":[{"count":18,"href":"https:\/\/www.doanduyhai.com\/blog\/index.php?rest_route=\/wp\/v2\/posts\/13439\/revisions"}],"predecessor-version":[{"id":13461,"href":"https:\/\/www.doanduyhai.com\/blog\/index.php?rest_route=\/wp\/v2\/posts\/13439\/revisions\/13461"}],"wp:attachment":[{"href":"https:\/\/www.doanduyhai.com\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=13439"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.doanduyhai.com\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=13439"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.doanduyhai.com\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=13439"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}