{"id":13285,"date":"2017-08-02T15:28:02","date_gmt":"2017-08-02T15:28:02","guid":{"rendered":"http:\/\/www.doanduyhai.com\/blog\/?p=13285"},"modified":"2017-08-08T21:00:49","modified_gmt":"2017-08-08T21:00:49","slug":"gremlin-recipes-3-recommendation-engine-traversal","status":"publish","type":"post","link":"https:\/\/www.doanduyhai.com\/blog\/?p=13285","title":{"rendered":"Gremlin recipes: 3 &#8211; Recommendation Engine traversal"},"content":{"rendered":"<p>This blog post is the 3<sup>rd<\/sup> from the series <strong>Gremlin Recipes<\/strong>. It is recommended to read the previous blog posts first: <\/p>\n<ol>\n<li><a href=\"https:\/\/www.doanduyhai.com\/blog\/?p=13224\" target=\"_blank\"><strong>Gremlin as a Stream<\/strong><\/a><\/li>\n<li><a href=\"https:\/\/www.doanduyhai.com\/blog\/?p=13260\" target=\"_blank\"><strong>SQL to Gremlin<\/strong><\/a><\/li>\n<\/ol>\n<p><!--more--><\/p>\n<h1>I KillrVideo dataset<\/h1>\n<p>To illustrate this series of recipes, you need first to create the schema for <strong>KillrVideo<\/strong> and import the data. See <a href=\"https:\/\/www.doanduyhai.com\/blog\/?p=13224#killrvideo_dataset\" target=\"_blank\"><strong>here<\/strong><\/a> for more details.<\/p>\n<p>The graph schema of this dataset is :<\/p>\n<p><iframe loading=\"lazy\" src=\"https:\/\/s3.amazonaws.com\/datastax-graph-schema-viewer\/index.html#\/?schema=killr_video_small.json\" height=\"600px\" width=\"100%\"><\/iframe><\/p>\n<h1>II Recommendation engine<\/h1>\n<p>In this post we want to build a simple collaborative filtering engine with <strong>Gremlin<\/strong> traversal. Let&#8217;s say we want to find all movies in which <strong>Harrison Ford<\/strong> has played as an actor and order them by their average rating. The traversal is:<\/p>\n<pre class=\"brush: java; title: ; wrap-lines: false; notranslate\" title=\"\">\r\ngremlin&gt;g.\r\n  V().\r\n  has(&quot;person&quot;, &quot;name&quot;, &quot;Harrison Ford&quot;).           \/\/ Fetch Harrison Ford: Iterator&lt;Person&gt;\r\n  in(&quot;actor&quot;).                                      \/\/ acting as actor in movies: Iterator&lt;Movie&gt;\r\n  order().\r\n    by(inE(&quot;rated&quot;).values(&quot;rating&quot;).mean(), decr). \/\/ order movies by average ratings DESC\r\n  project(&quot;movie&quot;,&quot;average_rating&quot;).                \/\/ Display those movies\r\n    by(&quot;title&quot;).                                    \/\/ by title\r\n    by(inE(&quot;rated&quot;).values(&quot;rating&quot;).mean())        \/\/ by average rating: Iterator&lt;Tuple2&lt;String=title,Double=avg rating&gt;&gt;\r\n==&gt;{movie=Blade Runner, average_rating=8.20353982300885}\r\n==&gt;{movie=Star Wars. Episode V: The Empire Strikes Back, average_rating=8.121739130434783}\r\n==&gt;{movie=Star Wars. Episode VI: Return of the Jedi, average_rating=7.900990099009901}\r\n==&gt;{movie=Star Wars IV: A New Hope, average_rating=7.885964912280702}\r\n==&gt;{movie=Indiana Jones: Raiders of the Lost Ark, average_rating=7.872881355932203}\r\n==&gt;{movie=Indiana Jones and the Last Crusade, average_rating=7.809523809523809}\r\n==&gt;{movie=Indiana Jones and the Temple of Doom, average_rating=7.402061855670103}\r\n==&gt;{movie=The Fugitive, average_rating=7.0}\r\n==&gt;{movie=Indiana Jones and the Kingdom of the Crystal Skull (Indiana Jones 4), average_rating=5.5}\r\n==&gt;{movie=Six Days Seven Nights, average_rating=5.132075471698113}\r\n<\/pre>\n<p>I will not explain into the detail the above traversal, if you have some difficulty to understand it, please read the 2 previous blog posts.<\/p>\n<p>So it seems that the <strong>Harrison Ford<\/strong> movie having the best average rating is <strong>Blade Runner<\/strong> and not one of the Star Wars series.<\/p>\n<p>Now suppose that you are an user of an online video club (\u00e0-la Netflix) and you have just watched <strong>Blade Runner<\/strong>. You really enjoyed it so you rated it 9\/10.<\/p>\n<p>The challenge for the video club is now to find a movie <em>similar<\/em> to <strong>Blade Runner<\/strong> that could potentially suit your taste.<\/p>\n<p>One of the classical recommendation engine technique, also called <strong>collaborative filtering<\/strong>, works as follows:<\/p>\n<blockquote><p>What has user XXX liked? Who else has liked those things? What have they liked that XXX hasn&#8217;t already liked?<\/p><\/blockquote>\n<p>With our graph schema, the verb <em>&#8220;like&#8221;<\/em> is translated into <em>&#8220;has rated with xxx rating&#8221;<\/em>.<\/p>\n<p>So we will have our <strong>Gremlin<\/strong> traversal starting as:<\/p>\n<pre class=\"brush: java; title: ; wrap-lines: false; notranslate\" title=\"\">\r\ngremlin&gt;g.\r\n  V().\r\n  has(&quot;movie&quot;, &quot;title&quot;, &quot;Blade Runner&quot;). \/\/ Fetch Blade Runner: Iterator&lt;Movie&gt;\r\n  as(&quot;blade_runner&quot;)                     \/\/ Save the movie under the label &quot;blade_runner&quot;\r\n  ...\r\n<\/pre>\n<p>The <code><strong>as(\"blade_runner\")<\/strong> <\/code> step will save the movie using the alias <em>&#8220;blade_runner&#8221;<\/em> for later re-use. From this <strong>Blade Runner<\/strong> movie, we want to find <strong>all the users who rated this movie more than its average rating e.g. more than 8.203<\/strong> (rounding from 8.20353982300885). The corresponding traversal is:<\/p>\n<pre class=\"brush: java; title: ; wrap-lines: false; notranslate\" title=\"\">\r\n  inE(&quot;rated&quot;).                              \/\/ Iterator&lt;Rated_Edge&gt;\r\n    where(values(&quot;rating&quot;).is(gte(8.203))).  \/\/ iterator.filter(rated -&gt; rated.getRating() &gt;= 8.2)\r\n  outV()                                     \/\/ Iterator&lt;User&gt; \r\n<\/pre>\n<p>From those users (who rated <strong>Blade Runner<\/strong> more than 8.203) we want to find all the movies they rated more than 8.203 which are NOT <strong>Blade Runner<\/strong> itself :<\/p>\n<pre class=\"brush: java; title: ; wrap-lines: false; notranslate\" title=\"\">\r\n  outE(&quot;rated&quot;).                             \/\/ Iterator&lt;Rated_Edge&gt;\r\n    where(values(&quot;rating&quot;).is(gte(8.203))).  \/\/ iterator.filter(rated -&gt; rated.getRating() &gt;= 8.2)\r\n  inV().                                     \/\/ Iterator&lt;Movie&gt; \r\n  where(neq(&quot;blade_runner&quot;)).                \/\/ iterator.filter(movie -&gt; !movie.eq(&quot;blade_runner&quot;)) \r\n  dedup()                                    \/\/ Remove possible duplicated movies\r\n<\/pre>\n<p>If you have trouble to following the direction of the traversal (<code><strong>inE()<\/strong><\/code>, <code><strong>outV()<\/strong><\/code>, &#8230;) just throw an eye to the graph schema and pay attention to the direction of each edge(arrow).<\/p>\n<p>We then need to project the resulting movies by displaying their <em>title<\/em>, <em>average rating<\/em> and <em>genres<\/em>:<\/p>\n<pre class=\"brush: java; title: ; wrap-lines: false; notranslate\" title=\"\">\r\n  project(&quot;title&quot;,&quot;average_rating&quot;,&quot;genres&quot;).\r\n    by(values(&quot;title&quot;).\r\n    by(outE(&quot;rated&quot;).values(&quot;rating&quot;).mean()).\r\n    by(out(&quot;belongsTo&quot;).values(&quot;name&quot;).fold())\r\n<\/pre>\n<p>The complete traversal is then:<\/p>\n<pre class=\"brush: java; title: ; wrap-lines: false; notranslate\" title=\"\">\r\ngremlin&gt;g.\r\n  V().\r\n  has(&quot;movie&quot;, &quot;title&quot;, &quot;Blade Runner&quot;).        \/\/ Fetch Blade Runner: Iterator&lt;Movie&gt;\r\n  as(&quot;blade_runner&quot;).                           \/\/ Save Blade Runner under the label &quot;blade_runner&quot;\r\n  inE(&quot;rated&quot;).                                 \/\/ Iterator&lt;Rated_Edge&gt;\r\n    where(values(&quot;rating&quot;).is(gte(8.203))).     \/\/ iterator.filter(rated -&gt; rated.getRating() &gt;= 8.2)\r\n  outV().                                       \/\/ Iterator&lt;User&gt;\r\n  outE(&quot;rated&quot;).                                \/\/ Iterator&lt;Rated_Edge&gt;\r\n    where(values(&quot;rating&quot;).is(gte(8.203))).     \/\/ iterator.filter(rated -&gt; rated.getRating() &gt;= 8.2)\r\n  inV().                                        \/\/ Iterator&lt;Movie&gt; \r\n  where(neq(&quot;blade_runner&quot;)).                   \/\/ iterator.filter(movie -&gt; !movie.eq(&quot;blade_runner&quot;)) \r\n  dedup().                                      \/\/ Remove possible duplicated movies\r\n  project(&quot;title&quot;,&quot;average_rating&quot;,&quot;genres&quot;).   \/\/ Project on\r\n    by(values(&quot;title&quot;)).                        \/\/ movie's title\r\n    by(inE(&quot;rated&quot;).values(&quot;rating&quot;).mean()).   \/\/ movie average rating \r\n    by(out(&quot;belongsTo&quot;).values(&quot;name&quot;).fold()). \/\/ movie's genres\r\n  limit(10)\r\n==&gt;{title=Inglourious Basterds, average_rating=7.801526717557252, genres=[War, Action, Comedy]}\r\n==&gt;{title=Untouchable, average_rating=8.141176470588235, genres=[Comedy, Drama]}\r\n==&gt;{title=Batman, average_rating=6.801980198019802, genres=[Thriller, Action, Fantasy]}\r\n==&gt;{title=The Band Wagon, average_rating=7.8, genres=[Musical]}\r\n==&gt;{title=Silverado, average_rating=6.697916666666667, genres=[Western]}\r\n==&gt;{title=The Last Samurai, average_rating=6.767123287671233, genres=[Action, Adventure]}\r\n==&gt;{title=Life Is Beautiful, average_rating=8.502923976608187, genres=[Comedy, Drama]}\r\n==&gt;{title=Love's a Bitch, average_rating=7.656716417910448, genres=[Drama]}\r\n==&gt;{title=Requiem for a Dream, average_rating=7.79047619047619, genres=[Drama]}\r\n==&gt;{title=The Matrix, average_rating=7.895348837209302, genres=[Sci-Fi, Thriller, Action, Fantasy]}\r\n<\/pre>\n<p>If we look at the first 10 movies, we can see that they are not at all a good match for our <strong>Blade Runner<\/strong> fan.<\/p>\n<ol>\n<li>some of the movies have a poor average rating: 6.8 for <strong>Batman<\/strong><\/li>\n<li>some of the movies do not belong to neither <strong>Sci-Fi<\/strong> nor <strong>Action<\/strong> genres: Life Is Beautiful<\/li>\n<\/ol>\n<p>Indeed our previous traversal contains several caveats. First, the query <strong><code>outE(\"rated\").where(values(\"rating\").is(gte(8.203)))inV()<\/code><\/strong> means <em>&#8220;give me all the movies rated more than 8.203 by those users&#8221;<\/em> (those users == those who rated Blade Runner more than 8.203). The problem is that even if those users may have rated those movie more than 8.203, it doesn&#8217;t mean necessarily that those movies have received a good rating from all other users. We want to retain only a <strong>subset<\/strong> of those movies with an <strong>average rating >= 8.203<\/strong>. For this we need to filter further the results set with:        <\/p>\n<pre class=\"brush: java; title: ; wrap-lines: false; notranslate\" title=\"\">\r\n  where(neq(&quot;blade_runner&quot;)).                                 \/\/ iterator.filter(movie -&gt; !movie.eq(&quot;blade_runner&quot;))  \r\n  where(inE(&quot;rated&quot;).values(&quot;rating&quot;).mean().is(gte(8.203))). \/\/ iterator.filter(movie -&gt; getAvgRating(movie) &gt;= 8.203))\r\n<\/pre>\n<p>But this is not sufficient enough. We can have movies with good average rating (>= 8.203) but the genres do not match at all those of <strong>Blade Runner<\/strong>. We want also to enforce that the matching movies belong to either <strong>Action<\/strong> or <strong>Sci-Fi<\/strong> genres. Again, an extra filtering step is necessary:<\/p>\n<pre class=\"brush: java; title: ; wrap-lines: false; notranslate\" title=\"\">\r\n  where(neq(&quot;blade_runner&quot;)).                                     \/\/ iterator.filter(movie -&gt; !movie.eq(&quot;blade_runner&quot;))  \r\n  where(inE(&quot;rated&quot;).values(&quot;rating&quot;).mean().is(gte(8.203))).     \/\/ iterator.filter(movie -&gt; getAvgRating(movie) &gt;= 8.203))\r\n  where(out(&quot;belongsTo&quot;).has(&quot;name&quot;, within(&quot;Sci-Fi&quot;, &quot;Action&quot;))). \/\/ iterator.filter(movie -&gt; getGenres(movie).contains(&quot;Sci-Fi&quot;) || getGenres(movie).contains(&quot;Action&quot;))\r\n<\/pre>\n<p>The correct traversal for this collaborative filtering is then:<\/p>\n<pre class=\"brush: java; title: ; wrap-lines: false; notranslate\" title=\"\">\r\ngremlin&gt;g.\r\n  V().\r\n  has(&quot;movie&quot;, &quot;title&quot;, &quot;Blade Runner&quot;).        \/\/ Fetch Blade Runner: Iterator&lt;Movie&gt;\r\n  as(&quot;blade_runner&quot;).                           \/\/ Save Blade Runner under the label &quot;blade_runner&quot;\r\n  inE(&quot;rated&quot;).                                 \/\/ Iterator&lt;Rated_Edge&gt;\r\n    where(values(&quot;rating&quot;).is(gte(8.203))).     \/\/ iterator.filter(rated -&gt; rated.getRating() &gt;= 8.2)\r\n  outV().                                       \/\/ Iterator&lt;User&gt;\r\n  outE(&quot;rated&quot;).                                \/\/ Iterator&lt;Rated_Edge&gt;\r\n    where(values(&quot;rating&quot;).is(gte(8.203))).     \/\/ iterator.filter(rated -&gt; rated.getRating() &gt;= 8.2)\r\n  inV().                                        \/\/ Iterator&lt;Movie&gt; \r\n  where(neq(&quot;blade_runner&quot;)).                   \/\/ iterator.filter(movie -&gt; !movie.eq(&quot;blade_runner&quot;)) \r\n  where(inE(&quot;rated&quot;).values(&quot;rating&quot;).mean().\r\n        is(gte(8.203))).                        \/\/ iterator.filter(movie -&gt; getAvgRating(movie) &gt;= 8.203))\r\n  where(out(&quot;belongsTo&quot;).has(&quot;name&quot;, \r\n        within(&quot;Sci-Fi&quot;, &quot;Action&quot;))).           \/\/ iterator.filter(movie -&gt; getGenres(movie).contains(&quot;Sci-Fi&quot;) || getGenres(movie).contains(&quot;Action&quot;))  \r\n  dedup().                                      \/\/ Remove possible duplicated movies\r\n  project(&quot;title&quot;,&quot;average_rating&quot;,&quot;genres&quot;).   \/\/ Project on\r\n    by(values(&quot;title&quot;)).                        \/\/ movie's title\r\n    by(inE(&quot;rated&quot;).values(&quot;rating&quot;).mean()).   \/\/ movie average rating \r\n    by(out(&quot;belongsTo&quot;).values(&quot;name&quot;).fold()). \/\/ movie's genres\r\n  limit(10)\r\n==&gt;{title=Pulp Fiction, average_rating=8.581005586592179, genres=[Thriller, Action]}\r\n==&gt;{title=Seven Samurai, average_rating=8.470588235294118, genres=[Action, Adventure, Drama]}\r\n==&gt;{title=A Clockwork Orange, average_rating=8.215686274509803, genres=[Sci-Fi, Drama]}\r\n<\/pre>\n<p>The results are now more satisfactory. All 3 movies have average rating >= 8.203 and each of them belong to either <strong>Action<\/strong> or <strong>Sci-Fi<\/strong> genre.<\/p>\n<p>And that&#8217;s all folks! <strong>Do not miss the other Gremlin recipes in this series<\/strong>.<\/p>\n<p>If you have any question about <strong>Gremlin<\/strong>, find me on the <strong><a href=\"http:\/\/datastaxacademy.slack.com\" target=\"_blank\">datastaxacademy.slack.com<\/a><\/strong>, channel <strong>dse-graph<\/strong>. My id is <em>@doanduyhai<\/em>   <\/p>\n","protected":false},"excerpt":{"rendered":"<p>This blog post is the 3rd from the series Gremlin Recipes. It is recommended to read the previous blog posts first: Gremlin as a Stream SQL to Gremlin<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[58,10],"tags":[],"_links":{"self":[{"href":"https:\/\/www.doanduyhai.com\/blog\/index.php?rest_route=\/wp\/v2\/posts\/13285"}],"collection":[{"href":"https:\/\/www.doanduyhai.com\/blog\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.doanduyhai.com\/blog\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.doanduyhai.com\/blog\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.doanduyhai.com\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=13285"}],"version-history":[{"count":14,"href":"https:\/\/www.doanduyhai.com\/blog\/index.php?rest_route=\/wp\/v2\/posts\/13285\/revisions"}],"predecessor-version":[{"id":13349,"href":"https:\/\/www.doanduyhai.com\/blog\/index.php?rest_route=\/wp\/v2\/posts\/13285\/revisions\/13349"}],"wp:attachment":[{"href":"https:\/\/www.doanduyhai.com\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=13285"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.doanduyhai.com\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=13285"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.doanduyhai.com\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=13285"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}