This blog post is the 3^rd from the series Gremlin Recipes. It is recommended to read the previous blog posts first:

I KillrVideo dataset

To illustrate this series of recipes, you need first to create the schema for KillrVideo and import the data. See here for more details.

The graph schema of this dataset is :

II Recommendation engine

In this post we want to build a simple collaborative filtering engine with Gremlin traversal. Let’s say we want to find all movies in which Harrison Ford has played as an actor and order them by their average rating. The traversal is:

gremlin>g.
  V().
  has("person", "name", "Harrison Ford").           // Fetch Harrison Ford: Iterator<Person>
  in("actor").                                      // acting as actor in movies: Iterator<Movie>
  order().
    by(inE("rated").values("rating").mean(), decr). // order movies by average ratings DESC
  project("movie","average_rating").                // Display those movies
    by("title").                                    // by title
    by(inE("rated").values("rating").mean())        // by average rating: Iterator<Tuple2<String=title,Double=avg rating>>
==>{movie=Blade Runner, average_rating=8.20353982300885}
==>{movie=Star Wars. Episode V: The Empire Strikes Back, average_rating=8.121739130434783}
==>{movie=Star Wars. Episode VI: Return of the Jedi, average_rating=7.900990099009901}
==>{movie=Star Wars IV: A New Hope, average_rating=7.885964912280702}
==>{movie=Indiana Jones: Raiders of the Lost Ark, average_rating=7.872881355932203}
==>{movie=Indiana Jones and the Last Crusade, average_rating=7.809523809523809}
==>{movie=Indiana Jones and the Temple of Doom, average_rating=7.402061855670103}
==>{movie=The Fugitive, average_rating=7.0}
==>{movie=Indiana Jones and the Kingdom of the Crystal Skull (Indiana Jones 4), average_rating=5.5}
==>{movie=Six Days Seven Nights, average_rating=5.132075471698113}

I will not explain into the detail the above traversal, if you have some difficulty to understand it, please read the 2 previous blog posts.

So it seems that the Harrison Ford movie having the best average rating is Blade Runner and not one of the Star Wars series.

Now suppose that you are an user of an online video club (à-la Netflix) and you have just watched Blade Runner. You really enjoyed it so you rated it 9/10.

The challenge for the video club is now to find a movie similar to Blade Runner that could potentially suit your taste.

One of the classical recommendation engine technique, also called collaborative filtering, works as follows:

What has user XXX liked? Who else has liked those things? What have they liked that XXX hasn’t already liked?

With our graph schema, the verb “like” is translated into “has rated with xxx rating”.

So we will have our Gremlin traversal starting as:

gremlin>g.
  V().
  has("movie", "title", "Blade Runner"). // Fetch Blade Runner: Iterator<Movie>
  as("blade_runner")                     // Save the movie under the label "blade_runner"
  ...

The as("blade_runner") step will save the movie using the alias “blade_runner” for later re-use. From this Blade Runner movie, we want to find all the users who rated this movie more than its average rating e.g. more than 8.203 (rounding from 8.20353982300885). The corresponding traversal is:

  inE("rated").                              // Iterator<Rated_Edge>
    where(values("rating").is(gte(8.203))).  // iterator.filter(rated -> rated.getRating() >= 8.2)
  outV()                                     // Iterator<User>

From those users (who rated Blade Runner more than 8.203) we want to find all the movies they rated more than 8.203 which are NOT Blade Runner itself :

  outE("rated").                             // Iterator<Rated_Edge>
    where(values("rating").is(gte(8.203))).  // iterator.filter(rated -> rated.getRating() >= 8.2)
  inV().                                     // Iterator<Movie> 
  where(neq("blade_runner")).                // iterator.filter(movie -> !movie.eq("blade_runner")) 
  dedup()                                    // Remove possible duplicated movies

If you have trouble to following the direction of the traversal (inE(), outV(), …) just throw an eye to the graph schema and pay attention to the direction of each edge(arrow).

We then need to project the resulting movies by displaying their title, average rating and genres:

  project("title","average_rating","genres").
    by(values("title").
    by(outE("rated").values("rating").mean()).
    by(out("belongsTo").values("name").fold())

The complete traversal is then:

gremlin>g.
  V().
  has("movie", "title", "Blade Runner").        // Fetch Blade Runner: Iterator<Movie>
  as("blade_runner").                           // Save Blade Runner under the label "blade_runner"
  inE("rated").                                 // Iterator<Rated_Edge>
    where(values("rating").is(gte(8.203))).     // iterator.filter(rated -> rated.getRating() >= 8.2)
  outV().                                       // Iterator<User>
  outE("rated").                                // Iterator<Rated_Edge>
    where(values("rating").is(gte(8.203))).     // iterator.filter(rated -> rated.getRating() >= 8.2)
  inV().                                        // Iterator<Movie> 
  where(neq("blade_runner")).                   // iterator.filter(movie -> !movie.eq("blade_runner")) 
  dedup().                                      // Remove possible duplicated movies
  project("title","average_rating","genres").   // Project on
    by(values("title")).                        // movie's title
    by(inE("rated").values("rating").mean()).   // movie average rating 
    by(out("belongsTo").values("name").fold()). // movie's genres
  limit(10)
==>{title=Inglourious Basterds, average_rating=7.801526717557252, genres=[War, Action, Comedy]}
==>{title=Untouchable, average_rating=8.141176470588235, genres=[Comedy, Drama]}
==>{title=Batman, average_rating=6.801980198019802, genres=[Thriller, Action, Fantasy]}
==>{title=The Band Wagon, average_rating=7.8, genres=[Musical]}
==>{title=Silverado, average_rating=6.697916666666667, genres=[Western]}
==>{title=The Last Samurai, average_rating=6.767123287671233, genres=[Action, Adventure]}
==>{title=Life Is Beautiful, average_rating=8.502923976608187, genres=[Comedy, Drama]}
==>{title=Love's a Bitch, average_rating=7.656716417910448, genres=[Drama]}
==>{title=Requiem for a Dream, average_rating=7.79047619047619, genres=[Drama]}
==>{title=The Matrix, average_rating=7.895348837209302, genres=[Sci-Fi, Thriller, Action, Fantasy]}

If we look at the first 10 movies, we can see that they are not at all a good match for our Blade Runner fan.

some of the movies have a poor average rating: 6.8 for Batman
some of the movies do not belong to neither Sci-Fi nor Action genres: Life Is Beautiful

Indeed our previous traversal contains several caveats. First, the query outE("rated").where(values("rating").is(gte(8.203)))inV() means “give me all the movies rated more than 8.203 by those users” (those users == those who rated Blade Runner more than 8.203). The problem is that even if those users may have rated those movie more than 8.203, it doesn’t mean necessarily that those movies have received a good rating from all other users. We want to retain only a subset of those movies with an average rating >= 8.203. For this we need to filter further the results set with:

  where(neq("blade_runner")).                                 // iterator.filter(movie -> !movie.eq("blade_runner"))  
  where(inE("rated").values("rating").mean().is(gte(8.203))). // iterator.filter(movie -> getAvgRating(movie) >= 8.203))

But this is not sufficient enough. We can have movies with good average rating (>= 8.203) but the genres do not match at all those of Blade Runner. We want also to enforce that the matching movies belong to either Action or Sci-Fi genres. Again, an extra filtering step is necessary:

  where(neq("blade_runner")).                                     // iterator.filter(movie -> !movie.eq("blade_runner"))  
  where(inE("rated").values("rating").mean().is(gte(8.203))).     // iterator.filter(movie -> getAvgRating(movie) >= 8.203))
  where(out("belongsTo").has("name", within("Sci-Fi", "Action"))). // iterator.filter(movie -> getGenres(movie).contains("Sci-Fi") || getGenres(movie).contains("Action"))

The correct traversal for this collaborative filtering is then:

gremlin>g.
  V().
  has("movie", "title", "Blade Runner").        // Fetch Blade Runner: Iterator<Movie>
  as("blade_runner").                           // Save Blade Runner under the label "blade_runner"
  inE("rated").                                 // Iterator<Rated_Edge>
    where(values("rating").is(gte(8.203))).     // iterator.filter(rated -> rated.getRating() >= 8.2)
  outV().                                       // Iterator<User>
  outE("rated").                                // Iterator<Rated_Edge>
    where(values("rating").is(gte(8.203))).     // iterator.filter(rated -> rated.getRating() >= 8.2)
  inV().                                        // Iterator<Movie> 
  where(neq("blade_runner")).                   // iterator.filter(movie -> !movie.eq("blade_runner")) 
  where(inE("rated").values("rating").mean().
        is(gte(8.203))).                        // iterator.filter(movie -> getAvgRating(movie) >= 8.203))
  where(out("belongsTo").has("name", 
        within("Sci-Fi", "Action"))).           // iterator.filter(movie -> getGenres(movie).contains("Sci-Fi") || getGenres(movie).contains("Action"))  
  dedup().                                      // Remove possible duplicated movies
  project("title","average_rating","genres").   // Project on
    by(values("title")).                        // movie's title
    by(inE("rated").values("rating").mean()).   // movie average rating 
    by(out("belongsTo").values("name").fold()). // movie's genres
  limit(10)
==>{title=Pulp Fiction, average_rating=8.581005586592179, genres=[Thriller, Action]}
==>{title=Seven Samurai, average_rating=8.470588235294118, genres=[Action, Adventure, Drama]}
==>{title=A Clockwork Orange, average_rating=8.215686274509803, genres=[Sci-Fi, Drama]}

The results are now more satisfactory. All 3 movies have average rating >= 8.203 and each of them belong to either Action or Sci-Fi genre.

And that’s all folks! Do not miss the other Gremlin recipes in this series.

If you have any question about Gremlin, find me on the datastaxacademy.slack.com, channel dse-graph. My id is @doanduyhai

2 Comments

Yang Li August 6, 2018 at 9:48 am

dear brother,
I have one small question
how I can round the value of the mean()
for example the mean() value is 2.33335
I want to show it as 2.3

thanks a lot in advance.

Reply ↓
1. doanduyhai (Post author)August 6, 2018 at 11:15 am
  
  Check the doc to see if there is a native function to round values. If not you’ll need to enable lambda to perform rounding using Groovy code directly
  
  Reply ↓

Gremlin recipes: 3 – Recommendation Engine traversal

I KillrVideo dataset

II Recommendation engine

2 Comments

Leave a Reply to Yang Li Cancel reply