Gremlin recipes: 2 – SQL to Gremlin

This blog post is the 2nd from the series Gremlin Recipes. It is recommended to read the previous recipe first: 1 – Gremlin as a Stream

I KillrVideo dataset

To illustrate this series of recipes, you need first to create the schema for KillrVideo and import the data. See here for more details.

The graph schema of this dataset is :

II Simple SQL queries translation

Let’s say we have a table movie with some properties:

  • country: Text
  • duration: Int
  • movieId: Text
  • production: Text
  • title: Text
  • year: Int

Now we want to translate the SQL query SELECT * FROM Movie LIMIT 10 into Gremlin traversal:

gremlin>g.
    V().                // Iterator<Vertex>    
    hasLabel("movie").  // Iterator<Movie>   
    valueMap().         // Iterator<Map<String == property label, Object = property value>>
    limit(10)
==>{duration=[104], country=[United States], year=[2003], production=[Paramount Pictures], movieId=[m75], title=[The Italian Job]}
==>{duration=[111], country=[United States], year=[1998], production=[20th Century Fox], movieId=[m617], title=[Great Expectations]}
==>{duration=[160], country=[United States], year=[2007], production=[Warner Bros. Pictures], movieId=[m337], title=[The Assassination of Jesse James By The Coward Robert Ford]}
==>{duration=[108], country=[United States], year=[1997], production=[20th Century Fox, Brandywine Productions], movieId=[m518], title=[Alien Resurrection]}
==>{duration=[175], country=[United States], year=[1954], production=[Warner Bros], movieId=[m777], title=[A Star Is Born]}
==>{duration=[109], country=[United States], year=[1990], production=[Carolco], movieId=[m917], title=[Total Recall]}
==>{duration=[127], country=[United States], year=[2012], production=[Fox 2000 Pictures, Haishang Films], movieId=[m546], title=[Life of Pi]}
==>{duration=[84], country=[United States], year=[1950], production=[20th Century Fox], movieId=[m468], title=[The Gunfighter]}
==>{duration=[115], country=[United States], year=[2000], production=[Franchise Pictures, Jonathan D. Krane, JTP Films], movieId=[m909], title=[Battlefield Earth]}
==>{duration=[108], country=[United States], year=[2004], production=[Focus Features], movieId=[m904], title=[Eternal Sunshine of the Spotless Mind]}

The step valueMap() is useful to get the properties label along with their values. If we had used the step values() we would only have the raw properties values:

gremlin>g.
    V().               // Iterator<Vertex>    
    hasLabel("movie"). // Iterator<Movie>   
    limit(2).
    values()           // iterator.flatMap(movie -> movies.properties()) == Iterator<Object>
==>m75
==>The Italian Job
==>2003
==>104
==>United States
==>Paramount Pictures
==>m617
==>Great Expectations
==>1998
==>111
==>United States
==>20th Century Fox

When using values() step, each movie vertex is transformed into a collection of its properties values and Gremlin implicitly applies a flatMap() operation so that the result is an Iterator<Object> instead of a nested structure Iterator<Collection<Object>>. Consequently we need to put the limit(2) before calling values() otherwise the limit will be applied on the collection of movies properties.

Now, instead of fetching all properties of the movies, what if we only want to read the title ? SELECT title FROM Movie LIMIT 10:

gremlin>g.
    V().               // Iterator<Vertex>    
    hasLabel("movie"). // Iterator<Movie>   
    values("title").   // iterator.map(movie -> movie.getTitle()) == Iterator<String>
    limit(10)          
==>The Italian Job
==>Great Expectations
==>The Assassination of Jesse James By The Coward Robert Ford
==>Alien Resurrection
==>A Star Is Born
==>Total Recall
==>Life of Pi
==>The Gunfighter
==>Battlefield Earth
==>Eternal Sunshine of the Spotless Mind

Easy. To select multiple columns SELECT country,duration,title,year FROM Movie LIMIT 10 we can use of course values("country","duration","title","year") step but then all the properties will be mixed into a stream collection. It is better to rely on valueMap() to have a stream of key/value pair:

gremlin>g.
    V().                                           // Iterator<Vertex>    
    hasLabel("movie").                             // Iterator<Movie>   
    valueMap("country","duration","title","year"). // Iterator<Map<String, Object>>
    limit(10)          
==>{duration=[104], country=[United States], year=[2003], title=[The Italian Job]}
==>{duration=[111], country=[United States], year=[1998], title=[Great Expectations]}
==>{duration=[160], country=[United States], year=[2007], title=[The Assassination of Jesse James By The Coward Robert Ford]}
==>{duration=[108], country=[United States], year=[1997], title=[Alien Resurrection]}
==>{duration=[175], country=[United States], year=[1954], title=[A Star Is Born]}
==>{duration=[109], country=[United States], year=[1990], title=[Total Recall]}
==>{duration=[127], country=[United States], year=[2012], title=[Life of Pi]}
==>{duration=[84], country=[United States], year=[1950], title=[The Gunfighter]}
==>{duration=[115], country=[United States], year=[2000], title=[Battlefield Earth]}
==>{duration=[108], country=[United States], year=[2004], title=[Eternal Sunshine of the Spotless Mind]}

The usage of valueMap() has been thoroughly explained in 1 – Gremlin as a Stream, please refer to it if you have any difficulty to understand the transformation.

III SQL WHERE clause in Gremlin traversal

Gremlin translates simply the WHERE clause of SQL language into the step filter() and has() steps. has() is a simple filter on vertex/label property whereas filter can be applied to a traversal. Let’s say we want the query SELECT title FROM Movie WHERE year = 2000 LIMIT 5. The Gremlin traversal would be:

gremlin>g.
    V().               // Iterator<Vertex>    
    hasLabel("movie"). // Iterator<Movie>   
    has("year", 2000). // iterator.filter(movie -> movie.getYear() == 2000)    
    values("title").   // Iterator<String>
    limit(5)
==>Cast Away
==>What Women Want
==>Dancer in the Dark
==>Crouching Tiger, Hidden Dragon
==>Snatch

For multiple filters, we can just chain the has() step as many time as needed, SELECT title,country,year FROM Movie WHERE year >= 2009 AND year <= 2011 AND country='United States' LIMIT 5:

gremlin>g.
    V().                                // Iterator<Vertex>    
    hasLabel("movie").                  // Iterator<Movie>   
    has("year", gte(2009)).             // iterator.filter(movie -> movie.getYear()>=2000)    
    has("year", lte(2011)).             // iterator.filter(movie -> movie.getYear()<=2011)
    has("country", "United States").    // iterator.filter(movie -> movie.getCountry().equals("United States"))    
    valueMap("title","country","year"). // Iterator<String>
    limit(5)
==>{country=[United States], year=[2010], title=[Alice in Wonderland]}
==>{country=[United States], year=[2010], title=[Blue valentine]}
==>{country=[United States], year=[2011], title=[Friends with Benefits]}
==>{country=[United States], year=[2010], title=[Tangled]}
==>{country=[United States], year=[2009], title=[Nine]}

Now if we want some more complex filtering using traversal, we can use the filter() step:

gremlin>g.
    V().
    hasLabel("movie").
    filter(out("director").has("name", "Woody Allen")).
    values("title").
    limit(5)
==>Love and Death
==>Match Point
==>Take the Money and Run
==>Scoop
==>Deconstructing Harry

IV GROUP BY and ORDER BY translation

Let’s start with ORDER BY clause. The SELECT title, duration FROM Movie ORDER BY duration DESC LIMIT 5 query translates into:

gremlin>g.
    V().
    hasLabel("movie").
    order().by("duration", decr).
    valueMap("title","duration").
    limit(5)
==>{duration=[238], title=[Gone With the Wind]}
==>{duration=[225], title=[Once Upon a Time in America]}
==>{duration=[222], title=[Lawrence of Arabia]}
==>{duration=[211], title=[Ben-Hur]}
==>{duration=[205], title=[Seven Samurai]}

The decr keyword means decrementing == DESCENDING. So far so good, it is also possible to order by multiple properties. Let’s say we want to order the movies first by their title and then their duration (the example is quite contrived but just good enough for explanation): SELECT title, duration FROM Movie ORDER BY title ASC, duration DESC LIMIT 5

gremlin>g.
    V().
    hasLabel("movie").
    order().
      by("title", incr).
      by("duration", decr).
    valueMap("title","duration").
    limit(5)
==>{duration=[96], title=[(500) Days of Summer]}
==>{duration=[97], title=[10 Things I Hate about You]}
==>{duration=[109], title=[10,000 B.C.]}
==>{duration=[79], title=[101 Dalmatians]}
==>{duration=[95], title=[12 Angry Men]}

Now let’s add GROUP BY into the mix. We want to have the top 2 longest movies for each country. The SQL query is:

SELECT country, max(duration) AS longest_duration
FROM Movie
GROUP BY country
ORDER BY max(duration) DESC
LIMIT 5

How would we translate this query into Gremlin traversal ? Grouping by country is a piece of cake, if you are not familiar with grouping in Gremlin, I recommend reading the previous post at section III Grouping the stream

So the obvious Gremlin traversal would be:

gremlin>g.
    V().                            // Iterator<Vertex>
    hasLabel("movie").              // Iterator<Movie>   
    group().
      by("country").                // Iterator<Map<String == country, Collection<Movie>>>
      by(values("duration").max()). // Iterator<Map<String == country, Long == duration of longest movie>>
    order().
      by(values, decr).       
    limit(5)
==>{Argentina=126, Hong Kong=120, United States=238, Japan=205, United Kingdom=222, Soviet Union (USSR)=165, Spain=143, Russia=114, New Zealand=121, Canada=96, South Korea=143, Austria=127, Ireland=135, China=89, Taiwan=119, Brazil=130, Germany - West Germany=94, Denmark=177, Italy=165, Mexico=150, South Africa=111, France=180, Australia=165, Germany=153}

The result is more than surprising, in fact it is completely wrong. Some explanation is needed.

The step order().by(values, decr) would order the stream of data (thus Iterator<Map<String,Long>> ) by its content e.g. order all the Maps inside the Iterator in descending order. The problem is that this ordering is useless and has no effect because inside the the Iterator, there is only a single Map. The same remark applies to the limit(5). Since there is only one Map any number for limit will yield the same result anyway.

To convince ourselves, we can just run the traversal to count the number of elements inside the the Iterator:

gremlin>g.
    V().                            // Iterator<Vertex>
    hasLabel("movie").              // Iterator<Movie>   
    group().
      by("country").                // Iterator<Map<String == country, Collection<Movie>>>
      by(values("duration").max()). // Iterator<Map<String == country, Long == duration of longest movie>>
    order().
      by(values, decr).       
    count()
==>1

To fix this problem, Gremlin offers 2 solutions

A Using “local” keyword

What we want is not to order the items inside the Iterator but to order the entries inside each Map. For that we need to use the local keyword

gremlin>g.
    V().                            // Iterator<Vertex>
    hasLabel("movie").              // Iterator<Movie>   
    group().
      by("country").                // Iterator<Map<String == country, Collection<Movie>>>
      by(values("duration").max()). // Iterator<Map<String == country, Long == duration of longest movie>>
    order(local).
      by(values, decr).       
    limit(local, 5)
==>{United States=238, United Kingdom=222, Japan=205, France=180, Denmark=177}

Now we have the correct result. order(local).by(values, decr) will order each nested Map by its value (please note that you can also decide to order by key using by(keys, decr) as well). limit(local, 5) will only take the first 5 elements inside the Map in the Iterator.

B Using “unfold()” step

The problem with our initial traversal (the one without the local keyword) was that we performed the ordering and limit on the wrong level. With the step unfold() which is quite similar to the Java stream flatMap() operation, we can extract all map entries from the internal Map and move them to the Iterator level so that the grouping and limit work again:

gremlin>g.
    V().                            // Iterator<Vertex>
    hasLabel("movie").              // Iterator<Movie>   
    group().
      by("country").                // Iterator<Map<String == country, Collection<Movie>>>
      by(values("duration").max()). // Iterator<Map<String == country, Long == duration of longest movie>>
    unfold().                       // Iterator<MapEntry<String == country, Long == duration of longest movie>>
    order().
      by(values, decr).       
    limit(5)
==>United States=238
==>United Kingdom=222
==>Japan=205
==>France=180
==>Denmark=177

And of course we have the same results as previously.

Please note that the display format does differ though.

The explanation is simple. By using limit(local, 5) Gremlin will display the first element of the Iterator with only 5 elements inside which is the Map truncated to its top 5 element, thus the JSON display of a map content {United States=238, United Kingdom=222, Japan=205, France=180, Denmark=177}.

Whereas with unfold(), the limit(5) is applied on the Iterator itself so Gremlin will show you only the first 5 map entries and that explains the display formatting:


==>United States=238
==>United Kingdom=222
==>Japan=205
==>France=180
==>Denmark=177

V JOIN translation

Let’s now tackle the biggest feature of relational databases: joins. With Gremlin, joins translate into simple traversals and you barely realise it. Let’s take the following join query:


SELECT Movie.title, Genre.name FROM Movie
INNER JOIN Movie_Genre ON Movie.movieId = Movie_Genre.movieId
INNER JOIN Genre ON Movie_Genre.genreId = Genre.genreId
WHERE Genre.name = "Sci-Fi"
LIMIT 10

We want to display the 10 movies title and genre name of all movies whose genres contain at least “Sci-Fi”.

A naive Gremlin traversal would be:

gremlin>g.
  V(). 
  hasLabel("movie").
  filter(out("belongsTo").has("name", "Sci-Fi")).   
  values("title", out("belongsTo").values("name")).
  limit(10)
No signature of method: org.apache.tinkerpop.gremlin.process.traversal.dsl.graph.DefaultGraphTraversal.values() is applicable for argument types: (java.lang.String, org.apache.tinkerpop.gremlin.process.traversal.dsl.graph.DefaultGraphTraversal) values: [title, [VertexStep(OUT,[belongsTo],vertex), PropertiesStep([name],value)]]
Type ':help' or ':h' for help.
Display stack trace? [yN]	

There are 2 interesting remarks. First the JOIN in SQL is achieved by the traversal filter(out("belongsTo").has("name", "Sci-Fi")). This traversal also performs the filtering on Genre name == Sci-Fi. Second, the traversal generates an exception, which basically says that the step values() does not support the arguments [title, [VertexStep(OUT,[belongsTo],vertex), PropertiesStep([name],value)]]. Decoded for mere mortals, it means that you cannot use inner traversal inside the values() step. Indeed values() can only be used to fetch the properties of a vertex/edge.

This lets us with the pending question: how to display the movie title alongside with its genre(s) ?

For this, we need to use the project() step. Projection in Gremlin is very similar to projection in SQL and this step can accept inner traversal, which is what we want.

gremlin>g.
  V(). 
  hasLabel("movie").
  filter(out("belongsTo").has("name", "Sci-Fi")). // JOIN ... WHERE Genre.name = "Sci-Fi"  
  project("title", "genres").                     // SELECT
    by(values("title")).                          //   Movie.Title
    by(out("belongsTo").values("name").fold()).   //   Genre.name
  limit(10)
==>{title=Alien Resurrection, genres=[Sci-Fi, Horror]}
==>{title=Total Recall, genres=[Sci-Fi, Action]}
==>{title=Battlefield Earth, genres=[Sci-Fi, Action]}
==>{title=Eternal Sunshine of the Spotless Mind, genres=[Romance, Sci-Fi, Comedy, Drama]}
==>{title=Back to the Future. Part III, genres=[Sci-Fi, Western, Adventure, Comedy, Fantasy]}
==>{title=The Avengers, genres=[Sci-Fi, Action, Fantasy]}
==>{title=Back to the Future, genres=[Sci-Fi, Adventure, Comedy, Fantasy]}
==>{title=El gran marciano, genres=[Sci-Fi, Comedy]}
==>{title=Source Code, genres=[Sci-Fi, Thriller, Action]}
==>{title=2001: A Space Odyssey, genres=[Sci-Fi]}	

The step project() needs some explanation. First the arguments to this step represent the alias/label of the values being projected (the equivalent in SQL land is SELECT xxx AS label). Here the labels are “title” and “genres” but you can put whatever String value you want as long as they remain meaningful for the reader.

Next the project() step is followed by some by() modulators. There should be as many by() modulators as there are arguments to the project() step. Here we need 2 by() modulators.

The first by() specifies on which value we want to project the label “title” to. Obviously we want to read the “title” property of the Movie vertex so by(values("title")) is straightforward.

The second by() specifies on which value we want to project the label “genres” to. Since the genres are not intrinsic properties of the Movie vertex we need a traversal: by(out("belongsTo").values("name").fold()). And because traversing the edge “belongsTo” can lead us to a 1-to-N relationship (a movie can belong to multiple genres) we need the fold() operator to concatenate all the genre names into a single collection to be displayed with each movie title.

And that’s all folks! Do not miss the other Gremlin recipes in this series.

If you have any question about Gremlin, find me on the datastaxacademy.slack.com, channel dse-graph. My id is @doanduyhai

Leave a Comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.