Advanced Regular Expressions: part 2 – greedy vs reluctant

In this post we will look at the greedy nature of common quantifiers (* and +).

II Greedy vs Reluctant

We’re so used with the * quantifier that we almost forget its fundamental greedy nature. A quantifier is considered “greedy” when it tries to match the longest text chain possible before bactracking if the pattern does not match. Its counterpart is a reluctant quantifier.

Example is better that words, let’s see how we can extract the root directory in an Unix system given the complete path of a file:

Input: /home/data/ebooks/JavaForDummies.pdf

Pattern: ^(.+)/.+$


String text = &quot;/home/data/books/JavaForDummies.pdf&quot;;

         Pattern pattern = Pattern.compile(&quot;^/(.+)(?:/.+)*$&quot;);
         Matcher matcher = pattern.matcher(text);
         if(matcher.matches())
         {
             System.out.println(&quot; Root folder : &quot; + matcher.group(1));
         }

Not surprisingly the console would display:

Matches : true

Group 1 : home/data/books/JavaForDummies.pdf

What the heck ? But we did clearly isolate the first part /(.+) of the pattern from the remaining trail, how comes ?

This is a very common mistake when you just happen to forget the greedy nature of the quantifier (greedy is their default behavior). First the regexp engine tries to match the pattern/(.+) with the longest possible text. The regexp engine starts by consuming all the text:

/(.+) : /home/data/books/JavaForDummies.pdf↑

The first part of the pattern does match, and the engine will stop here, reporting the entire text as the match. The last part of the pattern (?:/.+)* did indeed match with .. nothing because the * quantifier means nothing or more and it this case it matched nothing.

To circumvent this issue, one can thing to turn the * into + to force the regexp engine to match. The result is no even close to what we expect:

Matches : true

Group 1 : home/data/books

We did get rid of the last matching block (/JavaForDummies.pdf) but it’s still not satisfactory.

But ? Wasn’t the + in (?:/.+)+ supposed to be greedy ? Why didn’t it match /data/books/JavaForDummies.pdf but only the last part ?

Well, first come, first served. The regexp engine starts processing the /(.+) part first and since it is greedy it will consume all the string. But by doing so, there is nothing left to be matched with (?:/.+)+ so the regexp engine is forced to give up as many characters as needed so that the WHOLE pattern matches.

Step 1. /home/data/books/JavaForDummies.pdf↑

/(.+) matches /home/data/books/JavaForDummies.pdf
(?:/.+)+ matches nothing -> KO

Step 2. /home/data/books/JavaForDummies.pd↑f

/(.+) matches /home/data/books/JavaForDummies.pd
(?:/.+)+ matches f -> KO

Step 3. /home/data/books/JavaForDummies.p↑df

/(.+) matches /home/data/books/JavaForDummies.p
(?:/.+)+ matches df -> KO

…

Step 20. /home/data/books↑

/(.+) matches /home/data/books
(?:/.+)+ matches /JavaForDummies.pdf -> OK

As long as (?:/.+)+ can match anything, be it the shortest length, the regexp engine will stop processing and return the successful result. Indeed the greediness of the first part /(.+) has priority over the greediness of the second one (?:/.+)+.

Now, back to our issue, we still want to extract the root folder from the file path. Each folder level in the path is delimited by the slash /. The idea here would be to take the first group of text between slashes:

Patter: ^/([^/]+)/.+$

This time the output is:

Matches ? true
Group 1 : home

That’s what we wanted. The trick here is the use of a negation in a character class: [^/]. This means “anything except the a/“. Combined with the beginning / and the second /, we can isolate the root folder from the remaining path: /([^/]+)/

Alternatively we can use the reluctant version of our + quantifier: +?

Pattern: ^/(.+?)(?:/.+)+$

The output:

Matches ? true
Group 1 : home

This time, instead of starting by consuming all the string, the reluctant +? will consume character by character until the whole pattern matches, thus giving us the smallest text that can match the string match.

Advanced Regular Expressions: part 2 – greedy vs reluctant

II Greedy vs Reluctant

Leave a Comment Cancel reply