{"id":33,"date":"2011-07-30T00:45:32","date_gmt":"2011-07-29T22:45:32","guid":{"rendered":"http:\/\/doanduyhai.wordpress.com\/?p=33"},"modified":"2011-07-30T00:45:32","modified_gmt":"2011-07-29T22:45:32","slug":"advanced-regular-expressions-part-2-greedy-vs-reluctant","status":"publish","type":"post","link":"https:\/\/www.doanduyhai.com\/blog\/?p=33","title":{"rendered":"Advanced Regular Expressions: part 2 &#8211; greedy vs reluctant"},"content":{"rendered":"<p>In this post we will look at the greedy nature of common quantifiers (* and +).<br \/>\n<!--more--><br \/>\n&nbsp;<\/p>\n<h3><strong>II\u00a0 Greedy vs Reluctant<\/strong><\/h3>\n<p>We&#8217;re so used with the * quantifier that we almost forget its fundamental greedy nature.\u00a0 A quantifier is considered &#8220;greedy&#8221; when it tries to match the longest text chain possible before bactracking if the pattern does not match. Its counterpart is a reluctant quantifier.<\/p>\n<p>Example is better that words, let&#8217;s see how we can extract the root directory in an Unix system given the complete path of a file:<\/p>\n<p>Input: <span style=\"text-decoration:underline;\">\/home\/data\/ebooks\/JavaForDummies.pdf<\/span><\/p>\n<p>Pattern:\u00a0 <em><span style=\"font-size:large;\">^(.+)\/.+$<\/span><\/em><\/p>\n<pre class=\"brush: java; title: ; notranslate\" title=\"\">\n\nString text = &amp;quot;\/home\/data\/books\/JavaForDummies.pdf&amp;quot;;\n\n \u00a0\u00a0 \u00a0\u00a0\u00a0 \u00a0Pattern pattern = Pattern.compile(&amp;quot;^\/(.+)(?:\/.+)*$&amp;quot;);\n \u00a0\u00a0 \u00a0\u00a0\u00a0 \u00a0Matcher matcher = pattern.matcher(text);\n \u00a0\u00a0 \u00a0\u00a0\u00a0 \u00a0if(matcher.matches())\n \u00a0\u00a0 \u00a0\u00a0\u00a0 \u00a0{\n \u00a0\u00a0 \u00a0\u00a0\u00a0 \u00a0\u00a0\u00a0 \u00a0System.out.println(&amp;quot; Root folder : &amp;quot; + matcher.group(1));\n \u00a0\u00a0 \u00a0\u00a0\u00a0 \u00a0}\n\n<\/pre>\n<p>Not surprisingly the console would display:<\/p>\n<blockquote><p>Matches : true<\/p>\n<p>Group 1 : home\/data\/books\/JavaForDummies.pdf<\/p><\/blockquote>\n<p>What the heck ? But we did clearly isolate the first part <em><span style=\"font-size:large;\">\/(.+)<\/span><\/em> of the pattern from the remaining trail, how comes ?<\/p>\n<p>This is a very common mistake when you just happen to forget the greedy nature of the quantifier (greedy is their default behavior). First the regexp engine tries to match the pattern<em><span style=\"font-size:large;\">\/(.+)<\/span><\/em> with the longest possible text.\u00a0The regexp engine starts by consuming all the text:<\/p>\n<p><em><span style=\"font-size:large;\">\/(.+)<\/span><\/em> :\u00a0\u00a0\u00a0\u00a0 <span style=\"text-decoration:underline;\">\/home\/data\/books\/JavaForDummies.pdf<\/span>\u2191<\/p>\n<p>The first part of the pattern does match, and the engine will stop here, reporting the entire text as the match. The last part of the pattern <em><span style=\"font-size:large;\">(?:\/.+)*<\/span><\/em> did indeed match with .. nothing because the * quantifier means nothing or more and it this case it matched nothing.<\/p>\n<p>To circumvent this issue, one can thing to turn the * into + to force the regexp engine to match. The result is no even close to what we expect:<\/p>\n<blockquote><p>Matches : true<\/p>\n<p>Group 1 : home\/data\/books<\/p><\/blockquote>\n<p>We did get rid of the last matching block (<span style=\"text-decoration:underline;\">\/JavaForDummies.pdf<\/span>) but it&#8217;s still not satisfactory.<\/p>\n<p>But ? Wasn&#8217;t the + in <em><span style=\"font-size:large;\">(?:\/.+)+<\/span><\/em>\u00a0supposed to be greedy ? Why didn&#8217;t it match <span style=\"text-decoration:underline;\">\/data\/books\/JavaForDummies.pdf<\/span> but only the last part ?<\/p>\n<p>Well, first come, first served. The regexp engine starts processing the <em><span style=\"font-size:large;\">\/(.+) <\/span><\/em>part first and since it is greedy it will consume all the string. But by doing so, there is nothing left to be matched with <em><span style=\"font-size:large;\">(?:\/.+)+<\/span><\/em> so the regexp engine is forced to give up as many characters as needed so that the WHOLE pattern matches.<\/p>\n<p>Step 1.\u00a0 <span style=\"text-decoration:underline;\">\/home\/data\/books\/JavaForDummies.pdf\u2191<\/span><\/p>\n<ul>\n<li>\u00a0<em><span style=\"font-size:large;\">\/(.+) <\/span><\/em> matches <span style=\"text-decoration:underline;\">\/home\/data\/books\/JavaForDummies.pdf<\/span><\/li>\n<li><em><span style=\"font-size:large;\">(?:\/.+)+<\/span><\/em> matches nothing -&gt; KO<\/li>\n<\/ul>\n<p>Step 2. <span style=\"text-decoration:underline;\">\/home\/data\/books\/JavaForDummies.pd\u2191f<\/span><\/p>\n<ul>\n<li><em><span style=\"font-size:large;\">\/(.+) <\/span><\/em> matches <span style=\"text-decoration:underline;\">\/home\/data\/books\/JavaForDummies.pd<\/span><\/li>\n<li><em><span style=\"font-size:large;\">(?:\/.+)+<\/span><\/em> matches\u00a0 <span style=\"text-decoration:underline;\">f<\/span> -&gt; KO<\/li>\n<\/ul>\n<p>Step 3. <span style=\"text-decoration:underline;\">\/home\/data\/books\/JavaForDummies.p\u2191df<\/span><\/p>\n<ul>\n<li><em><span style=\"font-size:large;\">\/(.+) <\/span><\/em> matches <span style=\"text-decoration:underline;\">\/home\/data\/books\/JavaForDummies.p<\/span><\/li>\n<li><em><span style=\"font-size:large;\">(?:\/.+)+<\/span><\/em> matches\u00a0 <span style=\"text-decoration:underline;\">df<\/span> -&gt; KO<\/li>\n<\/ul>\n<p>&#8230;<\/p>\n<p>Step 20. <span style=\"text-decoration:underline;\">\/home\/data\/books\u2191<\/span><\/p>\n<ul>\n<li><em><span style=\"font-size:large;\">\/(.+) <\/span><\/em> matches <span style=\"text-decoration:underline;\">\/home\/data\/books<\/span><\/li>\n<li><em><span style=\"font-size:large;\">(?:\/.+)+<\/span><\/em> matches\u00a0 <span style=\"text-decoration:underline;\">\/JavaForDummies.pdf<\/span> -&gt; OK<\/li>\n<\/ul>\n<p>As long as <em><span style=\"font-size:large;\">(?:\/.+)+<\/span><\/em> can match anything, be it the shortest length, the regexp engine will stop processing and return the successful result. Indeed the greediness of the first part\u00a0<em><span style=\"font-size:large;\">\/(.+) <\/span><\/em> has priority over the greediness of the second one <em><span style=\"font-size:large;\">(?:\/.+)+<\/span><\/em>.<\/p>\n<p>Now, back to our issue, we still want to extract the root folder from the file path. Each folder level in the path is delimited by the slash \/. The idea here would be to take the first group of text between slashes:<\/p>\n<p>Patter: <em><span style=\"font-size:large;\">^\/([^\/]+)\/.+$<\/span><\/em><\/p>\n<p>This time the output is:<\/p>\n<blockquote><p>\u00a0Matches ? true<br \/>\nGroup 1 : home<\/p><\/blockquote>\n<p>That&#8217;s what we wanted. The trick here is the use of a negation in a character class:\u00a0<em><span style=\"font-size:large;\">[^\/]<\/span><\/em>. This means &#8220;<em>anything except the a\/<\/em>&#8220;. Combined with the beginning \/ and the second \/, we can isolate the root folder from the remaining path: <em><span style=\"font-size:large;\">\/([^\/]+)\/<\/span><\/em><\/p>\n<p>Alternatively we can use the reluctant version of our + quantifier:\u00a0 +?<\/p>\n<p>Pattern:\u00a0 <em><span style=\"font-size:large;\">^\/(<span style=\"color:#ff0000;\">.+?<\/span>)(?:\/.+)+$<\/span><\/em><\/p>\n<p>The output:<\/p>\n<blockquote><p>\u00a0Matches ? true<br \/>\nGroup 1 : home<\/p><\/blockquote>\n<p>This time, instead of starting by consuming all the string, the reluctant +? will consume character by character until the whole pattern matches, thus giving us the smallest text that can match the string match.<\/p>\n<p>&nbsp;<\/p>\n<p>Recommended readings:<\/p>\n<ul>\n<li><a href=\"http:\/\/www.regular-expressions.info\/optional.html\" target=\"_blank\">http:\/\/www.regular-expressions.info\/optional.html<\/a><\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>In this post we will look at the greedy nature of common quantifiers (* and +).<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[12],"tags":[45],"_links":{"self":[{"href":"https:\/\/www.doanduyhai.com\/blog\/index.php?rest_route=\/wp\/v2\/posts\/33"}],"collection":[{"href":"https:\/\/www.doanduyhai.com\/blog\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.doanduyhai.com\/blog\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.doanduyhai.com\/blog\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.doanduyhai.com\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=33"}],"version-history":[{"count":0,"href":"https:\/\/www.doanduyhai.com\/blog\/index.php?rest_route=\/wp\/v2\/posts\/33\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.doanduyhai.com\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=33"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.doanduyhai.com\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=33"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.doanduyhai.com\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=33"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}