Greedy and lazy quantifiers

Quantifiers are very simple from the first sight, but in fact they can be tricky.

We should understand how the search works very well if we plan to look for something more complex than /\d+/.

Let’s take the following task as an example.

We have a text and need to replace all quotes "..." with guillemet marks: «...». They are preferred for typography in many countries.

For instance: "Hello, world" should become «Hello, world».

Some countries prefer „Witam, świat!” (Polish) or even 「你好,世界」 (Chinese) quotes. For different locales we can choose different replacements, but that all works the same, so let’s start with «...».

To make replacements we first need to find all quoted substrings.

The regular expression can look like this: /".+"/g. That is: we look for a quote followed by one or more characters, and then another quote.

…But if we try to apply it, even in such a simple case…

let reg = /".+"/g;

let str = 'a "witch" and her "broom" is one';

alert( str.match(reg) ); // "witch" and her "broom"

…We can see that it works not as intended!

Instead of finding two matches "witch" and "broom", it finds one: "witch" and her "broom".

That can be described as “greediness is the cause of all evil”.

Greedy search

To find a match, the regular expression engine uses the following algorithm:

  • For every position in the string
    • Match the pattern at that position.
    • If there’s no match, go to the next position.

These common words do not make it obvious why the regexp fails, so let’s elaborate how the search works for the pattern ".+".

  1. The first pattern character is a quote ".

    The regular expression engine tries to find it at the zero position of the source string a "witch" and her "broom" is one, but there’s a there, so there’s immediately no match.

    Then it advances: goes to the next positions in the source string and tries to find the first character of the pattern there, and finally finds the quote at the 3rd position:

  2. The quote is detected, and then the engine tries to find a match for the rest of the pattern. It tries to see if the rest of the subject string conforms to .+".

    In our case the next pattern character is . (a dot). It denotes “any character except a newline”, so the next string letter 'w' fits:

  3. Then the dot repeats because of the quantifier .+. The regular expression engine builds the match by taking characters one by one while it is possible.

    …When it becomes impossible? All characters match the dot, so it only stops when it reaches the end of the string:

  4. Now the engine finished repeating for .+ and tries to find the next character of the pattern. It’s the quote ". But there’s a problem: the string has finished, there are no more characters!

    The regular expression engine understands that it took too many .+ and starts to backtrack.

    In other words, it shortens the match for the quantifier by one character:

    Now it assumes that .+ ends one character before the end and tries to match the rest of the pattern from that position.

    If there were a quote there, then that would be the end, but the last character is 'e', so there’s no match.

  5. …So the engine decreases the number of repetitions of .+ by one more character:

    The quote '"' does not match 'n'.

  6. The engine keep backtracking: it decreases the count of repetition for '.' until the rest of the pattern (in our case '"') matches:

  7. The match is complete.

  8. So the first match is "witch" and her "broom". The further search starts where the first match ends, but there are no more quotes in the rest of the string is one, so no more results.

That’s probably not what we expected, but that’s how it works.

In the greedy mode (by default) the quantifier is repeated as many times as possible.

The regexp engine tries to fetch as many characters as it can by .+, and then shortens that one by one.

For our task we want another thing. That’s what the lazy quantifier mode is for.

Lazy mode

The lazy mode of quantifier is an opposite to the greedy mode. It means: “repeat minimal number of times”.

We can enable it by putting a question mark '?' after the quantifier, so that it becomes *? or +? or even ?? for '?'.

To make things clear: usually a question mark ? is a quantifier by itself (zero or one), but if added after another quantifier (or even itself) it gets another meaning – it switches the matching mode from greedy to lazy.

The regexp /".+?"/g works as intended: it finds "witch" and "broom":

let reg = /".+?"/g;

let str = 'a "witch" and her "broom" is one';

alert( str.match(reg) ); // witch, broom

To clearly understand the change, let’s trace the search step by step.

  1. The first step is the same: it finds the pattern start '"' at the 3rd position:

  2. The next step is also similar: the engine finds a match for the dot '.':

  3. And now the search goes differently. Because we have a lazy mode for +?, the engine doesn’t try to match a dot one more time, but stops and tries to match the rest of the pattern '"' right now:

    If there were a quote there, then the search would end, but there’s 'i', so there’s no match.

  4. Then the regular expression engine increases the number of repetitions for the dot and tries one more time:

    Failure again. Then the number of repetitions is increased again and again…

  5. …Till the match for the rest of the pattern is found:

  6. The next search starts from the end of the current match and yield one more result:

In this example we saw how the lazy mode works for +?. Quantifiers +? and ?? work the similar way – the regexp engine increases the number of repetitions only if the rest of the pattern can’t match on the given position.

Laziness is only enabled for the quantifier with ?.

Other quantifiers remain greedy.

For instance:

alert( "123 456".match(/\d+ \d+?/g) ); // 123 4
  1. The pattern \d+ tries to match as many numbers as it can (greedy mode), so it finds 123 and stops, because the next character is a space ' '.

  2. Then there’s a space in pattern, it matches.

  3. Then there’s \d+?. The quantifier is in lazy mode, so it finds one digit 4 and tries to check if the rest of the pattern matches from there.

    …But there’s nothing in the pattern after \d+?.

    The lazy mode doesn’t repeat anything without a need. The pattern finished, so we’re done. We have a match 123 4.

  4. The next search starts from the character 5.

Optimizations

Modern regular expression engines can optimize internal algorithms to work faster. So they may work a bit different from the described algorithm.

But to understand how regular expressions work and to build regular expressions, we don’t need to know about that. They are only used internally to optimize things.

Complex regular expressions are hard to optimize, so the search may work exactly as described as well.

Alternative approach

With regexps, there’s often more then one way to do the same thing.

In our case we can find quoted strings without lazy mode using the regexp "[^"]+":

let reg = /"[^"]+"/g;

let str = 'a "witch" and her "broom" is one';

alert( str.match(reg) ); // witch, broom

The regexp "[^"]+" gives correct results, because it looks for a quote '"' followed by one or more non-quotes [^"], and then the closing quote.

When the regexp engine looks for [^"]+ it stops the repetitions when it meets the closing quote, and we’re done.

Please note, that this logic does not replace lazy quantifiers!

It is just different. There are times when we need one or another.

Let’s see one more example where lazy quantifiers fail and this variant works right.

For instance, we want to find links of the form <a href="..." class="doc">, with any href.

Which regular expression to use?

The first idea might be: /<a href=".*" class="doc">/g.

Let’s check it:

let str = '...<a href="link" class="doc">...';
let reg = /<a href=".*" class="doc">/g;

// Works!
alert( str.match(reg) ); // <a href="link" class="doc">

…But what if there are many links in the text?

let str = '...<a href="link1" class="doc">... <a href="link2" class="doc">...';
let reg = /<a href=".*" class="doc">/g;

// Whoops! Two links in one match!
alert( str.match(reg) ); // <a href="link1" class="doc">... <a href="link2" class="doc">

Now the result is wrong for the same reason as our “witches” example. The quantifier .* took too many characters.

The match looks like this:

<a href="....................................." class="doc">
<a href="link1" class="doc">... <a href="link2" class="doc">

Let’s modify the pattern by making the quantifier .*? lazy:

let str = '...<a href="link1" class="doc">... <a href="link2" class="doc">...';
let reg = /<a href=".*?" class="doc">/g;

// Works!
alert( str.match(reg) ); // <a href="link1" class="doc">, <a href="link2" class="doc">

Now it works, there are two matches:

<a href="....." class="doc">    <a href="....." class="doc">
<a href="link1" class="doc">... <a href="link2" class="doc">

Why it works – should be obvious after all explanations above. So let’s not stop on the details, but try one more text:

let str = '...<a href="link1" class="wrong">... <p style="" class="doc">...';
let reg = /<a href=".*?" class="doc">/g;

// Wrong match!
alert( str.match(reg) ); // <a href="link1" class="wrong">... <p style="" class="doc">

We can see that the regexp matched not just a link, but also a lot of text after it, including <p...>.

Why it happens?

  1. First the regexp finds a link start <a href=".

  2. Then it looks for .*?, we take one character, then check if there’s a match for the rest of the pattern, then take another one…

    The quantifier .*? consumes characters until it meets class="doc">.

    …And where can it find it? If we look at the text, then we can see that the only class="doc"> is beyond the link, in the tag <p>.

  3. So we have match:

    <a href="..................................." class="doc">
    <a href="link1" class="wrong">... <p style="" class="doc">

So the laziness did not work for us here.

We need the pattern to look for <a href="...something..." class="doc">, but both greedy and lazy variants have problems.

The correct variant would be: href="[^"]*". It will take all characters inside the href attribute till the nearest quote, just what we need.

A working example:

let str1 = '...<a href="link1" class="wrong">... <p style="" class="doc">...';
let str2 = '...<a href="link1" class="doc">... <a href="link2" class="doc">...';
let reg = /<a href="[^"]*" class="doc">/g;

// Works!
alert( str1.match(reg) ); // null, no matches, that's correct
alert( str2.match(reg) ); // <a href="link1" class="doc">, <a href="link2" class="doc">

Summary

Quantifiers have two modes of work:

Greedy
By default the regular expression engine tries to repeat the quantifier as many times as possible. For instance, \d+ consumes all possible digits. When it becomes impossible to consume more (no more digits or string end), then it continues to match the rest of the pattern. If there’s no match then it decreases the number of repetitions (backtracks) and tries again.
Lazy
Enabled by the question mark ? after the quantifier. The regexp engine tries to match the rest of the pattern before each repetition of the quantifier.

As we’ve seen, the lazy mode is not a “panacea” from the greedy search. An alternative is a “fine-tuned” greedy search, with exclusions. Soon we’ll see more examples of it.

Tasks

What’s the match here?

"123 456".match(/\d+? \d+?/g) ); // ?

The result is: 123 4.

First the lazy \d+? tries to take as little digits as it can, but it has to reach the space, so it takes 123.

Then the second \d+? takes only one digit, because that’s enough.

Find all HTML comments in the text:

let reg = /your regexp/g;

let str = `... <!-- My -- comment
 test --> ..  <!----> ..
`;

alert( str.match(reg) ); // '<!-- My -- comment \n test -->', '<!---->'

We need to find the beginning of the comment <!--, then everything till the end of -->.

The first idea could be <!--.*?--> – the lazy quantifier makes the dot stop right before -->.

But a dot in Javascript means “any symbol except the newline”. So multiline comments won’t be found.

We can use [\s\S] instead of the dot to match “anything”:

let reg = /<!--[\s\S]*?-->/g;

let str = `... <!-- My -- comment
 test --> ..  <!----> ..
`;

alert( str.match(reg) ); // '<!-- My -- comment \n test -->', '<!---->'

Create a regular expression to find all (opening and closing) HTML tags with their attributes.

An example of use:

let reg = /your regexp/g;

let str = '<> <a href="/"> <input type="radio" checked> <b>';

alert( str.match(reg) ); // '<a href="/">', '<input type="radio" checked>', '<b>'

Let’s assume that may not contain < and > inside (in quotes too), that simplifies things a bit.

The solution is <[^<>]+>.

let reg = /<[^<>]+>/g;

let str = '<> <a href="/"> <input type="radio" checked> <b>';

alert( str.match(reg) ); // '<a href="/">', '<input type="radio" checked>', '<b>'
Tutorial map

Comments

read this before commenting…
  • You're welcome to post additions, questions to the articles and answers to them.
  • To insert a few words of code, use the <code> tag, for several lines – use <pre>, for more than 10 lines – use a sandbox (plnkr, JSBin, codepen…)
  • If you can't understand something in the article – please elaborate.