Greedy and Lazy

  1. The searching algorithm
  2. Lazy mode
  3. Alternative approach

Let’s get under the hood of regexp engine and see how the search is performed. The understanding is essential for writing anything more complex than /\d/.

As a starter, we take the following regexp meant to search for quoted strings:

showMatch( 'a "witch" and her "broom" is one',  /".*"/g )

Run the example above… Whoops! It doesn’t work! Actually it finds a single match: "witch" and her "broom", instead of two separate strings.

In this case the regexp greedyness spoils the search.

The searching algorithm

The regexp engine tries to find the regexp in the text starting from position 0 and then tries to match it and so on.

To make it solid, we’ll track the matching algorithm for the quoted strings pattern ".", used above (quotes are not decorative here).

  1. The first pattern character is quote '"'. The regexp engine matches it against the text and finds at 3rd position:
  2. Then the engine tries to match the next part of the regexp. It’s second character is a dot, which means *any character. The regexp engine matches it at 'w':
  3. The dot is repeated one or more times .. So the regexp engine matches it repeatedly *all the way it can:
  4. The text is finished, so the dot repeating stops. But there is still the rest of the pattern to match, the second quote '"'.

    So, the engine starts backtracking or, in other words, shortens the current match, by one character:

    After the match is shortened, it tries to match the rest of the pattern. But the quote '"' doesn’t match 'e'.

  5. So the engine reduces the number of repetitions one more time:


    The quote '"' doesn’t match 'n'. Failed again.

  6. The engine continues backtracking, it shortens the number of dot '.' repetitions one-by-one until the rest of the pattern is matched:

  7. We’ve got the result. Because the regexp is global, the search continues for more results. The continued search starts from right after the match. It doesn’t yield more results.

In greedy (default) mode, the regexp engine repeats a quantifier as many times as possible.

Lazy mode

The lazy mode of quantifiers is a counterpart to greedy. It can be enabled by putting a question mark '?' after the quantifier, so it becomes ? or +? or even ?? for '?'.

The example below works correctly:

showMatch( 'a "witch" and her "broom" is one', /".*?"/g ) // "witch", "broom"

To get a grasp over the lazy mode, we get into details of how ".?" works.

  1. The first step is same, the quote '"' is matched:
  2. The second step is also same, the dot '.' is matched:
  3. Now, the main difference from the greedy mode. The purpose of the regexp engine is to repeat the dot as minimum times as possible, so it tries the rest of pattern '"' immediately:

    No, there is no match. Here we have ‘t’ != ‘”’, but actually much more complicated regexp part may follow. The algorithm is same. If no match, then go further.
  4. The engine matches one dot repetition more and retries:


    No match again. The engine adds one more repetition and so on…

  5. Only the 5th step finally allows the engine to match the rest of the pattern:

  6. Because the global flag is on, the engine continues to search in the text after the match, giving the second result:

In lazy mode the engine tries to repeat as little as possible.

In the example above, it was demonstrated by dot '.' quantifier. Similar it works with '+' and
the question mark '?'.

By default, the engine tries to match as many repetitions as it can, so the default (greedy) match below returns 1a:

showMatch( " item 1a ", /\d\w?/ )  // 1a

But let’s switch the quantifier ‘?’ to lazy mode, adding the question mark:

showMatch( " item 1a ", /\d\w??/ )  // 1

Now the result is 1, because \w is repeated as little as possible (0 times).

The lazy switch is per-quantifier. In the regexp, there may be both greedy and lazy quantifiers.

For example:

showMatch( "123 456", /\d+ \d+?/g )  // 123 4

  1. The subpattern \d+ tries to match as many digits as possible, so it finds 123 and stops, because the space symbol ’ ’ does not match \d.
  2. Then the space is matched and \d+? comes into play. It matches one digit '4' and tries to match the rest of the pattern (after \d+?). Because there is nothing after it, the search is finished and 123 4 is the result.
  3. The search for new results is continued starting with '5', but doesn’t give anything.

Alternative approach

In our particular case, it is possible to match quoted strings, and still remain in greedy mode:

showMatch( 'a "witch" and her "broom" is one', /"[^"]*"/g )

The regexp "[^"]" gives two correct results, because it looks for a quote '"' followed by as many non-quotes as possible. So, the second quote stops [^"] and allows to match the closing quote '"'.

Usually, the lazy approach has wider range of application and gives more readable pattern.

Are "[^"]" and ".?" the same?

In other words, is it possible that they give different results? If yes, then provide the test string.

Open solution
Solution

In general, yes, they are almost the same. Both match from one quote to the other.

But the gotcha is dot symbol '.'. Remember, dot '.' is any character excepts a newline.

And [^"] is any character except a quote '"'.

So, the first regexp "[^"]" will match quoted strings with newlines, but not the second regexp ".?" will not.

showMatch( ' "multiline \n string " ', /"[^"]*"/g )

showMatch( ' "multiline \n string " ', /".*?"/g )

Find all HTML comments in the text:

str = '.. <!-- My -- comment \n test --> ..  <!----> .. '
re = /.. your regexp ../

str.match(re) // '<!-- My -- comment \n test -->', '<!---->'

Open solution
Solution

We need to match everything . from comment start <!-- to comment end -->. To let the comment end match, the quantifier should be lazy.

So, the first-glance solution for a task is <!--.?-->.

But we remember that JavaScript dot doesn’t match a newline. So let’s replace it by a more generic [\s\S].

Finally:

str = '.. <!-- Welcome -- comment \n test --> ..  <!----> .. '
re = /<!--[\s\S]*?-->/g

alert( str.match(re) )


Create a regexp to match opening HTML tags with attributes:

var re = /* your regexp */

var str = '<> <a href="/"> <input type="radio" checked> <b>'

alert(str.match(re)) // '<a href="/">', '<input type="radio" checked>', '<b>'

P.S. We know that there may not be < or > inside a tag (they are &lt; and &gt;).
P.P.S. <> is not a tag.

Open solution
Solution

First, we start with <, then we could add .+?> to match any chars until >. So the regexp is <.+?>.

Let’s see how it works:

var re = /<.+?>/g

var str = '<> <a href="/"> <input type="radio" checked> **'

alert(str.match(re)) // '<> <a href="/">'

Wrong! It matches <> <a href="/">,* because .+? is *”any char (except newline) repeating one or more times until the rest of the pattern matches (lazyness)”.

So, here’s what is does step by step:

  1. Match the first pattern symbol '<':
  2. Start matching .+ ungreedily. Match any symbol one time:
  3. Because the quantifier is lazy, try to match the rest of the pattern '>':

    No match.
  4. Repeat the quantifier and try to match the rest of the pattern again.


    No match.

  5. Repeat the quantifier until the match is found:
  6. We’ve got the result:

    Because the regexp is global, the new search starts right after the match.

So, the right solution is <[^>]+>. It won’t match <>, because needs at least non-'>' symbol:

var re = /<[^>]+>/g

var str = '<> <a href="/"> <input type="radio" checked> <b>'

alert(str.match(re)) // '<a href="/">', '<input type="radio" checked>', '<b>'

Tutorial

Donate

Donate to this project