Let’s get under the hood of regexp engine and see how the search is performed. The understanding is essential for writing anything more complex than
As a starter, we take the following regexp meant to search for quoted strings:
showMatch( 'a "witch" and her "broom" is one', /".*"/g )
Run the example above… Whoops! It doesn’t work! Actually it finds a single match:
"witch" and her "broom", instead of two separate strings.
In this case the regexp greedyness spoils the search.
The searching algorithm
The regexp engine tries to find the regexp in the text starting from position 0 and then tries to match it and so on.
To make it solid, we’ll track the matching algorithm for the quoted strings pattern
".", used above (quotes are not decorative here).
- The first pattern character is quote
'"'. The regexp engine matches it against the text and finds at 3rd position:
- Then the engine tries to match the next part of the regexp. It’s second character is a dot, which means *any character. The regexp engine matches it at
- The dot is repeated one or more times
.. So the regexp engine matches it repeatedly *all the way it can:
- The text is finished, so the dot repeating stops. But there is still the rest of the pattern to match, the second quote
So, the engine starts backtracking or, in other words, shortens the current match, by one character:
After the match is shortened, it tries to match the rest of the pattern. But the quote
- So the engine reduces the number of repetitions one more time:
'n'. Failed again.
- The engine continues backtracking, it shortens the number of dot
'.'repetitions one-by-one until the rest of the pattern is matched:
- We’ve got the result. Because the regexp is global, the search continues for more results. The continued search starts from right after the match. It doesn’t yield more results.
In greedy (default) mode, the regexp engine repeats a quantifier as many times as possible.
The lazy mode of quantifiers is a counterpart to greedy. It can be enabled by putting a question mark
'?' after the quantifier, so it becomes
+? or even
The example below works correctly:
showMatch( 'a "witch" and her "broom" is one', /".*?"/g ) // "witch", "broom"
To get a grasp over the lazy mode, we get into details of how
- The first step is same, the quote
- The second step is also same, the dot
- Now, the main difference from the greedy mode. The purpose of the regexp engine is to repeat the dot as minimum times as possible, so it tries the rest of pattern
No, there is no match. Here we have ‘t’ != ‘”’, but actually much more complicated regexp part may follow. The algorithm is same. If no match, then go further.
- The engine matches one dot repetition more and retries:
No match again. The engine adds one more repetition and so on…
- Only the 5th step finally allows the engine to match the rest of the pattern:
- Because the global flag is on, the engine continues to search in the text after the match, giving the second result:
In lazy mode the engine tries to repeat as little as possible.
In the example above, it was demonstrated by dot
'.' quantifier. Similar it works with
the question mark
By default, the engine tries to match as many repetitions as it can, so the default (greedy) match below returns
showMatch( " item 1a ", /\d\w?/ ) // 1a
But let’s switch the quantifier ‘?’ to lazy mode, adding the question mark:
showMatch( " item 1a ", /\d\w??/ ) // 1
Now the result is 1, because
\w is repeated as little as possible (0 times).
The lazy switch is per-quantifier. In the regexp, there may be both greedy and lazy quantifiers.
showMatch( "123 456", /\d+ \d+?/g ) // 123 4
- The subpattern
\d+tries to match as many digits as possible, so it finds
123and stops, because the space symbol ’ ’ does not match \d.
- Then the space is matched and
\d+?comes into play. It matches one digit
'4'and tries to match the rest of the pattern (after
\d+?). Because there is nothing after it, the search is finished and
123 4is the result.
- The search for new results is continued starting with
'5', but doesn’t give anything.
In our particular case, it is possible to match quoted strings, and still remain in greedy mode:
showMatch( 'a "witch" and her "broom" is one', /"[^"]*"/g )
"[^"]" gives two correct results, because it looks for a quote
'"' followed by as many non-quotes as possible. So, the second quote stops
[^"] and allows to match the closing quote
Usually, the lazy approach has wider range of application and gives more readable pattern.
".?" the same?
In other words, is it possible that they give different results? If yes, then provide the test string.
In general, yes, they are almost the same. Both match from one quote to the other.
But the gotcha is dot symbol
'.'. Remember, dot
'.' is any character excepts a newline.
[^"] is any character except a quote
So, the first regexp
"[^"]" will match quoted strings with newlines, but not the second regexp
".?" will not.
showMatch( ' "multiline \n string " ', /"[^"]*"/g ) showMatch( ' "multiline \n string " ', /".*?"/g )
Find all HTML comments in the text:
str = '.. <!-- My -- comment \n test --> .. <!----> .. ' re = /.. your regexp ../ str.match(re) // '<!-- My -- comment \n test -->', '<!---->'
We need to match everything
. from comment start
<!-- to comment end
-->. To let the comment end match, the quantifier should be lazy.
So, the first-glance solution for a task is
str = '.. <!-- Welcome -- comment \n test --> .. <!----> .. ' re = /<!--[\s\S]*?-->/g alert( str.match(re) )
Create a regexp to match opening HTML tags with attributes:
var re = /* your regexp */ var str = '<> <a href="/"> <input type="radio" checked> <b>' alert(str.match(re)) // '<a href="/">', '<input type="radio" checked>', '<b>'
P.S. We know that there may not be
> inside a tag (they are
<> is not a tag.
First, we start with
<, then we could add
.+?> to match any chars until
>. So the regexp is
Let’s see how it works:
var re = /<.+?>/g var str = '<> <a href="/"> <input type="radio" checked> **' alert(str.match(re)) // '<> <a href="/">'
Wrong! It matches
<> <a href="/">,* because
.+? is *”any char (except newline) repeating one or more times until the rest of the pattern matches (lazyness)”.
So, here’s what is does step by step:
- Match the first pattern symbol
- Start matching
.+ungreedily. Match any symbol one time:
- Because the quantifier is lazy, try to match the rest of the pattern
- Repeat the quantifier and try to match the rest of the pattern again.
- Repeat the quantifier until the match is found:
- We’ve got the result:
Because the regexp is global, the new search starts right after the match.
So, the right solution is
<[^>]+>. It won’t match
<>, because needs at least
var re = /<[^>]+>/g var str = '<> <a href="/"> <input type="radio" checked> <b>' alert(str.match(re)) // '<a href="/">', '<input type="radio" checked>', '<b>'