Quantifiers are very simple from the first sight, but in fact they can be tricky.
We should understand how the search works very well if we plan to look for something more complex than
Let’s take the following task as an example.
We have a text and need to replace all quotes
"..." with guillemet marks:
«...». They are preferred for typography in many countries.
"Hello, world" should become
Some countries prefer
„Witam, świat!” (Polish) or even
「你好，世界」 (Chinese) quotes. For different locales we can choose different replacements, but that all works the same, so let’s start with
To make replacements we first need to find all quoted substrings.
The regular expression can look like this:
/".+"/g. That is: we look for a quote followed by one or more characters, and then another quote.
…But if we try to apply it, even in such a simple case…
…We can see that it works not as intended!
Instead of finding two matches
"broom", it finds one:
"witch" and her "broom".
That can be described as “greediness is the cause of all evil”.
To find a match, the regular expression engine uses the following algorithm:
- For every position in the string
- Match the pattern at that position.
- If there’s no match, go to the next position.
These common words do not make it obvious why the regexp fails, so let’s elaborate how the search works for the pattern
The first pattern character is a quote
The regular expression engine tries to find it at the zero position of the source string
a "witch" and her "broom" is one, but there’s
athere, so there’s immediately no match.
Then it advances: goes to the next positions in the source string and tries to find the first character of the pattern there, and finally finds the quote at the 3rd position:
The quote is detected, and then the engine tries to find a match for the rest of the pattern. It tries to see if the rest of the subject string conforms to
In our case the next pattern character is
.(a dot). It denotes “any character except a newline”, so the next string letter
Then the dot repeats because of the quantifier
.+. The regular expression engine builds the match by taking characters one by one while it is possible.
…When it becomes impossible? All characters match the dot, so it only stops when it reaches the end of the string:
Now the engine finished repeating for
.+and tries to find the next character of the pattern. It’s the quote
". But there’s a problem: the string has finished, there are no more characters!
The regular expression engine understands that it took too many
.+and starts to backtrack.
In other words, it shortens the match for the quantifier by one character:
Now it assumes that
.+ends one character before the end and tries to match the rest of the pattern from that position.
If there were a quote there, then that would be the end, but the last character is
'e', so there’s no match.
…So the engine decreases the number of repetitions of
.+by one more character:
'"'does not match
The engine keep backtracking: it decreases the count of repetition for
'.'until the rest of the pattern (in our case
The match is complete.
So the first match is
"witch" and her "broom". The further search starts where the first match ends, but there are no more quotes in the rest of the string
is one, so no more results.
That’s probably not what we expected, but that’s how it works.
In the greedy mode (by default) the quantifier is repeated as many times as possible.
The regexp engine tries to fetch as many characters as it can by
.+, and then shortens that one by one.
For our task we want another thing. That’s what the lazy quantifier mode is for.
The lazy mode of quantifier is an opposite to the greedy mode. It means: “repeat minimal number of times”.
We can enable it by putting a question mark
'?' after the quantifier, so that it becomes
+? or even
To make things clear: usually a question mark
? is a quantifier by itself (zero or one), but if added after another quantifier (or even itself) it gets another meaning – it switches the matching mode from greedy to lazy.
/".+?"/g works as intended: it finds
To clearly understand the change, let’s trace the search step by step.
The first step is the same: it finds the pattern start
'"'at the 3rd position:
The next step is also similar: the engine finds a match for the dot
And now the search goes differently. Because we have a lazy mode for
+?, the engine doesn’t try to match a dot one more time, but stops and tries to match the rest of the pattern
If there were a quote there, then the search would end, but there’s
'i', so there’s no match.
Then the regular expression engine increases the number of repetitions for the dot and tries one more time:
Failure again. Then the number of repetitions is increased again and again…
…Till the match for the rest of the pattern is found:
The next search starts from the end of the current match and yield one more result:
In this example we saw how the lazy mode works for
?? work the similar way – the regexp engine increases the number of repetitions only if the rest of the pattern can’t match on the given position.
Laziness is only enabled for the quantifier with
Other quantifiers remain greedy.
\d+tries to match as many numbers as it can (greedy mode), so it finds
123and stops, because the next character is a space
Then there’s a space in pattern, it matches.
\d+?. The quantifier is in lazy mode, so it finds one digit
4and tries to check if the rest of the pattern matches from there.
…But there’s nothing in the pattern after
The lazy mode doesn’t repeat anything without a need. The pattern finished, so we’re done. We have a match
The next search starts from the character
Modern regular expression engines can optimize internal algorithms to work faster. So they may work a bit different from the described algorithm.
But to understand how regular expressions work and to build regular expressions, we don’t need to know about that. They are only used internally to optimize things.
Complex regular expressions are hard to optimize, so the search may work exactly as described as well.
With regexps, there’s often more then one way to do the same thing.
In our case we can find quoted strings without lazy mode using the regexp
"[^"]+" gives correct results, because it looks for a quote
'"' followed by one or more non-quotes
[^"], and then the closing quote.
When the regexp engine looks for
[^"]+ it stops the repetitions when it meets the closing quote, and we’re done.
Please note, that this logic does not replace lazy quantifiers!
It is just different. There are times when we need one or another.
Let’s see one more example where lazy quantifiers fail and this variant works right.
For instance, we want to find links of the form
<a href="..." class="doc">, with any
Which regular expression to use?
The first idea might be:
/<a href=".*" class="doc">/g.
Let’s check it:
…But what if there are many links in the text?
Now the result is wrong for the same reason as our “witches” example. The quantifier
.* took too many characters.
The match looks like this:
<a href="....................................." class="doc"> <a href="link1" class="doc">... <a href="link2" class="doc">
Let’s modify the pattern by making the quantifier
Now it works, there are two matches:
<a href="....." class="doc"> <a href="....." class="doc"> <a href="link1" class="doc">... <a href="link2" class="doc">
Why it works – should be obvious after all explanations above. So let’s not stop on the details, but try one more text:
We can see that the regexp matched not just a link, but also a lot of text after it, including
Why it happens?
First the regexp finds a link start
Then it looks for
.*?, we take one character, then check if there’s a match for the rest of the pattern, then take another one…
.*?consumes characters until it meets
…And where can it find it? If we look at the text, then we can see that the only
class="doc">is beyond the link, in the tag
So we have match:
<a href="..................................." class="doc"> <a href="link1" class="wrong">... <p style="" class="doc">
So the laziness did not work for us here.
We need the pattern to look for
<a href="...something..." class="doc">, but both greedy and lazy variants have problems.
The correct variant would be:
href="[^"]*". It will take all characters inside the
href attribute till the nearest quote, just what we need.
A working example:
let str1 = '...<a href="link1" class="wrong">... <p style="" class="doc">...'; let str2 = '...<a href="link1" class="doc">... <a href="link2" class="doc">...'; let reg = /<a href="[^"]*" class="doc">/g; // Works! alert( str1.match(reg) ); // null, no matches, that's correct alert( str2.match(reg) ); // <a href="link1" class="doc">, <a href="link2" class="doc">
Quantifiers have two modes of work:
- By default the regular expression engine tries to repeat the quantifier as many times as possible. For instance,
\d+consumes all possible digits. When it becomes impossible to consume more (no more digits or string end), then it continues to match the rest of the pattern. If there’s no match then it decreases the number of repetitions (backtracks) and tries again.
- Enabled by the question mark
?after the quantifier. The regexp engine tries to match the rest of the pattern before each repetition of the quantifier.
As we’ve seen, the lazy mode is not a “panacea” from the greedy search. An alternative is a “fine-tuned” greedy search, with exclusions. Soon we’ll see more examples of it.