For instance, try the example below:
alert( '123456789012345678901234567890z'.match(/^(\d+)*$/) )
Depending on your browser, it may either give error (“Regexp is too complex” in Firefox), or hang up (Chrome, IE) or execute if the regexp engine is smart (Opera).
What’s so bad in the regexp? Actually, nothing. The bad is how the engine matches it. The regular exception is inherently simple, so we can see what’s going on.
To make a long story short, we take a shorter string:
- First, the engine starts to match
+is greedy, so it matches as far as it can.
So the greedy
+has to release a character. The regexp backtracks (green arrow).
- After backtracking, there’s
9on the way.
It is matched as the second instance of
\d+(remember, we have
As it doesn’t help, the regexp engine has to backtrack once more.
- The initial
\d+match now shortens to 8 digits. The rest is matched as next
Still, no match for
$. Now, *the last
\d+backtracks and shortens.
- The rest of the pattern is matched as 3rd instance of
Didn’t help again. Second and third
\d+backtracked to their minimum, so the first instance shortens.
- Now there are 7 charaters for first
\d+. The regexp engine matches the rest as another
There’s no match, so the second
- … Ultimately, it is easy to see that the engine will go through all possible combinations of
\d+in the number. That’s kind of a lot.
A swift-minded person may say: “Backtracking? Let’s be lazy and there will be no backtracking!”.
\d+? and see:
alert( '123456789012345678901234567890z'.match(/^(\d+?)*$/) )
There’s no effect. Lazy regexps will do the same, but in reverse order. Think about how it goes.
Backtracking may lead to very large number of tested cases, as in the example above.
The symphoms are either very slow or failing regexps. Such regexp may pop out surprizingly, profiling helps to detect it.
To tell the truth, not all regexp engines behave like that. Optimizations and heuristics are applicable. For example, Opera works well here. But their code is closed, so probably there are other cases when their improvements don’t work.
The task is to validate if the text consists of numbers delimited by double colons
'::'. There may be no number between double colons. Single colons are not alowed.
'', '::', '123::', '::123', '123', '::::', '::1234::5678901234567890::::1234567890::888'
':', '123:', ':::', ':::123'
A programmer used regexp
/^(\d+|::)*$/ to solve the task. But it takes too long or doesn’t work on the text below:
var str = '::1234::5678901234567890::::1234567890::888:' alert( str.match(/^(\d+|::)*$/) ) // should return null
Explain, why? Give a pattern which works.
P.S. The example is from an article in Perl Journal by Jeffrey Friedl. Nice reading btw.
The first half is “Why it happens”.
The reason is, of course, backtracking. If the string is valid, no backtracking occurs:
var str = '::1234::5678901234567890::::1234567890::888' alert( str.match(/^(\d+|::)*$/) ) // should return null
The backtracking happens if there’s no match at the last char, like below:
var str = '::1234::5678901234567890::::1234567890::888:' alert( str.match(/^(\d+|::)*$/) )
The regexp engine backtracks and tries every possible combination of multiple
\d+ in numbers composing the string.
In short, there’s no difference between:
var str = '::1234::5678901234567890::::1234567890::888:' str.match(/^(\d+|::)*$/)
…And the following simpler example:
var str = '123456789012345678901234567890888' str.match(/^(\d+)*$/) )
In both cases, the infinite backtracking occurs because of
Let’s try to shorten the string:
var str = '::1234567890::888:' alert( str.match(/^(\d+|::)*$/) ) // now works
The shorter string works, because there are much less combinations in it.
What’s the correct variant?
The pattern below has another strategy of matching.
var str = '::1234::5678901234567890::::1234567890::888:' alert( str.match(/^(\d*::)*\d*$/) )
Here we match as many numbers
\d followed by
'::' as possible. The numbers are optional, that’s why
We have to add an optional number
\d at the end, because otherwise, for
/^(\d::)/, the text would have to end with
This pattern also backtracks, of course, because there is a greedy quantifier inside. But the backtracking is much simpler and faster now. Think how it goes in your free time.