Infinite backtracking problem

The backtracking nature of regular expressions in JavaScript (and most other languages which use same type of regexp processing) may lead extremely long or almost infinite searching time.

For instance, try the example below:

alert( '123456789012345678901234567890z'.match(/^(\d+)*$/) )

Depending on your browser, it may either give error (“Regexp is too complex” in Firefox), or hang up (Chrome, IE) or execute if the regexp engine is smart (Opera).

What’s so bad in the regexp? Actually, nothing. The bad is how the engine matches it. The regular exception is inherently simple, so we can see what’s going on.

To make a long story short, we take a shorter string: 1234567890z:

  1. First, the engine starts to match \d+. That + is greedy, so it matches as far as it can.

    Then position $ doesn’t match.

    So the greedy + has to release a character. The regexp backtracks (green arrow).

  2. After backtracking, there’s 9 on the way.

    It is matched as the second instance of \d+ (remember, we have (\d+)):

    As it doesn’t help, the regexp engine has to backtrack once more.

  3. The initial \d+ match now shortens to 8 digits. The rest is matched as next \d+:

    Still, no match for $. Now, *the last \d+ backtracks and shortens.

  4. The rest of the pattern is matched as 3rd instance of \d+:

    Didn’t help again. Second and third \d+ backtracked to their minimum, so the first instance shortens.

  5. Now there are 7 charaters for first \d+. The regexp engine matches the rest as another \d+

    There’s no match, so the second \d+ steps back….

  6. … Ultimately, it is easy to see that the engine will go through all possible combinations of \d+ in the number. That’s kind of a lot.

A swift-minded person may say: “Backtracking? Let’s be lazy and there will be no backtracking!”.

Let’s repace \d+ with \d+? and see:

alert( '123456789012345678901234567890z'.match(/^(\d+?)*$/) )

There’s no effect. Lazy regexps will do the same, but in reverse order. Think about how it goes.

Backtracking may lead to very large number of tested cases, as in the example above.

The symphoms are either very slow or failing regexps. Such regexp may pop out surprizingly, profiling helps to detect it.

To tell the truth, not all regexp engines behave like that. Optimizations and heuristics are applicable. For example, Opera works well here. But their code is closed, so probably there are other cases when their improvements don’t work.

The task is to validate if the text consists of numbers delimited by double colons '::'. There may be no number between double colons. Single colons are not alowed.

Valid samples:

'', '::', '123::', '::123', '123', '::::',
'::1234::5678901234567890::::1234567890::888'

Invalid samples:

':', '123:', ':::', ':::123'

A programmer used regexp /^(\d+|::)*$/ to solve the task. But it takes too long or doesn’t work on the text below:

var str = '::1234::5678901234567890::::1234567890::888:'

alert( str.match(/^(\d+|::)*$/) ) // should return null

Explain, why? Give a pattern which works.

P.S. The example is from an article in Perl Journal by Jeffrey Friedl. Nice reading btw.

Open hint 1
Hint 1
Open solution
Solution

The first half is “Why it happens”.

The reason is, of course, backtracking. If the string is valid, no backtracking occurs:

var str = '::1234::5678901234567890::::1234567890::888'

alert( str.match(/^(\d+|::)*$/) ) // should return null

The backtracking happens if there’s no match at the last char, like below:

var str = '::1234::5678901234567890::::1234567890::888:'

alert( str.match(/^(\d+|::)*$/) )

The regexp engine backtracks and tries every possible combination of multiple \d+ in numbers composing the string.

In short, there’s no difference between:

var str = '::1234::5678901234567890::::1234567890::888:'
str.match(/^(\d+|::)*$/)

…And the following simpler example:
var str = '123456789012345678901234567890888'
str.match(/^(\d+)*$/) )

In both cases, the infinite backtracking occurs because of (\d+).

Let’s try to shorten the string:

var str = '::1234567890::888:'

alert( str.match(/^(\d+|::)*$/) )  // now works

The shorter string works, because there are much less combinations in it.

What’s the correct variant?

The pattern below has another strategy of matching.

var str = '::1234::5678901234567890::::1234567890::888:'

alert( str.match(/^(\d*::)*\d*$/) )

Here we match as many numbers \d followed by '::' as possible. The numbers are optional, that’s why .

We have to add an optional number \d at the end, because otherwise, for /^(\d::)/, the text would have to end with ::.

This pattern also backtracks, of course, because there is a greedy quantifier inside. But the backtracking is much simpler and faster now. Think how it goes in your free time.

In more advanced regexp engines, there are other ways to deal with backtracking, namely atomic groups and possessive quantifiers.

Tutorial

Donate

Donate to this project