Groups

A part of the regular expression can be grouped together in brackets (...).

Quantifiers are applied to whole group instead of just one char.

In the next example, pattern (go)+ matches one or more 'go':

showMatch( 'Gogogo now!', /(go)+/i )  // "Gogogo"

Without brackets, /go+/ would mean g, followed by one or more o, like goooo.

Write a regexp to describe a web-color, which starts with ‘#’ followed by 3 or 6 hexadimal chars.

var re = /*...your global regexp...*/

var subj = "color: #3f3; background-color: #AA00ef"

alert( subj.match(re) )  // #3f3,#AA00ef

Open hint 1
Hint 1
Open solution
Solution

The regular expression for 3-char colors is: /#[a-f0-9]{3}/i.

To make 6 characters also possible, we allow the color code part to repeat 2 times:
/#([a-f0-9]{3}){1,2}/i.

We have to put brackets outside of [a-f0-9]{3}, because the quantifier {1,2} applies to whole this structure as a group.

The regexp above can be rephrased as a longer, but maybe simpler variant:
/#[a-f0-9]{3}([a-f0-9]{3})?/i

It reads as a 3-character color, optionally followed by 3 more characters.

Finally, the test:

var re = /#([a-f0-9]{3}){1,2}/gi

var subj = "color: #3f3; background-color: #AA00ef"

alert( subj.match(re) )  // #3f3,#AA00ef

The brackets are numbered left-to-right. The regexp engine captures contents of each pair of brackets and allows to access it.

For example, the HTML tag can be matched (approximately) as <.?>. To get the contents of the tag, we can enclose it with brackets: <(.?)>

Method str.match returns different results for global and non-global regexps.

  • For a global regexp (with "g" flag) - it returns the array of matches. Groups are not returned:

    res = '<span> <p>'.match( /<(.*?)>/g ) 
    
    alert(res)   // [ '<span>', '<p>' ], full matches
    

  • For a non-global regexp - it finds the first match and returns an array: the full match becomes array item at index 0, the first group - at index 1, and so on.

    The following example searches for a tag, and captures its contents:

    var str = 'tag: <span>'
    var re = /<(.*?)>/ 
    
    alert( str.match(re) ) // [ '<span>', 'span' ], 1st match with groups
    

To find all matches with groups, one should use re.exec method. We cover it in the next section.

In the example below, there are nested brackets.

res = ' <span class="my"> '.match( /<(([a-z])[a-z0-9]*).*?>/ ) 

alert(res)   // [ '<span class="my">', 'span', 's' ]

The whole match is stored as item 0 of the resulting array. Nested brackets are numbered left-to-right as usual, nothing changes here. Because they are nested, group 1 contains group 2, that’s all.

Even if a group is optional and doesn’t match anything, the corresponding array item exists (and is undefined). The array has always the same number of elements.

For example, a(z)?(c)? in back matches ac, but the group (z)? is empty:

match = 'back'.match( /a(z)?(c)?/ )  

alert(match) // ['ac', undefined, 'c']

In the example above:

  1. match[0] = 'ac', the whole pattern,
  2. match[1] = undefined, because the optional group (z)? did not match anything,
  3. match[2] = 'c', for the 2nd group (c)?.

Another example of group usage is finding a tag together with it’s name and attributes.

str = 'tag <a href="...">link</a>'

match = str.match( /<([a-z]\w*)\s+(.*?)>/i )
 
alert(match)   // [ '<a href="...">', 'a', 'href="..."' ]

There are three values in the array:

  1. match[0] is the whole pattern match.
  2. match[1] is the 1st group, the tag name.
    It corresponds ([a-z]\w).
  3. match[2] the rest of the tag: (.?). It follows after spaces \s and, because of lazy quantifier, finishes at '>'

A group can be excluded from capturing and numbering. To do so, prepend it with ?:

alert( 'abc'.match( /(a)(b)(c)/ ) ) // ['abc', 'a', 'b', 'c'] 

alert( 'abc'.match( /(a)(*!*?:*/!*b)(c)/ ) ) // ['abc', 'a', 'c']

There are two reasons why a group may be excluded:

  1. First, that increases performance, because the regexp engine does not have to remember it. Most of time, the increase is not worth a burden of having two more chars '?:' in the pattern.
  2. The group may be used for syntax purposes only, to apply a quantifier. In this case we may want to omit it in the result. That’s more sane.

Parse an arithmetic expression of 2 numbers and operation - one of “+”, “-“, “*”, “/” and “^” (power).

The result should be an array of: the first argument, the operation, the second argument.

For example,

"12 + 1" /* your code => */ '12', '+', '1'
"5.1*2" /* your code => */ '5.1', '*', '2'

The expression may contain optional spaces.

Open hint 1
Hint 1
Open solution
Solution

The regexp is a number, followed by an operation and one more number.

A number including the optional decimal part is \d+(\.\d+)?.

An operation is [-+/^]. Here, hyphen - and caret '^' are not escaped, because they are not special on these positions inside [...]. When we put it into a literal JavaScript regexp, slash '/' needs to be escaped.

The full regexp is \d+(\.\d+)?\s[-+/^]\s\d+(\.\d+)?. Optional whitespaces are added between the operation and numbers.

To capture each number and the operation, let’s add brackets around them:
(\d+(\.\d+)?)\s([-+/^])\s*(\d+(\.\d+)?).

Let’s see it in action:

var re = /(\d+(\.\d+)?)\s*([-+*\/^])\s*(\d+(\.\d+)?)/

var result = "12 + 1".match(re) 
alert(result)

The resulting array has full match on index 0, and then:

  • result[0] == "12+1"
  • result[1] == "12"
  • result[2] == undefined
  • result[3] == "+"
  • result[4] == "1"
  • result[5] == undefined

Everything is fine, but we have extra undefineds. That’s because optional decimal part (\.\d+)? is absent here, but still occupies an array item.

In fact, even if it exists, we don’t want to capture it. The whole number is enough. So, we prepend the decimal part regexp by ?:
(?:\.\d+)?

var re = /(\d+(*!*?:*/!*\.\d+)?)\s*([-+*\/^])\s*(\d+(*!*?:*/!*\.\d+)?)/

var result = "12 + 1".match(re) 
result.shift()

alert(result)

Groups can be backreferenced in the pattern.

For example, we’ve got to match a quoted string with it’s contents. A string be either single-quoted '...' or double-quoted: "...".

How to match both kinds of strings? The regexp '"['"] allows to use different quotes, but it fails for internal quotes. The string "I'm the one" gives the match "I' (and I matches the contents).

str = " \"I'm the one\" "

reg = /['"](.*?)['"]/

alert( str.match(reg) ) // "I' , I

What we need it to make sure the closing quote is same as the opening. It can be done by making a group and back-referencing it with "\1":

str = " \"I'm the one\" "

reg = /(['"])(.*?)\1/

alert( str.match(reg) ) // "I'm the one" , ", I'm the one

The first group in can be reused as "\1", the second group as \2 and so on, up to \9.

Create a regular expression re to match opening-closing HTML tag pairs from the text, including their contents.

The array of groups should contain the whole tag, it’s name and it’s contents, see below:

re = /* your regexp */

result = "the **bold**".match(re)

result[0] (whole tag) = '**bold**'
result[1] (name) = 'b'
result[2] (contents) = 'bold'

The solution should handle nested tags, like: the <b>bold <i>italic</i></b>. It should only match the outer tag in this case and ignore the nested one.

P.S. Use regexp <([a-z]\w*)> to match the opening tag.

Open solution
Solution

So, the opening tag is <([a-z]\w)>. Now we need to match the closing tag. We’ll use a backreference here.

The tag name is put into the group, so it can be referenced in closing tag:

str = "the **bold**"

re = /<([a-z]\w*)>(.*?)<\/\1>/

alert( str.match(re) )

In the regexp above, the backreferene \1 refers to the first bracket group: ([a-z]\w), which is the tag name.

Running same regexp on the text with nested tags:

re = /<([a-z]\w*)>(.*?)<\/\1>/

result = "the **bold *italic***".match(re)

alert(result) 
// 0 => '**bold *italic***'
// 1 => 'b'
// 2 => 'bold *italic*'

It works.

Tutorial

Donate

Donate to this project