Capturing groups

A part of the pattern can be enclosed in parentheses (...). That’s called a “capturing group”.

That has two effects:

  1. It allows to place a part of the match into a separate array item when using String#match or RegExp#exec methods.
  2. If we put a quantifier after the parentheses, it applies to the parentheses as a whole, not the last character.

Example

In the example below the pattern (go)+ finds one or more 'go':

alert( 'Gogogo now!'.match(/(go)+/i) ); // "Gogogo"

Without parentheses, the pattern /go+/ means g, followed by o repeated one or more times. For instance, goooo or gooooooooo.

Parentheses group the word (go) together.

Let’s make something more complex – a regexp to match an email.

Examples of emails:

my@mail.com
john.smith@site.com.uk

The pattern: [-.\w]+@([\w-]+\.)+[\w-]{2,20}.

  • The first part before @ may include wordly characters, a dot and a dash [-.\w]+, like john.smith.

  • Then @

  • And then the domain. May be a second-level domain site.com or with subdomains like host.site.com.uk. We can match it as “a word followed by a dot” repeated one or more times for subdomains: mail. or site.com., and then “a word” for the last part: .com or .uk.

    The word followed by a dot is (\w+\.)+ (repeated). The last word should not have a dot at the end, so it’s just \w{2,20}. The quantifier {2,20} limits the length, because domain zones are like .uk or .com or .museum, but can’t be longer than 20 characters.

    So the domain pattern is (\w+\.)+\w{2,20}. Now we replace \w with [\w-], because dashes are also allowed in domains, and we get the final result.

That regexp is not perfect, but usually works. It’s short and good enough to fix errors or occasional mistypes.

For instance, here we can find all emails in the string:

let reg = /[-.\w]+@([\w-]+\.)+[\w-]{2,20}/g;

alert("my@mail.com @ his@site.com.uk".match(reg)); // my@mail.com,his@site.com.uk

Contents of parentheses

Parentheses are numbered from left to right. The search engine remembers the content of each and allows to reference it in the pattern or in the replacement string.

For instance, we can find an HTML-tag using a (simplified) pattern <.*?>. Usually we’d want to do something with the result after it.

If we enclose the inner contents of <...> into parentheses, then we can access it like this:

let str = '<h1>Hello, world!</h1>';
let reg = /<(.*?)>/;

alert( str.match(reg) ); // Array: ["<h1>", "h1"]

The call to String#match returns groups only if the regexp has no /.../g flag.

If we need all matches with their groups then we can use RegExp#exec method as described in Methods of RegExp and String:

let str = '<h1>Hello, world!</h1>';

// two matches: opening <h1> and closing </h1> tags
let reg = /<(.*?)>/g;

let match;

while (match = reg.exec(str)) {
  // first shows the match: <h1>,h1
  // then shows the match: </h1>,/h1
  alert(match);
}

Here we have two matches for <(.*?)>, each of them is an array with the full match and groups.

Nested groups

Parentheses can be nested. In this case the numbering also goes from left to right.

For instance, when searching a tag in <span class="my"> we may be interested in:

  1. The tag content as a whole: span class="my".
  2. The tag name: span.
  3. The tag attributes: class="my".

Let’s add parentheses for them:

let str = '<span class="my">';

let reg = /<(([a-z]+)\s*([^>]*))>/;

let result = str.match(reg);
alert(result); // <span class="my">, span class="my", span, class="my"

Here’s how groups look:

At the zero index of the result is always the full match.

Then groups, numbered from left to right. Whichever opens first gives the first group result[1]. Here it encloses the whole tag content.

Then in result[2] goes the group from the second opening ( till the corresponding ) – tag name, then we don’t group spaces, but group attributes for result[3].

If a group is optional and doesn’t exist in the match, the corresponding result index is present (and equals undefined).

For instance, let’s consider the regexp a(z)?(c)?. It looks for "a" optionally followed by "z" optionally followed by "c".

If we run it on the string with a single letter a, then the result is:

let match = 'a'.match(/a(z)?(c)?/);

alert( match.length ); // 3
alert( match[0] ); // a (whole match)
alert( match[1] ); // undefined
alert( match[2] ); // undefined

The array has the length of 3, but all groups are empty.

And here’s a more complex match for the string ack:

let match = 'ack'.match(/a(z)?(c)?/)

alert( match.length ); // 3
alert( match[0] ); // ac (whole match)
alert( match[1] ); // undefined, because there's nothing for (z)?
alert( match[2] ); // c

The array length is permanent: 3. But there’s nothing for the group (z)?, so the result is ["ac", undefined, "c"].

Non-capturing groups with ?:

Sometimes we need parentheses to correctly apply a quantifier, but we don’t want their contents in the array.

A group may be excluded by adding ?: in the beginning.

For instance, if we want to find (go)+, but don’t want to put remember the contents (go) in a separate array item, we can write: (?:go)+.

In the example below we only get the name “John” as a separate member of the results array:

let str = "Gogo John!";
// exclude Gogo from capturing
let reg = /(?:go)+ (\w+)/i;

let result = str.match(reg);

alert( result.length ); // 2
alert( result[1] ); // John

Tasks

Write a regexp that matches colors in the format #abc or #abcdef. That is: # followed by 3 or 6 hexadimal digits.

Usage example:

let reg = /your regexp/g;

let str = "color: #3f3; background-color: #AA00ef; and: #abcd";

alert( str.match(reg) ); // #3f3 #AA0ef

P.S. Should be exactly 3 or 6 hex digits: values like #abcd should not match.

A regexp to search 3-digit color #abc: /#[a-f0-9]{3}/i.

We can add exactly 3 more optional hex digits. We don’t need more or less. Either we have them or we don’t.

The simplest way to add them – is to append to the regexp: /#[a-f0-9]{3}([a-f0-9]{3})?/i

We can do it in a smarter way though: /#([a-f0-9]{3}){1,2}/i.

Here the regexp [a-f0-9]{3} is in parentheses to apply the quantifier {1,2} to it as a whole.

In action:

let reg = /#([a-f0-9]{3}){1,2}/gi;

let str = "color: #3f3; background-color: #AA00ef; and: #abcd";

alert( str.match(reg) ); // #3f3 #AA0ef #abc

There’s minor problem here: the pattern found #abc in #abcd. To prevent that we can add \b to the end:

let reg = /#([a-f0-9]{3}){1,2}\b/gi;

let str = "color: #3f3; background-color: #AA00ef; and: #abcd";

alert( str.match(reg) ); // #3f3 #AA0ef

Create a regexp that looks for positive numbers, including those without a decimal point.

An example of use:

let reg = /your regexp/g;

let str = "1.5 0 12. 123.4.";

alert( str.match(reg) );   // 1.5, 0, 12, 123.4

An integer number is \d+.

A decimal part is: \.\d+.

Because the decimal part is optional, let’s put it in parentheses with quantifier '?'.

Finally we have the regexp: \d+(\.\d+)?:

let reg = /\d+(\.\d+)?/g;

let str = "1.5 0 12. 123.4.";

alert( str.match(re) );   // 1.5, 0, 12, 123.4

Write a regexp that looks for all decimal numbers including integer ones, with the floating point and negative ones.

An example of use:

let reg = /your regexp/g;

let str = "-1.5 0 2 -123.4.";

alert( str.match(re) ); // -1.5, 0, 2, -123.4

A positive number with an optional decimal part is (per previous task): \d+(\.\d+)?.

Let’s add an optional - in the beginning:

let reg = /-?\d+(\.\d+)?/g;

let str = "-1.5 0 2 -123.4.";

alert( str.match(reg) );   // -1.5, 0, 2, -123.4

An arithmetical expression consists of 2 numbers and an operator between them, for instance:

  • 1 + 2
  • 1.2 * 3.4
  • -3 / -6
  • -2 - 2

The operator is one of: "+", "-", "*" or "/".

There may be extra spaces at the beginning, at the end or between the parts.

Create a function parse(expr) that takes an expression and returns an array of 3 items:

  1. The first number.
  2. The operator.
  3. The second number.

For example:

let [a, op, b] = parse("1.2 * 3.4");

alert(a); // 1.2
alert(op); // *
alert(b); // 3.4

A regexp for a number is: -?\d+(\.\d+)?. We created it in previous tasks.

An operator is [-+*/]. We put a dash - the first, because in the middle it would mean a character range, we don’t need that.

Note that a slash should be escaped inside a JavaScript regexp /.../.

We need a number, an operator, and then another number. And optional spaces between them.

The full regular expression: -?\d+(\.\d+)?\s*[-+*/]\s*-?\d+(\.\d+)?.

To get a result as an array let’s put parentheses around the data that we need: numbers and the operator: (-?\d+(\.\d+)?)\s*([-+*/])\s*(-?\d+(\.\d+)?).

In action:

let reg = /(-?\d+(\.\d+)?)\s*([-+*\/])\s*(-?\d+(\.\d+)?)/;

alert( "1.2 + 12".match(reg) );

The result includes:

  • result[0] == "1.2 + 12" (full match)
  • result[1] == "1" (first parentheses)
  • result[2] == "2" (second parentheses – the decimal part (\.\d+)?)
  • result[3] == "+" (…)
  • result[4] == "12" (…)
  • result[5] == undefined (the last decimal part is absent, so it’s undefined)

We need only numbers and the operator. We don’t need decimal parts.

So let’s remove extra groups from capturing by added ?:, for instance: (?:\.\d+)?.

The final solution:

function parse(expr) {
  let reg = /(-?\d+(?:\.\d+)?)\s*([-+*\/])\s*(-?\d+(?:\.\d+)?)/;

  let result = expr.match(reg);

  if (!result) return;
  result.shift();

  return result;
}

alert( parse("-1.23 * 3.45") );  // -1.23, *, 3.45
Tutorial map

Comments

read this before commenting…
  • You're welcome to post additions, questions to the articles and answers to them.
  • To insert a few words of code, use the <code> tag, for several lines – use <pre>, for more than 10 lines – use a sandbox (plnkr, JSBin, codepen…)
  • If you can't understand something in the article – please elaborate.