The unicode flag `/.../u`

enables the correct support of surrogate pairs.

Surrogate pairs are explained in the chapter Strings.

Let’s briefly review them here. In short, normally characters are encoded with 2 bytes. That gives us 65536 characters maximum. But there are more characters in the world.

So certain rare characters are encoded with 4 bytes, like `𝒳`

(mathematical X) or `😄`

(a smile).

Here are the unicode values to compare:

Character | Unicode | Bytes |
---|---|---|

`a` |
0x0061 | 2 |

`≈` |
0x2248 | 2 |

`𝒳` |
0x1d4b3 | 4 |

`𝒴` |
0x1d4b4 | 4 |

`😄` |
0x1f604 | 4 |

So characters like `a`

and `≈`

occupy 2 bytes, and those rare ones take 4.

The unicode is made in such a way that the 4-byte characters only have a meaning as a whole.

In the past JavaScript did not know about that, and many string methods still have problems. For instance, `length`

thinks that here are two characters:

```
alert('😄'.length); // 2
alert('𝒳'.length); // 2
```

…But we can see that there’s only one, right? The point is that `length`

treats 4 bytes as two 2-byte characters. That’s incorrect, because they must be considered only together (so-called “surrogate pair”).

Normally, regular expressions also treat “long characters” as two 2-byte ones.

That leads to odd results, for instance let’s try to find `[𝒳𝒴]`

in the string `𝒳`

:

`alert( '𝒳'.match(/[𝒳𝒴]/) ); // odd result (wrong match actually, "half-character")`

The result is wrong, because by default the regexp engine does not understand surrogate pairs.

So, it thinks that `[𝒳𝒴]`

are not two, but four characters:

- the left half of
`𝒳`

`(1)`

, - the right half of
`𝒳`

`(2)`

, - the left half of
`𝒴`

`(3)`

, - the right half of
`𝒴`

`(4)`

.

We can list them like this:

```
for(let i=0; i<'𝒳𝒴'.length; i++) {
alert('𝒳𝒴'.charCodeAt(i)); // 55349, 56499, 55349, 56500
};
```

So it finds only the “left half” of `𝒳`

.

In other words, the search works like `'12'.match(/[1234]/)`

: only `1`

is returned.

## The “u” flag

The `/.../u`

flag fixes that.

It enables surrogate pairs in the regexp engine, so the result is correct:

`alert( '𝒳'.match(/[𝒳𝒴]/u) ); // 𝒳`

Let’s see one more example.

If we forget the `u`

flag and accidentally use surrogate pairs, then we can get an error:

`'𝒳'.match(/[𝒳-𝒴]/); // SyntaxError: invalid range in character class`

Normally, regexps understand `[a-z]`

as a "range of characters with codes between codes of `a`

and `z`

.

But without `u`

flag, surrogate pairs are assumed to be a “pair of independent characters”, so `[𝒳-𝒴]`

is like `[<55349><56499>-<55349><56500>]`

(replaced each surrogate pair with code points). Now we can clearly see that the range `56499-55349`

is unacceptable, as the left range border must be less than the right one.

Using the `u`

flag makes it work right:

`alert( '𝒴'.match(/[𝒳-𝒵]/u) ); // 𝒴`

## Comments

`<code>`

tag, for several lines – use`<pre>`

, for more than 10 lines – use a sandbox (plnkr, JSBin, codepen…)