The unicode flag
/.../u enables the correct support of surrogate pairs.
Surrogate pairs are explained in the chapter Strings.
Let’s briefly remind them here. In short, normally characters are encoded with 2 bytes. That gives us 65536 characters maximum. But there are more characters in the world.
So certain rare characters are encoded with 4 bytes, like
𝒳 (mathematical X) or
😄 (a smile).
Here are the unicode values to compare:
So characters like
≈ occupy 2 bytes, and those rare ones take 4.
The unicode is made in such a way that the 4-byte characters only have a meaning as a whole.
length thinks that here are two characters:
…But we can see that there’s only one, right? The point is that
length treats 4 bytes as two 2-byte characters. That’s incorrect, because they must be considered only together (so-called “surrogate pair”).
Normally, regular expressions also treat “long characters” as two 2-byte ones.
That leads to odd results, for instance let’s try to find
[𝒳𝒴] in the string
The result would be wrong, because by default the regexp engine does not understand surrogate pairs. It thinks that
[𝒳𝒴] are not two, but four characters: the left half of
(1), the right half of
(2), the left half of
(3), the right half of
So it finds the left half of
𝒳 in the string
𝒳, not the whole symbol.
In other words, the search works like
'12'.match(//) – the
1 is returned (left half of
/.../u flag fixes that. It enables surrogate pairs in the regexp engine, so the result is correct:
There’s an error that may happen if we forget the flag:
Here the regexp
[𝒳-𝒴] is treated as
2 is the right part of
3 is the left part of
𝒴), and the range between two halves
3 is unacceptable.
Using the flag would make it work right:
To finalize, let’s note that if we do not deal with surrogate pairs, then the flag does nothing for us. But in the modern world we often meet them.