The internal format for strings is always UTF-16, it is not tied to the page encoding.
Let’s remember the kinds of quotes.
Strings can be enclosed either with the single, double quotes or in backticks:
let single = 'single-quoted'; let double = "double-quoted"; let backticks = `backticks`;
Single and double quotes are essentially the same. Backticks allow to embed any expression into the string, including function calls:
Another advantage of using backticks is that they allow a string to span multiple lines:
If we try to use single or double quotes the same way, there will be an error:
Single and double quotes come from ancient times of language creation, and the need for multiline strings was not taken into account. Backticks appeared much later and thus are more versatile.
Backticks also allow to specify a “template function” before the first backtick, the syntax is:
func`string`. The function
func is called automatically, receives the string and embedded expressions and can process them. You can read more in the docs. That is called “tagged templates”. This feature makes it easier to wrap strings into custom templating or other functionality, but is rarely used.
It is still possible to create multiline strings with single quotes, using a so-called “newline character” written as
\n, that denotes a line break:
So to speak, these two lines describe the same:
There are other, less common “special” characters as well, here’s the list:
||A unicode symbol with the hex code
||Some rare characters are encoded with two unicode symbols, taking up to 4 bytes. The long unicode requires braces around.|
Examples with unicode:
All special characters start with a backslash character
\. It is also called an “escaping character”.
We should also use it if we want to insert the quote into the string.
See, we have to prepend the inner quote by the backslash
\', because otherwise it would mean the string end.
Of course, that refers only for the quotes that are same as the enclosing ones. So, as a more elegant solution, we could switch to double quotes or backticks instead:
Note that the backslash
\. You can clearly see that in
alert from the examples above.
But what if we need exactly a backslash
\ in the string?
That’s possible, but we need to double it like
length property has the string length:
\n is a single “special” character, so the length is indeed
lengthis a property
People with background in some other languages sometimes mistype by calling
str.length() instead of just
str.length. That doesn’t work.
Please note that
str.length is a numeric property, not a function. There is no need to add brackets after it.
To get a character at position
pos, use square brackets
[pos] or call the method str.charAt(pos). The first character starts from the zero position:
The square brackets is a modern way of getting a character, while
charAt exists mostly for historical reasons.
The only difference between them is that if no character found,
charAt returns an empty string:
Also we can iterate over characters using
Let’s try to see that it doesn’t work:
The usual workaround is to create a whole new string and assign it to
str instead of the old one.
In the following sections we’ll see more examples of that.
Or, if we want a single character lowercased:
alert( 'Interface'.toLowerCase() ); // 'i'
There are multiple ways to look for a substring in a string.
The first method is str.indexOf(substr, pos).
It looks for the
str, starting from the given position
pos, and returns the position where the match was found or
-1 if nothing can be found.
The optional second parameter allows to search starting from the given position.
For instance, the first occurence of
"id" is at the position
1. To look for the next occurence, let’s start the search from the position
If we’re interested in all occurences, we can run
indexOf in a loop. Every new call is made with the position after the previous match:
The same algorithm can be layed out shorter:
There is also a similar method str.lastIndexOf(pos) that searches from the end of the string to its beginning.
It would list the occurences in the reverse way.
There is a slight inconvenience with
indexOf in the
if test. We can’t put it in the
if like this:
alert in the example above doesn’t show, because
0 (meaning that it found the match at the starting position). Right, but
if considers that to be
So, we should actualy check for
-1, like that:
One of the old tricks used here is the bitwise NOT
~ operator. It converts the number to 32-bit integer (removes the decimal part if exists) and then reverses all bits in its binary representation.
For 32-bit integers the call
~n means exactly the same as
-(n+1) (due to IEEE-754 format).
As we can see,
~n is zero only if
n == -1.
So, the test
if ( ~str.indexOf("...") ) is truthy that the result of
indexOf is not
-1. In other words, when there is a match.
People use it to shorten
It is usually not recommended to use language features in a non-obvious way, but this particular trick is widely used in old code, so we should understand it.
if (~str.indexOf(...)) reads as “if found”.
The more modern method str.includes(substr, pos) returns
true/false depending on whether
substr as its part.
It’s the right choice if we need to test for the match, but don’t need its position:
The optional second argument of
str.includes is the position to start searching from:
str.slice(start [, end])
Returns the part of the string from
startto (but not including)
If there is no second argument, then
slicegoes till the end of the string:
Negative values for
start/endare also possible. They mean the position is counted from the string end:
str.substring(start [, end])
Returns the part of the string between
Almost the same as
slice, but allows
startto be greater than
Negative arguments are (unlike slice) not supported, they are treated as
str.substr(start [, length])
Returns the part of the string from
start, with the given
In contrast with the previous methods, this one allows to specify the
lengthinstead of the ending position:
The first argument may be negative, to count from the end:
Let’s recap the methods to avoid any confusion:
||negative values mean
All of them can do the job. Formally,
The author finds himself using
slice almost all the time.
As we know from the chapter Comparisons, strings are compared character-by-character, in the alphabet order.
Although, there are some oddities.
A lowercase letter is always greater than the uppercase:
Letters with diacritical marks are “out of order”:
That may lead to strange results if we sort country names. Usually people would await for
Zealandto be after
Österreichin the list.
All strings are encoded using UTF-16. That is: each character has a corresponding numeric code. There are special methods that allow to get the character for the code and back.
Returns the code for the character at position
Creates a character by its numeric
We can also add unicode characters by their codes using
\ufollowed by the hex code:
Now let’s see the characters with codes
65..220 (the latin alphabet and a little bit extra) by making a string of them:
See? Capital character go first, then few special ones, then lowercase characters.
Now it becomes obvious why
a > Z.
The characters are compared by their numeric code. The greater code means that the character is greater. The code for
a (97) is greater than the code for
- All lowercase letters go after uppercase letters, their codes are greater.
- Some letters like
Östand apart from the main alphabet. Here, it’s code is greater than anything from
The “right” algorithm to do string comparisons is more complex than it may seem. Because alphabets are different for different languages. The same-looking letter may be located differently in different alphabets.
So, the browser needs to know the language to compare.
It provides a special method to compare strings in different languages, following their rules.
The call str.localeCompare(str2):
stris greater than
str2according to the language rules.
stris less than
0if they are equal.
The method actually has two additional arguments specified in the documentation, that allow to specify the language (by default taken from the environment) and setup additional rules like case sensivity or should
"á" be treated as the same etc.
The section goes deeper into string internals. The knowledge will be useful for you if you plan to deal with emoji, rare mathematical of hieroglyphs characters or other rare symbols.
You can skip the section if you don’t plan to support them.
Most symbols have a 2-byte code. Letters of most european languages, numbers, even most hieroglyphs have a 2-byte representation.
But 2 bytes only allow 65536 combinations that’s not enough for every possible symbol. So rare symbols are encoded with a pair of 2-byte characters called “a surrogate pair”.
The length of such symbols is
We actually have a single symbol in each of the strings above, but the
length shows the length of
str.codePointAt are few rare methods that deal with surrogate pairs right. They recently appeared in the language. Before them, there were only String.fromCharCode and str.charCodeAt. These methods are actually the same as
fromCodePoint/codePointAt, but don’t work with surrogate pairs.
But, for instance, getting a symbol can be tricky, because surrogate pairs are treated as two characters:
Note that pieces of the surrogate pair have no meaning without each other. So, the alerts in the example above actually display garbage.
Technically, surrogate pairs are also detectable by their codes: if a character has the code in the interval of
0xd800..0xdbff, then it is the first part of the surrogate pair. The next character (second part) must have the code in interval
0xdc00..0xdfff. These intervals are reserved exclusively for surrogate pairs by the standard.
In the case above:
You will find more ways to deal with surrogate pairs later in the chapter Iterables. Probably, there are special libraries for that too, but nothing famous enough to suggest here.
In many languages there are symbols that are composed of the base character and a mark above/under it.
For instance, letter
a can be the base character for:
àáâäãåā. Most common “composite” character have their own code in the UTF-16 table. But not all of them, because there are too many possible combinations.
To support arbitrary compositions, UTF-16 allows to use several unicode characters. The base character and one or many “mark” characters that “decorate” it.
For instance, if we have
S followed by the special “dot above” character (code
\u0307), it is shown as Ṡ.
If we need a one more mark over the letter (or below it) – no problem, just add the necessary mark character.
For instance, if we append a character “dot below” (code
\u0323), then we’ll have “S with dots above and below”:
This gives great flexibility, but also an interesting problem: the same symbol visually can be represented with different unicode compositions.
To solve it, there exists a “unicode normalization” algorithm that brings each string to the single “normal” form.
It is implemented by str.normalize().
It’s funny that in our situation
normalize() actually brings a sequence of 3 characters to one:
\u1e68 (S with two dots).
In real, that is not always so. The reason is that symbol
Ṩ is “common enough”, so UTF-16 creators included it into the main table and gave it the code.
If you want to learn more about normalization rules and variants – they are described in the appendix to the Unicode standard: Unicode Normalization Forms, but for most practical reasons the information from this section is enough.
- There are 3 types of quotes. Backticks allow a string to span multiple lines and embed expressions.
- We can use special characters like
\nand insert letters by their unicode using
- To get a character: use
- To get a substring: use
- To lowercase/uppercase a string: use
- To look for a substring: use
includes/startsWith/endsWithfor simple checks.
- To compare strings according to the language, use
localeCompare, otherwise they are compared by character codes.
There are several other helpful methods in strings:
str.trim()– removes (“trims”) spaces from the beginning and end of the string.
str.repeat(n)– repeats the string
- …and others, see the manual for details.
Also strings have methods for doing search/replace with regular expressions. But that topic deserves a separate chapter, so we’ll return to that later.