You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
| What about four backslashes `\\\\` in a string literal? Well, that's just two `\\` escape sequences next to each other, so it results in two adjacent backslashes (`\\`) in the underlying string value. You might recognize there's an odd/even rule pattern at play. You should thus be able to deciper any odd (`\\\\\`, `\\\\\\\\\`, etc) or even (`\\\\\\`, `\\\\\\\\\\`, etc) number of backslashes in a string literal. |
301
301
302
+
#### Line Continuation
303
+
304
+
The `\` character followed by an actual new-line character (not just literal `n`) is a special case, and it creates what's called a line-continuation:
305
+
306
+
```js
307
+
greeting ="Hello \
308
+
Friends!";
309
+
310
+
console.log(greeting);
311
+
// Hello Friends!
312
+
```
313
+
314
+
As you can see, the new-line at the end of the `greeting =` line is immediately preceded by a `\`, which allows this string literal to continue onto the subsequent line. Without the escaping `\` before it, a new-line -- the actual new-line, not the `\n` character escape sequence -- appearing in a `"` or `'` delimited string literal would actually produce a JS syntax parsing error.
315
+
316
+
Because the end-of-line `\` turns the new-line character into a line continuation, the new-line character is omitted from the string, as shown by the `console.log(..)` output.
317
+
318
+
| NOTE: |
319
+
| :--- |
320
+
| This line-continuation feature is often referred to as "multi-line strings", but I think that's a confusing label. As you can see, the string value itself doesn't have multiple lines, it only was defined across multiple lines via the line continuations. A multi-line string would actually have multiple lines in the underlying value. We'll revisit this topic later in this chapter when we cover Template Literals. |
321
+
302
322
### Multi-Character Escapes
303
323
304
324
Multi-character escape sequences may be hexadecimal or Unicode sequences.
@@ -360,7 +380,7 @@ Even though JS doesn't care which way such a character is represented in your pr
360
380
361
381
##### Unicode Normalization
362
382
363
-
A further wrinkle in Unicode string handling is that even certain single BMP characters can be represented in different ways.
383
+
Another wrinkle in Unicode string handling is that even certain single BMP characters can be represented in different ways.
364
384
365
385
For example, the `"é"` character can either be represented as itself (code-point `233`, aka `\xe9` or `\u00e9` or `\u{e9}`), or as the combination of two code-points: the `"e"` character (code-point `101`, aka `\x65`, `\u0065`, `\u{65}`) and the *combining tilde* (code-point `769`, aka `\u0301`, `\u{301}`).
One particular challenge is that you may copy-paste a string with an `"é"` character visible in it, and that character may have been in the *composed* or *decomposed* form. But there's no visual way to tell, and yet the underlying string value will be different depending:
411
+
412
+
```js
413
+
"é" === "é"; // false!!
414
+
```
415
+
390
416
This internal representation difference can be quite challenging if not carefully planned for. Fortunately, JS provides a `normalize(..)` utility method on strings to help:
The `"NFC"` normalization mode combines adjacent code-points into the *composed* code-point (if possible), whereas the `"NFD"` normalization mode splits a single code-point into its *decomposed* code-points (if possible).
402
428
403
-
And there can actually be more than two individual *decomposed* code-points that make up a single *composed* code-point; some international language symbols (Chinese, Japanese, etc) are *composed* of three or four code-points layered together!
429
+
And there can actually be more than two individual *decomposed* code-points that make up a single *composed* code-point -- for example, a single character could have several diacritical marks applied to it.
404
430
405
431
When dealing with Unicode strings that will be compared, sorted, or length analyzed, it's very important to keep Unicode normalization in mind, and use it where necessary.
406
432
407
-
### Line Continuation
433
+
##### Unicode Grapheme Clusters
408
434
409
-
The `\` followed by an actual new-line character (not just literal `n`) is a special case, and it creates what's called a line-continuation:
435
+
A final complication of Unicode string handling is the support for clustering of multiple adjacent code-points into a single visually distinct symbol, referred to as a *grapheme* (or a *grapheme cluster*).
436
+
437
+
An example would be a family emoji such as `"👩👩👦👦"`, which is actually made up of 7 code-points that all cluster/group together into a single visual symbol.
As you can see, the new-line at the end of the `greeting =` line is immediately preceded by a `\`, which allows this string literal to continue onto the subsequent line. Without the escaping `\` before it, a new-line -- the actual new-line, not the `\n` character escape sequence -- appearing in a `"` or `'` delimited string literal would actually produce a JS syntax parsing error.
420
-
421
-
Because the end-of-line `\` turns the new-line character into a line continuation, the new-line character is omitted from the string, as shown by the `console.log(..)` output.
447
+
This emoji is *not* a single registered Unicode code-point, and as such, there's no *normalization* that can be performed to compose these 7 separate code-points into a single entity. The visual rendering logic for such composite symbols is quite complex, well beyond what most of JS developers want to embed into our programs. Libraries do exist for handling some of this logic, but they're often large and still don't necessarily cover all of the nuances/variations.
422
448
423
-
| NOTE: |
424
-
| :--- |
425
-
| This line-continuation feature is often referred to as "multi-line strings", but I think that's a confusing label. As you can see, the string value itself doesn't have multiple lines, it only was defined across multiple lines via the line continuations. A multi-line string would actually have multiple lines in the underlying value. |
449
+
This kind of complexity significantly affects length computations, comparison, sorting, and many other common string-oriented operations.
Copy file name to clipboardExpand all lines: types-grammar/ch2.md
+80-30Lines changed: 80 additions & 30 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -132,22 +132,6 @@ yourAge; // 42 <-- unchanged
132
132
133
133
String values have a number of specific behaviors that every JS developer should be aware of.
134
134
135
-
### Length Computation
136
-
137
-
As mentioned in Chapter 1, string values have a `length` property that automatically exposes the length of the string; this property can only be accessed; attempts to set it are silently ignored.
138
-
139
-
The reported `length` value somewhat corresponds to the number of characters in the string (actually, code-units), but as we saw in Chapter 1, it's more complex when Unicode characters are involved.
140
-
141
-
Most people visually distinguish symbols as separate characters; this notion of an independent visual symbol is referred to as a *grapheme*. So when counting the "length" of a string, we typically mean that we're counting the number of graphemes.
142
-
143
-
But that's not how the computer deals with characters.
144
-
145
-
In JS, each *character* is a code-unit (16 bits), with a code-point value at or below `65535`. The `length` property of a string always counts the number of code-units in the string value, not code-points. A code-unit might represent a single character by itself, or it may be part of a surrogate pair, or it may be combined with an adjacent *combining* symbol. As such, `length` doesn't match the typical notion of counting graphemes.
146
-
147
-
To obtain a *grapheme length* for a string that matches typical expectations, the string value first needs to be normalized with `normalize("NFC")` (see "Normalizing Unicode" in Chapter 1) to produce *composed* code-units, in case any characters in it were originally stored *decomposed* as separate code-units.
148
-
149
-
// TODO
150
-
151
135
### String Character Access
152
136
153
137
Though strings are not actually arrays, JS allows `[ .. ]` array-style access of a character at a numeric (`0`-based) index:
@@ -170,20 +154,6 @@ If the value/expression resolves to a number outside the integer range of `0` -
170
154
| :--- |
171
155
| We'll cover coercion in-depth later in the book. |
172
156
173
-
### String Concatenation
174
-
175
-
Two or more string values can be concatenated (combined) into a new string value, using the `+` operator:
176
-
177
-
```js
178
-
greeting ="Hello, "+"Kyle!";
179
-
180
-
greeting; // Hello, Kyle!
181
-
```
182
-
183
-
The `+` operator will act as a string concatenation if either of the two operands (values on left or right sides of the operator) are already a string.
184
-
185
-
If one operand is a string and the other is not, the one that's not a string will be coerced to its string representation for the purposes of the concatenation.
186
-
187
157
### Character Iteration
188
158
189
159
Strings are not arrays, but they certainly mimick arrays closely in many ways. One such behavior is that, like arrays, strings are iterables. This means that the characters (code-units) of a string can be iterated individually:
| The specifics of the iterator protocol, including the fact that the `{ value: "e" .. }` result still shows `done: false`, are covered in detail in the "Sync & Async" title of this series. |
223
193
194
+
### Length Computation
195
+
196
+
As mentioned in Chapter 1, string values have a `length` property that automatically exposes the length of the string; this property can only be accessed; attempts to set it are silently ignored.
197
+
198
+
The reported `length` value somewhat corresponds to the number of characters in the string (actually, code-units), but as we saw in Chapter 1, it's more complex when Unicode characters are involved.
199
+
200
+
Most people visually distinguish symbols as separate characters; this notion of an independent visual symbol is referred to as a *grapheme*, or a *grapheme cluster*. So when counting the "length" of a string, we typically mean that we're counting the number of graphemes.
201
+
202
+
But that's not how the computer deals with characters.
203
+
204
+
In JS, each *character* is a code-unit (16 bits), with a code-point value at or below `65535`. The `length` property of a string always counts the number of code-units in the string value, not code-points. A code-unit might represent a single character by itself, or it may be part of a surrogate pair, or it may be combined with an adjacent *combining* symbol, or part of a grapheme cluster. As such, `length` doesn't match the typical notion of counting visual characters/graphemes.
205
+
206
+
To get closer to an expected/intuitive *grapheme length* for a string, the string value first needs to be normalized with `normalize("NFC")` (see "Normalizing Unicode" in Chapter 1) to produce any *composed* code-units (where possible), in case any characters were originally stored *decomposed* as separate code-units.
207
+
208
+
For example:
209
+
210
+
```js
211
+
favoriteItem ="teléfono";
212
+
favoriteItem.length; // 9 -- uh oh!
213
+
214
+
favoriteItem =favoriteItem.normalize("NFC");
215
+
favoriteItem.length; // 8 -- phew!
216
+
```
217
+
218
+
Unfortunately, as we saw in Chapter 1, we'll still have the possibility of characters of code-point greater the `65535`, and thus needing a surrogate pair to be represented. Such characters will count double in the `length`:
219
+
220
+
```js
221
+
// "☎" === "\u260E"
222
+
oldTelephone ="☎";
223
+
oldTelephone.length; // 1
224
+
225
+
// "📱" === "\u{1F4F1}" === "\uD83D\uDCF1"
226
+
cellphone ="📱";
227
+
cellphone.length; // 2 -- oops!
228
+
```
229
+
230
+
So what do we do?
231
+
232
+
One fix is to use character iteration (via `...` operator) as we saw in the previous section, since it automatically returns each combined character from a surrogate pair:
233
+
234
+
```js
235
+
cellphone ="📱";
236
+
cellphone.length; // 2 -- oops!
237
+
[ ...cellphone ].length; // 1 -- phew!
238
+
```
239
+
240
+
But, unfortunately, grapheme clusters (as explained in Chapter 1) throw yet another wrench into a string's length computation. For example, if we take the thumbs down emoji (`"\u{1F44E}"` and add to it the skin-tone modifier for medium-dark skin (`"\u{1F3FE}"`), we get:
241
+
242
+
```js
243
+
// "👎🏾" = "\u{1F44E}\u{1F3FE}"
244
+
thumbsDown ="👎🏾";
245
+
246
+
thumbsDown.length; // 4 -- oops!
247
+
[ ...thumbsDown ].length; // 2 -- oops!
248
+
```
249
+
250
+
As you can see, these are two distinct code-points (not a surrogate pair) that, by virtue of their ordering and adjacency, cause the computer's Unicode rendering to draw the thumbs-down symbol but with a darker skin tone than its default. The computed string length is thus `2`.
251
+
252
+
| WARNING: |
253
+
| :--- |
254
+
| As a Twitter user, you might expect to be able to put 280 thumbs-down emoji into a single tweet, since it looks like a single character. But Twitter counts each such emoji as two characters, so you only get 140. Surprisingly, twitter counts the `"👎"` (default thumbs-down), `"👎🏾"` (dark-skin tone thumbs-down), and even the `"👩👩👦👦"` (family emoji grapheme cluster) all as two characters each, even though their string lengths (from JS's perspective) are `2`, `4`, and `7`, respectively. Twitter must have some sort of custom Unicode handling implemented in the tools. |
255
+
256
+
It would take replicating most of a platform's complex Unicode rendering logic to be able to recognize such clusters of code-points as a single "character" for length-counting sake. There are libraries that purport to do so, but they're not necessarily perfect, and they come at a hefty cost in terms of extra code.
257
+
258
+
Counting the "length" of a string to match our human intuitions is a remarkably challenging task. We can get acceptable approximations in many cases, but there's plenty of other cases that confound our programs.
259
+
260
+
### String Concatenation
261
+
262
+
Two or more string values can be concatenated (combined) into a new string value, using the `+` operator:
263
+
264
+
```js
265
+
greeting ="Hello, "+"Kyle!";
266
+
267
+
greeting; // Hello, Kyle!
268
+
```
269
+
270
+
The `+` operator will act as a string concatenation if either of the two operands (values on left or right sides of the operator) are already a string.
271
+
272
+
If one operand is a string and the other is not, the one that's not a string will be coerced to its string representation for the purposes of the concatenation.
273
+
224
274
### String Methods
225
275
226
276
Strings provide a whole slew of additional string-specific methods (as properties):
0 commit comments