types-grammar: fixing issues with how unicode and length computation are discussed

getify · getify · commit c1e01fc23900 · 2022-08-03T01:16:26.000-05:00
diff --git a/types-grammar/ch1.md b/types-grammar/ch1.md
@@ -299,6 +299,26 @@ console.log(windowsFontsPath);
 | :--- |
 | What about four backslashes `\\\\` in a string literal? Well, that's just two `\\` escape sequences next to each other, so it results in two adjacent backslashes (`\\`) in the underlying string value. You might recognize there's an odd/even rule pattern at play. You should thus be able to deciper any odd (`\\\\\`, `\\\\\\\\\`, etc) or even (`\\\\\\`, `\\\\\\\\\\`, etc) number of backslashes in a string literal. |
 
+#### Line Continuation
+
+The `\` character followed by an actual new-line character (not just literal `n`) is a special case, and it creates what's called a line-continuation:
+
+```js
+greeting = "Hello \
+Friends!";
+
+console.log(greeting);
+// Hello Friends!
+```
+
+As you can see, the new-line at the end of the `greeting = ` line is immediately preceded by a `\`, which allows this string literal to continue onto the subsequent line. Without the escaping `\` before it, a new-line -- the actual new-line, not the `\n` character escape sequence -- appearing in a `"` or `'` delimited string literal would actually produce a JS syntax parsing error.
+
+Because the end-of-line `\` turns the new-line character into a line continuation, the new-line character is omitted from the string, as shown by the `console.log(..)` output.
+
+| NOTE: |
+| :--- |
+| This line-continuation feature is often referred to as "multi-line strings", but I think that's a confusing label. As you can see, the string value itself doesn't have multiple lines, it only was defined across multiple lines via the line continuations. A multi-line string would actually have multiple lines in the underlying value. We'll revisit this topic later in this chapter when we cover Template Literals. |
+
 ### Multi-Character Escapes
 
 Multi-character escape sequences may be hexadecimal or Unicode sequences.
@@ -360,7 +380,7 @@ Even though JS doesn't care which way such a character is represented in your pr
 
 ##### Unicode Normalization
 
-A further wrinkle in Unicode string handling is that even certain single BMP characters can be represented in different ways.
+Another wrinkle in Unicode string handling is that even certain single BMP characters can be represented in different ways.
 
 For example, the `"é"` character can either be represented as itself (code-point `233`, aka `\xe9` or `\u00e9` or `\u{e9}`), or as the combination of two code-points: the `"e"` character (code-point `101`, aka `\x65`, `\u0065`, `\u{65}`) and the *combining tilde* (code-point `769`, aka `\u0301`, `\u{301}`).
 
@@ -387,10 +407,16 @@ eTilde1 === eTilde2;        // false
 eTilde1 === eTilde3;        // true
 ```
 
+One particular challenge is that you may copy-paste a string with an `"é"` character visible in it, and that character may have been in the *composed* or *decomposed* form. But there's no visual way to tell, and yet the underlying string value will be different depending:
+
+```js
+"é" === "é";           // false!!
+```
+
 This internal representation difference can be quite challenging if not carefully planned for. Fortunately, JS provides a `normalize(..)` utility method on strings to help:
 
 ```js
-eTilde1 = "é"
+eTilde1 = "é";
 eTilde2 = "\u{e9}";
 eTilde3 = "\u{65}\u{301}";
 
@@ -400,29 +426,27 @@ eTilde2.normalize("NFD") === eTilde3;
 
 The `"NFC"` normalization mode combines adjacent code-points into the *composed* code-point (if possible), whereas the `"NFD"` normalization mode splits a single code-point into its *decomposed* code-points (if possible).
 
-And there can actually be more than two individual *decomposed* code-points that make up a single *composed* code-point; some international language symbols (Chinese, Japanese, etc) are *composed* of three or four code-points layered together!
+And there can actually be more than two individual *decomposed* code-points that make up a single *composed* code-point -- for example, a single character could have several diacritical marks applied to it.
 
 When dealing with Unicode strings that will be compared, sorted, or length analyzed, it's very important to keep Unicode normalization in mind, and use it where necessary.
 
-### Line Continuation
+##### Unicode Grapheme Clusters
 
-The `\` followed by an actual new-line character (not just literal `n`) is a special case, and it creates what's called a line-continuation:
+A final complication of Unicode string handling is the support for clustering of multiple adjacent code-points into a single visually distinct symbol, referred to as a *grapheme* (or a *grapheme cluster*).
+
+An example would be a family emoji such as `"👩‍👩‍👦‍👦"`, which is actually made up of 7 code-points that all cluster/group together into a single visual symbol.
+
+Consider:
 
 ```js
-greeting = "Hello \
-Friends!";
+familyEmoji = "\u{1f469}\u{200d}\u{1f469}\u{200d}\u{1f466}\u{200d}\u{1f466}";
 
-console.log(greeting);
-// Hello Friends!
+familyEmoji;            // 👩‍👩‍👦‍👦
 ```
 
-As you can see, the new-line at the end of the `greeting = ` line is immediately preceded by a `\`, which allows this string literal to continue onto the subsequent line. Without the escaping `\` before it, a new-line -- the actual new-line, not the `\n` character escape sequence -- appearing in a `"` or `'` delimited string literal would actually produce a JS syntax parsing error.
-
-Because the end-of-line `\` turns the new-line character into a line continuation, the new-line character is omitted from the string, as shown by the `console.log(..)` output.
+This emoji is *not* a single registered Unicode code-point, and as such, there's no *normalization* that can be performed to compose these 7 separate code-points into a single entity. The visual rendering logic for such composite symbols is quite complex, well beyond what most of JS developers want to embed into our programs. Libraries do exist for handling some of this logic, but they're often large and still don't necessarily cover all of the nuances/variations.
 
-| NOTE: |
-| :--- |
-| This line-continuation feature is often referred to as "multi-line strings", but I think that's a confusing label. As you can see, the string value itself doesn't have multiple lines, it only was defined across multiple lines via the line continuations. A multi-line string would actually have multiple lines in the underlying value. |
+This kind of complexity significantly affects length computations, comparison, sorting, and many other common string-oriented operations.
 
 ### Template Literals
 
diff --git a/types-grammar/ch2.md b/types-grammar/ch2.md
@@ -132,22 +132,6 @@ yourAge;            // 42 <-- unchanged
 
 String values have a number of specific behaviors that every JS developer should be aware of.
 
-### Length Computation
-
-As mentioned in Chapter 1, string values have a `length` property that automatically exposes the length of the string; this property can only be accessed; attempts to set it are silently ignored.
-
-The reported `length` value somewhat corresponds to the number of characters in the string (actually, code-units), but as we saw in Chapter 1, it's more complex when Unicode characters are involved.
-
-Most people visually distinguish symbols as separate characters; this notion of an independent visual symbol is referred to as a *grapheme*. So when counting the "length" of a string, we typically mean that we're counting the number of graphemes.
-
-But that's not how the computer deals with characters.
-
-In JS, each *character* is a code-unit (16 bits), with a code-point value at or below `65535`. The `length` property of a string always counts the number of code-units in the string value, not code-points. A code-unit might represent a single character by itself, or it may be part of a surrogate pair, or it may be combined with an adjacent *combining* symbol. As such, `length` doesn't match the typical notion of counting graphemes.
-
-To obtain a *grapheme length* for a string that matches typical expectations, the string value first needs to be normalized with `normalize("NFC")` (see "Normalizing Unicode" in Chapter 1) to produce *composed* code-units, in case any characters in it were originally stored *decomposed* as separate code-units.
-
-// TODO
-
 ### String Character Access
 
 Though strings are not actually arrays, JS allows `[ .. ]` array-style access of a character at a numeric (`0`-based) index:
@@ -170,20 +154,6 @@ If the value/expression resolves to a number outside the integer range of `0` -
 | :--- |
 |  We'll cover coercion in-depth later in the book. |
 
-### String Concatenation
-
-Two or more string values can be concatenated (combined) into a new string value, using the `+` operator:
-
-```js
-greeting = "Hello, " + "Kyle!";
-
-greeting;               // Hello, Kyle!
-```
-
-The `+` operator will act as a string concatenation if either of the two operands (values on left or right sides of the operator) are already a string.
-
-If one operand is a string and the other is not, the one that's not a string will be coerced to its string representation for the purposes of the concatenation.
-
 ### Character Iteration
 
 Strings are not arrays, but they certainly mimick arrays closely in many ways. One such behavior is that, like arrays, strings are iterables. This means that the characters (code-units) of a string can be iterated individually:
@@ -221,6 +191,86 @@ it.next();      // { value: undefined, done: true }
 | :--- |
 | The specifics of the iterator protocol, including the fact that the `{ value: "e" .. }` result still shows `done: false`, are covered in detail in the "Sync & Async" title of this series. |
 
+### Length Computation
+
+As mentioned in Chapter 1, string values have a `length` property that automatically exposes the length of the string; this property can only be accessed; attempts to set it are silently ignored.
+
+The reported `length` value somewhat corresponds to the number of characters in the string (actually, code-units), but as we saw in Chapter 1, it's more complex when Unicode characters are involved.
+
+Most people visually distinguish symbols as separate characters; this notion of an independent visual symbol is referred to as a *grapheme*, or a *grapheme cluster*. So when counting the "length" of a string, we typically mean that we're counting the number of graphemes.
+
+But that's not how the computer deals with characters.
+
+In JS, each *character* is a code-unit (16 bits), with a code-point value at or below `65535`. The `length` property of a string always counts the number of code-units in the string value, not code-points. A code-unit might represent a single character by itself, or it may be part of a surrogate pair, or it may be combined with an adjacent *combining* symbol, or part of a grapheme cluster. As such, `length` doesn't match the typical notion of counting visual characters/graphemes.
+
+To get closer to an expected/intuitive *grapheme length* for a string, the string value first needs to be normalized with `normalize("NFC")` (see "Normalizing Unicode" in Chapter 1) to produce any *composed* code-units (where possible), in case any characters were originally stored *decomposed* as separate code-units.
+
+For example:
+
+```js
+favoriteItem = "teléfono";
+favoriteItem.length;            // 9 -- uh oh!
+
+favoriteItem = favoriteItem.normalize("NFC");
+favoriteItem.length;            // 8 -- phew!
+```
+
+Unfortunately, as we saw in Chapter 1, we'll still have the possibility of characters of code-point greater the `65535`, and thus needing a surrogate pair to be represented. Such characters will count double in the `length`:
+
+```js
+// "☎" === "\u260E"
+oldTelephone = "☎";
+oldTelephone.length;            // 1
+
+// "📱" === "\u{1F4F1}" === "\uD83D\uDCF1"
+cellphone = "📱";
+cellphone.length;               // 2 -- oops!
+```
+
+So what do we do?
+
+One fix is to use character iteration (via `...` operator) as we saw in the previous section, since it automatically returns each combined character from a surrogate pair:
+
+```js
+cellphone = "📱";
+cellphone.length;               // 2 -- oops!
+[ ...cellphone ].length;        // 1 -- phew!
+```
+
+But, unfortunately, grapheme clusters (as explained in Chapter 1) throw yet another wrench into a string's length computation. For example, if we take the thumbs down emoji (`"\u{1F44E}"` and add to it the skin-tone modifier for medium-dark skin (`"\u{1F3FE}"`), we get:
+
+```js
+// "👎🏾" = "\u{1F44E}\u{1F3FE}"
+thumbsDown = "👎🏾";
+
+thumbsDown.length;              // 4 -- oops!
+[ ...thumbsDown ].length;       // 2 -- oops!
+```
+
+As you can see, these are two distinct code-points (not a surrogate pair) that, by virtue of their ordering and adjacency, cause the computer's Unicode rendering to draw the thumbs-down symbol but with a darker skin tone than its default. The computed string length is thus `2`.
+
+| WARNING: |
+| :--- |
+| As a Twitter user, you might expect to be able to put 280 thumbs-down emoji into a single tweet, since it looks like a single character. But Twitter counts each such emoji as two characters, so you only get 140. Surprisingly, twitter counts the `"👎"` (default thumbs-down), `"👎🏾"` (dark-skin tone thumbs-down), and even the `"👩‍👩‍👦‍👦"` (family emoji grapheme cluster) all as two characters each, even though their string lengths (from JS's perspective) are `2`, `4`, and `7`, respectively. Twitter must have some sort of custom Unicode handling implemented in the tools. |
+
+It would take replicating most of a platform's complex Unicode rendering logic to be able to recognize such clusters of code-points as a single "character" for length-counting sake. There are libraries that purport to do so, but they're not necessarily perfect, and they come at a hefty cost in terms of extra code.
+
+Counting the "length" of a string to match our human intuitions is a remarkably challenging task. We can get acceptable approximations in many cases, but there's plenty of other cases that confound our programs.
+
+### String Concatenation
+
+Two or more string values can be concatenated (combined) into a new string value, using the `+` operator:
+
+```js
+greeting = "Hello, " + "Kyle!";
+
+greeting;               // Hello, Kyle!
+```
+
+The `+` operator will act as a string concatenation if either of the two operands (values on left or right sides of the operator) are already a string.
+
+If one operand is a string and the other is not, the one that's not a string will be coerced to its string representation for the purposes of the concatenation.
+
 ### String Methods
 
 Strings provide a whole slew of additional string-specific methods (as properties):