Skip to content

Commit 3260c68

Browse files
committed
types-grammar, ch1: filling out the rest of the discussion on string values
1 parent d1bb48a commit 3260c68

File tree

2 files changed

+115
-13
lines changed

2 files changed

+115
-13
lines changed

types-grammar/ch1.md

Lines changed: 114 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ In Chapter 1 of the "Objects & Classes" book of this series, we confronted the c
99

1010
Here, we'll look at the core value types of JS, specifically the non-object types called *primitives*.
1111

12-
## Core Values
12+
## Built-in Values
1313

1414
JS provides seven built-in, primitive (non-object) value types:
1515

@@ -198,11 +198,35 @@ myName = "Kyle";
198198
199199
Strings can be delimited by double-quotes (`"`), single-quotes (`'`), or back-ticks (`` ` ``). The ending delimiter must always match the starting delimiter.
200200
201-
Strings have an intrinsic length which corresponds to how many code-points they contain. This does not necessarily correspond to the number of visible characters you type between the start and end delimiters (aka, the string literal). It can sometimes be a little confusing to keep straight the difference between a string literal and the underlying string value, so pay close attention.
201+
Strings have an intrinsic length which corresponds to how many code-points they contain.
202202
203-
If `"` or `'` are used to delimit a string literal, the contents are only parsed for *character-escape sequences*: `\` followed by one or more characters that JS recognizes and parses with special meaning. Any other characters in a string that don't parse as escape-sequences (single-character or multi-character), are inserted as-is into the string value.
203+
```js
204+
myName = "Kyle";
205+
206+
myName.length; // 4
207+
```
208+
209+
This does not necessarily correspond to the number of visible characters you type between the start and end delimiters (aka, the string literal). It can sometimes be a little confusing to keep straight the difference between a string literal and the underlying string value, so pay close attention.
210+
211+
#### JS Character Encodings
212+
213+
What type of character encoding does JS use for string characters?
204214
205-
#### Single-Character Escapes
215+
One might assume UTF-8 (8-bit) or UTF-16 (16-bit). It's actually more complicated, because you also need to consider UCS-2 (2-byte Universal Character Set), which is similar to UTF-16, but not quite the same. [^UTFUCS]
216+
217+
The first 65,535 code points in Unicode is called the BMP (Basic Multilingual Plane). All the rest of the code points are grouped into 16 so called "supplemental planes" or "astral planes". When representing Unicode characters from the BMP, it's pretty straightforward.
218+
219+
But when representing extended characters outside the BMP, JS actually represents these characters code-points as a pairing of two separate code units, called *surrogate halves*.
220+
221+
For example, the Unicode code point `127878` (hexadecimal `1F386`) is `🎆` (fireworks symbol). JS stores this as two surrogate halve code units, `U+D83C` and `U+DF86`.
222+
223+
This has implications on the length of strings, because a single visible character like the `🎆` fireworks symbol, when in a JS string, is a counted as 2 characters for the purposes of the string length!
224+
225+
We'll revisit Unicode characters shortly.
226+
227+
#### Escape Sequences
228+
229+
If `"` or `'` are used to delimit a string literal, the contents are only parsed for *character-escape sequences*: `\` followed by one or more characters that JS recognizes and parses with special meaning. Any other characters in a string that don't parse as escape-sequences (single-character or multi-character), are inserted as-is into the string value.
206230

207231
For single-character escape sequences, the following characters are recognized after a `\`: `bfnrtv0'"\`. For example, `\n` (new-line), `\t` (tab), etc.
208232
@@ -235,12 +259,55 @@ console.log(windowsDriveLocation);
235259
236260
Multi-character escape sequences may be hexadecimal or unicode sequences.
237261
238-
Hexidecimal escape sequences are used to encode any of the base ASCII characters (codes 0-255), and look like `\x` followed by exactly two hexidecimal characters (`0-9` and `a-f` / `A-F` -- case insensitive). For example, the escape-sequence `\xA9` (or `\xa9`) corresponds to the ASCII character with code-point `169`: `©` (copyright symbol).
262+
Hexidecimal escape sequences are used to encode any of the base ASCII characters (codes 0-255), and look like `\x` followed by exactly two hexadecimal characters (`0-9` and `a-f` / `A-F` -- case insensitive). For example, `A9` or `a9` are decimal value `169`, which corresponds to:
239263
240-
Unicode escape sequences encode any of the characters in the unicode set whose code-point values are from 0-65535, and look like `\u` followed by exactly four hexidecimal characters. For example, the escape-sequence `\u00A9` (or `\u00a9`) corresponds to that same `©` symbol, while `\u263A` (or `\u263a`) corresponds to the unicode character with code-point `9786`: `` (smiley face symbol).
264+
```js
265+
copyright = "\xA9"; // or "\xa9"
266+
267+
console.log(copyright); // ©
268+
```
269+
270+
For any normal character that can be typed on a keyboard, such as `"a"`, it's usually most readable to just specify the literal character, as opposed to a more obfuscated hexadecimal representation:
271+
272+
```js
273+
"a" === "\x61"; // true
274+
```
275+
276+
##### Unicode
277+
278+
Unicode escape sequences encode any of the characters in the unicode set whose code-point values range from 0-65535, and look like `\u` followed by exactly four hexadecimal characters. For example, the escape-sequence `\u00A9` (or `\u00a9`) corresponds to that same `©` symbol, while `\u263A` (or `\u263a`) corresponds to the unicode character with code-point `9786`: `` (smiley face symbol).
241279
242280
When any character-escape sequence (regardless of length) is recognized, the single character it represents is inserted into the string, rather than the original separate characters. So, in the string `"\u263A"`, there's only one (smiley) character, not six individual characters.
243281
282+
Unicode code-points can go well above `65535` (`FFFF` in hexadecimal), up to a maximum of `1114111` (`10FFFF` in hexadecimal). For example, `1F4A9` is decimal code-point `128169`, which corresponds to the funny `💩` (pile of poo) character.
283+
284+
But `"\u1F4A9"` wouldn't work as expected, since it would be parsed as `\u1F4A` as a unicode escape sequence, followed by just the `9` literal character. To address this limitation, a variation of unicode escape sequences was introduced in ES6, to allow an arbitrary number of hexadecimal characters after the `\u`, by surrounding them with `{ .. }` curly braces:
285+
286+
```js
287+
myReaction = "\u{1F4A9}";
288+
289+
console.log(myReaction);
290+
// 💩
291+
```
292+
293+
Recall the earlier discussion of extended (non-BMP) Unicode characters, *surrogate halves*? The same `💩` could also be defined with the explicit code-units:
294+
295+
```js
296+
myReaction = "\uD83D\uDCA9";
297+
298+
console.log(myReaction);
299+
// 💩
300+
```
301+
302+
All three representations of this same character are stored internally by JS identically and are indistinguishable:
303+
304+
```js
305+
"💩" === "\u{1F4A9}"; // true
306+
"\u{1F4A9}" === "\uD83D\uDCA9"; // true
307+
```
308+
309+
Even though JS doesn't care which way such a character is represented, consider the readability differences carefully when authoring your code.
310+
244311
#### Line Continuation
245312
246313
The `\` followed by an actual new-line character (not just literal `n`) is a special case, and it creates what's called a line-continuation:
@@ -250,15 +317,48 @@ greeting = "Hello \
250317
Friends!";
251318

252319
console.log(greeting);
253-
// Hello
254-
// Friends!
320+
// Hello Friends!
255321
```
256322
257-
As you can see, the new-line at the end of the `greeting = ` line is immediately preceded by a `\`, which allows this string literal to continue onto the subsequent line. Without the escaping `\` before it, a new-line appearing in a `"` or `'` delimited string literal would actually produce a JS syntax parsing error.
323+
As you can see, the new-line at the end of the `greeting = ` line is immediately preceded by a `\`, which allows this string literal to continue onto the subsequent line. Without the escaping `\` before it, a new-line -- the actual new-line, not the `\n` character escape sequence -- appearing in a `"` or `'` delimited string literal would actually produce a JS syntax parsing error.
258324
259-
The new-line itself is still in the string value.
325+
Because the end-of-line `\` turns the new-line character into a line continuation, the new-line character is omitted from the string, as shown by the `console.log(..)` output.
260326
261-
// TODO
327+
| NOTE: |
328+
| :--- |
329+
| This line-continuation feature is often referred to as "multi-line strings", but I think that's a confusing label. As you can see, the string value itself doesn't have multiple lines, it only was defined across multiple lines via the line continuations. A multi-line string would actually have multiple lines in the underlying value. |
330+
331+
#### Template Literals
332+
333+
I mentioned earlier that strings can alternately be delimited with `` `..` `` back-ticks:
334+
335+
```js
336+
myName = `Kyle`;
337+
```
338+
339+
All the same rules for character encodings, character escape sequences, and lengths apply to these types of strings.
340+
341+
However, the contents of these template (string) literals are additionally parsed for a special delimiter sequence `${ .. }`, which marks an expression to evaluate and interpolate into the string value at that location:
342+
343+
```js
344+
myName = `Kyle`;
345+
346+
greeting = `Hello, ${myName}!`;
347+
348+
console.log(greeting); // Hello, Kyle!
349+
```
350+
351+
Everything between the `{ .. }` in such a template literal is an arbitrary JS expression. It can be simple variables, or complex JS programs, or anything in between.
352+
353+
| TIP: |
354+
| :--- |
355+
| This feature is commonly called "template literals" or "template strings", but I think that's confusing. "Template" is usually referred to in programming contexts as a reusable definition that can be re-evaluated with different data. For example, *template engines* for pages, email templates for newsletter campaigns, etc. This JS feature is not re-usable. It's a literal, and it produces a single, immediate value (usually a string). You can put such a value in a function, and call the function multiple times. But then the function is acting as the template, not the the literal itself. I prefer instead to refer to this feature as *interpolated literals*, or the funny, shortened *interpoliterals*, as I think this name is more accurately descriptive. |
356+
357+
Some JS developers believe that this style of string literal is preferable to use for *all* strings, even if you're not doing any expression interpolation. I disagree. I think it should only be used when interpolating, and classic `".."` or `'..'` delimited strings should be used for non-interpolated string definitions.
358+
359+
Moreover, there are a few places where `` `..` `` style strings are disallowed. For example, the `"use strict"` pragma cannot use back-ticks, or the pragma will be silently ignored (and thus the program accidentally runs in non-strict mode). Also, this style of strings cannot be used in quoted property names of object literals, or in the ES Module `import .. from ..` module-specifier clause.
360+
361+
My advice: use `` `..` `` delimited strings where allowed, and only when interpolation is needed, but keep using `".."` or `'..'` delimited strings for all other strings.
262362
263363
### Number Values
264364
@@ -560,4 +660,6 @@ Here, the `myAge` and `yourAge` variables each have their own copy of the number
560660
561661
// TODO
562662
563-
[^IEEE754]: IEEE-754; https://en.wikipedia.org/wiki/IEEE_754; Accessed July 2022
663+
[^UTFUCS]: "JavaScript’s internal character encoding: UCS-2 or UTF-16?"; Mathias Bynens; January 20 2012; https://mathiasbynens.be/notes/javascript-encoding ; Accessed July 2022
664+
665+
[^IEEE754]: "IEEE-754"; https://en.wikipedia.org/wiki/IEEE_754 ; Accessed July 2022

types-grammar/toc.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@
99
* Foreword
1010
* Preface
1111
* Chapter 1: Primitives
12-
* Core Values
12+
* Built-in Values
1313
* Value Immutability
1414
* Assignments Are Value Copies
1515
* TODO

0 commit comments

Comments
 (0)