@@ -60,7 +60,8 @@ The syntax specification takes into account the following design restrictions:
6060 control characters such as U+0000 NULL and U+0009 TAB, permanently reserved noncharacters
6161 (U+FDD0 through U+FDEF and U+<i >n</i >FFFE and U+<i >n</i >FFFF where <i >n</i > is 0x0 through 0x10),
6262 private-use code points (U+E000 through U+F8FF, U+F0000 through U+FFFFD, and
63- U+100000 through U+10FFFD), unassigned code points, and other potentially confusing content.
63+ U+100000 through U+10FFFD), unassigned code points, unpaired surrogates (U+D800 through U+DFFF),
64+ and other potentially confusing content.
6465
6566## Messages and their Syntax
6667
@@ -113,6 +114,22 @@ A **_<dfn>local variable</dfn>_** is a _variable_ created as the result of a _lo
113114> In particular, it avoids using quote characters common to many file formats and formal languages
114115> so that these do not need to be escaped in the body of a _ message_ .
115116
117+ > [ !NOTE]
118+ > _ Text_ and _ quoted literals_ allow unpaired surrogate code points
119+ > (` U+D800 ` to ` U+DFFF ` ).
120+ > This is for compatibility with formats or data structures
121+ > that use the UTF-16 encoding
122+ > and do not check for unpaired surrogates.
123+ > (Strings in Java or JavaScript are examples of this.)
124+ > These code points SHOULD NOT be used in a _ message_ .
125+ > Unpaired surrogate code points are likely an indication of mistakes
126+ > or errors in the creation, serialization, or processing of the _ message_ .
127+ > Many processes will convert them to
128+ > � ; U+FFFD REPLACEMENT CHARACTER
129+ > during processing or display.
130+ > Implementations not based on UTF-16 might not be able to represent
131+ > a _ message_ containing such code points.
132+
116133> [ !NOTE]
117134> In general (and except where required by the syntax), whitespace carries no meaning in the structure
118135> of a _ message_ . While many of the examples in this spec are written on multiple lines, the formatting
@@ -271,8 +288,8 @@ A _quoted pattern_ MAY be empty.
271288### Text
272289
273290**_<dfn>text</dfn>_** is the translateable content of a _pattern_.
274- Any Unicode code point is allowed, except for U+0000 NULL
275- and the surrogate code points U+D800 through U+DFFF inclusive.
291+ Any Unicode code point is allowed, except for U+0000 NULL.
292+
276293The characters U+005C REVERSE SOLIDUS `\`,
277294U+007B LEFT CURLY BRACKET `{`, and U+007D RIGHT CURLY BRACKET `}`
278295MUST be escaped as `\\`, `\{`, and `\}` respectively.
@@ -298,10 +315,14 @@ content-char = %x01-08 ; omit NULL (%x00), HTAB (%x09) and LF (%x0A)
298315 / %x41-5B ; omit \ (%x5C)
299316 / %x5D-7A ; omit { | } (%x7B-7D)
300317 / %x7E-2FFF ; omit IDEOGRAPHIC SPACE (%x3000)
301- / %x3001-D7FF ; omit surrogates
302- / %xE000-10FFFF
318+ / %x3001-10FFFF ; allowing surrogates is intentional
303319```
304320
321+ > [ !NOTE]
322+ > Unpaired surrogate code points (` U+D800 ` through ` U+DFFF ` inclusive)
323+ > are allowed for compatibility with UTF-16 based implementations
324+ > that do not check for this encoding error.
325+
305326When a _ pattern_ is quoted by embedding the _ pattern_ in curly brackets, the
306327resulting _ message_ can be embedded into
307328various formats regardless of the container's whitespace trimming rules.
@@ -688,8 +709,7 @@ A _literal_ can appear
688709as a _ key_ value,
689710as the _ operand_ of a _ literal-expression_ ,
690711or in the value of an _ option_ .
691- A _ literal_ MAY include any Unicode code point
692- except for U+0000 NULL or the surrogate code points U+D800 through U+DFFF.
712+ A _ literal_ MAY include any Unicode code point except for U+0000 NULL.
693713
694714All code points are preserved.
695715
@@ -711,6 +731,11 @@ A **_<dfn>quoted literal</dfn>_** begins and ends with U+005E VERTICAL BAR `|`.
711731The characters ` \ ` and ` | ` within a _ quoted literal_ MUST be
712732escaped as ` \\ ` and ` \| ` .
713733
734+ > [ !NOTE]
735+ > Unpaired surrogate code points (` U+D800 ` through ` U+DFFF ` inclusive)
736+ > are allowed in _ quoted literals_ for compatibility with UTF-16 based
737+ > implementations that do not check for this encoding error.
738+
714739An ** _ <dfn >unquoted literal</dfn >_ ** is a _ literal_ that does not require the ` | `
715740quotes around it to be distinct from the rest of the _ message_ syntax.
716741An _ unquoted literal_ MAY be used when the content of the _ literal_
0 commit comments