Skip to content

Commit 1baac19

Browse files
mihnitaaphillips
andauthored
Allow surrogates in content, issue #895 (#906)
* Allow surrogates in content, issue #895 * Grammar and typos, linkify terms, make into a note, and fix 2119 keywords Thanks Addison! Co-authored-by: Addison Phillips <[email protected]> * Not using "localizable elements" Co-authored-by: Addison Phillips <[email protected]> * Keep syntax.md in sync with message.abnf * Added note about surrogates to quoted literals * Moved the note about surrogates from Security Considerations to The Message * Update spec/syntax.md * Update spec/syntax.md * Italicize in a couple of places * Implemeted more (all?) feedback from review --------- Co-authored-by: Addison Phillips <[email protected]>
1 parent 0eb6109 commit 1baac19

File tree

3 files changed

+35
-13
lines changed

3 files changed

+35
-13
lines changed

spec/appendices.md

Lines changed: 2 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -14,12 +14,10 @@ host environments, their serializations and resource formats,
1414
that might be sufficient to prevent most problems.
1515
However, MessageFormat itself does not supply such a restriction.
1616

17-
MessageFormat _messages_ permit nearly all Unicode code points,
18-
with the exception of surrogates,
17+
MessageFormat _messages_ permit nearly all Unicode code points
1918
to appear in _literals_, including the text portions of a _pattern_.
2019
This means that it can be possible for a _message_ to contain invisible characters
21-
(such as bidirectional controls,
22-
ASCII control characters in the range U+0000 to U+001F,
20+
(such as bidirectional controls, ASCII control characters in the range U+0000 to U+001F,
2321
or characters that might be interpreted as escapes or syntax in the host format)
2422
that abnormally affect the display of the _message_
2523
when viewed as source code, or in resource formats or translation tools,

spec/message.abnf

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -76,8 +76,7 @@ content-char = %x01-08 ; omit NULL (%x00), HTAB (%x09) and LF (%x0A)
7676
/ %x41-5B ; omit \ (%x5C)
7777
/ %x5D-7A ; omit { | } (%x7B-7D)
7878
/ %x7E-2FFF ; omit IDEOGRAPHIC SPACE (%x3000)
79-
/ %x3001-D7FF ; omit surrogates
80-
/ %xE000-10FFFF
79+
/ %x3001-10FFFF ; allowing surrogates is intentional
8180

8281
; Character escapes
8382
escaped-char = backslash ( backslash / "{" / "|" / "}" )

spec/syntax.md

Lines changed: 32 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -60,7 +60,8 @@ The syntax specification takes into account the following design restrictions:
6060
control characters such as U+0000 NULL and U+0009 TAB, permanently reserved noncharacters
6161
(U+FDD0 through U+FDEF and U+<i>n</i>FFFE and U+<i>n</i>FFFF where <i>n</i> is 0x0 through 0x10),
6262
private-use code points (U+E000 through U+F8FF, U+F0000 through U+FFFFD, and
63-
U+100000 through U+10FFFD), unassigned code points, and other potentially confusing content.
63+
U+100000 through U+10FFFD), unassigned code points, unpaired surrogates (U+D800 through U+DFFF),
64+
and other potentially confusing content.
6465

6566
## Messages and their Syntax
6667

@@ -113,6 +114,22 @@ A **_<dfn>local variable</dfn>_** is a _variable_ created as the result of a _lo
113114
> In particular, it avoids using quote characters common to many file formats and formal languages
114115
> so that these do not need to be escaped in the body of a _message_.
115116
117+
> [!NOTE]
118+
> _Text_ and _quoted literals_ allow unpaired surrogate code points
119+
> (`U+D800` to `U+DFFF`).
120+
> This is for compatibility with formats or data structures
121+
> that use the UTF-16 encoding
122+
> and do not check for unpaired surrogates.
123+
> (Strings in Java or JavaScript are examples of this.)
124+
> These code points SHOULD NOT be used in a _message_.
125+
> Unpaired surrogate code points are likely an indication of mistakes
126+
> or errors in the creation, serialization, or processing of the _message_.
127+
> Many processes will convert them to
128+
> &#xfffd; U+FFFD REPLACEMENT CHARACTER
129+
> during processing or display.
130+
> Implementations not based on UTF-16 might not be able to represent
131+
> a _message_ containing such code points.
132+
116133
> [!NOTE]
117134
> In general (and except where required by the syntax), whitespace carries no meaning in the structure
118135
> of a _message_. While many of the examples in this spec are written on multiple lines, the formatting
@@ -271,8 +288,8 @@ A _quoted pattern_ MAY be empty.
271288
### Text
272289
273290
**_<dfn>text</dfn>_** is the translateable content of a _pattern_.
274-
Any Unicode code point is allowed, except for U+0000 NULL
275-
and the surrogate code points U+D800 through U+DFFF inclusive.
291+
Any Unicode code point is allowed, except for U+0000 NULL.
292+
276293
The characters U+005C REVERSE SOLIDUS `\`,
277294
U+007B LEFT CURLY BRACKET `{`, and U+007D RIGHT CURLY BRACKET `}`
278295
MUST be escaped as `\\`, `\{`, and `\}` respectively.
@@ -298,10 +315,14 @@ content-char = %x01-08 ; omit NULL (%x00), HTAB (%x09) and LF (%x0A)
298315
/ %x41-5B ; omit \ (%x5C)
299316
/ %x5D-7A ; omit { | } (%x7B-7D)
300317
/ %x7E-2FFF ; omit IDEOGRAPHIC SPACE (%x3000)
301-
/ %x3001-D7FF ; omit surrogates
302-
/ %xE000-10FFFF
318+
/ %x3001-10FFFF ; allowing surrogates is intentional
303319
```
304320
321+
> [!NOTE]
322+
> Unpaired surrogate code points (`U+D800` through `U+DFFF` inclusive)
323+
> are allowed for compatibility with UTF-16 based implementations
324+
> that do not check for this encoding error.
325+
305326
When a _pattern_ is quoted by embedding the _pattern_ in curly brackets, the
306327
resulting _message_ can be embedded into
307328
various formats regardless of the container's whitespace trimming rules.
@@ -688,8 +709,7 @@ A _literal_ can appear
688709
as a _key_ value,
689710
as the _operand_ of a _literal-expression_,
690711
or in the value of an _option_.
691-
A _literal_ MAY include any Unicode code point
692-
except for U+0000 NULL or the surrogate code points U+D800 through U+DFFF.
712+
A _literal_ MAY include any Unicode code point except for U+0000 NULL.
693713

694714
All code points are preserved.
695715

@@ -711,6 +731,11 @@ A **_<dfn>quoted literal</dfn>_** begins and ends with U+005E VERTICAL BAR `|`.
711731
The characters `\` and `|` within a _quoted literal_ MUST be
712732
escaped as `\\` and `\|`.
713733

734+
> [!NOTE]
735+
> Unpaired surrogate code points (`U+D800` through `U+DFFF` inclusive)
736+
> are allowed in _quoted literals_ for compatibility with UTF-16 based
737+
> implementations that do not check for this encoding error.
738+
714739
An **_<dfn>unquoted literal</dfn>_** is a _literal_ that does not require the `|`
715740
quotes around it to be distinct from the rest of the _message_ syntax.
716741
An _unquoted literal_ MAY be used when the content of the _literal_

0 commit comments

Comments
 (0)