Skip to content
43 changes: 34 additions & 9 deletions spec/message.abnf
Original file line number Diff line number Diff line change
Expand Up @@ -49,18 +49,43 @@ local = %s".local"
match = %s".match"

; Names and identifiers
; identifier matches https://www.w3.org/TR/REC-xml-names/#NT-QName
; name matches https://www.w3.org/TR/REC-xml-names/#NT-NCName but excludes U+FFFD and U+061C
identifier = [namespace ":"] name
namespace = name
name = [bidi] name-start *name-char [bidi]
name-start = ALPHA / "_"
/ %xC0-D6 / %xD8-F6 / %xF8-2FF
/ %x370-37D / %x37F-61B / %x61D-1FFF / %x200C-200D
/ %x2070-218F / %x2C00-2FEF / %x3001-D7FF
/ %xF900-FDCF / %xFDF0-FFFC / %x10000-EFFFF
name-char = name-start / DIGIT / "-" / "."
/ %xB7 / %x300-36F / %x203F-2040
name-start = ALPHA
/ %x2B ; «+» omit Cc: %x0-1F, Whitespace: « », Ascii: «!"#$%&'()*»
/ %x5F ; «_» omit Ascii: «,-./0123456789:;<=>?@» «[\]^»
/ %xA1-61B ; omit Cc: %x7F-9F, Whitespace: %xA0, Ascii: «`» «{|}~»
/ %x61D-167F ; omit BidiControl: %x61C
/ %x1681-1FFF ; omit Whitespace: %x1680
/ %x200B-200D ; omit Whitespace: %x2000-200A
/ %x2010-2027 ; omit BidiControl: %x200E-200F
/ %x2030-205E ; omit Whitespace: %x2028-2029 %x202F, BidiControl: %x202A-202E
/ %x2060-2065 ; omit Whitespace: %x205F
/ %x206A-2FFF ; omit BidiControl: %x2066-2069
/ %x3001-D7FF ; omit Whitespace: %x3000
/ %xE000-FDCF ; omit Cs: %xD800-DFFF
/ %xFDF0-FFFD ; omit NChar: %xFDD0-FDEF
/ %x10000-1FFFD ; omit NChar: %xFFFE-FFFF
/ %x20000-2FFFD ; omit NChar: %x1FFFE-1FFFF
/ %x30000-3FFFD ; omit NChar: %x2FFFE-2FFFF
/ %x40000-4FFFD ; omit NChar: %x3FFFE-3FFFF
/ %x50000-5FFFD ; omit NChar: %x4FFFE-4FFFF
/ %x60000-6FFFD ; omit NChar: %x5FFFE-5FFFF
/ %x70000-7FFFD ; omit NChar: %x6FFFE-6FFFF
/ %x80000-8FFFD ; omit NChar: %x7FFFE-7FFFF
/ %x90000-9FFFD ; omit NChar: %x8FFFE-8FFFF
/ %xA0000-AFFFD ; omit NChar: %x9FFFE-9FFFF
/ %xB0000-BFFFD ; omit NChar: %xAFFFE-AFFFF
/ %xC0000-CFFFD ; omit NChar: %xBFFFE-BFFFF
/ %xD0000-DFFFD ; omit NChar: %xCFFFE-CFFFF
/ %xE0000-EFFFD ; omit NChar: %xDFFFE-DFFFF
/ %xF0000-FFFFD ; omit NChar: %xEFFFE-EFFFF
/ %x100000-10FFFD ; omit NChar: %xFFFFE-FFFFF
; omit NChar: %x10FFFE-10FFFF

name-char = name-start / DIGIT
/ %x2D-2E ; «-.» omit Cc: %x0-1F, Whitespace: « », Ascii: «!"#$%&'()*+,»

; Restrictions on characters in various contexts
simple-start-char = %x01-08 ; omit NULL (%x00), HTAB (%x09) and LF (%x0A)
Expand Down
73 changes: 61 additions & 12 deletions spec/syntax.md
Original file line number Diff line number Diff line change
Expand Up @@ -786,11 +786,9 @@ has been applied to both.
> implementations can often substitute checking for actually applying normalization
> to _name_ values.

Valid content for _names_ is based on <cite>Namespaces in XML 1.0</cite>'s
[NCName](https://www.w3.org/TR/xml-names/#NT-NCName).
This is different from XML's [Name](https://www.w3.org/TR/xml/#NT-Name)
in that it MUST NOT contain a U+003A COLON `:`.
Otherwise, the set of characters allowed in a _name_ is large.
The _names_ are [immutable identifiers](https://www.unicode.org/reports/tr31/#Immutable_Identifier_Syntax).
They are similar to <cite>Namespaces in XML 1.0</cite>'s [NCName](https://www.w3.org/TR/xml-names/#NT-NCName),
but have been updated to be more consistent.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given how far we've departed from that spec, this note seems only relevant within the history of the spec, and not with where we're ending up with this PR. I'd prefer dropping it, and moving the preceding sentence (if keeping) above the preceding note.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok.


> [!NOTE]
> _External variables_ can be passed in that are not valid _names_.
Expand Down Expand Up @@ -843,15 +841,66 @@ option = identifier o "=" o (literal / variable)
identifier = [namespace ":"] name
namespace = name
name = [bidi] name-start *name-char [bidi]
name-start = ALPHA / "_"
/ %xC0-D6 / %xD8-F6 / %xF8-2FF
/ %x370-37D / %x37F-61B / %x61D-1FFF / %x200C-200D
/ %x2070-218F / %x2C00-2FEF / %x3001-D7FF
/ %xF900-FDCF / %xFDF0-FFFC / %x10000-EFFFF
name-char = name-start / DIGIT / "-" / "."
/ %xB7 / %x300-36F / %x203F-2040
name-start = ALPHA
/ %x2B ; «+» omit Cc: %x0-1F, Whitespace: « », Ascii: «!"#$%&'()*»
/ %x5F ; «_» omit Ascii: «,-./0123456789:;<=>?@» «[\]^»
/ %xA1-61B ; omit Cc: %x7F-9F, Whitespace: %xA0, Ascii: «`» «{|}~»
/ %x61D-167F ; omit BidiControl: %x61C
/ %x1681-1FFF ; omit Whitespace: %x1680
/ %x200B-200D ; omit Whitespace: %x2000-200A
/ %x2010-2027 ; omit BidiControl: %x200E-200F
/ %x2030-205E ; omit Whitespace: %x2028-2029 %x202F, BidiControl: %x202A-202E
/ %x2060-2065 ; omit Whitespace: %x205F
/ %x206A-2FFF ; omit BidiControl: %x2066-2069
/ %x3001-D7FF ; omit Whitespace: %x3000
/ %xE000-FDCF ; omit Cs: %xD800-DFFF
/ %xFDF0-FFFD ; omit NChar: %xFDD0-FDEF
/ %x10000-1FFFD ; omit NChar: %xFFFE-FFFF
/ %x20000-2FFFD ; omit NChar: %x1FFFE-1FFFF
/ %x30000-3FFFD ; omit NChar: %x2FFFE-2FFFF
/ %x40000-4FFFD ; omit NChar: %x3FFFE-3FFFF
/ %x50000-5FFFD ; omit NChar: %x4FFFE-4FFFF
/ %x60000-6FFFD ; omit NChar: %x5FFFE-5FFFF
/ %x70000-7FFFD ; omit NChar: %x6FFFE-6FFFF
/ %x80000-8FFFD ; omit NChar: %x7FFFE-7FFFF
/ %x90000-9FFFD ; omit NChar: %x8FFFE-8FFFF
/ %xA0000-AFFFD ; omit NChar: %x9FFFE-9FFFF
/ %xB0000-BFFFD ; omit NChar: %xAFFFE-AFFFF
/ %xC0000-CFFFD ; omit NChar: %xBFFFE-BFFFF
/ %xD0000-DFFFD ; omit NChar: %xCFFFE-CFFFF
/ %xE0000-EFFFD ; omit NChar: %xDFFFE-DFFFF
/ %xF0000-FFFFD ; omit NChar: %xEFFFE-EFFFF
/ %x100000-10FFFD ; omit NChar: %xFFFFE-FFFFF
; omit NChar: %x10FFFE-10FFFF

name-char = name-start / DIGIT
/ %x2D-2E ; «-.» omit Cc: %x0-1F, Whitespace: « », Ascii: «!"#$%&'()*+,»
```

> [!NOTE]
> Syntactically, the definitions of `identifier` and `name-char` provide backwards compatibility over time by allowing a stable,
> wide range of characters.
> So when there is a new character in a version of Unicode, it can be used in any conformant implementation of MessageFormat.
> The definition currently excludes:
> * Most ASCII except for letters and characters used for numbers
> * This avoids conflicts with syntax characters, and reserves some characters for future syntax.
> * Bidirectional controls (`Bidi_C`)
> * Control characters (`GC=Cc`, but not Format characters: `GC=Cf`)
> * Whitespace characters (`WSpace`)
> * Surrogate code points (`GC=Cs`)
> * Non-Characters (`NChar`)

This syntax allows a wide range of characters in _names_ and _identifiers_.
Implementers and authors of _functions_ and _messages_,
including _functions_, _options_, and _operands_ (variable names),
SHOULD avoid creating _names_ that could produce confusion or harm usability
by choosing names consistent with the following guidelines.
MessageFormat tools, such as linters, SHOULD warn when _names_ chosen by users
violate these constraints.
>
> 1. [Unicode Default Identifier Syntax](https://www.unicode.org/reports/tr31/#Default_Identifier_Syntax)
> 2. [Unicode General Security Profile for Identifiers](https://www.unicode.org/reports/tr39/#General_Security_Profile)

### Escape Sequences

An **_<dfn>escape sequence</dfn>_** is a two-character sequence starting with
Expand Down