Skip to content
40 changes: 33 additions & 7 deletions spec/message.abnf
Original file line number Diff line number Diff line change
Expand Up @@ -54,13 +54,39 @@ match = %s".match"
identifier = [namespace ":"] name
namespace = name
name = [bidi] name-start *name-char [bidi]
name-start = ALPHA / "_"
/ %xC0-D6 / %xD8-F6 / %xF8-2FF
/ %x370-37D / %x37F-61B / %x61D-1FFF / %x200C-200D
/ %x2070-218F / %x2C00-2FEF / %x3001-D7FF
/ %xF900-FDCF / %xFDF0-FFFC / %x10000-EFFFF
name-char = name-start / DIGIT / "-" / "."
/ %xB7 / %x300-36F / %x203F-2040
name-start = ALPHA
/ %x2B ; 【+】 omit Cc %x0-1F, Whitespace %20, Ascii 【!"#$%&'()*】
/ %x5F ; 【_】 omit Ascii 【,-./0123456789:;<=>?@】 【[\]^】
/ %xA1-61B ; omit Cc %x7F-9F, Whitespace %xA0, Ascii 【`】 【{|}~】
/ %x61D-167F ; omit BidiControl %x61C
/ %x1681-1FFF ; omit Whitespace %x1680
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the other character range definitions, the "omit" comments are offset by one compared to how they're here, as in (showing these three lines only as an example):

Suggested change
/ %xA1-61B ; omit Cc %x7F-9F, Whitespace %xA0, Ascii 【`】 【{|}~】
/ %x61D-167F ; omit BidiControl %x61C
/ %x1681-1FFF ; omit Whitespace %x1680
/ %xA1-61B ; omit BidiControl %x61C
/ %x61D-167F ; omit Whitespace %x1680
/ %x1681-1FFF ; omit Whitespace %x2000-200A

The same style should be used in all these comments.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a difference on my screen. Do you want an additional space before or after 'omit', or a space deleted before or after 'omit'.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the EA brackets to guillemets, since they line up better for monospace.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant offset in a vertical direction, so a comment like "omit BidiControl %x61C" should follow the range %xA1-61B, rather than the range %x61D-167F.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, will work on that.

/ %x200B-200D ; omit Whitespace %x2000-200A
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This set is ZWSP, ZWNJ, and ZWJ. Should they really be included in name-start? That seems surprising to me, and with no positive utility.

We will need ZWNJ and ZWJ within names, though, so maybe it's fine for them to be here. But why ZWSP?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still have name-char. Why not put the joiners in there?

I kind of also question ZWSP

Copy link
Member Author

@macchiati macchiati Feb 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no particular utility to having ZWSP start a name-char, nor any utility to having it end a name-char. (

It doesn't hurt to move that one (ZWSP) to name-char, but it doesn't really make a dent either — and we really wouldn't want to go too far down the very long and slippery slope. That's for linters and guidance.

That being said, if people want it out I can remove it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The question here is why put these characters in name-start, where they have no utility? At least in name-char they would be enclosed or at the end?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I'm afraid of is that if we move that to name-char, it will just open it up to people endlessly complaining that:

"ZWSP" is in name-chart instead of name-start: why is XXX in name-start when it should also be just be in name-chart???:

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is, the basic difference between name-char and name-start is that

  • name is used in identifiers and variables, and can't start with a digit, -, or .
  • name-char is used in literals, and can start with digit, -, .

The syntactic motivation is clear: to make sure that identifiers and variables are distinguishable from numbers. That is a clear syntactic need.

ZWSP certainly isn't needed at the start of an identifier or variable, but there is an large and complicated list of characters that are also not needed at start of identifiers and variables, and plucking just one of those characters out, without any syntactic need, doesn't actually provide much value.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we do not allow space characters or control characters, I'd prefer not allowing zero-width spaces in names or unquoted literals.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A zero-width space is not a space; that is just a name used for familiarity. It is a Format character, like many others.

https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5Cp%7Bgc%3Dformat%7D&g=&i=

/ %x2010-2027 ; omit BidiControl %x200E-200F
/ %x2030-205E ; omit Whitespace %x2028-2029 %x202F, BidiControl %x202A-202E
/ %x2060-2065 ; omit Whitespace %x205F
/ %x206A-2FFF ; omit BidiControl %x2066-2069
/ %x3001-D7FF ; omit Whitespace %x3000
/ %xF900-FDCF ; omit Cs %xD800-DFFF, Co %xE000-F8FF
/ %xFDF0-FFFD ; omit NChar %xFDD0-FDEF
/ %x10000-1FFFD ; omit NChar %xFFFE-FFFF
/ %x20000-2FFFD ; omit NChar %x1FFFE-1FFFF
/ %x30000-3FFFD ; omit NChar %x2FFFE-2FFFF
/ %x40000-4FFFD ; omit NChar %x3FFFE-3FFFF
/ %x50000-5FFFD ; omit NChar %x4FFFE-4FFFF
/ %x60000-6FFFD ; omit NChar %x5FFFE-5FFFF
/ %x70000-7FFFD ; omit NChar %x6FFFE-6FFFF
/ %x80000-8FFFD ; omit NChar %x7FFFE-7FFFF
/ %x90000-9FFFD ; omit NChar %x8FFFE-8FFFF
/ %xA0000-AFFFD ; omit NChar %x9FFFE-9FFFF
/ %xB0000-BFFFD ; omit NChar %xAFFFE-AFFFF
/ %xC0000-CFFFD ; omit NChar %xBFFFE-BFFFF
/ %xD0000-DFFFD ; omit NChar %xCFFFE-CFFFF
/ %xE0000-EFFFD ; omit NChar %xDFFFE-DFFFF,
; omit NChar %xEFFFE-EFFFF %xFFFFE-FFFFF %x10FFFE-10FFFF,
; omit Co %xF0000-FFFFD %x100000-10FFFD

name-char = name-start / DIGIT
/ %x2D-2E ; 【-.】 omit Cc %x0-1F, Whitespace 【 】, Ascii 【!"#$%&'()*+,】

; Restrictions on characters in various contexts
simple-start-char = %x01-08 ; omit NULL (%x00), HTAB (%x09) and LF (%x0A)
Expand Down
61 changes: 54 additions & 7 deletions spec/syntax.md
Original file line number Diff line number Diff line change
Expand Up @@ -843,15 +843,62 @@ option = identifier o "=" o (literal / variable)
identifier = [namespace ":"] name
namespace = name
name = [bidi] name-start *name-char [bidi]
name-start = ALPHA / "_"
/ %xC0-D6 / %xD8-F6 / %xF8-2FF
/ %x370-37D / %x37F-61B / %x61D-1FFF / %x200C-200D
/ %x2070-218F / %x2C00-2FEF / %x3001-D7FF
/ %xF900-FDCF / %xFDF0-FFFC / %x10000-EFFFF
name-char = name-start / DIGIT / "-" / "."
/ %xB7 / %x300-36F / %x203F-2040
name-start = ALPHA
/ %x2B ; 【+】 omit Cc %x0-1F, Whitespace %20, Ascii 【!"#$%&'()*】
/ %x5F ; 【_】 omit Ascii 【,-./0123456789:;<=>?@】 【[\]^】
/ %xA1-61B ; omit Cc %x7F-9F, Whitespace %xA0, Ascii 【`】 【{|}~】
/ %x61D-167F ; omit BidiControl %x61C
/ %x1681-1FFF ; omit Whitespace %x1680
/ %x200B-200D ; omit Whitespace %x2000-200A
/ %x2010-2027 ; omit BidiControl %x200E-200F
/ %x2030-205E ; omit Whitespace %x2028-2029 %x202F, BidiControl %x202A-202E
/ %x2060-2065 ; omit Whitespace %x205F
/ %x206A-2FFF ; omit BidiControl %x2066-2069
/ %x3001-D7FF ; omit Whitespace %x3000
/ %xF900-FDCF ; omit Cs %xD800-DFFF, Co %xE000-F8FF
/ %xFDF0-FFFD ; omit NChar %xFDD0-FDEF
/ %x10000-1FFFD ; omit NChar %xFFFE-FFFF
/ %x20000-2FFFD ; omit NChar %x1FFFE-1FFFF
/ %x30000-3FFFD ; omit NChar %x2FFFE-2FFFF
/ %x40000-4FFFD ; omit NChar %x3FFFE-3FFFF
/ %x50000-5FFFD ; omit NChar %x4FFFE-4FFFF
/ %x60000-6FFFD ; omit NChar %x5FFFE-5FFFF
/ %x70000-7FFFD ; omit NChar %x6FFFE-6FFFF
/ %x80000-8FFFD ; omit NChar %x7FFFE-7FFFF
/ %x90000-9FFFD ; omit NChar %x8FFFE-8FFFF
/ %xA0000-AFFFD ; omit NChar %x9FFFE-9FFFF
/ %xB0000-BFFFD ; omit NChar %xAFFFE-AFFFF
/ %xC0000-CFFFD ; omit NChar %xBFFFE-BFFFF
/ %xD0000-DFFFD ; omit NChar %xCFFFE-CFFFF
/ %xE0000-EFFFD ; omit NChar %xDFFFE-DFFFF,
; omit NChar %xEFFFE-EFFFF %xFFFFE-FFFFF %x10FFFE-10FFFF,
; omit Co %xF0000-FFFFD %x100000-10FFFD
name-char = name-start / DIGIT
/ %x2D-2E ; 【-.】 omit Cc %x0-1F, Whitespace 【 】, Ascii 【!"#$%&'()*+,】
```

> [!NOTE]
> Syntactically, the definitions of `identifier` and `name-char` provide backwards compatibility over time by allowing a stable,
> wide range of characters.
> So when there is a new character in a version of Unicode, it can be used in any conformant implementation of Message Format.
> The definition currently excludes:
> * Most ASCII except for letters and characters used for numbers
> * This avoids conflicts with syntax characters, and reserves some characters for future syntax.
> * Bidirectional controls (`Bidi_C`)
> * Control characters (`GC=Cc`, but not Format characters: `GC=Cf`)
> * Whitespace characters (`WSpace`)
> * Isolated Surrogate characters (`GC=Cs`)
> * Private use characters (`GC=Co`)
> * Non-Characters (`NChar`)
>
> Although syntactically a wide range of characters are included,
> when function and implementations and message authors are creating new identifiers (for functions, options, variables, …),
> it is strongly recommended that they conform to the following to minimize confusion.
> These are also recommended for Message Format linter implementations.
>
> 1. [Unicode Default Identifier Syntax](https://www.unicode.org/reports/tr31/#Default_Identifier_Syntax)
> 2. [Unicode General Security Profile for Identifiers](https://www.unicode.org/reports/tr39/#General_Security_Profile)

### Escape Sequences

An **_<dfn>escape sequence</dfn>_** is a two-character sequence starting with
Expand Down