Skip to content

Conversation

@bhaible
Copy link
Contributor

@bhaible bhaible commented Mar 15, 2024

In the 'reserved-statement' nonterminal, there is an ambiguity if there is more than one whitespace character between the 'reserved-keyword' and the first non-whitespace character of the 'reserved-body', because these whitespace characters can be seen as part of the 's' nonterminal or as part of the 'reserved-body' nonterminal.

According to the principles explained in #725 and the proposed resolution of #721, it is not desired that a 'reserved-body' starts with a whitespace character; rather, such a whitespace character is meant to be interpreted as part of the preceding 's' nonterminal.

Test case:

.regex   /foo/{xyz}{{hello}}

This patch removes this ambiguity, by disallowing whitespace as the first character of a 'reserved-body' in a reserved-statement.

It thus fixes the first part of #721.

Details:

  • In the other occurrences of 'resolved-body' as well (in a 'reserved-annotation' or 'private-use-annotation') the leading whitespace is separated as well. This has no influence on the set of inputs that the 'reserved-annotation' and 'private-use-annotation' nonterminals can match, but highlights that the parser should better trim off this leading whitespace in these places before entering the 'resolved-body' into the data model.
  • Two nonterminals 'reserved-body-start' and 'resolved-body-part' are introduced, each referenced once. The purpose is clarity and to follow the common *-start / *-part idiom.

Copy link
Collaborator

@eemeli eemeli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're going to encode in the syntax that the value of reserved-body does not include its leading whitespace, then this seems like a better way of doing so than #730.

Approving, but note my inline suggested simplification, which ought to be applied syntax.md as well.

This does introduce one concern that should be noted:

If someone introduces private syntax {&foo} for which they do not allow for space between the & and foo, a standard MF2 processor, when encountering a privately invalid expression {& foo}, could reserialize it as {&foo}, a privately valid expression.

Similarly, this change would effectively disallow private and future syntax from requiring a space immediately after the initial sigil, as an MF2.0 processor would likely dismiss the space as insignificant.

I'm okay with accepting these limitations.

Comment on lines 70 to 72
reserved-body = reserved-body-start *reserved-body-part
reserved-body-start = reserved-char / reserved-escape / quoted
reserved-body-part = [s] 1*(reserved-char / reserved-escape / quoted)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be simplified:

Suggested change
reserved-body = reserved-body-start *reserved-body-part
reserved-body-start = reserved-char / reserved-escape / quoted
reserved-body-part = [s] 1*(reserved-char / reserved-escape / quoted)
reserved-body = reserved-body-part *([s] reserved-body-part)
reserved-body-part = reserved-char / reserved-escape / quoted

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. I like it, and I'll apply it if there is consensus. (I had introduced the reserved-body-start nonterminal following @aphillips comment #730 (comment) .)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's apply this here and below.

@aphillips aphillips added syntax Issues related with syntax or ABNF Agenda+ Requested for upcoming teleconference labels Mar 16, 2024
Copy link
Collaborator

@catamorphism catamorphism left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The grammar change looks good. It took me ages to understand what this is doing; maybe I'm just slow, but I think it might help future readers to add a note in syntax.md as I suggested.

```abnf
reserved-annotation = reserved-annotation-start reserved-body
reserved-annotation = reserved-annotation-start [[s] reserved-body]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's worth adding some prose here explaining why the grammar is structured this way. It wasn't obvious immediately that it had to be, so it won't be obvious to the reader either. Including the example would be good.

I also think a note should be added to the "Reserved Statements" section of the same doc that explains what the limitations are of reserved-statement that arise because of this change. You could use your example from #721.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use a separate PR to do any text changes.

@aphillips
Copy link
Member

I think the group is settled on this one over #730. Let's close #730 and concentrate on finishing this one off.

Adding a note to syntax.md is probably useful.

@aphillips aphillips added fast-track Editorial change permitted to use fast-track merge rules LDML45 and removed Agenda+ Requested for upcoming teleconference labels Mar 18, 2024
…served-statement.

In the 'reserved-statement' nonterminal, there is an ambiguity if there is more
than one whitespace character between the 'reserved-keyword' and the first
non-whitespace character of the 'reserved-body', because these whitespace
characters can be seen as part of the 's' nonterminal or as part of the
'reserved-body' nonterminal.

According to the principles explained in unicode-org#725 and the proposed resolution
of unicode-org#721, it is not desired that a 'reserved-body' starts with a whitespace
character; rather, such a whitespace character is meant to be interpreted
as part of the preceding 's' nonterminal.

Test case:
```
.regex   /foo/{xyz}{{hello}}
```

This patch removes this ambiguity, by disallowing whitespace as the first
character of a 'reserved-body' in a reserved-statement.

It thus fixes the first part of unicode-org#721.

Details:
  - In the other occurrences of 'resolved-body' as well (in a
    'reserved-annotation' or 'private-use-annotation') the leading
    whitespace is separated as well. This has no influence on the set
    of inputs that the 'reserved-annotation' and 'private-use-annotation'
    nonterminals can match, but highlights that the parser should better
    trim off this leading whitespace in these places before entering the
    resolved-body into the data model.
  - A nonterminal 'resolved-body-part' is introduced.
@bhaible bhaible force-pushed the fix-syntax-ambiguity-before-resolve-body-alt branch from a238850 to dcf6f32 Compare March 25, 2024 15:44
@bhaible
Copy link
Contributor Author

bhaible commented Mar 25, 2024

The simplification suggested by @eemeli is now included.

bhaible added a commit to bhaible/message-format-wg that referenced this pull request Mar 25, 2024
This assumes that PR unicode-org#731 is applied.

In detail:

* Since the reserved keywords start with a `.`, we need to talk about the keyword `.match`, not `match`.

* Saying that "A _reserved annotation_ starts with a reserved character" and "A _reserved annotation_ MAY be empty" is a contradiction. Therefore, we need to distinguish a _reserved annotation_ from a _reserved body_. This distinction is also useful because the _reserved body_ occurs in two other places as well.

* The statement that a _reserved body_ contains "arbitrary text" is not true, since we have now decided that (unless empty) it must start and end with a non-whitespace character.

* A nonterminal `reserved-start` does not exist. What is meant is `reserved-annotation-start`.

* The statement "Implementations MUST NOT remove or alter the contents of a _reserved annotation_." needs to be constrained, because now implementations are SUPPOSED to trim whitespace around the _reserved body_.
@aphillips aphillips merged commit 20a61b4 into unicode-org:main Mar 25, 2024
aphillips added a commit that referenced this pull request Mar 25, 2024
* A few proposed tweaks to syntax.md.

This assumes that PR #731 is applied.

In detail:

* Since the reserved keywords start with a `.`, we need to talk about the keyword `.match`, not `match`.

* Saying that "A _reserved annotation_ starts with a reserved character" and "A _reserved annotation_ MAY be empty" is a contradiction. Therefore, we need to distinguish a _reserved annotation_ from a _reserved body_. This distinction is also useful because the _reserved body_ occurs in two other places as well.

* The statement that a _reserved body_ contains "arbitrary text" is not true, since we have now decided that (unless empty) it must start and end with a non-whitespace character.

* A nonterminal `reserved-start` does not exist. What is meant is `reserved-annotation-start`.

* The statement "Implementations MUST NOT remove or alter the contents of a _reserved annotation_." needs to be constrained, because now implementations are SUPPOSED to trim whitespace around the _reserved body_.

* Subtle wording regarding whitespace around a reserved-body.

Co-authored-by: Addison Phillips <[email protected]>

---------

Co-authored-by: Addison Phillips <[email protected]>
@eemeli eemeli added this to the LDML 45 milestone Jul 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

fast-track Editorial change permitted to use fast-track merge rules syntax Issues related with syntax or ABNF

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants