-
-
Notifications
You must be signed in to change notification settings - Fork 35
Disallow whitespace as the first character of a reserved-body in a reserved-statement. #731
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Disallow whitespace as the first character of a reserved-body in a reserved-statement. #731
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're going to encode in the syntax that the value of reserved-body does not include its leading whitespace, then this seems like a better way of doing so than #730.
Approving, but note my inline suggested simplification, which ought to be applied syntax.md as well.
This does introduce one concern that should be noted:
If someone introduces private syntax {&foo} for which they do not allow for space between the & and foo, a standard MF2 processor, when encountering a privately invalid expression {& foo}, could reserialize it as {&foo}, a privately valid expression.
Similarly, this change would effectively disallow private and future syntax from requiring a space immediately after the initial sigil, as an MF2.0 processor would likely dismiss the space as insignificant.
I'm okay with accepting these limitations.
spec/message.abnf
Outdated
| reserved-body = reserved-body-start *reserved-body-part | ||
| reserved-body-start = reserved-char / reserved-escape / quoted | ||
| reserved-body-part = [s] 1*(reserved-char / reserved-escape / quoted) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be simplified:
| reserved-body = reserved-body-start *reserved-body-part | |
| reserved-body-start = reserved-char / reserved-escape / quoted | |
| reserved-body-part = [s] 1*(reserved-char / reserved-escape / quoted) | |
| reserved-body = reserved-body-part *([s] reserved-body-part) | |
| reserved-body-part = reserved-char / reserved-escape / quoted |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion. I like it, and I'll apply it if there is consensus. (I had introduced the reserved-body-start nonterminal following @aphillips comment #730 (comment) .)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's apply this here and below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The grammar change looks good. It took me ages to understand what this is doing; maybe I'm just slow, but I think it might help future readers to add a note in syntax.md as I suggested.
| ```abnf | ||
| reserved-annotation = reserved-annotation-start reserved-body | ||
| reserved-annotation = reserved-annotation-start [[s] reserved-body] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's worth adding some prose here explaining why the grammar is structured this way. It wasn't obvious immediately that it had to be, so it won't be obvious to the reader either. Including the example would be good.
I also think a note should be added to the "Reserved Statements" section of the same doc that explains what the limitations are of reserved-statement that arise because of this change. You could use your example from #721.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use a separate PR to do any text changes.
…served-statement. In the 'reserved-statement' nonterminal, there is an ambiguity if there is more than one whitespace character between the 'reserved-keyword' and the first non-whitespace character of the 'reserved-body', because these whitespace characters can be seen as part of the 's' nonterminal or as part of the 'reserved-body' nonterminal. According to the principles explained in unicode-org#725 and the proposed resolution of unicode-org#721, it is not desired that a 'reserved-body' starts with a whitespace character; rather, such a whitespace character is meant to be interpreted as part of the preceding 's' nonterminal. Test case: ``` .regex /foo/{xyz}{{hello}} ``` This patch removes this ambiguity, by disallowing whitespace as the first character of a 'reserved-body' in a reserved-statement. It thus fixes the first part of unicode-org#721. Details: - In the other occurrences of 'resolved-body' as well (in a 'reserved-annotation' or 'private-use-annotation') the leading whitespace is separated as well. This has no influence on the set of inputs that the 'reserved-annotation' and 'private-use-annotation' nonterminals can match, but highlights that the parser should better trim off this leading whitespace in these places before entering the resolved-body into the data model. - A nonterminal 'resolved-body-part' is introduced.
a238850 to
dcf6f32
Compare
|
The simplification suggested by @eemeli is now included. |
This assumes that PR unicode-org#731 is applied. In detail: * Since the reserved keywords start with a `.`, we need to talk about the keyword `.match`, not `match`. * Saying that "A _reserved annotation_ starts with a reserved character" and "A _reserved annotation_ MAY be empty" is a contradiction. Therefore, we need to distinguish a _reserved annotation_ from a _reserved body_. This distinction is also useful because the _reserved body_ occurs in two other places as well. * The statement that a _reserved body_ contains "arbitrary text" is not true, since we have now decided that (unless empty) it must start and end with a non-whitespace character. * A nonterminal `reserved-start` does not exist. What is meant is `reserved-annotation-start`. * The statement "Implementations MUST NOT remove or alter the contents of a _reserved annotation_." needs to be constrained, because now implementations are SUPPOSED to trim whitespace around the _reserved body_.
* A few proposed tweaks to syntax.md. This assumes that PR #731 is applied. In detail: * Since the reserved keywords start with a `.`, we need to talk about the keyword `.match`, not `match`. * Saying that "A _reserved annotation_ starts with a reserved character" and "A _reserved annotation_ MAY be empty" is a contradiction. Therefore, we need to distinguish a _reserved annotation_ from a _reserved body_. This distinction is also useful because the _reserved body_ occurs in two other places as well. * The statement that a _reserved body_ contains "arbitrary text" is not true, since we have now decided that (unless empty) it must start and end with a non-whitespace character. * A nonterminal `reserved-start` does not exist. What is meant is `reserved-annotation-start`. * The statement "Implementations MUST NOT remove or alter the contents of a _reserved annotation_." needs to be constrained, because now implementations are SUPPOSED to trim whitespace around the _reserved body_. * Subtle wording regarding whitespace around a reserved-body. Co-authored-by: Addison Phillips <[email protected]> --------- Co-authored-by: Addison Phillips <[email protected]>
In the 'reserved-statement' nonterminal, there is an ambiguity if there is more than one whitespace character between the 'reserved-keyword' and the first non-whitespace character of the 'reserved-body', because these whitespace characters can be seen as part of the 's' nonterminal or as part of the 'reserved-body' nonterminal.
According to the principles explained in #725 and the proposed resolution of #721, it is not desired that a 'reserved-body' starts with a whitespace character; rather, such a whitespace character is meant to be interpreted as part of the preceding 's' nonterminal.
Test case:
This patch removes this ambiguity, by disallowing whitespace as the first character of a 'reserved-body' in a reserved-statement.
It thus fixes the first part of #721.
Details: