Skip to content
Merged
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
55 changes: 40 additions & 15 deletions spec/syntax.md
Original file line number Diff line number Diff line change
Expand Up @@ -684,6 +684,17 @@ except for U+0000 NULL or the surrogate code points U+D800 through U+DFFF.

All code points are preserved.

Two _literals_ are considered equal if they consist of the same sequence of Unicode
code points.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a bit excessive. While most name comparisons are internal to the spec, AFAIK the only literal comparison in the spec is for duplicate variant key lists, which tbh I'd prefer to be done with normalization.

All other literal value handling is done by functions, which we should not restrict from applying normalization in their internal processing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. I'd go a bit further in encouraging function specifiers to normalize when matching: and we make the standard functions do so. That is, something along the lines of:


An NFC comparison (aka Unicode canonically equivalent comparison) produces the same results as if each string value being compared were converted to the Unicode Normalization Form C (NFC). For example, with an NFC comparison against the literal |U\x{3308}|, the same result is obtained as if the literal were |\x{DC}|. For more examples, see the Unicode Standard.

When determining whether two variant key lists are duplicates, NFC comparison MUST be used for literals.

When a selector function evaluates matches to literal keys, the matches SHOULD use NFC comparison. Moreover, the implementation of the standard selector functions MUST use NFC comparison. Thus the standard :string selector function MUST match a string input parameter of "U\x{3308}" with the literal |\x{DC}|.


BTW: Some selector functions, such as the standard numeric selectors, only match literals with all ASCII characters. ASCII literals never change when converted to NFC, and there are only 3 non ASCII characters that change to ASCII. So selector functions whose literals don't include ";", "`", or "K" don't need to use NFC comparison; that includes our numeric selectors.

Char. Code Point Name
; U+037E GREEK QUESTION MARK
U+1FEF GREEK VARIA
U+212A KELVIN SIGN

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, do we agree that literals MAY be non-normalized/denormalized?

If a literal can be a non-normalized string, then we should define when two literals match inside MF2. Literal comparison is for duplicate key lists, but also for matching between the sorted results of a selector and the keys in the message (the sorting is done by a function, but not the matching after sorting). This text says nothing about what functions do or are allowed to do with (possibly not normalized) literal values. All it says is when MF2 considers two literals to be equal. I could add text allowing functions to have greater restriction on equality. @macchiati suggests requiring it for :string.

When determining whether two variant key lists are duplicates, NFC comparison MUST be used for literals.

This is the opposite of what @eemeli is saying? If we allow normalization (but don't require it) we also allow the lack of it.

By not normalizing literals, we allow non-normalized sequences to be used in expressions, option values, or keys. This has positive impacts (for people who know what they're doing when working with combining marks or certain characters) and negative consequences (when people don't)

When a selector function evaluates matches to literal keys, the matches SHOULD use NFC comparison. Moreover, the implementation of the standard selector functions MUST use NFC comparison. Thus the standard :string selector function MUST match a string input parameter of "U\x{3308}" with the literal |\x{DC}|.

Why?

.local $angstromsAreCool = {Å :string}
.match $angstromsAreCool
Å {{U+212B is the only way to be cool}}
Å {{I'm U+00C5, so almost cool}}
Å {{I'm A + U+030A, so I combine with cool}}
* {{I'm not cool}}

I understand the lack of illustrating a compelling use case here. Most of the time the sets of valid keys should be rational, sane, highly-normalized enumerated values and not just random text... in fact, I have a note cautioning people about this right 👇

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is far, far more likely that people will make mistakes with non-NFC literals (or input) than the really, really obscure edge case of someone wanting to match non-normalized text.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, do we agree that literals MAY be non-normalized/denormalized?

As with pattern text, I agree that we should not require the normalization of literal values.

Literal comparison is [...] also for matching between the sorted results of a selector and the keys in the message (the sorting is done by a function, but not the matching after sorting).

Regarding the latter, we say this:

The method MatchSelectorKeys is determined by the implementation.
It takes as arguments a resolved _selector_ value `rv` and a list of string keys `keys`,
and returns a list of string keys in preferential order.
The returned list MUST contain only unique elements of the input list `keys`.

That MUST is requiring the processing to not normalise any of the values, even if it did so for its internal processing.

I'd be completely fine with us normalising the keys before they're passed to the function, or at least allowing an implementation to do so.

When determining whether two variant key lists are duplicates, NFC comparison MUST be used for literals.

This is the opposite of what @eemeli is saying? If we allow normalization (but don't require it) we also allow the lack of it.

I'm aligned with @macchiati here. We don't need to normalize key values, but we should do their comparison when checking for duplicate key lists as if they were normalized.


> [!IMPORTANT]
> _Literal_ equality is different from _name_ equality in that
> Unicode Normalization is not applied to _literal_ values before comparison.
> Users are cautioned to ensure that they use the same character sequences
> for equivalent values.
> The use of [Normalization Form C]((https://unicode.org/reports/tr15/) for all
> _literal_ values is RECOMMENDED.

A **_<dfn>quoted literal</dfn>_** begins and ends with U+005E VERTICAL BAR `|`.
The characters `\` and `|` within a _quoted literal_ MUST be
escaped as `\\` and `\|`.
Expand All @@ -708,27 +719,26 @@ number-literal = ["-"] (%x30 / (%x31-39 *DIGIT)) ["." 1*DIGIT] [%i"e" ["-" / "

### Names and Identifiers

An **_<dfn>identifier</dfn>_** is a character sequence that
identifies a _function_, _markup_, or _option_.
Each _identifier_ consists of a _name_ optionally preceeded by
a _namespace_.
When present, the _namespace_ is separated from the _name_ by a
U+003A COLON `:`.
Built-in _functions_ and their _options_ do not have a _namespace_ identifier.

The _namespace_ `u` (U+0075 LATIN SMALL LETTER U)
is reserved for future standardization.

_Function_ _identifiers_ are prefixed with `:`.
_Markup_ _identifiers_ are prefixed with `#` or `/`.
_Option_ _identifiers_ have no prefix.

A **_<dfn>name</dfn>_** is a character sequence used in an _identifier_
or as the name for a _variable_
or the value of an _unquoted literal_.

_Variable_ names are prefixed with `$`.

A _name_ is identical to another name if both consist of the same sequence of
Unicode code points after
[Unicode Normalization Form C](https://unicode.org/reports/tr15/) (NFC)
has been applied to both.

> [!NOTE]
> Implementations are not required to normalize _names_.
> Comparisons of _name_ values only need be done "as-if" normalization
> has occured.
> Since most text in the wild is already in NFC
> and since checking for NFC is fast and efficient,
> implementations can often substitute checking for actually applying normalization
> to _name_ values.

Valid content for _names_ is based on <cite>Namespaces in XML 1.0</cite>'s
[NCName](https://www.w3.org/TR/xml-names/#NT-NCName).
This is different from XML's [Name](https://www.w3.org/TR/xml/#NT-Name)
Expand All @@ -740,6 +750,21 @@ Otherwise, the set of characters allowed in a _name_ is large.
> Such variables cannot be referenced in a _message_,
> but are not otherwise errors.

An **_<dfn>identifier</dfn>_** is a character sequence that
identifies a _function_, _markup_, or _option_.
Each _identifier_ consists of a _name_ optionally preceeded by
a _namespace_.
When present, the _namespace_ is separated from the _name_ by a
U+003A COLON `:`.
Built-in _functions_ and their _options_ do not have a _namespace_ identifier.

The _namespace_ `u` (U+0075 LATIN SMALL LETTER U)
is reserved for future standardization.

_Function_ _identifiers_ are prefixed with `:`.
_Markup_ _identifiers_ are prefixed with `#` or `/`.
_Option_ _identifiers_ have no prefix.

Examples:
> A variable:
>```
Expand Down