diff --git a/exploration/bidi-usability.md b/exploration/bidi-usability.md index f3bb6b6e4b..3f70ed7002 100644 --- a/exploration/bidi-usability.md +++ b/exploration/bidi-usability.md @@ -54,34 +54,42 @@ the plain-text of the message and the Unicode Bidirectional Algorithm (UBA, UAX# can interact in ways that make the _message_ unintelligible or difficult to parse visually. Machines do not have a problem parsing _messages_ that contain RTL characters, -but users need to be able to discern what a _message_ does, -what _variant_ will be selected, -or what a _placeholder_ will evaluate to. +but users need to be able to discern what a _message_ does. +For example, users need to be able to match _keys_ in a _variant_ to _selectors_ +in a `.match` statement. +Or they want to know how a _pattern_ will be evaluated, +such as understanding the _options_ and _values_ in a _placeholder_. In addition, it is possible to construct messages that use bidi characters to spoof users into believing that a _message_ does something different than what it actually does. The current syntax does not permit bidi controls in _name_ tokens, -_unquoted_ literals, -or in the whitespace portions of a _message_. +_unquoted literals_, +or in the non-pattern whitespace portions of a _message_. -Permitting the **isolate** controls and the standalone strongly-directional markers +Permitting the Unicode bidi **isolate** characters and the standalone strongly-directional markers would enable tools, including translation tools, and users who are writing in RTL languages to format a _message_ so that its plain-text representation and its function are unambiguous. -The isolate controls are paired invisible control characters inserted around a portion of a string. -The start of an isolate sequence is one of: +The isolates are paired invisible characters inserted around a portion of a string. +The start of an isolated sequence is one of: - U+2066 LEFT-TO-RIGHT ISOLATE (LRI) - U+2067 RIGHT-TO-LEFT ISOLATE (RLI) - U+2068 FIRST-STRONG ISOLATE (FSI) -The end of an isolate sequence is U+2069 POP DIRECTIONAL ISOLATE (PDI). +The end of an isolated sequence is U+2069 POP DIRECTIONAL ISOLATE (PDI). -The characters inside an isolate sequence have the initial string (paragraph) direction -corresponding to the starting control (LTR for LRI, RTL for RLI, auto for FSI). -The isolate sequence is **isolated** from surrounding text. -This means that the surrounding text treats it as-if the sequence were a single neutral character. +The characters inside an isolated sequence have the initial string direction +corresponding to the starting character ( +left-to-right for `LRI`, +right-to-left for `RLI`, +or auto for `FSI`). +They are called "isolates" because the enclosed text is **isolated** from surrounding text +while being processed using the Unicode Bidirectional Algorithm (UBA). +The surrounding text treats the sequence as-if it were a single neutral character, +while the interior sequence is processed using the base direction specified by the isolate +starting character. > [!NOTE] > One of the side-effects of using `{`/`}` and `{{`/`}}` to delimit _expressions_ @@ -96,11 +104,26 @@ These include: - U+200F RIGHT-TO-LEFT MARK (RLM) - U+061C ARABIC LETTER MARK (ALM) -These characters are invisible strongly-directional characters used in bidirectional +These characters are invisible strongly-directional characters. +They are used in bidirectional text to coerce certain directional behavior (usually to mark the end of a sequence of characters that would otherwise be ambiguous or interact with neutrals or opposite direction runs in an unhelpful way). +### Strictness and Abuse + +We want the syntax to be somewhat permissive, particularly when it comes to paired isolates. +The isolates and strongly-directional marks are invisible except in certain specialized editing environments. +While users and tools should be strict about using well-formed isolate sequences, +we don't want to have invisible characters or whitespace generate additional syntax errors except where necessary. +Therefore, it should not be a syntax error if a user, editor, or tool fails to match opening/closing isolates. + +It is possible to generate a "strict" version of the ABNF that is more restrictive about isolate pairing. +Such an ABNF might be used by message serializers to ensure high-quality message generation. + +Unfortunately, permitting a "relaxed" handling of isolates/marks, when mixed with whitespace, +could produce the various Trojan Source effects described in [[UTS55]](https://www.unicode.org/reports/tr55/#Usability-bidi)) + ## Use-Cases _What use-cases do we see? Ideally, quote concrete examples._ @@ -135,7 +158,7 @@ You have {$م1صر :م2صر م3صر=م4صر} <- no controls You have {$م1صر‎ :م2صر‎ م3صر‎=م4صر‎} <- LRM after each RTL token ``` -3. As a developer or translator, I want to make RTL literal or names appear correctly +3. As a developer or translator, I want to make unquoted RTL literals or names appear correctly in my plain-text editing environment. I don't want to have to manage a lot of paired controls, when I can get the right effect using strongly directional mark characters (LRM, RLM, ALM) @@ -209,6 +232,12 @@ Newlines inside of messages should not harm later syntax. ن}}‎ 123 456 {{ LRM }} ``` + +Naive text editors, when operating in a right-to-left context, +might display a _message_ with an RTL base direction. +While the display of the _message_ might be somewhat damaged by this, +it should still produce results that are as reasonable as possible. + ## Constraints _What prior decisions and existing conditions limit the possible design?_ @@ -230,86 +259,110 @@ The workaround in #763 was to permit these characters _before_ or _after_ whites using the various whitespace productions. This works at the cost of allowing spurious markers. +We want isolate characters to be _outside_ of patterns. +There is an open question about how best to place them. +One option would be to place them adjacent to the "pattern quote" character sequences `{{`/`}}`. +Another option would be to place them _inside_ the pattern quotes, e.g. `{\u2066{`/`}\u2068}`. + +Bidi isolates and marks are invisible characters. +Whitespace is also invisible. +Mixing these may be problematic. +Not allowing these to mix could produce annoying parse errors. + ## Proposed Design _Describe the proposed solution. Consider syntax, formatting, errors, registry, tooling, interchange._ -Editing and display of a _message_ SHOULD always use a left-to-right base direction +The syntax of a _message_ assumes a left-to-right base direction both for the complete text of the _message_ as well as for each line (paragraph) -contained therein. - -We use LTR display because the syntax of a _message_ depends on LTR word tokens, +contained therein. +We prefer LTR display because human understanding of a _message_ depends on LTR word tokens, as well as token ordering (as in a placeholder or with variant keys). +Note that LTR display is **_not_** a requirement, because that is beyond the scope of MF2 itself. +However, tool and editor implementers ought to pay attention to this assumption. -This is not the disadvantage to right-to-left languages that it might first appear: -- Bidi inside of _patterns_ works normally -- _Placeholders_ and _markup_ are isolated (treated as neutrals) so that they appear +Preferring LTR display is not the disadvantage to right-to-left languages that it might first appear: +- Bidi inside of _patterns_ works normally (we go to great lengths to make the interior + of _patterns_ work as plain text) +- _Placeholders_ and _markup_ can be isolated (treated as neutrals) so that they appear in the correct location in an RTL _pattern_ - _Expressions_ use isolates and directional marks to display internal tokens in the correct order and without spillover effects +- The syntax uses enclosing marks (specifically curly brackets) which the Unicode Bidirectional Algorithm + pairs up for shaping purposes, resulting in a weak form of isolation in the syntax itself. -Permit isolating bidi controls to be used on the **outside** of the following: +The syntax permits (but does not require) isolating bidi controls to be used on the +**outside** of the following: - unquoted literals - quoted literals - quoted patterns -We permit any of the isolate starting controls (LRI, RLI, FSI) because we want to allow +We permit any of the isolate starting characters (LRI, RLI, FSI) because we want to allow the user to set the base direction of a _literal_ or _pattern_ according to its respective actual contents. +> [!IMPORTANT] +> This change adds a "lookahead" to the process of determining if a given _message_ is +> "simple" or "complex", as LRI, RLI, and FSI are all valid starters for a simple message +> as well as being allowed before a quoted pattern, declaration, or selector. + This would change the ABNF as follows: (Notice that this change includes a production `bidi` described further down in this document) ```abnf -literal = ( open-isolate (quoted / (unquoted [bidi])) close-isolate) - / (quoted / (unquoted [bidi])) -quoted-pattern = ( open-isolate "{{" pattern "}}" close-isolate) - / ("{{" pattern "}}") +literal = [open-isolate] (quoted-literal / (unquoted-literal [bidi])) [close-isolate] +quoted-pattern = [open-isolate] "{{" pattern "}}" [close-isolate] open-isolate = %x2066-2068 close-isolate = %x2069 ``` > [!IMPORTANT] -> The isolating controls go on the **_outside_** of the various _literal_ and _pattern_ +> The isolating characters go on the **_outside_** of the various _literal_ and _pattern_ > productions because characters on the **_inside_** of these are part of the _literal_'s > or _pattern_'s textual content. -> We need to allow users to include bidi controls in the output of MF2. - -Permit **left-to-right** isolating bidi controls (`U+2066`...`U+2069`) to be used **immediately inside** the following: -- expressions -- markup - -We only permit the LTR isolates because the contents of an _expression_ -or _markup_ must be laid out left-to-right. -_Literal_ values can be right-to-left isolated within that or use strongly -directional marks to ensure correct display. +> We need to allow users to include bidi characters, including isolates and strongly directional marks +> in the output of MF2. + +- Permit **left-to-right** isolates + (starting with LRI `U+2066` and ending with PDI `U+2069`) + to be used **immediately inside** the following: + - expressions + - markup + +- Permit any type of isolate sequence + (starting with LRI `U+2066`, RLI `U+2067`, or FSI `U+2068` and ending with PDI `U+2069`) + around any token inside of an expression or markup. + +- Permit the use of LRM, RLM, or ALM stronly directional marks immediately following any of the items that + **end** with the `name` production in the ABNF. + This includes _identifiers_ found in the names of + _functions_ + and _options_, + plus the names of _variables_, + as well as the contents of _unquoted_ literals. This would change the ABNF as follows (assuming the above changes are also incorporated): ```abnf -expression = "{" LRI (literal-expression / variable-expression / annotation-expression) close-isolate "}" - / "{" (literal-expression / variable-expression / annotation-expression) "}" +expression = "{" [LRI] (literal-expression / variable-expression / annotation-expression) [close-isolate] "}" literal-expression = [s] literal [s annotation] *(s attribute) [s] variable-expression = [s] variable [s annotation] *(s attribute) [s] annotation-expression = [s] annotation *(s attribute) [s] -markup = "{" [s] "#" identifier *(s option) *(s attribute) [s] ["/"] "}" ; open and standalone - / "{" [s] "/" identifier *(s option) *(s attribute) [s] "}" ; close - / "{" LRI [s] "#" identifier *(s option) *(s attribute) [s] ["/"] close-isolate "}" ; open and standalone - / "{" LRI [s] "/" identifier *(s option) *(s attribute) [s] close-isolate "}" ; close +markup = "{" [LRI] [s] "#" identifier *(s option) *(s attribute) [s] ["/"] [close-isolate] "}" ; open and standalone + / "{" [LRI] [s] "/" identifier *(s option) *(s attribute) [s] [close-isolate] "}" ; close LRI = %x2066 ``` -Permit the use of LRM, RLM, or ALM stronly directional marks immediately following any of the items that -**end** with the `name` production in the ABNF. -This includes _identifiers_ found in the names of -_functions_ -and _options_, -plus the names of _variables_, -as well as the contents of _unquoted_ literals. +> [!NOTE] +> This design only permits LTR isolates at the expression level because the contents of an _expression_ +> or _markup_ must be laid out left-to-right. +> _Literal_ values can be right-to-left isolated within that or use strongly +> directional marks to ensure correct display. > [!NOTE] -> Notice that _unquoted_ literals can also be surrounded by bidi isolates +> Notice that _unquoted literals_ can also be surrounded by bidi isolates > using the previous syntax modification just above. +> The isolates are **not** a part of the literal! > [!NOTE] > Notice that `reserved-annotation` is not in the ABNF changes because it already @@ -321,14 +374,25 @@ as well as the contents of _unquoted_ literals. ```abnf variable-expression = "{" [s] variable [bidi] [s annotation] *(s attribute) [s] "}" function = ":" identifier [bidi] *(s option) -option = identifier [bidi] [s] "=" [s] (literal / variable) [bidi] -attribute = "@" identifier [bidi] [[s] "=" [s] ((literal / variable) [bidi])] -markup = "{" [s] "#" identifier [bidi] *(s option) *(s attribute) [s] ["/"] "}" ; open and standalone - / "{" [s] "/" identifier [bidi] *(s option) *(s attribute) [s] "}" ; close -identifier = [(namespace [bidi] ":")] name +option = [LRI] identifier [bidi] [s] "=" [s] (literal / variable) [bidi] [close-isolate] +attribute = [LRI] "@" identifier [bidi] [[s] "=" [s] ((literal / variable) [bidi])] [close-isolate] +markup = "{" [LRI] [s] "#" identifier [bidi] *(s option) *(s attribute) [s] ["/"] [close-isolate] "}" ; open and standalone + / "{" [LRI] [s] "/" identifier [bidi] *(s option) *(s attribute) [s] [close-isolate] "}" ; close +identifier = [(namespace ns-separator)] name +ns-separator = [bidi] ":" bidi = [ %x200E-200F / %x061C ] ``` +### Open Issues with Proposed Design + +The ABNF changes found above put isolates and strongly directional marks into specific locations, +such as directly next to `{`/`}`/`{{`/`}}` markers +or directly following "tokens" such as `name`. +This makes it a syntax error for whitespace to appear around the isolates or marks. +A more permissive design would add the isolates and strongly directional marks to required and optional +whitespace in the syntax and depend on users/editors to appropriately pair or position the marks +to get optimal display. + ## Alternatives Considered _What other solutions are available?_ @@ -348,69 +412,112 @@ the results or debug what is wrong with their messages. By contrast, if users insert too many or the wrong controls using the recommended design, the _message_ would still be functional and would emit no undesired characters. +### Super-loose isolation -### Loose isolation +Add isolates and strongly directional marks to required and optional whitespace in the syntax. +This would permit users to get the effects described by the above design, +as long as they use isolates/marks in a "responsible" way. -Apply bidi isolates in a slightly different way. -The main differences to the proposed solution are: -1. The open/close isolate characters are not syntactically required to be paired. - This avoids introducing parse errors for missing or required invisible characters, - which would lead to bad user experiences. -2. Rather than patching the `name` rule with an optional trailing LRM/RLM/ALM, - allow for its proper isolation. +(Omitting other changes found in #673) -Quoted patterns, quoted literals, and names may be isolated by LRI/RLI/FSI...PDI. -For names and quoted literals, the isolate characters are outside the body of the token, -but for quoted patterns, the isolates are in the middle of the `{{` and `}}` characters. -This avoids adding a lookahead requirement for detecting a `complex-message` start, -and differentiates a `quoted-pattern` from a `quoted` `key` in a `variant`. +```abnf +; strongly directional marks and bidi isolates +; ALM / LRM / RLM / LRI / RLI / FSI / PDI +bidi = %x061C / %x200E / %x200F / %x2066-2069 -Expressions and markup may be isolated by LRI...PDI immediately within the `{` and `}`. +; optional whitespace +owsp = *( s / bidi ) -An LRI is allowed immediately after a newline outside patterns and within expressions. -This is intended to allow left-to-right representation for "code" -even if it contains a newline followed by content -that could otherwise prompt the paragraph direction to be detected as right-to-left. +; required whitespace +wsp = [ owsp ] 1*s [ owsp ] -```abnf -name = [open-isolate] name-start *name-char [close-isolate] -quoted = [open-isolate] "|" *(quoted-char / quoted-escape) "|" [close-isolate] -quoted-pattern = "{" [open-isolate] "{" pattern "}" [close-isolate] "}" +; whitespace characters +s = ( SP / HTAB / CR / LF / %x3000 ) +``` -literal-expression = "{" [LRI] [s] literal [s annotation] *(s attribute) [s] [close-isolate] "}" -variable-expression = "{" [LRI] [s] variable [s annotation] *(s attribute) [s] [close-isolate] "}" -annotation-expression = "{" [LRI] [s] annotation *(s attribute) [s] [close-isolate] "}" +**Pros** +- Avoids problems with syntax errors that users and tools might find difficult to debug. +- Effective if used carefully. +- Addresses need to comply with UAX#31 -markup = "{" [LRI] [s] "#" identifier *(s option) *(s attribute) [s] ["/"] [close-isolate] "}" - / "{" [LRI] [s] "/" identifier *(s option) *(s attribute) [s] [close-isolate] "}" +**Cons** +- Syntax does not prevent poor display outcomes, including enabling some Trojan Source cases (UAX#55); + note that tooling or linting can help ameliorate these issues. -s = 1*( SP / HTAB / CR / LF [LRI] / %x3000 ) -LRI = %x2066 -open-isolate = %x2066-2068 -close-isolate = %x2069 +### Strict isolation all the time + +Apply bidi isolates in a strict way. +The main differences to the proposed solution is: +1. The open/close isolate characters are syntactically required to be paired. + This introduces parse errors for unpaired invisible characters, + which could lead to bad user experiences. + +As noted above, the "strict" version of the ABNF should be adopted by serializers and for +message normalization. + +```abnf +variable-expression = "{" [s] variable [bidi] [s annotation] *(s attribute) [s] "}" +function = ":" identifier [bidi] *(s option) +option = identifier [bidi] [s] "=" [s] (literal / variable) [bidi] + / LRI identifier [bidi] [s] "=" [s] (literal / variable) [bidi] close-isolate +attribute = "@" identifier [bidi] [[s] "=" [s] ((literal / variable) [bidi])] + / LRI "@" identifier [bidi] [[s] "=" [s] ((literal / variable) [bidi])] close-isolate +markup = "{" [s] "#" identifier [bidi] *(s option) *(s attribute) [s] ["/"] "}" ; open and standalone + / "{" LRI [s] "#" identifier [bidi] *(s option) *(s attribute) [s] ["/"] close-isolate "}" + / "{" [s] "/" identifier [bidi] *(s option) *(s attribute) [s] "}" ; close + / "{" LRI [s] "/" identifier [bidi] *(s option) *(s attribute) [s] close-isolate "}" ; close +identifier = [(namespace ns-separator)] name +ns-separator = [bidi] ":" +bidi = [ %x200E-200F / %x061C ] ``` + +### Isolate `name` rather than `unquoted-literal` + Isolating rather than marking `name` helps ensure that its directionality does not spill over to adjoining syntax. + +The following replaces the proposed design's changes to `literal` and the `[bidi]` additions to +`variable-expression`, `function`, `option`, `attribute`, `markup`, and `ns-separator`: +```abnf +name = [open-isolate] name-start *name-char [close-isolate] +quoted-literal = [open-isolate] "|" *(quoted-char / quoted-escape) "|" [close-isolate] +``` + For example, this allows for the proper rendering of the expression ``` {⁦:⁧אחת⁩:⁧שתיים⁩⁩} ``` where "אחת" is the `namespace` of the `identifier`. -Without `name` isolation, this would render as +Without `name` isolation, this would (misleadingly) render as ``` {⁦:אחת:שתיים⁩} ``` -In the syntax, it's much simpler to include the changes to `name` in that rule, -rather than patching every place where `name` is used. -Either way, the parsed value of the name should not include the open/close isolates, -just as they're not included in the parsed values of quoted literals or quoted patterns. +Note that the parsed value of the `name` does not include the open/close isolates, +just as they're not included in the parsed values of quoted literals or quoted patterns, +even though the production includes the characters. +We could accomplish this by adding an additional productions to manage `name`, at the cost +of a more complex ABNF. + +**Pros** +- In the syntax, it's much simpler to include the changes to `name` in the `name` rule, + rather than patching every place where `name` is used. +**Cons** +- Implementations need to remove isolates from the `name` token before comparing + the value to other values (such as comparing `function` or `variable` names). + Because of namespacing, this requires looking _inside_ the token. +- Implementations might need to insert isolates when generating names upon serialization. + The current data model does not separate `namespace` and `name`, + so this might be more complicated. +- `unquoted-literal` values appear as keys, as operands, and as option values. + If not isolated, these can cause spillover effects, so we might need both `name` + and `unquoted-literal` isolation. ### Deeper Syntax Changes We could alter the syntax to make it more "bidi robust", -such as by using strongly directional instead of neutrals. +such as by using strongly directional characters instead of neutrals. ### Forbid RTL characters in `name` and/or `unquoted` We could alter the syntax to forbid using RTL characters in names and unquoted literals. @@ -425,7 +532,7 @@ Cons: - This is not friendly to non-English/non-Latin users and represents a usability restriction in environments in which names can be non-ASCII values -### Allow more permissive use of bidi controls +### Permit LRI, RLI, and FSI inside expressions and markup We could permit RLI/FSI to be used inside _expressions_ and _markup_. This would be an advantage for simple _expressions_ containing only or primarily @@ -468,3 +575,45 @@ complex sets of controls. - Requires complex sets of bidi controls - RTL editing/display is mostly a special case; we already afford the ability to edit RTL in _patterns_ and _literals_ + +### Hybrid approaches + +Strict syntactical requirements produce better _display_ outcomes +that solve the various problems enumerated in this design document. +However, the strictness comes with a cost: otherwise-valid messages, +including messages that display completely as expected and are not in any way misleading, +can produce syntax errors. +These errors can be difficult to debug, since the characters are invisible. +Syntax errors are generally treated as fatal by processors. + +Semi-strict or super-loose strategies can be used to avoid producing these types of syntax error. +However, valid messages using these approaches can have stray (e.g. unpaired isolates), +malformed (e.g. PDI before LRI/RLI/FSI), +or badly formatted character sequences (wrapping the wrong things), +unless the user or the user's tools are careful. +This can include deliberate abuse, such as Trojan Source attacks (see UAX#55), +in which Bad Actors create messages that have a misleading appearance vs. their runtime interpretation. + +A hybrid ("Postel's Law") approach would be to permit the use of isolates and strongly directional marks +in whitespace in a permissive way (see: "super-loose isolation"), +particularly in runtime formatting operations +but strongly encourage tools to implement message normalization on a strictly-defined grammar +(see: "strict isolation all the time") +and to encourage users to use the strict version of the grammar when writing or serializing messages. + +The hybrid approach would include tests to allow implementations to claim +adherence to the stricter grammar. + +**Pros** +- Messages can be written that solve all display problems +- Stray, unpaired, repeated, or other invisible typos do not produce spurious + syntax errors +- Provides a foundation for tools to claim strict conformance and message normalization + as well as guidance to implementers to make them want to adopt it + +**Cons** +- Requires additional effort to maintain the grammar +- Requires additional effort to maintain tests +- Valid messages can contain Trojan Source and other negative display consequences; + messages can be checked, however, using the strict grammar, so tools could warn + users of potential abuse