unicode-org · aphillips · Aug 12, 2024 · Jun 24, 2024 · Jun 25, 2024 · Jul 2, 2024
diff --git a/exploration/bidi-usability.md b/exploration/bidi-usability.md
@@ -70,18 +70,23 @@ would enable tools, including translation tools, and users who are writing in RT
 to format a _message_ so that its plain-text representation and its function
 are unambiguous.
 
-The isolate controls are paired invisible control characters inserted around a portion of a string.
-The start of an isolate sequence is one of:
+The isolates are paired invisible characters inserted around a portion of a string.
+The start of an isolated sequence is one of:
 - U+2066 LEFT-TO-RIGHT ISOLATE (LRI)
 - U+2067 RIGHT-TO-LEFT ISOLATE (RLI)
 - U+2068 FIRST-STRONG ISOLATE (FSI)
 
-The end of an isolate sequence is U+2069 POP DIRECTIONAL ISOLATE (PDI).
+The end of an isolated sequence is U+2069 POP DIRECTIONAL ISOLATE (PDI).
 
-The characters inside an isolate sequence have the initial string (paragraph) direction
-corresponding to the starting control (LTR for LRI, RTL for RLI, auto for FSI).
-The isolate sequence is **isolated** from surrounding text.
-This means that the surrounding text treats it as-if the sequence were a single neutral character.
+The characters inside an isolated sequence have the initial string direction
+corresponding to the starting control (
+left-to-right for `LRI`, 
+right-to-left for `RLI`, 
+or <a href="https://www.w3.org/TR/i18n-glossary#auto-direction">auto</a> for `FSI`).
+The isolated sequence is **isolated** from surrounding text:
+it is processed using the Unicode Bidirectional Algorithm (UBA)
+separately from the rest of the string and
+the surrounding text treats the sequence as-if it were a single neutral character.
 
 > [!NOTE]
 > One of the side-effects of using `{`/`}` and `{{`/`}}` to delimit _expressions_
@@ -96,11 +101,26 @@ These include:
 - U+200F RIGHT-TO-LEFT MARK (RLM)
 - U+061C ARABIC LETTER MARK (ALM)
 
-These characters are invisible strongly-directional characters used in bidirectional
+These characters are invisible strongly-directional characters.
+They are used in bidirectional
 text to coerce certain directional behavior (usually to mark the end of 
 a sequence of characters that would otherwise be ambiguous or interact with
 neutrals or opposite direction runs in an unhelpful way).
 
+### Strictness and Abuse
+
+We want the syntax to be somewhat permissive, particularly when it comes to paired isolates.
+The isolates and strongly-directional marks are invisble except in certain specialized editing environments.
+While users and tools should be strict about using well-formed isolate sequences,
+we don't want to invisble characters or whitespace to generate additional syntax errors except where necessary.
+Therefore, it should not be a syntax error if a user, editor, or tool fails to match opening/closing isolates.
+
+It is possible to generate a "strict" version of the ABNF that is more restrictive about isolate pairing.
+Such an ABNF might be used by message serializers to ensure high-quality message generation.
+
+Unfortunately, permitting a "relaxed" handling of isolates/marks, when mixed with whitespace, 
+could produce the various Trojan Source effects described in [[UTS55]](https://www.unicode.org/reports/tr55/#Usability-bidi))
+
 ## Use-Cases
 
 _What use-cases do we see? Ideally, quote concrete examples._
@@ -209,6 +229,12 @@ Newlines inside of messages should not harm later syntax.
 ن}}‎ 123 456 {{ LRM }}
 ```
 
+
+Naive text editors, when operating in a right-to-left context, 
+might display a _message_ with an RTL base direction.
+While the display of the _message_ might be somewhat damaged by this,
+it should still produce results that are as reasonable as possible.
+
 ## Constraints
 
 _What prior decisions and existing conditions limit the possible design?_
@@ -230,72 +256,90 @@ The workaround in #763 was to permit these characters _before_ or _after_ whites
 using the various whitespace productions.
 This works at the cost of allowing spurious markers.
 
+We want isolate characters to be _outside_ of patterns.
+There is an open question about how best to place them.
+One option would be to place them adjacent to the "pattern quote" character sequences `{{`/`}}`.
+Another option would be to place them _inside_ the pattern quotes, e.g. `{\u2066{`/`}\u2068}`.
+
+Bidi isolates and marks are invisible characters.
+Whitespace is also invisible.
+Mixing these may be problematic.
+Not allowing these to mix could produce annoying parse errors.
+
 ## Proposed Design
 
 _Describe the proposed solution. Consider syntax, formatting, errors, registry, tooling, interchange._
 
-Editing and display of a _message_ SHOULD always use a left-to-right base direction
+The syntax of a _message_ assumes a left-to-right base direction
 both for the complete text of the _message_ as well as for each line (paragraph)
-contained therein.
-
-We use LTR display because the syntax of a _message_ depends on LTR word tokens,
+contained therein. 
+We prefer LTR display because human understanding of a _message_ depends on LTR word tokens,
 as well as token ordering (as in a placeholder or with variant keys).
+Note that LTR display is **_not_** a requirement, because that is beyond the scope of MF2 itself.
+However, tool and editor implementers ought to pay attention to this assumption.
 
-This is not the disadvantage to right-to-left languages that it might first appear:
-- Bidi inside of _patterns_ works normally
-- _Placeholders_ and _markup_ are isolated (treated as neutrals) so that they appear
+Preferring LTR display is not the disadvantage to right-to-left languages that it might first appear:
+- Bidi inside of _patterns_ works normally (we go to great lengths to make the interior
+  of _patterns_ work as plain text)
+- _Placeholders_ and _markup_ can be isolated (treated as neutrals) so that they appear
   in the correct location in an RTL _pattern_
 - _Expressions_ use isolates and directional marks to display internal tokens in the
   correct order and without spillover effects
+- The syntax uses paired enclosing marks that the Unicode Bidirectional Algorithm pairs
+  for shaping purposes and these offer a poor person's form of isolation.
-  for shaping purposes and these offer a poor person's form of isolation.
+- The syntax uses enclosing marks (specifically curly brackets) which the Unicode Bidirectional Algorithm
+  pairs up for shaping purposes, resulting in a weak form of isolation in the syntax itself.
-  for shaping purposes and these offer a poor person's form of isolation.
+- The syntax uses enclosing marks (specifically curly brackets) which the Unicode Bidirectional Algorithm
+  pairs up for shaping purposes, resulting in a weak form of isolation in the syntax itself.
 
-Permit isolating bidi controls to be used on the **outside** of the following:
+The syntax permits (but does not require) isolating bidi controls to be used on the 
+**outside** of the following:
 - unquoted literals
 - quoted literals
 - quoted patterns
 
-We permit any of the isolate starting controls (LRI, RLI, FSI) because we want to allow
+We permit any of the isolate starting characters (LRI, RLI, FSI) because we want to allow
 the user to set the base direction of a _literal_ or _pattern_ according to its respective 
 actual contents.
 
+> [!IMPORTANT]
+> This change adds a "lookahead" to the process of determining if a given _message_ is
+> "simple" or "complex", as LRI, RLI, and FSI are all valid starters for a simple message
+> as well as being allowed before a quoted pattern.
+
 This would change the ABNF as follows:
 (Notice that this change includes a production `bidi` described further down
 in this document)
 ```abnf
-literal        = ( open-isolate (quoted / (unquoted [bidi])) close-isolate)
-               / (quoted / (unquoted [bidi]))
-quoted-pattern = ( open-isolate "{{" pattern "}}" close-isolate)
-               / ("{{" pattern "}}")
+literal        = [open-isolate] (quoted / (unquoted [bidi])) [close-isolate]
+quoted-pattern = [open-isolate] "{{" pattern "}}" [close-isolate]
 
 open-isolate   = %x2066-2068
 close-isolate  = %x2069
 ```
 
 > [!IMPORTANT]
-> The isolating controls go on the **_outside_** of the various _literal_ and _pattern_
+> The isolating characters go on the **_outside_** of the various _literal_ and _pattern_
 > productions because characters on the **_inside_** of these are part of the _literal_'s
 > or _pattern_'s textual content.
-> We need to allow users to include bidi controls in the output of MF2.
+> We need to allow users to include bidi characters, including isolates and strongly directional marks
+> in the output of MF2.
 
-Permit **left-to-right** isolating bidi controls (`U+2066`...`U+2069`) to be used **immediately inside** the following:
+Permit **left-to-right** isolates (`U+2066`...`U+2069`) to be used **immediately inside** the following:
 - expressions
 - markup
 
+Permit isolates around any token inside of an expression or markup.
+
 We only permit the LTR isolates because the contents of an _expression_
 or _markup_ must be laid out left-to-right.
 _Literal_ values can be right-to-left isolated within that or use strongly
 directional marks to ensure correct display.
 
 This would change the ABNF as follows (assuming the above changes are also incorporated):
 ```abnf
-expression            = "{" LRI (literal-expression / variable-expression / annotation-expression) close-isolate "}"
-                      / "{" (literal-expression / variable-expression / annotation-expression) "}"
+expression            = "{" [LRI] (literal-expression / variable-expression / annotation-expression) [close-isolate] "}"
 literal-expression    = [s] literal [s annotation] *(s attribute) [s]
 variable-expression   = [s] variable [s annotation] *(s attribute) [s]
 annotation-expression = [s] annotation *(s attribute) [s]
-markup = "{" [s] "#" identifier *(s option) *(s attribute) [s] ["/"] "}"                    ; open and standalone
-       / "{" [s] "/" identifier *(s option) *(s attribute) [s] "}"                          ; close
-       / "{" LRI [s] "#" identifier *(s option) *(s attribute) [s] ["/"] close-isolate "}"  ; open and standalone
-       / "{" LRI [s] "/" identifier *(s option) *(s attribute) [s] close-isolate "}"        ; close
+markup = "{" [LRI] [s] "#" identifier *(s option) *(s attribute) [s] ["/"] [close-isolate] "}"  ; open and standalone
+       / "{" [LRI] [s] "/" identifier *(s option) *(s attribute) [s] [close-isolate] "}"        ; close
 LRI = %x2066
 ```
 
@@ -349,46 +393,19 @@ By contrast, if users insert too many or the wrong controls using the recommende
 the _message_ would still be functional and would emit no undesired characters.
 
 
-### Loose isolation
-
-Apply bidi isolates in a slightly different way.
-The main differences to the proposed solution are:
-1. The open/close isolate characters are not syntactically required to be paired.
-   This avoids introducing parse errors for missing or required invisible characters,
-   which would lead to bad user experiences.
-2. Rather than patching the `name` rule with an optional trailing LRM/RLM/ALM,
-   allow for its proper isolation.
+### Strict isolation all the time
 
-Quoted patterns, quoted literals, and names may be isolated by LRI/RLI/FSI...PDI.
-For names and quoted literals, the isolate characters are outside the body of the token,
-but for quoted patterns, the isolates are in the middle of the `{{` and `}}` characters.
-This avoids adding a lookahead requirement for detecting a `complex-message` start,
-and differentiates a `quoted-pattern` from a `quoted` `key` in a `variant`.
+Apply bidi isolates in a strict way.
+The main differences to the proposed solution is:
+1. The open/close isolate characters are syntactically required to be paired.
+   This introduces parse errors for unpaired invisible characters,
+   which could lead to bad user experiences.
 
-Expressions and markup may be isolated by LRI...PDI immediately within the `{` and `}`.
+As noted above, the "strict" version of the ABNF should be adopted by serializers and for 
+message normalization.
 
-An LRI is allowed immediately after a newline outside patterns and within expressions.
-This is intended to allow left-to-right representation for "code"
-even if it contains a newline followed by content
-that could otherwise prompt the paragraph direction to be detected as right-to-left.
-
-```abnf
-name           = [open-isolate] name-start *name-char [close-isolate]
-quoted         = [open-isolate] "|" *(quoted-char / quoted-escape) "|" [close-isolate]
-quoted-pattern = "{" [open-isolate] "{" pattern "}" [close-isolate] "}"
+// TODO put ABNF here
 
-literal-expression    = "{" [LRI] [s] literal [s annotation] *(s attribute) [s] [close-isolate] "}"
-variable-expression   = "{" [LRI] [s] variable [s annotation] *(s attribute) [s] [close-isolate] "}"
-annotation-expression = "{" [LRI] [s] annotation *(s attribute) [s] [close-isolate] "}"
-
-markup = "{" [LRI] [s] "#" identifier *(s option) *(s attribute) [s] ["/"] [close-isolate] "}"
-       / "{" [LRI] [s] "/" identifier *(s option) *(s attribute) [s] [close-isolate] "}"
-
-s = 1*( SP / HTAB / CR / LF [LRI] / %x3000 )
-LRI = %x2066
-open-isolate  = %x2066-2068
-close-isolate = %x2069
-```
 
 Isolating rather than marking `name` helps ensure
 that its directionality does not spill over to adjoining syntax.
@@ -397,7 +414,7 @@ For example, this allows for the proper rendering of the expression
 {⁦:⁧אחת⁩:⁧שתיים⁩⁩}
 ```
 where "אחת" is the `namespace` of the `identifier`.
-Without `name` isolation, this would render as
+Without `name` isolation, this would (misleadingly) render as
 ```
 {⁦:אחת:שתיים⁩}
 ```
@@ -410,7 +427,7 @@ just as they're not included in the parsed values of quoted literals or quoted p
 
 ### Deeper Syntax Changes
 We could alter the syntax to make it more "bidi robust", 
-such as by using strongly directional instead of neutrals.
+such as by using strongly directional characters instead of neutrals.
 
 ### Forbid RTL characters in `name` and/or `unquoted`
 We could alter the syntax to forbid using RTL characters in names and unquoted literals.