Skip to content
Merged
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
153 changes: 85 additions & 68 deletions exploration/bidi-usability.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,18 +70,23 @@ would enable tools, including translation tools, and users who are writing in RT
to format a _message_ so that its plain-text representation and its function
are unambiguous.

The isolate controls are paired invisible control characters inserted around a portion of a string.
The start of an isolate sequence is one of:
The isolates are paired invisible characters inserted around a portion of a string.
The start of an isolated sequence is one of:
- U+2066 LEFT-TO-RIGHT ISOLATE (LRI)
- U+2067 RIGHT-TO-LEFT ISOLATE (RLI)
- U+2068 FIRST-STRONG ISOLATE (FSI)

The end of an isolate sequence is U+2069 POP DIRECTIONAL ISOLATE (PDI).
The end of an isolated sequence is U+2069 POP DIRECTIONAL ISOLATE (PDI).

The characters inside an isolate sequence have the initial string (paragraph) direction
corresponding to the starting control (LTR for LRI, RTL for RLI, auto for FSI).
The isolate sequence is **isolated** from surrounding text.
This means that the surrounding text treats it as-if the sequence were a single neutral character.
The characters inside an isolated sequence have the initial string direction
corresponding to the starting control (
left-to-right for `LRI`,
right-to-left for `RLI`,
or <a href="https://www.w3.org/TR/i18n-glossary#auto-direction">auto</a> for `FSI`).
The isolated sequence is **isolated** from surrounding text:
it is processed using the Unicode Bidirectional Algorithm (UBA)
separately from the rest of the string and
the surrounding text treats the sequence as-if it were a single neutral character.

> [!NOTE]
> One of the side-effects of using `{`/`}` and `{{`/`}}` to delimit _expressions_
Expand All @@ -96,11 +101,26 @@ These include:
- U+200F RIGHT-TO-LEFT MARK (RLM)
- U+061C ARABIC LETTER MARK (ALM)

These characters are invisible strongly-directional characters used in bidirectional
These characters are invisible strongly-directional characters.
They are used in bidirectional
text to coerce certain directional behavior (usually to mark the end of
a sequence of characters that would otherwise be ambiguous or interact with
neutrals or opposite direction runs in an unhelpful way).

### Strictness and Abuse

We want the syntax to be somewhat permissive, particularly when it comes to paired isolates.
The isolates and strongly-directional marks are invisble except in certain specialized editing environments.
While users and tools should be strict about using well-formed isolate sequences,
we don't want to invisble characters or whitespace to generate additional syntax errors except where necessary.
Therefore, it should not be a syntax error if a user, editor, or tool fails to match opening/closing isolates.

It is possible to generate a "strict" version of the ABNF that is more restrictive about isolate pairing.
Such an ABNF might be used by message serializers to ensure high-quality message generation.

Unfortunately, permitting a "relaxed" handling of isolates/marks, when mixed with whitespace,
could produce the various Trojan Source effects described in [[UTS55]](https://www.unicode.org/reports/tr55/#Usability-bidi))

## Use-Cases

_What use-cases do we see? Ideally, quote concrete examples._
Expand Down Expand Up @@ -209,6 +229,12 @@ Newlines inside of messages should not harm later syntax.
ن}}‎ 123 456 {{ LRM }}
```


Naive text editors, when operating in a right-to-left context,
might display a _message_ with an RTL base direction.
While the display of the _message_ might be somewhat damaged by this,
it should still produce results that are as reasonable as possible.

## Constraints

_What prior decisions and existing conditions limit the possible design?_
Expand All @@ -230,72 +256,90 @@ The workaround in #763 was to permit these characters _before_ or _after_ whites
using the various whitespace productions.
This works at the cost of allowing spurious markers.

We want isolate characters to be _outside_ of patterns.
There is an open question about how best to place them.
One option would be to place them adjacent to the "pattern quote" character sequences `{{`/`}}`.
Another option would be to place them _inside_ the pattern quotes, e.g. `{\u2066{`/`}\u2068}`.

Bidi isolates and marks are invisible characters.
Whitespace is also invisible.
Mixing these may be problematic.
Not allowing these to mix could produce annoying parse errors.

## Proposed Design

_Describe the proposed solution. Consider syntax, formatting, errors, registry, tooling, interchange._

Editing and display of a _message_ SHOULD always use a left-to-right base direction
The syntax of a _message_ assumes a left-to-right base direction
both for the complete text of the _message_ as well as for each line (paragraph)
contained therein.

We use LTR display because the syntax of a _message_ depends on LTR word tokens,
contained therein.
We prefer LTR display because human understanding of a _message_ depends on LTR word tokens,
as well as token ordering (as in a placeholder or with variant keys).
Note that LTR display is **_not_** a requirement, because that is beyond the scope of MF2 itself.
However, tool and editor implementers ought to pay attention to this assumption.

This is not the disadvantage to right-to-left languages that it might first appear:
- Bidi inside of _patterns_ works normally
- _Placeholders_ and _markup_ are isolated (treated as neutrals) so that they appear
Preferring LTR display is not the disadvantage to right-to-left languages that it might first appear:
- Bidi inside of _patterns_ works normally (we go to great lengths to make the interior
of _patterns_ work as plain text)
- _Placeholders_ and _markup_ can be isolated (treated as neutrals) so that they appear
in the correct location in an RTL _pattern_
- _Expressions_ use isolates and directional marks to display internal tokens in the
correct order and without spillover effects
- The syntax uses paired enclosing marks that the Unicode Bidirectional Algorithm pairs
for shaping purposes and these offer a poor person's form of isolation.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't tell whether to read this as "paired enclosing marks (that the Unicode Bidirectional Algorithm pairs for shaping purposes)" or "uses paired enclosing marks (that the Unicode Bidirectional Algorithm pairs) for shaping purposes."

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the former.

Suggested change
for shaping purposes and these offer a poor person's form of isolation.
- The syntax uses enclosing marks (specifically curly brackets) which the Unicode Bidirectional Algorithm
pairs up for shaping purposes, resulting in a weak form of isolation in the syntax itself.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


Permit isolating bidi controls to be used on the **outside** of the following:
The syntax permits (but does not require) isolating bidi controls to be used on the
**outside** of the following:
- unquoted literals
- quoted literals
- quoted patterns

We permit any of the isolate starting controls (LRI, RLI, FSI) because we want to allow
We permit any of the isolate starting characters (LRI, RLI, FSI) because we want to allow
the user to set the base direction of a _literal_ or _pattern_ according to its respective
actual contents.

> [!IMPORTANT]
> This change adds a "lookahead" to the process of determining if a given _message_ is
> "simple" or "complex", as LRI, RLI, and FSI are all valid starters for a simple message
> as well as being allowed before a quoted pattern.

This would change the ABNF as follows:
(Notice that this change includes a production `bidi` described further down
in this document)
```abnf
literal = ( open-isolate (quoted / (unquoted [bidi])) close-isolate)
/ (quoted / (unquoted [bidi]))
quoted-pattern = ( open-isolate "{{" pattern "}}" close-isolate)
/ ("{{" pattern "}}")
literal = [open-isolate] (quoted / (unquoted [bidi])) [close-isolate]
quoted-pattern = [open-isolate] "{{" pattern "}}" [close-isolate]

open-isolate = %x2066-2068
close-isolate = %x2069
```

> [!IMPORTANT]
> The isolating controls go on the **_outside_** of the various _literal_ and _pattern_
> The isolating characters go on the **_outside_** of the various _literal_ and _pattern_
> productions because characters on the **_inside_** of these are part of the _literal_'s
> or _pattern_'s textual content.
> We need to allow users to include bidi controls in the output of MF2.
> We need to allow users to include bidi characters, including isolates and strongly directional marks
> in the output of MF2.

Permit **left-to-right** isolating bidi controls (`U+2066`...`U+2069`) to be used **immediately inside** the following:
Permit **left-to-right** isolates (`U+2066`...`U+2069`) to be used **immediately inside** the following:
- expressions
- markup

Permit isolates around any token inside of an expression or markup.

We only permit the LTR isolates because the contents of an _expression_
or _markup_ must be laid out left-to-right.
_Literal_ values can be right-to-left isolated within that or use strongly
directional marks to ensure correct display.

This would change the ABNF as follows (assuming the above changes are also incorporated):
```abnf
expression = "{" LRI (literal-expression / variable-expression / annotation-expression) close-isolate "}"
/ "{" (literal-expression / variable-expression / annotation-expression) "}"
expression = "{" [LRI] (literal-expression / variable-expression / annotation-expression) [close-isolate] "}"
literal-expression = [s] literal [s annotation] *(s attribute) [s]
variable-expression = [s] variable [s annotation] *(s attribute) [s]
annotation-expression = [s] annotation *(s attribute) [s]
markup = "{" [s] "#" identifier *(s option) *(s attribute) [s] ["/"] "}" ; open and standalone
/ "{" [s] "/" identifier *(s option) *(s attribute) [s] "}" ; close
/ "{" LRI [s] "#" identifier *(s option) *(s attribute) [s] ["/"] close-isolate "}" ; open and standalone
/ "{" LRI [s] "/" identifier *(s option) *(s attribute) [s] close-isolate "}" ; close
markup = "{" [LRI] [s] "#" identifier *(s option) *(s attribute) [s] ["/"] [close-isolate] "}" ; open and standalone
/ "{" [LRI] [s] "/" identifier *(s option) *(s attribute) [s] [close-isolate] "}" ; close
LRI = %x2066
```

Expand Down Expand Up @@ -349,46 +393,19 @@ By contrast, if users insert too many or the wrong controls using the recommende
the _message_ would still be functional and would emit no undesired characters.


### Loose isolation

Apply bidi isolates in a slightly different way.
The main differences to the proposed solution are:
1. The open/close isolate characters are not syntactically required to be paired.
This avoids introducing parse errors for missing or required invisible characters,
which would lead to bad user experiences.
2. Rather than patching the `name` rule with an optional trailing LRM/RLM/ALM,
allow for its proper isolation.
### Strict isolation all the time

Quoted patterns, quoted literals, and names may be isolated by LRI/RLI/FSI...PDI.
For names and quoted literals, the isolate characters are outside the body of the token,
but for quoted patterns, the isolates are in the middle of the `{{` and `}}` characters.
This avoids adding a lookahead requirement for detecting a `complex-message` start,
and differentiates a `quoted-pattern` from a `quoted` `key` in a `variant`.
Apply bidi isolates in a strict way.
The main differences to the proposed solution is:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only see one difference?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should do the TODO, which makes it clearer. I'll do one here in the comment for clarity and then go back and fix the PR.

The current design has expression thusly:

expression = "{" [LRI] (literal-expression / variable-expression / annotation-expression) [close-isolate] "}"

This alternative would turn that into:

expression = "{" (literal-expression / variable-expression / annotation-expression) "}"
           / "{" LRI (literal-expression / variable-expression / annotation-expression) close-isolate "}"

In this formulation, you cannot have unpaired opening (or closing) isolates without a syntax error, nor can you have multiples of open or close.

Rinse and repeat for markup, option, attribute, and literals.

Make sense?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed the TODO

1. The open/close isolate characters are syntactically required to be paired.
This introduces parse errors for unpaired invisible characters,
which could lead to bad user experiences.

Expressions and markup may be isolated by LRI...PDI immediately within the `{` and `}`.
As noted above, the "strict" version of the ABNF should be adopted by serializers and for
message normalization.

An LRI is allowed immediately after a newline outside patterns and within expressions.
This is intended to allow left-to-right representation for "code"
even if it contains a newline followed by content
that could otherwise prompt the paragraph direction to be detected as right-to-left.

```abnf
name = [open-isolate] name-start *name-char [close-isolate]
quoted = [open-isolate] "|" *(quoted-char / quoted-escape) "|" [close-isolate]
quoted-pattern = "{" [open-isolate] "{" pattern "}" [close-isolate] "}"
// TODO put ABNF here

literal-expression = "{" [LRI] [s] literal [s annotation] *(s attribute) [s] [close-isolate] "}"
variable-expression = "{" [LRI] [s] variable [s annotation] *(s attribute) [s] [close-isolate] "}"
annotation-expression = "{" [LRI] [s] annotation *(s attribute) [s] [close-isolate] "}"

markup = "{" [LRI] [s] "#" identifier *(s option) *(s attribute) [s] ["/"] [close-isolate] "}"
/ "{" [LRI] [s] "/" identifier *(s option) *(s attribute) [s] [close-isolate] "}"

s = 1*( SP / HTAB / CR / LF [LRI] / %x3000 )
LRI = %x2066
open-isolate = %x2066-2068
close-isolate = %x2069
```

Isolating rather than marking `name` helps ensure
that its directionality does not spill over to adjoining syntax.
Expand All @@ -397,7 +414,7 @@ For example, this allows for the proper rendering of the expression
{⁦:⁧אחת⁩:⁧שתיים⁩⁩}
```
where "אחת" is the `namespace` of the `identifier`.
Without `name` isolation, this would render as
Without `name` isolation, this would (misleadingly) render as
```
{⁦:אחת:שתיים⁩}
```
Expand All @@ -410,7 +427,7 @@ just as they're not included in the parsed values of quoted literals or quoted p

### Deeper Syntax Changes
We could alter the syntax to make it more "bidi robust",
such as by using strongly directional instead of neutrals.
such as by using strongly directional characters instead of neutrals.

### Forbid RTL characters in `name` and/or `unquoted`
We could alter the syntax to forbid using RTL characters in names and unquoted literals.
Expand Down