-
-
Notifications
You must be signed in to change notification settings - Fork 35
[DESIGN] Implement changes to bidi to permit non-strict formation #811
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 8 commits
c2bcfe5
9342386
3d84859
33528f3
dc175bb
ed8bab5
2f7e13c
1d9ae2e
03ebbce
df37674
f12e316
27ca447
570204a
53815b3
43a5ab2
b6e4132
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||
|---|---|---|---|---|---|---|---|---|
|
|
@@ -54,34 +54,42 @@ the plain-text of the message and the Unicode Bidirectional Algorithm (UBA, UAX# | |||||||
| can interact in ways that make the _message_ unintelligible or difficult to parse visually. | ||||||||
|
|
||||||||
| Machines do not have a problem parsing _messages_ that contain RTL characters, | ||||||||
| but users need to be able to discern what a _message_ does, | ||||||||
| what _variant_ will be selected, | ||||||||
| or what a _placeholder_ will evaluate to. | ||||||||
| but users need to be able to discern what a _message_ does. | ||||||||
| For example, users need to be able to match _keys_ in a _variant_ to _selectors_ | ||||||||
| in a `.match` statement. | ||||||||
| Or they want to know how a _pattern_ will be evaluated, | ||||||||
| such as understanding the _options_ and _values_ in a _placeholder_. | ||||||||
|
|
||||||||
| In addition, it is possible to construct messages that use bidi characters to spoof | ||||||||
| users into believing that a _message_ does something different than what it actually does. | ||||||||
|
|
||||||||
| The current syntax does not permit bidi controls in _name_ tokens, | ||||||||
| _unquoted_ literals, | ||||||||
| or in the whitespace portions of a _message_. | ||||||||
| _unquoted literals_, | ||||||||
| or in the non-pattern whitespace portions of a _message_. | ||||||||
|
|
||||||||
| Permitting the **isolate** controls and the standalone strongly-directional markers | ||||||||
| Permitting the Unicode bidi **isolate** characters and the standalone strongly-directional markers | ||||||||
| would enable tools, including translation tools, and users who are writing in RTL languages | ||||||||
| to format a _message_ so that its plain-text representation and its function | ||||||||
| are unambiguous. | ||||||||
|
|
||||||||
| The isolate controls are paired invisible control characters inserted around a portion of a string. | ||||||||
| The start of an isolate sequence is one of: | ||||||||
| The isolates are paired invisible characters inserted around a portion of a string. | ||||||||
| The start of an isolated sequence is one of: | ||||||||
| - U+2066 LEFT-TO-RIGHT ISOLATE (LRI) | ||||||||
| - U+2067 RIGHT-TO-LEFT ISOLATE (RLI) | ||||||||
| - U+2068 FIRST-STRONG ISOLATE (FSI) | ||||||||
|
|
||||||||
| The end of an isolate sequence is U+2069 POP DIRECTIONAL ISOLATE (PDI). | ||||||||
| The end of an isolated sequence is U+2069 POP DIRECTIONAL ISOLATE (PDI). | ||||||||
|
|
||||||||
| The characters inside an isolate sequence have the initial string (paragraph) direction | ||||||||
| corresponding to the starting control (LTR for LRI, RTL for RLI, auto for FSI). | ||||||||
| The isolate sequence is **isolated** from surrounding text. | ||||||||
| This means that the surrounding text treats it as-if the sequence were a single neutral character. | ||||||||
| The characters inside an isolated sequence have the initial string direction | ||||||||
| corresponding to the starting character ( | ||||||||
| left-to-right for `LRI`, | ||||||||
| right-to-left for `RLI`, | ||||||||
| or <a href="https://www.w3.org/TR/i18n-glossary#auto-direction">auto</a> for `FSI`). | ||||||||
| They are called "isolates" because the enclosed text is **isolated** from surrounding text | ||||||||
| while being processed using the Unicode Bidirectional Algorithm (UBA). | ||||||||
| The surrounding text treats the sequence as-if it were a single neutral character, | ||||||||
| while the interior sequence is processed using the base direction specified by the isolate | ||||||||
| starting character. | ||||||||
|
|
||||||||
| > [!NOTE] | ||||||||
| > One of the side-effects of using `{`/`}` and `{{`/`}}` to delimit _expressions_ | ||||||||
|
|
@@ -96,11 +104,26 @@ These include: | |||||||
| - U+200F RIGHT-TO-LEFT MARK (RLM) | ||||||||
| - U+061C ARABIC LETTER MARK (ALM) | ||||||||
|
|
||||||||
| These characters are invisible strongly-directional characters used in bidirectional | ||||||||
| These characters are invisible strongly-directional characters. | ||||||||
| They are used in bidirectional | ||||||||
| text to coerce certain directional behavior (usually to mark the end of | ||||||||
| a sequence of characters that would otherwise be ambiguous or interact with | ||||||||
| neutrals or opposite direction runs in an unhelpful way). | ||||||||
|
|
||||||||
| ### Strictness and Abuse | ||||||||
|
|
||||||||
| We want the syntax to be somewhat permissive, particularly when it comes to paired isolates. | ||||||||
| The isolates and strongly-directional marks are invisble except in certain specialized editing environments. | ||||||||
| While users and tools should be strict about using well-formed isolate sequences, | ||||||||
| we don't want to invisble characters or whitespace to generate additional syntax errors except where necessary. | ||||||||
aphillips marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||||
| Therefore, it should not be a syntax error if a user, editor, or tool fails to match opening/closing isolates. | ||||||||
|
|
||||||||
| It is possible to generate a "strict" version of the ABNF that is more restrictive about isolate pairing. | ||||||||
| Such an ABNF might be used by message serializers to ensure high-quality message generation. | ||||||||
|
|
||||||||
| Unfortunately, permitting a "relaxed" handling of isolates/marks, when mixed with whitespace, | ||||||||
| could produce the various Trojan Source effects described in [[UTS55]](https://www.unicode.org/reports/tr55/#Usability-bidi)) | ||||||||
|
|
||||||||
| ## Use-Cases | ||||||||
|
|
||||||||
| _What use-cases do we see? Ideally, quote concrete examples._ | ||||||||
|
|
@@ -135,7 +158,7 @@ You have {$م1صر :م2صر م3صر=م4صر} <- no controls | |||||||
| You have {$م1صر :م2صر م3صر=م4صر} <- LRM after each RTL token | ||||||||
| ``` | ||||||||
|
|
||||||||
| 3. As a developer or translator, I want to make RTL literal or names appear correctly | ||||||||
| 3. As a developer or translator, I want to make unquoted RTL literals or names appear correctly | ||||||||
| in my plain-text editing environment. | ||||||||
| I don't want to have to manage a lot of paired controls, when I can get the right effect using | ||||||||
| strongly directional mark characters (LRM, RLM, ALM) | ||||||||
|
|
@@ -209,6 +232,12 @@ Newlines inside of messages should not harm later syntax. | |||||||
| ن}} 123 456 {{ LRM }} | ||||||||
| ``` | ||||||||
|
|
||||||||
|
|
||||||||
| Naive text editors, when operating in a right-to-left context, | ||||||||
| might display a _message_ with an RTL base direction. | ||||||||
| While the display of the _message_ might be somewhat damaged by this, | ||||||||
| it should still produce results that are as reasonable as possible. | ||||||||
|
|
||||||||
| ## Constraints | ||||||||
|
|
||||||||
| _What prior decisions and existing conditions limit the possible design?_ | ||||||||
|
|
@@ -230,72 +259,90 @@ The workaround in #763 was to permit these characters _before_ or _after_ whites | |||||||
| using the various whitespace productions. | ||||||||
| This works at the cost of allowing spurious markers. | ||||||||
|
|
||||||||
| We want isolate characters to be _outside_ of patterns. | ||||||||
| There is an open question about how best to place them. | ||||||||
| One option would be to place them adjacent to the "pattern quote" character sequences `{{`/`}}`. | ||||||||
| Another option would be to place them _inside_ the pattern quotes, e.g. `{\u2066{`/`}\u2068}`. | ||||||||
|
|
||||||||
| Bidi isolates and marks are invisible characters. | ||||||||
| Whitespace is also invisible. | ||||||||
| Mixing these may be problematic. | ||||||||
| Not allowing these to mix could produce annoying parse errors. | ||||||||
|
|
||||||||
| ## Proposed Design | ||||||||
|
|
||||||||
| _Describe the proposed solution. Consider syntax, formatting, errors, registry, tooling, interchange._ | ||||||||
|
|
||||||||
| Editing and display of a _message_ SHOULD always use a left-to-right base direction | ||||||||
| The syntax of a _message_ assumes a left-to-right base direction | ||||||||
| both for the complete text of the _message_ as well as for each line (paragraph) | ||||||||
| contained therein. | ||||||||
|
|
||||||||
| We use LTR display because the syntax of a _message_ depends on LTR word tokens, | ||||||||
| contained therein. | ||||||||
| We prefer LTR display because human understanding of a _message_ depends on LTR word tokens, | ||||||||
| as well as token ordering (as in a placeholder or with variant keys). | ||||||||
| Note that LTR display is **_not_** a requirement, because that is beyond the scope of MF2 itself. | ||||||||
| However, tool and editor implementers ought to pay attention to this assumption. | ||||||||
|
|
||||||||
| This is not the disadvantage to right-to-left languages that it might first appear: | ||||||||
| - Bidi inside of _patterns_ works normally | ||||||||
| - _Placeholders_ and _markup_ are isolated (treated as neutrals) so that they appear | ||||||||
| Preferring LTR display is not the disadvantage to right-to-left languages that it might first appear: | ||||||||
| - Bidi inside of _patterns_ works normally (we go to great lengths to make the interior | ||||||||
| of _patterns_ work as plain text) | ||||||||
| - _Placeholders_ and _markup_ can be isolated (treated as neutrals) so that they appear | ||||||||
| in the correct location in an RTL _pattern_ | ||||||||
| - _Expressions_ use isolates and directional marks to display internal tokens in the | ||||||||
| correct order and without spillover effects | ||||||||
| - The syntax uses paired enclosing marks that the Unicode Bidirectional Algorithm pairs | ||||||||
| for shaping purposes and these offer a poor person's form of isolation. | ||||||||
|
||||||||
| for shaping purposes and these offer a poor person's form of isolation. | |
| - The syntax uses enclosing marks (specifically curly brackets) which the Unicode Bidirectional Algorithm | |
| pairs up for shaping purposes, resulting in a weak form of isolation in the syntax itself. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
eemeli marked this conversation as resolved.
Show resolved
Hide resolved
aphillips marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
macchiati marked this conversation as resolved.
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only see one difference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should do the TODO, which makes it clearer. I'll do one here in the comment for clarity and then go back and fix the PR.
The current design has expression thusly:
expression = "{" [LRI] (literal-expression / variable-expression / annotation-expression) [close-isolate] "}"This alternative would turn that into:
expression = "{" (literal-expression / variable-expression / annotation-expression) "}"
/ "{" LRI (literal-expression / variable-expression / annotation-expression) close-isolate "}"In this formulation, you cannot have unpaired opening (or closing) isolates without a syntax error, nor can you have multiples of open or close.
Rinse and repeat for markup, option, attribute, and literals.
Make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed the TODO
Uh oh!
There was an error while loading. Please reload this page.