-
-
Notifications
You must be signed in to change notification settings - Fork 35
[DESIGN] Bidi usability #754
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 5 commits
87d0463
5a752ec
280d520
d98dd71
d6e3b38
b3298c2
1086487
83e9d0f
0f52131
308fc05
125a7ae
239f9ed
4cf35cf
b5e602e
405810a
dab3948
68b4803
fd41cce
5ac8dd9
df1cd1d
2e1419c
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,236 @@ | ||||||||||
| # Bidi Usability | ||||||||||
|
|
||||||||||
| Status: **Proposed** | ||||||||||
|
|
||||||||||
| <details> | ||||||||||
| <summary>Metadata</summary> | ||||||||||
| <dl> | ||||||||||
| <dt>Contributors</dt> | ||||||||||
| <dd>@aphillips</dd> | ||||||||||
| <dt>First proposed</dt> | ||||||||||
| <dd>2024-03-27</dd> | ||||||||||
| <dt>Pull Requests</dt> | ||||||||||
| <dd>#000</dd> | ||||||||||
| </dl> | ||||||||||
| </details> | ||||||||||
|
|
||||||||||
| ## Objective | ||||||||||
|
|
||||||||||
| _What is this proposal trying to achieve?_ | ||||||||||
|
|
||||||||||
| The MessageFormat v2 syntax uses whitespace as a required delimiter | ||||||||||
| as well as permitting the use of whitespace to make _messages_ easier to read. | ||||||||||
| In addition, a _message_ can include bidirectional text in identifiers and literal values. | ||||||||||
|
|
||||||||||
| MessageFormat's syntax also uses a variety of "sigils" and markers to form the structure of a _message_. | ||||||||||
| These sigils are ASCII punctuation characters that have neutral directionality. | ||||||||||
| This means that the inclusion of right-to-left ("RTL") identifiers or literals in a _message_ | ||||||||||
| can result in the syntax looking "scrambled" or, in extreme cases, appearing to have a different meaning | ||||||||||
| due to [spillover](https://www.w3.org/TR/i18n-glossary/#dfn-spillover-effects). | ||||||||||
|
|
||||||||||
| To prevent spillover effects and to allow users (particularly RTL language users) | ||||||||||
| to author _messages_ in a straightforward way, we want to allow the syntax to include appropriate | ||||||||||
| bidirectional support and to recommend to tool and translation technology implementers | ||||||||||
| mechanisms to make _messages_ that include RTL characters easy to work with | ||||||||||
| without introducing spoofing or "Trojan Source" attack vectors. | ||||||||||
|
|
||||||||||
| ## Background | ||||||||||
|
|
||||||||||
| _What context is helpful to understand this proposal?_ | ||||||||||
|
|
||||||||||
| If you are unfamiliar with bidirectional or right-to-left text, there is a basic introduction | ||||||||||
| [here](https://www.w3.org/International/articles/inline-bidi-markup/uba-basics). | ||||||||||
|
|
||||||||||
| MessageFormat _message_ strings are created and edited primarily by humans. | ||||||||||
| The original _message_ is often written by a software developer or user experience designer. | ||||||||||
| Translators need to work with the target-language versions of each _message_. | ||||||||||
| Like many templating or domain-specific languages, MFv2 uses neutrally-directional symbols | ||||||||||
| to form portions of the syntax. | ||||||||||
| When the _message_ contains right-to-left (RTL) translations or uses values that are RTL, | ||||||||||
| the plain-text of the message and the Unicode Bidirectional Algorithm (UBA, UAX#9) | ||||||||||
| interact in ways that make the _message_ unintelligible or difficult to parse visually. | ||||||||||
|
|
||||||||||
| Machines do not have a problem parsing _messages_ that contain RTL characters, | ||||||||||
| but users need to be able to discern what a _message_ does, | ||||||||||
| what _variant_ will be selected, | ||||||||||
| or what a _placeholder_ will evaluate into. | ||||||||||
aphillips marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||||||
|
|
||||||||||
| In addition, it is possible to construct messages that use bidi characters to spoof | ||||||||||
| users into believing that a _message_ does something different than what it actually does. | ||||||||||
|
|
||||||||||
| The current syntax does not permit bidi controls in _name_ tokens, | ||||||||||
| _unquoted_ literal values, | ||||||||||
| or in the whitespace portions of a _message_. | ||||||||||
|
|
||||||||||
| Permitting the **isolate** controls and the standalone strongly-directional markers | ||||||||||
| would enable tools, including translation tools, and users who speak RTL languages | ||||||||||
aphillips marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||||||
| to format a _message_ so that it's plain-text representation and its function | ||||||||||
aphillips marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||||||
| are unambiguous. | ||||||||||
|
|
||||||||||
| The isolate controls are paired invisible control characters inserted around a portion of a string. | ||||||||||
| The start of an isolate sequence is one of: | ||||||||||
| - U+2066 LEFT-TO-RIGHT ISOLATE (LRI) | ||||||||||
| - U+2067 RIGHT-TO-LEFT ISOALTE (RLI) | ||||||||||
aphillips marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||||||
| - U+2068 FIRST-STRONG ISOLATE (FSI) | ||||||||||
|
|
||||||||||
| The end of an isolate sequence is U+2069 POP DIRECTIONAL ISOLATE (PDI). | ||||||||||
|
|
||||||||||
| The characters inside an isolate sequence have the initial string (paragraph) direction | ||||||||||
| corresponding to the starting control (LTR for LRI, RTL for RLI, auto for FSI). | ||||||||||
|
Comment on lines
+79
to
+80
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do all editors reset the paragraph direction after a newline? For example, if there's a newline between an LRI and an FSI, how is the paragraph direction of the second line determined? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The normal application of the bidi algorithm requires a reset on each paragraph, wherein a newline breaks paragraphs. "The algorithm reorders text only within a paragraph; characters in one paragraph have no effect on characters in a different paragraph. Paragraphs are divided by the Paragraph Separator or appropriate Newline Function (for guidelines on the handling of CR, LF, and CRLF, see Section 4.4, Directionality, and Section 5.8, Newline Guidelines of [Unicode]). Paragraphs may also be determined by higher-level protocols: for example, the text in two different cells of a table will be in different paragraphs." There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @macchiati is correct. That's why it's called "paragraph direction". Note that newlines don't help us that much: they are optional in our syntax (outside literals) and technically normalize to space (or nothing). That is, the newline doesn't help us if we end up writing the message as a single-line. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ok, so given that we allow for newlines within "code" and, specifically, expressions, I think we need to account for that so that we can keep the direction of the code as left-to-right, even when the first strongly directional character on the line is RTL. As I understand it, not even an LRI/FSI pair inside the braces is always enough to keep the a = 'אחד'
b = 'שתיים'
s = a + '{\u2066\n$' + b + '\u2069}'There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That's correct. Getting the sigils to stay on the left side needs a base direction of LTR. An LRM doesn't help in your example either (except to prevent spillover with the following annotation if there were any). My proposal is not 100% bulletproof (and requires some action on the part of tools or users). A bulletproof design would require more isolates and would probably be limited to using LRI/PDI pairs. It would be difficult to work with, given that there would be a lot of invisible control characters inside subcomponents of an expression, e.g.: |
||||||||||
| The isolate sequence is **isolated** from surrounding text. | ||||||||||
| This means that the surrounding text treats it as-if the sequence were a single neutral character. | ||||||||||
|
|
||||||||||
| > [!NOTE] | ||||||||||
| > One of the side-effects of using `{`/`}` and `{{`/`}}` to delimit _expressions_ | ||||||||||
| > and _patterns_ is that these paired enclosing punctuations provide a measure of | ||||||||||
| > isolation in UBA. | ||||||||||
| > This is an additional reason not to change over to quote marks (which are not enclosing) | ||||||||||
| > around patterns. | ||||||||||
|
|
||||||||||
| ## Use-Cases | ||||||||||
|
|
||||||||||
| _What use-cases do we see? Ideally, quote concrete examples._ | ||||||||||
|
|
||||||||||
| Presentation of keys can change if values are not isolated: | ||||||||||
| ``` | ||||||||||
| .match {$م2صر :string}{$num :integer} | ||||||||||
| م2صر 0 {{The {$م2صر} is actually the first key}} | ||||||||||
| م2صر * {{This one appears okay}} | ||||||||||
aphillips marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||
| ``` | ||||||||||
|
|
||||||||||
| Presentation in an expression can change if values are not isolated or restore LTR order: | ||||||||||
| > In the following example, we use the same string with a number inserted into the middle of | ||||||||||
| > the string to make the bidi effects visible. | ||||||||||
| > The numbers correspond to: | ||||||||||
| > 1. operand | ||||||||||
| > 2. function | ||||||||||
| > 3. option name | ||||||||||
| > 4. option value | ||||||||||
|
|
||||||||||
| ``` | ||||||||||
| You have {$م1صر :م2صر م3صر=م4صر} <- no controls | ||||||||||
| You have {$م1صر :م2صر م3صر=م4صر} <- LRM after each RTL token | ||||||||||
aphillips marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||
| ``` | ||||||||||
|
|
||||||||||
| ## Requirements | ||||||||||
|
|
||||||||||
| _What properties does the solution have to manifest to enable the use-cases above?_ | ||||||||||
|
|
||||||||||
| To prevent RTL _literals_ from having spillover effects with surrounding syntax, | ||||||||||
| it should be possible to bidi isolate a _quoted_ or _unquoted_ _literal_. | ||||||||||
|
|
||||||||||
| >``` | ||||||||||
| > .local $title = {|البحرين مصر الكويت!|} | ||||||||||
catamorphism marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||
| > .local $egypt = {مصر :string} | ||||||||||
| >``` | ||||||||||
|
|
||||||||||
| To prevent _patterns_ from having spillover effects with other parts of a _message_, | ||||||||||
| particularly with _keys_ in a _variant_, | ||||||||||
| it should be possible to bidi isolate a _quoted-pattern_. | ||||||||||
aphillips marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||||||
|
|
||||||||||
| >``` | ||||||||||
| > .match {$foo :string} | ||||||||||
| > isolate {{البحرين مصر الكويت!}} | ||||||||||
catamorphism marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||
| >``` | ||||||||||
|
|
||||||||||
| To prevent _markup_, _placeholders_, or _expressions_ from having spillover effects | ||||||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The spillover can also occur in declarations and the |
||||||||||
| with other parts of a _message_ | ||||||||||
| it should be possible to bidi isolate the contents of a _markup_ or an _expression_. | ||||||||||
|
|
||||||||||
| >``` | ||||||||||
| > You can find it in {$مصر}. | ||||||||||
| >``` | ||||||||||
|
|
||||||||||
| To prevent RTL identifiers from having spillover effects with other parts of an _expression_, | ||||||||||
| it should be possible to include "local effect" bidi controls following an _identifier_, | ||||||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can't we omit "identifier" since an identifier ends with a name? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Identifiers end with names, but also contain names in the namespace position. I wanted to be clear that we meant the end of an identifier in this case. |
||||||||||
| _name_, | ||||||||||
| _option value_, | ||||||||||
| or _literal_. | ||||||||||
| These controls must not be included into the _identifier_, _name_, _option value_, or _literal_, | ||||||||||
| that is, it must be possible to distinguish these characters from the value in question. | ||||||||||
|
||||||||||
| that is, it must be possible to distinguish these characters from the value in question. | |
| that is, it must be possible to distinguish these characters from the term in question. |
"Value" is very confusing since there are also "option values". I suggest "term" since _names, option values, etc. are all terms in the grammar.
Or, it might be more precise to say something like "the characters should not appear in parsed output" (i.e. the relevant nodes in the data model).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also allow for an LRI/FSI pair immediately inside expressions and markup, or is there a reason not to do so?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could do that also. It doesn't solve the problem of expression/markup internal bidi, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm mostly here thinking of content like:
a = 'אחד'
b = 'שתיים'
s = a + '{$' + b + '}'where we have an RTL variable name inside a placeholder in an RTL pattern.
How, except with an LRI/FSI pair inside the braces, can we get that to render so that the $ is to the left of the name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #discussion_r1542105763
For those implementations, RLM/LRM are the best one can do.
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| > productions because characters on the **_inside_** of these are part of the normal text. | |
| > productions because characters on the **_inside_** of these are uninterpreted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disagree with this change. Perhaps omit "normal" or perhaps:
| > productions because characters on the **_inside_** of these are part of the normal text. | |
| > productions because characters on the **_inside_** of these are not part of the MF2 syntax. |
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why LRM/RLM rather than isolates?
Why allow for RLM?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Some implementations don't handle bidi isolates well yet.
- The LRM and RLM are not stateful, and may be preferred in some circumstances.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might allow paired isolates inside expressions. It might not be an either/or. Allowing LRM makes it easy to clean up the contents of an expression:
.input {$م1صر :م2صر} <- no bidi controls
.input {$م1صر :م2صر} <- one LRM right after the id with 1 in it
I included RLM (and probably should have included ALM U+061C) so that an RTL literal or name that ends with a neutral can display correctly:
{م123+ :foo} <- with no RLM
{م123+ :foo} <- with RLM after +
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that we allow a number of neutral direction characters also in name-start, doesn't the same apply to the beginning of the name as well?
From an automation PoV, using FSI/PDI to wrap names seems like it would "just work", whereas LRM/RLM/ALM would require inspecting the contents of the string to figure out what might be needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are strictly talking about machines doing the "bidi annotation", tightly wrapping with isolates will generally work.
FSI is not always going to get the right results, as not all tokens have the correct direction strongly directional character nearest the front. There is an element of judgement (machines don't have enough information, generally, to decide this, although sometimes they do).
But going back to my initial statement: humans write these strings and create translations of these strings. Sometimes the easiest way for them to make the message look correct is to add a strongly directional mark vs. wrapping. (Note that ICU produces marks on some number and date formats to coerce proper display).
You are correct that the design document should call out these use cases separately so that the reader (and ultimately the WG) can weight supporting these mechanisms appropriately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some implementations don't handle bidi isolates well yet.
Is there a list available anywhere of software that does not yet support bidi isolates?
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| - name (note that this includes _identifiers_ as well as names of | |
| _functions_, _variables_, and _unquoted_ literals | |
| - name (note that this includes _unquoted_ literals_, _identifiers_, and _variables_; | |
| and that _identifiers_ include the names of _functions_.) |
(It's a bit confusing to say that an unquoted literal has a name.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps, but that's how unquoted is defined:
unquoted = name / number-literal
Uh oh!
There was an error while loading. Please reload this page.