Skip to content

Commit 6024aeb

Browse files
macchiatiaphillipseemeli
authored
Update why_mf_next.md (#607)
* Update why_mf_next.md Make some spot fixes to the doc to update for reviewers. Some of the text is outdated, and/or needed some clarifications. * Update docs/why_mf_next.md Co-authored-by: Addison Phillips <[email protected]> * Update docs/why_mf_next.md Co-authored-by: Addison Phillips <[email protected]> * Update docs/why_mf_next.md Co-authored-by: Addison Phillips <[email protected]> * Update docs/why_mf_next.md Co-authored-by: Addison Phillips <[email protected]> * Update docs/why_mf_next.md Co-authored-by: Addison Phillips <[email protected]> * Update why_mf_next.md Added "well" to qualify "Unable to support gender selection" * Update why_mf_next.md * Update docs/why_mf_next.md * More edits for clarity * Apply suggestions from code review Co-authored-by: Eemeli Aro <[email protected]> * Update docs/why_mf_next.md Co-authored-by: Eemeli Aro <[email protected]> * Address deletion * Add link to UTW preso * address comments --------- Co-authored-by: Addison Phillips <[email protected]> Co-authored-by: Eemeli Aro <[email protected]>
1 parent 20edaea commit 6024aeb

File tree

1 file changed

+58
-40
lines changed

1 file changed

+58
-40
lines changed

docs/why_mf_next.md

Lines changed: 58 additions & 40 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,10 @@
11
# Why `MessageFormat` needs a successor ([issue #49](https://github.com/unicode-org/message-format-wg/issues/49))
22

3+
Check out the [YouTube video](https://www.youtube.com/watch?v=-DlS6KNopoU)
4+
of the Unicode Technical Workshop (UTW)
5+
presentation about MessageFormat 2.0 which includes a discussion of
6+
why MessageFormat is important and why MessageFormat 2.0 is needed.
7+
38
## Intro
49

510
The `MessageFormat` API and syntax have been around for a long time.
@@ -8,8 +13,9 @@ Intro
813

914
- `MessageFormat` is the Unicode API for software localization
1015
- It is 20 years old, well designed, proven solution
11-
- Its design is optimized for the software development model of 20y ago and its
12-
shortcomings result in mixed reception and adoption by the industry.
16+
Its design was optimized for the software development model
17+
of twenty years ago.
18+
Implementers, developers, and translators struggle with its shortcomings.
1319

1420
The current wave of software development uses dynamic languages, modern UI
1521
frameworks and new forms of user interactions (voice, VR etc.).
@@ -21,7 +27,7 @@ suitable for current generation of software, and adoption by Web Standards.
2127
Other efforts: [Fluent](https://projectfluent.org/),
2228
[FBT](https://facebook.github.io/fbt/)
2329

24-
## Core problems with the current `MessageFormat`
30+
## Core problems with the current `MessageFormat`(aka "MessageFormat 1.0")
2531

2632
1. The design is not modular enough
2733
- Does not have any “extension points”
@@ -44,20 +50,25 @@ It also means most tools used to process these messages are built rigidly,
4450
and are unprepared to handle changes
4551
(think localization tools, linters, friendly UIs, etc.).
4652

47-
The most basic functionality would be adding a new formatter. Meantime ICU
48-
added other formatters: time intervals, measurement, lists. But MessageFormat
49-
did not keep up. And adding support for any of these new formats risks to break
50-
existing tools.
53+
The most basic functionality would be adding a new formatting function.
54+
MessageFormat 1.0 only supported a small number of basic formatting functions,
55+
while over the years ICU added many new capabilities: date and time intervals,
56+
measurement units, lists, person names, and many more.
57+
Developers also sometimes want to define their own formatting functions.
58+
Supporting additional formats risks breaking interoperability or compatibility
59+
with existing tools.
5160

5261
### 1.2. Can't deprecate anything, even if now we know better
5362

5463
ICU is old, but also very popular (right now it is the core i18n library
55-
for all major operating systems, and many products).
64+
for all major operating systems, browsers, and many other products).
5665

57-
This is how he have both numeric and named parameters, partial strings in
58-
plural / select (technically concatenation, which is bad i18n), date / time
59-
patterns (bad i18n, when skeletons are the better way), nesting selectors,
60-
unfriendly escaping (think doubling the apostrophe `''` ), `#` in plurals.
66+
As a result of its age and design,
67+
MessageFormat 1.0 has both numeric (positional) and named parameters.
68+
It still provides date and time patterns (picture strings), when skeletons or option
69+
bags provide far superior results.
70+
It allows selectors (such as plural and select) around only part of the overall message,
71+
which is a form of non-internationalized string concatenation.
6172

6273
Most of it can't be “blamed” directly on a bad decision, it is just time
6374
teaching us what works (for instance skeletons did not exist when the
@@ -68,34 +79,45 @@ But the stability requirements prevent any major cleanup.
6879
### 2. Some existing problems
6980

7081
- ICU added new formatters, but MessageFormat does not support them
71-
- Combined selectors (select + plural) results in unreadable and error
72-
prone nesting
73-
- Select and plurals inside the message are difficult to translate because of
74-
grammatical agreement requires words outside select / plural to change.
82+
- Messages with selectors (`select` and/or `plural`) are difficult to create
83+
and edit because of the complex nesting requirements of the syntax.
84+
- `select` and `plural` placeholders inside a message are difficult to translate as
85+
grammatical agreement may require words _outside_ the `select`/`plural` to change.
7586
See https://en.wikipedia.org/wiki/Agreement_(linguistics)
76-
- Patterns in the date / time / number placeholders are bad i18n, should use skeletons
77-
- No official support for gender. It can be done with `select`, but it
78-
is not the same thing (same as the difference between an `enum` and integer/strings). Developers can use masculine/feminine, masc/fem, male/female, etc.
79-
- Formatting for “parameters” known at compile time
87+
- Placeholders for `date`, `time`, and `number` can include picture strings
88+
that require translators to alter the "code" portion of a message
89+
and to understand arcane software-developer oriented syntaxes.
90+
While more-modern solutions such as skeletons have been added,
91+
there are no guardrails to keep people from using these poorly
92+
internationalized features.
93+
- Unable to support grammatical or personal gender selection well.
94+
Existing selectors such as `select` cannot account
95+
for the grammatical needs of different gender categories across languages.
96+
Tools have no way to know what modifications are needed
97+
and developers have to understand the needs of current and future languages to succeed.
8098
- Escaping with apostrophe is error prone. There is no reliable way to tell if
8199
it has to be doubled or not.
82-
- The # is used in plural format instead of {...}, but does not work for nesting unless the plural is the innermost selector. But named placeholders don't work
100+
- The `#` is used in plural format instead of `{...}`, but does not work for nesting unless the plural is the innermost selector. But named placeholders don't work
83101
properly for plurals with offset. So there are 2 ways to do the same thing that work in 98% of cases, but in special situations only one of the ways works.
84102
- Does not support inflections, and it would be hard to add without breaking existing tools.
85103

86104
### 3. Hard to map to the existing localization core structures
87105

88-
The format is not well supported by any major localization system. \
106+
While MessageFormat 1.0 and its syntax are widely supported by runtime environments,
107+
the same cannot be said for localization tooling.
89108
The root cause of that is not it is difficult to parse.
90109
Because it is not. And ICU4J has public API for parsing.
91110

92111
Most translation tools take a string (with placeholders) in a source language
93-
and gives back a translated string, usually with the same placeholders
112+
and give back a translated string, usually with the same placeholders
94113
(with some degree of flexibility).
95114

96-
It makes it very difficult to translate things like plurals, where the input
97-
has (for example) 2 “message variants” (English, 1 / many, singular / plural),
98-
and return 4 message variants for Russian, for example.
115+
To get the right results, translation software needs to understand the message syntax.
116+
For example, it needs to adjust the number of translated "patterns" to match the
117+
grammatical needs of the target language.
118+
Where the English input might have only two patterns (singular and plural),
119+
the Arabic translator needs to supply six message variants,
120+
and the Japanese translator only one.
99121

100122
This is not a superficial problem. It affects most steps in the normal
101123
localization flow:
@@ -108,23 +130,19 @@ localization flow:
108130

109131
### 4. Designed to be API only, plain text, UI, “imperative style”
110132

111-
The main (only?) use case for `MessageFormat` is: load the string from resources,
112-
replace placeholders, and return the string result with placeholders replaced. \
133+
The typical use case for `MessageFormat` is: load the string from resources,
134+
replace placeholders, and return the string result with placeholders replaced.
113135
An i18n-aware `printf`, basically.
114136

115-
It does not play well with binding, formatting tags (think `html`),
137+
It does not account for formatting tags (such as HTML),
116138
or “document-like” content (for example templating systems like
117139
[freemarker](https://freemarker.apache.org/),
118140
[mustache](https://mustache.github.io/), even JSP, PHP, etc.)
119141

120-
Because it is API only it has no standard way to store the stings in a
121-
serialized format and to carry info or directives for translators or
122-
localization tools. \
123-
So there is no way for a message to reference another message, or to fallback
124-
to a different locale. That is all left to the "host resource manager"
125-
(whatever that is for the given tech stack)
126-
127-
There is also no metadata: comments, length limits, example, links,
142+
The format defined by the API provides no standard way to carry the necessary
143+
information or directives for translators or localization tools,
144+
so implementations have had to develop their own, with no interoperability.
145+
There is also no metadata “packaging”: comments, length limits, example, links,
128146
protecting non-translatable sections of text, etc.
129147

130148
But this is also an advantage.
@@ -136,9 +154,9 @@ Applications don't need to migrate all the strings to a new format and resource
136154
resolution only to support some more advanced features in a few messages.
137155

138156
And since the string loading is left to the underlying tech stack it means that
139-
the locale resolution and fallback is consistent with everything else. \
140-
For example in Android there is locale based selection (with fallback) for
141-
styles, images, sounds, any kind of assets. \
157+
the locale resolution and fallback is consistent with everything else.
158+
For example in Android there is locale-based selection (with fallback) for
159+
styles, images, sounds, any kind of assets.
142160
So there is no risk that the string fallback is different than the sound
143161
fallback, for example.
144162

0 commit comments

Comments
 (0)