You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: proposals/002-text-utf-default.md
+81-7Lines changed: 81 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -57,7 +57,7 @@ no longer know beforehand, where *n*-th code point starts. One must *parse* all
57
57
from the very beginning to learn which ones are 2-byte long and which are 4-byte long.
58
58
59
59
But once we abandon requirement of constant indexing, even better option arises. Let's
60
-
encode first 128 characters as 1 byte, some others as 2 bytes, and the rest as 4 bytes.
60
+
encode first 128 characters as 1 byte, some others as 2 bytes, and the rest as 3 or 4 bytes.
61
61
This is UTF-8. The killer feature of this encoding is that it's fully backwards compatible
62
62
with ASCII. This meant that all existing ASCII documents were automatically valid UTF-8
63
63
documents as well, and that 50-years-old executables could often parse UTF-8 data without
@@ -90,11 +90,19 @@ If in future (see an upcoming "Unifying vector-like types" proposal) `ByteString
90
90
is backed by unpinned memory, we'd be able to eliminate copying entirely.
91
91
92
92
`Text` is also often used in contexts, which involve mostly ASCII characters.
93
-
For such applications storing data in UTF-8 means using up to 2x less space,
94
-
which could be important to reduce memory pressure.
93
+
This often prompts developers to use `ByteString` instead of `Text` to save 2x space
94
+
in a (false) hope that their data would never happen to contain anything non-ASCII.
95
+
Backing `Text` by UTF-8 removes this source of inefficiency, reduces memory consumption
96
+
up to 2x and promote more convergence to use `Text` for all stringy things
97
+
(as opposed to binary data).
98
+
99
+
Modern computer science research is focused on developing faster algorithms
100
+
for UTF-8 data, e. g., an ultra-fast JSON decoder [simdjson](https://github.com/simdjson/simdjson). There is much less work on (and demand for) UTF-16 algorithms.
101
+
Switching `text` to UTF-8 will open us a way to accomodate and benefit
102
+
from future developments in rapid text processing.
95
103
96
104
The importance of UTF-16 to UTF-8 transition was recognised long ago, and at least
97
-
two attempts has been made:
105
+
two attempts have been made:
98
106
[in 2011](https://github.com/jaspervdj/text/tree/utf8) and five years later
99
107
[in 2016](https://github.com/text-utf8/text-utf8). Unfortunately, they did not get
100
108
merged into main `text` package. Today, five more years later it seems suitable
@@ -108,14 +116,13 @@ to make another attempt.
108
116
109
117
- Ensure stakeholders (e.g. GHC, Cabal, Stack, boot libs) have ample time to migrate and address any bugs.
110
118
111
-
- Implementation should not significantly alter the performance characteristics of the base `text` library within some tolerance
112
-
threshold.
119
+
- Performance satisfies targets listed below in "Performance impact" section.
113
120
114
121
# People
115
122
116
123
- Performers:
117
124
118
-
- Leader: Andrew Lelechenko (bodigrim)
125
+
- Leader: Andrew Lelechenko (Bodigrim)
119
126
120
127
- Support: Emily Pillmore (emilypi)
121
128
@@ -173,6 +180,73 @@ possible. This candidate should be shared publicly and loudly.
173
180
174
181
- TBD: There is a straightforward implementation, but this one is left up to Andrew for comment.
175
182
183
+
**Performance impact**
184
+
185
+
A common misunderstanding is that switching to UTF-8 makes everything twice smaller and
186
+
twice faster. That's not quite so.
187
+
188
+
While UTF-8 encoded English text is twice smaller than UTF-16,
189
+
this is not exactly true even for other Latin-based languages, which frequently
190
+
use vowels with diacritics. For non-Latin scripts (Russian, Hebrew, Greek)
191
+
the difference between UTF-8 and UTF-16 is almost negligible: one saves on
192
+
spaces and punctuation, but letters still take two bytes. On a bright side, programs rarely
193
+
transfer sheer walls of text, and for a typical markup language (JSON, HTML, XML),
194
+
even if payload is non-ASCII, savings from UTF-8 easily reach \~30%.
195
+
196
+
As a Haskell value, `Text` involves a significant constant overhead: there is
197
+
a constructor tag, then offset and length, plus bytearray header and length.
198
+
Altogether 5 machine word = 40 bytes. So for short texts, even if they are ASCII only,
199
+
difference in memory allocations is not very pronounced.
200
+
201
+
Further, traversing UTF-8 text is not necessarily faster than UTF-16. Both are
202
+
variable length encodings, so indexing a certain element requires parsing everything
203
+
in front. But in UTF-16 there are only two options: a code point
204
+
takes either 2 or 4 bytes, and the vast majority of frequently used characters are
205
+
2-byte long. So traversing UTF-16 keeps branch prediction happy. Now with UTF-8
206
+
we have all four options: a code point can take from 1 to 4 bytes, and most non-English
207
+
texts constantly alternate between 1-byte (e. g., spaces) and 2-byte characters.
208
+
Having more branches and, more importantly, bad branch prediction is a serious penalty.
209
+
This to a certain degree is mitigated by better cache locality.
210
+
211
+
Existing `text` benchmarks are arguably favoring UTF-16 encoding: most of them are huge
212
+
walls of Russian, Greek, Korean, etc. texts without any markup. So encoding them in UTF-8
213
+
does not save any space, but we have to pay extra for more elaborate encoding. Our goal
214
+
here is nevertheless to stay roughly on par with existing implementation.
215
+
216
+
Benchmarks for `decodeUtf8` / `encodeUtf8` should improve significantly by virtue
217
+
of avoiding conversion between UTF-8 and UTF-16 conversion.
218
+
Fast validation of UTF-8 is not a trivial task, but we intend to employ
219
+
[`simdjson::validate_utf8`](https://arxiv.org/pdf/2010.03090.pdf) for this task.
220
+
221
+
Another important aspect of `text` performance is fusion.We are finalising
222
+
an `inspection-testing`-based [test suite](https://github.com/haskell/text/pull/337) to check that
223
+
pipelines, which used to fuse before, are fusing after UTF-8 transition as well.
224
+
Fusion is incredibly fragile matter: for example, of 100 tests, which fuse in GHC 8.10.4,
225
+
40 do not fuse in GHC 9.0.1, 30 do not fuse in GHC 8.4.4, etc. In such environment we cannot
226
+
bet on retaining all fusion capabilities, but we aim to thoroughly investigate
227
+
and explain all regressions.
228
+
229
+
We expect that switching to UTF-8 will be beneficial for client of `text`, both
230
+
libraries and applications. They'll be able to save memory for storage,
231
+
save time on encoding/decoding inputs and outputs, use state-of-the-art text algorithms,
232
+
developed for UTF-8. Parsers often benefit from UTF-8 encoding, because if a grammar
233
+
does not have specific rules for non-ASCII characters (which is most often the case),
234
+
parser can operate on a `ByteArray` without bothering about multibyte encodings at all.
235
+
236
+
We will seek clients' feedback as early as possible, and will act on it, if it arrives
237
+
before the end of the project. However, since our clients are external actors, often
238
+
unpaid volunteers, we cannot expect them to provide feedback by the given date.
239
+
Thus to keep targets of this project time-bound, we cannot include a goal
240
+
of waiting for an approval of indefinite number of parties for indefinitely long.
241
+
Such goal or sentiment, in our opinion, made a significant contribution into failure
242
+
of two previous attempts.
243
+
244
+
To sum up:
245
+
246
+
*`decodeUtf8` and `encodeUtf8` become at least 2x faster.
247
+
* Geometric mean of existing benchmarks (which favor UTF-16) decreases.
248
+
* Fusion (as per our test suite) does not regress beyond at most several cases.
249
+
176
250
**Stakeholders:**
177
251
178
252
- Library authors will need to be made aware of changes and adjust accordingly. HF will provide a git reference to a complete MVP as
0 commit comments