Skip to content

Commit 7b79d75

Browse files
committed
Update Motivation and state Performance Impact
1 parent cd12a3a commit 7b79d75

File tree

1 file changed

+81
-7
lines changed

1 file changed

+81
-7
lines changed

proposals/002-text-utf-default.md

Lines changed: 81 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -57,7 +57,7 @@ no longer know beforehand, where *n*-th code point starts. One must *parse* all
5757
from the very beginning to learn which ones are 2-byte long and which are 4-byte long.
5858

5959
But once we abandon requirement of constant indexing, even better option arises. Let's
60-
encode first 128 characters as 1 byte, some others as 2 bytes, and the rest as 4 bytes.
60+
encode first 128 characters as 1 byte, some others as 2 bytes, and the rest as 3 or 4 bytes.
6161
This is UTF-8. The killer feature of this encoding is that it's fully backwards compatible
6262
with ASCII. This meant that all existing ASCII documents were automatically valid UTF-8
6363
documents as well, and that 50-years-old executables could often parse UTF-8 data without
@@ -90,11 +90,19 @@ If in future (see an upcoming "Unifying vector-like types" proposal) `ByteString
9090
is backed by unpinned memory, we'd be able to eliminate copying entirely.
9191

9292
`Text` is also often used in contexts, which involve mostly ASCII characters.
93-
For such applications storing data in UTF-8 means using up to 2x less space,
94-
which could be important to reduce memory pressure.
93+
This often prompts developers to use `ByteString` instead of `Text` to save 2x space
94+
in a (false) hope that their data would never happen to contain anything non-ASCII.
95+
Backing `Text` by UTF-8 removes this source of inefficiency, reduces memory consumption
96+
up to 2x and promote more convergence to use `Text` for all stringy things
97+
(as opposed to binary data).
98+
99+
Modern computer science research is focused on developing faster algorithms
100+
for UTF-8 data, e. g., an ultra-fast JSON decoder [simdjson](https://github.com/simdjson/simdjson). There is much less work on (and demand for) UTF-16 algorithms.
101+
Switching `text` to UTF-8 will open us a way to accomodate and benefit
102+
from future developments in rapid text processing.
95103

96104
The importance of UTF-16 to UTF-8 transition was recognised long ago, and at least
97-
two attempts has been made:
105+
two attempts have been made:
98106
[in 2011](https://github.com/jaspervdj/text/tree/utf8) and five years later
99107
[in 2016](https://github.com/text-utf8/text-utf8). Unfortunately, they did not get
100108
merged into main `text` package. Today, five more years later it seems suitable
@@ -108,14 +116,13 @@ to make another attempt.
108116

109117
- Ensure stakeholders (e.g. GHC, Cabal, Stack, boot libs) have ample time to migrate and address any bugs.
110118

111-
- Implementation should not significantly alter the performance characteristics of the base `text` library within some tolerance
112-
threshold.
119+
- Performance satisfies targets listed below in "Performance impact" section.
113120

114121
# People
115122

116123
- Performers:
117124

118-
- Leader: Andrew Lelechenko (bodigrim)
125+
- Leader: Andrew Lelechenko (Bodigrim)
119126

120127
- Support: Emily Pillmore (emilypi)
121128

@@ -173,6 +180,73 @@ possible. This candidate should be shared publicly and loudly.
173180

174181
- TBD: There is a straightforward implementation, but this one is left up to Andrew for comment.
175182

183+
**Performance impact**
184+
185+
A common misunderstanding is that switching to UTF-8 makes everything twice smaller and
186+
twice faster. That's not quite so.
187+
188+
While UTF-8 encoded English text is twice smaller than UTF-16,
189+
this is not exactly true even for other Latin-based languages, which frequently
190+
use vowels with diacritics. For non-Latin scripts (Russian, Hebrew, Greek)
191+
the difference between UTF-8 and UTF-16 is almost negligible: one saves on
192+
spaces and punctuation, but letters still take two bytes. On a bright side, programs rarely
193+
transfer sheer walls of text, and for a typical markup language (JSON, HTML, XML),
194+
even if payload is non-ASCII, savings from UTF-8 easily reach \~30%.
195+
196+
As a Haskell value, `Text` involves a significant constant overhead: there is
197+
a constructor tag, then offset and length, plus bytearray header and length.
198+
Altogether 5 machine word = 40 bytes. So for short texts, even if they are ASCII only,
199+
difference in memory allocations is not very pronounced.
200+
201+
Further, traversing UTF-8 text is not necessarily faster than UTF-16. Both are
202+
variable length encodings, so indexing a certain element requires parsing everything
203+
in front. But in UTF-16 there are only two options: a code point
204+
takes either 2 or 4 bytes, and the vast majority of frequently used characters are
205+
2-byte long. So traversing UTF-16 keeps branch prediction happy. Now with UTF-8
206+
we have all four options: a code point can take from 1 to 4 bytes, and most non-English
207+
texts constantly alternate between 1-byte (e. g., spaces) and 2-byte characters.
208+
Having more branches and, more importantly, bad branch prediction is a serious penalty.
209+
This to a certain degree is mitigated by better cache locality.
210+
211+
Existing `text` benchmarks are arguably favoring UTF-16 encoding: most of them are huge
212+
walls of Russian, Greek, Korean, etc. texts without any markup. So encoding them in UTF-8
213+
does not save any space, but we have to pay extra for more elaborate encoding. Our goal
214+
here is nevertheless to stay roughly on par with existing implementation.
215+
216+
Benchmarks for `decodeUtf8` / `encodeUtf8` should improve significantly by virtue
217+
of avoiding conversion between UTF-8 and UTF-16 conversion.
218+
Fast validation of UTF-8 is not a trivial task, but we intend to employ
219+
[`simdjson::validate_utf8`](https://arxiv.org/pdf/2010.03090.pdf) for this task.
220+
221+
Another important aspect of `text` performance is fusion.We are finalising
222+
an `inspection-testing`-based [test suite](https://github.com/haskell/text/pull/337) to check that
223+
pipelines, which used to fuse before, are fusing after UTF-8 transition as well.
224+
Fusion is incredibly fragile matter: for example, of 100 tests, which fuse in GHC 8.10.4,
225+
40 do not fuse in GHC 9.0.1, 30 do not fuse in GHC 8.4.4, etc. In such environment we cannot
226+
bet on retaining all fusion capabilities, but we aim to thoroughly investigate
227+
and explain all regressions.
228+
229+
We expect that switching to UTF-8 will be beneficial for client of `text`, both
230+
libraries and applications. They'll be able to save memory for storage,
231+
save time on encoding/decoding inputs and outputs, use state-of-the-art text algorithms,
232+
developed for UTF-8. Parsers often benefit from UTF-8 encoding, because if a grammar
233+
does not have specific rules for non-ASCII characters (which is most often the case),
234+
parser can operate on a `ByteArray` without bothering about multibyte encodings at all.
235+
236+
We will seek clients' feedback as early as possible, and will act on it, if it arrives
237+
before the end of the project. However, since our clients are external actors, often
238+
unpaid volunteers, we cannot expect them to provide feedback by the given date.
239+
Thus to keep targets of this project time-bound, we cannot include a goal
240+
of waiting for an approval of indefinite number of parties for indefinitely long.
241+
Such goal or sentiment, in our opinion, made a significant contribution into failure
242+
of two previous attempts.
243+
244+
To sum up:
245+
246+
* `decodeUtf8` and `encodeUtf8` become at least 2x faster.
247+
* Geometric mean of existing benchmarks (which favor UTF-16) decreases.
248+
* Fusion (as per our test suite) does not regress beyond at most several cases.
249+
176250
**Stakeholders:**
177251

178252
- Library authors will need to be made aware of changes and adjust accordingly. HF will provide a git reference to a complete MVP as

0 commit comments

Comments
 (0)