@@ -18,7 +18,7 @@ this project.
18
18
# What is Unicode?
19
19
20
20
Representing a text via bytes requires to choose a character enconding.
21
- One of the most old encodings is ASCII: it has 2⁷=128 code points,
21
+ One of the oldest encodings is ASCII: it has 2⁷=128 code points,
22
22
representing Latin letters, digits and punctuations. These code points
23
23
are trivially mapped to 7-bit sequences and stored byte by byte (with leading zero).
24
24
@@ -67,7 +67,7 @@ the vast majority of data is stored and sent between processes in UTF-8 encoding
67
67
To sum up:
68
68
69
69
* Both UTF-8 and UTF-16 support exactly the same range of characters.
70
- * For ASCII data UTF-8 takes twice less space.
70
+ * For ASCII data UTF-8 takes half the space.
71
71
* UTF-8 is vastly more popular for serialization and storage.
72
72
73
73
# Motivation
@@ -86,14 +86,14 @@ The very `instance Binary Text` serializes `Text` in UTF-8 encoding.
86
86
If we switch the internal representation of ` Text ` from UTF-16 to UTF-8,
87
87
all such conversions would be made redundant and we'll be able just check that
88
88
a ` ByteString ` is a valid UTF-8 (which is most often the case) and copy it into ` Text ` .
89
- If in future ` ByteString ` switch to be
89
+ If in the future ` ByteString ` switches to be
90
90
backed by unpinned memory, we'd be able to eliminate copying entirely.
91
91
92
92
` Text ` is also often used in contexts, which involve mostly ASCII characters.
93
- This often prompts developers to use ` ByteString ` instead of ` Text ` to save 2x space
93
+ This often prompts developers to use ` ByteString ` instead of ` Text ` to save half the space
94
94
in a (false) hope that their data would never happen to contain anything non-ASCII.
95
- Backing ` Text ` by UTF-8 removes this source of inefficiency, reduces memory consumption
96
- up to 2x and promote more convergence to use ` Text ` for all stringy things
95
+ Backing ` Text ` by UTF-8 removes this source of inefficiency, reduces memory consumption by
96
+ up to half and promote more convergence to use ` Text ` for all stringy things
97
97
(as opposed to binary data).
98
98
99
99
Modern computer science research is focused on developing faster algorithms
@@ -180,31 +180,31 @@ The work came to a halt in 2018 and at the moment is \~200 commits behind `maste
180
180
181
181
It would be extremely challenging to rebase this work on top of current ` text ` ,
182
182
audit and verify decade-old changes, then fix remaining
183
- issues and pass a review. As one can imagine, reviewers are not usually quite happy
183
+ issues and pass a review. As one can imagine, reviewers are usually not happy
184
184
to review 300 commits of vague provenance. Moreover, we discovered that benchmarks
185
185
regressed severely in ` text-utf8 ` and that fusion is broken on several occasions.
186
186
It's unclear where exactly the problem lurks there. Finally, we'd like to explore different
187
187
approaches to tackle potential performance issues.
188
188
189
- We decided that the safest bet is to reimplement UTF-8 transition from the scratch,
189
+ We decided that the safest bet is to reimplement UTF-8 transition from scratch,
190
190
paying close attention to tests and benchmarks step by step. This way we'll be able
191
191
to gain enough confidence and understanding of the nature of changes, and provide
192
192
reviewers with a clean sequence of commits, facilitating timely merge.
193
193
194
194
Talking about developments in a wider ecosystem, one must mention
195
195
` text-short ` package, which provides a data structure, similar in characteristics
196
196
to ` ShortByteString ` , but interpreted as a UTF-8 encoded data. It was argued that
197
- this type is worth inclusion into main ` text ` package to mirror ` ShortByteString ` ,
197
+ this type is worth including in main ` text ` package to mirror ` ShortByteString ` ,
198
198
exposed from ` bytestring ` . While such acquisition is out of scope for this project,
199
199
it will be easier to do so when ` text ` package itself switches to UTF-8, opening
200
200
possibilities for even better String story in Haskell.
201
201
202
202
** Compatibility issues**
203
203
204
- ` text ` is a very old package, deeply ingrained in Haskell ecosystem.
204
+ ` text ` is a very old package, deeply ingrained in the Haskell ecosystem.
205
205
A change of internal representation is necessarily a breaking change.
206
206
Our strategy to tackle compatibility issues is guided by a desire
207
- to finish this project in a time-bound fashion with realistic expectations
207
+ to finish this project in a time-bounded fashion with realistic expectations
208
208
about available resources.
209
209
210
210
Current ` text ` HEAD supports GHCs back to GHC 8.0.
@@ -217,7 +217,7 @@ or working around a bug), we may decide to shrink the compatibility window.
217
217
Such decision would not to be taken lightly, but we believe that getting
218
218
things done for the bright future should not be hindered by old unsupported luggage.
219
219
220
- One suggestion to improve compatibility story was to keep both UTF-16 and UTF-8
220
+ One suggestion to improve the compatibility story was to keep both UTF-16 and UTF-8
221
221
implementations in ` text ` and switch between them via Cabal flag. It seems,
222
222
however, that such strategy will put an undue, indefinitely long burden
223
223
on ` text ` maintainers, and brings little benefits to downstream packages, because
@@ -234,11 +234,11 @@ modules unchanged, except `Word16` replaced by `Word8` where appropriate.
234
234
Such promise unfortunately cannot be made for ` Internal ` modules,
235
235
due to their nature: even while we'll strive to keep as much untouched as possible,
236
236
the semantics of internal functions is due to change drastically. This kind of breakage
237
- should not come as a big surprise, because ` Internal ` modules have a disclaimer about
238
- unstable API .
237
+ should not come as a big surprise, because ` Internal ` modules have a disclaimer about the API
238
+ being unstable .
239
239
240
240
There are two places where ` text ` leaks details of internal representation.
241
- First of them is ` Data.Text.Array ` , which provides an access to an underlying bytearray.
241
+ First of them is ` Data.Text.Array ` , which provides access to an underlying bytearray.
242
242
Not only its API is to change from ` Word16 ` to ` Word8 ` , but also the semantics
243
243
of array switches from UTF-16 to UTF-8. This will cause breakage of several packages
244
244
such as ` unicode-transforms ` and ` unicode-collation ` . We intend to communicate with
@@ -254,7 +254,7 @@ to reach to them as soon as we have an MVP.
254
254
Since fixing downstream compatibility issues is up to external counterparties,
255
255
most of which are unpaid volunteers, we cannot expect them to do it in a limited
256
256
time frame. We are devoted to having a smooth migration story and will provide
257
- as much guidance as possible, but to keep our targets time-bound we cannot tie the success
257
+ as much guidance as possible, but to keep our targets time-bounded we cannot tie the success
258
258
of this project to actions of third parties. We will not block this project
259
259
because of unmigrated packages downstream.
260
260
@@ -263,29 +263,29 @@ To sum up, we plan to:
263
263
* Keep ` text ` compatible with GHCs back to 8.0, unless it puts an undue cost (more than 50 lines of code per major release).
264
264
* Keep signatures of non-` Internal ` modules compatible modulo ` Word16 ` /` Word8 ` change.
265
265
* Provide migration guidance to clients of ` Data.Text.{Array,Foreign} ` .
266
- * Facilitate a community project to keep UTF16-based legacy fork alive, if there is such demand.
266
+ * Facilitate a community project to keep UTF16-based legacy fork alive, if there is such a demand.
267
267
268
268
** Performance impact**
269
269
270
- A common misunderstanding is that switching to UTF-8 makes everything twice smaller and
271
- twice faster . That's not quite so.
270
+ A common misunderstanding is that switching to UTF-8 makes everything smaller by half and
271
+ twice as fast . That's not quite so.
272
272
273
- While UTF-8 encoded English text is twice smaller than UTF-16,
273
+ While UTF-8 encoded English text is half as big as UTF-16,
274
274
this is not exactly true even for other Latin-based languages, which frequently
275
275
use vowels with diacritics. For non-Latin scripts (Russian, Hebrew, Greek)
276
276
the difference between UTF-8 and UTF-16 is almost negligible: one saves on
277
277
spaces and punctuation, but letters still take two bytes. On a bright side, programs rarely
278
278
transfer sheer walls of text, and for a typical markup language (JSON, HTML, XML),
279
- even if payload is non-ASCII, savings from UTF-8 easily reach \~ 30%.
279
+ even if the payload is non-ASCII, savings from UTF-8 easily reach \~ 30%.
280
280
281
281
As a Haskell value, ` Text ` involves a significant constant overhead: there is
282
282
a constructor tag, then offset and length, plus bytearray header and length.
283
283
Altogether 5 machine word = 40 bytes. So for short texts, even if they are ASCII only,
284
- difference in memory allocations is not very pronounced.
284
+ the difference in memory allocations is not very pronounced.
285
285
286
286
Further, traversing UTF-8 text is not necessarily faster than UTF-16. Both are
287
287
variable length encodings, so indexing a certain element requires parsing everything
288
- in front. But in UTF-16 there are only two options: a code point
288
+ up front. But in UTF-16 there are only two options: a code point
289
289
takes either 2 or 4 bytes, and the vast majority of frequently used characters are
290
290
2-byte long. So traversing UTF-16 keeps branch prediction happy. Now with UTF-8
291
291
we have all four options: a code point can take from 1 to 4 bytes, and most non-English
0 commit comments