Skip to content

Commit 296151c

Browse files
authored
some style and language fixes for 002-text-utf-default.md (#4)
1 parent dcfac9f commit 296151c

File tree

1 file changed

+23
-23
lines changed

1 file changed

+23
-23
lines changed

proposals/002-text-utf-default.md

Lines changed: 23 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ this project.
1818
# What is Unicode?
1919

2020
Representing a text via bytes requires to choose a character enconding.
21-
One of the most old encodings is ASCII: it has 2⁷=128 code points,
21+
One of the oldest encodings is ASCII: it has 2⁷=128 code points,
2222
representing Latin letters, digits and punctuations. These code points
2323
are trivially mapped to 7-bit sequences and stored byte by byte (with leading zero).
2424

@@ -67,7 +67,7 @@ the vast majority of data is stored and sent between processes in UTF-8 encoding
6767
To sum up:
6868

6969
* Both UTF-8 and UTF-16 support exactly the same range of characters.
70-
* For ASCII data UTF-8 takes twice less space.
70+
* For ASCII data UTF-8 takes half the space.
7171
* UTF-8 is vastly more popular for serialization and storage.
7272

7373
# Motivation
@@ -86,14 +86,14 @@ The very `instance Binary Text` serializes `Text` in UTF-8 encoding.
8686
If we switch the internal representation of `Text` from UTF-16 to UTF-8,
8787
all such conversions would be made redundant and we'll be able just check that
8888
a `ByteString` is a valid UTF-8 (which is most often the case) and copy it into `Text`.
89-
If in future `ByteString` switch to be
89+
If in the future `ByteString` switches to be
9090
backed by unpinned memory, we'd be able to eliminate copying entirely.
9191

9292
`Text` is also often used in contexts, which involve mostly ASCII characters.
93-
This often prompts developers to use `ByteString` instead of `Text` to save 2x space
93+
This often prompts developers to use `ByteString` instead of `Text` to save half the space
9494
in a (false) hope that their data would never happen to contain anything non-ASCII.
95-
Backing `Text` by UTF-8 removes this source of inefficiency, reduces memory consumption
96-
up to 2x and promote more convergence to use `Text` for all stringy things
95+
Backing `Text` by UTF-8 removes this source of inefficiency, reduces memory consumption by
96+
up to half and promote more convergence to use `Text` for all stringy things
9797
(as opposed to binary data).
9898

9999
Modern computer science research is focused on developing faster algorithms
@@ -180,31 +180,31 @@ The work came to a halt in 2018 and at the moment is \~200 commits behind `maste
180180

181181
It would be extremely challenging to rebase this work on top of current `text`,
182182
audit and verify decade-old changes, then fix remaining
183-
issues and pass a review. As one can imagine, reviewers are not usually quite happy
183+
issues and pass a review. As one can imagine, reviewers are usually not happy
184184
to review 300 commits of vague provenance. Moreover, we discovered that benchmarks
185185
regressed severely in `text-utf8` and that fusion is broken on several occasions.
186186
It's unclear where exactly the problem lurks there. Finally, we'd like to explore different
187187
approaches to tackle potential performance issues.
188188

189-
We decided that the safest bet is to reimplement UTF-8 transition from the scratch,
189+
We decided that the safest bet is to reimplement UTF-8 transition from scratch,
190190
paying close attention to tests and benchmarks step by step. This way we'll be able
191191
to gain enough confidence and understanding of the nature of changes, and provide
192192
reviewers with a clean sequence of commits, facilitating timely merge.
193193

194194
Talking about developments in a wider ecosystem, one must mention
195195
`text-short` package, which provides a data structure, similar in characteristics
196196
to `ShortByteString`, but interpreted as a UTF-8 encoded data. It was argued that
197-
this type is worth inclusion into main `text` package to mirror `ShortByteString`,
197+
this type is worth including in main `text` package to mirror `ShortByteString`,
198198
exposed from `bytestring`. While such acquisition is out of scope for this project,
199199
it will be easier to do so when `text` package itself switches to UTF-8, opening
200200
possibilities for even better String story in Haskell.
201201

202202
**Compatibility issues**
203203

204-
`text` is a very old package, deeply ingrained in Haskell ecosystem.
204+
`text` is a very old package, deeply ingrained in the Haskell ecosystem.
205205
A change of internal representation is necessarily a breaking change.
206206
Our strategy to tackle compatibility issues is guided by a desire
207-
to finish this project in a time-bound fashion with realistic expectations
207+
to finish this project in a time-bounded fashion with realistic expectations
208208
about available resources.
209209

210210
Current `text` HEAD supports GHCs back to GHC 8.0.
@@ -217,7 +217,7 @@ or working around a bug), we may decide to shrink the compatibility window.
217217
Such decision would not to be taken lightly, but we believe that getting
218218
things done for the bright future should not be hindered by old unsupported luggage.
219219

220-
One suggestion to improve compatibility story was to keep both UTF-16 and UTF-8
220+
One suggestion to improve the compatibility story was to keep both UTF-16 and UTF-8
221221
implementations in `text` and switch between them via Cabal flag. It seems,
222222
however, that such strategy will put an undue, indefinitely long burden
223223
on `text` maintainers, and brings little benefits to downstream packages, because
@@ -234,11 +234,11 @@ modules unchanged, except `Word16` replaced by `Word8` where appropriate.
234234
Such promise unfortunately cannot be made for `Internal` modules,
235235
due to their nature: even while we'll strive to keep as much untouched as possible,
236236
the semantics of internal functions is due to change drastically. This kind of breakage
237-
should not come as a big surprise, because `Internal` modules have a disclaimer about
238-
unstable API.
237+
should not come as a big surprise, because `Internal` modules have a disclaimer about the API
238+
being unstable.
239239

240240
There are two places where `text` leaks details of internal representation.
241-
First of them is `Data.Text.Array`, which provides an access to an underlying bytearray.
241+
First of them is `Data.Text.Array`, which provides access to an underlying bytearray.
242242
Not only its API is to change from `Word16` to `Word8`, but also the semantics
243243
of array switches from UTF-16 to UTF-8. This will cause breakage of several packages
244244
such as `unicode-transforms` and `unicode-collation`. We intend to communicate with
@@ -254,7 +254,7 @@ to reach to them as soon as we have an MVP.
254254
Since fixing downstream compatibility issues is up to external counterparties,
255255
most of which are unpaid volunteers, we cannot expect them to do it in a limited
256256
time frame. We are devoted to having a smooth migration story and will provide
257-
as much guidance as possible, but to keep our targets time-bound we cannot tie the success
257+
as much guidance as possible, but to keep our targets time-bounded we cannot tie the success
258258
of this project to actions of third parties. We will not block this project
259259
because of unmigrated packages downstream.
260260

@@ -263,29 +263,29 @@ To sum up, we plan to:
263263
* Keep `text` compatible with GHCs back to 8.0, unless it puts an undue cost (more than 50 lines of code per major release).
264264
* Keep signatures of non-`Internal` modules compatible modulo `Word16`/`Word8` change.
265265
* Provide migration guidance to clients of `Data.Text.{Array,Foreign}`.
266-
* Facilitate a community project to keep UTF16-based legacy fork alive, if there is such demand.
266+
* Facilitate a community project to keep UTF16-based legacy fork alive, if there is such a demand.
267267

268268
**Performance impact**
269269

270-
A common misunderstanding is that switching to UTF-8 makes everything twice smaller and
271-
twice faster. That's not quite so.
270+
A common misunderstanding is that switching to UTF-8 makes everything smaller by half and
271+
twice as fast. That's not quite so.
272272

273-
While UTF-8 encoded English text is twice smaller than UTF-16,
273+
While UTF-8 encoded English text is half as big as UTF-16,
274274
this is not exactly true even for other Latin-based languages, which frequently
275275
use vowels with diacritics. For non-Latin scripts (Russian, Hebrew, Greek)
276276
the difference between UTF-8 and UTF-16 is almost negligible: one saves on
277277
spaces and punctuation, but letters still take two bytes. On a bright side, programs rarely
278278
transfer sheer walls of text, and for a typical markup language (JSON, HTML, XML),
279-
even if payload is non-ASCII, savings from UTF-8 easily reach \~30%.
279+
even if the payload is non-ASCII, savings from UTF-8 easily reach \~30%.
280280

281281
As a Haskell value, `Text` involves a significant constant overhead: there is
282282
a constructor tag, then offset and length, plus bytearray header and length.
283283
Altogether 5 machine word = 40 bytes. So for short texts, even if they are ASCII only,
284-
difference in memory allocations is not very pronounced.
284+
the difference in memory allocations is not very pronounced.
285285

286286
Further, traversing UTF-8 text is not necessarily faster than UTF-16. Both are
287287
variable length encodings, so indexing a certain element requires parsing everything
288-
in front. But in UTF-16 there are only two options: a code point
288+
up front. But in UTF-16 there are only two options: a code point
289289
takes either 2 or 4 bytes, and the vast majority of frequently used characters are
290290
2-byte long. So traversing UTF-16 keeps branch prediction happy. Now with UTF-8
291291
we have all four options: a code point can take from 1 to 4 bytes, and most non-English

0 commit comments

Comments
 (0)