Skip to content

Commit cd12a3a

Browse files
committed
Rework Motivation section
1 parent 666518d commit cd12a3a

File tree

1 file changed

+30
-10
lines changed

1 file changed

+30
-10
lines changed

proposals/002-text-utf-default.md

Lines changed: 30 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -45,9 +45,9 @@ a character (as of Unicode 13.0). Now comes a tricky part: Unicode defines how t
4545
characters to code points (basically, integers), but how would you serialise lists of
4646
integers to bytes? The simplest encoding is just allocate 32 bits (4 bytes) per code
4747
point, and write them one by one. This is UTF-32. Its main benefit is that since
48-
all code points take the same size, so you can still index characters in a constant time.
48+
all code points take the same size, you can still index characters in a constant time.
4949
However, memory requirements are 4x comparing to ASCII, and in a world of ASCII and UCS-2
50-
there was little appetite to embrace one more, completely new encoding.
50+
there was little appetite to embrace one more incompatible encoding.
5151

5252
Next option on the list is to encode some code points as 2 bytes and some others,
5353
less lucky ones, as 4 bytes. This is UTF-16. This encoding allowed to retain a decent
@@ -60,7 +60,7 @@ But once we abandon requirement of constant indexing, even better option arises.
6060
encode first 128 characters as 1 byte, some others as 2 bytes, and the rest as 4 bytes.
6161
This is UTF-8. The killer feature of this encoding is that it's fully backwards compatible
6262
with ASCII. This meant that all existing ASCII documents were automatically valid UTF-8
63-
documents as well, and that 50-years-old executables could often parse UTF08 data without
63+
documents as well, and that 50-years-old executables could often parse UTF-8 data without
6464
knowing a bit about it. This property appeared so important that in a modern environment
6565
the vast majority of data is stored and sent between processes in UTF-8 encoding.
6666

@@ -72,13 +72,33 @@ To sum up:
7272

7373
# Motivation
7474

75-
- UTF-16 by default requires that all Text values pay a premium for serialization. Arguably, the performance impact of Text is flipped
76-
upside-down: most text is UTF-8, and Haskell devs pay an undue cost when working with the wrong default.
77-
78-
- UTF-8 is the industry standard and by far the most common text encoding, with roughly 97% of web pages existing in UTF-8. The
79-
existing UTF-16 default imposes an additional hurdle to working with the vast majority of web content on earth.
80-
81-
- Many systems in Haskell are UTF-8 by default (e.g. Haddock)
75+
`text` is a standard Haskell library for Unicode strings. Internally it stores
76+
Unicode code points in UTF-16, so any character takes either 2 or 4 bytes.
77+
In a modern enviroment this is a suboptimal choice: usually data
78+
is stored (e. g., on a disc or in DB) and tranferred between agents (e. g., via web)
79+
in UTF-8 encoding. So `text` needs to convert (UTF-8 to UTF-16) all inputs and usually
80+
ends up converting outputs as well (this time UTF-16 to UTF-8).
81+
82+
Even within Haskell ecosystem UTF-16 is rarely used
83+
for interprocess communication or as a component of binary formats.
84+
The very `instance Binary Text` serializes `Text` in UTF-8 encoding.
85+
86+
If we switch the internal representation of `Text` from UTF-16 to UTF-8,
87+
all such conversions would be made redundant and we'll be able just check that
88+
a `ByteString` is a valid UTF-8 (which is most often the case) and copy it into `Text`.
89+
If in future (see an upcoming "Unifying vector-like types" proposal) `ByteString`
90+
is backed by unpinned memory, we'd be able to eliminate copying entirely.
91+
92+
`Text` is also often used in contexts, which involve mostly ASCII characters.
93+
For such applications storing data in UTF-8 means using up to 2x less space,
94+
which could be important to reduce memory pressure.
95+
96+
The importance of UTF-16 to UTF-8 transition was recognised long ago, and at least
97+
two attempts has been made:
98+
[in 2011](https://github.com/jaspervdj/text/tree/utf8) and five years later
99+
[in 2016](https://github.com/text-utf8/text-utf8). Unfortunately, they did not get
100+
merged into main `text` package. Today, five more years later it seems suitable
101+
to make another attempt.
82102

83103
# Goals
84104

0 commit comments

Comments
 (0)