1
+ # Introduction
2
+
1
3
This proposal is for the migration of the ` text ` package from its
2
4
default UTF-16 encoding to UTF-8. The lack of UTF-8 as a default in the
3
5
` text ` package is a pain point raised by the Haskell Community and
@@ -13,6 +15,61 @@ appetite for breakage, and what stakeholders would be affected by the
13
15
changes. Andrew Lelechenko has offered to lead the implementation of
14
16
this project.
15
17
18
+ # What is Unicode?
19
+
20
+ Representing a text via bytes requires to choose a character enconding.
21
+ One of the most old encodings is ASCII: it has 2⁷=128 code points,
22
+ representing Latin letters, digits and punctuations. These code points
23
+ are trivially mapped to 7-bit sequences and stored byte by byte (with leading zero).
24
+
25
+ Unfortunately, almost no other alphabet except English can fit into ASCII:
26
+ even Western European languages need code points for ñ, ß, ø, etc.
27
+ Since on the majority of architectures a byte contains 8 bits, able to encode
28
+ up to 256 code points, various incompatible ASCII extensions proliferated profusely.
29
+ This includes ISO 8859-1, covering additional symbols for Latin scripts, several
30
+ encodings to represent Cyrillic letters, etc.
31
+
32
+ Not only it was error-prone to guess an encoding of a particular bytestring, but also
33
+ impossible, for example, to mix French and Russian letters in a single text. This
34
+ prompted work on a universal encoding, Unicode, which would be capable to represent
35
+ all letters one can think of. At early development stages it was thought that
36
+ “64K ought to be enough for anybody” and a uniform 16-bit encoding UCS-2 was proposed.
37
+ Some programming languages developed around that period (Java, JavaScript)
38
+ chose this encoding for internal representation of strings. It is
39
+ only twice longer than ASCII and, since all code points are
40
+ represented by 2 bytes, constant-time indexing is possible.
41
+
42
+ Soon enough, however, Unicode Consortium discovered more than 64K letters.
43
+ Unicode standard defines almost 17·2¹⁶ code points, of which \~ 143 000 are assigned
44
+ a character (as of Unicode 13.0). Now comes a tricky part: Unicode defines how to map
45
+ characters to code points (basically, integers), but how would you serialise lists of
46
+ integers to bytes? The simplest encoding is just allocate 32 bits (4 bytes) per code
47
+ point, and write them one by one. This is UTF-32. Its main benefit is that since
48
+ all code points take the same size, so you can still index characters in a constant time.
49
+ However, memory requirements are 4x comparing to ASCII, and in a world of ASCII and UCS-2
50
+ there was little appetite to embrace one more, completely new encoding.
51
+
52
+ Next option on the list is to encode some code points as 2 bytes and some others,
53
+ less lucky ones, as 4 bytes. This is UTF-16. This encoding allowed to retain a decent
54
+ backward compatibility with UCS-2, so, for instance, modern Java and JavaScript
55
+ stick to UTF-16 as default internal representation. The biggest downside is that you
56
+ no longer know beforehand, where * n* -th code point starts. One must * parse* all characters
57
+ from the very beginning to learn which ones are 2-byte long and which are 4-byte long.
58
+
59
+ But once we abandon requirement of constant indexing, even better option arises. Let's
60
+ encode first 128 characters as 1 byte, some others as 2 bytes, and the rest as 4 bytes.
61
+ This is UTF-8. The killer feature of this encoding is that it's fully backwards compatible
62
+ with ASCII. This meant that all existing ASCII documents were automatically valid UTF-8
63
+ documents as well, and that 50-years-old executables could often parse UTF08 data without
64
+ knowing a bit about it. This property appeared so important that in a modern environment
65
+ the vast majority of data is stored and sent between processes in UTF-8 encoding.
66
+
67
+ To sum up:
68
+
69
+ * Both UTF-8 and UTF-16 support exactly the same range of characters.
70
+ * For ASCII data UTF-8 takes twice less space.
71
+ * UTF-8 is vastly more popular for serialization and storage.
72
+
16
73
# Motivation
17
74
18
75
- UTF-16 by default requires that all Text values pay a premium for serialization. Arguably, the performance impact of Text is flipped
0 commit comments