Skip to content

Commit 666518d

Browse files
committed
Add technical introduction to Unicode
1 parent c184a9e commit 666518d

File tree

1 file changed

+57
-0
lines changed

1 file changed

+57
-0
lines changed

proposals/002-text-utf-default.md

Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
1+
# Introduction
2+
13
This proposal is for the migration of the `text` package from its
24
default UTF-16 encoding to UTF-8. The lack of UTF-8 as a default in the
35
`text` package is a pain point raised by the Haskell Community and
@@ -13,6 +15,61 @@ appetite for breakage, and what stakeholders would be affected by the
1315
changes. Andrew Lelechenko has offered to lead the implementation of
1416
this project.
1517

18+
# What is Unicode?
19+
20+
Representing a text via bytes requires to choose a character enconding.
21+
One of the most old encodings is ASCII: it has 2⁷=128 code points,
22+
representing Latin letters, digits and punctuations. These code points
23+
are trivially mapped to 7-bit sequences and stored byte by byte (with leading zero).
24+
25+
Unfortunately, almost no other alphabet except English can fit into ASCII:
26+
even Western European languages need code points for ñ, ß, ø, etc.
27+
Since on the majority of architectures a byte contains 8 bits, able to encode
28+
up to 256 code points, various incompatible ASCII extensions proliferated profusely.
29+
This includes ISO 8859-1, covering additional symbols for Latin scripts, several
30+
encodings to represent Cyrillic letters, etc.
31+
32+
Not only it was error-prone to guess an encoding of a particular bytestring, but also
33+
impossible, for example, to mix French and Russian letters in a single text. This
34+
prompted work on a universal encoding, Unicode, which would be capable to represent
35+
all letters one can think of. At early development stages it was thought that
36+
“64K ought to be enough for anybody” and a uniform 16-bit encoding UCS-2 was proposed.
37+
Some programming languages developed around that period (Java, JavaScript)
38+
chose this encoding for internal representation of strings. It is
39+
only twice longer than ASCII and, since all code points are
40+
represented by 2 bytes, constant-time indexing is possible.
41+
42+
Soon enough, however, Unicode Consortium discovered more than 64K letters.
43+
Unicode standard defines almost 17·2¹⁶ code points, of which \~143 000 are assigned
44+
a character (as of Unicode 13.0). Now comes a tricky part: Unicode defines how to map
45+
characters to code points (basically, integers), but how would you serialise lists of
46+
integers to bytes? The simplest encoding is just allocate 32 bits (4 bytes) per code
47+
point, and write them one by one. This is UTF-32. Its main benefit is that since
48+
all code points take the same size, so you can still index characters in a constant time.
49+
However, memory requirements are 4x comparing to ASCII, and in a world of ASCII and UCS-2
50+
there was little appetite to embrace one more, completely new encoding.
51+
52+
Next option on the list is to encode some code points as 2 bytes and some others,
53+
less lucky ones, as 4 bytes. This is UTF-16. This encoding allowed to retain a decent
54+
backward compatibility with UCS-2, so, for instance, modern Java and JavaScript
55+
stick to UTF-16 as default internal representation. The biggest downside is that you
56+
no longer know beforehand, where *n*-th code point starts. One must *parse* all characters
57+
from the very beginning to learn which ones are 2-byte long and which are 4-byte long.
58+
59+
But once we abandon requirement of constant indexing, even better option arises. Let's
60+
encode first 128 characters as 1 byte, some others as 2 bytes, and the rest as 4 bytes.
61+
This is UTF-8. The killer feature of this encoding is that it's fully backwards compatible
62+
with ASCII. This meant that all existing ASCII documents were automatically valid UTF-8
63+
documents as well, and that 50-years-old executables could often parse UTF08 data without
64+
knowing a bit about it. This property appeared so important that in a modern environment
65+
the vast majority of data is stored and sent between processes in UTF-8 encoding.
66+
67+
To sum up:
68+
69+
* Both UTF-8 and UTF-16 support exactly the same range of characters.
70+
* For ASCII data UTF-8 takes twice less space.
71+
* UTF-8 is vastly more popular for serialization and storage.
72+
1673
# Motivation
1774

1875
- UTF-16 by default requires that all Text values pay a premium for serialization. Arguably, the performance impact of Text is flipped

0 commit comments

Comments
 (0)