UTF8

UTF8 是一种兼容 ASCII 编码的变长编码。

UTF8编码的细节：

Bytes count	Bits	Range	Bytes
1	7	U+0000 - U+007F	0xxxxxxx
2	11	U+0080 - U+07FF	110xxxxx 10xxxxxx
3	16	U+0800 - U+FFFF	1110xxxx 10xxxxxx 10xxxxxx
4	21	U+10000 - U+10FFFF	11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

转换成十进制，即：字节 0xxxxxxx 范围是：0 - 127 字节 10xxxxxx 范围是：128 - 191 字节 110xxxxx 范围是：192 - 223 字节 1110xxxx 范围是：224 - 239 字节 11110xxx 范围是：240 - 247

Character	Octal	Binary code point	Binary UTF-8
$ U+0024	044	010 0100	0-0100100
¢ U+00A2	0242	000 10-10 0010	110-00010 10-100010
€ U+20AC	020254	0010 -0000 10-10 1100	1110-0010 10-000010 10-101100
𐍈 U+10348	0201510	0 00-01 0000 -0011 01-00 1000	11110-000 10-010000 10-001101 10-001000

即将 Unicode codepoint 转换为符合 UTF8 编码的字节。

已知一个字符的Unicode Codepoint是 n，求其UTF8储存的字节。

fn encode(codepoint: u32) -> Vec<u8>

二进制转换基本表格：

具体的算法参照源代码：encode.ts 或 encode.cpp。

已知一序列的UTF8二进制码，将其解码为Unicode codepoint。

fn decode(stream: Vec<u8>) -> Vec<u32>

具体算法参照源代码：decode.cpp

对于长度为 N 的UTF8字符串，查找到第n个字符，其时间复杂度为 O(n)。

由于UTF8是变长编码，所以索引字符不是数组索引的时间复杂度 O(1)，而是和解码的时间复杂度相等同。

（Mozilla的Rust就因为这个问题，在 charAt类方法上龃龉不前，时有反覆。）