Skip to content

Commit 2c3853e

Browse files
authored
Adopt GB18030-2022
This implements the Unicode Technical Committee recommendation around GB18030-2022 in a matter suitable for this standard, taking into account existing practice and the closeness between GBK and gb18030. In particular, using the text file attached to https://www.unicode.org/L2/L2023/23003r-gb18030-recommendations.pdf this does the following: 1. Merges the first set of 18 mappings, which are bidirectional, directly into index gb18030, replacing existing PUA entries. This ends up impacting GBK and gb18030. 2. The second set of 18 mappings (from PUA to bytes) are encoded as an encoder only table, for both GBK and gb18030. 3. The third set of 18 mappings (from bytes to code points) are ignored, as they are already covered by index gb18030 ranges. (Presumably they are included because the recommendation covers the transition from "Previous Mappings" to "Current Mappings" to "Recommended Mappings", whereas we are going directly from "Previous Mappings" to "Recommended Mappings".) The reason for changing GBK as well is because Chromium and WebKit have already code in the wild that impacts GBK to some degree (although the encoder only table is excluded for GBK only at the moment, including that would make the most sense compatibility-wise) and no fallout has been recorded. Additionally GBK is already positioned as a rough subset of gb18030 in this standard, with the decoder being shared completely. Tests: encoding/legacy-mb-schinese has some GB18030-2022 coverage already. This is completed with web-platform-tests/wpt#48239 and web-platform-tests/wpt#48240. This supersedes #335. This fixes #27 and fixes #312. This also updates the description of index gb18030 ranges to account for #22 (the change from GB18030-2000 to -2005) which it until now did not.
1 parent e20f586 commit 2c3853e

36 files changed

+129
-58
lines changed

encoding.bs

Lines changed: 75 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -832,7 +832,7 @@ specification, excluding <a>index single-byte</a>, which have their own table:
832832
<td><a href=index-gb18030.txt>index-gb18030.txt</a>
833833
<td><a href=gb18030.html>index gb18030 visualization</a>
834834
<td><a href=gb18030-bmp.html>index gb18030 BMP coverage</a>
835-
<td>This matches the GB18030-2005 standard for code points encoded as two bytes, except for
835+
<td>This matches the GB18030-2022 standard for code points encoded as two bytes, except for
836836
0xA3 0xA0 which maps to U+3000 to be compatible with deployed content. This index covers the
837837
CJK Unified Ideographs block of Unicode in its entirety. Entries from that block that are above or
838838
to the left of (the first) U+3000 in the visualization are in the Unicode order.
@@ -845,9 +845,13 @@ specification, excluding <a>index single-byte</a>, which have their own table:
845845
<td colspan=3><a href=index-gb18030-ranges.txt>index-gb18030-ranges.txt</a>
846846
<td>This <a>index</a> works different from all others. Listing all code points would result
847847
in over a million items whereas they can be represented neatly in 207 ranges combined with trivial
848-
limit checks. It therefore only superficially matches the GB18030-2005 standard for code points
849-
encoded as four bytes. See also <a>index gb18030 ranges code point</a> and
850-
<a>index gb18030 ranges pointer</a> below.
848+
limit checks. It therefore only superficially matches the GB18030-2000 standard for code points
849+
encoded as four bytes. The change for the GB18030-2005 revision is handled inline by the
850+
<a>index gb18030 ranges code point</a> and <a>index gb18030 ranges pointer</a> algorithms below
851+
that accompany this index. And the changes for the GB18030-2022 revision are handled differently
852+
again to not further increase the number of byte sequences mapping to Private Use code points. The
853+
relevant Private Use code points are mapped in the <a>gb18030 encoder</a> directly through a side
854+
table to preserve compatibility with how they were mapped before.
851855
<tr>
852856
<td><dfn export>index jis0208</dfn>
853857
<td><a href=index-jis0208.txt>index-jis0208.txt</a>
@@ -2434,6 +2438,73 @@ consumers of content generated with <a>GBK</a>'s <a for=/>encoder</a>.
24342438
<li><p>If <a>is GBK</a> is true and <var>code point</var> is
24352439
U+20AC, return byte 0x80.
24362440

2441+
<li>
2442+
<p>If there is a row in the table below whose first column is <var>code point</var>, then return
2443+
the two bytes on the same row listed in the second column:
2444+
2445+
<table>
2446+
<tr>
2447+
<th>Code point
2448+
<th>Bytes
2449+
<tr>
2450+
<td>U+E78D
2451+
<td>0xA6 0xD9
2452+
<tr>
2453+
<td>U+E78E
2454+
<td>0xA6 0xDA
2455+
<tr>
2456+
<td>U+E78F
2457+
<td>0xA6 0xDB
2458+
<tr>
2459+
<td>U+E790
2460+
<td>0xA6 0xDC
2461+
<tr>
2462+
<td>U+E791
2463+
<td>0xA6 0xDD
2464+
<tr>
2465+
<td>U+E792
2466+
<td>0xA6 0xDE
2467+
<tr>
2468+
<td>U+E793
2469+
<td>0xA6 0xDF
2470+
<tr>
2471+
<td>U+E794
2472+
<td>0xA6 0xEC
2473+
<tr>
2474+
<td>U+E795
2475+
<td>0xA6 0xED
2476+
<tr>
2477+
<td>U+E796
2478+
<td>0xA6 0xF3
2479+
<tr>
2480+
<td>U+E81E
2481+
<td>0xFE 0x59
2482+
<tr>
2483+
<td>U+E826
2484+
<td>0xFE 0x61
2485+
<tr>
2486+
<td>U+E82B
2487+
<td>0xFE 0x66
2488+
<tr>
2489+
<td>U+E82C
2490+
<td>0xFE 0x67
2491+
<tr>
2492+
<td>U+E832
2493+
<td>0xFE 0x6D
2494+
<tr>
2495+
<td>U+E843
2496+
<td>0xFE 0x7E
2497+
<tr>
2498+
<td>U+E854
2499+
<td>0xFE 0x90
2500+
<tr>
2501+
<td>U+E864
2502+
<td>0xFE 0xA0
2503+
</table>
2504+
2505+
<p class=note>This asymmetric encoder table preserves compatibility with the GB18030-2005
2506+
standard. See also the explanation at <a>index gb18030 ranges</a>.
2507+
24372508
<li><p>Let <var>pointer</var> be the <a>index pointer</a> for
24382509
<var>code point</var> in <a>index gb18030</a>.
24392510

index-big5.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
# https://encoding.spec.whatwg.org/
33
#
44
# Identifier: 8dfc771062e7be0810919082c2c06baa2236147909e0ecc235b1cb9ad782ac82
5-
# Date: 2018-01-06
5+
# Date: 2024-09-18
66

77
942 0x43F0 䏰 (<CJK Ideograph Extension A>)
88
943 0x4C32 䰲 (<CJK Ideograph Extension A>)

index-euc-kr.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
# https://encoding.spec.whatwg.org/
33
#
44
# Identifier: 1d97134cbf187263585bc8f593ca4196654ed4c7a673f5672eaad4f5d9fdc4ba
5-
# Date: 2018-01-06
5+
# Date: 2024-09-18
66

77
0 0xAC02 갂 (HANGUL SYLLABLE GAGG)
88
1 0xAC03 갃 (HANGUL SYLLABLE GAGS)

index-gb18030-ranges.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
# https://encoding.spec.whatwg.org/
33
#
44
# Identifier: f963aaa1653f630c523e7b04729fb4e4458f35806c45eb5c179445623138f0c0
5-
# Date: 2018-01-06
5+
# Date: 2024-09-18
66

77
0 0x0080
88
36 0x00A5

index-gb18030.txt

Lines changed: 20 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
11
# For details on index index-gb18030.txt see the Encoding Standard
22
# https://encoding.spec.whatwg.org/
33
#
4-
# Identifier: 715f084846f5c6fc9dd31046d0a4d604bd2d88bfe3a22833cea048415e413c70
5-
# Date: 2018-01-06
4+
# Identifier: ff1c9a923b5d24f9761b3a2de2c0f07b395f9f6f36519508944de4f0415be81c
5+
# Date: 2024-09-18
66

77
0 0x4E02 丂 (<CJK Ideograph>)
88
1 0x4E04 丄 (<CJK Ideograph>)
@@ -7186,13 +7186,13 @@
71867186
7179 0x03C7 χ (GREEK SMALL LETTER CHI)
71877187
7180 0x03C8 ψ (GREEK SMALL LETTER PSI)
71887188
7181 0x03C9 ω (GREEK SMALL LETTER OMEGA)
7189-
7182 0xE78D  (<Private Use>)
7190-
7183 0xE78E  (<Private Use>)
7191-
7184 0xE78F  (<Private Use>)
7192-
7185 0xE790  (<Private Use>)
7193-
7186 0xE791  (<Private Use>)
7194-
7187 0xE792  (<Private Use>)
7195-
7188 0xE793  (<Private Use>)
7189+
7182 0xFE10 ︐ (PRESENTATION FORM FOR VERTICAL COMMA)
7190+
7183 0xFE12 ︒ (PRESENTATION FORM FOR VERTICAL IDEOGRAPHIC FULL STOP)
7191+
7184 0xFE11 ︑ (PRESENTATION FORM FOR VERTICAL IDEOGRAPHIC COMMA)
7192+
7185 0xFE13 ︓ (PRESENTATION FORM FOR VERTICAL COLON)
7193+
7186 0xFE14 ︔ (PRESENTATION FORM FOR VERTICAL SEMICOLON)
7194+
7187 0xFE15 ︕ (PRESENTATION FORM FOR VERTICAL EXCLAMATION MARK)
7195+
7188 0xFE16 ︖ (PRESENTATION FORM FOR VERTICAL QUESTION MARK)
71967196
7189 0xFE35 ︵ (PRESENTATION FORM FOR VERTICAL LEFT PARENTHESIS)
71977197
7190 0xFE36 ︶ (PRESENTATION FORM FOR VERTICAL RIGHT PARENTHESIS)
71987198
7191 0xFE39 ︹ (PRESENTATION FORM FOR VERTICAL LEFT TORTOISE SHELL BRACKET)
@@ -7205,14 +7205,14 @@
72057205
7198 0xFE42 ﹂ (PRESENTATION FORM FOR VERTICAL RIGHT CORNER BRACKET)
72067206
7199 0xFE43 ﹃ (PRESENTATION FORM FOR VERTICAL LEFT WHITE CORNER BRACKET)
72077207
7200 0xFE44 ﹄ (PRESENTATION FORM FOR VERTICAL RIGHT WHITE CORNER BRACKET)
7208-
7201 0xE794  (<Private Use>)
7209-
7202 0xE795  (<Private Use>)
7208+
7201 0xFE17 ︗ (PRESENTATION FORM FOR VERTICAL LEFT WHITE LENTICULAR BRACKET)
7209+
7202 0xFE18 ︘ (PRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRAKCET)
72107210
7203 0xFE3B ︻ (PRESENTATION FORM FOR VERTICAL LEFT BLACK LENTICULAR BRACKET)
72117211
7204 0xFE3C ︼ (PRESENTATION FORM FOR VERTICAL RIGHT BLACK LENTICULAR BRACKET)
72127212
7205 0xFE37 ︷ (PRESENTATION FORM FOR VERTICAL LEFT CURLY BRACKET)
72137213
7206 0xFE38 ︸ (PRESENTATION FORM FOR VERTICAL RIGHT CURLY BRACKET)
72147214
7207 0xFE31 ︱ (PRESENTATION FORM FOR VERTICAL EM DASH)
7215-
7208 0xE796  (<Private Use>)
7215+
7208 0xFE19 ︙ (PRESENTATION FORM FOR VERTICAL HORIZONTAL ELLIPSIS)
72167216
7209 0xFE33 ︳ (PRESENTATION FORM FOR VERTICAL LOW LINE)
72177217
7210 0xFE34 ︴ (PRESENTATION FORM FOR VERTICAL WAVY LOW LINE)
72187218
7211 0xE797  (<Private Use>)
@@ -23779,27 +23779,27 @@
2377923779
23772 0x3447 㑇 (<CJK Ideograph Extension A>)
2378023780
23773 0x2E88 ⺈ (CJK RADICAL KNIFE ONE)
2378123781
23774 0x2E8B ⺋ (CJK RADICAL SEAL)
23782-
23775 0xE81E  (<Private Use>)
23782+
23775 0x9FB4 龴 (<CJK Ideograph>)
2378323783
23776 0x359E 㖞 (<CJK Ideograph Extension A>)
2378423784
23777 0x361A 㘚 (<CJK Ideograph Extension A>)
2378523785
23778 0x360E 㘎 (<CJK Ideograph Extension A>)
2378623786
23779 0x2E8C ⺌ (CJK RADICAL SMALL ONE)
2378723787
23780 0x2E97 ⺗ (CJK RADICAL HEART TWO)
2378823788
23781 0x396E 㥮 (<CJK Ideograph Extension A>)
2378923789
23782 0x3918 㤘 (<CJK Ideograph Extension A>)
23790-
23783 0xE826  (<Private Use>)
23790+
23783 0x9FB5 龵 (<CJK Ideograph>)
2379123791
23784 0x39CF 㧏 (<CJK Ideograph Extension A>)
2379223792
23785 0x39DF 㧟 (<CJK Ideograph Extension A>)
2379323793
23786 0x3A73 㩳 (<CJK Ideograph Extension A>)
2379423794
23787 0x39D0 㧐 (<CJK Ideograph Extension A>)
23795-
23788 0xE82B  (<Private Use>)
23796-
23789 0xE82C  (<Private Use>)
23795+
23788 0x9FB6 龶 (<CJK Ideograph>)
23796+
23789 0x9FB7 龷 (<CJK Ideograph>)
2379723797
23790 0x3B4E 㭎 (<CJK Ideograph Extension A>)
2379823798
23791 0x3C6E 㱮 (<CJK Ideograph Extension A>)
2379923799
23792 0x3CE0 㳠 (<CJK Ideograph Extension A>)
2380023800
23793 0x2EA7 ⺧ (CJK RADICAL COW)
2380123801
23794 0xE831  (<Private Use>)
23802-
23795 0xE832  (<Private Use>)
23802+
23795 0x9FB8 龸 (<CJK Ideograph>)
2380323803
23796 0x2EAA ⺪ (CJK RADICAL BOLT OF CLOTH)
2380423804
23797 0x4056 䁖 (<CJK Ideograph Extension A>)
2380523805
23798 0x415F 䅟 (<CJK Ideograph Extension A>)
@@ -23816,7 +23816,7 @@
2381623816
23809 0x44D6 䓖 (<CJK Ideograph Extension A>)
2381723817
23810 0x4661 䙡 (<CJK Ideograph Extension A>)
2381823818
23811 0x464C 䙌 (<CJK Ideograph Extension A>)
23819-
23812 0xE843  (<Private Use>)
23819+
23812 0x9FB9 龹 (<CJK Ideograph>)
2382023820
23813 0x4723 䜣 (<CJK Ideograph Extension A>)
2382123821
23814 0x4729 䜩 (<CJK Ideograph Extension A>)
2382223822
23815 0x477C 䝼 (<CJK Ideograph Extension A>)
@@ -23833,7 +23833,7 @@
2383323833
23826 0x499B 䦛 (<CJK Ideograph Extension A>)
2383423834
23827 0x49B7 䦷 (<CJK Ideograph Extension A>)
2383523835
23828 0x49B6 䦶 (<CJK Ideograph Extension A>)
23836-
23829 0xE854  (<Private Use>)
23836+
23829 0x9FBA 龺 (<CJK Ideograph>)
2383723837
23830 0xE855  (<Private Use>)
2383823838
23831 0x4CA3 䲣 (<CJK Ideograph Extension A>)
2383923839
23832 0x4C9F 䲟 (<CJK Ideograph Extension A>)
@@ -23849,7 +23849,7 @@
2384923849
23842 0x4D18 䴘 (<CJK Ideograph Extension A>)
2385023850
23843 0x4D19 䴙 (<CJK Ideograph Extension A>)
2385123851
23844 0x4DAE 䶮 (<CJK Ideograph Extension A>)
23852-
23845 0xE864  (<Private Use>)
23852+
23845 0x9FBB 龻 (<CJK Ideograph>)
2385323853
23846 0xE468  (<Private Use>)
2385423854
23847 0xE469  (<Private Use>)
2385523855
23848 0xE46A  (<Private Use>)

index-ibm866.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
# https://encoding.spec.whatwg.org/
33
#
44
# Identifier: db6fe14a559d1601a7667338d83704773d5708dbc641e1ad3c5e21405770f05e
5-
# Date: 2018-01-06
5+
# Date: 2024-09-18
66

77
0 0x0410 А (CYRILLIC CAPITAL LETTER A)
88
1 0x0411 Б (CYRILLIC CAPITAL LETTER BE)

index-iso-2022-jp-katakana.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
# https://encoding.spec.whatwg.org/
33
#
44
# Identifier: 6ffc12c11f6eab1ccb3dada740d9b0db096ef0b0783c3bd5ec951dcb4a44b95e
5-
# Date: 2018-01-06
5+
# Date: 2024-09-18
66

77
0 0x3002 。 (IDEOGRAPHIC FULL STOP)
88
1 0x300C 「 (LEFT CORNER BRACKET)

index-iso-8859-10.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
# https://encoding.spec.whatwg.org/
33
#
44
# Identifier: 02c2b5590d8ccda9931008c471f6ee2c590b2c8fe5e6ccb3b08638115d778507
5-
# Date: 2018-01-06
5+
# Date: 2024-09-18
66

77
0 0x0080 € (<control>)
88
1 0x0081  (<control>)

index-iso-8859-13.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
# https://encoding.spec.whatwg.org/
33
#
44
# Identifier: 40736338e964ab520407cebcb01329f8d450abf6ce12bf88b74b655b60e43300
5-
# Date: 2018-01-06
5+
# Date: 2024-09-18
66

77
0 0x0080 € (<control>)
88
1 0x0081  (<control>)

index-iso-8859-14.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
# https://encoding.spec.whatwg.org/
33
#
44
# Identifier: 2c8651cfc08b1f35b17919ee5379f2fa006af3ec809f11b3b7f470785580542b
5-
# Date: 2018-01-06
5+
# Date: 2024-09-18
66

77
0 0x0080 € (<control>)
88
1 0x0081  (<control>)

0 commit comments

Comments
 (0)