Skip to content

Commit f286608

Browse files
committed
UCA 16.0 delta 6
From Ken: O.k., a new day. Time to start in on the meat of the matter -- the whole script additions in primary order. I start with the unicameral script additions. Those are a bit simpler than any new bicameral scripts (Garay for 16.0). (See above in the Delta 3 discussion for the detailed rationale for the placement of the new scripts -- I won't replicate that discussion here, but just proceed, based on the placement decisions already made.) 1. Kirat Rai I start with this one, even though it has some complexities, because it is fresh in our minds from the extended discussion of the unicodetools PR. First move the entire range of Kirat Rai alphabetic records (in code point order) into unidata-16.0.0d6.txt, starting after Tangsa. This includes all the consonants, all the vowels signs, which in Kirat Rai are actually full standalone letters, and the two killers: 16D40;KIRAT RAI SIGN ANUSVARA;Lm;;;;;; ... 16D6C;KIRAT RAI SIGN SAAT;Lm;;;;;; I omit the three punctuation marks for now. Those end up elsewhere in DUCET, and it is more efficient to deal with all the punctuation additions in a separate delta later on, after the alphabetic runs have been established. As noted in discussion, the ANUSVARA, TONPI (bindu), and VISARGA for Kirat Rai are encoded as standalone modifier letters, rather than as combining marks, and we've already decided not to try equating them to the various combining candrabindu, visarga, etc., that use generic weights shared with the Devanagari archetypes of the combining marks. The proposal suggests that they be given primary distinctions and be left in code point order. It might be better to move them to the end of the list of consonants, but without further rationale provided, the simplest solution here is to simply do as the proposal suggests. Indeed, the proposal states that "This sort order with anusvara, tonpi, and visarga sorting first has been approved by AKRS." So we just leave it that way for DUCET. The next complication results from canonical equivalences for three Kirat Rai vowels: AI, O, AU. In cases like this, the way to get the sifter to introduce contractions is to surround the records in question with the CONTRACTION pragma, to wit: CONTRACTION 16D68;KIRAT RAI VOWEL SIGN AI;Lo;16D67 16D67;;;;; 16D69;KIRAT RAI VOWEL SIGN O;Lo;16D63 16D67;;;;; 16D6A;KIRAT RAI VOWEL SIGN AU;Lo;16D69 16D67;;;;; DEFAULT In cases like this, as for Tamil, Kannada, etc., etc., I also drop a comment into unidata.txt with a somewhat redundant explanation, to remind everybody what is going on here. The effect of the CONTRACTION pragma is to tell the sifter that for the range of entries where it is in effect, the sifter is to go ahead and assign a primary weight to the code point and *also* generate a contraction entry from the decomposition, giving it the same weight as the atomic character code point. In the absence of the CONTRACTION pragma, such an entry is instead just entered into allkeys.txt with the sequence of weights from the decomposition, and does not have its own primary weight. But wait, there's more. Because of the strange encoding of Kirat Rai vowel signs, we have a canonical closure problem for 16D6A. The full decomposition for 16D6A is <16D63, 16D67, 16D67>, and we need to weight that sequence with the same primary weight via contraction. Fortunately, this is not the first time this problem has been encountered for the sifter. A similar problem of canonical closure for a recursive canonical decomposition occurs for 0CCB in Kannada and 0DDD in Sinhala. The mechanism baked in to the sifter to deal with this is a "secondary decomposition", which can be added to the input entry in the decomposition field. As a first step for handling 16D6A, I'm putting the following entry into unidata-16.0.0d6.txt: 16D6A;KIRAT RAI VOWEL SIGN AU;Lo;16D69 16D67, 16D63 16D67 16D67;;;;; The comma delimitation in the decomposition allows for adding a secondary decomposition. When enclosed within the CONTRACTION pragma, this generates a second contraction using the secondary decomposition information. This *almost* solves the problem for Kirat Rai, as it was solved for Kannada and Sinhala. Unfortunately, the way the Kirat Rai vowels work, there is yet *another* sequence that is canonically equivalent: <16D63, 16D68>. That sequence is equivalent to <16D63, 16D67, 16D67>, so it also needs a contraction that is weighted with the same primary weight. The sifter code currently treats the Kannada and Sinhala cases essentially as the extent of the problem -- only a *single* secondary contraction is allowed for in the code. The syntax is for this currently is decompfield := decomp (, decomp)? rather than decompfield := decomp (, decomp)* because the code was written to just look for the comma and then process a *single* secondary decomposition value, rather than being written to expect and process an indefinite *list* of secondary decompositions. To fix this for Kirat Rai (and future-proof against any similar cases in the future), I'm going to have to do some considerable refactoring of the relevant decomposition handling code in the sifter, which is a tricky and sensitive part of the code. For now I am just going to postpone that code work until all the rest of the 16.0 input for UCA has been taken care of. But in the meantime, while adding the first relevant secondary decomposition for 16D6A to unidata.txt, I also reversed the order of the secondary decomposition for the 0CCB and 0DDD entries in unidata.txt. Now the field with the secondary decomposition better matches what the code states, assuming the *first* entry is the formal canonical decomposition string from the UCD (which is then recursively decomposed internal to the sifter processing), followed by a secondary decomposition, which in the Kannada and Sinhala cases is the full decomposition. The impact on the output in allkeys.txt is just to invert the order of two contraction lines, but it does not affect any of the weighting per se. The other side effect is that the output log will stop warning about encountering a "Non-binary" canonical decomposition for 0CCB and 0DDD in the recursive decomposition. Generate allkeys.txt and verify that Kirat Rai weights are as expected, with special attention to the results for 16D68, 16D69, and 16D6A. Also examine the impact of the secondary decomposition change for 0CCB and 0DDD. O.k., this discussion for Kirat Rai and its implications is hairy enough that I'm going to make this its own delta, without introducing more scripts into this set of changes. 45 more down, 1004 to go. Archive this delta 6: unidata-16.0.0d6.txt (1559396 bytes, 10/08/2023)
1 parent b303a4c commit f286608

File tree

2 files changed

+4202
-4092
lines changed

2 files changed

+4202
-4092
lines changed

c/uca/sifter/unidata.txt

Lines changed: 71 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
# unidata-16.0.0.txt
2-
# Date: 2023-10-07, 00:00:00 GMT [KW]
2+
# Date: 2023-10-08, 00:00:00 GMT [KW]
33
# © 2023 Unicode®, Inc.
44
# For terms of use, see https://www.unicode.org/terms_of_use.html
55
#
@@ -9,7 +9,7 @@
99
# Default Unicode Collation Element Table (DUCET) for
1010
# the Unicode Collation Algorithm.
1111
#
12-
# Version 16.0.0 draft 5 (Unicode Version: 16.0.0)
12+
# Version 16.0.0 draft 6 (Unicode Version: 16.0.0)
1313
# based on Unicode data file UnicodeData-16.0.0d7.txt
1414
# Ordering for Unicode 16.0
1515
#
@@ -21399,16 +21399,16 @@ DEFAULT
2139921399

2140021400
# Kannada two-part vowels collate as units, not
2140121401
# by their decompositions.
21402-
# Added a second decomposition for 0CCB, to deal with
21403-
# the canonical equivalence of 0CCA --> 0CC6 0CC2
2140421402

2140521403
CONTRACTION
2140621404

2140721405
0CC7;KANNADA VOWEL SIGN EE;Mc;0CC6 0CD5;;;;;
2140821406
0CC8;KANNADA VOWEL SIGN AI;Mc;0CC6 0CD6;;;;;
2140921407
0CCA;KANNADA VOWEL SIGN O;Mc;0CC6 0CC2;;;;;
21410-
#0CCB;KANNADA VOWEL SIGN OO;Mc;0CC6 0CC2 0CD5;;;;;
21411-
0CCB;KANNADA VOWEL SIGN OO;Mc;0CC6 0CC2 0CD5, 0CCA 0CD5;;;;;
21408+
# Added a second decomposition for 0CCB, to deal with
21409+
# the canonical equivalence of 0CCA --> 0CC6 0CC2
21410+
#0CCB;KANNADA VOWEL SIGN OO;Mc;0CCA 0CD5;;;;;
21411+
0CCB;KANNADA VOWEL SIGN OO;Mc;0CCA 0CD5, 0CC6 0CC2 0CD5;;;;;
2141221412

2141321413
DEFAULT
2141421414

@@ -21636,14 +21636,15 @@ DEFAULT
2163621636

2163721637
# Sinhala two-part vowels collate as units, not
2163821638
# by their decompositions.
21639-
#
21640-
# A second decomposition is added for 0DDD, to deal
21641-
# with the canonical equivalence of 0DDC -> 0DD9 0DCF
2164221639

2164321640
CONTRACTION
2164421641

2164521642
0DDC;SINHALA VOWEL SIGN KOMBUVA HAA AELA-PILLA;Mc;0DD9 0DCF;;;;;
21646-
0DDD;SINHALA VOWEL SIGN KOMBUVA HAA DIGA AELA-PILLA;Mc;0DD9 0DCF 0DCA, 0DDC 0DCA;;;;;
21643+
# A second decomposition is added for 0DDD, to deal
21644+
# with the canonical equivalence of 0DDC -> 0DD9 0DCF
21645+
# in order to maintain canonical closure.
21646+
# 0DDD;SINHALA VOWEL SIGN KOMBUVA HAA DIGA AELA-PILLA;Mc;0DDC 0DCA;;;;;
21647+
0DDD;SINHALA VOWEL SIGN KOMBUVA HAA DIGA AELA-PILLA;Mc;0DDC 0DCA, 0DD9 0DCF 0DCA;;;;;
2164721648
0DDE;SINHALA VOWEL SIGN KOMBUVA HAA GAYANUKITTA;Mc;0DD9 0DDF;;;;;
2164821649

2164921650
DEFAULT
@@ -33774,6 +33775,66 @@ A4F7;LISU LETTER OE;Lo;;;;;;
3377433775
16ABD;TANGSA LETTER CHA;Lo;;;;;;
3377533776
16ABE;TANGSA LETTER ZA;Lo;;;;;;
3377633777

33778+
# Kirat Rai script starts here
33779+
33780+
16D40;KIRAT RAI SIGN ANUSVARA;Lm;;;;;;
33781+
16D41;KIRAT RAI SIGN TONPI;Lm;;;;;;
33782+
16D42;KIRAT RAI SIGN VISARGA;Lm;;;;;;
33783+
16D43;KIRAT RAI LETTER A;Lo;;;;;;
33784+
16D44;KIRAT RAI LETTER KA;Lo;;;;;;
33785+
16D45;KIRAT RAI LETTER KHA;Lo;;;;;;
33786+
16D46;KIRAT RAI LETTER GA;Lo;;;;;;
33787+
16D47;KIRAT RAI LETTER GHA;Lo;;;;;;
33788+
16D48;KIRAT RAI LETTER NGA;Lo;;;;;;
33789+
16D49;KIRAT RAI LETTER CA;Lo;;;;;;
33790+
16D4A;KIRAT RAI LETTER CHA;Lo;;;;;;
33791+
16D4B;KIRAT RAI LETTER JA;Lo;;;;;;
33792+
16D4C;KIRAT RAI LETTER JHA;Lo;;;;;;
33793+
16D4D;KIRAT RAI LETTER NYA;Lo;;;;;;
33794+
16D4E;KIRAT RAI LETTER TTA;Lo;;;;;;
33795+
16D4F;KIRAT RAI LETTER TTHA;Lo;;;;;;
33796+
16D50;KIRAT RAI LETTER DDA;Lo;;;;;;
33797+
16D51;KIRAT RAI LETTER DDHA;Lo;;;;;;
33798+
16D52;KIRAT RAI LETTER TA;Lo;;;;;;
33799+
16D53;KIRAT RAI LETTER THA;Lo;;;;;;
33800+
16D54;KIRAT RAI LETTER DA;Lo;;;;;;
33801+
16D55;KIRAT RAI LETTER DHA;Lo;;;;;;
33802+
16D56;KIRAT RAI LETTER NA;Lo;;;;;;
33803+
16D57;KIRAT RAI LETTER PA;Lo;;;;;;
33804+
16D58;KIRAT RAI LETTER PHA;Lo;;;;;;
33805+
16D59;KIRAT RAI LETTER BA;Lo;;;;;;
33806+
16D5A;KIRAT RAI LETTER BHA;Lo;;;;;;
33807+
16D5B;KIRAT RAI LETTER MA;Lo;;;;;;
33808+
16D5C;KIRAT RAI LETTER YA;Lo;;;;;;
33809+
16D5D;KIRAT RAI LETTER RA;Lo;;;;;;
33810+
16D5E;KIRAT RAI LETTER LA;Lo;;;;;;
33811+
16D5F;KIRAT RAI LETTER VA;Lo;;;;;;
33812+
16D60;KIRAT RAI LETTER SA;Lo;;;;;;
33813+
16D61;KIRAT RAI LETTER SHA;Lo;;;;;;
33814+
16D62;KIRAT RAI LETTER HA;Lo;;;;;;
33815+
16D63;KIRAT RAI VOWEL SIGN AA;Lo;;;;;;
33816+
16D64;KIRAT RAI VOWEL SIGN I;Lo;;;;;;
33817+
16D65;KIRAT RAI VOWEL SIGN U;Lo;;;;;;
33818+
16D66;KIRAT RAI VOWEL SIGN UE;Lo;;;;;;
33819+
16D67;KIRAT RAI VOWEL SIGN E;Lo;;;;;;
33820+
33821+
# Kirat Rai two-part and three-part vowels collate as units, not
33822+
# by their decompositions.
33823+
33824+
CONTRACTION
33825+
33826+
16D68;KIRAT RAI VOWEL SIGN AI;Lo;16D67 16D67;;;;;
33827+
16D69;KIRAT RAI VOWEL SIGN O;Lo;16D63 16D67;;;;;
33828+
# The vowel sign au has a complex decomposition that recurses.
33829+
# Add a secondary decomposition to 16D6A for canonical closure.
33830+
# 16D6A;KIRAT RAI VOWEL SIGN AU;Lo;16D69 16D67;;;;;
33831+
16D6A;KIRAT RAI VOWEL SIGN AU;Lo;16D69 16D67, 16D63 16D67 16D67;;;;;
33832+
33833+
DEFAULT
33834+
33835+
16D6B;KIRAT RAI SIGN VIRAMA;Lm;;;;;;
33836+
16D6C;KIRAT RAI SIGN SAAT;Lm;;;;;;
33837+
3377733838
# Aegean syllabic scripts start here
3377833839

3377933840
# Linear B script starts here

0 commit comments

Comments
 (0)