Skip to content

Commit 709e4c4

Browse files
authored
Consistency of InSC and Alpha/Dia/Ext (#672)
1 parent 804018d commit 709e4c4

File tree

2 files changed

+35
-11
lines changed

2 files changed

+35
-11
lines changed

unicodetools/data/ucd/dev/PropList.txt

Lines changed: 20 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
# PropList-16.0.0.txt
2-
# Date: 2024-01-10, 17:59:20 GMT
2+
# Date: 2024-01-27, 00:32:10 GMT
33
# © 2023 Unicode®, Inc.
44
# Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the U.S. and other countries.
55
# For terms of use, see https://www.unicode.org/terms_of_use.html
@@ -948,6 +948,7 @@ FA70..FAD9 ; Ideographic # Lo [106] CJK COMPATIBILITY IDEOGRAPH-FA70..CJK COM
948948
0D3B..0D3C ; Diacritic # Mn [2] MALAYALAM SIGN VERTICAL BAR VIRAMA..MALAYALAM SIGN CIRCULAR VIRAMA
949949
0D4D ; Diacritic # Mn MALAYALAM SIGN VIRAMA
950950
0DCA ; Diacritic # Mn SINHALA SIGN AL-LAKUNA
951+
0E3A ; Diacritic # Mn THAI CHARACTER PHINTHU
951952
0E47..0E4C ; Diacritic # Mn [6] THAI CHARACTER MAITAIKHU..THAI CHARACTER THANTHAKHAT
952953
0E4E ; Diacritic # Mn THAI CHARACTER YAMAKKAN
953954
0EBA ; Diacritic # Mn LAO SIGN PALI VIRAMA
@@ -971,9 +972,11 @@ FA70..FAD9 ; Ideographic # Lo [106] CJK COMPATIBILITY IDEOGRAPH-FA70..CJK COM
971972
135D..135F ; Diacritic # Mn [3] ETHIOPIC COMBINING GEMINATION AND VOWEL LENGTH MARK..ETHIOPIC COMBINING GEMINATION MARK
972973
1714 ; Diacritic # Mn TAGALOG SIGN VIRAMA
973974
1715 ; Diacritic # Mc TAGALOG SIGN PAMUDPOD
975+
1734 ; Diacritic # Mc HANUNOO SIGN PAMUDPOD
974976
17C9..17D3 ; Diacritic # Mn [11] KHMER SIGN MUUSIKATOAN..KHMER SIGN BATHAMASAT
975977
17DD ; Diacritic # Mn KHMER SIGN ATTHACAN
976978
1939..193B ; Diacritic # Mn [3] LIMBU SIGN MUKPHRENG..LIMBU SIGN SA-I
979+
1A60 ; Diacritic # Mn TAI THAM SIGN SAKOT
977980
1A75..1A7C ; Diacritic # Mn [8] TAI THAM SIGN TONE-1..TAI THAM SIGN KHUEN-LUE KARAN
978981
1A7F ; Diacritic # Mn TAI THAM COMBINING CRYPTOGRAMMIC DOT
979982
1AB0..1ABD ; Diacritic # Mn [14] COMBINING DOUBLED CIRCUMFLEX ACCENT..COMBINING PARENTHESES BELOW
@@ -984,6 +987,8 @@ FA70..FAD9 ; Ideographic # Lo [106] CJK COMPATIBILITY IDEOGRAPH-FA70..CJK COM
984987
1B6B..1B73 ; Diacritic # Mn [9] BALINESE MUSICAL SYMBOL COMBINING TEGEH..BALINESE MUSICAL SYMBOL COMBINING GONG
985988
1BAA ; Diacritic # Mc SUNDANESE SIGN PAMAAEH
986989
1BAB ; Diacritic # Mn SUNDANESE SIGN VIRAMA
990+
1BE6 ; Diacritic # Mn BATAK SIGN TOMPI
991+
1BF2..1BF3 ; Diacritic # Mc [2] BATAK PANGOLAT..BATAK PANONGONAN
987992
1C36..1C37 ; Diacritic # Mn [2] LEPCHA SIGN RAN..LEPCHA SIGN NUKTA
988993
1C78..1C7D ; Diacritic # Lm [6] OL CHIKI MU TTUDDAG..OL CHIKI AHAD
989994
1CD0..1CD2 ; Diacritic # Mn [3] VEDIC TONE KARSHANA..VEDIC TONE PRENKHA
@@ -1022,6 +1027,8 @@ A720..A721 ; Diacritic # Sk [2] MODIFIER LETTER STRESS AND HIGH TONE..MODIF
10221027
A788 ; Diacritic # Lm MODIFIER LETTER LOW CIRCUMFLEX ACCENT
10231028
A789..A78A ; Diacritic # Sk [2] MODIFIER LETTER COLON..MODIFIER LETTER SHORT EQUALS SIGN
10241029
A7F8..A7F9 ; Diacritic # Lm [2] MODIFIER LETTER CAPITAL H WITH STROKE..MODIFIER LETTER SMALL LIGATURE OE
1030+
A806 ; Diacritic # Mn SYLOTI NAGRI SIGN HASANTA
1031+
A82C ; Diacritic # Mn SYLOTI NAGRI SIGN ALTERNATE HASANTA
10251032
A8C4 ; Diacritic # Mn SAURASHTRA SIGN VIRAMA
10261033
A8E0..A8F1 ; Diacritic # Mn [18] COMBINING DEVANAGARI DIGIT ZERO..COMBINING DEVANAGARI SIGN AVAGRAHA
10271034
A92B..A92D ; Diacritic # Mn [3] KAYAH LI TONE PLOPHU..KAYAH LI TONE CALYA PLOPHU
@@ -1055,6 +1062,8 @@ FFE3 ; Diacritic # Sk FULLWIDTH MACRON
10551062
10780..10785 ; Diacritic # Lm [6] MODIFIER LETTER SMALL CAPITAL AA..MODIFIER LETTER SMALL B WITH HOOK
10561063
10787..107B0 ; Diacritic # Lm [42] MODIFIER LETTER SMALL DZ DIGRAPH..MODIFIER LETTER SMALL V WITH RIGHT HOOK
10571064
107B2..107BA ; Diacritic # Lm [9] MODIFIER LETTER SMALL CAPITAL Y..MODIFIER LETTER SMALL S WITH CURL
1065+
10A38..10A3A ; Diacritic # Mn [3] KHAROSHTHI SIGN BAR ABOVE..KHAROSHTHI SIGN DOT BELOW
1066+
10A3F ; Diacritic # Mn KHAROSHTHI VIRAMA
10581067
10AE5..10AE6 ; Diacritic # Mn [2] MANICHAEAN ABBREVIATION MARK ABOVE..MANICHAEAN ABBREVIATION MARK BELOW
10591068
10D22..10D23 ; Diacritic # Lo [2] HANIFI ROHINGYA MARK SAKIN..HANIFI ROHINGYA MARK NA KHONNA
10601069
10D24..10D27 ; Diacritic # Mn [4] HANIFI ROHINGYA SIGN HARBAHAY..HANIFI ROHINGYA SIGN TASSI
@@ -1073,12 +1082,13 @@ FFE3 ; Diacritic # Sk FULLWIDTH MACRON
10731082
11235 ; Diacritic # Mc KHOJKI SIGN VIRAMA
10741083
11236 ; Diacritic # Mn KHOJKI SIGN NUKTA
10751084
112E9..112EA ; Diacritic # Mn [2] KHUDAWADI SIGN NUKTA..KHUDAWADI SIGN VIRAMA
1076-
1133C ; Diacritic # Mn GRANTHA SIGN NUKTA
1085+
1133B..1133C ; Diacritic # Mn [2] COMBINING BINDU BELOW..GRANTHA SIGN NUKTA
10771086
1134D ; Diacritic # Mc GRANTHA SIGN VIRAMA
10781087
11366..1136C ; Diacritic # Mn [7] COMBINING GRANTHA DIGIT ZERO..COMBINING GRANTHA DIGIT SIX
10791088
11370..11374 ; Diacritic # Mn [5] COMBINING GRANTHA LETTER A..COMBINING GRANTHA LETTER PA
10801089
113CE ; Diacritic # Mn TULU-TIGALARI SIGN VIRAMA
10811090
113CF ; Diacritic # Mc TULU-TIGALARI SIGN LOOPED VIRAMA
1091+
113D0 ; Diacritic # Mn TULU-TIGALARI CONJOINER
10821092
113D2 ; Diacritic # Mn TULU-TIGALARI GEMINATION MARK
10831093
113D3 ; Diacritic # Lo TULU-TIGALARI SIGN PLUTA
10841094
113E1..113E2 ; Diacritic # Mn [2] TULU-TIGALARI VEDIC TONE SVARITA..TULU-TIGALARI VEDIC TONE ANUDATTA
@@ -1102,11 +1112,14 @@ FFE3 ; Diacritic # Sk FULLWIDTH MACRON
11021112
11D42 ; Diacritic # Mn MASARAM GONDI SIGN NUKTA
11031113
11D44..11D45 ; Diacritic # Mn [2] MASARAM GONDI SIGN HALANTA..MASARAM GONDI VIRAMA
11041114
11D97 ; Diacritic # Mn GUNJALA GONDI VIRAMA
1115+
11F41 ; Diacritic # Mc KAWI SIGN KILLER
1116+
11F42 ; Diacritic # Mn KAWI CONJOINER
11051117
11F5A ; Diacritic # Mn KAWI SIGN NUKTA
11061118
13447..13455 ; Diacritic # Mn [15] EGYPTIAN HIEROGLYPH MODIFIER DAMAGED AT TOP START..EGYPTIAN HIEROGLYPH MODIFIER DAMAGED
11071119
1612F ; Diacritic # Mn GURUNG KHEMA SIGN THOLHOMA
11081120
16AF0..16AF4 ; Diacritic # Mn [5] BASSA VAH COMBINING HIGH TONE..BASSA VAH COMBINING HIGH-LOW TONE
11091121
16B30..16B36 ; Diacritic # Mn [7] PAHAWH HMONG MARK CIM TUB..PAHAWH HMONG MARK CIM TAUM
1122+
16D6B..16D6C ; Diacritic # Lm [2] KIRAT RAI SIGN VIRAMA..KIRAT RAI SIGN SAAT
11101123
16F8F..16F92 ; Diacritic # Mn [4] MIAO TONE RIGHT..MIAO TONE BELOW
11111124
16F93..16F9F ; Diacritic # Lm [13] MIAO LETTER TONE-2..MIAO LETTER REFORMED TONE-8
11121125
16FF0..16FF1 ; Diacritic # Mc [2] VIETNAMESE ALTERNATE READING MARK CA..VIETNAMESE ALTERNATE READING MARK NHAY
@@ -1129,14 +1142,16 @@ FFE3 ; Diacritic # Sk FULLWIDTH MACRON
11291142
1E944..1E946 ; Diacritic # Mn [3] ADLAM ALIF LENGTHENER..ADLAM GEMINATION MARK
11301143
1E948..1E94A ; Diacritic # Mn [3] ADLAM CONSONANT MODIFIER..ADLAM NUKTA
11311144

1132-
# Total code points: 1160
1145+
# Total code points: 1178
11331146

11341147
# ================================================
11351148

11361149
00B7 ; Extender # Po MIDDLE DOT
11371150
02D0..02D1 ; Extender # Lm [2] MODIFIER LETTER TRIANGULAR COLON..MODIFIER LETTER HALF TRIANGULAR COLON
11381151
0640 ; Extender # Lm ARABIC TATWEEL
11391152
07FA ; Extender # Lm NKO LAJANYALAN
1153+
0A71 ; Extender # Mn GURMUKHI ADDAK
1154+
0AFB ; Extender # Mn GUJARATI SIGN SHADDA
11401155
0B55 ; Extender # Mn ORIYA SIGN OVERLINE
11411156
0E46 ; Extender # Lm THAI CHARACTER MAIYAMOK
11421157
0EC6 ; Extender # Lm LAO KO LA
@@ -1161,6 +1176,7 @@ FF70 ; Extender # Lm HALFWIDTH KATAKANA-HIRAGANA PROLONGED SOUND
11611176
10D4E ; Extender # Lm GARAY VOWEL LENGTH MARK
11621177
10D6A ; Extender # Mn GARAY CONSONANT GEMINATION MARK
11631178
10D6F ; Extender # Lm GARAY REDUPLICATION MARK
1179+
11237 ; Extender # Mn KHOJKI SIGN SHADDA
11641180
1135D ; Extender # Lo GRANTHA SIGN PLUTA
11651181
113D2 ; Extender # Mn TULU-TIGALARI GEMINATION MARK
11661182
113D3 ; Extender # Lo TULU-TIGALARI SIGN PLUTA
@@ -1173,7 +1189,7 @@ FF70 ; Extender # Lm HALFWIDTH KATAKANA-HIRAGANA PROLONGED SOUND
11731189
1E5EF ; Extender # Mn OL ONAL SIGN IKIR
11741190
1E944..1E946 ; Extender # Mn [3] ADLAM ALIF LENGTHENER..ADLAM GEMINATION MARK
11751191

1176-
# Total code points: 56
1192+
# Total code points: 59
11771193

11781194
# ================================================
11791195

unicodetools/src/main/resources/org/unicode/text/UCD/UnicodeInvariantTest.txt

Lines changed: 15 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -523,6 +523,10 @@ Show [\u20b9]
523523
# [\p{Alphabetic}] ∥ \p{Script=Common}
524524
# & [\p{Decomposition_Type=None} \p{Decomposition_Type=Canonical}]
525525

526+
## Alphabetic, Diacritic, and Extender.
527+
528+
# Consistency with Indic_Syllabic_Category.
529+
526530
# The UTC 172 script ad hoc report (L2/22-128) item VII 27 “Nonalphabetic bindus” points out that
527531
# “Most characters with InSC=Bindu have Alphabetic=Yes”
528532
# and adjustments have been made so that all current bindus have Alphabetic=Yes.
@@ -535,17 +539,21 @@ Let $nonAlphabeticBindus = []
535539
Let $nonAlphabeticDependentVowels = [\N{ORIYA SIGN OVERLINE}\N{THAI CHARACTER MAITAIKHU}\N{LIMBU SIGN KEMPHRENG}\N{SHARADA VOWEL MODIFIER MARK}\N{SHARADA EXTRA SHORT VOWEL MARK}]
536540
[\p{InSC=Vowel_Dependent} - \p{Alphabetic}] = $nonAlphabeticDependentVowels
537541

542+
# Several invariants from L2/24-009 item 2.2.
543+
\p{InSC=Gemination_Mark} ⊆ \p{Extender}
544+
\p{InSC=Nukta} ⊆ \p{Diacritic}
545+
[\p{InSC=Virama}\p{InSC=Pure_Killer}\p{InSC=Reordering_Killer}] ⊆ \p{Diacritic}
546+
\p{InSC=Invisible_Stacker} ⊆ \p{Diacritic}
547+
Let $nonAlphabeticAvagrahas = [\N{TIBETAN MARK PALUTA}] # A punctuation mark.
548+
[\p{InSC=Avagraha} - $nonAlphabeticAvagrahas] ⊆ \p{Alphabetic}
549+
550+
# Name-based checks.
551+
538552
# Combining letters are often alphabetic (medievalist abbreviations).
539553
# The others are diacritic (cantillation marks, phonetics).
540-
# See 177-CXX.
554+
# See 177-C52.
541555
\p{name=/COMBINING .* LETTER/} ⊆ [\p{Alphabetic}\p{Diacritic}]
542556

543-
# Nuktas should probably be diacritic, but as of 15.1 this is only the case of
544-
# those that have NUKTA in their name.
545-
# See https://github.com/unicode-org/properties/issues/195#issuecomment-1804962555.
546-
Let $nonDiacriticNuktas = [\u1BE6\U00010A38\U00010A39\U00010A3A\U0001133B]
547-
[\p{InSc=Nukta} - \p{Diacritic}] = $nonDiacriticNuktas
548-
549557
## Joining_Type and Joining_Group
550558
# Where defined, the Joining_Group refines the Joining_Type.
551559
EquivalencesOf \P{Joining_Group=No_Joining_Group} Joining_Group ⇒ Joining_Type

0 commit comments

Comments
 (0)