Skip to content

Commit e381710

Browse files
More Unicode 17.0 Confusables (#1168)
* Arabic confusables for Hindko * New confusables data based on L2/22-108 * Confusable sequences proposed in L2/22-107 * New confusable data based on L2/22-114 * Confusable sequences mentioned in L2/22-081 * Missing Coptic confusables * Confusable Katakana-Han pair * Uncomment commented additions and fix their issues --------- Co-authored-by: Roozbeh Pournader <[email protected]>
1 parent ff30035 commit e381710

11 files changed

+1508
-293
lines changed

unicodetools/data/security/dev/confusables.txt

Lines changed: 252 additions & 57 deletions
Large diffs are not rendered by default.

unicodetools/data/security/dev/confusablesSummary.txt

Lines changed: 527 additions & 163 deletions
Large diffs are not rendered by default.

unicodetools/data/security/dev/data/confusablesSummaryIdentifier.txt

Lines changed: 334 additions & 51 deletions
Large diffs are not rendered by default.

unicodetools/data/security/dev/data/source/confusables-intentional.txt

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,6 @@
3939
0259 ; 04D9 # schwa ; schwa
4040
0068 ; 04BB # h ; shha
4141
0069 ; 0456 # i ; i
42-
026A ; 04CF # smallcap i ; small palochka (arguable)
4342
03B9 ; 0269 # iota ; iota
4443
03F3 ; 006A ; 0458 # yot ; j ; je
4544
0138 ; 043A # kra ; ka

unicodetools/data/security/dev/data/source/confusables-macFonts.txt

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -336,7 +336,6 @@
336336

337337
0131 ; 026A # ( ı ~ ɪ ) LATIN SMALL LETTER DOTLESS I ~ LATIN LETTER SMALL CAPITAL I
338338
0131 ; 03B9 # ( ı ~ ι ) LATIN SMALL LETTER DOTLESS I ~ GREEK SMALL LETTER IOTA
339-
0131 ; 04CF # ( ı ~ ӏ ) LATIN SMALL LETTER DOTLESS I ~ CYRILLIC SMALL LETTER PALOCHKA
340339

341340
0138 ; 1D0B # ( ĸ ~ ᴋ ) LATIN SMALL LETTER KRA ~ LATIN LETTER SMALL CAPITAL K
342341
0138 ; 03BA # ( ĸ ~ κ ) LATIN SMALL LETTER KRA ~ GREEK SMALL LETTER KAPPA

unicodetools/data/security/dev/data/source/confusables-source.txt

Lines changed: 154 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -72,7 +72,6 @@
7272
018A ; 02BD 0044 # ( Ɗ → ʽD) LATIN CAPITAL LETTER D WITH HOOK → MODIFIER LETTER REVERSED COMMA, LATIN CAPITAL LETTER D
7373
018C ; 0064 0304 # ( ƌ → d̄) LATIN SMALL LETTER D WITH TOPBAR → LATIN SMALL LETTER D, COMBINING MACRON
7474
0191 ; 0046 0321 # ( Ƒ → F̡) LATIN CAPITAL LETTER F WITH HOOK → LATIN CAPITAL LETTER F, COMBINING PALATALIZED HOOK BELOW
75-
0192 ; 0066 0321 # ( ƒ → f̡) LATIN SMALL LETTER F WITH HOOK → LATIN SMALL LETTER F, COMBINING PALATALIZED HOOK BELOW
7675
0193 ; 0047 02BD # ( Ɠ → Gʽ) LATIN CAPITAL LETTER G WITH HOOK → LATIN CAPITAL LETTER G, MODIFIER LETTER REVERSED COMMA
7776
0196 ; 006C # ( Ɩ → l) LATIN CAPITAL LETTER IOTA → LATIN SMALL LETTER L
7877
0197 ; 0049 0335 # ( Ɨ → I̵) LATIN CAPITAL LETTER I WITH STROKE → LATIN CAPITAL LETTER I, COMBINING SHORT STROKE OVERLAY
@@ -431,7 +430,6 @@
431430
09BC ; 093C # ( ় → ़) BENGALI SIGN NUKTA → DEVANAGARI SIGN NUKTA
432431
09E0 ; 098B 09C3 # ( ৠ → ঋৃ) BENGALI LETTER VOCALIC RR → BENGALI LETTER VOCALIC R, BENGALI VOWEL SIGN VOCALIC R
433432
09E1 ; 098C 09E2 # ( ৡ → ঌৢ) BENGALI LETTER VOCALIC LL → BENGALI LETTER VOCALIC L, BENGALI VOWEL SIGN VOCALIC L
434-
09E6 ; 0030 # ( ০ → 0) BENGALI DIGIT ZERO → DIGIT ZERO
435433
09EA ; 0038 # ( ৪ → 8) BENGALI DIGIT FOUR → DIGIT EIGHT
436434
09ED ; 0039 # ( ৭ → 9) BENGALI DIGIT SEVEN → DIGIT NINE
437435
0A02 ; 0307 # ( ਂ → ̇) GURMUKHI SIGN BINDI → COMBINING DOT ABOVE
@@ -478,7 +476,6 @@
478476
0B3C ; 0323 # ( ଼ → ̣) ORIYA SIGN NUKTA → COMBINING DOT BELOW
479477
0B3C ; 0A3C # ( ଼ → ਼) ORIYA SIGN NUKTA → GURMUKHI SIGN NUKTA
480478
0B3C ; 0ABC # ( ଼ → ઼) ORIYA SIGN NUKTA → GUJARATI SIGN NUKTA
481-
0B66 ; 0030 # ( ୦ → 0) ORIYA DIGIT ZERO → DIGIT ZERO
482479
0B68 ; 0039 # ( ୨ → 9) ORIYA DIGIT TWO → DIGIT NINE
483480
0B82 ; 030A # ( ஂ → ̊) TAMIL SIGN ANUSVARA → COMBINING RING ABOVE
484481
0B8A ; 0B89 0BB3 # ( ஊ → உள) TAMIL LETTER UU → TAMIL LETTER U, TAMIL LETTER LLA
@@ -539,7 +536,6 @@
539536
0CB1 ; 0C31 # ( ಱ → ఱ) KANNADA LETTER RRA → TELUGU LETTER RRA
540537
0CB2 ; 0C32 # ( ಲ → ల) KANNADA LETTER LA → TELUGU LETTER LA
541538
0CE1 ; 0C8C 0CBE # ( ೡ → ಌಾ) KANNADA LETTER VOCALIC LL → KANNADA LETTER VOCALIC L, KANNADA VOWEL SIGN AA
542-
0CE6 ; 0C66 # ( ೦ → ౦) KANNADA DIGIT ZERO → TELUGU DIGIT ZERO
543539
0CE7 ; 0C67 # ( ೧ → ౧) KANNADA DIGIT ONE → TELUGU DIGIT ONE
544540
0CE8 ; 0C68 # ( ೨ → ౨) KANNADA DIGIT TWO → TELUGU DIGIT TWO
545541
0CEF ; 0C6F # ( ೯ → ౯) KANNADA DIGIT NINE → TELUGU DIGIT NINE
@@ -5574,3 +5570,157 @@ A7F1 ; 02E2 # ( ꟱ → ˢ ) MODIFIER LETTER CAPITAL S → MODIFIER LETTER SMAL
55745570
# High priority confusables for Tibetan (PAG ref #402)
55755571
0F7B ; 0F7A 0F7A # ( ཻ → ེེ ) TIBETAN VOWEL SIGN EE → TIBETAN VOWEL SIGN E, TIBETAN VOWEL SIGN E
55765572
0F7D ; 0F7C 0F7C # ( ཽ → ོོ ) TIBETAN VOWEL SIGN OO → TIBETAN VOWEL SIGN O, TIBETAN VOWEL SIGN O
5573+
5574+
# Arabic confusables for Hindko (PAG ref #412)
5575+
08BE ; 067E 065A
5576+
08BF ; 062A 065A
5577+
08C0 ; 0679 065A
5578+
08C1 ; 0686 065A
5579+
08C2 ; 06A9 065A
5580+
5581+
# New confusables data based on L2/22-108 (PAG ref #415)
5582+
0945 ; 0306
5583+
0949 ; 093E 0945
5584+
0972 ; 0905 0945
5585+
0BB6 ; 0BB8
5586+
105A ; 1004
5587+
178F ; 178A
5588+
5589+
# An analysis of confusable sequences proposed in L2/22-107 (PAG ref #416)
5590+
0974 ; 0905 093B
5591+
093B ; 093E 093A
5592+
0D7A ; 0D23 0D4D
5593+
0D7D ; 0D32 0D4D
5594+
0D7E ; 0D33 0D4D
5595+
5596+
# New confusable data based on L2/22-114 (PAG ref #417)
5597+
09E6 ; 006F
5598+
0B20 ; 004F
5599+
0B66 ; 006F
5600+
0CE6 ; 004F
5601+
17E0 ; 006F
5602+
0AE9 ; 0033
5603+
1004 ; 0063
5604+
0192 ; 0066
5605+
0572 ; 03B7
5606+
01BF ; 0070
5607+
0D1F ; 0073
5608+
014B ; 03B7
5609+
0448 ; 0077
5610+
045F ; 0075 0329
5611+
067A ; 062A
5612+
0752 ; 067E
5613+
068F ; 068E
5614+
0973 ; 0905 093A
5615+
0AB0 ; 0AE8
5616+
0A24 ; 0909
5617+
0975 ; 0905 094F
5618+
0A1F ; 091F
5619+
0A20 ; 0920
5620+
0A2B ; 0922
5621+
0A1C ; 0924 094D 0924
5622+
0A27 ; 092A
5623+
0A72 ; 092A 094D 091F
5624+
0A2E ; 092D
5625+
0A38 ; 092E
5626+
0A15 ; 0935
5627+
0A35 ; 0939
5628+
0A3F ; 093F
5629+
0A48 ; 0948
5630+
0A41 ; 0956
5631+
0A42 ; 0957
5632+
09BF ; 093F
5633+
0A47 ; 0947
5634+
09F0 ; 09B0
5635+
1031 ; 0B47
5636+
0B94 ; 0B92 0BB3
5637+
0D25 ; 0BAE
5638+
0D16 ; 0BB5
5639+
0D46 ; 0BC6
5640+
0D47 ; 0BC7
5641+
0C90 ; 0C10
5642+
0C16 ; 0C96 0323
5643+
0C97 ; 0C17
5644+
0C9D ; 0C1D
5645+
0C9F ; 0C1F
5646+
0CA6 ; 0C26
5647+
0CA8 ; 0C28
5648+
0CB0 ; 0C30
5649+
0CB3 ; 0C33
5650+
0CBF ; 0C3F
5651+
0CC1 ; 0C41
5652+
0CC3 ; 0C43
5653+
1002 ; 0D31
5654+
10D8 ; 0D31
5655+
0D8D ; 0DC3 0DD8
5656+
0DB5 ; 0D91
5657+
0D92 ; 0D91 0DCA
5658+
0D93 ; 0D91 0DD9
5659+
0DB9 ; 0D94
5660+
0DB6 ; 0D9B
5661+
0DC0 ; 0DA0
5662+
0DC4 ; 0DB7
5663+
0AEB ; 0AAA
5664+
0AEB ; 0447
5665+
1023 ; 1000 1039 1000
5666+
1061 ; 101B 103E
5667+
10D7 ; 1010
5668+
723F ; 1102 1171
5669+
535F ; 1106 1161
5670+
4ECA ; 1109 1173 11A8
5671+
5408 ; 1109 1173 11B7
5672+
4E1B ; 110A 1173
5673+
4E15 ; 110C 1169
5674+
9577 ; 1110 1173 11AC
5675+
5676+
# An analysis of confusable sequences mentioned in L2/22-081 (PAG ref #418)
5677+
256A ; 01C2
5678+
0582 ; 0131
5679+
04CF ; 0049
5680+
5681+
# Missing Coptic confusables (PAG ref #419)
5682+
03E4 ; 0427
5683+
03E5 ; 0447
5684+
03EC ; 0036
5685+
03ED ; 03C3
5686+
2C82 ; 0042
5687+
2C83 ; 0432
5688+
2C8B ; 03C2
5689+
2C8C ; 2C6B
5690+
2C8D ; 2C6C
5691+
2C8F ; 043D
5692+
2C90 ; 03F4
5693+
2C91 ; 0275
5694+
2C93 ; 0131
5695+
2C97 ; 028C
5696+
2C99 ; 1D0D
5697+
2C9B ; 0274
5698+
2C9C ; 01B7
5699+
2C9D ; 0293
5700+
2CA1 ; 043F
5701+
2CA7 ; 0442
5702+
2CA9 ; 03B3
5703+
2CAF ; 03C8
5704+
2CB0 ; A64C
5705+
2CB2 ; FB29
5706+
2CB3 ; FB29
5707+
2CB5 ; 22D6
5708+
2CB6 ; 039E
5709+
2CB7 ; 2261
5710+
2CBB ; 2212
5711+
2CC0 ; 20BD
5712+
2CC1 ; 03FC
5713+
2CC4 ; 01B7
5714+
2CC5 ; 0292
5715+
2CC7 ; 002F
5716+
2CCB ; 0039
5717+
2CCE ; 0050
5718+
2CCF ; 0070
5719+
2CD3 ; 0036
5720+
2CDD ; 03B4
5721+
2CE0 ; 03C6
5722+
2CE1 ; 03C6
5723+
2CE8 ; 20BD
5724+
5725+
# Confusable Katakana-Han pair (PAG ref #442)
5726+
1B122 ; 4E8E

unicodetools/data/security/dev/data/source/confusables-winFonts.txt

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1037,8 +1037,6 @@
10371037

10381038
07F5 ; 2018 # ( ‎ߵ‎ ~ ‘ ) NKO LOW TONE APOSTROPHE ~ LEFT SINGLE QUOTATION MARK
10391039

1040-
0B20 ; 0B66 # ( ଠ ~ ୦ ) ORIYA LETTER TTHA ~ ORIYA DIGIT ZERO
1041-
10421040
0B85 ; 0BEE # ( அ ~ ௮ ) TAMIL LETTER A ~ TAMIL DIGIT EIGHT
10431041

10441042
0B95 ; 0BE7 # ( க ~ ௧ ) TAMIL LETTER KA ~ TAMIL DIGIT ONE

unicodetools/data/security/dev/data/source/formatted-macFonts.txt

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
# formatted-macFonts.txt
2-
# Date: 2025-05-02, 14:36:22 GMT
2+
# Date: 2025-07-22, 05:49:36 GMT
33
# © 2025 Unicode®, Inc.
44
# Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the U.S. and other countries.
55
# For terms of use and license, see https://www.unicode.org/terms_of_use.html
@@ -283,7 +283,6 @@
283283

284284
0131 ; 026A # ( ı ~ ɪ ) LATIN SMALL LETTER DOTLESS I ~ LATIN LETTER SMALL CAPITAL I
285285
0131 ; 03B9 # ( ı ~ ι ) LATIN SMALL LETTER DOTLESS I ~ GREEK SMALL LETTER IOTA
286-
0131 ; 04CF # ( ı ~ ӏ ) LATIN SMALL LETTER DOTLESS I ~ CYRILLIC SMALL LETTER PALOCHKA
287286

288287
0138 ; 1D0B # ( ĸ ~ ᴋ ) LATIN SMALL LETTER KRA ~ LATIN LETTER SMALL CAPITAL K
289288
0138 ; 03BA # ( ĸ ~ κ ) LATIN SMALL LETTER KRA ~ GREEK SMALL LETTER KAPPA

0 commit comments

Comments
 (0)