Skip to content

Commit 3cbc70a

Browse files
committed
Fix and extend confusables for Indic abbreviation marks
This fixes problems described in the following issue: - unicode-org/properties#455
1 parent e5b3407 commit 3cbc70a

File tree

4 files changed

+67
-21
lines changed

4 files changed

+67
-21
lines changed

unicodetools/data/security/dev/confusables.txt

Lines changed: 17 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
# confusables.txt
2-
# Date: 2025-10-09, 03:26:38 GMT
2+
# Date: 2025-10-09, 06:52:02 GMT
33
# © 2025 Unicode®, Inc.
44
# Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the U.S. and other countries.
55
# For terms of use and license, see https://www.unicode.org/terms_of_use.html
@@ -378,6 +378,10 @@ FF65 ; 00B7 ; MA # ( ・ → · ) HALFWIDTH KATAKANA MIDDLE DOT → MIDDLE DOT #
378378
10101 ; 00B7 ; MA #* ( 𐄁 → · ) AEGEAN WORD SEPARATOR DOT → MIDDLE DOT #
379379
2022 ; 00B7 ; MA #* ( • → · ) BULLET → MIDDLE DOT #
380380
2027 ; 00B7 ; MA #* ( ‧ → · ) HYPHENATION POINT → MIDDLE DOT #
381+
0A76 ; 00B7 ; MA #* ( ੶ → · ) GURMUKHI ABBREVIATION SIGN → MIDDLE DOT #
382+
11174 ; 00B7 ; MA #* ( 𑅴 → · ) MAHAJANI ABBREVIATION SIGN → MIDDLE DOT # →੶→
383+
111C7 ; 00B7 ; MA #* ( 𑇇 → · ) SHARADA ABBREVIATION SIGN → MIDDLE DOT # →੶→
384+
116B9 ; 00B7 ; MA #* ( 𑚹 → · ) TAKRI ABBREVIATION SIGN → MIDDLE DOT # →੶→
381385
2219 ; 00B7 ; MA #* ( ∙ → · ) BULLET OPERATOR → MIDDLE DOT #
382386
22C5 ; 00B7 ; MA #* ( ⋅ → · ) DOT OPERATOR → MIDDLE DOT #
383387
A78F ; 00B7 ; MA # ( ꞏ → · ) LATIN LETTER SINOLOGICAL DOT → MIDDLE DOT #
@@ -1042,11 +1046,6 @@ FE68 ; 005C ; MA #* ( ﹨ → \ ) SMALL REVERSE SOLIDUS → REVERSE SOLIDUS #
10421046

10431047
A778 ; 0026 ; MA # ( ꝸ → & ) LATIN SMALL LETTER UM → AMPERSAND #
10441048

1045-
0AF0 ; 0970 ; MA #* ( ૰ → ॰ ) GUJARATI ABBREVIATION SIGN → DEVANAGARI ABBREVIATION SIGN #
1046-
110BB ; 0970 ; MA #* ( 𑂻 → ॰ ) KAITHI ABBREVIATION SIGN → DEVANAGARI ABBREVIATION SIGN #
1047-
111C7 ; 0970 ; MA #* ( 𑇇 → ॰ ) SHARADA ABBREVIATION SIGN → DEVANAGARI ABBREVIATION SIGN #
1048-
26AC ; 0970 ; MA #* ( ⚬ → ॰ ) MEDIUM SMALL WHITE CIRCLE → DEVANAGARI ABBREVIATION SIGN #
1049-
10501049
111DB ; A8FC ; MA #* ( 𑇛 → ꣼ ) SHARADA SIGN SIDDHAM → DEVANAGARI SIGN SIDDHAM #
10511050

10521051
17D9 ; 0E4F ; MA #* ( ៙ → ๏ ) KHMER SIGN PHNAEK MUAN → THAI CHARACTER FONGMAN #
@@ -1088,10 +1087,21 @@ A714 ; 02EB ; MA #* ( ꜔ → ˫ ) MODIFIER LETTER MID LEFT-STEM TONE BAR → MO
10881087
3002 ; 02F3 ; MA #* ( 。 → ˳ ) IDEOGRAPHIC FULL STOP → MODIFIER LETTER LOW RING #
10891088

10901089
2E30 ; 00B0 ; MA #* ( ⸰ → ° ) RING POINT → DEGREE SIGN # →∘→
1090+
0970 ; 00B0 ; MA #* ( ॰ → ° ) DEVANAGARI ABBREVIATION SIGN → DEGREE SIGN #
1091+
09FD ; 00B0 ; MA #* ( ৽ → ° ) BENGALI ABBREVIATION SIGN → DEGREE SIGN # →॰→
1092+
0AF0 ; 00B0 ; MA #* ( ૰ → ° ) GUJARATI ABBREVIATION SIGN → DEGREE SIGN # →॰→
1093+
110BB ; 00B0 ; MA #* ( 𑂻 → ° ) KAITHI ABBREVIATION SIGN → DEGREE SIGN # →॰→
1094+
1123D ; 00B0 ; MA #* ( 𑈽 → ° ) KHOJKI ABBREVIATION SIGN → DEGREE SIGN # →॰→
1095+
1144F ; 00B0 ; MA #* ( 𑑏 → ° ) NEWA ABBREVIATION SIGN → DEGREE SIGN # →॰→
1096+
114C6 ; 00B0 ; MA #* ( 𑓆 → ° ) TIRHUTA ABBREVIATION SIGN → DEGREE SIGN # →॰→
1097+
11643 ; 00B0 ; MA #* ( 𑙃 → ° ) MODI ABBREVIATION SIGN → DEGREE SIGN # →॰→
1098+
1183B ; 00B0 ; MA #* ( 𑠻 → ° ) DOGRA ABBREVIATION SIGN → DEGREE SIGN # →॰→
1099+
1E5FF ; 00B0 ; MA #* ( 𞗿 → ° ) OL ONAL ABBREVIATION SIGN → DEGREE SIGN # →॰→
10911100
02DA ; 00B0 ; MA #* ( ˚ → ° ) RING ABOVE → DEGREE SIGN #
10921101
2218 ; 00B0 ; MA #* ( ∘ → ° ) RING OPERATOR → DEGREE SIGN #
10931102
25CB ; 00B0 ; MA #* ( ○ → ° ) WHITE CIRCLE → DEGREE SIGN # →◦→→∘→
10941103
25E6 ; 00B0 ; MA #* ( ◦ → ° ) WHITE BULLET → DEGREE SIGN # →∘→
1104+
26AC ; 00B0 ; MA #* ( ⚬ → ° ) MEDIUM SMALL WHITE CIRCLE → DEGREE SIGN # →॰→
10951105

10961106
10ED0 ; 00B0 0332 ; MA #* ( 𐻐 → °̲ ) ARABIC BIBLICAL END OF VERSE → DEGREE SIGN, COMBINING LOW LINE # →⍜→→○̲→
10971107
235C ; 00B0 0332 ; MA #* ( ⍜ → °̲ ) APL FUNCTIONAL SYMBOL CIRCLE UNDERBAR → DEGREE SIGN, COMBINING LOW LINE # →○̲→
@@ -9989,5 +9999,5 @@ FACE ; 9F9C ; MA # ( 龜 → 龜 ) CJK COMPATIBILITY IDEOGRAPH-FACE → CJK UNIF
99899999

999010000
2FD5 ; 9FA0 ; MA #* ( ⿕ → 龠 ) KANGXI RADICAL FLUTE → CJK UNIFIED IDEOGRAPH-9FA0 #
999110001

9992-
# total: 6567
10002+
# total: 6578
999310003

unicodetools/data/security/dev/confusablesSummary.txt

Lines changed: 19 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
# confusablesSummary.txt
2-
# Date: 2025-10-09, 03:26:38 GMT
2+
# Date: 2025-10-09, 06:52:01 GMT
33
# © 2025 Unicode®, Inc.
44
# Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the U.S. and other countries.
55
# For terms of use and license, see https://www.unicode.org/terms_of_use.html
@@ -4450,12 +4450,23 @@
44504450
← (‎ ˉb ‎) 02C9 0062 MODIFIER LETTER MACRON, LATIN SMALL LETTER B
44514451
← (‎ ъ ‎) 044A CYRILLIC SMALL LETTER HARD SIGN
44524452

4453-
# ° ∘ ○ ◦ ˚
4453+
# ° ॰ ৽ 𑠻 𞗿 ૰ ∘ ○ ◦ ⚬ ⸰ 𑂻 𑈽 𑓆 𑙃 𑑏 ˚
44544454
(‎ ° ‎) 00B0 DEGREE SIGN
4455+
← (‎ ॰ ‎) 0970 DEVANAGARI ABBREVIATION SIGN
4456+
← (‎ ৽ ‎) 09FD BENGALI ABBREVIATION SIGN # →॰→
4457+
← (‎ 𑠻 ‎) 1183B DOGRA ABBREVIATION SIGN # →॰→
4458+
← (‎ 𞗿 ‎) 1E5FF OL ONAL ABBREVIATION SIGN # →॰→
4459+
← (‎ ૰ ‎) 0AF0 GUJARATI ABBREVIATION SIGN # →॰→
44554460
← (‎ ∘ ‎) 2218 RING OPERATOR
44564461
← (‎ ○ ‎) 25CB WHITE CIRCLE # →◦→→∘→
44574462
← (‎ ◦ ‎) 25E6 WHITE BULLET # →∘→
4463+
← (‎ ⚬ ‎) 26AC MEDIUM SMALL WHITE CIRCLE # →॰→
44584464
← (‎ ⸰ ‎) 2E30 RING POINT # →∘→
4465+
← (‎ 𑂻 ‎) 110BB KAITHI ABBREVIATION SIGN # →॰→
4466+
← (‎ 𑈽 ‎) 1123D KHOJKI ABBREVIATION SIGN # →॰→
4467+
← (‎ 𑓆 ‎) 114C6 TIRHUTA ABBREVIATION SIGN # →॰→
4468+
← (‎ 𑙃 ‎) 11643 MODI ABBREVIATION SIGN # →॰→
4469+
← (‎ 𑑏 ‎) 1144F NEWA ABBREVIATION SIGN # →॰→
44594470
← (‎ ˚ ‎) 02DA RING ABOVE
44604471

44614472
# °C ℃
@@ -4491,8 +4502,10 @@
44914502
(‎ ¶ ‎) 00B6 PILCROW SIGN
44924503
← (‎ ⸿ ‎) 2E3F CAPITULUM
44934504

4494-
# · ・ ᐧ ‧ ᛫ • ∙ ⋅ ⸱ 𐄁 ꞏ · ・
4505+
# · ੶ 𑚹 ・ ᐧ ‧ ᛫ • ∙ ⋅ ⸱ 𐄁 𑇇 𑅴 ꞏ · ・
44954506
(‎ · ‎) 00B7 MIDDLE DOT
4507+
← (‎ ੶ ‎) 0A76 GURMUKHI ABBREVIATION SIGN
4508+
← (‎ 𑚹 ‎) 116B9 TAKRI ABBREVIATION SIGN # →੶→
44964509
← (‎ ・ ‎) 30FB KATAKANA MIDDLE DOT # →•→
44974510
← (‎ ᐧ ‎) 1427 CANADIAN SYLLABICS FINAL MIDDLE DOT
44984511
← (‎ ‧ ‎) 2027 HYPHENATION POINT
@@ -4502,6 +4515,8 @@
45024515
← (‎ ⋅ ‎) 22C5 DOT OPERATOR
45034516
← (‎ ⸱ ‎) 2E31 WORD SEPARATOR MIDDLE DOT
45044517
← (‎ 𐄁 ‎) 10101 AEGEAN WORD SEPARATOR DOT
4518+
← (‎ 𑇇 ‎) 111C7 SHARADA ABBREVIATION SIGN # →੶→
4519+
← (‎ 𑅴 ‎) 11174 MAHAJANI ABBREVIATION SIGN # →੶→
45054520
← (‎ ꞏ ‎) A78F LATIN LETTER SINOLOGICAL DOT
45064521
← (‎ · ‎) 0387 GREEK ANO TELEIA
45074522
← (‎ ・ ‎) FF65 HALFWIDTH KATAKANA MIDDLE DOT # →•→
@@ -8719,13 +8734,6 @@
87198734
(‎ ८ ‎) 096E DEVANAGARI DIGIT EIGHT
87208735
← (‎ ૮ ‎) 0AEE GUJARATI DIGIT EIGHT
87218736

8722-
# ॰ ૰ ⚬ 𑂻 𑇇
8723-
(‎ ॰ ‎) 0970 DEVANAGARI ABBREVIATION SIGN
8724-
← (‎ ૰ ‎) 0AF0 GUJARATI ABBREVIATION SIGN
8725-
← (‎ ⚬ ‎) 26AC MEDIUM SMALL WHITE CIRCLE
8726-
← (‎ 𑂻 ‎) 110BB KAITHI ABBREVIATION SIGN
8727-
← (‎ 𑇇 ‎) 111C7 SHARADA ABBREVIATION SIGN
8728-
87298737
# ঃ ః ಃ ഃ ඃ း ਃ 𑓁
87308738
(‎ ঃ ‎) 0983 BENGALI SIGN VISARGA
87318739
← (‎ ః ‎) 0C03 TELUGU SIGN VISARGA # →ਃ→
@@ -17834,5 +17842,5 @@
1783417842
(‎ 𪘀 ‎) 2A600 CJK UNIFIED IDEOGRAPH-2A600
1783517843
← (‎ 𪘀 ‎) 2FA1D CJK COMPATIBILITY IDEOGRAPH-2FA1D
1783617844

17837-
# total : 7582
17845+
# total : 7593
1783817846

unicodetools/data/security/dev/data/source/confusables-source.txt

Lines changed: 17 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4427,7 +4427,6 @@ A4EF ; 2C6F # [email protected]
44274427
# Additions from UTC
44284428

44294429
0AF0 ; 0970 # ( ૰ ) GUJARATI ABBREVIATION SIGN => U+0970 ( ॰ ) DEVANAGARI ABBREVIATION SIGN
4430-
111C7 ; 0970 # ( 𑇇 ) SHARADA ABBREVIATION SIGN => U+0970 ( ॰ ) DEVANAGARI ABBREVIATION SIGN
44314430

44324431
A792 ; 0404 # ( Ꞓ ) LATIN CAPITAL LETTER C WITH BAR =>U+0404 ( Є ) CYRILLIC CAPITAL LETTER UKRAINIAN IE
44334432
A793 ; 0454 # ( ꞓ ) LATIN SMALL LETTER C WITH BAR =>U+0454 ( є ) CYRILLIC SMALL LETTER UKRAINIAN IE
@@ -5728,3 +5727,20 @@ A7F1 ; 02E2 # ( ꟱ → ˢ ) MODIFIER LETTER CAPITAL S → MODIFIER LETTER SMAL
57285727
# Confusables for Devanagari UE and UUE (PAG ref #449)
57295728
0956 ; 032E
57305729
0957 ; 032E 032E
5730+
5731+
# Indic abbreviation marks (PAG ref #455)
5732+
0970 ; 00B0
5733+
09FD ; 0970
5734+
0AF0 ; 0970
5735+
110BB ; 0970
5736+
1123D ; 0970
5737+
1144F ; 0970
5738+
114C6 ; 0970
5739+
11643 ; 0970
5740+
1183B ; 0970
5741+
1E5FF ; 0970
5742+
5743+
0A76 ; 00B7
5744+
11174 ; 0A76
5745+
111C7 ; 0A76
5746+
116B9 ; 0A76

unicodetools/data/security/dev/data/source/formatted-source.txt

Lines changed: 14 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
# formatted-source.txt
2-
# Date: 2025-10-09, 03:26:35 GMT
2+
# Date: 2025-10-09, 06:52:00 GMT
33
# © 2025 Unicode®, Inc.
44
# Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the U.S. and other countries.
55
# For terms of use and license, see https://www.unicode.org/terms_of_use.html
@@ -890,10 +890,12 @@
890890

891891
00AF 0062 ; 044A #* ( ¯b ~ ъ ) MACRON, LATIN SMALL LETTER B ~ CYRILLIC SMALL LETTER HARD SIGN
892892

893+
00B0 ; 0970 #* ( ° ~ ॰ ) DEGREE SIGN ~ DEVANAGARI ABBREVIATION SIGN
893894
00B0 ; 02DA #* ( ° ~ ˚ ) DEGREE SIGN ~ RING ABOVE
894895

895896
00B6 ; 2E3F #* ( ¶ ~ ⸿ ) PILCROW SIGN ~ CAPITULUM
896897

898+
00B7 ; 0A76 # ( · ~ ੶ ) MIDDLE DOT ~ GURMUKHI ABBREVIATION SIGN
897899
00B7 ; 1427 # ( · ~ ᐧ ) MIDDLE DOT ~ CANADIAN SYLLABICS FINAL MIDDLE DOT
898900
00B7 ; 2027 # ( · ~ ‧ ) MIDDLE DOT ~ HYPHENATION POINT
899901
00B7 ; 16EB # ( · ~ ᛫ ) MIDDLE DOT ~ RUNIC SINGLE PUNCTUATION
@@ -2035,10 +2037,16 @@
20352037

20362038
096E ; 0AEE # ( ८ ~ ૮ ) DEVANAGARI DIGIT EIGHT ~ GUJARATI DIGIT EIGHT
20372039

2040+
0970 ; 09FD #* ( ॰ ~ ৽ ) DEVANAGARI ABBREVIATION SIGN ~ BENGALI ABBREVIATION SIGN
2041+
0970 ; 1183B #* ( ॰ ~ 𑠻 ) DEVANAGARI ABBREVIATION SIGN ~ DOGRA ABBREVIATION SIGN
2042+
0970 ; 1E5FF #* ( ॰ ~ 𞗿 ) DEVANAGARI ABBREVIATION SIGN ~ OL ONAL ABBREVIATION SIGN
20382043
0970 ; 0AF0 #* ( ॰ ~ ૰ ) DEVANAGARI ABBREVIATION SIGN ~ GUJARATI ABBREVIATION SIGN
20392044
0970 ; 26AC #* ( ॰ ~ ⚬ ) DEVANAGARI ABBREVIATION SIGN ~ MEDIUM SMALL WHITE CIRCLE
20402045
0970 ; 110BB #* ( ॰ ~ 𑂻 ) DEVANAGARI ABBREVIATION SIGN ~ KAITHI ABBREVIATION SIGN
2041-
0970 ; 111C7 #* ( ॰ ~ 𑇇 ) DEVANAGARI ABBREVIATION SIGN ~ SHARADA ABBREVIATION SIGN
2046+
0970 ; 1123D #* ( ॰ ~ 𑈽 ) DEVANAGARI ABBREVIATION SIGN ~ KHOJKI ABBREVIATION SIGN
2047+
0970 ; 114C6 #* ( ॰ ~ 𑓆 ) DEVANAGARI ABBREVIATION SIGN ~ TIRHUTA ABBREVIATION SIGN
2048+
0970 ; 11643 #* ( ॰ ~ 𑙃 ) DEVANAGARI ABBREVIATION SIGN ~ MODI ABBREVIATION SIGN
2049+
0970 ; 1144F #* ( ॰ ~ 𑑏 ) DEVANAGARI ABBREVIATION SIGN ~ NEWA ABBREVIATION SIGN
20422050

20432051
0971 ; 02D9 # ( ॱ ~ ˙ ) DEVANAGARI SIGN HIGH SPACING DOT ~ DOT ABOVE
20442052

@@ -2142,6 +2150,10 @@
21422150

21432151
0A73 0A42 ; 0A0A # ( ੳੂ ~ ਊ ) GURMUKHI URA, GURMUKHI VOWEL SIGN UU ~ GURMUKHI LETTER UU
21442152

2153+
0A76 ; 116B9 #* ( ੶ ~ 𑚹 ) GURMUKHI ABBREVIATION SIGN ~ TAKRI ABBREVIATION SIGN
2154+
0A76 ; 111C7 #* ( ੶ ~ 𑇇 ) GURMUKHI ABBREVIATION SIGN ~ SHARADA ABBREVIATION SIGN
2155+
0A76 ; 11174 #* ( ੶ ~ 𑅴 ) GURMUKHI ABBREVIATION SIGN ~ MAHAJANI ABBREVIATION SIGN
2156+
21452157
0A85 0ABE ; 0A86 # ( અા ~ આ ) GUJARATI LETTER A, GUJARATI VOWEL SIGN AA ~ GUJARATI LETTER AA
21462158

21472159
0A85 0AC5 ; 0A8D # ( અૅ ~ ઍ ) GUJARATI LETTER A, GUJARATI VOWEL SIGN CANDRA E ~ GUJARATI VOWEL CANDRA E

0 commit comments

Comments
 (0)