Skip to content

Commit 37a3b8b

Browse files
authored
make data files refer to core spec sections by name not number (#1178)
1 parent 23b7631 commit 37a3b8b

File tree

18 files changed

+84
-62
lines changed

18 files changed

+84
-62
lines changed

unicodetools/data/security/dev/IdentifierStatus.txt

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
# IdentifierStatus.txt
2-
# Date: 2025-07-29, 02:36:52 GMT
2+
# Date: 2025-08-01, 18:11:48 GMT
33
# © 2025 Unicode®, Inc.
44
# Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the U.S. and other countries.
55
# For terms of use and license, see https://www.unicode.org/terms_of_use.html
@@ -12,7 +12,10 @@
1212
# Format
1313
#
1414
# Field 0: code point
15-
# Field 1: Identifier_Status value (see Table 1 of http://www.unicode.org/reports/tr39)
15+
# Field 1: Identifier_Status value
16+
# See the "Identifier_Status and Identifier_Type" table of UTS #39:
17+
# https://www.unicode.org/reports/tr39/#Identifier_Status_and_Type
18+
1619
#
1720
# For the purpose of regular expressions, the property Identifier_Status is defined as
1821
# an enumerated property of code points.

unicodetools/data/security/dev/IdentifierType.txt

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
# IdentifierType.txt
2-
# Date: 2025-07-29, 02:36:51 GMT
2+
# Date: 2025-08-01, 18:11:44 GMT
33
# © 2025 Unicode®, Inc.
44
# Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the U.S. and other countries.
55
# For terms of use and license, see https://www.unicode.org/terms_of_use.html
@@ -12,7 +12,10 @@
1212
# Format
1313
#
1414
# Field 0: code point
15-
# Field 1: set of Identifier_Type values (see Table 1 of http://www.unicode.org/reports/tr39)
15+
# Field 1: set of Identifier_Type values
16+
# See the "Identifier_Status and Identifier_Type" table of UTS #39:
17+
# https://www.unicode.org/reports/tr39/#Identifier_Status_and_Type
18+
1619
#
1720
# For the purpose of regular expressions, the property Identifier_Type is defined as
1821
# mapping each code point to a set of enumerated values.

unicodetools/data/ucd/dev/BidiCharacterTest.txt

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# BidiCharacterTest-17.0.0.txt
2-
# Date: 2024-02-02
3-
# © 2024 Unicode®, Inc.
2+
# Date: 2025-07-30
3+
# © 2025 Unicode®, Inc.
44
# Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the U.S. and other countries.
55
# For terms of use and license, see https://www.unicode.org/terms_of_use.html
66
#
@@ -39,7 +39,8 @@
3939
################################################################################
4040
# Examples from UAX #9
4141

42-
# Examples from Section 3.3.5
42+
# Examples from the "Resolving Neutral and Isolate Formatting Types" section of UAX #9
43+
# (https://www.unicode.org/reports/tr9/#Resolving_Neutral_Types)
4344
05D0 05D1 0028 05D2 05D3 005B 0026 0065 0066 005D 002E 0029 0067 0068;0;0;1 1 0 1 1 0 0 0 0 0 0 0 0 0;1 0 2 4 3 5 6 7 8 9 10 11 12 13
4445
05D0 05D1 0028 05D2 05D3 005B 0026 0065 0066 005D 002E 0029 0067 0068;1;1;1 1 1 1 1 1 1 2 2 1 1 1 2 2;12 13 11 10 9 7 8 6 5 4 3 2 1 0
4546
0061 0062 0063 0020 0028 0064 0065 0066 0020 0627 0628 062C 0029 0020 05D0 05D1 05D2;0;0;0 0 0 0 0 0 0 0 0 1 1 1 0 0 1 1 1;0 1 2 3 4 5 6 7 8 11 10 9 12 13 16 15 14

unicodetools/data/ucd/dev/CaseFolding.txt

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
# CaseFolding-17.0.0.txt
2-
# Date: 2025-05-02, 21:48:45 GMT
2+
# Date: 2025-07-30, 23:54:36 GMT
33
# © 2025 Unicode®, Inc.
44
# Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the U.S. and other countries.
55
# For terms of use and license, see https://www.unicode.org/terms_of_use.html
@@ -25,8 +25,8 @@
2525
# NOTE: case folding does not preserve normalization formats!
2626
#
2727
# For information on case folding, including how to have case folding
28-
# preserve normalization formats, see Section 3.13 Default Case Algorithms in
29-
# The Unicode Standard.
28+
# preserve normalization formats, see the
29+
# "Conformance" / "Default Case Algorithms" section of the core specification.
3030
#
3131
# ================================================================================
3232
# Format

unicodetools/data/ucd/dev/DerivedAge.txt

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
# DerivedAge-17.0.0.txt
2-
# Date: 2025-07-24, 00:12:18 GMT
2+
# Date: 2025-07-30, 23:54:38 GMT
33
# © 2025 Unicode®, Inc.
44
# Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the U.S. and other countries.
55
# For terms of use and license, see https://www.unicode.org/terms_of_use.html
@@ -15,7 +15,8 @@
1515
# - The term 'assigned' means that a previously reserved code point was assigned
1616
# to be a character (graphic, format, control, or private-use);
1717
# a noncharacter code point; or a surrogate code point.
18-
# For more information, see The Unicode Standard Section 2.4
18+
# For more information, see the
19+
# "General Structure" / "Code Points and Characters" section of the core specification.
1920
#
2021
# - Versions are only tracked from 1.1 onwards, since version 1.0
2122
# predated changes required by the ISO 10646 merger.

unicodetools/data/ucd/dev/DerivedCoreProperties.txt

Lines changed: 11 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
# DerivedCoreProperties-17.0.0.txt
2-
# Date: 2025-07-24, 00:12:47 GMT
2+
# Date: 2025-07-30, 23:55:08 GMT
33
# © 2025 Unicode®, Inc.
44
# Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the U.S. and other countries.
55
# For terms of use and license, see https://www.unicode.org/terms_of_use.html
@@ -3553,7 +3553,8 @@ E0100..E01EF ; Case_Ignorable # Mn [240] VARIATION SELECTOR-17..VARIATION SELEC
35533553

35543554
# Derived Property: Changes_When_Lowercased (CWL)
35553555
# Characters whose normalized forms are not stable under a toLowercase mapping.
3556-
# For more information, see D139 in Section 3.13, "Default Case Algorithms".
3556+
# For more information, see the definition of "isLowercase(X)"
3557+
# in the "Conformance" / "Default Case Algorithms" section of the core specification.
35573558
# Changes_When_Lowercased(X) is true when toLowercase(toNFD(X)) != toNFD(X)
35583559

35593560
0041..005A ; Changes_When_Lowercased # L& [26] LATIN CAPITAL LETTER A..LATIN CAPITAL LETTER Z
@@ -4181,7 +4182,8 @@ FF21..FF3A ; Changes_When_Lowercased # L& [26] FULLWIDTH LATIN CAPITAL LETTE
41814182

41824183
# Derived Property: Changes_When_Uppercased (CWU)
41834184
# Characters whose normalized forms are not stable under a toUppercase mapping.
4184-
# For more information, see D140 in Section 3.13, "Default Case Algorithms".
4185+
# For more information, see the definition of "isUppercase(X)"
4186+
# in the "Conformance" / "Default Case Algorithms" section of the core specification.
41854187
# Changes_When_Uppercased(X) is true when toUppercase(toNFD(X)) != toNFD(X)
41864188

41874189
0061..007A ; Changes_When_Uppercased # L& [26] LATIN SMALL LETTER A..LATIN SMALL LETTER Z
@@ -4825,7 +4827,8 @@ FF41..FF5A ; Changes_When_Uppercased # L& [26] FULLWIDTH LATIN SMALL LETTER
48254827

48264828
# Derived Property: Changes_When_Titlecased (CWT)
48274829
# Characters whose normalized forms are not stable under a toTitlecase mapping.
4828-
# For more information, see D141 in Section 3.13, "Default Case Algorithms".
4830+
# For more information, see the definition of "isTitlecase(X)"
4831+
# in the "Conformance" / "Default Case Algorithms" section of the core specification.
48294832
# Changes_When_Titlecased(X) is true when toTitlecase(toNFD(X)) != toNFD(X)
48304833

48314834
0061..007A ; Changes_When_Titlecased # L& [26] LATIN SMALL LETTER A..LATIN SMALL LETTER Z
@@ -5468,7 +5471,8 @@ FF41..FF5A ; Changes_When_Titlecased # L& [26] FULLWIDTH LATIN SMALL LETTER
54685471

54695472
# Derived Property: Changes_When_Casefolded (CWCF)
54705473
# Characters whose normalized forms are not stable under case folding.
5471-
# For more information, see D142 in Section 3.13, "Default Case Algorithms".
5474+
# For more information, see the definition of "isCasefolded(X)"
5475+
# in the "Conformance" / "Default Case Algorithms" section of the core specification.
54725476
# Changes_When_Casefolded(X) is true when toCasefold(toNFD(X)) != toNFD(X)
54735477

54745478
0041..005A ; Changes_When_Casefolded # L& [26] LATIN CAPITAL LETTER A..LATIN CAPITAL LETTER Z
@@ -6108,7 +6112,8 @@ FF21..FF3A ; Changes_When_Casefolded # L& [26] FULLWIDTH LATIN CAPITAL LETTE
61086112

61096113
# Derived Property: Changes_When_Casemapped (CWCM)
61106114
# Characters whose normalized forms are not stable under case mapping.
6111-
# For more information, see D143 in Section 3.13, "Default Case Algorithms".
6115+
# For more information, see the definition of "isCased(X)"
6116+
# in the "Conformance" / "Default Case Algorithms" section of the core specification.
61126117
# Changes_When_Casemapped(X) is true when CWL(X), or CWT(X), or CWU(X)
61136118

61146119
0041..005A ; Changes_When_Casemapped # L& [26] LATIN CAPITAL LETTER A..LATIN CAPITAL LETTER Z

unicodetools/data/ucd/dev/DoNotEmit.txt

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
# DoNotEmit-17.0.0.txt
2-
# Date: 2025-05-13, 04:43:00 GMT
2+
# Date: 2025-07-30
33
# © 2025 Unicode®, Inc.
44
# Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the U.S. and other countries.
55
# For terms of use and license, see https://www.unicode.org/terms_of_use.html
@@ -332,7 +332,7 @@
332332
# Deprecated characters and other discouraged characters and sequences
333333
# ================================================
334334

335-
# Latin, from text of Section 7.1, the NamesList, and the uppercase mapping
335+
# Latin, from text in the "Latin" section of the core specification, the NamesList, and the uppercase mapping
336336
0140; 006C 00B7; Preferred_Spelling # LATIN SMALL LETTER L WITH MIDDLE DOT; LATIN SMALL LETTER L, MIDDLE DOT
337337
0149; 2019 006E; Deprecated # LATIN SMALL LETTER N PRECEDED BY APOSTROPHE; RIGHT SINGLE QUOTATION MARK, LATIN SMALL LETTER N
338338
0131 0307; 0069 0307; Dotless_Form # LATIN SMALL LETTER DOTLESS I, COMBINING DOT ABOVE; LATIN SMALL LETTER I, COMBINING DOT ABOVE
@@ -432,7 +432,7 @@
432432
107A6 0322; 107A7; Precomposed_Form # MODIFIER LETTER SMALL TURNED R WITH LONG LEG, COMBINING RETROFLEX HOOK BELOW; MODIFIER LETTER SMALL TURNED R WITH LONG LEG AND RETROFLEX HOOK
433433
107AC 0322; 107AD; Precomposed_Form # MODIFIER LETTER SMALL TS DIGRAPH, COMBINING RETROFLEX HOOK BELOW; MODIFIER LETTER SMALL TS DIGRAPH WITH RETROFLEX HOOK
434434

435-
# Arabic, from text of Section 9.2 and the NamesList
435+
# Arabic, from text in the "Arabic" section of the core specification, and the NamesList
436436
0649 0654; 0626; Hamza_Form # ARABIC LETTER ALEF MAKSURA, ARABIC HAMZA ABOVE; ARABIC LETTER YEH WITH HAMZA ABOVE
437437
064E 064E; 064B; Arabic_Tashkil # ARABIC FATHA, ARABIC FATHA; ARABIC FATHATAN
438438
0650 0650; 064D; Arabic_Tashkil # ARABIC KASRA, ARABIC KASRA; ARABIC KASRATAN
@@ -442,11 +442,11 @@
442442
0677; 0674 06C7; Preferred_Spelling # ARABIC LETTER U WITH HAMZA ABOVE; ARABIC LETTER HIGH HAMZA, ARABIC LETTER U
443443
0678; 0674 0649; Preferred_Spelling # ARABIC LETTER HIGH HAMZA YEH; ARABIC LETTER HIGH HAMZA, ARABIC LETTER ALEF MAKSURA
444444

445-
# Devanagari, from Section 12.1 and the NamesList
445+
# Devanagari, from the "Devanagari" section of the core specification, and the NamesList
446446
0953; 0300; Discouraged # DEVANAGARI GRAVE ACCENT; COMBINING GRAVE ACCENT
447447
0954; 0301; Discouraged # DEVANAGARI ACUTE ACCENT; COMBINING ACUTE ACCENT
448448

449-
# Bengali, from Section 12.2
449+
# Bengali, from the "Bengali (Bangla)" section of the core specification
450450
09A4 09CD 200D; 09CE; Bengali_Khanda_Ta # BENGALI LETTER TA, BENGALI SIGN VIRAMA, ZERO WIDTH JOINER; BENGALI LETTER KHANDA TA
451451

452452
# Gujarati, from the NamesList
@@ -462,11 +462,11 @@
462462
0D32 0D4D 200D; 0D7D; Malayalam_Chillu # MALAYALAM LETTER LA, MALAYALAM SIGN VIRAMA, ZERO WIDTH JOINER; MALAYALAM LETTER CHILLU L
463463
0D33 0D4D 200D; 0D7E; Malayalam_Chillu # MALAYALAM LETTER LLA, MALAYALAM SIGN VIRAMA, ZERO WIDTH JOINER; MALAYALAM LETTER CHILLU LL
464464

465-
# Tibetan, from text of Section 13.4, the NamesList, and the decompositions
465+
# Tibetan, from text in the "Tibetan" section of the core specification, the NamesList, and the decompositions
466466
0F77; 0FB2 0F71 0F80; Deprecated # TIBETAN VOWEL SIGN VOCALIC RR; TIBETAN SUBJOINED LETTER RA, TIBETAN VOWEL SIGN AA, TIBETAN VOWEL SIGN REVERSED I
467467
0F79; 0FB3 0F71 0F80; Deprecated # TIBETAN VOWEL SIGN VOCALIC LL; TIBETAN SUBJOINED LETTER LA, TIBETAN VOWEL SIGN AA, TIBETAN VOWEL SIGN REVERSED I
468468

469-
# Khmer, from text of Section 16.4 and the NamesList
469+
# Khmer, from text in the "Khmer" section of the core specification, and the NamesList
470470
17A3; 17A2; Deprecated # KHMER INDEPENDENT VOWEL QAQ; KHMER LETTER QA
471471
17A4; 17A2 17B6; Deprecated # KHMER INDEPENDENT VOWEL QAA; KHMER LETTER QA, KHMER VOWEL SIGN AA
472472
17D8; 17D4 179B 17D4; Discouraged # KHMER SIGN BEYYAL; KHMER SIGN KHAN, KHMER LETTER LO, KHMER SIGN KHAN

unicodetools/data/ucd/dev/Jamo.txt

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Jamo-17.0.0.txt
2-
# Date: 2024-02-02
3-
# © 2024 Unicode®, Inc.
2+
# Date: 2025-07-30
3+
# © 2025 Unicode®, Inc.
44
# Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the U.S. and other countries.
55
# For terms of use and license, see https://www.unicode.org/terms_of_use.html
66
#
@@ -9,7 +9,7 @@
99
#
1010
# This file defines the Jamo_Short_Name property.
1111
#
12-
# See Section 3.12 of the core specification of the Unicode Standard
12+
# See the "Conformance" / "Conjoining Jamo Behavior" section of the core specification
1313
# for more information.
1414
#
1515
# Each line contains two fields, separated by a semicolon.

unicodetools/data/ucd/dev/NamedSequences.txt

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# NamedSequences-17.0.0.txt
2-
# Date: 2024-02-02
3-
# © 2024 Unicode®, Inc.
2+
# Date: 2025-07-30
3+
# © 2025 Unicode®, Inc.
44
# Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the U.S. and other countries.
55
# For terms of use and license, see https://www.unicode.org/terms_of_use.html
66
#
@@ -209,8 +209,8 @@ BENGALI LETTER KHINYA;0995 09CD 09B7
209209
# Provisional 2008-02-08, Approved 2009-08-14
210210
#
211211
# A visual display of the Tamil named character sequences is available
212-
# in the documentation for the Unicode Standard. See Section 12.6, Tamil in
213-
# https://www.unicode.org/versions/latest/
212+
# in the documentation for the Unicode Standard.
213+
# See the "Tamil" section in the core specification.
214214

215215
TAMIL CONSONANT K; 0B95 0BCD
216216
TAMIL CONSONANT NG; 0B99 0BCD

unicodetools/data/ucd/dev/SpecialCasing.txt

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
# SpecialCasing-17.0.0.txt
2-
# Date: 2025-07-29, 22:01:10 GMT
2+
# Date: 2025-07-31, 22:11:55 GMT
33
# © 2025 Unicode®, Inc.
44
# Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the U.S. and other countries.
55
# For terms of use and license, see https://www.unicode.org/terms_of_use.html
@@ -47,8 +47,8 @@
4747
#
4848
# A language ID is defined by BCP 47, with '-' and '_' treated equivalently.
4949
#
50-
# A casing context for a character is defined by Section 3.13 Default Case Algorithms
51-
# of The Unicode Standard.
50+
# A casing context for a character is defined in the
51+
# "Conformance" / "Default Case Algorithms" section of the core specification.
5252
#
5353
# Parsers of this file must be prepared to deal with future additions to this format:
5454
# * Additional contexts

0 commit comments

Comments
 (0)