Skip to content

Commit 53a6ebf

Browse files
authored
Fix segmentation tests (#1075)
* Never return false from an override of applyPropertyAlias * Work around deficient symbol table * Regenerate UCD * More =true * Revert "Never return false from an override of applyPropertyAlias" This reverts commit 9b45e96. * Revert "More =true" This reverts commit 0360f99.
1 parent 1ad8248 commit 53a6ebf

File tree

6 files changed

+24
-24
lines changed

6 files changed

+24
-24
lines changed

unicodetools/data/ucd/dev/auxiliary/GraphemeBreakTest.html

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@
77
<body bgcolor='#FFFFFF'>
88
<h2>Grapheme_Cluster_Break Chart</h2>
99
<p><b>Unicode Version:</b> 17.0.0</p>
10-
<p><b>Date:</b> 2025-02-14, 00:14:44 GMT</p>
10+
<p><b>Date:</b> 2025-03-24, 14:45:55 GMT</p>
1111
<p>This page illustrates the application of the Grapheme_Cluster_Break specification. The material here is informative, not normative.</p> <p>The first chart shows where breaks would appear between different sample characters or strings. The sample characters are chosen mechanically to represent the different properties used by the specification.</p><p>Each cell shows the break-status for the position between the character(s) in its row header and the character(s) in its column header. The × symbol indicates no break, while the ÷ symbol indicates a break. The cells with × are also shaded to make it easier to scan the table. For example, in the cell at the intersection of the row headed by “CR” and the column headed by “LF”, there is a × symbol, indicating that there is no break between CR and LF.</p>
1212
<p>After the heavy blue line in the table are additional rows, either with different sample characters or for sequences. </p><p>In the row and column headers of the <a href='#table'>Table</a>, in the <a href='#rules'>Rules</a>, when hovering over characters in the <a href='#samples'>Samples</a>, and in the comments in the associated list of test cases <a href='GraphemeBreakTest.txt'>GraphemeBreakTest.txt</a>:</p>
1313
<ol><li>The following sets are used:<ul>
@@ -24,7 +24,7 @@ <h2>Grapheme_Cluster_Break Chart</h2>
2424
<li>
2525
ExtPict
2626
=
27-
\p{Extended_Pictographic}
27+
\p{Extended_Pictographic=True}
2828
</li>
2929
<li>
3030
LinkingConsonant
@@ -232,15 +232,15 @@ <h3><a href='#samples' name='samples'>Sample Strings</a></h3>
232232

233233
</font></td></tr>
234234
<tr><th style='text-align:right'><a href='#s23' name='s23'>23</a></th><td><font size='5'>
235-
<span title='0.2'><span style='border-right: 1px solid blue'>&nbsp;</span>&nbsp;</span><span title='U+2701 UPPER BLADE SCISSORS (ExtPict)'>&#x2701;</span><span title='9.0'><span>&nbsp;</span>&nbsp;</span>
236-
<span title='U+200D ZERO WIDTH JOINER (ZWJ)'>&#x25A1;</span><span title='11.0'><span>&nbsp;</span>&nbsp;</span>
237-
<span title='U+2701 UPPER BLADE SCISSORS (ExtPict)'>&#x2701;</span><span title='0.3'><span style='border-right: 1px solid blue'>&nbsp;</span>&nbsp;</span>
235+
<span title='0.2'><span style='border-right: 1px solid blue'>&nbsp;</span>&nbsp;</span><span title='U+2701 UPPER BLADE SCISSORS (XXmLinkingConsonantmExtPict)'>&#x2701;</span><span title='9.0'><span>&nbsp;</span>&nbsp;</span>
236+
<span title='U+200D ZERO WIDTH JOINER (ZWJ)'>&#x25A1;</span><span title='999.0'><span style='border-right: 1px solid blue'>&nbsp;</span>&nbsp;</span>
237+
<span title='U+2701 UPPER BLADE SCISSORS (XXmLinkingConsonantmExtPict)'>&#x2701;</span><span title='0.3'><span style='border-right: 1px solid blue'>&nbsp;</span>&nbsp;</span>
238238

239239
</font></td></tr>
240240
<tr><th style='text-align:right'><a href='#s24' name='s24'>24</a></th><td><font size='5'>
241241
<span title='0.2'><span style='border-right: 1px solid blue'>&nbsp;</span>&nbsp;</span><span title='U+0061 LATIN SMALL LETTER A (XXmLinkingConsonantmExtPict)'>a</span><span title='9.0'><span>&nbsp;</span>&nbsp;</span>
242242
<span title='U+200D ZERO WIDTH JOINER (ZWJ)'>&#x25A1;</span><span title='999.0'><span style='border-right: 1px solid blue'>&nbsp;</span>&nbsp;</span>
243-
<span title='U+2701 UPPER BLADE SCISSORS (ExtPict)'>&#x2701;</span><span title='0.3'><span style='border-right: 1px solid blue'>&nbsp;</span>&nbsp;</span>
243+
<span title='U+2701 UPPER BLADE SCISSORS (XXmLinkingConsonantmExtPict)'>&#x2701;</span><span title='0.3'><span style='border-right: 1px solid blue'>&nbsp;</span>&nbsp;</span>
244244

245245
</font></td></tr>
246246
<tr><th style='text-align:right'><a href='#s25' name='s25'>25</a></th><td><font size='5'>

unicodetools/data/ucd/dev/auxiliary/GraphemeBreakTest.txt

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
# GraphemeBreakTest-17.0.0.txt
2-
# Date: 2025-02-14, 00:14:44 GMT
2+
# Date: 2025-03-24, 14:45:55 GMT
33
# © 2025 Unicode®, Inc.
44
# Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the U.S. and other countries.
55
# For terms of use and license, see https://www.unicode.org/terms_of_use.html
@@ -768,8 +768,8 @@
768768
÷ 1F476 × 1F3FF × 0308 × 200D × 1F476 × 1F3FF ÷ # ÷ [0.2] BABY (ExtPict) × [9.0] EMOJI MODIFIER FITZPATRICK TYPE-6 (Extend_ConjunctExtendermConjunctLinker) × [9.0] COMBINING DIAERESIS (Extend_ConjunctExtendermConjunctLinker) × [9.0] ZERO WIDTH JOINER (ZWJ) × [11.0] BABY (ExtPict) × [9.0] EMOJI MODIFIER FITZPATRICK TYPE-6 (Extend_ConjunctExtendermConjunctLinker) ÷ [0.3]
769769
÷ 1F6D1 × 200D × 1F6D1 ÷ # ÷ [0.2] OCTAGONAL SIGN (ExtPict) × [9.0] ZERO WIDTH JOINER (ZWJ) × [11.0] OCTAGONAL SIGN (ExtPict) ÷ [0.3]
770770
÷ 0061 × 200D ÷ 1F6D1 ÷ # ÷ [0.2] LATIN SMALL LETTER A (XXmLinkingConsonantmExtPict) × [9.0] ZERO WIDTH JOINER (ZWJ) ÷ [999.0] OCTAGONAL SIGN (ExtPict) ÷ [0.3]
771-
÷ 2701 × 200D × 2701 ÷ # ÷ [0.2] UPPER BLADE SCISSORS (ExtPict) × [9.0] ZERO WIDTH JOINER (ZWJ) × [11.0] UPPER BLADE SCISSORS (ExtPict) ÷ [0.3]
772-
÷ 0061 × 200D ÷ 2701 ÷ # ÷ [0.2] LATIN SMALL LETTER A (XXmLinkingConsonantmExtPict) × [9.0] ZERO WIDTH JOINER (ZWJ) ÷ [999.0] UPPER BLADE SCISSORS (ExtPict) ÷ [0.3]
771+
÷ 2701 × 200D ÷ 2701 ÷ # ÷ [0.2] UPPER BLADE SCISSORS (XXmLinkingConsonantmExtPict) × [9.0] ZERO WIDTH JOINER (ZWJ) ÷ [999.0] UPPER BLADE SCISSORS (XXmLinkingConsonantmExtPict) ÷ [0.3]
772+
÷ 0061 × 200D ÷ 2701 ÷ # ÷ [0.2] LATIN SMALL LETTER A (XXmLinkingConsonantmExtPict) × [9.0] ZERO WIDTH JOINER (ZWJ) ÷ [999.0] UPPER BLADE SCISSORS (XXmLinkingConsonantmExtPict) ÷ [0.3]
773773
÷ 0915 ÷ 0924 ÷ # ÷ [0.2] DEVANAGARI LETTER KA (LinkingConsonant) ÷ [999.0] DEVANAGARI LETTER TA (LinkingConsonant) ÷ [0.3]
774774
÷ 0915 × 094D × 0924 ÷ # ÷ [0.2] DEVANAGARI LETTER KA (LinkingConsonant) × [9.0] DEVANAGARI SIGN VIRAMA (Extend_ConjunctLinker) × [9.3] DEVANAGARI LETTER TA (LinkingConsonant) ÷ [0.3]
775775
÷ 0915 × 094D × 094D × 0924 ÷ # ÷ [0.2] DEVANAGARI LETTER KA (LinkingConsonant) × [9.0] DEVANAGARI SIGN VIRAMA (Extend_ConjunctLinker) × [9.0] DEVANAGARI SIGN VIRAMA (Extend_ConjunctLinker) × [9.3] DEVANAGARI LETTER TA (LinkingConsonant) ÷ [0.3]

unicodetools/data/ucd/dev/auxiliary/LineBreakTest.html

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@
77
<body bgcolor='#FFFFFF'>
88
<h2>Line_Break Chart</h2>
99
<p><b>Unicode Version:</b> 17.0.0</p>
10-
<p><b>Date:</b> 2025-02-14, 17:30:27 GMT</p>
10+
<p><b>Date:</b> 2025-03-24, 14:45:57 GMT</p>
1111
<p>This page illustrates the application of the Line_Break specification. The material here is informative, not normative.</p> <p>The first chart shows where breaks would appear between different sample characters or strings. The sample characters are chosen mechanically to represent the different properties used by the specification.</p><p>Each cell shows the break-status for the position between the character(s) in its row header and the character(s) in its column header. The symbol × indicates a prohibited break, even with intervening spaces; the ÷ symbol indicates a (direct) break; the symbol ∻ indicates a break only in the presence of an intervening space (an indirect break).The cells with × or ∻ are also shaded to make it easier to scan the table. For example, in the cell at the intersection of the row headed by “CR” and the column headed by “LF”, there is a × symbol, indicating that there is no break between CR and LF.</p>
1212
<p></p><p>In the row and column headers of the <a href='#table'>Table</a>, in the <a href='#rules'>Rules</a>, when hovering over characters in the <a href='#samples'>Samples</a>, and in the comments in the associated list of test cases <a href='LineBreakTest.txt'>LineBreakTest.txt</a>:</p>
1313
<ol><li>The following sets are used:<ul>
@@ -49,7 +49,7 @@ <h2>Line_Break Chart</h2>
4949
<li>
5050
ExtPictUnassigned
5151
=
52-
[\p{Extended_Pictographic}&\p{gc=Cn}]
52+
[\p{Extended_Pictographic=True}&\p{gc=Cn}]
5353
</li>
5454
<li>
5555
NS

unicodetools/data/ucd/dev/auxiliary/WordBreakTest.html

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@
77
<body bgcolor='#FFFFFF'>
88
<h2>Word_Break Chart</h2>
99
<p><b>Unicode Version:</b> 17.0.0</p>
10-
<p><b>Date:</b> 2024-11-27, 17:44:59 GMT</p>
10+
<p><b>Date:</b> 2025-03-24, 14:46:35 GMT</p>
1111
<p>This page illustrates the application of the Word_Break specification. The material here is informative, not normative.</p> <p>The first chart shows where breaks would appear between different sample characters or strings. The sample characters are chosen mechanically to represent the different properties used by the specification.</p><p>Each cell shows the break-status for the position between the character(s) in its row header and the character(s) in its column header. The × symbol indicates no break, while the ÷ symbol indicates a break. The cells with × are also shaded to make it easier to scan the table. For example, in the cell at the intersection of the row headed by “CR” and the column headed by “LF”, there is a × symbol, indicating that there is no break between CR and LF.</p>
1212
<p>After the heavy blue line in the table are additional rows, either with different sample characters or for sequences, such as “ALetter MidLetter”. </p><p>In the row and column headers of the <a href='#table'>Table</a>, in the <a href='#rules'>Rules</a>, when hovering over characters in the <a href='#samples'>Samples</a>, and in the comments in the associated list of test cases <a href='WordBreakTest.txt'>WordBreakTest.txt</a>:</p>
1313
<ol><li>The following sets are used:<ul>
@@ -19,7 +19,7 @@ <h2>Word_Break Chart</h2>
1919
<li>
2020
ExtPict
2121
=
22-
\p{Extended_Pictographic}
22+
\p{Extended_Pictographic=True}
2323
</li>
2424
<li>
2525
MidNumLetQ
@@ -292,15 +292,15 @@ <h3><a href='#samples' name='samples'>Sample Strings</a></h3>
292292

293293
</font></td></tr>
294294
<tr><th style='text-align:right'><a href='#s27' name='s27'>27</a></th><td><font size='5'>
295-
<span title='0.2'><span style='border-right: 1px solid blue'>&nbsp;</span>&nbsp;</span><span title='U+2701 UPPER BLADE SCISSORS (ExtPictmALetter)'>&#x2701;</span><span title='4.0'><span>&nbsp;</span>&nbsp;</span>
296-
<span title='U+200D ZERO WIDTH JOINER (ZWJ)'>&#x25A1;</span><span title='3.3'><span>&nbsp;</span>&nbsp;</span>
297-
<span title='U+2701 UPPER BLADE SCISSORS (ExtPictmALetter)'>&#x2701;</span><span title='0.3'><span style='border-right: 1px solid blue'>&nbsp;</span>&nbsp;</span>
295+
<span title='0.2'><span style='border-right: 1px solid blue'>&nbsp;</span>&nbsp;</span><span title='U+2701 UPPER BLADE SCISSORS (XXmExtPict)'>&#x2701;</span><span title='4.0'><span>&nbsp;</span>&nbsp;</span>
296+
<span title='U+200D ZERO WIDTH JOINER (ZWJ)'>&#x25A1;</span><span title='999.0'><span style='border-right: 1px solid blue'>&nbsp;</span>&nbsp;</span>
297+
<span title='U+2701 UPPER BLADE SCISSORS (XXmExtPict)'>&#x2701;</span><span title='0.3'><span style='border-right: 1px solid blue'>&nbsp;</span>&nbsp;</span>
298298

299299
</font></td></tr>
300300
<tr><th style='text-align:right'><a href='#s28' name='s28'>28</a></th><td><font size='5'>
301301
<span title='0.2'><span style='border-right: 1px solid blue'>&nbsp;</span>&nbsp;</span><span title='U+0061 LATIN SMALL LETTER A (ALettermExtPict)'>a</span><span title='4.0'><span>&nbsp;</span>&nbsp;</span>
302-
<span title='U+200D ZERO WIDTH JOINER (ZWJ)'>&#x25A1;</span><span title='3.3'><span>&nbsp;</span>&nbsp;</span>
303-
<span title='U+2701 UPPER BLADE SCISSORS (ExtPictmALetter)'>&#x2701;</span><span title='0.3'><span style='border-right: 1px solid blue'>&nbsp;</span>&nbsp;</span>
302+
<span title='U+200D ZERO WIDTH JOINER (ZWJ)'>&#x25A1;</span><span title='999.0'><span style='border-right: 1px solid blue'>&nbsp;</span>&nbsp;</span>
303+
<span title='U+2701 UPPER BLADE SCISSORS (XXmExtPict)'>&#x2701;</span><span title='0.3'><span style='border-right: 1px solid blue'>&nbsp;</span>&nbsp;</span>
304304

305305
</font></td></tr>
306306
<tr><th style='text-align:right'><a href='#s29' name='s29'>29</a></th><td><font size='5'>

unicodetools/data/ucd/dev/auxiliary/WordBreakTest.txt

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
# WordBreakTest-17.0.0.txt
2-
# Date: 2025-01-27, 18:09:43 GMT
2+
# Date: 2025-03-24, 14:46:35 GMT
33
# © 2025 Unicode®, Inc.
44
# Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the U.S. and other countries.
55
# For terms of use and license, see https://www.unicode.org/terms_of_use.html
@@ -1850,8 +1850,8 @@
18501850
÷ 1F476 × 1F3FF ÷ 1F476 ÷ # ÷ [0.2] BABY (ExtPictmALetter) × [4.0] EMOJI MODIFIER FITZPATRICK TYPE-6 (Extend) ÷ [999.0] BABY (ExtPictmALetter) ÷ [0.3]
18511851
÷ 1F6D1 × 200D × 1F6D1 ÷ # ÷ [0.2] OCTAGONAL SIGN (ExtPictmALetter) × [4.0] ZERO WIDTH JOINER (ZWJ) × [3.3] OCTAGONAL SIGN (ExtPictmALetter) ÷ [0.3]
18521852
÷ 0061 × 200D × 1F6D1 ÷ # ÷ [0.2] LATIN SMALL LETTER A (ALettermExtPict) × [4.0] ZERO WIDTH JOINER (ZWJ) × [3.3] OCTAGONAL SIGN (ExtPictmALetter) ÷ [0.3]
1853-
÷ 2701 × 200D × 2701 ÷ # ÷ [0.2] UPPER BLADE SCISSORS (ExtPictmALetter) × [4.0] ZERO WIDTH JOINER (ZWJ) × [3.3] UPPER BLADE SCISSORS (ExtPictmALetter) ÷ [0.3]
1854-
÷ 0061 × 200D × 2701 ÷ # ÷ [0.2] LATIN SMALL LETTER A (ALettermExtPict) × [4.0] ZERO WIDTH JOINER (ZWJ) × [3.3] UPPER BLADE SCISSORS (ExtPictmALetter) ÷ [0.3]
1853+
÷ 2701 × 200D ÷ 2701 ÷ # ÷ [0.2] UPPER BLADE SCISSORS (XXmExtPict) × [4.0] ZERO WIDTH JOINER (ZWJ) ÷ [999.0] UPPER BLADE SCISSORS (XXmExtPict) ÷ [0.3]
1854+
÷ 0061 × 200D ÷ 2701 ÷ # ÷ [0.2] LATIN SMALL LETTER A (ALettermExtPict) × [4.0] ZERO WIDTH JOINER (ZWJ) ÷ [999.0] UPPER BLADE SCISSORS (XXmExtPict) ÷ [0.3]
18551855
÷ 1F476 × 1F3FF × 0308 × 200D × 1F476 × 1F3FF ÷ # ÷ [0.2] BABY (ExtPictmALetter) × [4.0] EMOJI MODIFIER FITZPATRICK TYPE-6 (Extend) × [4.0] COMBINING DIAERESIS (Extend) × [4.0] ZERO WIDTH JOINER (ZWJ) × [3.3] BABY (ExtPictmALetter) × [4.0] EMOJI MODIFIER FITZPATRICK TYPE-6 (Extend) ÷ [0.3]
18561856
÷ 1F6D1 × 1F3FF ÷ # ÷ [0.2] OCTAGONAL SIGN (ExtPictmALetter) × [4.0] EMOJI MODIFIER FITZPATRICK TYPE-6 (Extend) ÷ [0.3]
18571857
÷ 200D × 1F6D1 × 1F3FF ÷ # ÷ [0.2] ZERO WIDTH JOINER (ZWJ) × [3.3] OCTAGONAL SIGN (ExtPictmALetter) × [4.0] EMOJI MODIFIER FITZPATRICK TYPE-6 (Extend) ÷ [0.3]

unicodetools/src/main/resources/org/unicode/tools/SegmenterDefault.txt

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ $ConjunctLinker=\p{Indic_Conjunct_Break=Linker}
2020
$LinkingConsonant=\p{Indic_Conjunct_Break=Consonant}
2121
## $E_Base=\p{Grapheme_Cluster_Break=E_Base}
2222
## $E_Modifier=\p{Grapheme_Cluster_Break=E_Modifier}
23-
$ExtPict=\p{Extended_Pictographic}
23+
$ExtPict=\p{Extended_Pictographic=True}
2424
$ConjunctExtender=[\p{Indic_Conjunct_Break=Linker}\p{Indic_Conjunct_Break=Extend}]
2525
## $EBG=\p{Grapheme_Cluster_Break=E_Base_GAZ}
2626
## $Glue_After_Zwj=\p{Grapheme_Cluster_Break=Glue_After_Zwj}
@@ -124,7 +124,7 @@ $DottedCircle = [◌]
124124
$CPmEastAsian=[$CP-$EastAsian]
125125
$OPmEastAsian=[$OP-$EastAsian]
126126

127-
$ExtPictUnassigned=[\p{Extended_Pictographic}&\p{gc=Cn}]
127+
$ExtPictUnassigned=[\p{Extended_Pictographic=True}&\p{gc=Cn}]
128128

129129
# Some rules refer to the start and end of text. We could just use a literal ^ for sot, but naming
130130
# it as in the spec makes it easier to compare. The parser will eat (and choke on) $, so we play a
@@ -364,7 +364,7 @@ $Single_Quote=\p{Word_Break=Single_Quote}
364364
## $E_Modifier=\p{Word_Break=E_Modifier}
365365
$ZWJ=\p{Word_Break=ZWJ}
366366
# Note: The following may overlap with the above
367-
$ExtPict=\p{Extended_Pictographic}
367+
$ExtPict=\p{Extended_Pictographic=True}
368368
## $EBG=\p{Word_Break=E_Base_GAZ}
369369
## $Glue_After_Zwj=\p{Word_Break=Glue_After_Zwj}
370370
$WSegSpace=\p{Word_Break=WSegSpace}

0 commit comments

Comments
 (0)