Skip to content

Commit 27875cb

Browse files
authored
Generate Java monkeys (#1006)
* meow * Greedy context before in LB20a * Regenerate UCD
1 parent 65c1aa5 commit 27875cb

File tree

4 files changed

+64
-7
lines changed

4 files changed

+64
-7
lines changed

unicodetools/data/ucd/dev/auxiliary/LineBreakTest.html

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@
77
<body bgcolor='#FFFFFF'>
88
<h2>Line_Break Chart</h2>
99
<p><b>Unicode Version:</b> 17.0.0</p>
10-
<p><b>Date:</b> 2024-11-28, 01:27:49 GMT</p>
10+
<p><b>Date:</b> 2025-01-28, 00:21:01 GMT</p>
1111
<p>This page illustrates the application of the Line_Break specification. The material here is informative, not normative.</p> <p>The first chart shows where breaks would appear between different sample characters or strings. The sample characters are chosen mechanically to represent the different properties used by the specification.</p><p>Each cell shows the break-status for the position between the character(s) in its row header and the character(s) in its column header. The symbol × indicates a prohibited break, even with intervening spaces; the ÷ symbol indicates a (direct) break; the symbol ∻ indicates a break only in the presence of an intervening space (an indirect break).The cells with × or ∻ are also shaded to make it easier to scan the table. For example, in the cell at the intersection of the row headed by “CR” and the column headed by “LF”, there is a × symbol, indicating that there is no break between CR and LF.</p>
1212
<p></p><p>In the row and column headers of the <a href='#table'>Table</a>, in the <a href='#rules'>Rules</a>, when hovering over characters in the <a href='#samples'>Samples</a>, and in the comments in the associated list of test cases <a href='LineBreakTest.txt'>LineBreakTest.txt</a>:</p>
1313
<ol><li>The following sets are used:<ul>
@@ -226,7 +226,7 @@ <h3><a href='#rules' name='rules'>Rules</a></h3>
226226
<tr><th style='text-align:right'><a href='#r13.03' name='r13.03'>13.03</a></th><td style='text-align:right'></td><td>×</td><td> CP</td></tr>
227227
<tr><th style='text-align:right'><a href='#r13.04' name='r13.04'>13.04</a></th><td style='text-align:right'></td><td>×</td><td> SY</td></tr>
228228
<tr><th style='text-align:right'><a href='#r14.0' name='r14.0'>14.0</a></th><td style='text-align:right'>OP SP* </td><td>×</td><td></td></tr>
229-
<tr><th style='text-align:right'><a href='#r15.11' name='r15.11'>15.11</a></th><td style='text-align:right'>( sot | BK | CR | LF | NL | OP | QU | GL | SP | ZW ) QU_Pi SP* </td><td>×</td><td></td></tr>
229+
<tr><th style='text-align:right'><a href='#r15.11' name='r15.11'>15.11</a></th><td style='text-align:right'>( BK | CR | LF | NL | OP | QU | GL | SP | ZW | sot ) QU_Pi SP* </td><td>×</td><td></td></tr>
230230
<tr><th style='text-align:right'><a href='#r15.21' name='r15.21'>15.21</a></th><td style='text-align:right'></td><td>×</td><td> QU_Pf ( SP | GL | WJ | CL | QU | CP | EX | IS | SY | BK | CR | LF | NL | ZW | eot )</td></tr>
231231
<tr><th style='text-align:right'><a href='#r15.3' name='r15.3'>15.3</a></th><td style='text-align:right'>SP </td><td>÷</td><td> IS NU</td></tr>
232232
<tr><th style='text-align:right'><a href='#r15.4' name='r15.4'>15.4</a></th><td style='text-align:right'></td><td>×</td><td> IS</td></tr>
@@ -238,10 +238,10 @@ <h3><a href='#rules' name='rules'>Rules</a></h3>
238238
<tr><th style='text-align:right'><a href='#r19.1' name='r19.1'>19.1</a></th><td style='text-align:right'>[^EastAsian] </td><td>×</td><td> QU</td></tr>
239239
<tr><th style='text-align:right'><a href='#r19.11' name='r19.11'>19.11</a></th><td style='text-align:right'></td><td>×</td><td> QU ( [^EastAsian] | eot )</td></tr>
240240
<tr><th style='text-align:right'><a href='#r19.12' name='r19.12'>19.12</a></th><td style='text-align:right'>QU </td><td>×</td><td> [^EastAsian]</td></tr>
241-
<tr><th style='text-align:right'><a href='#r19.13' name='r19.13'>19.13</a></th><td style='text-align:right'>( sot | [^EastAsian] ) QU </td><td>×</td><td></td></tr>
241+
<tr><th style='text-align:right'><a href='#r19.13' name='r19.13'>19.13</a></th><td style='text-align:right'>( [^EastAsian] | sot ) QU </td><td>×</td><td></td></tr>
242242
<tr><th style='text-align:right'><a href='#r20.01' name='r20.01'>20.01</a></th><td style='text-align:right'></td><td>÷</td><td> CB</td></tr>
243243
<tr><th style='text-align:right'><a href='#r20.02' name='r20.02'>20.02</a></th><td style='text-align:right'>CB </td><td>÷</td><td></td></tr>
244-
<tr><th style='text-align:right'><a href='#r20.1' name='r20.1'>20.1</a></th><td style='text-align:right'>( sot | BK | CR | LF | NL | SP | ZW | CB | GL ) ( HY | Hyphen ) </td><td>×</td><td> AL</td></tr>
244+
<tr><th style='text-align:right'><a href='#r20.1' name='r20.1'>20.1</a></th><td style='text-align:right'>( BK | CR | LF | NL | SP | ZW | CB | GL | sot ) ( HY | Hyphen ) </td><td>×</td><td> AL</td></tr>
245245
<tr><th style='text-align:right'><a href='#r21.01' name='r21.01'>21.01</a></th><td style='text-align:right'></td><td>×</td><td> BA</td></tr>
246246
<tr><th style='text-align:right'><a href='#r21.02' name='r21.02'>21.02</a></th><td style='text-align:right'></td><td>×</td><td> HY</td></tr>
247247
<tr><th style='text-align:right'><a href='#r21.03' name='r21.03'>21.03</a></th><td style='text-align:right'></td><td>×</td><td> NS</td></tr>

unicodetools/src/main/java/org/unicode/text/UCD/GenerateBreakTest.java

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -480,6 +480,7 @@ value, new ParsePosition(0), IUP.getXSymbolTable()))) {
480480

481481
generateTest(false, path, outFilename, propertyName);
482482
generateCppOldMonkeys(extraPath, outFilename);
483+
generateJavaOldMonkeys(extraPath, outFilename);
483484
}
484485

485486
private void generateCppOldMonkeys(String path, String outFilename) throws IOException {
@@ -512,6 +513,36 @@ private void generateCppOldMonkeys(String path, String outFilename) throws IOExc
512513
fc.close();
513514
}
514515

516+
private void generateJavaOldMonkeys(String path, String outFilename) throws IOException {
517+
final UnicodeDataFile fc = UnicodeDataFile.openAndWriteHeader(path, outFilename + ".java");
518+
final PrintWriter out = fc.out;
519+
out.println();
520+
out.println("####### Instructions ###################################");
521+
out.println("# Copy the following lines into RBBITestMonkey.java in #");
522+
out.println(
523+
"# ICU4J, in the constructor of RBBIMeowMonkey, replacing #"
524+
.replace("Meow", outFilename.substring(0, 4).replace("Graph", "Char")));
525+
out.println("# the existing block of generated code. #");
526+
out.println("########################################################");
527+
out.println();
528+
out.println(" // --- NOLI ME TANGERE ---");
529+
out.println(" // Generated by GenerateBreakTest.java in the Unicode tools.");
530+
for (Segmenter.Builder.NamedRefinedSet part : segmenter.getPartitionDefinition()) {
531+
out.println(
532+
" partition.add(new NamedSet(\""
533+
+ part.getName().replace("\\", "\\\\").replace("\"", "\\\"")
534+
+ "\", new UnicodeSet(\""
535+
+ part.getDefinition().replace("\\", "\\\\").replace("\"", "\\\"")
536+
+ "\")));");
537+
}
538+
out.println();
539+
for (Segmenter.SegmentationRule rule : segmenter.getRules()) {
540+
out.println(" rules.add(" + rule.toJavaOldMonkeyString() + ");");
541+
}
542+
out.println(" // --- End of generated code. ---");
543+
fc.close();
544+
}
545+
515546
private void generateTest(
516547
boolean shortVersion, String path, String outFilename, String propertyName)
517548
throws IOException {

unicodetools/src/main/java/org/unicode/tools/Segmenter.java

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -283,6 +283,8 @@ public String toString() {
283283
}
284284

285285
public abstract String toCppOldMonkeyString();
286+
287+
public abstract String toJavaOldMonkeyString();
286288
}
287289

288290
/** A « treat as » rule. */
@@ -390,6 +392,17 @@ public String toCppOldMonkeyString() {
390392
+ replacement
391393
+ ")\")";
392394
}
395+
396+
@Override
397+
public String toJavaOldMonkeyString() {
398+
return "new RemapRule(\""
399+
+ name.replace("\\", "\\\\").replace("\"", "\\\"")
400+
+ "\", \""
401+
+ patternDefinition.replace("\\", "\\\\").replace("\"", "\\\"")
402+
+ "\", \""
403+
+ replacement.replace("\\", "\\\\").replace("\"", "\\\"")
404+
+ "\")";
405+
}
393406
}
394407

395408
/** A rule that determines the status of an offset. */
@@ -487,6 +500,19 @@ public String toCppOldMonkeyString() {
487500
+ ")\")";
488501
}
489502

503+
@Override
504+
public String toJavaOldMonkeyString() {
505+
return "new RegexRule(\""
506+
+ name.replace("\\", "\\\\").replace("\"", "\\\"")
507+
+ "\", \""
508+
+ beforeDefinition.replace("\\", "\\\\").replace("\"", "\\\"")
509+
+ "\", Resolution."
510+
+ breaks.name()
511+
+ ", \""
512+
+ afterDefinition.replace("\\", "\\\\").replace("\"", "\\\"")
513+
+ "\")";
514+
}
515+
490516
// ============== Internals ================
491517
// We cannot use a single regex of the form "(?<= before) after" because
492518
// (RI RI)* RI × RI would require unbounded lookbehind.

unicodetools/src/main/resources/org/unicode/tools/SegmenterDefault.txt

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -183,7 +183,7 @@ $NS=[$NSorig $CJ]
183183
# LB 15a Do not break after an unresolved initial punctuation that lies at the start of the line,
184184
# after a space, after opening punctuation, or after an unresolved quotation mark, even after
185185
# spaces.
186-
15.11) ( $sot | $BK | $CR | $LF | $NL | $OP | $QU | $GL | $SP | $ZW ) $QU_Pi $SP* ×
186+
15.11) ( $BK | $CR | $LF | $NL | $OP | $QU | $GL | $SP | $ZW | $sot ) $QU_Pi $SP* ×
187187
# LB 15b Do not break before an unresolved final punctuation that lies at the end of the line, before
188188
# a space, before a prohibited break, or before an unresolved quotation mark, even before spaces.
189189
15.21) × $QU_Pf ( $SP | $GL | $WJ | $CL | $QU | $CP | $EX | $IS | $SY | $BK | $CR | $LF | $NL | $ZW | $eot )
@@ -204,12 +204,12 @@ $NS=[$NSorig $CJ]
204204
19.10) [^$EastAsian] × $QU
205205
19.11) × $QU ( [^$EastAsian] | $eot )
206206
19.12) $QU × [^$EastAsian]
207-
19.13) ( $sot | [^$EastAsian] ) $QU ×
207+
19.13) ( [^$EastAsian] | $sot ) $QU ×
208208
# LB 20 Break before and after unresolved CB.
209209
20.01) ÷ $CB
210210
20.02) $CB ÷
211211
# LB 20a Do not break after a hyphen that follows break opportunity, a space, or the start of text.
212-
20.10) ( $sot | $BK | $CR | $LF | $NL | $SP | $ZW | $CB | $GL ) ( $HY | $Hyphen ) × $AL
212+
20.10) ( $BK | $CR | $LF | $NL | $SP | $ZW | $CB | $GL | $sot ) ( $HY | $Hyphen ) × $AL
213213
# LB 21 Do not break before hyphen-minus, other hyphens, fixed-width spaces, small kana and other non-starters, or after acute accents.
214214
21.01) × $BA
215215
21.02) × $HY

0 commit comments

Comments
 (0)