ICU-23358 Make RBNF lenient parsing of ignorables more consistent between locales by grhoten · Pull Request #3926 · unicode-org/icu

grhoten · 2026-04-04T07:30:42Z

Right now, there is inconsistent usage of the %%lenient-parse rule between locales in RBNF. The vast majority of them define the same set of characters as ignorable, except German, which adds a little extra for vowels with the umlaut. We should change the lenient parsing to not use the collator by default, unless it defines the %%lenient-parse rule. The letter folding, and the ignorable characters should be handled separately from a collator.

This improves the parsing time, and it usually reduces the overall heap usage when parsing leniently. All existing tests continue to pass with these changes.

Checklist

Required: Issue filed: ICU-23358
Required: The PR title must be prefixed with a JIRA Issue number. Example: "ICU-NNNNN Fix xyz"
Required: Each commit message must be prefixed with a JIRA Issue number. Example: "ICU-NNNNN Fix xyz"
Issue accepted (done by Technical Committee after discussion)
Tests included, if applicable
API docs and/or User Guide docs changed or added, if applicable
Approver: Feel free to merge on my behalf

…ween locales

grhoten · 2026-04-04T15:50:00Z

icu4c/source/data/rbnf/de.txt

        SpelloutRules{
            "%%lenient-parse:"
-            "&ue=\u00FC&ae=\u00E4&oe=\u00F6&[last primary ignorable ] << ' ' << ',' << '-' << '\u00AD';"
+            "&ue=\u00FC&ae=\u00E4&oe=\u00F6;"


This is the only %%lenient-parse left. This one is hard to get rid of.

grhoten · 2026-04-04T15:51:18Z

icu4c/source/data/rbnf/it.txt

            "%spellout-numbering:"
            "-x: meno >>;"
            "x.x: << virgola >>;"
+            "Inf: infinito;"


This change got pulled in from the latest version of CLDR. Only the latest RBNF data is being brought in from CLDR.

grhoten · 2026-04-04T15:55:23Z

icu4c/source/i18n/nfrule.cpp

+        }
+
+        // go through all this grief if we're in lenient-parse mode
+#if !UCONFIG_NO_COLLATION


The UCONFIG_NO_COLLATION option has been narrowed down to this smaller segment in both ICU4C and ICU4J. If collation is disabled, only German would be affected at this time. All of the mixing of ignorable characters and case folding is happening above.

grhoten · 2026-04-04T15:56:22Z

icu4c/source/i18n/nfrule.cpp

        fprintf(stderr, "prefix length: %d\n", result);
 #endif
        return result;
-#if 0


Remove permanently dead code collation code. It can be resurrected from the history if needed.

grhoten mentioned this pull request Apr 4, 2026

CLDR-19372 Remove %%lenient-parse rules from RBNF unicode-org/cldr#5550

Open

1 task

grhoten force-pushed the 23358 branch from bee87ea to 7868161 Compare April 4, 2026 08:01

unicode-org deleted a comment from jira-pull-request-webhook bot Apr 4, 2026

grhoten force-pushed the 23358 branch from 7868161 to 6bc726a Compare April 4, 2026 15:45

unicode-org deleted a comment from jira-pull-request-webhook bot Apr 4, 2026

ICU-23358 Make RBNF lenient parsing of ignorables more consistent bet…

c849f9d

…ween locales

grhoten force-pushed the 23358 branch from 6bc726a to c849f9d Compare April 4, 2026 15:59

grhoten commented Apr 4, 2026

View reviewed changes

unicode-org deleted a comment from jira-pull-request-webhook bot Apr 4, 2026

grhoten requested review from markusicu and richgillam April 4, 2026 16:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ICU-23358 Make RBNF lenient parsing of ignorables more consistent between locales#3926

ICU-23358 Make RBNF lenient parsing of ignorables more consistent between locales#3926
grhoten wants to merge 1 commit intounicode-org:mainfrom
grhoten:23358

grhoten commented Apr 4, 2026 •

edited

Loading

Uh oh!

grhoten Apr 4, 2026

Uh oh!

grhoten Apr 4, 2026

Uh oh!

grhoten Apr 4, 2026

Uh oh!

grhoten Apr 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

grhoten commented Apr 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Uh oh!

grhoten Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

grhoten Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

grhoten Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

grhoten Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

grhoten commented Apr 4, 2026 •

edited

Loading