ICU-23358 Make RBNF lenient parsing of ignorables more consistent between locales#3926
Open
grhoten wants to merge 1 commit intounicode-org:mainfrom
Open
ICU-23358 Make RBNF lenient parsing of ignorables more consistent between locales#3926grhoten wants to merge 1 commit intounicode-org:mainfrom
grhoten wants to merge 1 commit intounicode-org:mainfrom
Conversation
1 task
grhoten
commented
Apr 4, 2026
Comment on lines
6
to
+8
| SpelloutRules{ | ||
| "%%lenient-parse:" | ||
| "&ue=\u00FC&ae=\u00E4&oe=\u00F6&[last primary ignorable ] << ' ' << ',' << '-' << '\u00AD';" | ||
| "&ue=\u00FC&ae=\u00E4&oe=\u00F6;" |
Member
Author
There was a problem hiding this comment.
This is the only %%lenient-parse left. This one is hard to get rid of.
| "%spellout-numbering:" | ||
| "-x: meno >>;" | ||
| "x.x: << virgola >>;" | ||
| "Inf: infinito;" |
Member
Author
There was a problem hiding this comment.
This change got pulled in from the latest version of CLDR. Only the latest RBNF data is being brought in from CLDR.
| } | ||
|
|
||
| // go through all this grief if we're in lenient-parse mode | ||
| #if !UCONFIG_NO_COLLATION |
Member
Author
There was a problem hiding this comment.
The UCONFIG_NO_COLLATION option has been narrowed down to this smaller segment in both ICU4C and ICU4J. If collation is disabled, only German would be affected at this time. All of the mixing of ignorable characters and case folding is happening above.
| fprintf(stderr, "prefix length: %d\n", result); | ||
| #endif | ||
| return result; | ||
| #if 0 |
Member
Author
There was a problem hiding this comment.
Remove permanently dead code collation code. It can be resurrected from the history if needed.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Right now, there is inconsistent usage of the
%%lenient-parserule between locales in RBNF. The vast majority of them define the same set of characters as ignorable, except German, which adds a little extra for vowels with the umlaut. We should change the lenient parsing to not use the collator by default, unless it defines the%%lenient-parserule. The letter folding, and the ignorable characters should be handled separately from a collator.This improves the parsing time, and it usually reduces the overall heap usage when parsing leniently. All existing tests continue to pass with these changes.
Checklist