Skip to content

ICU-23358 Make RBNF lenient parsing of ignorables more consistent between locales#3926

Open
grhoten wants to merge 1 commit intounicode-org:mainfrom
grhoten:23358
Open

ICU-23358 Make RBNF lenient parsing of ignorables more consistent between locales#3926
grhoten wants to merge 1 commit intounicode-org:mainfrom
grhoten:23358

Conversation

@grhoten
Copy link
Copy Markdown
Member

@grhoten grhoten commented Apr 4, 2026

Right now, there is inconsistent usage of the %%lenient-parse rule between locales in RBNF. The vast majority of them define the same set of characters as ignorable, except German, which adds a little extra for vowels with the umlaut. We should change the lenient parsing to not use the collator by default, unless it defines the %%lenient-parse rule. The letter folding, and the ignorable characters should be handled separately from a collator.

This improves the parsing time, and it usually reduces the overall heap usage when parsing leniently. All existing tests continue to pass with these changes.

Checklist

  • Required: Issue filed: ICU-23358
  • Required: The PR title must be prefixed with a JIRA Issue number. Example: "ICU-NNNNN Fix xyz"
  • Required: Each commit message must be prefixed with a JIRA Issue number. Example: "ICU-NNNNN Fix xyz"
  • Issue accepted (done by Technical Committee after discussion)
  • Tests included, if applicable
  • API docs and/or User Guide docs changed or added, if applicable
  • Approver: Feel free to merge on my behalf

Comment on lines 6 to +8
SpelloutRules{
"%%lenient-parse:"
"&ue=\u00FC&ae=\u00E4&oe=\u00F6&[last primary ignorable ] << ' ' << ',' << '-' << '\u00AD';"
"&ue=\u00FC&ae=\u00E4&oe=\u00F6;"
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only %%lenient-parse left. This one is hard to get rid of.

"%spellout-numbering:"
"-x: meno >>;"
"x.x: << virgola >>;"
"Inf: infinito;"
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change got pulled in from the latest version of CLDR. Only the latest RBNF data is being brought in from CLDR.

}

// go through all this grief if we're in lenient-parse mode
#if !UCONFIG_NO_COLLATION
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The UCONFIG_NO_COLLATION option has been narrowed down to this smaller segment in both ICU4C and ICU4J. If collation is disabled, only German would be affected at this time. All of the mixing of ignorable characters and case folding is happening above.

fprintf(stderr, "prefix length: %d\n", result);
#endif
return result;
#if 0
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove permanently dead code collation code. It can be resurrected from the history if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant