Skip to content

ICU-23347 Reduce construction time of RuleBasedNumberFormat#3917

Open
grhoten wants to merge 1 commit intounicode-org:mainfrom
grhoten:23347
Open

ICU-23347 Reduce construction time of RuleBasedNumberFormat#3917
grhoten wants to merge 1 commit intounicode-org:mainfrom
grhoten:23347

Conversation

@grhoten
Copy link
Copy Markdown
Member

@grhoten grhoten commented Mar 29, 2026

There are some changes that can be done to reduce the construction time and heap usage of the RuleBasedNumberFormat.

The changes include:

  • Use a single string for the ruleset resource bundle instead of an array that is turned into a single string.
  • If the rules don’t need to strip the whitespace, then don’t copy the string.
  • If the the plural format used in the rules don’t format a number, don’t construct a NumberFormat. This saves about 55% of the peak heap usage in intltest’s rbnf/TestAllLocales. That’s about 6.5 MB. Some RBNF rulesets use a lot of plural format objects, and most won’t be referenced.

Checklist

  • Required: Issue filed: ICU-23347
  • Required: The PR title must be prefixed with a JIRA Issue number. Example: "ICU-NNNNN Fix xyz"
  • Required: Each commit message must be prefixed with a JIRA Issue number. Example: "ICU-NNNNN Fix xyz"
  • Issue accepted (done by Technical Committee after discussion)
  • Tests included, if applicable
  • API docs and/or User Guide docs changed or added, if applicable
  • Approver: Feel free to merge on my behalf

Comment on lines -7 to +9
"%digits-ordinal:",
"-x: \u2212>>;",
"0: =#,##0=$(ordinal,few{de}other{ste})$;",
"%digits-ordinal:"
"-x: \u2212>>;"
"0: =#,##0=$(ordinal,few{de}other{ste})$;"
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of the RBNF resource bundles are being changed from an array to a single string. These were generated from a recent copy of CLDR. Some recent changes & fixes in the main branch were included. I ignored all other changes from CLDR for this pull request.

offset(0) {
init(nullptr, UPLURAL_TYPE_CARDINAL, status);
: PluralFormat(Locale::getDefault(), status)
{
}
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reuse existing constructors to default specific fields.

Comment on lines -273 to 284
auto *decFmt = dynamic_cast<DecimalFormat *>(numberFormat);
if(decFmt != nullptr) {
const number::LocalizedNumberFormatter* lnf = decFmt->toNumberFormatter(status);
if (U_FAILURE(status)) {
return appendTo;
}
lnf->formatImpl(&data, status); // mutates &data
if (U_FAILURE(status)) {
return appendTo;
}
numberString = data.getStringRef().toUnicodeString();
} else {
if (offset == 0) {
numberFormat->format(numberObject, numberString, status);
if (numberFormat != nullptr) {
auto *decFmt = dynamic_cast<DecimalFormat *>(numberFormat);
if(decFmt != nullptr) {
const number::LocalizedNumberFormatter* lnf = decFmt->toNumberFormatter(status);
if (U_FAILURE(status)) {
return appendTo;
}
lnf->formatImpl(&data, status); // mutates &data
if (U_FAILURE(status)) {
return appendTo;
}
numberString = data.getStringRef().toUnicodeString();
} else {
numberFormat->format(numberMinusOffset, numberString, status);
if (offset == 0) {
numberFormat->format(numberObject, numberString, status);
} else {
numberFormat->format(numberMinusOffset, numberString, status);
}
}
}
Copy link
Copy Markdown
Member Author

@grhoten grhoten Mar 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an expensive formatting operation. If you're not going to actually use the formatted number, then why construct it at all! So we make numberFormat optional with the changes to this class.

Comment on lines +169 to 179
void
PluralFormat::initializeNumberFormat(UErrorCode& status) {
if (U_SUCCESS(status) && numberFormat == nullptr
&& (msgPattern.countParts() == 0 || msgPattern.getPatternString().indexOf(u'#') >= 0))
{
// The pattern may need to format a number later.
// Let's cache this expensive to use number format.
numberFormat = NumberFormat::createInstance(locale, status);
}
// either we failed, we use the existing one, or the pattern doesn't format a number.
}
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an important optimization. It saves construction time, and formatting time. RBNF uses a lot of these objects, which uses a lot of heap. None of them use a plural format that uses a number. So don't construct it if it's impossible to use.

Comment on lines -750 to -770
UResourceBundle* rbnfRules = ures_getByKeyWithFallback(nfrb, rules_tag, nullptr, &status);
if (U_FAILURE(status)) {
ures_close(nfrb);
}
UResourceBundle* ruleSets = ures_getByKeyWithFallback(rbnfRules, fmt_tag, nullptr, &status);
if (U_FAILURE(status)) {
ures_close(rbnfRules);
ures_close(nfrb);
return;
}

UnicodeString desc;
while (ures_hasNext(ruleSets)) {
desc.append(ures_getNextUnicodeString(ruleSets,nullptr,&status));
nfrb = ures_getByKeyWithFallback(nfrb, rules_tag, nfrb, &status);
nfrb = ures_getByKeyWithFallback(nfrb, fmt_tag, nfrb, &status);
if (U_SUCCESS(status)) {
UParseError perror;
init(ures_getUnicodeString(nfrb, &status), locinfo, perror, status);
}
UParseError perror;

init(desc, locinfo, perror, status);

ures_close(ruleSets);
ures_close(rbnfRules);
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reuse the resource bundle object, and use a single string instead of an array.

Comment on lines -1600 to -1603
// iterate through the characters...
UnicodeString result;
// Find the first semicolon followed by whitespace or another semicolon,
// or leading whitespace/semicolons. Everything before that point is already clean.
int32_t len = description.length();
int32_t start = 0;
if (len > 0) {
UChar ch = description.charAt(0);
if (!PatternProps::isWhiteSpace(ch) && ch != gSemiColon) {
for (int32_t i = 0; i < len - 1; ++i) {
if (description.charAt(i) == gSemiColon) {
UChar next = description.charAt(i + 1);
if (PatternProps::isWhiteSpace(next) || next == gSemiColon) {
start = i + 1;
break;
}
}
}
if (start == 0) {
return; // No whitespace to strip anywhere.
}
}
}

// Copy the clean prefix, then strip whitespace from the rest.
UnicodeString result(description, 0, start);

int start = 0;
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't copy a string, if it's already in an optimized format.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Languages with a lot of grammatical cases tend to have longer strings.

Copy link
Copy Markdown
Member

@macchiati macchiati left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants