ICU-23347 Reduce construction time of RuleBasedNumberFormat by grhoten · Pull Request #3917 · unicode-org/icu

grhoten · 2026-03-29T17:48:28Z

There are some changes that can be done to reduce the construction time and heap usage of the RuleBasedNumberFormat.

The changes include:

Use a single string for the ruleset resource bundle instead of an array that is turned into a single string.
If the rules don’t need to strip the whitespace, then don’t copy the string.
If the the plural format used in the rules don’t format a number, don’t construct a NumberFormat. This saves about 55% of the peak heap usage in intltest’s rbnf/TestAllLocales. That’s about 6.5 MB. Some RBNF rulesets use a lot of plural format objects, and most won’t be referenced.

Checklist

Required: Issue filed: ICU-23347
Required: The PR title must be prefixed with a JIRA Issue number. Example: "ICU-NNNNN Fix xyz"
Required: Each commit message must be prefixed with a JIRA Issue number. Example: "ICU-NNNNN Fix xyz"
Issue accepted (done by Technical Committee after discussion)
Tests included, if applicable
API docs and/or User Guide docs changed or added, if applicable
Approver: Feel free to merge on my behalf

grhoten · 2026-03-29T17:50:42Z

icu4c/source/data/rbnf/af.txt

-            "%digits-ordinal:",
-            "-x: \u2212>>;",
-            "0: =#,##0=$(ordinal,few{de}other{ste})$;",
+            "%digits-ordinal:"
+            "-x: \u2212>>;"
+            "0: =#,##0=$(ordinal,few{de}other{ste})$;"


All of the RBNF resource bundles are being changed from an array to a single string. These were generated from a recent copy of CLDR. Some recent changes & fixes in the main branch were included. I ignored all other changes from CLDR for this pull request.

grhoten · 2026-03-29T17:52:26Z

icu4c/source/i18n/plurfmt.cpp

-          offset(0) {
-    init(nullptr, UPLURAL_TYPE_CARDINAL, status);
+        : PluralFormat(Locale::getDefault(), status)
+{
 }


Reuse existing constructors to default specific fields.

grhoten · 2026-03-29T17:55:07Z

icu4c/source/i18n/plurfmt.cpp

-    auto *decFmt = dynamic_cast<DecimalFormat *>(numberFormat);
-    if(decFmt != nullptr) {
-        const number::LocalizedNumberFormatter* lnf = decFmt->toNumberFormatter(status);
-        if (U_FAILURE(status)) {
-            return appendTo;
-        }
-        lnf->formatImpl(&data, status); // mutates &data
-        if (U_FAILURE(status)) {
-            return appendTo;
-        }
-        numberString = data.getStringRef().toUnicodeString();
-    } else {
-        if (offset == 0) {
-            numberFormat->format(numberObject, numberString, status);
+    if (numberFormat != nullptr) {
+        auto *decFmt = dynamic_cast<DecimalFormat *>(numberFormat);
+        if(decFmt != nullptr) {
+            const number::LocalizedNumberFormatter* lnf = decFmt->toNumberFormatter(status);
+            if (U_FAILURE(status)) {
+                return appendTo;
+            }
+            lnf->formatImpl(&data, status); // mutates &data
+            if (U_FAILURE(status)) {
+                return appendTo;
+            }
+            numberString = data.getStringRef().toUnicodeString();
        } else {
-            numberFormat->format(numberMinusOffset, numberString, status);
+            if (offset == 0) {
+                numberFormat->format(numberObject, numberString, status);
+            } else {
+                numberFormat->format(numberMinusOffset, numberString, status);
+            }
        }
    }


This is an expensive formatting operation. If you're not going to actually use the formatted number, then why construct it at all! So we make numberFormat optional with the changes to this class.

grhoten · 2026-03-29T17:57:20Z

icu4c/source/i18n/plurfmt.cpp

+void
+PluralFormat::initializeNumberFormat(UErrorCode& status) {
+    if (U_SUCCESS(status) && numberFormat == nullptr
+        && (msgPattern.countParts() == 0 || msgPattern.getPatternString().indexOf(u'#') >= 0))
+    {
+        // The pattern may need to format a number later.
+        // Let's cache this expensive to use number format.
+        numberFormat = NumberFormat::createInstance(locale, status);
+    }
+    // either we failed, we use the existing one, or the pattern doesn't format a number.
 }


This is an important optimization. It saves construction time, and formatting time. RBNF uses a lot of these objects, which uses a lot of heap. None of them use a plural format that uses a number. So don't construct it if it's impossible to use.

grhoten · 2026-03-29T17:58:04Z

icu4c/source/i18n/rbnf.cpp

-        UResourceBundle* rbnfRules = ures_getByKeyWithFallback(nfrb, rules_tag, nullptr, &status);
-        if (U_FAILURE(status)) {
-            ures_close(nfrb);
-        }
-        UResourceBundle* ruleSets = ures_getByKeyWithFallback(rbnfRules, fmt_tag, nullptr, &status);
-        if (U_FAILURE(status)) {
-            ures_close(rbnfRules);
-            ures_close(nfrb);
-            return;
-        }
-
-        UnicodeString desc;
-        while (ures_hasNext(ruleSets)) {
-           desc.append(ures_getNextUnicodeString(ruleSets,nullptr,&status));
+        nfrb = ures_getByKeyWithFallback(nfrb, rules_tag, nfrb, &status);
+        nfrb = ures_getByKeyWithFallback(nfrb, fmt_tag, nfrb, &status);
+        if (U_SUCCESS(status)) {
+            UParseError perror;
+            init(ures_getUnicodeString(nfrb, &status), locinfo, perror, status);
        }
-        UParseError perror;
-
-        init(desc, locinfo, perror, status);
-
-        ures_close(ruleSets);
-        ures_close(rbnfRules);


Reuse the resource bundle object, and use a single string instead of an array.

grhoten · 2026-03-29T17:58:35Z

icu4c/source/i18n/rbnf.cpp

-    // iterate through the characters...
-    UnicodeString result;
+    // Find the first semicolon followed by whitespace or another semicolon,
+    // or leading whitespace/semicolons. Everything before that point is already clean.
+    int32_t len = description.length();
+    int32_t start = 0;
+    if (len > 0) {
+        UChar ch = description.charAt(0);
+        if (!PatternProps::isWhiteSpace(ch) && ch != gSemiColon) {
+            for (int32_t i = 0; i < len - 1; ++i) {
+                if (description.charAt(i) == gSemiColon) {
+                    UChar next = description.charAt(i + 1);
+                    if (PatternProps::isWhiteSpace(next) || next == gSemiColon) {
+                        start = i + 1;
+                        break;
+                    }
+                }
+            }
+            if (start == 0) {
+                return;  // No whitespace to strip anywhere.
+            }
+        }
+    }
+
+    // Copy the clean prefix, then strip whitespace from the rest.
+    UnicodeString result(description, 0, start);

-    int start = 0;


Don't copy a string, if it's already in an optimized format.

Languages with a lot of grammatical cases tend to have longer strings.

macchiati

LGTM

ICU-23347 Reduce construction time of RuleBasedNumberFormat

fb6f207

grhoten commented Mar 29, 2026

View reviewed changes

macchiati reviewed Mar 29, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ICU-23347 Reduce construction time of RuleBasedNumberFormat#3917

ICU-23347 Reduce construction time of RuleBasedNumberFormat#3917
grhoten wants to merge 1 commit intounicode-org:mainfrom
grhoten:23347

grhoten commented Mar 29, 2026 •

edited

Loading

Uh oh!

grhoten Mar 29, 2026

Uh oh!

grhoten Mar 29, 2026

Uh oh!

grhoten Mar 29, 2026 •

edited

Loading

Uh oh!

grhoten Mar 29, 2026

Uh oh!

grhoten Mar 29, 2026

Uh oh!

grhoten Mar 29, 2026

Uh oh!

grhoten Mar 29, 2026

Uh oh!

macchiati left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

grhoten commented Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Uh oh!

grhoten Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

grhoten Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

grhoten Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

grhoten Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

grhoten Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

grhoten Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

grhoten Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

macchiati left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

grhoten commented Mar 29, 2026 •

edited

Loading

grhoten Mar 29, 2026 •

edited

Loading