ICU-23394 Validate binary RBBI data offsets in RBBIDataWrapper::init() by TristanInSec · Pull Request #3961 · unicode-org/icu

TristanInSec · 2026-04-29T19:11:23Z

Validate all offset+length pairs (fFTable, fRTable, fTrie, fRuleSource,
fStatusTable) against the total data length in RBBIDataWrapper::init()
before computing any pointers. Malformed input now returns
U_INVALID_FORMAT_ERROR instead of producing wild pointers.

Checklist

Required: Issue filed: ICU-23394
Required: The PR title must be prefixed with a JIRA Issue number.
Required: Each commit message must be prefixed with a JIRA Issue number.
Issue accepted (done by Technical Committee after discussion)
Tests included, if applicable
API docs and/or User Guide docs changed or added, if applicable

CLAassistant · 2026-04-29T19:26:13Z

All committers have signed the CLA.

Add bounds checking for all offset+length pairs (fFTable, fRTable, fTrie, fRuleSource, fStatusTable) against the total data length in the RBBI binary data header. Without this validation, crafted binary data with out-of-range offsets causes an out-of-bounds read when passed to RuleBasedBreakIterator(const uint8_t*, uint32_t, UErrorCode&). The overflow-safe checks verify that each offset does not exceed totalLen, and that the corresponding length does not exceed the remaining space.

jira-pull-request-webhook · 2026-04-30T12:12:29Z

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

markusicu · 2026-04-30T16:09:45Z

Hi @TristanInSec,

It is true that if you give ICU corrupted binary data files -- via API or via constructing data files to be loaded -- there are all kinds of ways that ICU will "go into the weeds". Your additional checks here are the most basic, and might be reasonable, but it's not clear that it makes sense to do this partial validation. We definitely don't plan to do full validation.

If we were going forward, we would want to see

a Jira ticket explaining the issue, and making the case by garbage-in-garbage-out is not acceptable
- the ticket you used is for generic release work
restore and fill out the pull request template
unit test code that fails before these changes and passes with them
Java port

Thanks,
markus

TristanInSec · 2026-05-04T18:39:03Z

Hi @markusicu,

Thank you for the detailed review. I'll address each item.

Jira tickets: I created two dedicated tickets:

ICU-23394: Deserialization offset validation, covering RBBI (ICU-23394 Validate binary RBBI data offsets in RBBIDataWrapper::init() #3961), uspoof (ICU-23394 Validate serialized spoof data in SpoofData deserialization #3962), and utrie2/ucptrie (ICU-23394 Validate serialized trie data in utrie2/ucptrie deserialization #3964)
ICU-23395: u_shapeArabic off-by-one (ICU-23395 Fix out-of-bounds read in expandCompositCharAtNear() #3963), which is a different bug class (runtime text processing, not binary deserialization)

PR template: Will restore and fill out on all four PRs.

Unit tests: I have crash inputs from libFuzzer that reproduce each issue under ASan. I'll wrap them as C++ test cases that fail before the fix (OOB read / SEGV) and pass after (returning U_INVALID_FORMAT_ERROR).

Java port: I'll look at the ICU4J equivalents for RBBIDataWrapper, SpoofData, and the trie APIs and prepare matching validation.

On partial validation vs. GIGO:

I understand the concern about incomplete coverage. The case for these checks:

These are public C/C++ APIs accepting const uint8_t* + length with no documented precondition that callers must pre-validate the data. The API contract implies graceful error handling on invalid input.
ICU already validates in comparable paths: uresdata.cpp checks format, size, and root type on bundle loading. The RBBI, uspoof, and trie paths are missing equivalent checks, which is an inconsistency.
Practical attack surface: ICU is embedded in Chrome, Firefox, Node.js, Android, and system libraries. Binary data can be corrupted on disk, tampered in transit, or received from an untrusted source in a sandboxed architecture (e.g., a renderer process loading locale data).
These deserialization paths had zero OSS-Fuzz coverage. Custom harnesses crashed the RBBI constructor in approximately 131K iterations (seconds). The crash inputs are 20 to 116 bytes.

The validation is minimal and targeted: it checks that header offsets fall within the buffer bounds, consistent with how uresdata.cpp already works. It does not claim full validation.

I'll update the PRs with the items above.

Best regards,
Tristan

Test that RuleBasedBreakIterator returns U_INVALID_FORMAT_ERROR when given binary data with out-of-bounds offsets or a truncated header, rather than crashing with a SEGV.

Add comprehensive offset+length bounds checking for all header fields (fFTable, fRTable, fTrie, fRuleSource, fStatusTable) against fLength before using them. Includes unit test with crafted data.

TristanInSec force-pushed the fix-rbbi-binary-validation branch from 058350a to b2757e0 Compare April 30, 2026 12:12

markusicu self-assigned this Apr 30, 2026

markusicu added jira-needed need-tests Needs unit test code that demonstrates the bug and the fix labels Apr 30, 2026

This was referenced Apr 30, 2026

ICU-23394 Validate serialized spoof data in SpoofData deserialization #3962

Open

ICU-23394 Validate serialized trie data in utrie2/ucptrie deserialization #3964

Open

TristanInSec mentioned this pull request May 4, 2026

ICU-23395 Fix out-of-bounds read in expandCompositCharAtNear() #3963

Closed

6 tasks

ICU-23394 Add unit test for malformed binary RBBI data validation

619a008

Test that RuleBasedBreakIterator returns U_INVALID_FORMAT_ERROR when given binary data with out-of-bounds offsets or a truncated header, rather than crashing with a SEGV.

TristanInSec changed the title ~~ICU-23250 Validate binary RBBI data offsets in RBBIDataWrapper::init()~~ ICU-23394 Validate binary RBBI data offsets in RBBIDataWrapper::init() May 4, 2026

ICU-23394 Java port: validate RBBI data offsets in RBBIDataWrapper.get()

8419432

Add comprehensive offset+length bounds checking for all header fields (fFTable, fRTable, fTrie, fRuleSource, fStatusTable) against fLength before using them. Includes unit test with crafted data.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ICU-23394 Validate binary RBBI data offsets in RBBIDataWrapper::init()#3961

ICU-23394 Validate binary RBBI data offsets in RBBIDataWrapper::init()#3961
TristanInSec wants to merge 3 commits into
unicode-org:mainfrom
TristanInSec:fix-rbbi-binary-validation

TristanInSec commented Apr 29, 2026 •

edited by atlassian Bot

Loading

Uh oh!

CLAassistant commented Apr 29, 2026 •

edited

Loading

Uh oh!

jira-pull-request-webhook Bot commented Apr 30, 2026

Uh oh!

markusicu commented Apr 30, 2026

Uh oh!

TristanInSec commented May 4, 2026 •

edited by atlassian Bot

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

TristanInSec commented Apr 29, 2026 • edited by atlassian Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Uh oh!

CLAassistant commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jira-pull-request-webhook Bot commented Apr 30, 2026

Uh oh!

markusicu commented Apr 30, 2026

Uh oh!

TristanInSec commented May 4, 2026 • edited by atlassian Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

TristanInSec commented Apr 29, 2026 •

edited by atlassian Bot

Loading

CLAassistant commented Apr 29, 2026 •

edited

Loading

TristanInSec commented May 4, 2026 •

edited by atlassian Bot

Loading