Skip to content

ICU-23394 Validate binary RBBI data offsets in RBBIDataWrapper::init()#3961

Open
TristanInSec wants to merge 3 commits into
unicode-org:mainfrom
TristanInSec:fix-rbbi-binary-validation
Open

ICU-23394 Validate binary RBBI data offsets in RBBIDataWrapper::init()#3961
TristanInSec wants to merge 3 commits into
unicode-org:mainfrom
TristanInSec:fix-rbbi-binary-validation

Conversation

@TristanInSec
Copy link
Copy Markdown

@TristanInSec TristanInSec commented Apr 29, 2026

Validate all offset+length pairs (fFTable, fRTable, fTrie, fRuleSource,
fStatusTable) against the total data length in RBBIDataWrapper::init()
before computing any pointers. Malformed input now returns
U_INVALID_FORMAT_ERROR instead of producing wild pointers.

Checklist

  • Required: Issue filed: ICU-23394
  • Required: The PR title must be prefixed with a JIRA Issue number.
  • Required: Each commit message must be prefixed with a JIRA Issue number.
  • Issue accepted (done by Technical Committee after discussion)
  • Tests included, if applicable
  • API docs and/or User Guide docs changed or added, if applicable

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Apr 29, 2026

CLA assistant check
All committers have signed the CLA.

Add bounds checking for all offset+length pairs (fFTable, fRTable, fTrie,
fRuleSource, fStatusTable) against the total data length in the RBBI binary
data header. Without this validation, crafted binary data with out-of-range
offsets causes an out-of-bounds read when passed to
RuleBasedBreakIterator(const uint8_t*, uint32_t, UErrorCode&).

The overflow-safe checks verify that each offset does not exceed totalLen,
and that the corresponding length does not exceed the remaining space.
@TristanInSec TristanInSec force-pushed the fix-rbbi-binary-validation branch from 058350a to b2757e0 Compare April 30, 2026 12:12
@jira-pull-request-webhook
Copy link
Copy Markdown

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

@markusicu markusicu self-assigned this Apr 30, 2026
@markusicu
Copy link
Copy Markdown
Member

Hi @TristanInSec,

It is true that if you give ICU corrupted binary data files -- via API or via constructing data files to be loaded -- there are all kinds of ways that ICU will "go into the weeds". Your additional checks here are the most basic, and might be reasonable, but it's not clear that it makes sense to do this partial validation. We definitely don't plan to do full validation.

If we were going forward, we would want to see

  • a Jira ticket explaining the issue, and making the case by garbage-in-garbage-out is not acceptable
    • the ticket you used is for generic release work
  • restore and fill out the pull request template
  • unit test code that fails before these changes and passes with them
  • Java port

Thanks,
markus

@TristanInSec
Copy link
Copy Markdown
Author

TristanInSec commented May 4, 2026

Hi @markusicu,

Thank you for the detailed review. I'll address each item.

Jira tickets: I created two dedicated tickets:

PR template: Will restore and fill out on all four PRs.

Unit tests: I have crash inputs from libFuzzer that reproduce each issue under ASan. I'll wrap them as C++ test cases that fail before the fix (OOB read / SEGV) and pass after (returning U_INVALID_FORMAT_ERROR).

Java port: I'll look at the ICU4J equivalents for RBBIDataWrapper, SpoofData, and the trie APIs and prepare matching validation.

On partial validation vs. GIGO:

I understand the concern about incomplete coverage. The case for these checks:

  • These are public C/C++ APIs accepting const uint8_t* + length with no documented precondition that callers must pre-validate the data. The API contract implies graceful error handling on invalid input.
  • ICU already validates in comparable paths: uresdata.cpp checks format, size, and root type on bundle loading. The RBBI, uspoof, and trie paths are missing equivalent checks, which is an inconsistency.
  • Practical attack surface: ICU is embedded in Chrome, Firefox, Node.js, Android, and system libraries. Binary data can be corrupted on disk, tampered in transit, or received from an untrusted source in a sandboxed architecture (e.g., a renderer process loading locale data).
  • These deserialization paths had zero OSS-Fuzz coverage. Custom harnesses crashed the RBBI constructor in approximately 131K iterations (seconds). The crash inputs are 20 to 116 bytes.

The validation is minimal and targeted: it checks that header offsets fall within the buffer bounds, consistent with how uresdata.cpp already works. It does not claim full validation.

I'll update the PRs with the items above.

Best regards,
Tristan

Test that RuleBasedBreakIterator returns U_INVALID_FORMAT_ERROR when
given binary data with out-of-bounds offsets or a truncated header,
rather than crashing with a SEGV.
@TristanInSec TristanInSec changed the title ICU-23250 Validate binary RBBI data offsets in RBBIDataWrapper::init() ICU-23394 Validate binary RBBI data offsets in RBBIDataWrapper::init() May 4, 2026
Add comprehensive offset+length bounds checking for all header fields
(fFTable, fRTable, fTrie, fRuleSource, fStatusTable) against fLength
before using them. Includes unit test with crafted data.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira-needed need-tests Needs unit test code that demonstrates the bug and the fix

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants