Skip to content

Create AsyncByteBufferEscapedSurrogateInFieldName1581Test.java#1583

Open
pjfanning wants to merge 1 commit intoFasterXML:2.21from
pjfanning:non-blocking-test
Open

Create AsyncByteBufferEscapedSurrogateInFieldName1581Test.java#1583
pjfanning wants to merge 1 commit intoFasterXML:2.21from
pjfanning:non-blocking-test

Conversation

@pjfanning
Copy link
Copy Markdown
Member

@pjfanning pjfanning commented Mar 27, 2026

backported #1582
issue #1581

Copilot suggests:

Now let me look at the actual source of the bug — the _addName method in NonBlockingJsonParserBase on the 2.21 branch, and the corresponding fix in the 3.x branch.Here's a thorough analysis of why the test in PR #1583 works on the 3.x branch but fails on 2.21, and what fix is needed.

Root Cause

The 3.x branch introduced a dedicated field _pendingSurrogateInName in NonBlockingJsonParserBase (since 3.1), along with a _finishPropertyWithEscape() method that explicitly handles surrogate pairs split across chunk boundaries:

/**
 * High surrogate code point awaiting matching low surrogate during
 * property name parsing, or 0 if none pending.
 *
 * @since 3.1
 */
protected int _pendingSurrogateInName;
protected final JsonToken _finishPropertyWithEscape() throws JacksonException
{
    int ch;

    // [jackson-core#1541]: Check if we have a pending high surrogate
    if (_pendingSurrogateInName != 0) {
        // We have a high surrogate saved, now need to decode the low surrogate escape
        if (_quotedDigits == -2) {
            // Need to read the backslash first
            if (_inputPtr >= _inputEnd) {
                return JsonToken.NOT_AVAILABLE;
            }
            int b = getNextUnsignedByteFromBuffer();
            if (b != INT_BACKSLASH) {
                _reportError("...");
            }
            _quotedDigits = -1;
            _quoted32 = 0;
        }
        ch = _decodeSplitEscaped(_quoted32, _quotedDigits);
        if (ch < 0) {
            _minorState = MINOR_PROPERTY_NAME_ESCAPE;
            return JsonToken.NOT_AVAILABLE;
        }
        ch = _decodeSurrogate(_pendingSurrogateInName, ch);
        _pendingSurrogateInName = 0;
    } else {
        ch = _decodeSplitEscaped(_quoted32, _quotedDigits);
        if (ch < 0) {
            _minorState = MINOR_PROPERTY_NAME_ESCAPE;
            return JsonToken.NOT_AVAILABLE;
        }
        // [jackson-core#1541]: Check if decoded value is a high surrogate
        if (ch >= 0xD800 && ch <= 0xDBFF) {
            _pendingSurrogateInName = ch;
            _quoted32 = 0;
            _quotedDigits = -2;
            _minorState = MINOR_PROPERTY_NAME_ESCAPE;
            return _finishPropertyWithEscape();
        } else if (ch >= 0xDC00 && ch <= 0xDFFF) {
            _reportUnexpectedLowSurrogate(ch);
        }
    }
    // ... encode ch into quads and continue name parsing

The 2.21 branch has none of this. Its _finishFieldWithEscape() in NonBlockingUtf8JsonParserBase decodes the escape character and directly re-encodes it as UTF-8 bytes into the quad buffer — but it has no concept of a pending high surrogate. When a \uD834\uDD1e surrogate pair appears in a field name, the first escape (\uD834) is decoded and the code attempts to re-encode it as a 3-byte UTF-8 sequence. This hits the surrogate check in _addName() at line 773:

// [jackson-core#363]: Surrogates (0xD800 - 0xDFFF) are illegal in UTF-8 for 3-byte sequences
if (needed == 2) {
    if (ch >= 0xD800 && ch <= 0xDFFF) {
        _reportInvalidUTF8Surrogate(ch);
    }

...throwing Invalid UTF-8: Illegal surrogate character.

Suggested Fix for 2.21

You need to backport the same surrogate-pair handling logic from 3.x into NonBlockingUtf8JsonParserBase on the 2.21 branch. The key changes are:

1. Add _pendingSurrogateInName field to NonBlockingJsonParserBase

// After _quotedDigits:
protected int _quotedDigits;

// ADD THIS:
/**
 * High surrogate code point awaiting matching low surrogate during
 * field name parsing, or 0 if none pending.
 * [jackson-core#1581]
 */
protected int _pendingSurrogateInName;

2. Fix _finishFieldWithEscape() in NonBlockingUtf8JsonParserBase

The existing _finishFieldWithEscape() (called from the MINOR_FIELD_NAME_ESCAPE state for both NonBlockingByteArrayParser and NonBlockingByteBufferParser) must be updated to:

  • When a decoded char is a high surrogate (0xD800–0xDBFF): save it in _pendingSurrogateInName, set _quotedDigits = -2 as a signal that the next thing to read is a \ for the low surrogate, suspend with NOT_AVAILABLE.
  • When resuming with _pendingSurrogateInName != 0: read/decode the low surrogate escape, then combine the pair using UTF-16 surrogate math into a supplementary code point, then encode that as a 4-byte UTF-8 sequence into the quad buffer.

Here's the essential structure of the fix (mirroring the 3.x _finishPropertyWithEscape()):

protected final JsonToken _finishFieldWithEscape() throws IOException
{
    int ch;

    // [jackson-core#1581]: Handle pending high surrogate
    if (_pendingSurrogateInName != 0) {
        if (_quotedDigits == -2) {
            // Waiting for the backslash of the low surrogate escape
            if (_inputPtr >= _inputEnd) {
                return JsonToken.NOT_AVAILABLE;
            }
            int b = getNextUnsignedByteFromBuffer();
            if (b != INT_BACKSLASH) {
                _reportError("Broken surrogate pair in field name: expected '\\\\' to start low surrogate, got 0x"
                        + Integer.toHexString(b));
            }
            _quotedDigits = -1;
            _quoted32 = 0;
        }
        ch = _decodeSplitEscaped(_quoted32, _quotedDigits);
        if (ch < 0) {
            _minorState = MINOR_FIELD_NAME_ESCAPE;
            return JsonToken.NOT_AVAILABLE;
        }
        // Combine high + low surrogate into supplementary code point
        int highSurrogate = _pendingSurrogateInName;
        _pendingSurrogateInName = 0;
        if (ch < 0xDC00 || ch > 0xDFFF) {
            _reportError("Broken surrogate pair in field name: high surrogate 0x"
                    + Integer.toHexString(highSurrogate)
                    + " not followed by valid low surrogate, got 0x"
                    + Integer.toHexString(ch));
        }
        ch = 0x10000 + ((highSurrogate - 0xD800) << 10) + (ch - 0xDC00);
    } else {
        ch = _decodeSplitEscaped(_quoted32, _quotedDigits);
        if (ch < 0) {
            _minorState = MINOR_FIELD_NAME_ESCAPE;
            return JsonToken.NOT_AVAILABLE;
        }
        // [jackson-core#1581]: high surrogate — save and wait for low surrogate
        if (ch >= 0xD800 && ch <= 0xDBFF) {
            _pendingSurrogateInName = ch;
            _quoted32 = 0;
            _quotedDigits = -2;
            _minorState = MINOR_FIELD_NAME_ESCAPE;
            return _finishFieldWithEscape(); // recurse to attempt reading low surrogate
        } else if (ch >= 0xDC00 && ch <= 0xDFFF) {
            _reportError("Unexpected low surrogate character (0x"
                    + Integer.toHexString(ch) + ") in field name");
        }
    }

    // Now encode `ch` into quads as UTF-8
    // ... (existing quad-encoding logic, extended to handle 4-byte / supplementary code points)

3. Extend the quad UTF-8 encoding to handle supplementary characters

In 2.21's _finishFieldWithEscape(), the existing UTF-8 re-encoding only handles up to 3-byte sequences (i.e., BMP characters). You need to add the 4-byte case for supplementary code points (>= 0x10000):

if (ch > 127) {
    if (currQuadBytes >= 4) {
        quads[qlen++] = currQuad;
        currQuad = 0;
        currQuadBytes = 0;
    }
    if (ch < 0x800) { // 2-byte
        currQuad = (currQuad << 8) | (0xc0 | (ch >> 6));
        ++currQuadBytes;
    } else if (ch < 0x10000) { // 3 bytes (BMP, non-surrogate)
        currQuad = (currQuad << 8) | (0xe0 | (ch >> 12));
        ++currQuadBytes;
        if (currQuadBytes >= 4) { quads[qlen++] = currQuad; currQuad = 0; currQuadBytes = 0; }
        currQuad = (currQuad << 8) | (0x80 | ((ch >> 6) & 0x3f));
        ++currQuadBytes;
    } else { // 4 bytes (supplementary, from surrogate pair)
        currQuad = (currQuad << 8) | (0xf0 | (ch >> 18));
        ++currQuadBytes;
        if (currQuadBytes >= 4) { quads[qlen++] = currQuad; currQuad = 0; currQuadBytes = 0; }
        currQuad = (currQuad << 8) | (0x80 | ((ch >> 12) & 0x3f));
        ++currQuadBytes;
        if (currQuadBytes >= 4) { quads[qlen++] = currQuad; currQuad = 0; currQuadBytes = 0; }
        currQuad = (currQuad << 8) | (0x80 | ((ch >> 6) & 0x3f));
        ++currQuadBytes;
    }
    ch = 0x80 | (ch & 0x3f); // last byte, falls through to common output below
}

Also make sure _parseEscapedName() (the inline fast path) has the same surrogate-pair detection — check if the decoded escape char is a high surrogate, save it in _pendingSurrogateInName, and suspend — mirroring what _finishFieldWithEscape() does.

Summary

3.x 2.21 (current) 2.21 (after fix)
_pendingSurrogateInName field ✅ add it
_finishPropertyWithEscape handles surrogate pairs ✅ backport logic
4-byte UTF-8 encoding in quad buffer ✅ add 4-byte case
_parseEscapedName inline surrogate detection ✅ add guard

The test in PR #1583 will pass on 2.21 once these production-code changes are in place. The test itself is correct and complete — it just needs the underlying parser fix to land alongside it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant