Skip to content

Conversation

Shaurya0108
Copy link
Contributor

@Shaurya0108 Shaurya0108 commented Oct 8, 2025

Issue #, if available:
#954

Description of changes:
The AnyEncoding reader was failing to read binary Ion data when processing it during streaming scenarios where bytes would arrive incrementally. detect_encoding() function only handled complete 4-byte IVM patterns, not the intermediate states that occur during streaming.

The fix required adding a modification in detect_encoding() to recognize incomplete binary IVM patterns:

[0xE0, 0x01, 0x00] | [0xE0, 0x01, 0x01] => IonEncoding::Binary_1_0,
[0xE0, 0x01] => IonEncoding::Binary_1_0,
[0xE0] => IonEncoding::Binary_1_0,
  • Binary readers now handle incomplete data properly by returning IonError::Incomplete
  • The choice of Binary_1_0 is defaulted

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Copy link

codecov bot commented Oct 8, 2025

Codecov Report

❌ Patch coverage is 94.87179% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 78.35%. Comparing base (3aaccc3) to head (22fccfd).

Files with missing lines Patch % Lines
src/lazy/any_encoding.rs 94.87% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1005      +/-   ##
==========================================
+ Coverage   78.32%   78.35%   +0.02%     
==========================================
  Files         138      138              
  Lines       34108    34147      +39     
  Branches    34108    34147      +39     
==========================================
+ Hits        26716    26755      +39     
- Misses       5317     5319       +2     
+ Partials     2075     2073       -2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

[0xE0, 0x01, 0x01, 0xEA, ..] => IonEncoding::Binary_1_1,

// We should try binary first since it can handle incomplete data better when we have incomplete data
[0xE0, 0x01, 0x00] | [0xE0, 0x01, 0x01] => IonEncoding::Binary_1_0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately we can't assume an encoding off of anything less than the full 4 byte binary IVM. Consider having 2 services communicating via ion. Suppose service A is sending ion data to service B, and service B is parsing it while it is being sent. If service B receives the first 2 bytes [0xE0, 0x01], and defaults to Binary ion 1.0, creating a reader in preparation for the next bytes, but the next bytes are [0x01, 0xEA]. In this event the reader is going to interpret the remaining bytes as ion 1.0 which will have very different meaning than the intended 1.1 opcodes.

In order to claim that an encoding is detected, we need to eliminate ambiguity and ensure we have enough data to make the decision. So something like [0xE0, 0x01] should signal that we need more data to proceed. detect_encoding will need to trigger that signal and ultimately should lead to an IonResult<..>::Incomplete bubbling up if data contains only a partial IVM (0xE0, 0xE0 0x01, or 0xE0 0x01 0x00). the Incomplete will inform either the reader, or the user, that more data needs to be buffered before we can continue.

detect_encoding gets called in 2 spots: LazyRawAnyReader::new and LazyRawAnyReader::resume. Both of these methods will need to return an IonResult, and if detect_encoding is unable to determine the encoding based on the IVM, they'll need to bubble up an IonResult::incomplete("incomplete IVM read", offset) or similar.

The only allowed "ambiguous" result is if the data provided does not start with an IVM or consist entirely of an IVM prefix, then we assume it is ion text.

#[test]
fn test_detect_encoding_incomplete_patterns() {
// Test that incomplete binary IVM patterns are handled correctly
let test_cases = vec![
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In all of these cases detect_encoding should signal that an encoding is unable to be determined. It could return an Option<IonEncoding>, and None could indicate an encoding wasn't determined. The only reason I'd say Option over IonResult, is because detect_encoding does not have reader state to supply offset, but its callers do.

for input_byte in data {
input = Box::new(input.chain(Cursor::new([input_byte])));
}
let _values: Vec<_> = Reader::new(AnyEncoding, input)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect this is working because the reader is defaulting to a 1.0 reader upon not having enough IVM data, and once it gets to reading values (where incomplete data is handled) it ends up reading 1.0 data. If instead of a 3 character 1.0 string, we had 1.1 data, that resulted in a detectable problem when interpreted as 1.0.. then I would expect this to fail.

@Shaurya0108
Copy link
Contributor Author

Proposed fixes for a long term fix:

  • Change LazyRawReader::new() and ::resume() to return IonResult<Self>
  • Update detect_encoding() to return Incomplete for partial IVM patterns [0xE0, ..]
  • Update trait implementations to handle new error-returning signatures
  • Modify callers to handle Incomplete errors (and potentially buffer more data?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants