Shauvy/detect encoding in ivm #1005

Shaurya0108 · 2025-10-08T17:29:23Z

Issue #, if available:
#954

Description of changes:
The AnyEncoding reader was failing to read binary Ion data when processing it during streaming scenarios where bytes would arrive incrementally. detect_encoding() function only handled complete 4-byte IVM patterns, not the intermediate states that occur during streaming.

The fix required adding a modification in detect_encoding() to recognize incomplete binary IVM patterns:

[0xE0, 0x01, 0x00] | [0xE0, 0x01, 0x01] => IonEncoding::Binary_1_0,
[0xE0, 0x01] => IonEncoding::Binary_1_0,
[0xE0] => IonEncoding::Binary_1_0,

Binary readers now handle incomplete data properly by returning IonError::Incomplete
The choice of Binary_1_0 is defaulted

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

codecov · 2025-10-08T17:35:02Z

Codecov Report

❌ Patch coverage is 94.87179% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 78.35%. Comparing base (3aaccc3) to head (22fccfd).

Files with missing lines	Patch %	Lines
src/lazy/any_encoding.rs	94.87%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1005      +/-   ##
==========================================
+ Coverage   78.32%   78.35%   +0.02%     
==========================================
  Files         138      138              
  Lines       34108    34147      +39     
  Branches    34108    34147      +39     
==========================================
+ Hits        26716    26755      +39     
- Misses       5317     5319       +2     
+ Partials     2075     2073       -2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

nirosys · 2025-10-08T19:53:52Z

src/lazy/any_encoding.rs

            [0xE0, 0x01, 0x01, 0xEA, ..] => IonEncoding::Binary_1_1,
+
+            // We should try binary first since it can handle incomplete data better when we have incomplete data
+            [0xE0, 0x01, 0x00] | [0xE0, 0x01, 0x01] => IonEncoding::Binary_1_0,


Unfortunately we can't assume an encoding off of anything less than the full 4 byte binary IVM. Consider having 2 services communicating via ion. Suppose service A is sending ion data to service B, and service B is parsing it while it is being sent. If service B receives the first 2 bytes [0xE0, 0x01], and defaults to Binary ion 1.0, creating a reader in preparation for the next bytes, but the next bytes are [0x01, 0xEA]. In this event the reader is going to interpret the remaining bytes as ion 1.0 which will have very different meaning than the intended 1.1 opcodes.

In order to claim that an encoding is detected, we need to eliminate ambiguity and ensure we have enough data to make the decision. So something like [0xE0, 0x01] should signal that we need more data to proceed. detect_encoding will need to trigger that signal and ultimately should lead to an IonResult<..>::Incomplete bubbling up if data contains only a partial IVM (0xE0, 0xE0 0x01, or 0xE0 0x01 0x00). the Incomplete will inform either the reader, or the user, that more data needs to be buffered before we can continue.

detect_encoding gets called in 2 spots: LazyRawAnyReader::new and LazyRawAnyReader::resume. Both of these methods will need to return an IonResult, and if detect_encoding is unable to determine the encoding based on the IVM, they'll need to bubble up an IonResult::incomplete("incomplete IVM read", offset) or similar.

The only allowed "ambiguous" result is if the data provided does not start with an IVM or consist entirely of an IVM prefix, then we assume it is ion text.

nirosys · 2025-10-08T19:56:55Z

src/lazy/any_encoding.rs

+    #[test]
+    fn test_detect_encoding_incomplete_patterns() {
+        // Test that incomplete binary IVM patterns are handled correctly
+        let test_cases = vec![


In all of these cases detect_encoding should signal that an encoding is unable to be determined. It could return an Option<IonEncoding>, and None could indicate an encoding wasn't determined. The only reason I'd say Option over IonResult, is because detect_encoding does not have reader state to supply offset, but its callers do.

nirosys · 2025-10-08T20:00:58Z

src/lazy/any_encoding.rs

+        for input_byte in data {
+            input = Box::new(input.chain(Cursor::new([input_byte])));
+        }
+        let _values: Vec<_> = Reader::new(AnyEncoding, input)


I suspect this is working because the reader is defaulting to a 1.0 reader upon not having enough IVM data, and once it gets to reading values (where incomplete data is handled) it ends up reading 1.0 data. If instead of a 3 character 1.0 string, we had 1.1 data, that resulted in a detectable problem when interpreted as 1.0.. then I would expect this to fail.

Shaurya0108 · 2025-10-09T17:42:32Z

Proposed fixes for a long term fix:

Change LazyRawReader::new() and ::resume() to return IonResult<Self>
Update detect_encoding() to return Incomplete for partial IVM patterns [0xE0, ..]
Update trait implementations to handle new error-returning signatures
Modify callers to handle Incomplete errors (and potentially buffer more data?)

Shaurya Dwivedi and others added 6 commits October 1, 2025 14:42

Move typed null handling from ion_type() to read_null() methods

1be4596

Merge branch 'main' into shauvy/reflect_typed_nulls

bff0bca

Removed read_null

de99793

Fix ion_type() to correctly return typed null types in Ion 1.1 binary

26e7ecc

Fix encoding detection for incomplete binary IVM in streaming readers

49f160e

Merge branch 'main' into shauvy/detect-encoding-in-ivm

22fccfd

nirosys reviewed Oct 8, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Shauvy/detect encoding in ivm #1005

Shauvy/detect encoding in ivm #1005

Uh oh!

Shaurya0108 commented Oct 8, 2025 •

edited

Loading

Uh oh!

codecov bot commented Oct 8, 2025

Uh oh!

nirosys Oct 8, 2025

Uh oh!

nirosys Oct 8, 2025

Uh oh!

nirosys Oct 8, 2025

Uh oh!

Shaurya0108 commented Oct 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Shauvy/detect encoding in ivm #1005

Are you sure you want to change the base?

Shauvy/detect encoding in ivm #1005

Uh oh!

Conversation

Shaurya0108 commented Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Oct 8, 2025

Codecov Report

Uh oh!

nirosys Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

nirosys Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

nirosys Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

Shaurya0108 commented Oct 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Shaurya0108 commented Oct 8, 2025 •

edited

Loading