-
Notifications
You must be signed in to change notification settings - Fork 38
Shauvy/detect encoding in ivm #1005
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
1be4596
bff0bca
de99793
26e7ecc
49f160e
22fccfd
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -459,6 +459,11 @@ impl LazyRawAnyReader<'_> { | |
match *data { | ||
[0xE0, 0x01, 0x00, 0xEA, ..] => IonEncoding::Binary_1_0, | ||
[0xE0, 0x01, 0x01, 0xEA, ..] => IonEncoding::Binary_1_1, | ||
|
||
// We should try binary first since it can handle incomplete data better when we have incomplete data | ||
[0xE0, 0x01, 0x00] | [0xE0, 0x01, 0x01] => IonEncoding::Binary_1_0, | ||
[0xE0, 0x01] => IonEncoding::Binary_1_0, | ||
[0xE0] => IonEncoding::Binary_1_0, | ||
_ => IonEncoding::Text_1_0, | ||
} | ||
} | ||
|
@@ -1992,4 +1997,57 @@ mod tests { | |
|
||
Ok(()) | ||
} | ||
|
||
#[test] | ||
fn test_detect_encoding_from_stream() { | ||
use std::io::{self, Cursor, Read}; | ||
use crate::{Reader, AnyEncoding}; | ||
|
||
let data = [ | ||
0xE0u8, 0x01, 0x00, 0xEA, // IVM | ||
0x83, 65, 66, 67, // String: "ABC" | ||
]; | ||
|
||
let mut input: Box<dyn Read> = Box::new(io::empty()); | ||
for input_byte in data { | ||
input = Box::new(input.chain(Cursor::new([input_byte]))); | ||
} | ||
let _values: Vec<_> = Reader::new(AnyEncoding, input) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I suspect this is working because the reader is defaulting to a 1.0 reader upon not having enough IVM data, and once it gets to reading values (where incomplete data is handled) it ends up reading 1.0 data. If instead of a 3 character 1.0 string, we had 1.1 data, that resulted in a detectable problem when interpreted as 1.0.. then I would expect this to fail. |
||
.expect("a reader") | ||
.collect::<IonResult<_>>() | ||
.expect("values should be read successfully"); | ||
} | ||
|
||
#[test] | ||
fn test_detect_encoding_incomplete_patterns() { | ||
// Test that incomplete binary IVM patterns are handled correctly | ||
let test_cases = vec![ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In all of these cases |
||
vec![0xE0], | ||
vec![0xE0, 0x01], | ||
vec![0xE0, 0x01, 0x00], | ||
vec![0xE0, 0x01, 0x01], | ||
]; | ||
|
||
for incomplete_data in test_cases { | ||
let encoding = LazyRawAnyReader::detect_encoding(&incomplete_data); | ||
assert_eq!(encoding, IonEncoding::Binary_1_0, | ||
"Failed for data: {:?}", incomplete_data); | ||
} | ||
} | ||
|
||
#[test] | ||
fn test_detect_encoding_complete_patterns() { | ||
let test_cases = vec![ | ||
(vec![0xE0, 0x01, 0x00, 0xEA], IonEncoding::Binary_1_0), | ||
(vec![0xE0, 0x01, 0x01, 0xEA], IonEncoding::Binary_1_1), | ||
(vec![0xE0, 0x01, 0x00, 0xEA, 0x21, 0x01], IonEncoding::Binary_1_0), // with extra data | ||
(vec![0xE0, 0x01, 0x01, 0xEA, 0x21, 0x01], IonEncoding::Binary_1_1), // with extra data | ||
]; | ||
|
||
for (data, expected_encoding) in test_cases { | ||
let encoding = LazyRawAnyReader::detect_encoding(&data); | ||
assert_eq!(encoding, expected_encoding, | ||
"Failed for data: {:?}", data); | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately we can't assume an encoding off of anything less than the full 4 byte binary IVM. Consider having 2 services communicating via ion. Suppose service A is sending ion data to service B, and service B is parsing it while it is being sent. If service B receives the first 2 bytes
[0xE0, 0x01]
, and defaults to Binary ion 1.0, creating a reader in preparation for the next bytes, but the next bytes are[0x01, 0xEA]
. In this event the reader is going to interpret the remaining bytes as ion 1.0 which will have very different meaning than the intended 1.1 opcodes.In order to claim that an encoding is detected, we need to eliminate ambiguity and ensure we have enough data to make the decision. So something like
[0xE0, 0x01]
should signal that we need more data to proceed.detect_encoding
will need to trigger that signal and ultimately should lead to anIonResult<..>::Incomplete
bubbling up if data contains only a partial IVM (0xE0
,0xE0 0x01
, or0xE0 0x01 0x00
). theIncomplete
will inform either the reader, or the user, that more data needs to be buffered before we can continue.detect_encoding
gets called in 2 spots:LazyRawAnyReader::new
andLazyRawAnyReader::resume
. Both of these methods will need to return anIonResult
, and ifdetect_encoding
is unable to determine the encoding based on the IVM, they'll need to bubble up anIonResult::incomplete("incomplete IVM read", offset)
or similar.The only allowed "ambiguous" result is if the data provided does not start with an IVM or consist entirely of an IVM prefix, then we assume it is ion text.