Skip to content
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
58 changes: 58 additions & 0 deletions src/lazy/any_encoding.rs
Original file line number Diff line number Diff line change
Expand Up @@ -459,6 +459,11 @@ impl LazyRawAnyReader<'_> {
match *data {
[0xE0, 0x01, 0x00, 0xEA, ..] => IonEncoding::Binary_1_0,
[0xE0, 0x01, 0x01, 0xEA, ..] => IonEncoding::Binary_1_1,

// We should try binary first since it can handle incomplete data better when we have incomplete data
[0xE0, 0x01, 0x00] | [0xE0, 0x01, 0x01] => IonEncoding::Binary_1_0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately we can't assume an encoding off of anything less than the full 4 byte binary IVM. Consider having 2 services communicating via ion. Suppose service A is sending ion data to service B, and service B is parsing it while it is being sent. If service B receives the first 2 bytes [0xE0, 0x01], and defaults to Binary ion 1.0, creating a reader in preparation for the next bytes, but the next bytes are [0x01, 0xEA]. In this event the reader is going to interpret the remaining bytes as ion 1.0 which will have very different meaning than the intended 1.1 opcodes.

In order to claim that an encoding is detected, we need to eliminate ambiguity and ensure we have enough data to make the decision. So something like [0xE0, 0x01] should signal that we need more data to proceed. detect_encoding will need to trigger that signal and ultimately should lead to an IonResult<..>::Incomplete bubbling up if data contains only a partial IVM (0xE0, 0xE0 0x01, or 0xE0 0x01 0x00). the Incomplete will inform either the reader, or the user, that more data needs to be buffered before we can continue.

detect_encoding gets called in 2 spots: LazyRawAnyReader::new and LazyRawAnyReader::resume. Both of these methods will need to return an IonResult, and if detect_encoding is unable to determine the encoding based on the IVM, they'll need to bubble up an IonResult::incomplete("incomplete IVM read", offset) or similar.

The only allowed "ambiguous" result is if the data provided does not start with an IVM or consist entirely of an IVM prefix, then we assume it is ion text.

[0xE0, 0x01] => IonEncoding::Binary_1_0,
[0xE0] => IonEncoding::Binary_1_0,
_ => IonEncoding::Text_1_0,
}
}
Expand Down Expand Up @@ -1992,4 +1997,57 @@ mod tests {

Ok(())
}

#[test]
fn test_detect_encoding_from_stream() {
use std::io::{self, Cursor, Read};
use crate::{Reader, AnyEncoding};

let data = [
0xE0u8, 0x01, 0x00, 0xEA, // IVM
0x83, 65, 66, 67, // String: "ABC"
];

let mut input: Box<dyn Read> = Box::new(io::empty());
for input_byte in data {
input = Box::new(input.chain(Cursor::new([input_byte])));
}
let _values: Vec<_> = Reader::new(AnyEncoding, input)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect this is working because the reader is defaulting to a 1.0 reader upon not having enough IVM data, and once it gets to reading values (where incomplete data is handled) it ends up reading 1.0 data. If instead of a 3 character 1.0 string, we had 1.1 data, that resulted in a detectable problem when interpreted as 1.0.. then I would expect this to fail.

.expect("a reader")
.collect::<IonResult<_>>()
.expect("values should be read successfully");
}

#[test]
fn test_detect_encoding_incomplete_patterns() {
// Test that incomplete binary IVM patterns are handled correctly
let test_cases = vec![
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In all of these cases detect_encoding should signal that an encoding is unable to be determined. It could return an Option<IonEncoding>, and None could indicate an encoding wasn't determined. The only reason I'd say Option over IonResult, is because detect_encoding does not have reader state to supply offset, but its callers do.

vec![0xE0],
vec![0xE0, 0x01],
vec![0xE0, 0x01, 0x00],
vec![0xE0, 0x01, 0x01],
];

for incomplete_data in test_cases {
let encoding = LazyRawAnyReader::detect_encoding(&incomplete_data);
assert_eq!(encoding, IonEncoding::Binary_1_0,
"Failed for data: {:?}", incomplete_data);
}
}

#[test]
fn test_detect_encoding_complete_patterns() {
let test_cases = vec![
(vec![0xE0, 0x01, 0x00, 0xEA], IonEncoding::Binary_1_0),
(vec![0xE0, 0x01, 0x01, 0xEA], IonEncoding::Binary_1_1),
(vec![0xE0, 0x01, 0x00, 0xEA, 0x21, 0x01], IonEncoding::Binary_1_0), // with extra data
(vec![0xE0, 0x01, 0x01, 0xEA, 0x21, 0x01], IonEncoding::Binary_1_1), // with extra data
];

for (data, expected_encoding) in test_cases {
let encoding = LazyRawAnyReader::detect_encoding(&data);
assert_eq!(encoding, expected_encoding,
"Failed for data: {:?}", data);
}
}
}
Loading