Skip to content

feature: add support for slicing above i32 where u64 can safely be coerced to f64 losslessly#861

Open
mobiusklein wants to merge 7 commits intokylebarron:mainfrom
mobiusklein:main
Open

feature: add support for slicing above i32 where u64 can safely be coerced to f64 losslessly#861
mobiusklein wants to merge 7 commits intokylebarron:mainfrom
mobiusklein:main

Conversation

@mobiusklein
Copy link
Copy Markdown

Right now, if I try to read a Parquet file that is larger than 2**31 bytes, WrappedFile::get_bytes panics because it cannot coerce the u64 byte range to i32. This change relaxes the limit, checking that we can round-trip the indices from u64 -> f64 -> u64 without loss of precision and then uses the f64 representation to slice the Blob with.

This might also be doable with a check for the value being less than Number.MAX_SAFE_INTEGER, but I didn't see a good way to query that from WASM, though I doubt it can change.

spawn_local(async move {
let subset_blob = file
.slice_with_i32_and_i32(
let subset_blob = if (range.start <= i32::MAX as u64) && (range.end <= i32::MAX as u64)
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of having two branches for i32 and f64, we can probably first check that the range start and end is representable as an integer in f64, and then only use the f64 slice api

} else {
let start = range.start as f64;
if start as u64 != range.start {
panic!("Cannot safely convert start index of {range:?}");
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of panicking, we should change this method to be fallible and return an Err

@kylebarron
Copy link
Copy Markdown
Owner

As noted in rust-lang/rust#152466, you can just define top level constants for max safe integer

pub const MAX_EXACT_INTEGER: f64 = (1 << f64::MANTISSA_DIGITS) - 1;
pub const MIN_EXACT_INTEGER: f64 = -MAX_EXACT_INTEGER;

@mobiusklein
Copy link
Copy Markdown
Author

I made the requested changes and piped the failure along to the AsyncFileReader implementation.

I also have a separate branch where I'm working on exposing more of the Parquet file indices and statistics to JS. Is that appropriate to add in a separate PR?

}

pub async fn get_bytes(&mut self, range: Range<u64>) -> Vec<u8> {
pub async fn get_bytes(&mut self, range: Range<u64>) -> io::Result<Vec<u8>> {
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have a result type in this crate

pub type Result<T> = std::result::Result<T, ParquetWasmError>;
pub type WasmResult<T> = std::result::Result<T, JsError>;

panic!("Cannot safely convert start index of {range:?}");
}
file.slice_with_f64_and_f64(start, end).unwrap()
sender.send(Err(io::Error::new(io::ErrorKind::Unsupported, format!("{range:?} is too large to convert into a Blob slice")))).unwrap();
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be a PlatformSupportError

#[error("Platform error: `{0}`")]
PlatformSupportError(String),

since JS in the web doesn't support a slice larger than a number

Comment on lines +351 to +354
let result = match file.get_bytes(range).await {
Ok(result) => Ok(Bytes::from(result)),
Err(e) => Err(e)
};
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just use ? after the await

let mut file = self.file.clone();
spawn_local(async move {
let result: Bytes = file.get_bytes(range).await.into();
let result = file.get_bytes(range).await.map(Bytes::from);
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like you did here

@kylebarron
Copy link
Copy Markdown
Owner

I also have a separate branch where I'm working on exposing more of the Parquet file indices and statistics to JS. Is that appropriate to add in a separate PR?

Perhaps create an issue first to discuss what you want to achieve. Let's keep this PR focused on this fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants