Support Shredded Lists/Array in variant_get#8354
Support Shredded Lists/Array in variant_get#8354sdf-jkl wants to merge 49 commits intoapache:mainfrom
variant_get#8354Conversation
…ed_list_support
scovich
left a comment
There was a problem hiding this comment.
I'm not sure I understand how these unit tests will translate to variant_get?
Could you elaborate please? I am currently trying to build just the Shredded |
No worries -- the current iteration does look it produces a correct shredded variant containing a list, so I should probably just be patient and let you finish! |
|
My question is: does The reason I am asking is that since we use the output of The only way to work with list arrays I came up with so far, is to build new arrays with
|
| // Build the list of indices to take | ||
| let mut take_indices = Vec::with_capacity(list_len); | ||
| for i in 0..list_len { | ||
| let start = offsets[i] as usize; | ||
| let end = offsets[i + 1] as usize; | ||
| let len = end - start; | ||
|
|
||
| if *index < len { | ||
| take_indices.push(Some((start + index) as u32)); | ||
| } else { | ||
| take_indices.push(None); | ||
| } | ||
| } | ||
|
|
||
| let index_array = UInt32Array::from(take_indices); | ||
|
|
||
| // Use Arrow compute kernel to gather elements | ||
| let taken = take(field_array, &index_array, None)?; |
There was a problem hiding this comment.
You can see the basic idea here
…ed_list_support
…ed_list_support
|
Hey @scovich I made it work for a one of the simple tests and it doesn't go through with the second one because Variant to Arrow does not support utf8 yet. Do we have an issue tracking variant_to_arrow types support? If not, I can make one. |
I'm not sure we have a tracking issue for utf8 support in variant_to_arrow, but I've also noticed that it's an annoying gap for unit testing (we all seem to reach for string values...) |
…ed_list_support
…row-rs into shredded_list_support
|
Looking for an early pass? Or is this |
|
@scovich an early pass to check the test coverage should do. Once the test coverage is complete and the code passes, it will be |
scovich
left a comment
There was a problem hiding this comment.
Thanks for taking a stab at this. Several comments.
| return Err(ArrowError::CastError(format!( | ||
| "Cannot access index '{}' for row {} with list length {}", | ||
| index, row, len | ||
| ))); |
There was a problem hiding this comment.
We need to decide what the out of bounds semantics should be. For example, spark just returns NULL.
By way of comparison, spark and arrow-rs both return NULL for non-existent struct fields, which could be argued as analogous. Or maybe it's considered different and we want the error. Or maybe missing struct fields are also handled wrong?
(I'm comfortable with following spark semantics, but would love to hear others' thoughts)
There was a problem hiding this comment.
I support following the spark semantics too
There was a problem hiding this comment.
Do we need to add the behavior to the document or somewhere else?
There was a problem hiding this comment.
The latest behavior is to error out on OOB access unless safe casting is enabled.
Spark semantics would just return NULL regardless of that flag.
There was a problem hiding this comment.
Actually tho -- I think spark is just implementing jsonpath semantics:
A syntactically valid segment MUST NOT produce errors when executing the query. This means that some operations that might be considered erroneous, such as using an index lying outside the range of an array, simply result in fewer nodes being selected.
Here, "syntactically valid" is referring to the previous section (2.1):
A JSONPath implementation MUST raise an error for any query that is not well-formed and valid. The well-formedness and the validity of JSONPath queries are independent of the JSON value the query is applied to. No further errors relating to the well-formedness and the validity of a JSONPath query can be raised during application of the query to a value. This clearly separates well-formedness/validity errors in the query from mismatches that may actually stem from flaws in the data.
Note: Integer overflow in an index is well-formed but not valid, so it's allowed to produce an error.
|
|
||
| // Gather both typed and fallback values at the requested element index. | ||
| let taken_value = value_array | ||
| .map(|value| take(value, &index_array, None)) |
variant_getvariant_get
| return Err(ArrowError::CastError(format!( | ||
| "Cannot access index '{}' for row {} with list length {}", | ||
| index, row, len | ||
| ))); |
There was a problem hiding this comment.
Do we need to add the behavior to the document or somewhere else?
|
Everything looks good, code-wise -- nice and clean. But there's still an open question of whether we intend to follow the jsonpath spec in our path step logic, as e.g. spark does? The jsonpath spec requires In contrast, our current struct handling code currently returns an error if safe casting is disabled and:
|
|
@alamb -- any opinions about supporting jsonpath semantics or not? Or ideas on who we should seek input from? |
Which issue does this PR close?
Rationale for this change
We should be able to
variant_getusing Indices to path throughVariantArraysWhat changes are included in this PR?
Are these changes tested?
Yes, unit tested.
Are there any user-facing changes?