Skip to content

Support Shredded Lists/Array in variant_get#8354

Open
sdf-jkl wants to merge 49 commits intoapache:mainfrom
sdf-jkl:shredded_list_support
Open

Support Shredded Lists/Array in variant_get#8354
sdf-jkl wants to merge 49 commits intoapache:mainfrom
sdf-jkl:shredded_list_support

Conversation

@sdf-jkl
Copy link
Contributor

@sdf-jkl sdf-jkl commented Sep 16, 2025

Which issue does this PR close?

Rationale for this change

We should be able to variant_get using Indices to path through VariantArrays

What changes are included in this PR?

Are these changes tested?

Yes, unit tested.

Are there any user-facing changes?

Copy link
Contributor

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a couple comments that are hopefully helpful.

Also, we should (eventually) support nesting -- arrays and structs inside arrays.
Let's get simple lists of primitives working first, tho!

Copy link
Contributor

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand how these unit tests will translate to variant_get?

@sdf-jkl
Copy link
Contributor Author

sdf-jkl commented Sep 19, 2025

I'm not sure I understand how these unit tests will translate to variant_get?

Could you elaborate please?

I am currently trying to build just the Shredded List VariantArray test case, and while doing so learning how we could build them in shred_variant later. Once have a good way of building simple Shredded List VariantArray it will be easy to work on the rest of the unit tests for variant_get

@scovich
Copy link
Contributor

scovich commented Sep 19, 2025

I'm not sure I understand how these unit tests will translate to variant_get?

Could you elaborate please?

I am currently trying to build just the Shredded List VariantArray test case, and while doing so learning how we could build them in shred_variant later. Once have a good way of building simple Shredded List VariantArray it will be easy to work on the rest of the unit tests for variant_get

No worries -- the current iteration does look it produces a correct shredded variant containing a list, so I should probably just be patient and let you finish!

@sdf-jkl
Copy link
Contributor Author

sdf-jkl commented Sep 23, 2025

Hey @scovich I see that your current implementation of follow_shredded_path_element for VariantPathElement::Field when following the shredded path is successful, it returns a ShreddedPathStep::Success(field.shredding_state()) that holds a ShreddingState::Typed that holds a reference to the typed_value array. (That we later use for the next steps)

My question is: does ShreddedPathStep::Success() necessarily have to require the input ShreddingState to be a reference?

The reason I am asking is that since we use the output of follow_shredded_path_element to get the values from the shredded VariantArray, shouldn't we be free to drop the outer array once we extract the relevant typed_value?

The only way to work with list arrays I came up with so far, is to build new arrays with arrow_select::take, combining the path index and GenericListArray offsets.
But by using this method we create new arrays within the scope of the function and can't use a reference to the array in the ShreddedPathStep::Success.
(I just pushed a commit with a non-working implementation of the idea)

Should we instead look for another way to represent a resulting array consisting of slices instead?

I just saw the #8392

Comment on lines +135 to +152
// Build the list of indices to take
let mut take_indices = Vec::with_capacity(list_len);
for i in 0..list_len {
let start = offsets[i] as usize;
let end = offsets[i + 1] as usize;
let len = end - start;

if *index < len {
take_indices.push(Some((start + index) as u32));
} else {
take_indices.push(None);
}
}

let index_array = UInt32Array::from(take_indices);

// Use Arrow compute kernel to gather elements
let taken = take(field_array, &index_array, None)?;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can see the basic idea here

@sdf-jkl
Copy link
Contributor Author

sdf-jkl commented Sep 25, 2025

Hey @scovich I made it work for a one of the simple tests and it doesn't go through with the second one because Variant to Arrow does not support utf8 yet.

Do we have an issue tracking variant_to_arrow types support? If not, I can make one.

@scovich
Copy link
Contributor

scovich commented Sep 26, 2025

I made it work for a one of the simple tests and it doesn't go through with the second one because Variant to Arrow does not support utf8 yet.

Do we have an issue tracking variant_to_arrow types support? If not, I can make one.

I'm not sure we have a tracking issue for utf8 support in variant_to_arrow, but I've also noticed that it's an annoying gap for unit testing (we all seem to reach for string values...)

@sdf-jkl sdf-jkl deleted the shredded_list_support branch January 3, 2026 00:14
@sdf-jkl sdf-jkl restored the shredded_list_support branch February 19, 2026 22:16
@sdf-jkl
Copy link
Contributor Author

sdf-jkl commented Feb 19, 2026

I want to continue working here on #9443.

@klion26 I'll start addressing your old comments first.

@sdf-jkl sdf-jkl reopened this Feb 19, 2026
@sdf-jkl sdf-jkl marked this pull request as draft February 19, 2026 23:04
@sdf-jkl
Copy link
Contributor Author

sdf-jkl commented Feb 23, 2026

@klion26 @scovich please review when available. thanks!

@scovich
Copy link
Contributor

scovich commented Feb 23, 2026

Looking for an early pass? Or is this Ready for review now?

@sdf-jkl
Copy link
Contributor Author

sdf-jkl commented Feb 23, 2026

@scovich an early pass to check the test coverage should do.

Once the test coverage is complete and the code passes, it will be Ready for review

Copy link
Contributor

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking a stab at this. Several comments.

Comment on lines +84 to +87
return Err(ArrowError::CastError(format!(
"Cannot access index '{}' for row {} with list length {}",
index, row, len
)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to decide what the out of bounds semantics should be. For example, spark just returns NULL.

By way of comparison, spark and arrow-rs both return NULL for non-existent struct fields, which could be argued as analogous. Or maybe it's considered different and we want the error. Or maybe missing struct fields are also handled wrong?

(I'm comfortable with following spark semantics, but would love to hear others' thoughts)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I support following the spark semantics too

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to add the behavior to the document or somewhere else?

Copy link
Contributor Author

@sdf-jkl sdf-jkl Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added to docs here 91589ad

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latest behavior is to error out on OOB access unless safe casting is enabled.
Spark semantics would just return NULL regardless of that flag.

Copy link
Contributor

@scovich scovich Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually tho -- I think spark is just implementing jsonpath semantics:

A syntactically valid segment MUST NOT produce errors when executing the query. This means that some operations that might be considered erroneous, such as using an index lying outside the range of an array, simply result in fewer nodes being selected.

Here, "syntactically valid" is referring to the previous section (2.1):

A JSONPath implementation MUST raise an error for any query that is not well-formed and valid. The well-formedness and the validity of JSONPath queries are independent of the JSON value the query is applied to. No further errors relating to the well-formedness and the validity of a JSONPath query can be raised during application of the query to a value. This clearly separates well-formedness/validity errors in the query from mismatches that may actually stem from flaws in the data.

Note: Integer overflow in an index is well-formed but not valid, so it's allowed to produce an error.


// Gather both typed and fallback values at the requested element index.
let taken_value = value_array
.map(|value| take(value, &index_array, None))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aside: TIL about take. Very helpful here.

@sdf-jkl sdf-jkl changed the title [WIP] Support Shredded Lists/Array in variant_get Support Shredded Lists/Array in variant_get Feb 24, 2026
Comment on lines +84 to +87
return Err(ArrowError::CastError(format!(
"Cannot access index '{}' for row {} with list length {}",
index, row, len
)));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to add the behavior to the document or somewhere else?

@sdf-jkl sdf-jkl marked this pull request as ready for review February 26, 2026 04:10
@scovich
Copy link
Contributor

scovich commented Mar 2, 2026

Everything looks good, code-wise -- nice and clean.

But there's still an open question of whether we intend to follow the jsonpath spec in our path step logic, as e.g. spark does?
#8354 (comment)

The jsonpath spec requires foo[100] to return NULL if foo is not an array, and also requires returning NULL if foo has fewer than 101 elements. Similarly, foo.bar should return NULL if foo is not a struct and should also return NULL if foo has no field named bar. Safe casting would only influence actual casting decisions, e.g. a variant_get call that specifically requests a string and the requested path points to a struct.

In contrast, our current struct handling code currently returns an error if safe casting is disabled and:

  • a Field path step encounters a "wrong" type (L169)
  • an Index path step encounters a "wrong" type (L224)
  • an Index path step is out of bounds (L99)

@scovich
Copy link
Contributor

scovich commented Mar 2, 2026

@alamb -- any opinions about supporting jsonpath semantics or not? Or ideas on who we should seek input from?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet-variant parquet-variant* crates

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Variant] Support VariantPathElement::Index for Variant Arrays for variant_get

4 participants