Commit 0c82ade
authored
Fix struct casts to align fields by name (prevent positional mis-casts) (apache#19674)
## Which issue does this PR close?
* Closes apache#17285.
## Rationale for this change
DataFusion’s struct casting and some coercion paths were effectively
positional: when two structs had the same field types but different
field *orders*, casting could silently swap values. This is surprising
to users and can lead to silent data corruption (e.g. `{b: 3, a:
4}::STRUCT(a INT, b INT)` yielding `{a: 3, b: 4}`).
The goal of this PR is to make struct casting behavior match user
expectations by matching fields by **name** (case-sensitive) and
recursively applying the same logic to nested structs, while keeping a
compatible fallback for structs with **no** shared field names.
## What changes are included in this PR?
* **Name-based struct casting implementation** in
`datafusion_common::nested_struct`:
* Match struct fields by **name**, reorder to match target schema,
recursively cast nested structs.
* Fill **missing target fields** with null arrays.
* Ignore **extra source fields**.
* **Positional mapping fallback** when there is *no name overlap*
**and** field counts match (avoids breaking `struct(1, 'x')::STRUCT(a
INT, b VARCHAR)` style casts).
* Improved handling for **NULL / all-null struct inputs** by producing a
correctly typed null struct array.
* Centralized validation via `validate_field_compatibility` and helper
`fields_have_name_overlap`.
* **Ensure struct casting paths use the name-based logic**:
* `ScalarValue::cast_to_with_options`: route `Struct` casts through
`nested_struct::cast_column`.
* `ColumnarValue::cast_to`: for `Struct` targets, cast via
`nested_struct::cast_column`; non-struct casts still use Arrow’s
standard casting.
* **Type coercion improvements for structs in binary operators / CASE**:
* When two structs have at least one overlapping name, coerce **by
name**.
* Otherwise, preserve prior behavior by coercing **positionally**.
* **Planning-time cast validation for struct-to-struct**:
* `physical-expr` CAST planning now validates struct compatibility using
the same rules as runtime (`validate_struct_compatibility`) to fail
fast.
* `ExprSchemable` allows struct-to-struct casts to pass type checking;
detailed compatibility is enforced by the runtime / planning-time
validator.
* **Optimizer safety**:
* Avoid const-folding struct casts when field counts differ.
* Avoid const-folding casts of **0-row** struct literals due to
evaluation batch dimension mismatches.
* **Tests and SQL logic tests**:
* New unit tests covering:
* name-based reordering
* missing fields (nullable vs non-nullable)
* null struct fields and nested nulls
* positional fallback with no overlap
* coercion behavior and simplifier behavior
* Updated/added `.slt` cases to reflect the new semantics and to add
coverage for struct casts and nested struct reordering.
* **Minor docs/maintenance**:
* Adjusted doc comment referencing `ParquetWriterOptions` so it doesn’t
break when the `parquet` feature is disabled.
## Are these changes tested?
Yes.
* Added/updated Rust unit tests in:
* `datafusion/common/src/nested_struct.rs`
* `datafusion/expr-common/src/columnar_value.rs`
* `datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs`
* Added/updated SQL logic tests in:
* `datafusion/sqllogictest/test_files/case.slt`
* `datafusion/sqllogictest/test_files/struct.slt`
These tests cover:
* correct value mapping when struct field order differs
* nested struct reordering
* insertion of nulls for missing nullable fields
* erroring on missing non-nullable target fields
* positional mapping fallback when there is no name overlap
* planning-time validation vs runtime behavior alignment
## Are there any user-facing changes?
Yes.
* **Struct casts are now name-based** (case-sensitive): fields are
matched by name, reordered to the target schema, missing fields are
null-filled (if nullable), and extra fields are ignored.
* **Fallback behavior**: if there is *no* name overlap and field counts
match, casting proceeds **positionally**.
* **Potential behavior change** in queries relying on the prior
positional behavior when structs shared names but were out of order
(previously could yield swapped values). This PR changes that to the
safer, expected behavior.
No public API changes are introduced, but this is a semantic change in
struct casting.
## LLM-generated code disclosure
This PR includes LLM-generated code and comments. All LLM-generated
content has been manually reviewed and tested.1 parent c2f3d65 commit 0c82ade
File tree
12 files changed
+2075
-115
lines changed- datafusion
- common/src
- scalar
- expr-common/src
- type_coercion
- expr/src
- optimizer/src/simplify_expressions
- physical-expr/src/expressions
- sqllogictest/test_files
- docs/source/user-guide/sql
12 files changed
+2075
-115
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
2248 | 2248 | | |
2249 | 2249 | | |
2250 | 2250 | | |
2251 | | - | |
| 2251 | + | |
2252 | 2252 | | |
2253 | 2253 | | |
2254 | 2254 | | |
| |||
0 commit comments