Skip to content

Commit 0c82ade

Browse files
authored
Fix struct casts to align fields by name (prevent positional mis-casts) (apache#19674)
## Which issue does this PR close? * Closes apache#17285. ## Rationale for this change DataFusion’s struct casting and some coercion paths were effectively positional: when two structs had the same field types but different field *orders*, casting could silently swap values. This is surprising to users and can lead to silent data corruption (e.g. `{b: 3, a: 4}::STRUCT(a INT, b INT)` yielding `{a: 3, b: 4}`). The goal of this PR is to make struct casting behavior match user expectations by matching fields by **name** (case-sensitive) and recursively applying the same logic to nested structs, while keeping a compatible fallback for structs with **no** shared field names. ## What changes are included in this PR? * **Name-based struct casting implementation** in `datafusion_common::nested_struct`: * Match struct fields by **name**, reorder to match target schema, recursively cast nested structs. * Fill **missing target fields** with null arrays. * Ignore **extra source fields**. * **Positional mapping fallback** when there is *no name overlap* **and** field counts match (avoids breaking `struct(1, 'x')::STRUCT(a INT, b VARCHAR)` style casts). * Improved handling for **NULL / all-null struct inputs** by producing a correctly typed null struct array. * Centralized validation via `validate_field_compatibility` and helper `fields_have_name_overlap`. * **Ensure struct casting paths use the name-based logic**: * `ScalarValue::cast_to_with_options`: route `Struct` casts through `nested_struct::cast_column`. * `ColumnarValue::cast_to`: for `Struct` targets, cast via `nested_struct::cast_column`; non-struct casts still use Arrow’s standard casting. * **Type coercion improvements for structs in binary operators / CASE**: * When two structs have at least one overlapping name, coerce **by name**. * Otherwise, preserve prior behavior by coercing **positionally**. * **Planning-time cast validation for struct-to-struct**: * `physical-expr` CAST planning now validates struct compatibility using the same rules as runtime (`validate_struct_compatibility`) to fail fast. * `ExprSchemable` allows struct-to-struct casts to pass type checking; detailed compatibility is enforced by the runtime / planning-time validator. * **Optimizer safety**: * Avoid const-folding struct casts when field counts differ. * Avoid const-folding casts of **0-row** struct literals due to evaluation batch dimension mismatches. * **Tests and SQL logic tests**: * New unit tests covering: * name-based reordering * missing fields (nullable vs non-nullable) * null struct fields and nested nulls * positional fallback with no overlap * coercion behavior and simplifier behavior * Updated/added `.slt` cases to reflect the new semantics and to add coverage for struct casts and nested struct reordering. * **Minor docs/maintenance**: * Adjusted doc comment referencing `ParquetWriterOptions` so it doesn’t break when the `parquet` feature is disabled. ## Are these changes tested? Yes. * Added/updated Rust unit tests in: * `datafusion/common/src/nested_struct.rs` * `datafusion/expr-common/src/columnar_value.rs` * `datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs` * Added/updated SQL logic tests in: * `datafusion/sqllogictest/test_files/case.slt` * `datafusion/sqllogictest/test_files/struct.slt` These tests cover: * correct value mapping when struct field order differs * nested struct reordering * insertion of nulls for missing nullable fields * erroring on missing non-nullable target fields * positional mapping fallback when there is no name overlap * planning-time validation vs runtime behavior alignment ## Are there any user-facing changes? Yes. * **Struct casts are now name-based** (case-sensitive): fields are matched by name, reordered to the target schema, missing fields are null-filled (if nullable), and extra fields are ignored. * **Fallback behavior**: if there is *no* name overlap and field counts match, casting proceeds **positionally**. * **Potential behavior change** in queries relying on the prior positional behavior when structs shared names but were out of order (previously could yield swapped values). This PR changes that to the safer, expected behavior. No public API changes are introduced, but this is a semantic change in struct casting. ## LLM-generated code disclosure This PR includes LLM-generated code and comments. All LLM-generated content has been manually reviewed and tested.
1 parent c2f3d65 commit 0c82ade

File tree

12 files changed

+2075
-115
lines changed

12 files changed

+2075
-115
lines changed

datafusion/common/src/config.rs

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2248,7 +2248,7 @@ impl TableOptions {
22482248
/// Options that control how Parquet files are read, including global options
22492249
/// that apply to all columns and optional column-specific overrides
22502250
///
2251-
/// Closely tied to [`ParquetWriterOptions`](crate::file_options::parquet_writer::ParquetWriterOptions).
2251+
/// Closely tied to `ParquetWriterOptions` (see `crate::file_options::parquet_writer::ParquetWriterOptions` when the "parquet" feature is enabled).
22522252
/// Properties not included in [`TableParquetOptions`] may not be configurable at the external API
22532253
/// (e.g. sorting_columns).
22542254
#[derive(Clone, Default, Debug, PartialEq)]

0 commit comments

Comments
 (0)