fix: generated column expr with SchemaMode::Merge handles missing columns#4223
Conversation
Test Results |
68ce8fc to
d742a83
Compare
ethan-tyler
left a comment
There was a problem hiding this comment.
Thanks for the fix here @veeceey. The direction looks right. Requesting changes for the Err(_) catch all in with_generated_columns swallows all parse/resolve failures.
The e2e test is a nice to have.
| } | ||
| e | ||
| } | ||
| Err(_) => { |
There was a problem hiding this comment.
This is too broad. We should only catch errors during the schema merge and everything else should surface.
| e | ||
| } | ||
| Err(_) => { | ||
| debug!( |
There was a problem hiding this comment.
Bind the error and include it in the log
| /// does not fail, but instead produces a NULL placeholder. | ||
| /// This is the core fix for #4169. | ||
| #[test] | ||
| fn test_generated_column_referencing_missing_column_uses_null_placeholder() { |
There was a problem hiding this comment.
Nice test. It would be nice to have an e2e test through WriteBuilder &
SchemaMode::Merge that asserts actual write time values and not just
schema shape.
|
Thanks for the changes, lgtm. Just fix the DCO sign off and we should be good to go. Thanks again @veeceey Looks like you need to fmt as well |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #4223 +/- ##
=======================================
Coverage 76.81% 76.82%
=======================================
Files 167 167
Lines 48849 48874 +25
Branches 48849 48874 +25
=======================================
+ Hits 37524 37546 +22
- Misses 9421 9425 +4
+ Partials 1904 1903 -1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
…aMode::Merge When using SchemaMode::Merge, the input batch may omit nullable columns that a generated column expression references. Previously, with_generated_columns() called parse_predicate_expression() against the pre-evolution plan schema, which failed because the column didn't exist yet -- schema evolution hadn't run to add it as NULL. Now when expression resolution fails, we fall back to a typed NULL placeholder. Schema evolution will later add the missing base columns, and DataValidationExec will see NULL IS NOT DISTINCT FROM NULL = true, which correctly passes validation. Closes delta-io#4169 Signed-off-by: Varun Chawla <varun_6april@hotmail.com>
d568488 to
70b49bd
Compare
… resolution errors only Address review feedback: - Replace broad Err(_) catch-all with a guard that only matches column resolution errors (No field named / Schema error), letting parse and type errors propagate normally. - Bind the error and include it in the debug log message. - Add e2e test through WriteBuilder & SchemaMode::Merge that writes with and without the referenced column, then asserts actual values.
70b49bd to
c82fb8a
Compare
Fixes #4169
When appending with
SchemaMode::Merge, generated column expressions can reference nullable columns that aren't in the input batch. The problem was thatwith_generated_columns()resolves expressions against the plan schema before schema evolution runs -- so any column that schema evolution would add as NULL doesn't exist yet, andparse_predicate_expressionfails.The fix catches expression resolution errors in
with_generated_columns()and falls back to a typed NULL placeholder. This lets the pipeline continue through schema evolution, which adds the missing columns. TheDataValidationExecthen seesNULL IS NOT DISTINCT FROM NULL = true, correctly passing validation.This mirrors the approach already used in
add_missing_generated_columns()(the merge path), which also inserts NULL placeholders for missing generated columns.Test plan
test_generated_column_referencing_missing_column_uses_null_placeholdertest that reproduces the exact failure described in the issuecargo test -p deltalake-core --lib --features datafusion -- "generated_columns::tests"passes cleanly