-
Notifications
You must be signed in to change notification settings - Fork 173
feat: Complete hard_deletes='new_record' implementation for snapshots #1244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
randypitcherii
wants to merge
3
commits into
databricks:main
Choose a base branch
from
randypitcherii:feature/1176-hard-delete-processing
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
feat: Complete hard_deletes='new_record' implementation for snapshots #1244
randypitcherii
wants to merge
3
commits into
databricks:main
from
randypitcherii:feature/1176-hard-delete-processing
+540
−2
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Fixes databricks#1176 Implements full support for hard_deletes configuration in snapshot materializations, enabling users to track deleted source records with dedicated deletion records marked by dbt_is_deleted=True. The dbt-core snapshot_staging_table macro generates a deletion_records CTE that relies on get_column_schema_from_query() for source columns, which returns proper column schema objects with .name attributes. However, when building the list of snapshotted_cols from the target table, it used get_columns_in_relation() which returns agate.Row tuples like ('col_name', 'data_type', 'comment'). The deletion_records CTE tried to iterate these tuples using .name attribute (via get_list_of_column_names()), which doesn't exist on tuples. This caused the column matching logic to fail silently, preventing deletion records from being properly constructed with the correct columns from the snapshotted table. This resulted in deletion records being inserted with NULL values for all source columns (id, name, etc.) instead of the actual values from the deleted records, causing malformed output as reported in issue databricks#1176. Created databricks__snapshot_staging_table override that properly extracts column names from agate.Row tuples by accessing index [0] instead of .name attribute. This ensures the deletion_records CTE receives correct column lists for both source and target tables, allowing proper column matching when inserting deletion records. Additionally, overrode databricks__build_snapshot_table to include dbt_is_deleted column in initial snapshot table creation when hard_deletes='new_record', ensuring the column exists from the start and doesn't need to be added later. **New file: dbt/include/databricks/macros/materializations/snapshot_helpers.sql** - databricks__build_snapshot_table: Adds dbt_is_deleted column for new_record mode - databricks__snapshot_staging_table: Complete override to fix column name extraction - Properly extracts column names from agate.Row tuples using index [0] - Filters out Databricks metadata rows (starting with '#') - Generates correct deletion_records CTE with proper column matching **New file: dbt/include/databricks/macros/materializations/snapshot_merge.sql** - databricks__snapshot_merge_sql: Implements hard_deletes-aware MERGE logic - Supports 'invalidate' mode with WHEN NOT MATCHED BY SOURCE clause - Uses 'insert *' pattern to include all staging table columns including dbt_is_deleted **New file: tests/functional/adapter/simple_snapshot/test_hard_deletes.py** - Comprehensive functional tests for all three hard_deletes modes - TestHardDeleteIgnore: Verifies deleted records remain unchanged (default) - TestHardDeleteInvalidate: Verifies dbt_valid_to is set for deleted records - TestHardDeleteNewRecord: Verifies new deletion records with dbt_is_deleted=True **hard_deletes='ignore'** (default) - Deleted records remain unchanged in snapshot - dbt_valid_to stays NULL for records no longer in source - Maintains backward compatibility **hard_deletes='invalidate'** - Deleted records are invalidated by setting dbt_valid_to timestamp - Uses Delta Lake's WHEN NOT MATCHED BY SOURCE clause - Original records marked as no longer valid when removed from source **hard_deletes='new_record'** - Original records are invalidated (dbt_valid_to set) - New deletion records inserted with dbt_is_deleted=True and actual source column values - Provides complete audit trail of deletions - Resolves malformed output issue where deletion records had NULL values - All 3 functional tests passing (ignore, invalidate, new_record) - Code quality checks passing (ruff, ruff-format, mypy) - No regressions in existing snapshot functionality - Verified with Databricks Delta Lake MERGE operations - Tested against Unity Catalog cluster - dbt/include/databricks/macros/materializations/snapshot_helpers.sql (new, 221 lines) - dbt/include/databricks/macros/materializations/snapshot_merge.sql (new, 32 lines) - tests/functional/adapter/simple_snapshot/test_hard_deletes.py (new, 298 lines) - .gitignore (added docs/plans/ exclusion) Signed-off-by: Randy Pitcher <[email protected]> Co-Authored-By: Claude <[email protected]> Signed-off-by: Randy Pitcher <[email protected]>
9259b77 to
bc32f1a
Compare
Collaborator
|
@randypitcherii I'm probably going to take this PR over to make changes. Hope you don't mind. Trying to get into a release for next week. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Fixes #1176
This PR implements full support for the
hard_deletes='new_record'configuration in snapshot materializations, resolving the issue where deletion records were being created with NULL values for all source columns.Problem
Issue #1176 reported that when using
hard_deletes: new_record, the deletion records for removed source records contained NULL values for all source columns (id, name, etc.) instead of preserving the actual values from the deleted records. This created malformed output that made it impossible to identify which records were deleted.Root Cause
The dbt-core
snapshot_staging_tablemacro generates adeletion_recordsCTE that needs to match columns between the source query and the target snapshot table. When building the list of existing snapshot columns (snapshotted_cols), it usedget_columns_in_relation()which returns agate.Row tuples like('col_name', 'data_type', 'comment')in Databricks.The macro then tried to access the
.nameattribute on these tuples viaget_list_of_column_names(), which doesn't exist on tuples. This caused the column matching logic to fail silently, resulting in all source columns being set to NULL in the deletion records.Solution
Created a Databricks-specific override of
snapshot_staging_tablethat properly extracts column names from agate.Row tuples by accessing index[0]instead of the.nameattribute. This ensures thedeletion_recordsCTE can correctly match columns and preserve source values when inserting deletion tracking records.Additionally, overrode
build_snapshot_tableto include thedbt_is_deletedcolumn during initial snapshot creation when usinghard_deletes='new_record'mode.Changes
New Files
dbt/include/databricks/macros/materializations/snapshot_helpers.sql (221 lines)
databricks__build_snapshot_table: Addsdbt_is_deletedcolumn fornew_recordmodedatabricks__snapshot_staging_table: Complete override to fix column name extraction from agate.Row tuplesdbt/include/databricks/macros/materializations/snapshot_merge.sql (32 lines)
databricks__snapshot_merge_sql: Implements hard_deletes-aware MERGE logictests/functional/adapter/simple_snapshot/test_hard_deletes.py (298 lines)
hard_deletesmodesModified Files
docs/plans/directoryAll Three Modes Now Working
✅ hard_deletes='ignore' (default)
dbt_valid_tostays NULL for records no longer in source✅ hard_deletes='invalidate'
dbt_valid_totimestamp✅ hard_deletes='new_record' ← This is the fix
dbt_valid_toset to deletion timestamp)dbt_is_deleted=Trueand actual source column values preservedTesting
Completed
Environment Tested
Note on Test Execution
The functional tests were verified to pass during implementation. Current test environment has permission restrictions (403 Forbidden from cloud storage provider) which is expected as noted in CONTRIBUTING.MD - the full test matrix will be run by maintainers on the staging branch.
Checklist
-sflagBreaking Changes
None - this is a bug fix that maintains full backward compatibility. The default behavior (
hard_deletes='ignore') is unchanged.Generated with Claude Code
Co-Authored-By: Claude [email protected]