You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
fix(table): correct deduplication logic for data files in MaintenanceTable
The deduplicate_data_files() method was not properly removing duplicate
data file references from Iceberg tables. After deduplication, multiple
references to the same data file remained instead of the expected single
reference.
Root causes:
1. _get_all_datafiles() was scanning ALL snapshots instead of current only
2. Incorrect transaction API usage that didn't leverage snapshot updates
3. Missing proper overwrite logic to create clean deduplicated snapshots
Key fixes:
- Modified _get_all_datafiles() to scan only current snapshot manifests
- Implemented proper transaction pattern using update_snapshot().overwrite()
- Added explicit delete_data_file() calls for duplicates + append_data_file() for unique files
- Removed unused helper methods _get_all_datafiles_with_context() and _detect_duplicates()
Technical details:
- Deduplication now operates on ManifestEntry objects from current snapshot only
- Files are grouped by basename and first occurrence is kept as canonical reference
- New snapshot created atomically replaces current snapshot with deduplicated file list
- Proper Iceberg transaction semantics ensure data consistency
Tests: All deduplication tests now pass including the previously failing
test_deduplicate_data_files_removes_duplicates_in_current_snapshot
Fixes: Table maintenance deduplication functionality
0 commit comments