fix: global eq delete matching should apply to only strictly older files, and fix partition scoped matching to consider spec id#1758
Conversation
…n seq number are applied
| self.global_deletes | ||
| .iter() | ||
| // filter that returns true if the provided delete file's sequence number is **greater than or equal to** `seq_num` | ||
| // filter that returns true if the provided delete file's sequence number is **greater than** `seq_num` |
There was a problem hiding this comment.
I could just be missing something in the implementation, but right now seems like we're over applying in a way that would technically be incorrect. If a snapshot has some added data files and an equality delete, the equality deletes should only be applied to older seq numbers. We do this correctly for partition scoped eq deletes below, but for some reason the global equality delete matching is different; the seq number criteria for eq deletes applies regardless of scope of the delete.
There was a problem hiding this comment.
I noticed this when I was working on some optimizations for equality delete matching (eliminating eq deletes to apply based on bounds)
There was a problem hiding this comment.
Thanks for catching this. I think "greater than" is correct here. The global_deletes variable is used for gathering all the eq delete files for an unpartitioned table, but the same logic should still apply.
Looking at past commits, i did find this was previously discussed here
Also from https://iceberg.apache.org/spec/#scan-planning

This implies that we should not apply eq delete files that have equivalent sequence number
There was a problem hiding this comment.
Yeah, thanks for finding that github thread, looks like the confusion was what global_deletes meant, but really it's global equality deletes. Regardless any equality delete must be applied for files strictly older. I also fixed some of the partition scoped delete matching logic because it's technically possible to have tuples match but the spec IDs don't match; currently we only go off the tuple which isn't quite correct for matching as well.
kevinjqliu
left a comment
There was a problem hiding this comment.
LGTM!
global_deletes (eq delete for unpartitioned table) should only apply if sequence number is strictly greater than the data files' sequence number
…alues so that it also considers spec IDs
eca75bf to
3df3b6c
Compare
kevinjqliu
left a comment
There was a problem hiding this comment.
LGTM!
Thanks for adding the tests
| seq_num | ||
| .map(|seq_num| delete.manifest_entry.sequence_number() >= Some(seq_num)) | ||
| .unwrap_or_else(|| true) | ||
| && data_file.partition_spec_id == delete.partition_spec_id |
There was a problem hiding this comment.
Exactly. We could also define a key which is tupled off the <spec ID, partition value> pair in the map (this is what the java implementation does).
There was a problem hiding this comment.
lets see if we can refactor this in a follow up PR and keep the changes here straightforward :)
There was a problem hiding this comment.
Pull Request Overview
This PR fixes two bugs in delete file index matching logic during scan planning:
- Global equality deletes now only match against strictly older data files (changed from
>=to>comparison) - Partition-scoped deletes now correctly validate that both the partition value and spec ID match
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| seq_num | ||
| .map(|seq_num| delete.manifest_entry.sequence_number() >= Some(seq_num)) | ||
| .unwrap_or_else(|| true) | ||
| && data_file.partition_spec_id == delete.partition_spec_id |
There was a problem hiding this comment.
lets see if we can refactor this in a follow up PR and keep the changes here straightforward :)

Which issue does this PR close?
What changes are included in this PR?
This changes scan planning of equality deletes so that:
we only match against eq deletes which are strictly older than the data file. Right now it looks like we incorrectly over-apply based on the seq number for global equality deletes.
Partition scoped deletes (both equality and position) are compared correctly by also factoring in the spec ID. It's not quite enough to compare just off the tuple, we should also compare based off the spec ID as well.
Are these changes tested?
Added unit tests which are scoped to testing delete index matching logic.
If tests are not included in your PR, please explain why (for example, are they covered by existing tests)?
-->