Skip to content

Fix: use new snapshot id in deleted manifest entry unless is existing entry #2266

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

lliangyu-lin
Copy link

@lliangyu-lin lliangyu-lin commented Jul 31, 2025

Rationale for this change

Based on iceberg spec, when a manifest entry is marked as deleted, the snapshot id for when the entry was deleted should be used.
https://iceberg.apache.org/spec/?h=deletes#manifest-entry-fields

When a file is replaced or deleted from the dataset, its manifest entry fields store the snapshot ID in which the file was deleted and status 2 (deleted). The file may be deleted from the file system when the snapshot in which it was deleted is garbage collected, assuming that older snapshots have also been garbage collected [1].

https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/ManifestWriter.java#L178-L179

Incorrect snapshot id could lead to data being deleted during garbage collection when not supposed to.

Are these changes tested?

Added in test_deletes.py

Are there any user-facing changes?

No


def assert_manifest_entry(expected_status: ManifestEntryStatus, expected_snapshot_id: int) -> None:
current_snapshot = table.refresh().current_snapshot()
manifest_files = current_snapshot.manifests(table.io)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Run a make lint here the snapshot calls are a union with null and require an assertion or check if they exist before calling them

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for checking! I see the CI failed with lint as well, I'll fix it in the next commit

pass


def test_manifest_entry_after_deletes(catalog: Catalog) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can probably add this test to the existing test_deletes.py suite

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @lliangyu-lin for raising this PR, it looks indeed that we have a bug here 👍

Co-authored-by: Fokko Driesprong <[email protected]>
@kevinjqliu kevinjqliu added this to the PyIceberg 0.10.0 milestone Aug 11, 2025
@lliangyu-lin
Copy link
Author

@kevinjqliu Hi Kevin, could you help re-enable the CI for this PR again?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants