Skip to content

Conversation

geruh
Copy link
Contributor

@geruh geruh commented Jul 29, 2025

Closes #1210

Summary

This work was primarily done by @rutb327 while I provided guidance!

This PR adds equality delete read support to PyIceberg by implementing the delete file indexing system that matches delete files to data files, mimicking the behavior found in Iceberg Core. With this implementation we are able to index files and now read equality deletes during table scans.

Design details

Delete File Index

The new DeleteFileIndex class centralizes handling of all delete file types: positional deletes, equality deletes, and deletion vectors. It organizes deletes by type (equality vs. positional), partition (using PartitionMap for spec-aware grouping), and path (for path-specific positional deletes). This enables efficient lookup during table scans, reducing unnecessary delete file processing.

Equality Delete support

Equality delete files are loaded as PyArrow Tables with their respective equality ids for the schema and for each we are grouping tables with the same set equality id's to reduce anti join operations.

Testing

Added tests from the core iceberg DeleteFileIndex test suite and added some tests with dummy files. As well as some manual testing with a flink setup.

table_eq with only equality deletes on id=2, id=5
+---+-------+
| id|   data|
+---+-------+
|  1|  Alice|
|  3|Charlie|
|  4|  David|
|  6|  Frank|
+---+-------+

table_eq_pos with equality deletes and positional delete at position 3
+---+-----+
| id| data|
+---+-----+
|  1|Alice|
|  4|David|
|  6|Frank|
+---+-----+

Are there any user-facing changes?

Yes can read tables with equality deletes

@gabeiglio
Copy link
Contributor

I noticed that this PR addresses the same issue/feature as the one I was working on in here. However, your implementation is more complete (by supporting reading equality deletes and deletion vectors), so I think it makes sense to move forward with this one instead. (cc: @sungwy, since you reviewed my PR)

@kevinjqliu
Copy link
Contributor

kevinjqliu commented Jul 31, 2025

oops, sorry @gabeiglio, I was searching for positional deletes in github search and i didnt see that you were already working on it in that PR. Looks like there are some parts of the PR that is still super useful to get merged, like the validates.

@gabeiglio
Copy link
Contributor

Yea exactly, should have been more clear on my message, my implementation for DeleteFileIndex was a scope creep to achieve the validation. so now that PR can be only for the validation instead of partition maps, delete file index, etc. :) @kevinjqliu

Copy link
Collaborator

@sungwy sungwy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @geruh - thanks for working on this PR, and sorry for the delayed review.

I've added some review feedback. Let me know your thoughts!

@rutb327
Copy link
Contributor

rutb327 commented Aug 14, 2025

@sungwy Thanks a lot! I have done the suggested changes, could you take another look at it?

Copy link
Collaborator

@sungwy sungwy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @rutb327 thank you for continuing to work on the PR!

I've added a few more suggestions after taking longer time reading your implementation and the test suite. Hope you find this helpful!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very comprehensive test suite! I think we could also benefit from introducing some integration tests, because if we get this wrong, we have the potential to introduce data integrity issues to our users.

There are some great examples of integration tests in tests/integration/test_writes/test_writes.py that invokes a set of actions in either Spark or PyIceberg and then reads the result in both to assert that the result is the same through either.

In our case, I think we could set up a spark session and PyIceberg to the same catalog and:

  • create a positional delete through PySpark
  • read the result in PyIceberg
  • read the result in Spark
  • assert that the two results are the same

I think it would be good to cover a range of cases as we did in this unit test suite. WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree we need to catch potential data integrity issues. I’ll look into adding these tests that cover different cases

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added some tests, let me know if we should add some more cases.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests look good to me @rutb327 - could we resolve the conflict on tests/conftest.py?

@Fokko Fokko self-requested a review August 26, 2025 16:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[feature request] Support reading equality delete files
6 participants