feat: Support Delta deletion vectors in scan_delta#26867
feat: Support Delta deletion vectors in scan_delta#26867kdn36 wants to merge 8 commits intopola-rs:mainfrom
scan_delta#26867Conversation
2573ed3 to
386cff3
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #26867 +/- ##
==========================================
+ Coverage 81.30% 81.66% +0.36%
==========================================
Files 1802 1807 +5
Lines 246972 248308 +1336
Branches 3086 3137 +51
==========================================
+ Hits 200810 202793 +1983
+ Misses 45371 44710 -661
- Partials 791 805 +14 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| self | ||
| } | ||
|
|
||
| pub fn call(&self) -> PolarsResult<Option<DataFrame>> { |
There was a problem hiding this comment.
Could we have this function directly return a PlHashMap<usize, BooleanChunked>?
There was a problem hiding this comment.
In terms of implementation, I think we -
Build a hashsetHashSet<usize>of the selected indicesThen buildHashMap<usize, BooleanChunekd>by iterating over the rows of the DataFrame we get from delta -If theidxof that row is not present in theHashSet<usize>, don't add it to the HashMap
There was a problem hiding this comment.
Return PolarsResult<Series> instead
Then, check if Series.null_count() == Series.len() and filter out if that's the case
|
|
||
| impl std::fmt::Display for DeltaDeletionVectorProvider { | ||
| fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result { | ||
| f.write_str("DeltaOLDDeletionVectorCallbackOLD") |
| credential_provider: Option<Py<PyAny>>, | ||
| deletion_files: Option<Wrap<DeletionFilesList>>, | ||
| deletion_files: Option<Wrap<PyDeletionFilesList>>, | ||
| deletion_vector_callback: Option<Py<PyAny>>, |
There was a problem hiding this comment.
We also shouldn't to introduce a new variant here - we should be able to re-use deletion_files instead - that was designed to be extensible -
You should be able to extend it to support passing ("delta-deletion-vector", <object>) on the existing parameter
| try_parse_dates: try_parse_hive_dates, | ||
| }; | ||
|
|
||
| // Unify the two distinct Python variants into one unified Rust type. |
There was a problem hiding this comment.
|
|
||
| # The engine (polars) must explicitly manage this: | ||
| if os.getenv("POLARS_DELTA_READER_FEATURE_DV") == "1": | ||
| SUPPORTED_READER_FEATURES.add("deletionVectors") |
There was a problem hiding this comment.
I'd think we could just enable this by default?
There was a problem hiding this comment.
Given the high impact of any correctness issue (that may be left undetected), we prefer to gate the functionality until we pass an initial round of field testing. Open to either option.

Closes #26369
This PR introduces read support for Deltalake Deletion Vectors. The feature is unstable and gated behind an environment variable (see code and test for details).
AI was used to seed draft code in select areas, primarily for the generation of deletion vector write helpers for CI (python). This code was updated to meet correctness and code style expectations. I have updated and reviewed all changes myself, and I believe they are relevant and correct.
Pending: feature gating.