Skip to content

feat: Support Delta deletion vectors in scan_delta#26867

Draft
kdn36 wants to merge 8 commits intopola-rs:mainfrom
kdn36:feat_deletion_vectors
Draft

feat: Support Delta deletion vectors in scan_delta#26867
kdn36 wants to merge 8 commits intopola-rs:mainfrom
kdn36:feat_deletion_vectors

Conversation

@kdn36
Copy link
Collaborator

@kdn36 kdn36 commented Mar 9, 2026

Closes #26369

This PR introduces read support for Deltalake Deletion Vectors. The feature is unstable and gated behind an environment variable (see code and test for details).

AI was used to seed draft code in select areas, primarily for the generation of deletion vector write helpers for CI (python). This code was updated to meet correctness and code style expectations. I have updated and reviewed all changes myself, and I believe they are relevant and correct.

Pending: feature gating.

@kdn36 kdn36 marked this pull request as draft March 9, 2026 20:19
@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars labels Mar 9, 2026
@kdn36 kdn36 force-pushed the feat_deletion_vectors branch from 2573ed3 to 386cff3 Compare March 9, 2026 20:56
@github-actions github-actions bot added the changes-dsl Do not merge if this label is present and red. label Mar 9, 2026
@codecov
Copy link

codecov bot commented Mar 9, 2026

Codecov Report

❌ Patch coverage is 75.48077% with 51 lines in your changes missing coverage. Please review.
✅ Project coverage is 81.66%. Comparing base (9f1a742) to head (6a23298).
⚠️ Report is 55 commits behind head on main.

Files with missing lines Patch % Lines
py-polars/src/polars/io/delta/_dataset.py 30.55% 25 Missing ⚠️
...plan/src/dsl/file_scan/python_delta_dv_provider.rs 78.57% 12 Missing ⚠️
...rates/polars-python/src/delta/dv_provider_funcs.rs 64.70% 6 Missing ⚠️
crates/polars-plan/src/dsl/file_scan/deletion.rs 50.00% 3 Missing ⚠️
...rates/polars-python/src/lazyframe/visitor/nodes.rs 0.00% 2 Missing ⚠️
.../io_sources/multi_scan/components/row_deletions.rs 96.36% 2 Missing ⚠️
crates/polars-python/src/io/scan_options.rs 90.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #26867      +/-   ##
==========================================
+ Coverage   81.30%   81.66%   +0.36%     
==========================================
  Files        1802     1807       +5     
  Lines      246972   248308    +1336     
  Branches     3086     3137      +51     
==========================================
+ Hits       200810   202793    +1983     
+ Misses      45371    44710     -661     
- Partials      791      805      +14     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

self
}

pub fn call(&self) -> PolarsResult<Option<DataFrame>> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we have this function directly return a PlHashMap<usize, BooleanChunked>?

Copy link
Collaborator

@nameexhaustion nameexhaustion Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In terms of implementation, I think we -

  • Build a hashset HashSet<usize> of the selected indices
  • Then build HashMap<usize, BooleanChunekd> by iterating over the rows of the DataFrame we get from delta -
    • If the idx of that row is not present in the HashSet<usize>, don't add it to the HashMap

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Return PolarsResult<Series> instead

Then, check if Series.null_count() == Series.len() and filter out if that's the case


impl std::fmt::Display for DeltaDeletionVectorProvider {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
f.write_str("DeltaOLDDeletionVectorCallbackOLD")
Copy link
Collaborator

@nameexhaustion nameexhaustion Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's already old?

Image

credential_provider: Option<Py<PyAny>>,
deletion_files: Option<Wrap<DeletionFilesList>>,
deletion_files: Option<Wrap<PyDeletionFilesList>>,
deletion_vector_callback: Option<Py<PyAny>>,
Copy link
Collaborator

@nameexhaustion nameexhaustion Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also shouldn't to introduce a new variant here - we should be able to re-use deletion_files instead - that was designed to be extensible -

_deletion_files=("iceberg-position-delete", self.deletion_files),

You should be able to extend it to support passing ("delta-deletion-vector", <object>) on the existing parameter

try_parse_dates: try_parse_hive_dates,
};

// Unify the two distinct Python variants into one unified Rust type.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


# The engine (polars) must explicitly manage this:
if os.getenv("POLARS_DELTA_READER_FEATURE_DV") == "1":
SUPPORTED_READER_FEATURES.add("deletionVectors")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd think we could just enable this by default?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the high impact of any correctness issue (that may be left undetected), we prefer to gate the functionality until we pass an initial round of field testing. Open to either option.

@nameexhaustion nameexhaustion removed the RC label Mar 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changes-dsl Do not merge if this label is present and red. enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support deletion vectors (delta)

2 participants