Skip to content

Conversation

fvaleye
Copy link
Collaborator

@fvaleye fvaleye commented Oct 14, 2025

Description

  • Add max_temp_directory_size parameter to the Python bindings to optimize DataFusion operations.
  • Add max_spill_size to the Python binding (available in Rust)

Related Issue(s)

@github-actions github-actions bot added binding/python Issues for the Python package binding/rust Issues for the Rust crate labels Oct 14, 2025
Copy link

codecov bot commented Oct 14, 2025

Codecov Report

❌ Patch coverage is 29.85075% with 47 lines in your changes missing coverage. Please review.
✅ Project coverage is 70.56%. Comparing base (cb672ac) to head (6909bff).

Files with missing lines Patch % Lines
crates/core/src/delta_datafusion/session.rs 35.29% 22 Missing ⚠️
python/src/lib.rs 0.00% 14 Missing ⚠️
crates/core/src/operations/optimize.rs 47.05% 9 Missing ⚠️
python/src/query.rs 0.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3847      +/-   ##
==========================================
- Coverage   73.72%   70.56%   -3.16%     
==========================================
  Files         151      151              
  Lines       39250    38645     -605     
  Branches    39250    38645     -605     
==========================================
- Hits        28936    27271    -1665     
- Misses       9005    10123    +1118     
+ Partials     1309     1251      -58     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

preserve_insertion_order: false,
max_concurrent_tasks: num_cpus::get(),
max_spill_size: 20 * 1024 * 1024 * 1024, // 20 GB.
max_temp_directory_size: 100 * 1024 * 1024 * 1024, // 100 GB.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe not expose it here if we immediately deprecate it, but rather create a proper configured Session in the python binding when calling optimize and passing that around

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took the simple approach.
I can take more time to create a properly configured session, something like this:

fn create_optimize_session(
    max_spill_size: usize,
    max_temp_directory_size: u64,
) -> SessionState {

WDYT?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okey dokey!
I will refactor this to remove depreciation and configure a proper session. I will ping you when it's ready.

@fvaleye fvaleye force-pushed the feature/configure-disk-manager-for-datafusion-from-Python branch from c82d4ba to 332179c Compare October 14, 2025 09:15
@abhiaagarwal
Copy link
Contributor

I looked into this similarly and unfortunately, there doesn't seem to be a way to allow passing in a datafusion SessionState from the python side to the rust side, which would give python callers the maximum flexibility :/

@ion-elgreco
Copy link
Collaborator

I looked into this similarly and unfortunately, there doesn't seem to be a way to allow passing in a datafusion SessionState from the python side to the rust side, which would give python callers the maximum flexibility :/

Not a state, but you can consume FFISessionConfig, not sure if it has all the some buttons and knobs as state/context

@abhiaagarwal
Copy link
Contributor

I looked into this similarly and unfortunately, there doesn't seem to be a way to allow passing in a datafusion SessionState from the python side to the rust side, which would give python callers the maximum flexibility :/

Not a state, but you can consume FFISessionConfig, not sure if it has all the some buttons and knobs as state/context

It accomplishes the goal on the per-operation level, but if you wanted to share a SessionContext across multiple operations (say, to limit the total amount of ram for working with multiple tables concurrently) you're out of luck.

I guess the rust side could store a SessionContext as an optional in RawDeltaTable and pass that in every operation to achieve the same goal, though. Definitely not done in this ticket, though :D

@fvaleye fvaleye force-pushed the feature/configure-disk-manager-for-datafusion-from-Python branch 4 times, most recently from b0d47cf to 79dcc10 Compare October 14, 2025 14:01
@fvaleye
Copy link
Collaborator Author

fvaleye commented Oct 14, 2025

I looked into this similarly and unfortunately, there doesn't seem to be a way to allow passing in a datafusion SessionState from the python side to the rust side, which would give python callers the maximum flexibility :/

Not a state, but you can consume FFISessionConfig, not sure if it has all the some buttons and knobs as state/context

Something like:

config = SessionConfig()
config.set_config("datafusion.execution.target_partitions", "8")
config.set_config("datafusion.optimizer.skip_failed_rules", "false")

dt.optimize.z_order(
    columns=["id"],
    session_config=config,  # Advanced DataFusion config
)

We can have a follow-up PR for this, WDYT?

@ion-elgreco
Copy link
Collaborator

I looked into this similarly and unfortunately, there doesn't seem to be a way to allow passing in a datafusion SessionState from the python side to the rust side, which would give python callers the maximum flexibility :/

Not a state, but you can consume FFISessionConfig, not sure if it has all the some buttons and knobs as state/context

Something like:

config = SessionConfig()
config.set_config("datafusion.execution.target_partitions", "8")
config.set_config("datafusion.optimizer.skip_failed_rules", "false")

dt.optimize.z_order(
    columns=["id"],
    session_config=config,  # Advanced DataFusion config
)

We can have a follow-up PR for this, WDYT?

Yeah might be nice for interopability with datafusion-python

@roeap
Copy link
Collaborator

roeap commented Oct 17, 2025

@fvaleye @ion-elgreco @abhiaagarwal - I just opened #3860 to create a more narrow waist when creating datafusion sessions. Going forward we also need to wire our object-store and log store factories with the datafusion session.

Would it be possible to base this PR on the one above to properly configure the session?

@fvaleye
Copy link
Collaborator Author

fvaleye commented Oct 19, 2025

@fvaleye @ion-elgreco @abhiaagarwal - I just opened #3860 to create a more narrow waist when creating datafusion sessions. Going forward we also need to wire our object-store and log store factories with the datafusion session.

Would it be possible to base this PR on the one above to properly configure the session?

Yes, I will do that @roeap 👍

@fvaleye fvaleye force-pushed the feature/configure-disk-manager-for-datafusion-from-Python branch 2 times, most recently from 2fc4046 to 6652cef Compare October 20, 2025 08:14
Copy link

ACTION NEEDED

delta-rs follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

ion-elgreco
ion-elgreco previously approved these changes Oct 20, 2025
@ion-elgreco
Copy link
Collaborator

@fvaleye can you resolve the conflict, then we are good to go :)

@fvaleye fvaleye force-pushed the feature/configure-disk-manager-for-datafusion-from-Python branch from 16e6111 to 6652cef Compare October 20, 2025 08:19
@fvaleye
Copy link
Collaborator Author

fvaleye commented Oct 20, 2025

@fvaleye can you resolve the conflict, then we are good to go :)

I'm on it!

@fvaleye fvaleye force-pushed the feature/configure-disk-manager-for-datafusion-from-Python branch from 4648d1b to 5c83b44 Compare October 20, 2025 08:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

binding/python Issues for the Python package binding/rust Issues for the Rust crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Asked to increase max_temp_directory_size in the disk manager configuration when optimizing large table

4 participants