Skip to content

Conversation

@gabotechs
Copy link
Collaborator

Improves the default task estimator by selecting the task count based on the number of file partitions rather than the number of actual files themselves

@gabotechs gabotechs force-pushed the gabrielmusat/improve-default-task-estimator branch 2 times, most recently from 97d388c to da7997e Compare December 30, 2025 10:29
@gabotechs gabotechs force-pushed the gabrielmusat/improve-default-task-estimator branch from da7997e to e5cd77d Compare December 30, 2025 15:49
Copy link
Collaborator

@LiaCastaneda LiaCastaneda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense to me. Just out of curiosity, do you know what's DataFusion thereshold to split a physical file into multiple PartitionedFiles?

@gabotechs
Copy link
Collaborator Author

I think it's this one: https://github.com/apache/datafusion/blob/main/datafusion/common/src/config.rs#L937-L937, so if I'm reading it correctly it's 10Mb.

@gabotechs gabotechs merged commit 48910a6 into main Jan 2, 2026
7 checks passed
@gabotechs gabotechs deleted the gabrielmusat/improve-default-task-estimator branch January 2, 2026 07:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants