Metadata-based TaskEstimator implementation for leaf nodes.

After https://github.com/datafusion-contrib/datafusion-distributed/pull/216, we are in a better position about how do we decide how many tasks per stage we are using.

There is still room for improvement, for example:
- The amount of tasks in a leaf stage is decided based on the number of files we are going to read, but it does not take into account how big those files are, how many rows are we going to read from them, how many columns, the DataType of the columns, how many NULLs there are, the estimated bytes we are going to ingest, etc...
- The amount of tasks in intermediate stages is decided based on the `CardinalityEffect` of intermediate nodes in the previous stage, but this is a very coarse guess, if we had actual statistics about how many rows/bytes are going to flow through the nodes we could better guess the scale up/down factor in the amount of tasks for intermediate stages.

For the case when we are reading from Parquet, we actually have most of this metadata, so I think we should have all the information already there for making more informed guesses about the number of tasks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Metadata-based TaskEstimator implementation for leaf nodes. #252

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Metadata-based TaskEstimator implementation for leaf nodes. #252

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions