-
Notifications
You must be signed in to change notification settings - Fork 19
Open
Description
After #216, we are in a better position about how do we decide how many tasks per stage we are using.
There is still room for improvement, for example:
- The amount of tasks in a leaf stage is decided based on the number of files we are going to read, but it does not take into account how big those files are, how many rows are we going to read from them, how many columns, the DataType of the columns, how many NULLs there are, the estimated bytes we are going to ingest, etc...
- The amount of tasks in intermediate stages is decided based on the
CardinalityEffectof intermediate nodes in the previous stage, but this is a very coarse guess, if we had actual statistics about how many rows/bytes are going to flow through the nodes we could better guess the scale up/down factor in the amount of tasks for intermediate stages.
For the case when we are reading from Parquet, we actually have most of this metadata, so I think we should have all the information already there for making more informed guesses about the number of tasks.
Metadata
Metadata
Assignees
Labels
No labels