Skip to content

Metadata-based TaskEstimator implementation for leaf nodes. #252

@gabotechs

Description

@gabotechs

After #216, we are in a better position about how do we decide how many tasks per stage we are using.

There is still room for improvement, for example:

  • The amount of tasks in a leaf stage is decided based on the number of files we are going to read, but it does not take into account how big those files are, how many rows are we going to read from them, how many columns, the DataType of the columns, how many NULLs there are, the estimated bytes we are going to ingest, etc...
  • The amount of tasks in intermediate stages is decided based on the CardinalityEffect of intermediate nodes in the previous stage, but this is a very coarse guess, if we had actual statistics about how many rows/bytes are going to flow through the nodes we could better guess the scale up/down factor in the amount of tasks for intermediate stages.

For the case when we are reading from Parquet, we actually have most of this metadata, so I think we should have all the information already there for making more informed guesses about the number of tasks.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions