Skip to content

Improving The Scalability Of Pipelines #1723

@eveleighoj

Description

@eveleighoj

The listed building collection has started failing due to disk limit sizes. This means we need to put a hurry on increasing the scalability of the pipeline. Each of the following steps will contribute to helping getting the pipeline running but only when all of them are done will I be confident that the. pipelines are as scalable as we want. there will still be iterative work to keep costs down but scalable they will be.

  • Address where state is read, make it an output of the collection step with a transform count
  • move transformed parquets into the transformed directory instead of the cache
  • separate out each phase into sequential tasks in airflow.
  • separate transform into separate machines with. a mapped number of transformations to run
  • replace assemble step with. spark and add separate datasette supporting step

This will be deployed after @eveleighoj is back from AL the PRs below may close before hand or be waiting but the list here is useful for reference

PR:

Additional PRs:

Technical debt:

  • need to bring more data into parquet datasets, This will support packaging a lot more easily.
  • need to add input schemas for packaging of several tables, this will speed up larger datasets for package building, assumption for now is that the dataset is small enough so it's not an issue

Sub-issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    Initiatives

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions