Improving The Scalability Of Pipelines

The listed building collection has started failing due to disk limit sizes. This means we need to put a hurry on increasing the scalability of the pipeline. Each of the following steps will contribute to helping getting the pipeline running but only when all of them are done will I be confident that the. pipelines are as scalable as we want. there will still be iterative work to keep costs down but scalable they will be.

- Address where state is read, make it an output of the collection step with a transform count
- move transformed parquets into the transformed directory instead of the cache
- separate out each phase into sequential tasks in airflow. 
- separate transform into separate machines with. a mapped number of transformations to run
- replace assemble step with. spark and add separate datasette supporting step

This will be deployed after @eveleighoj is back from AL the PRs below may close before hand or be waiting but the list here is useful for reference

PR:

- Fix pyspark code - https://github.com/digital-land/pyspark-jobs/pull/55
- Update 'new' collection DAGS this flow can replace existing after testing- https://github.com/digital-land/airflow-dags/pull/67
- update digital-land-python to consider dataset-resource when calculating number of transformations - https://github.com/digital-land/digital-land-python/pull/503
- update collection-task to consider dataset-resource when calculating number of transformations and add package script - https://github.com/digital-land/collection-task/pull/45
- add look-up rules to digital-land-python, this is separate and is only needed when we actually add title boundaries - https://github.com/digital-land/digital-land-python/pull/477
- update look-up rule csv in config for title boundary

Additional PRs:
- sort out testing error on mac - https://github.com/digital-land/digital-land-python/pull/505

Technical debt:
- need to bring more data into parquet datasets, This will support packaging a lot more easily.
- need to add input schemas for packaging of several tables, this will speed up larger datasets for package building, assumption for now is that the dataset is small enough so it's not an issue


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improving The Scalability Of Pipelines #1723

Sub-issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Improving The Scalability Of Pipelines #1723

Description

Sub-issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions