-
Notifications
You must be signed in to change notification settings - Fork 0
Open
6 / 66 of 6 issues completedDescription
The listed building collection has started failing due to disk limit sizes. This means we need to put a hurry on increasing the scalability of the pipeline. Each of the following steps will contribute to helping getting the pipeline running but only when all of them are done will I be confident that the. pipelines are as scalable as we want. there will still be iterative work to keep costs down but scalable they will be.
- Address where state is read, make it an output of the collection step with a transform count
- move transformed parquets into the transformed directory instead of the cache
- separate out each phase into sequential tasks in airflow.
- separate transform into separate machines with. a mapped number of transformations to run
- replace assemble step with. spark and add separate datasette supporting step
This will be deployed after @eveleighoj is back from AL the PRs below may close before hand or be waiting but the list here is useful for reference
PR:
- Fix pyspark code - Fix/incorrect processing pyspark-jobs#55
- Update 'new' collection DAGS this flow can replace existing after testing- use emr serverless operator airflow-dags#67
- update digital-land-python to consider dataset-resource when calculating number of transformations - Feat/improve log digital-land-python#503
- update collection-task to consider dataset-resource when calculating number of transformations and add package script - set up a library for src code collection-task#45
- add look-up rules to digital-land-python, this is separate and is only needed when we actually add title boundaries - add lookup rules and tests digital-land-python#477
- update look-up rule csv in config for title boundary
Additional PRs:
- sort out testing error on mac - finally a fix for seg faults digital-land-python#505
Technical debt:
- need to bring more data into parquet datasets, This will support packaging a lot more easily.
- need to add input schemas for packaging of several tables, this will speed up larger datasets for package building, assumption for now is that the dataset is small enough so it's not an issue
Reactions are currently unavailable
Sub-issues
Metadata
Metadata
Assignees
Labels
No labels
Type
Projects
Status
Initiatives