-
Notifications
You must be signed in to change notification settings - Fork 3
Roadmap
Damon McCullough edited this page Aug 24, 2023
·
43 revisions
This is a long-term vision to improve Data Engineering infrastructure and operations without major disruptions to our data product releases.
These ideas and note inform some of the tasks in our team project.
- Our current data engineering infrastructure
- extracts and transforms source data before storing it as a sql dump file
- often uses a temporary database to build a dataset
- relies heavily on long bash scripts which call python and sql files
- is spread across dozens of repos
- Pros
- rigorous versioning of archived source data
- flexibility/stability of isolated build processes
- python and sql are mature, popular languages
- already very cloud-based
- Cons
- maintenance of many repos
- cognitive costs of inconsistency across pipelines
- difficult to test pipelines before production runs
- long bash scripts
- post-build questions are hard to answer
- variety of dataset handoff process to GIS, OSE, Capital Planning, etc.
- lack of data lineage, end-to-end testing
-
We want data engineering infrastructure that:
- reduces time spent building and maintaining data pipelines
- reduces time spent performing updates and QA of datasets
- standardizes the approaches and code used to build datasets
-
The data platform we imagine may do the following:
- extract store unstructured source data
- load source data to a persistent data warehouse
- transform data in the data warehouse to build data products
So far, our ideas for changes and features fall into these general buckets of work:
- refactor of data products
- bash -> python
- reduce inconsistencies across data products using dbt, etc
- expansion of QA tools
- especially source data QA, but that's possibly blocked by the reworking of data library
- design of our "data platform"
- including build engine, prod/dev environments, and maybe data storage architecture in general
- rework or replacement of data library
- use of orchestration and compute resources
- e.g. airflow
- start the mono repo
- build all primary data products in the mono repo
- use some amount of common code in all builds
- standardize build output folder structure in Digital Ocean
- implement a Build, QA, Promote workflow for data products
- use the QA app to inspect source data
- build an MVP data warehouse
- Celebrate! 🎊
- ...
- implement a standardized Extract and Load process
- implement a standardized Transform process
- Celebrate! 🎊
- ...
- Celebrate! 🎊
- Current state
- Desired state