Roadmap

Overview

This is a long-term vision to improve our infrastructure and operations without major disruptions to our data products.

The ideas and notes here inform the tasks found in the Data Engineering project within this repo.

So far, our ideas for changes and features fall into these general buckets:

data warehouse
- including build engine - prod/dev environments, etc - maybe data storage architecture in general
expanding QAQC tools
- source data QAQC being especially valuable, possibly blocked by our desired re-working of data library
refactoring of data products
- bash -> python
- reduce inconsistencies across data products, dbt, etc
reworking or replacement of data library
orchestration and compute resources

Milestones

Desired state

We want data engineering infrastructure that:
- reduces time spent building and maintaining data pipelines
- reduces time spent performing updates and QA of datasets
- standardizes the approaches and code used to build datasets
The data platform we imagine may do the following:
- extract source data and store it
- load source data to a persistent data warehouse
- transform data in the data warehouse to build data products

Current state

Our current data engineering infrastructure:
- extracts and transforms source data before storing it as a sql dump file
- often uses a temporary database to build a dataset
- relies heavily on long bash scripts which call python and sql files
- is spread across dozens of repos
Pros
- rigorous versioning of archived source data
- flexibility/stability of isolated build processes
- python and sql are mature, popular languages
- already very cloud-based
Cons
- maintenance of many repos
- cognitive costs of inconsistency across pipelines
- difficult to test pipelines before production runs
- long bash scripts
- post-build questions are hard to answer
- variety of dataset handoff process to GIS, OSE, Capital Planning, etc.
- lack of data lineage, end-to-end testing

Roadmap

Overview

Milestones

Desired state

Current state

Whiteboard pictures

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Getting started

Code/Infrastructure

Data Products

Resources/Reference

Clone this wiki locally