Skip to content

Ingest NGED primary live data using Dagster #12

@JackKelly

Description

@JackKelly
  • Maybe NgedCkanClient doesn't need to be a class?
  • Replace Sensor with an Asset that checks the NGED CKAN API for new data, and saves the JSON
  • Don't de-dupe all rows. Instead crop new dataframe to start of old dataframe, and de-dupe the cropped new df, then append
  • PAUSE to see if we can learn more about NGED's new data structures planned for S3.
  • Convert from "raw" Parquet to Delta Lake
  • Set the job to run once a day. Set a Schedule.
  • Implement Dagster pipeline to download & merge substation locations. Move logic out of packages/dashboard/main.py
  • Maybe the dagster substation partition should be on substation number?
  • Parquets should use Hive partitioning & be partitioned by month and substation number.
  • Use separate function for checking data? With an asset check decorator
  • Configure where CSVs and Parquet go. Perhaps using IOManager? We'll want different paths in dev and prod and local. And we want all the code in nged-substation-forecast to have access to these paths. Maybe just use .env? Maybe create a .env.template?
  • Write unit tests that run Dagster
  • retry any network GETs
  • Define Dagster asset spec for NGED'S API??
  • Use dependency injection (using Resources?) to separate the CKAN client, so it's easy to mock in tests

Sub-issues

Metadata

Metadata

Assignees

Labels

No labels
No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions