Delta lake structure

As we're doing now: For each substation, grab CSVs once a day & save the entire CSV to disk. Read the latest parquet file to find the end date. Convert the CSV to a `pl.DataFrame`, crop the dataframe so it starts when the Parquet ends, ensure it's unique, and sort.

New: Save _just_ the new data to a new Parquet. We'll keep a week of these daily files in a `hot` directory. No more overwriting Parquets.

Once a month Dagster will compact these daily files to monthly files. We'll need to change the `last_modified_dates` `DynamicalPartitionDefinition` to a daily `DailyPartitionsDefinition` (this is might be a nice change anyway, because i  might actually make more sense to partition by the date we _retrieve_ the CSVs, rather than the CKAN `modified_date`). The compaction job will use a `MonthlyPartitionMapping`. And we use a `TimeWindowPartitionMapping` to tell Dagster to only run the compaction job at the end of each month.

Note to self: See [this Gemini conversation](https://gemini.google.com/app/a8d2eef29076380a) (which I think I've mostly summarised above)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Delta lake structure #31

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Delta lake structure #31

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions