Skip to content

Delta lake structure #31

@JackKelly

Description

@JackKelly

As we're doing now: For each substation, grab CSVs once a day & save the entire CSV to disk. Read the latest parquet file to find the end date. Convert the CSV to a pl.DataFrame, crop the dataframe so it starts when the Parquet ends, ensure it's unique, and sort.

New: Save just the new data to a new Parquet. We'll keep a week of these daily files in a hot directory. No more overwriting Parquets.

Once a month Dagster will compact these daily files to monthly files. We'll need to change the last_modified_dates DynamicalPartitionDefinition to a daily DailyPartitionsDefinition (this is might be a nice change anyway, because i might actually make more sense to partition by the date we retrieve the CSVs, rather than the CKAN modified_date). The compaction job will use a MonthlyPartitionMapping. And we use a TimeWindowPartitionMapping to tell Dagster to only run the compaction job at the end of each month.

Note to self: See this Gemini conversation (which I think I've mostly summarised above)

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions