-
-
Notifications
You must be signed in to change notification settings - Fork 0
Description
As we're doing now: For each substation, grab CSVs once a day & save the entire CSV to disk. Read the latest parquet file to find the end date. Convert the CSV to a pl.DataFrame, crop the dataframe so it starts when the Parquet ends, ensure it's unique, and sort.
New: Save just the new data to a new Parquet. We'll keep a week of these daily files in a hot directory. No more overwriting Parquets.
Once a month Dagster will compact these daily files to monthly files. We'll need to change the last_modified_dates DynamicalPartitionDefinition to a daily DailyPartitionsDefinition (this is might be a nice change anyway, because i might actually make more sense to partition by the date we retrieve the CSVs, rather than the CKAN modified_date). The compaction job will use a MonthlyPartitionMapping. And we use a TimeWindowPartitionMapping to tell Dagster to only run the compaction job at the end of each month.
Note to self: See this Gemini conversation (which I think I've mostly summarised above)