ingest - recording timelines #1074

fvankrieken · 2024-08-14T16:15:00Z

fvankrieken
Aug 14, 2024
Maintainer

Just in thinking about what I'd like to know about, say, a socrata dataset in edm-recipes, we've talked a bit about what events are logged/captured somewhere.

For a socrata dataset, we could have a timeline like this

1/1 - dataset rows updated
1/5 - dataset pulled by our weekly job (v 20240105 created in edm-recipes)
1/12 - dataset pulled by our weekly job
1/19 - dataset pulled by our weekly job

We go to build on 1/21 - what bits of information would actually be useful here? Obviously the "actual" version - the date of rows updated. This part currently is the most explicit, since it's the actual version. I also think that both 1/5 and 1/19 are useful dates - 1/5 maybe not for a production build specifically, but I think it's important to be able to quickly tell when a dataset entered our system - being able to look back programmatically at "snapshots" of our data lake just by looking at configs/logs would be nice. 1/19 is much more relevant - when preparing for a build, we want to know how recently we checked freshness.

There's a couple ways to note these things. I'm going to be assuming we're in ingest, and not talk about library

s3 file structure is the record of events

Assuming that we, at a minimum, check the version before pulling data, then archive raw data, then EITHER overwrite the existing version under datasets/ or just pass without pushing anything to datasets (or running preprocessing steps, etc)

Basically, the thought here would be to take advantage of the presence of raw_datasets - the subfolders of any dataset there are just the timestamps of when ingest was invoked. And if there are some weird inconsistencies between versions, we have the log here.

Slightly open question then if it would be best to overwrite what's under datasets. Overwriting ensures that if we had a "silent" change (dataset changes but somehow version does not) that we actually have the latest, and that the config has the latest timestamp, since it's the most recent date. This also feels a little dangerous though - I like these datasets being immutable, and knowing that if dof_dtm "20240101" was used in ZTL and PLUTO that we know with 100% certainty that the exact same data was used in each.

The problem in the immutable case - we can't tell from the config of a dataset (and so we can't tell by our most performant utils, that don't need to scrape s3 folders) how recently it was checked. So another option would be

update config file

(datasets/dof_dtm/20240101/config.json)

Basically, the idea here is that when starting the ingest process, for a dataset of this type (where we programmatically look up version), if the version already exists, we do not overwrite the existing dataset but we do update the config.json. We could essentially add here a log of timestamps - we have the created date, and we have "verified" or "checked" (or "last_checked", etc) dates, so that at build time, or in a dashboard, we can very readily see, in a single place, the full status of this dataset in terms of our automated pipelines.

In this case, we could either still archive raw data each time, or simply see that the version exists, update the config, and gracefully exit. Open to either - seems nice to maybe be able to confirm that data doesn't mutate, but maybe that's something we want to do either more automatically (verify consistency of data when running a process like this), or just fully manually (only archive data when we actually want to manually compare it to the production version)

logging

Another way to do the similar thing would essentially be the first approach, with no overwrite, and simply log all these events and use utilities interacting with the database. There's a nice simplicity in this, but it's also more overhead. We have no "production" processes currently that rely on persisted db tables, and that's a significant change for us. It's one I think we want to move towards, but lightweight solutions are nice.

sf-dcp · 2024-08-15T15:46:25Z

sf-dcp
Aug 15, 2024

will review today/tmw

0 replies

fvankrieken · 2024-08-30T20:59:08Z

fvankrieken
Aug 30, 2024
Maintainer Author

So per a few in-person discussions, it seems we're leaning towards:

pull and archive raw data
check if version already exists
update "freshness" or "date checked" or something like that in config of that version

There are some interesting edge cases to think about

Version does not match "latest" folder

This is a little odd - it'd be a bit weird if we pull data that is older than some other version that we're currently using... but it seems like there could be some edge cases of re-archiving old data or something like that in which this would come up some how. Regardless of how it would come up, it's an edge case that should be handled.

What we do maybe depends a bit on whether "latest" flag is supplied when running ingest

"latest" not supplied - simply update the freshness of the config in the versioned folder, ignore latest folder
"latest" flag supplied - weird case. It wasn't latest when we ran it before, but now it is for some reason. We could
- overwrite latest with this version
- throw an error
  not sure which I prefer

Code/Data changes

Say we add a new data cleaning preprocessing step, but we keep checking the same socrata dataset that hasn't changed in years.. I still think that our data should be immutable, but if data changes, we shouldn't be throwing away our new data (which maybe has a new preprocessing step or something). It seems like maybe the flow here would be something like

pull and archive raw data
run preprocessing steps
see that this version already exists
pull last parquet, compare to parquet that was just generated
compare
- if they're equal - do the usual "update freshness"
- if they've diverged, "patch" the version, archive this new version, and push to latest (assuming that this is a recurring job that always pushes to latest)

The one p

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ingest - recording timelines #1074

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

ingest - recording timelines #1074

Uh oh!

fvankrieken Aug 14, 2024 Maintainer

s3 file structure is the record of events

update config file

logging

Replies: 2 comments

Uh oh!

sf-dcp Aug 15, 2024

Uh oh!

Uh oh!

fvankrieken Aug 30, 2024 Maintainer Author

Version does not match "latest" folder

Code/Data changes

fvankrieken
Aug 14, 2024
Maintainer

sf-dcp
Aug 15, 2024

fvankrieken
Aug 30, 2024
Maintainer Author