ingest - recording timelines #1074
Replies: 2 comments
-
|
will review today/tmw |
Beta Was this translation helpful? Give feedback.
-
|
So per a few in-person discussions, it seems we're leaning towards:
There are some interesting edge cases to think about Version does not match "latest" folderThis is a little odd - it'd be a bit weird if we pull data that is older than some other version that we're currently using... but it seems like there could be some edge cases of re-archiving old data or something like that in which this would come up some how. Regardless of how it would come up, it's an edge case that should be handled. What we do maybe depends a bit on whether "latest" flag is supplied when running ingest
Code/Data changesSay we add a new data cleaning preprocessing step, but we keep checking the same socrata dataset that hasn't changed in years.. I still think that our data should be immutable, but if data changes, we shouldn't be throwing away our new data (which maybe has a new preprocessing step or something). It seems like maybe the flow here would be something like
The one p |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Just in thinking about what I'd like to know about, say, a socrata dataset in
edm-recipes, we've talked a bit about what events are logged/captured somewhere.For a socrata dataset, we could have a timeline like this
20240105created inedm-recipes)We go to build on 1/21 - what bits of information would actually be useful here? Obviously the "actual" version - the date of rows updated. This part currently is the most explicit, since it's the actual version. I also think that both 1/5 and 1/19 are useful dates - 1/5 maybe not for a production build specifically, but I think it's important to be able to quickly tell when a dataset entered our system - being able to look back programmatically at "snapshots" of our data lake just by looking at configs/logs would be nice. 1/19 is much more relevant - when preparing for a build, we want to know how recently we checked freshness.
There's a couple ways to note these things. I'm going to be assuming we're in ingest, and not talk about library
s3 file structure is the record of events
Assuming that we, at a minimum, check the version before pulling data, then archive raw data, then EITHER overwrite the existing version under
datasets/or just pass without pushing anything todatasets(or running preprocessing steps, etc)Basically, the thought here would be to take advantage of the presence of
raw_datasets- the subfolders of any dataset there are just the timestamps of wheningestwas invoked. And if there are some weird inconsistencies between versions, we have the log here.Slightly open question then if it would be best to overwrite what's under
datasets. Overwriting ensures that if we had a "silent" change (dataset changes but somehow version does not) that we actually have the latest, and that the config has the latest timestamp, since it's the most recent date. This also feels a little dangerous though - I like these datasets being immutable, and knowing that if dof_dtm "20240101" was used in ZTL and PLUTO that we know with 100% certainty that the exact same data was used in each.The problem in the immutable case - we can't tell from the config of a dataset (and so we can't tell by our most performant utils, that don't need to scrape s3 folders) how recently it was checked. So another option would be
update config file
(
datasets/dof_dtm/20240101/config.json)Basically, the idea here is that when starting the ingest process, for a dataset of this type (where we programmatically look up version), if the version already exists, we do not overwrite the existing dataset but we do update the config.json. We could essentially add here a log of timestamps - we have the created date, and we have "verified" or "checked" (or "last_checked", etc) dates, so that at build time, or in a dashboard, we can very readily see, in a single place, the full status of this dataset in terms of our automated pipelines.
In this case, we could either still archive raw data each time, or simply see that the version exists, update the config, and gracefully exit. Open to either - seems nice to maybe be able to confirm that data doesn't mutate, but maybe that's something we want to do either more automatically (verify consistency of data when running a process like this), or just fully manually (only archive data when we actually want to manually compare it to the production version)
logging
Another way to do the similar thing would essentially be the first approach, with no overwrite, and simply log all these events and use utilities interacting with the database. There's a nice simplicity in this, but it's also more overhead. We have no "production" processes currently that rely on persisted db tables, and that's a significant change for us. It's one I think we want to move towards, but lightweight solutions are nice.
Beta Was this translation helpful? Give feedback.
All reactions