-
Notifications
You must be signed in to change notification settings - Fork 3
Description
In building FacDB, I ran into a few issues which were trivial to fix, but nonetheless took time to diagnose. They might provide a nice framework to discuss potential improvements.
Here's a sampling of issues I encountered Building FacDB
USECODEs were corrupted on Bytes: they should have been 4 digit numerics, but were cast to numbers. E.g. "0211" -> 211.-
- This is an interesting case, because this is data that could benefit from our automated checks, but modifications after
edm-publishingare out of our control.
- This is an interesting case, because this is data that could benefit from our automated checks, but modifications after
- Column were changed or removed in source data (e.g.
boroughin dsny_electronicsdrop`) - Columns had mixed data after being imported into our python dataframes (
zip_codeindsny_fooddropwas sometimes being read in as a float in pandas) dot_parkingballooned in size and changed completely. It would have been nice to catch this prior to import to EDM Recipes, rather than noticing the discrepancy in the output, and then having to purge the bad data from S3 (I could easily have forgotten to do that).
What I'm thinking for next steps: Perhaps model out the required/important fields in FacDB for 1) a subset of recipes 2) for the output to edm-publishing. The output of this exercise would be some declarative format about expectations (probably fields with types modeled in yml) which we'd then use to write automations to detect and potentially coerce out-of-spec data into a usable format. Modeling might be a nice way to indicate which columns actually matter at the periphery. E.g. at ingestion time, if we no longer have the boro field in dsny_electronicsdrop does that actually matter?
There are some neat libraries (e.g. Cerberus) that we might make use of, though I think it would be nice to get our feet wet before making a decision about them.
Thoughts @fvankrieken , @damonmcc . Would love to hear about your pain points as well. If it'd be easier, we could just huddle and jot down some notes.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status