Improved Automated QA for Recipes and Products

In building FacDB, I ran into a few issues which were trivial to fix, but nonetheless took time to diagnose. They might provide a nice framework to discuss potential improvements.

Here's a sampling of issues I encountered Building FacDB
- `USECODE`s were corrupted on Bytes: they should have been 4 digit numerics, but were cast to numbers. E.g. "0211" -> 211. 
- - This is an interesting case, because this is data that could benefit from our automated checks, but modifications after `edm-publishing` are out of our control.
- Column were changed or removed in source data (e.g. `borough` in dsny_electronicsdrop`)
- Columns had mixed data after being imported into our python dataframes (`zip_code` in `dsny_fooddrop` was sometimes being read in as a float in pandas)
- `dot_parking` ballooned in size and changed completely. It would have been nice to catch this prior to import to EDM Recipes, rather than noticing the discrepancy in the output, and then having to purge the bad data from S3 (I could easily have forgotten to do that).

What I'm thinking for next steps: Perhaps model out the required/important fields in FacDB for 1) a subset of recipes 2) for the output to edm-publishing. The output of this exercise would be some declarative format about expectations (probably fields with types modeled in yml) which we'd then use to write automations to detect and potentially coerce out-of-spec data into a usable format. Modeling might be a nice way to indicate which columns actually matter at the periphery. E.g. at ingestion time, if we no longer have the `boro` field in `dsny_electronicsdrop` does that actually matter? 

There are some neat libraries (e.g. Cerberus) that we might make use of, though I think it would be nice to get our feet wet before making a decision about them.

Thoughts @fvankrieken , @damonmcc . Would love to hear about your pain points as well. If it'd be easier, we could just huddle and jot down some notes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improved Automated QA for Recipes and Products #116

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Improved Automated QA for Recipes and Products #116

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions