Skip to content

Improved Automated QA for Recipes and Products #116

@alexrichey

Description

@alexrichey

In building FacDB, I ran into a few issues which were trivial to fix, but nonetheless took time to diagnose. They might provide a nice framework to discuss potential improvements.

Here's a sampling of issues I encountered Building FacDB

  • USECODEs were corrupted on Bytes: they should have been 4 digit numerics, but were cast to numbers. E.g. "0211" -> 211.
    • This is an interesting case, because this is data that could benefit from our automated checks, but modifications after edm-publishing are out of our control.
  • Column were changed or removed in source data (e.g. borough in dsny_electronicsdrop`)
  • Columns had mixed data after being imported into our python dataframes (zip_code in dsny_fooddrop was sometimes being read in as a float in pandas)
  • dot_parking ballooned in size and changed completely. It would have been nice to catch this prior to import to EDM Recipes, rather than noticing the discrepancy in the output, and then having to purge the bad data from S3 (I could easily have forgotten to do that).

What I'm thinking for next steps: Perhaps model out the required/important fields in FacDB for 1) a subset of recipes 2) for the output to edm-publishing. The output of this exercise would be some declarative format about expectations (probably fields with types modeled in yml) which we'd then use to write automations to detect and potentially coerce out-of-spec data into a usable format. Modeling might be a nice way to indicate which columns actually matter at the periphery. E.g. at ingestion time, if we no longer have the boro field in dsny_electronicsdrop does that actually matter?

There are some neat libraries (e.g. Cerberus) that we might make use of, though I think it would be nice to get our feet wet before making a decision about them.

Thoughts @fvankrieken , @damonmcc . Would love to hear about your pain points as well. If it'd be easier, we could just huddle and jot down some notes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    discussionDiscussion or information request

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions