-
Notifications
You must be signed in to change notification settings - Fork 91
External input validator #1243
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
External input validator #1243
Conversation
|
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
| pass | ||
|
|
||
|
|
||
| class CityOwnedPropertiesOutputValidator(BaseValidator): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these should be going on the CityOwnedPropertiesInputValidator since this is defining a schema for the incoming data. The ...OutputValidators are used to monitor the accumulated, constructed dataset in the pipeline after each service is called, so they should have an evolving schema that depends on the particular columns that are being changed or added progressively by the services.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Where exactly should CityOwnedPropertiesInputValidator be called? I see in your framework setup you installed decorators with OutputValidators for each data_util, but InputValidators aren't being called.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea the input validators need to be called slightly differently since they're coming in on the data loader. The input validators should be passed in to the respective loader class for each of the data services seen in the class definition here. Then the validation is called whenever data loading occurs here and the result is passed out.
| pin: Optional[pa.typing.Series[int]] | ||
| mapreg_1: pa.typing.Series[str] | ||
| agency: pa.typing.Series[str] | ||
| opabrt: Optional[pa.typing.Series[str]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The input validation is being called after several normalizations have occurred on the incoming data in src/classes/loaders.py in the normalize_columns, standardize_opa functions etc. so, for instance, the schema here should already reflect some of the those earlier changes like opabrt should already be substituted to opa_id. We'll need to be careful to take that into account when defining the incoming schema for the different services, and I would be careful to double-check that to make sure those match.
| self.errors.append( | ||
| f"'sideyardeligible' column must contain only 'Yes' or 'No', but got: {actual}" | ||
| ) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a lot of these checks should already/can be accommodated by the pandera schema checking when we define the class schema, including the type checking strings, nullable constraints, different statistical checks. It has a ton of options in the documentation and is really powerful, so I would always default to looking there so see if it can accomplish that desired check first before adding it into the custom validation.
Anything that doesn't comply with all the schema included validation should be caught by this step in the BaseValidator.
data/src/validation/council_dists.py
Outdated
| if len(gdf) != 10: | ||
| self.errors.append(f"Expected 10 council district records, got {len(gdf)}") | ||
|
|
||
| # Check "district" values are exactly "1" through "10" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here to see if some of this can be accomplished in pandera.
|
@nlebovits @cfreedman I added updated schemas. I'm not sure what the expected schema for the output of each pipeline stage should be, but I put my best guess given a print of the df head and columns. Please let me know if there are specific checks I should implement in the output schemas that aren't currently in there. I basically moved all the custom_validator functionality into the schemas themselves. |
f65fe8c to
54bca70
Compare
|
@gabecano4308 @cfreedman is this ready to go? |
|
Can we get that formatted with |
…witched some data types in schema that were off.
a366112 to
c53e739
Compare
External Input Validator -- Council Districts and City-Owned Properties
Checklist:
Before submitting your PR, please confirm that you have done the following:
stagingbranch, NOT againstmainDescription
As part of the larger data pipeline validation implementation, this PR creates custom schemas/validators for external data sources, namely council districts and city owned properties. When running these two stages of the data pipeline, their base validation functions will call their custom validations and schema checks, which make up the content of these two files.
Note that geometry validator is already being called by the base validator so we don't need to include that in the new code.
Related Issue(s)
This PR addresses issues #1237 and #1238
Type of change
How Has This Been Tested?
I reran the pipeline with changes to ensure no new errors or warnings were being displayed for the affected pipeline stages. I also created a local script with fake data to ensure the validations were flagging errors as expected. A dedicated directory for unit tests would be ideal, but unsure if we want to commit to do that for every validator.