Skip to content

Conversation

@gabecano4308
Copy link
Contributor

@gabecano4308 gabecano4308 commented Jun 16, 2025

External Input Validator -- Council Districts and City-Owned Properties

Checklist:

Before submitting your PR, please confirm that you have done the following:

  • I have opened my PR against the staging branch, NOT against main
  • I've run the relevant formatting and linting tools listed in the setup docs
  • I have commented hard-to-understand areas in my code
  • I've reviewed any merge conflicts to make sure they are resolved
  • My changes generate no new warnings

Description

As part of the larger data pipeline validation implementation, this PR creates custom schemas/validators for external data sources, namely council districts and city owned properties. When running these two stages of the data pipeline, their base validation functions will call their custom validations and schema checks, which make up the content of these two files.
Note that geometry validator is already being called by the base validator so we don't need to include that in the new code.

Related Issue(s)

This PR addresses issues #1237 and #1238

Type of change

  • Bug fix
  • New feature
  • Breaking change

How Has This Been Tested?

I reran the pipeline with changes to ensure no new errors or warnings were being displayed for the affected pipeline stages. I also created a local script with fake data to ensure the validations were flagging errors as expected. A dedicated directory for unit tests would be ideal, but unsure if we want to commit to do that for every validator.

@vercel
Copy link

vercel bot commented Jun 16, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
vacant-lots-proj ✅ Ready (Inspect) Visit Preview 💬 Add feedback Jun 27, 2025 9:31pm

pass


class CityOwnedPropertiesOutputValidator(BaseValidator):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these should be going on the CityOwnedPropertiesInputValidator since this is defining a schema for the incoming data. The ...OutputValidators are used to monitor the accumulated, constructed dataset in the pipeline after each service is called, so they should have an evolving schema that depends on the particular columns that are being changed or added progressively by the services.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Where exactly should CityOwnedPropertiesInputValidator be called? I see in your framework setup you installed decorators with OutputValidators for each data_util, but InputValidators aren't being called.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea the input validators need to be called slightly differently since they're coming in on the data loader. The input validators should be passed in to the respective loader class for each of the data services seen in the class definition here. Then the validation is called whenever data loading occurs here and the result is passed out.

pin: Optional[pa.typing.Series[int]]
mapreg_1: pa.typing.Series[str]
agency: pa.typing.Series[str]
opabrt: Optional[pa.typing.Series[str]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The input validation is being called after several normalizations have occurred on the incoming data in src/classes/loaders.py in the normalize_columns, standardize_opa functions etc. so, for instance, the schema here should already reflect some of the those earlier changes like opabrt should already be substituted to opa_id. We'll need to be careful to take that into account when defining the incoming schema for the different services, and I would be careful to double-check that to make sure those match.

self.errors.append(
f"'sideyardeligible' column must contain only 'Yes' or 'No', but got: {actual}"
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a lot of these checks should already/can be accommodated by the pandera schema checking when we define the class schema, including the type checking strings, nullable constraints, different statistical checks. It has a ton of options in the documentation and is really powerful, so I would always default to looking there so see if it can accomplish that desired check first before adding it into the custom validation.

Anything that doesn't comply with all the schema included validation should be caught by this step in the BaseValidator.

if len(gdf) != 10:
self.errors.append(f"Expected 10 council district records, got {len(gdf)}")

# Check "district" values are exactly "1" through "10"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here to see if some of this can be accomplished in pandera.

@gabecano4308
Copy link
Contributor Author

gabecano4308 commented Jun 20, 2025

@nlebovits @cfreedman I added updated schemas. I'm not sure what the expected schema for the output of each pipeline stage should be, but I put my best guess given a print of the df head and columns. Please let me know if there are specific checks I should implement in the output schemas that aren't currently in there.

I basically moved all the custom_validator functionality into the schemas themselves.

@nlebovits
Copy link
Collaborator

@gabecano4308 @cfreedman is this ready to go?

@cfreedman
Copy link
Contributor

Can we get that formatted with docker compose --rm vacant-lots-proj sh -c ruff format --check. Also double-check the tests after you pull and resolve with staging and what I just PRed. I think they should resolve, but just check that the schema is being defined correctly as an instance of DataFrameSchema like the base validator requires.

…witched some data types in schema that were off.
@gabecano4308 gabecano4308 force-pushed the external-input-validator branch from a366112 to c53e739 Compare June 27, 2025 22:18
@cfreedman cfreedman self-requested a review June 29, 2025 13:13
@cfreedman cfreedman merged commit 331c6c5 into CodeForPhilly:staging Jun 29, 2025
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants