External input validator by gabecano4308 · Pull Request #1243 · CodeForPhilly/clean-and-green-philly

gabecano4308 · 2025-06-16T21:54:08Z

External Input Validator -- Council Districts and City-Owned Properties

Checklist:

Before submitting your PR, please confirm that you have done the following:

I have opened my PR against the staging branch, NOT against main
I've run the relevant formatting and linting tools listed in the setup docs
I have commented hard-to-understand areas in my code
I've reviewed any merge conflicts to make sure they are resolved
My changes generate no new warnings

Description

As part of the larger data pipeline validation implementation, this PR creates custom schemas/validators for external data sources, namely council districts and city owned properties. When running these two stages of the data pipeline, their base validation functions will call their custom validations and schema checks, which make up the content of these two files.
Note that geometry validator is already being called by the base validator so we don't need to include that in the new code.

Related Issue(s)

This PR addresses issues #1237 and #1238

Type of change

Bug fix
New feature
Breaking change

How Has This Been Tested?

I reran the pipeline with changes to ensure no new errors or warnings were being displayed for the affected pipeline stages. I also created a local script with fake data to ensure the validations were flagging errors as expected. A dedicated directory for unit tests would be ideal, but unsure if we want to commit to do that for every validator.

vercel · 2025-06-16T21:54:11Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
vacant-lots-proj	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Jun 27, 2025 9:31pm

cfreedman · 2025-06-17T15:08:02Z

data/src/validation/city_owned_properties.py

@@ -13,9 +39,43 @@ def _custom_validation(self, gdf: gpd.GeoDataFrame):


 class CityOwnedPropertiesOutputValidator(BaseValidator):


I think these should be going on the CityOwnedPropertiesInputValidator since this is defining a schema for the incoming data. The ...OutputValidators are used to monitor the accumulated, constructed dataset in the pipeline after each service is called, so they should have an evolving schema that depends on the particular columns that are being changed or added progressively by the services.

I see. Where exactly should CityOwnedPropertiesInputValidator be called? I see in your framework setup you installed decorators with OutputValidators for each data_util, but InputValidators aren't being called.

Yea the input validators need to be called slightly differently since they're coming in on the data loader. The input validators should be passed in to the respective loader class for each of the data services seen in the class definition here. Then the validation is called whenever data loading occurs here and the result is passed out.

cfreedman · 2025-06-17T15:10:52Z

data/src/validation/city_owned_properties.py

+    pin: Optional[pa.typing.Series[int]]
+    mapreg_1: pa.typing.Series[str]
+    agency: pa.typing.Series[str]
+    opabrt: Optional[pa.typing.Series[str]]


The input validation is being called after several normalizations have occurred on the incoming data in src/classes/loaders.py in the normalize_columns, standardize_opa functions etc. so, for instance, the schema here should already reflect some of the those earlier changes like opabrt should already be substituted to opa_id. We'll need to be careful to take that into account when defining the incoming schema for the different services, and I would be careful to double-check that to make sure those match.

cfreedman · 2025-06-17T15:19:52Z

data/src/validation/city_owned_properties.py

+            self.errors.append(
+                f"'sideyardeligible' column must contain only 'Yes' or 'No', but got: {actual}"
+            )
+


I think a lot of these checks should already/can be accommodated by the pandera schema checking when we define the class schema, including the type checking strings, nullable constraints, different statistical checks. It has a ton of options in the documentation and is really powerful, so I would always default to looking there so see if it can accomplish that desired check first before adding it into the custom validation.

Anything that doesn't comply with all the schema included validation should be caught by this step in the BaseValidator.

cfreedman · 2025-06-17T15:20:38Z

data/src/validation/council_dists.py

+        if len(gdf) != 10:
+            self.errors.append(f"Expected 10 council district records, got {len(gdf)}")
+
+        # Check "district" values are exactly "1" through "10"


Same here to see if some of this can be accomplished in pandera.

gabecano4308 · 2025-06-20T21:38:11Z

@nlebovits @cfreedman I added updated schemas. I'm not sure what the expected schema for the output of each pipeline stage should be, but I put my best guess given a print of the df head and columns. Please let me know if there are specific checks I should implement in the output schemas that aren't currently in there.

I basically moved all the custom_validator functionality into the schemas themselves.

nlebovits · 2025-06-21T16:24:01Z

@gabecano4308 @cfreedman is this ready to go?

cfreedman · 2025-06-24T15:05:01Z

Can we get that formatted with docker compose --rm vacant-lots-proj sh -c ruff format --check. Also double-check the tests after you pull and resolve with staging and what I just PRed. I think they should resolve, but just check that the schema is being defined correctly as an instance of DataFrameSchema like the base validator requires.

…witched some data types in schema that were off.

github-actions bot added backend frontend labels Jun 16, 2025

vercel bot deployed to Preview June 16, 2025 21:55 View deployment

cfreedman self-requested a review June 17, 2025 15:05

cfreedman reviewed Jun 17, 2025

View reviewed changes

vercel bot deployed to Preview June 20, 2025 21:36 View deployment

gabecano4308 force-pushed the external-input-validator branch from f65fe8c to 54bca70 Compare June 20, 2025 23:59

vercel bot deployed to Preview June 21, 2025 00:00 View deployment

vercel bot deployed to Preview June 21, 2025 00:04 View deployment

vercel bot deployed to Preview June 21, 2025 16:24 View deployment

vercel bot deployed to Preview June 27, 2025 20:26 View deployment

vercel bot deployed to Preview June 27, 2025 20:59 View deployment

vercel bot deployed to Preview June 27, 2025 21:08 View deployment

vercel bot deployed to Preview June 27, 2025 21:18 View deployment

vercel bot deployed to Preview June 27, 2025 21:31 View deployment

gabecano4308 added 11 commits June 27, 2025 18:11

feat(validator): add validator for council districts

2c5dc9f

adding validator for city owned properties and altered some comments

f201f76

updating pandera imports

1fc1c57

updating pandera imports

63fb0c7

ruff formatted

2fee362

added new schemas, corrected param names

7135861

feat(FilterView): ruff formatting

59c168a

feat(FilterView): ruff formatting

3960ff6

feat(FilterView): align nullable stat

5a22b91

feat(FilterView): ruff

0875393

feat(FilterView): changing dataframemodel to dataframeschema in base

5ba572a

feat(FilterView): switched standardized_address for street_address. s…

c53e739

…witched some data types in schema that were off.

gabecano4308 force-pushed the external-input-validator branch from a366112 to c53e739 Compare June 27, 2025 22:18

cfreedman self-requested a review June 29, 2025 13:13

cfreedman approved these changes Jun 29, 2025

View reviewed changes

cfreedman merged commit 331c6c5 into CodeForPhilly:staging Jun 29, 2025
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

External input validator#1243

External input validator#1243
cfreedman merged 12 commits intoCodeForPhilly:stagingfrom
gabecano4308:external-input-validator

gabecano4308 commented Jun 16, 2025 •

edited

Loading

Uh oh!

vercel bot commented Jun 16, 2025 •

edited

Loading

Uh oh!

cfreedman Jun 17, 2025

Uh oh!

gabecano4308 Jun 19, 2025

Uh oh!

cfreedman Jun 20, 2025

Uh oh!

cfreedman Jun 17, 2025

Uh oh!

cfreedman Jun 17, 2025

Uh oh!

cfreedman Jun 17, 2025

Uh oh!

gabecano4308 commented Jun 20, 2025 •

edited

Loading

Uh oh!

nlebovits commented Jun 21, 2025

Uh oh!

cfreedman commented Jun 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		@@ -13,9 +39,43 @@ def _custom_validation(self, gdf: gpd.GeoDataFrame):


		class CityOwnedPropertiesOutputValidator(BaseValidator):

Conversation

gabecano4308 commented Jun 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

External Input Validator -- Council Districts and City-Owned Properties

Checklist:

Description

Related Issue(s)

Type of change

How Has This Been Tested?

Uh oh!

vercel bot commented Jun 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cfreedman Jun 17, 2025

Choose a reason for hiding this comment

Uh oh!

gabecano4308 Jun 19, 2025

Choose a reason for hiding this comment

Uh oh!

cfreedman Jun 20, 2025

Choose a reason for hiding this comment

Uh oh!

cfreedman Jun 17, 2025

Choose a reason for hiding this comment

Uh oh!

cfreedman Jun 17, 2025

Choose a reason for hiding this comment

Uh oh!

cfreedman Jun 17, 2025

Choose a reason for hiding this comment

Uh oh!

gabecano4308 commented Jun 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nlebovits commented Jun 21, 2025

Uh oh!

cfreedman commented Jun 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

gabecano4308 commented Jun 16, 2025 •

edited

Loading

vercel bot commented Jun 16, 2025 •

edited

Loading

gabecano4308 commented Jun 20, 2025 •

edited

Loading