feat: validate PDP uploads with repo schemas, write normalized output to validated/ #202

chapmanhk · 2026-02-06T18:46:32Z

feat: validate PDP uploads with repo schemas, write normalized output to validated/

changes

Validation flow
- PDP (single STUDENT or COURSE): Uses edvise as single source of truth. Encoding sniff → resolve input to a path (temp file if file-like) → read_raw_pdp_cohort_data or read_raw_pdp_course_data (same as pipeline: CSV read, column mapping e.g. study_id → student_id, duplicate handling for course, datetime format tries). No JSON-based header pass or API normalizers for this path; edvise schemas (RawPDPCohortDataSchema / RawPDPCourseDataSchema) validate the DataFrame. Pandera SchemaErrors from edvise are converted to HardValidationError for the API formatter. Uploads must include all required columns per repo schema (e.g. per-year credit columns for cohort); no single-credit-column expansion.
- Edvise: Uses JSON-based validation only (different shape; no repo schema).
- Custom institutions: Unchanged; JSON extension merge + Pandera.
- Errors: Pandera failures are converted to HardValidationError with normalized failure_cases; existing formatter and PDP check messages used for user-facing 400 responses.
GCS on success
- Copy blob from unvalidated/{file_name} → raw/{file_name} (archive of original).
- Write normalized DataFrame (canonical columns, repo-shaped for PDP) to validated/{file_name} as UTF-8 CSV.
- Delete blob from unvalidated/. Downstream consumes validated/ only; raw/ is kept for record.
Refactors and quality
- validation.py: PDP path: helpers _path_for_edvise_read (path vs file-like, temp cleanup), _read_pdp_course_edvise (datetime formats + converter fallback for older edvise), _validate_pdp_with_edvise_read. Extracted helpers for JSON path: _header_missing_and_extra, _get_csv_read_kwargs, _validate_optional_columns_json; validate_dataset and related functions under 50 lines. Src type extended with io.StringIO for file-like support.
- gcsutil.py: Extracted _archive_raw_and_write_validated; added type hints to rename_file; validate_file under 50 lines.
- Tests added: PDP path: validation_pdp_read_path_test.py (validate_file_reader PDP routing, _path_for_edvise_read path/file-like/cleanup/read-failure, _validate_pdp_with_edvise_read success/SchemaErrors→HardValidationError/invalid model set/file-like, _read_pdp_course_edvise success/all-fail/TypeError school_type fallback). Existing: PDP rename_pdp_dataframe_to_repo_schema (program_of_study fallback, course unchanged), validate_dataframe_with_edvise_schema (empty raises, invalid raises HardValidationError); validation CSV read failure → HardValidationError; gcsutil ValueError/UnicodeError propagation; data_test assertion that validate_file is called with institution_identifier for Edvise.
- Schema: pdp_schema_extension.json — special_program is string type (required).
- Ruff format applied; mypy fixes in validation_pdp_edvise and tests (Optional model_list, cast for schema return, dict annotations).

context

PDP alignment: PDP uploads are validated with the same rules as the edvise repo (cohort/course schemas) so behavior is consistent with pipelines and audits. The API calls edvise’s read layer (read_raw_pdp_cohort_data / read_raw_pdp_course_data) directly so there is a single source of truth; no API-side normalizers or header pass for PDP.
Normalized output: Writing the normalized (canonical-column, repo-shaped) DataFrame to validated/ lets downstream use a consistent schema without re-normalizing; keeping the original in raw/ preserves an audit trail.
Edvise: Edvise has a different column shape and stays on JSON-based validation only; only PDP uses the repo schemas.

SQL: PDP extension schema update

Before or after deploying, ensure the PDP extension in schema_registry is the one used by this branch (e.g. from src/webapp/validation_schemas/pdp_schema_extension.json).

Update existing active PDP row (one row with is_pdp = 1 and is_active = 1):

UPDATE schema_registry
SET json_doc = :json_doc
WHERE is_pdp = 1 AND is_active = 1
LIMIT 1;

Bind :json_doc to the full PDP extension JSON (e.g. contents of pdp_schema_extension.json).

questions

None.

To see the specific tasks where the Asana app for GitHub is being used, see below:
- https://app.asana.com/0/0/1213120059770216

…o alignment - Add PDP edvise schema validation path (validation_pdp_edvise) - Add Edvise-to-PDP normalization (validation_edvise_normalize) - Integrate repo schemas into validation pipeline and error formatter - Update pdp_schema_extension and lockfile; add tests Co-authored-by: Cursor <cursoragent@cursor.com>

… raw/ - On validation success: archive original to raw/{filename}, write normalized (canonical columns, coerced dtypes) DataFrame to validated/{filename}, delete from unvalidated/ - Validation layer always returns normalized_df on success; storage serializes to UTF-8 CSV and uploads to validated/ - Add input validation and helpers in gcsutil (under 50 lines); catch specific exceptions; TYPE_CHECKING for HardValidationError in validation_pdp_edvise - Add gcsutil_test.py: validate_file input/error/success paths, _run_validation_and_get_normalized_df, _write_dataframe_to_gcs_as_csv - Add validation_test: empty-schema short-circuit returns normalized_df None - Ruff/black formatting and lint fixes; mypy-clean for touched files Co-authored-by: Cursor <cursoragent@cursor.com>

… types and format - Extract validation helpers to meet 50-line rule (_header_missing_and_extra, _get_csv_read_kwargs, _validate_optional_columns_json) - Extract gcsutil._archive_raw_and_write_validated; add type hints to rename_file - Add tests: PDP rename/validate_dataframe, CSV read failure, gcsutil error propagation, edvise institution_identifier in validate_file call - Remove unused validation_edvise_normalize and its tests - Fix mypy in validation_pdp_edvise and tests (Optional[List], cast, annotations) - Apply ruff format Co-authored-by: Cursor <cursoragent@cursor.com>

- Route PDP cohort/course through edvise read (read_raw_pdp_*); remove API-side normalizers for PDP so pipeline and API share one source of truth - Add _path_for_edvise_read, _read_pdp_course_edvise, _validate_pdp_with_edvise_read - Convert Pandera SchemaErrors to HardValidationError in PDP path - Add validation_pdp_read_path_test.py (routing, path cleanup, SchemaErrors, course converter fallback); extend Src type with io.StringIO for file-like Co-authored-by: Cursor <cursoragent@cursor.com>

chapmanhk and others added 3 commits February 6, 2026 08:58

chapmanhk requested a review from vishpillai123 February 6, 2026 18:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: validate PDP uploads with repo schemas, write normalized output to validated/ #202

feat: validate PDP uploads with repo schemas, write normalized output to validated/ #202

Uh oh!

chapmanhk commented Feb 6, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

feat: validate PDP uploads with repo schemas, write normalized output to validated/ #202

Are you sure you want to change the base?

feat: validate PDP uploads with repo schemas, write normalized output to validated/ #202

Uh oh!

Conversation

chapmanhk commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

feat: validate PDP uploads with repo schemas, write normalized output to validated/

changes

context

SQL: PDP extension schema update

questions

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

chapmanhk commented Feb 6, 2026 •

edited

Loading