Skip to content

Conversation

@chapmanhk
Copy link
Contributor

@chapmanhk chapmanhk commented Feb 6, 2026

feat: validate PDP uploads with repo schemas, write normalized output to validated/

changes

  • Validation flow

    • PDP (single STUDENT or COURSE): Uses edvise as single source of truth. Encoding sniff → resolve input to a path (temp file if file-like) → read_raw_pdp_cohort_data or read_raw_pdp_course_data (same as pipeline: CSV read, column mapping e.g. study_id → student_id, duplicate handling for course, datetime format tries). No JSON-based header pass or API normalizers for this path; edvise schemas (RawPDPCohortDataSchema / RawPDPCourseDataSchema) validate the DataFrame. Pandera SchemaErrors from edvise are converted to HardValidationError for the API formatter. Uploads must include all required columns per repo schema (e.g. per-year credit columns for cohort); no single-credit-column expansion.
    • Edvise: Uses JSON-based validation only (different shape; no repo schema).
    • Custom institutions: Unchanged; JSON extension merge + Pandera.
    • Errors: Pandera failures are converted to HardValidationError with normalized failure_cases; existing formatter and PDP check messages used for user-facing 400 responses.
  • GCS on success

    • Copy blob from unvalidated/{file_name}raw/{file_name} (archive of original).
    • Write normalized DataFrame (canonical columns, repo-shaped for PDP) to validated/{file_name} as UTF-8 CSV.
    • Delete blob from unvalidated/. Downstream consumes validated/ only; raw/ is kept for record.
  • Refactors and quality

    • validation.py: PDP path: helpers _path_for_edvise_read (path vs file-like, temp cleanup), _read_pdp_course_edvise (datetime formats + converter fallback for older edvise), _validate_pdp_with_edvise_read. Extracted helpers for JSON path: _header_missing_and_extra, _get_csv_read_kwargs, _validate_optional_columns_json; validate_dataset and related functions under 50 lines. Src type extended with io.StringIO for file-like support.
    • gcsutil.py: Extracted _archive_raw_and_write_validated; added type hints to rename_file; validate_file under 50 lines.
    • Tests added: PDP path: validation_pdp_read_path_test.py (validate_file_reader PDP routing, _path_for_edvise_read path/file-like/cleanup/read-failure, _validate_pdp_with_edvise_read success/SchemaErrors→HardValidationError/invalid model set/file-like, _read_pdp_course_edvise success/all-fail/TypeError school_type fallback). Existing: PDP rename_pdp_dataframe_to_repo_schema (program_of_study fallback, course unchanged), validate_dataframe_with_edvise_schema (empty raises, invalid raises HardValidationError); validation CSV read failure → HardValidationError; gcsutil ValueError/UnicodeError propagation; data_test assertion that validate_file is called with institution_identifier for Edvise.
    • Schema: pdp_schema_extension.jsonspecial_program is string type (required).
    • Ruff format applied; mypy fixes in validation_pdp_edvise and tests (Optional model_list, cast for schema return, dict annotations).

context

  • PDP alignment: PDP uploads are validated with the same rules as the edvise repo (cohort/course schemas) so behavior is consistent with pipelines and audits. The API calls edvise’s read layer (read_raw_pdp_cohort_data / read_raw_pdp_course_data) directly so there is a single source of truth; no API-side normalizers or header pass for PDP.
  • Normalized output: Writing the normalized (canonical-column, repo-shaped) DataFrame to validated/ lets downstream use a consistent schema without re-normalizing; keeping the original in raw/ preserves an audit trail.
  • Edvise: Edvise has a different column shape and stays on JSON-based validation only; only PDP uses the repo schemas.

SQL: PDP extension schema update

Before or after deploying, ensure the PDP extension in schema_registry is the one used by this branch (e.g. from src/webapp/validation_schemas/pdp_schema_extension.json).

Update existing active PDP row (one row with is_pdp = 1 and is_active = 1):

UPDATE schema_registry
SET json_doc = :json_doc
WHERE is_pdp = 1 AND is_active = 1
LIMIT 1;

Bind :json_doc to the full PDP extension JSON (e.g. contents of pdp_schema_extension.json).

questions

None.


chapmanhk and others added 3 commits February 6, 2026 08:58
…o alignment

- Add PDP edvise schema validation path (validation_pdp_edvise)
- Add Edvise-to-PDP normalization (validation_edvise_normalize)
- Integrate repo schemas into validation pipeline and error formatter
- Update pdp_schema_extension and lockfile; add tests

Co-authored-by: Cursor <cursoragent@cursor.com>
… raw/

- On validation success: archive original to raw/{filename}, write
  normalized (canonical columns, coerced dtypes) DataFrame to
  validated/{filename}, delete from unvalidated/
- Validation layer always returns normalized_df on success; storage
  serializes to UTF-8 CSV and uploads to validated/
- Add input validation and helpers in gcsutil (under 50 lines); catch
  specific exceptions; TYPE_CHECKING for HardValidationError in
  validation_pdp_edvise
- Add gcsutil_test.py: validate_file input/error/success paths,
  _run_validation_and_get_normalized_df, _write_dataframe_to_gcs_as_csv
- Add validation_test: empty-schema short-circuit returns normalized_df None
- Ruff/black formatting and lint fixes; mypy-clean for touched files

Co-authored-by: Cursor <cursoragent@cursor.com>
… types and format

- Extract validation helpers to meet 50-line rule (_header_missing_and_extra,
  _get_csv_read_kwargs, _validate_optional_columns_json)
- Extract gcsutil._archive_raw_and_write_validated; add type hints to rename_file
- Add tests: PDP rename/validate_dataframe, CSV read failure, gcsutil error
  propagation, edvise institution_identifier in validate_file call
- Remove unused validation_edvise_normalize and its tests
- Fix mypy in validation_pdp_edvise and tests (Optional[List], cast, annotations)
- Apply ruff format

Co-authored-by: Cursor <cursoragent@cursor.com>
- Route PDP cohort/course through edvise read (read_raw_pdp_*); remove
  API-side normalizers for PDP so pipeline and API share one source of truth
- Add _path_for_edvise_read, _read_pdp_course_edvise, _validate_pdp_with_edvise_read
- Convert Pandera SchemaErrors to HardValidationError in PDP path
- Add validation_pdp_read_path_test.py (routing, path cleanup, SchemaErrors,
  course converter fallback); extend Src type with io.StringIO for file-like

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant