feat: validate PDP uploads with repo schemas, write normalized output to validated/ #202
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
feat: validate PDP uploads with repo schemas, write normalized output to validated/
changes
Validation flow
read_raw_pdp_cohort_dataorread_raw_pdp_course_data(same as pipeline: CSV read, column mapping e.g. study_id → student_id, duplicate handling for course, datetime format tries). No JSON-based header pass or API normalizers for this path; edvise schemas (RawPDPCohortDataSchema/RawPDPCourseDataSchema) validate the DataFrame. PanderaSchemaErrorsfrom edvise are converted toHardValidationErrorfor the API formatter. Uploads must include all required columns per repo schema (e.g. per-year credit columns for cohort); no single-credit-column expansion.HardValidationErrorwith normalizedfailure_cases; existing formatter and PDP check messages used for user-facing 400 responses.GCS on success
unvalidated/{file_name}→raw/{file_name}(archive of original).validated/{file_name}as UTF-8 CSV.unvalidated/. Downstream consumesvalidated/only;raw/is kept for record.Refactors and quality
_path_for_edvise_read(path vs file-like, temp cleanup),_read_pdp_course_edvise(datetime formats + converter fallback for older edvise),_validate_pdp_with_edvise_read. Extracted helpers for JSON path:_header_missing_and_extra,_get_csv_read_kwargs,_validate_optional_columns_json;validate_datasetand related functions under 50 lines.Srctype extended withio.StringIOfor file-like support._archive_raw_and_write_validated; added type hints torename_file;validate_fileunder 50 lines.validation_pdp_read_path_test.py(validate_file_reader PDP routing, _path_for_edvise_read path/file-like/cleanup/read-failure, _validate_pdp_with_edvise_read success/SchemaErrors→HardValidationError/invalid model set/file-like, _read_pdp_course_edvise success/all-fail/TypeError school_type fallback). Existing: PDPrename_pdp_dataframe_to_repo_schema(program_of_study fallback, course unchanged),validate_dataframe_with_edvise_schema(empty raises, invalid raises HardValidationError); validation CSV read failure → HardValidationError; gcsutil ValueError/UnicodeError propagation; data_test assertion thatvalidate_fileis called withinstitution_identifierfor Edvise.pdp_schema_extension.json—special_programis string type (required).validation_pdp_edviseand tests (Optional model_list, cast for schema return, dict annotations).context
read_raw_pdp_cohort_data/read_raw_pdp_course_data) directly so there is a single source of truth; no API-side normalizers or header pass for PDP.validated/lets downstream use a consistent schema without re-normalizing; keeping the original inraw/preserves an audit trail.SQL: PDP extension schema update
Before or after deploying, ensure the PDP extension in
schema_registryis the one used by this branch (e.g. fromsrc/webapp/validation_schemas/pdp_schema_extension.json).Update existing active PDP row (one row with
is_pdp = 1andis_active = 1):Bind
:json_docto the full PDP extension JSON (e.g. contents ofpdp_schema_extension.json).questions
None.