Skip to content

Commit 97cf4c3

Browse files
committed
Move final check documentation in python docstring
1 parent 1c64d2b commit 97cf4c3

File tree

5 files changed

+98
-72
lines changed

5 files changed

+98
-72
lines changed

README.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -186,14 +186,14 @@ The data processing pipeline consists of modular steps that transform raw survey
186186
4. **[Link Trips](https://bayareametro.github.io/travel-diary-survey-tools/pipeline_steps/link_trips/)** - Aggregates individual trip segments into complete journey records by detecting mode changes and transfers
187187
5. **[Detect Joint Trips](https://bayareametro.github.io/travel-diary-survey-tools/pipeline_steps/detect_joint_trips/)** - Identifies shared household trips using spatial-temporal similarity matching
188188
6. **[Extract Tours](https://bayareametro.github.io/travel-diary-survey-tools/pipeline_steps/extract_tours/)** - Builds hierarchical tour structures (home-based tours and work-based subtours) from linked trips
189-
7. **Weighting** *(placeholder)* - Calculates expansion weights to match survey sample to population targets
189+
7. **[Weighting](https://bayareametro.github.io/travel-diary-survey-tools/pipeline_steps/weighting/)** *(placeholder)* - Calculates expansion weights to match survey sample to population targets
190190
8. **Format Output** - Transforms canonical data to model-specific formats (DaySim, ActivitySim, etc.)
191191
- **[DaySim Format](https://bayareametro.github.io/travel-diary-survey-tools/pipeline_steps/format_output/daysim/)** - Formats data for DaySim model input
192192
- **[CT-RAMP Format](https://bayareametro.github.io/travel-diary-survey-tools/pipeline_steps/format_output/ctramp/)** - Formats data for CT-RAMP model input
193-
9. **[Final Check](src/processing/final_check/README.md)** - Validates complete dataset against canonical schemas before export
193+
9. **[Final Check](https://bayareametro.github.io/travel-diary-survey-tools/pipeline_steps/final_check/)** - Validates complete dataset against canonical schemas before export
194194
10. **[Write Data](https://bayareametro.github.io/travel-diary-survey-tools/pipeline_steps/read_write/)** - Writes processed tables to output files with optional validation
195195

196-
Each step README provides detailed documentation on:
196+
Each step links to documentation generated by the step's docstring, and provides detailed documentation on:
197197
- Input/output data requirements
198198
- Core algorithm and processing logic
199199
- Configuration parameters

docs/pipeline_steps/final_check.md

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
# Final Check
2+
3+
::: processing.final_check.final_check
4+
options:
5+
show_root_heading: true
6+
show_root_toc_entry: false
7+
members:
8+
- final_check
9+
filters:
10+
- "!^logger$"
11+
- "!^_"

mkdocs.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -72,6 +72,7 @@ nav:
7272
- Format Output:
7373
- DaySim Format: pipeline_steps/format_output/daysim.md
7474
- CT-RAMP Format: pipeline_steps/format_output/ctramp.md
75+
- Final Check: pipeline_steps/final_check.md
7576

7677
markdown_extensions:
7778
- admonition

src/processing/final_check/README.md

Lines changed: 4 additions & 59 deletions
Original file line numberDiff line numberDiff line change
@@ -2,64 +2,9 @@
22

33
# Final Check Pipeline Step
44

5-
This module performs final validation checks on the complete processed dataset to ensure data quality and schema compliance. It is basically a dummy module to run Pydantic validation on all tables at the end of the pipeline.
5+
This module performs final validation checks on the complete processed dataset to ensure data quality and schema compliance before export.
66

7-
## Pipeline Steps
7+
For detailed API documentation including validation algorithm, error handling, and implementation notes, see: [Final Check API Documentation](https://bayareametro.github.io/travel-diary-survey-tools/pipeline_steps/final_check/)
88

9-
### `final_check`
10-
11-
Runs comprehensive validation on all canonical survey tables at the end of the pipeline.
12-
13-
**Inputs:**
14-
- `households`: Processed household table (pl.DataFrame)
15-
- `persons`: Processed person table (pl.DataFrame)
16-
- `days`: Processed person-day table (pl.DataFrame)
17-
- `unlinked_trips`: Processed unlinked trip records (pl.DataFrame)
18-
- `linked_trips`: Processed linked trip records (pl.DataFrame)
19-
- `tours`: Processed tour records (pl.DataFrame)
20-
21-
**Outputs:**
22-
- Dictionary containing the same validated tables:
23-
- `households`
24-
- `persons`
25-
- `days`
26-
- `unlinked_trips`
27-
- `linked_trips`
28-
- `tours`
29-
30-
**Core Algorithm:**
31-
32-
**Pydantic Model Validation:**
33-
1. This step is decorated with `@step(validate_input=True, validate_output=True)`
34-
2. The pipeline framework automatically validates all input/output against Pydantic data models
35-
3. Validation checks:
36-
- **Schema Compliance:** All required columns present with correct data types
37-
- **Value Constraints:** Numeric ranges, categorical values, enum memberships
38-
- **Referential Integrity:** Foreign keys match (person_id → persons, hh_id → households, etc.)
39-
- **Business Rules:** Domain-specific constraints (e.g., depart_time < arrive_time)
40-
41-
**Custom Validation Space:**
42-
- The function body is intentionally simple (pass-through)
43-
- Pydantic handles validation automatically at model instantiation
44-
- This space *could* be extended with additional custom checks not covered by models:
45-
- Cross-table consistency checks
46-
- Statistical outlier detection
47-
- Survey-specific business rules
48-
- Data quality metrics logging
49-
- However, validation logic should ideally be implemented in Pydantic models themselves for reusability
50-
51-
**Validation Failure Handling:**
52-
- If validation fails, raises `DataValidationError` with detailed error messages
53-
- Error messages indicate:
54-
- Which table failed validation
55-
- Which rows/columns have issues
56-
- What constraint was violated
57-
- Pipeline execution halts on validation failure
58-
59-
**Notes:**
60-
- This is the last checkpoint before data export
61-
- Ensures output meets canonical data specifications
62-
- Validation errors caught here prevent invalid data from reaching models/analyses
63-
- Pydantic models defined in `src/data_canon/models/` provide the validation rules
64-
- Comprehensive logging helps diagnose data quality issues
65-
- Pass-through design allows validation to occur transparently
9+
The documentation includes:
10+
- `final_check()` - Comprehensive validation pass-through for all canonical tables

src/processing/final_check/final_check.py

Lines changed: 79 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,53 @@
1-
"""Final validation step for the entire dataset."""
1+
"""Final Validation Step.
2+
3+
Performs final validation checks on the complete processed dataset to ensure data
4+
quality and schema compliance before export. This is a pass-through validation step
5+
that leverages the `@step()` decorator's automatic Pydantic model validation.
6+
7+
!!! Algorithm
8+
9+
# Pydantic Model Validation
10+
11+
1. This step is decorated with `@step(validate_input=True, validate_output=True)`
12+
2. The pipeline framework automatically validates all input/output against Pydantic
13+
data models
14+
3. Validation checks:
15+
- **Schema Compliance:** All required columns present with correct data types
16+
- **Value Constraints:** Numeric ranges, categorical values, enum memberships
17+
- **Referential Integrity:** Foreign keys match (person_id → persons,
18+
hh_id → households, etc.)
19+
- **Business Rules:** Domain-specific constraints (e.g., depart_time < arrive_time)
20+
21+
# Custom Validation Space
22+
23+
- The function body is intentionally simple (pass-through)
24+
- Pydantic handles validation automatically at model instantiation
25+
- This space *could* be extended with additional custom checks not covered by models:
26+
- Cross-table consistency checks
27+
- Statistical outlier detection
28+
- Survey-specific business rules
29+
- Data quality metrics logging
30+
- However, validation logic should ideally be implemented in Pydantic models
31+
themselves for reusability
32+
33+
# Validation Failure Handling
34+
35+
- If validation fails, raises `DataValidationError` with detailed error messages
36+
- Error messages indicate:
37+
- Which table failed validation
38+
- Which rows/columns have issues
39+
- What constraint was violated
40+
- Pipeline execution halts on validation failure
41+
42+
!!! Notes
43+
44+
- This is the last checkpoint before data export
45+
- Ensures output meets canonical data specifications
46+
- Validation errors caught here prevent invalid data from reaching models/analyses
47+
- Pydantic models defined in `src/data_canon/models/` provide the validation rules
48+
- Comprehensive logging helps diagnose data quality issues
49+
- Pass-through design allows validation to occur transparently
50+
"""
251

352
import logging
453

@@ -18,21 +67,41 @@ def final_check(
1867
linked_trips: pl.DataFrame,
1968
tours: pl.DataFrame,
2069
) -> dict[str, pl.DataFrame]:
21-
"""Run validation checks on the entire dataset.
70+
"""Run comprehensive validation on all canonical survey tables.
71+
72+
This is a pass-through function that relies on the `@step()` decorator to
73+
perform automatic Pydantic model validation on both inputs and outputs.
74+
Validation checks schema compliance, value constraints, referential integrity,
75+
and business rules.
2276
2377
Args:
24-
households: The households dataframe
25-
persons: The persons dataframe
26-
days: The days dataframe
27-
unlinked_trips: The unlinked trips dataframe
28-
linked_trips: The linked trips dataframe
29-
tours: The tours dataframe
78+
households: Processed household table with all required fields
79+
persons: Processed person table with all required fields
80+
days: Processed person-day table with all required fields
81+
unlinked_trips: Processed unlinked trip records with all required fields
82+
linked_trips: Processed linked trip records (journey-level) with all
83+
required fields
84+
tours: Processed tour records with all required fields
3085
3186
Returns:
32-
The validated dataset
87+
Dictionary containing the same validated tables:
88+
89+
- households: Validated household table
90+
- persons: Validated person table
91+
- days: Validated person-day table
92+
- unlinked_trips: Validated unlinked trip records
93+
- linked_trips: Validated linked trip records
94+
- tours: Validated tour records
3395
3496
Raises:
35-
DataValidationError: If pydantic validation fails
97+
DataValidationError: If pydantic validation fails on any table. Error
98+
message indicates which table, row, column, and constraint failed.
99+
100+
Notes:
101+
- Pydantic handles validation automatically at model instantiation
102+
- This is the final quality checkpoint before data export
103+
- Custom validation logic can be added here if needed, but should
104+
ideally be implemented in Pydantic models for reusability
36105
"""
37106
logger.info("Starting final validation checks")
38107

0 commit comments

Comments
 (0)