|
6 | 6 | [](https://www.gnu.org/licenses/agpl-3.0) |
7 | 7 | [](https://github.com/yourusername/egon-validation) |
8 | 8 |
|
9 | | -A SQL-first validation framework for PostgreSQL/PostGIS databases, designed for large-scale energy system data pipelines. Execute validation rules directly in the database, generate interactive reports, and integrate seamlessly with Airflow workflows. |
| 9 | +SQL-first validation framework for PostgreSQL/PostGIS databases. Execute validation rules directly in the database, generate interactive reports, and integrate with Airflow workflows. |
10 | 10 |
|
11 | 11 | ## Features |
12 | 12 |
|
13 | | -- **SQL-First Execution** - Push validation logic to the database for optimal performance |
14 | | -- **PostGIS Support** - Native geometry and SRID validation for spatial data |
15 | | -- **Extensible Rules** - Combine built-in formal rules with custom domain logic |
16 | | -- **Rich Reports** - Interactive HTML reports with filtering and coverage analysis |
17 | | -- **Airflow Ready** - Resume-safe execution with unique run tracking |
18 | | -- **Parallel Processing** - Thread-safe multi-rule execution |
| 13 | +- **SQL-First Execution** - Push validation logic to the database |
| 14 | +- **PostGIS Support** - Geometry and SRID validation |
| 15 | +- **Extensible Rules** - Built-in + custom rules |
| 16 | +- **HTML Reports** - Interactive reports with filtering |
| 17 | +- **Airflow Ready** - Pipeline integration |
| 18 | +- **Parallel Processing** - Multi-threaded execution |
19 | 19 |
|
20 | 20 | ## Quick Start |
21 | 21 |
|
22 | | -### Installation |
23 | | - |
24 | 22 | ```bash |
25 | 23 | pip install -e . |
26 | 24 |
|
27 | | -# With test dependencies |
28 | | -pip install -e ".[test]" |
29 | | -``` |
30 | | - |
31 | | -### Basic Usage |
32 | | - |
33 | | -1. **Configure database connection:** |
34 | | - |
35 | | -```bash |
36 | 25 | export DB_URL="postgresql://user:password@host:port/database" |
37 | | -``` |
38 | | - |
39 | | -2. **Run validation:** |
40 | | - |
41 | | -```bash |
42 | | -# Generate run ID |
43 | | -RUNID="validation-$(date +%Y%m%dT%H%M%S)" |
44 | | - |
45 | | -# Execute validation rules |
46 | | -egon-validation run-task --run-id $RUNID --task validation-test |
47 | | - |
48 | | -# Generate HTML report |
49 | | -egon-validation final-report --run-id $RUNID |
50 | | -``` |
51 | | - |
52 | | -3. **View results:** |
53 | | - |
54 | | -```bash |
55 | | -open validation_runs/$RUNID/final/report.html |
56 | | -``` |
57 | 26 |
|
58 | | -## Writing Rules |
59 | | - |
60 | | -### SQL Rule |
61 | | - |
62 | | -```python |
63 | | -from egon_validation.rules.base import SqlRule, RuleResult |
64 | | -from egon_validation.rules.registry import register |
65 | | - |
66 | | -@register( |
67 | | - task="data_quality", |
68 | | - dataset="public.generators", |
69 | | - rule_id="CAPACITY_RANGE", |
70 | | - kind="formal", |
71 | | - column="capacity_mw", |
72 | | - min_val=0, |
73 | | - max_val=10000 |
74 | | -) |
75 | | -class CapacityRangeCheck(SqlRule): |
76 | | - def sql(self, ctx): |
77 | | - col = self.params['column'] |
78 | | - min_v = self.params['min_val'] |
79 | | - max_v = self.params['max_val'] |
80 | | - |
81 | | - return f""" |
82 | | - SELECT |
83 | | - COUNT(*) as total, |
84 | | - COUNT(CASE WHEN {col} < {min_v} OR {col} > {max_v} |
85 | | - THEN 1 END) as invalid |
86 | | - FROM {self.dataset} |
87 | | - """ |
88 | | - |
89 | | - def postprocess(self, row, ctx): |
90 | | - return RuleResult( |
91 | | - rule_id=self.rule_id, |
92 | | - task=self.task, |
93 | | - dataset=self.dataset, |
94 | | - success=row['invalid'] == 0, |
95 | | - observed=row['invalid'], |
96 | | - expected=0 |
97 | | - ) |
| 27 | +egon-validation run-task --run-id my-run --task validation-test |
| 28 | +egon-validation final-report --run-id my-run |
98 | 29 | ``` |
99 | 30 |
|
100 | | -### Built-in Rules |
101 | | - |
102 | | -| Rule | Purpose | |
103 | | -|------|---------| |
104 | | -| `NotNullAndNotNaNValidation` | Validates no NULL/NaN values in one or more specified columns | |
105 | | -| `WholeTableNotNullAndNotNaNValidation` | Validates no NULL/NaN values in all table columns (auto-discovery) | |
106 | | -| `DataTypeValidation` | Verifies data types for one or more columns | |
107 | | -| `GeometryContainmentValidation` | PostGIS geometry validity and containment | |
108 | | -| `SRIDUniqueNonZero` | PostGIS SRID consistency (unique, non-zero) | |
109 | | -| `SRIDSpecificValidation` | PostGIS SRID validation against expected value | |
110 | | -| `ReferentialIntegrityValidation` | Foreign key validation | |
111 | | -| `RowCountValidation` | Row count boundaries | |
112 | | -| `ValueSetValidation` | Enum/allowed values | |
113 | | -| `ArrayCardinalityValidation` | Array length constraints | |
114 | | - |
115 | | -## Configuration |
116 | | - |
117 | | -Configure via environment variables or `.env` file: |
118 | | - |
119 | | -```bash |
120 | | -# Database |
121 | | -DB_HOST=localhost |
122 | | -DB_PORT=5432 |
123 | | -DB_NAME=egon-data |
124 | | -DB_USER=postgres |
125 | | -DB_PASS=secret |
126 | | - |
127 | | -# SSH Tunnel (optional) |
128 | | -SSH_HOST=gateway.example.com |
129 | | -SSH_USER=username |
130 | | -SSH_KEY_FILE=~/.ssh/id_rsa |
131 | | - |
132 | | -# Execution |
133 | | -MAX_WORKERS=6 |
134 | | -OUTPUT_DIR=./validation_runs |
135 | | -DEFAULT_TOLERANCE=0.0 |
136 | | -``` |
| 31 | +Output: `validation_runs/my-run/final/report.html` |
137 | 32 |
|
138 | | -## Airflow Integration |
| 33 | +## Documentation |
139 | 34 |
|
140 | | -```python |
141 | | -from airflow import DAG |
142 | | -from airflow.operators.bash import BashOperator |
143 | | -from datetime import datetime |
| 35 | +See [docs/](docs/) for full documentation: |
144 | 36 |
|
145 | | -with DAG('data_validation', start_date=datetime(2024, 1, 1)) as dag: |
146 | | - |
147 | | - validate = BashOperator( |
148 | | - task_id='validate_data', |
149 | | - bash_command=''' |
150 | | - RUNID="{{ ds }}_{{ ts_nodash }}" |
151 | | - egon-validation run-task --run-id $RUNID --task validation-test --with-tunnel |
152 | | - egon-validation final-report --run-id $RUNID |
153 | | - ''' |
154 | | - ) |
155 | | -``` |
| 37 | +- [Installation & Configuration](docs/installation.md) |
| 38 | +- [CLI Reference](docs/cli.md) |
| 39 | +- [Built-in Rules](docs/rules.md) |
| 40 | +- [Custom Rules](docs/custom-rules.md) |
| 41 | +- [Pipeline Integration](docs/pipeline-integration.md) |
156 | 42 |
|
157 | 43 | ## Project Structure |
158 | 44 |
|
159 | 45 | ``` |
160 | 46 | egon_validation/ |
161 | 47 | ├── cli.py # Command-line interface |
162 | 48 | ├── config.py # Configuration management |
163 | | -├── context.py # Run tracking |
164 | 49 | ├── db.py # Database connections |
165 | 50 | ├── rules/ |
166 | | -│ ├── base.py # Base rule classes |
167 | | -│ ├── registry.py # Rule registration |
168 | | -│ ├── formal/ # Built-in rules |
169 | | -│ └── custom/ # Domain-specific rules |
| 51 | +│ ├── base.py # Base rule classes |
| 52 | +│ ├── formal/ # Built-in rules |
| 53 | +│ └── custom/ # Domain-specific rules |
170 | 54 | ├── runner/ |
171 | | -│ ├── execute.py # Task execution |
172 | | -│ └── aggregate.py # Result aggregation |
| 55 | +│ └── execute.py # Task execution |
173 | 56 | └── report/ |
174 | | - ├── generate.py # HTML report generation |
175 | | - └── assets/ # Report templates |
| 57 | + └── generate.py # HTML report generation |
176 | 58 | ``` |
177 | 59 |
|
178 | 60 | ## Development |
179 | 61 |
|
180 | 62 | ```bash |
181 | | -# Run tests |
182 | | -pytest |
183 | | - |
184 | | -# With coverage |
185 | | -pytest --cov=egon_validation --cov-report=html |
186 | | - |
187 | | -# Format code |
188 | | -black egon_validation/ |
189 | | - |
190 | | -# Lint |
191 | | -flake8 egon_validation/ |
| 63 | +pytest # Run tests |
| 64 | +pytest --cov=egon_validation # With coverage |
| 65 | +black egon_validation/ # Format |
| 66 | +flake8 egon_validation/ # Lint |
192 | 67 | ``` |
193 | 68 |
|
194 | 69 | ## License |
195 | 70 |
|
196 | | -AGPL-3.0 - see [LICENSE](LICENSE) |
197 | | - |
198 | | -## Contributing |
199 | | - |
200 | | -Contributions welcome! Please ensure: |
201 | | -- Tests pass (`pytest`) |
202 | | -- Code is formatted (`black`) |
203 | | -- New rules include examples and tests |
| 71 | +AGPL-3.0 - see [LICENSE](LICENSE) |
0 commit comments