Skip to content

Commit ed97994

Browse files
committed
Merge branch 'dev'
2 parents 73cade8 + 2dc8c41 commit ed97994

28 files changed

+877
-4789
lines changed

README.md

Lines changed: 27 additions & 159 deletions
Original file line numberDiff line numberDiff line change
@@ -6,198 +6,66 @@
66
[![License: AGPL-3.0](https://img.shields.io/badge/License-AGPL%203.0-blue.svg)](https://www.gnu.org/licenses/agpl-3.0)
77
[![Version](https://img.shields.io/badge/version-1.1.1-green.svg)](https://github.com/yourusername/egon-validation)
88

9-
A SQL-first validation framework for PostgreSQL/PostGIS databases, designed for large-scale energy system data pipelines. Execute validation rules directly in the database, generate interactive reports, and integrate seamlessly with Airflow workflows.
9+
SQL-first validation framework for PostgreSQL/PostGIS databases. Execute validation rules directly in the database, generate interactive reports, and integrate with Airflow workflows.
1010

1111
## Features
1212

13-
- **SQL-First Execution** - Push validation logic to the database for optimal performance
14-
- **PostGIS Support** - Native geometry and SRID validation for spatial data
15-
- **Extensible Rules** - Combine built-in formal rules with custom domain logic
16-
- **Rich Reports** - Interactive HTML reports with filtering and coverage analysis
17-
- **Airflow Ready** - Resume-safe execution with unique run tracking
18-
- **Parallel Processing** - Thread-safe multi-rule execution
13+
- **SQL-First Execution** - Push validation logic to the database
14+
- **PostGIS Support** - Geometry and SRID validation
15+
- **Extensible Rules** - Built-in + custom rules
16+
- **HTML Reports** - Interactive reports with filtering
17+
- **Airflow Ready** - Pipeline integration
18+
- **Parallel Processing** - Multi-threaded execution
1919

2020
## Quick Start
2121

22-
### Installation
23-
2422
```bash
2523
pip install -e .
2624

27-
# With test dependencies
28-
pip install -e ".[test]"
29-
```
30-
31-
### Basic Usage
32-
33-
1. **Configure database connection:**
34-
35-
```bash
3625
export DB_URL="postgresql://user:password@host:port/database"
37-
```
38-
39-
2. **Run validation:**
40-
41-
```bash
42-
# Generate run ID
43-
RUNID="validation-$(date +%Y%m%dT%H%M%S)"
44-
45-
# Execute validation rules
46-
egon-validation run-task --run-id $RUNID --task validation-test
47-
48-
# Generate HTML report
49-
egon-validation final-report --run-id $RUNID
50-
```
51-
52-
3. **View results:**
53-
54-
```bash
55-
open validation_runs/$RUNID/final/report.html
56-
```
5726

58-
## Writing Rules
59-
60-
### SQL Rule
61-
62-
```python
63-
from egon_validation.rules.base import SqlRule, RuleResult
64-
from egon_validation.rules.registry import register
65-
66-
@register(
67-
task="data_quality",
68-
dataset="public.generators",
69-
rule_id="CAPACITY_RANGE",
70-
kind="formal",
71-
column="capacity_mw",
72-
min_val=0,
73-
max_val=10000
74-
)
75-
class CapacityRangeCheck(SqlRule):
76-
def sql(self, ctx):
77-
col = self.params['column']
78-
min_v = self.params['min_val']
79-
max_v = self.params['max_val']
80-
81-
return f"""
82-
SELECT
83-
COUNT(*) as total,
84-
COUNT(CASE WHEN {col} < {min_v} OR {col} > {max_v}
85-
THEN 1 END) as invalid
86-
FROM {self.dataset}
87-
"""
88-
89-
def postprocess(self, row, ctx):
90-
return RuleResult(
91-
rule_id=self.rule_id,
92-
task=self.task,
93-
dataset=self.dataset,
94-
success=row['invalid'] == 0,
95-
observed=row['invalid'],
96-
expected=0
97-
)
27+
egon-validation run-task --run-id my-run --task validation-test
28+
egon-validation final-report --run-id my-run
9829
```
9930

100-
### Built-in Rules
101-
102-
| Rule | Purpose |
103-
|------|---------|
104-
| `NotNullAndNotNaNValidation` | Validates no NULL/NaN values in one or more specified columns |
105-
| `WholeTableNotNullAndNotNaNValidation` | Validates no NULL/NaN values in all table columns (auto-discovery) |
106-
| `DataTypeValidation` | Verifies data types for one or more columns |
107-
| `GeometryContainmentValidation` | PostGIS geometry validity and containment |
108-
| `SRIDUniqueNonZero` | PostGIS SRID consistency (unique, non-zero) |
109-
| `SRIDSpecificValidation` | PostGIS SRID validation against expected value |
110-
| `ReferentialIntegrityValidation` | Foreign key validation |
111-
| `RowCountValidation` | Row count boundaries |
112-
| `ValueSetValidation` | Enum/allowed values |
113-
| `ArrayCardinalityValidation` | Array length constraints |
114-
115-
## Configuration
116-
117-
Configure via environment variables or `.env` file:
118-
119-
```bash
120-
# Database
121-
DB_HOST=localhost
122-
DB_PORT=5432
123-
DB_NAME=egon-data
124-
DB_USER=postgres
125-
DB_PASS=secret
126-
127-
# SSH Tunnel (optional)
128-
SSH_HOST=gateway.example.com
129-
SSH_USER=username
130-
SSH_KEY_FILE=~/.ssh/id_rsa
131-
132-
# Execution
133-
MAX_WORKERS=6
134-
OUTPUT_DIR=./validation_runs
135-
DEFAULT_TOLERANCE=0.0
136-
```
31+
Output: `validation_runs/my-run/final/report.html`
13732

138-
## Airflow Integration
33+
## Documentation
13934

140-
```python
141-
from airflow import DAG
142-
from airflow.operators.bash import BashOperator
143-
from datetime import datetime
35+
See [docs/](docs/) for full documentation:
14436

145-
with DAG('data_validation', start_date=datetime(2024, 1, 1)) as dag:
146-
147-
validate = BashOperator(
148-
task_id='validate_data',
149-
bash_command='''
150-
RUNID="{{ ds }}_{{ ts_nodash }}"
151-
egon-validation run-task --run-id $RUNID --task validation-test --with-tunnel
152-
egon-validation final-report --run-id $RUNID
153-
'''
154-
)
155-
```
37+
- [Installation & Configuration](docs/installation.md)
38+
- [CLI Reference](docs/cli.md)
39+
- [Built-in Rules](docs/rules.md)
40+
- [Custom Rules](docs/custom-rules.md)
41+
- [Pipeline Integration](docs/pipeline-integration.md)
15642

15743
## Project Structure
15844

15945
```
16046
egon_validation/
16147
├── cli.py # Command-line interface
16248
├── config.py # Configuration management
163-
├── context.py # Run tracking
16449
├── db.py # Database connections
16550
├── rules/
166-
│ ├── base.py # Base rule classes
167-
│ ├── registry.py # Rule registration
168-
│ ├── formal/ # Built-in rules
169-
│ └── custom/ # Domain-specific rules
51+
│ ├── base.py # Base rule classes
52+
│ ├── formal/ # Built-in rules
53+
│ └── custom/ # Domain-specific rules
17054
├── runner/
171-
│ ├── execute.py # Task execution
172-
│ └── aggregate.py # Result aggregation
55+
│ └── execute.py # Task execution
17356
└── report/
174-
├── generate.py # HTML report generation
175-
└── assets/ # Report templates
57+
└── generate.py # HTML report generation
17658
```
17759

17860
## Development
17961

18062
```bash
181-
# Run tests
182-
pytest
183-
184-
# With coverage
185-
pytest --cov=egon_validation --cov-report=html
186-
187-
# Format code
188-
black egon_validation/
189-
190-
# Lint
191-
flake8 egon_validation/
63+
pytest # Run tests
64+
pytest --cov=egon_validation # With coverage
65+
black egon_validation/ # Format
66+
flake8 egon_validation/ # Lint
19267
```
19368

19469
## License
19570

196-
AGPL-3.0 - see [LICENSE](LICENSE)
197-
198-
## Contributing
199-
200-
Contributions welcome! Please ensure:
201-
- Tests pass (`pytest`)
202-
- Code is formatted (`black`)
203-
- New rules include examples and tests
71+
AGPL-3.0 - see [LICENSE](LICENSE)

docs/cli.md

Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
# CLI Reference
2+
3+
## run-task
4+
5+
Execute validation rules for a task.
6+
7+
```bash
8+
egon-validation run-task --run-id <ID> --task <TASK> [OPTIONS]
9+
```
10+
11+
| Option | Description |
12+
|--------|-------------|
13+
| `--run-id` | Unique identifier for this run (required) |
14+
| `--task` | Task name to execute (required) |
15+
| `--db-url` | Database URL (or use env var) |
16+
| `--out` | Output directory (default: `./validation_runs`) |
17+
| `--with-tunnel` | Use SSH tunnel from env config |
18+
| `--echo-sql` | Print SQL queries for debugging |
19+
20+
Example:
21+
```bash
22+
egon-validation run-task --run-id validation-20260116 --task data_quality
23+
```
24+
25+
## final-report
26+
27+
Aggregate results and generate HTML report.
28+
29+
```bash
30+
egon-validation final-report --run-id <ID> [OPTIONS]
31+
```
32+
33+
| Option | Description |
34+
|--------|-------------|
35+
| `--run-id` | Run ID to aggregate (required) |
36+
| `--out` | Output directory (default: `./validation_runs`) |
37+
| `--list-rules` | Print registered rules |
38+
39+
Example:
40+
```bash
41+
egon-validation final-report --run-id validation-20260116
42+
```
43+
44+
Output: `validation_runs/<run-id>/final/report.html`

docs/custom-rules.md

Lines changed: 65 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,65 @@
1+
# Custom Rules
2+
3+
## SQL Rule
4+
5+
Push validation logic to the database:
6+
7+
```python
8+
from egon_validation.rules.base import SqlRule
9+
from egon_validation.rules.registry import register
10+
11+
@register(
12+
task="my_task",
13+
table="schema.my_table",
14+
rule_id="POSITIVE_VALUES",
15+
kind="custom",
16+
column="amount"
17+
)
18+
class PositiveValuesCheck(SqlRule):
19+
def sql(self, ctx):
20+
col = self.params["column"]
21+
return f"""
22+
SELECT
23+
COUNT(*) as total,
24+
COUNT(CASE WHEN {col} < 0 THEN 1 END) as invalid
25+
FROM {self.table}
26+
"""
27+
28+
def postprocess(self, row, ctx):
29+
return self.create_result(
30+
success=row["invalid"] == 0,
31+
observed=row["invalid"],
32+
expected=0
33+
)
34+
```
35+
36+
## DataFrame Rule
37+
38+
For complex Python-based validation:
39+
40+
```python
41+
from egon_validation.rules.base import DataFrameRule
42+
43+
@register(
44+
task="my_task",
45+
table="schema.my_table",
46+
rule_id="COMPLEX_CHECK",
47+
kind="custom"
48+
)
49+
class ComplexCheck(DataFrameRule):
50+
def sql(self, ctx):
51+
return f"SELECT * FROM {self.table}"
52+
53+
def validate(self, df, ctx):
54+
# Custom pandas logic
55+
invalid = df[df["value"] < df["threshold"]].shape[0]
56+
return self.create_result(
57+
success=invalid == 0,
58+
observed=invalid,
59+
expected=0
60+
)
61+
```
62+
63+
## File Location
64+
65+
Place custom rules in `egon_validation/rules/custom/`. They are auto-discovered on import.

docs/index.md

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
# eGon Validation - User Guide
2+
3+
SQL-first validation framework for PostgreSQL/PostGIS databases.
4+
5+
## Documentation
6+
7+
- [Installation & Configuration](installation.md)
8+
- [CLI Reference](cli.md)
9+
- [Built-in Rules](rules.md)
10+
- [Custom Rules](custom-rules.md)
11+
- [Pipeline Integration](pipeline-integration.md)
12+
13+
## Quick Example
14+
15+
```bash
16+
# Set database connection
17+
export DB_URL="postgresql://user:pass@host:5432/db"
18+
19+
# Run validation
20+
egon-validation run-task --run-id my-run --task validation-test
21+
22+
# Generate report
23+
egon-validation final-report --run-id my-run
24+
```
25+
26+
Output: `validation_runs/my-run/final/report.html`

0 commit comments

Comments
 (0)