IDA ICE Energy Simulation ETL Pipeline

A robust, production-style ETL (Extract, Transform, Load) pipeline for processing IDA ICE building energy simulation data. This pipeline extracts simulation outputs, standardizes and validates them, and loads them into an analytics-ready star schema (DuckDB / Parquet) for downstream HVAC analytics.

Features

Synthetic Data Generation: Generate realistic IDA ICE simulation data for testing
Mock Cloud API: Simulates IDA ICE Cloud API for data extraction
Star Schema Design: Organized dimension and fact tables for analytics
Data Validation: Comprehensive quality checks including:
- Schema validation
- Value range validation
- Temporal coverage checks
- Energy plausibility checks
Multiple Storage Formats: Exports to both Parquet files and DuckDB database
Analytical Views: Pre-built views for common analyses

Project Structure

.
├── data/
│   ├── raw/
│   │   └── simulations/          # Simulation ZIP files
│   └── processed/
│       ├── parquet/               # Parquet format output
│       └── duckdb/                # DuckDB database
├── src/
│   ├── generate_synthetic_idaice.py   # Synthetic data generator
│   ├── idaice_client.py               # Mock IDA ICE Cloud API
│   └── etl/
│       ├── extract/               # Data extraction module
│       ├── transform/             # Star schema transformation
│       ├── load/                  # Parquet & DuckDB loading
│       └── validate/              # Data validation
├── notebooks/                     # Jupyter notebooks for analysis
├── tests/                         # Unit tests
├── run_pipeline.py                # Main pipeline runner
├── requirements.txt               # Python dependencies
└── README.md

Quick Start

Get started in just a few commands:

Option 1: Using Make (Recommended)

# Install dependencies and generate sample data
make all

# Run the pipeline
make pipeline

Option 2: Manual Commands

# 1. Install dependencies
pip install -r requirements.txt

# 2. Generate sample data
python src/generate_synthetic_idaice.py

# 3. Run the pipeline
python run_pipeline.py

That's it! Your data is now processed and ready for analysis.

🎬 Demo Steps (Frontend Integration)

Run the full pipeline and export data for the frontend demo:

# 1. Setup and activate virtual environment
python -m venv .venv
source .venv/bin/activate

# 2. Install dependencies and generate sample data
make all

# 3. Run the ETL pipeline
make pipeline

# 4. Export canonical JSON for frontend
make export-json

# 5. Validate JSON schema
make validate-json

Artifact paths:

Output	Path	Size
Raw simulation ZIPs	`data/raw/simulations/run_*/`	~50 KB each
Parquet tables	`data/processed/parquet/*.parquet`	~100 KB total
DuckDB database	`data/processed/duckdb/simulations.duckdb`	~200 KB
Frontend JSON	`artifacts/json/ida_ice_simulation_summary.json`	~2 KB

Sync to frontend demo:

cd ../energy-pipeline-demo && ./sync-data.sh

Common Commands

# Generate sample data (one-line command)
make sample-data

# Run tests
make test

# Clean up generated files
make clean

# See all available commands
make help

Detailed Setup

1. Setup Environment

# Create virtual environment (recommended)
python -m venv .venv
source .venv/bin/activate    # Linux/macOS
# .venv\Scripts\activate     # Windows

# Install dependencies
pip install -r requirements.txt

2. Generate Sample Data

Generate realistic IDA ICE simulation data with a single command:

python src/generate_synthetic_idaice.py

This generates 6 simulation runs (3 buildings × 2 scenarios) with:

168 hours (7 days) of hourly data
Zone conditions (temperature, CO₂, humidity)
HVAC system data (power, heating, cooling)
Building meters (electric, heating, cooling energy)
Weather data (temperature, humidity, solar)

Custom generation options:

# Generate data for specific buildings and scenarios
python src/generate_synthetic_idaice.py --buildings BLDG_01 BLDG_02 --scenarios BASE

# Change duration and start date
python src/generate_synthetic_idaice.py --days 14 --start-date "2024-06-01 00:00"

# Custom output directory
python src/generate_synthetic_idaice.py --output data/custom/simulations

3. Run ETL Pipeline

# Run full pipeline with validation
python run_pipeline.py

# Skip validation (faster)
python run_pipeline.py --skip-validation

# Custom paths
python run_pipeline.py --simulations data/raw/simulations \
                       --parquet data/output/parquet \
                       --duckdb data/output/db/simulations.duckdb

Data Schema

Dimension Tables

dim_building: Building metadata (ID, name, location, floor area)
dim_scenario: Scenario descriptions (baseline, retrofit, etc.)
dim_zone: HVAC zones (zone ID, name, building)
dim_ahu: Air handling units (AHU ID, building)
dim_time: Time dimension (timestamp, year, month, day, hour, day of week)

Fact Tables

fact_zone_conditions: Zone-level conditions (temperature, CO₂, humidity)
- Grain: one row per (timestamp, zone, scenario)
fact_hvac: HVAC system performance (power, heating, cooling, COP proxy)
- Grain: one row per (timestamp, AHU, scenario)
fact_meters: Building-level energy meters (electric, heating, cooling)
- Grain: one row per (timestamp, building, scenario)
fact_weather: Weather conditions (outdoor temp, humidity, solar)
- Grain: one row per (timestamp, building)

Analytical Views

Pre-built DuckDB views for common analyses:

vw_zone_with_weather: Zone conditions joined with weather data
vw_hvac_with_meters: HVAC performance with building meters and weather
vw_energy_summary: Energy consumption summary by building and scenario

Usage Examples

Python API

from src.etl.extract import extract_runs
from src.etl.transform import transform_all
from src.etl.load import load_to_duckdb, query_duckdb
from src.etl.validate import validate_all, print_validation_results

# Extract simulation runs
runs = extract_runs('data/raw/simulations')

# Transform to star schema
schema = transform_all(runs)

# Validate data quality
results = validate_all(schema)
print_validation_results(results)

# Load to DuckDB
load_to_duckdb(schema, 'data/processed/duckdb/simulations.duckdb')

# Query the database
df = query_duckdb('data/processed/duckdb/simulations.duckdb', '''
    SELECT building_id, scenario_id, 
           SUM(electric_kwh) as total_electric,
           SUM(heating_kwh) as total_heating
    FROM vw_energy_summary
    GROUP BY building_id, scenario_id
''')
print(df)

SQL Queries

-- Compare energy use between scenarios
SELECT 
    building_id,
    scenario_id,
    total_electric_kwh,
    total_heating_kwh,
    total_cooling_kwh
FROM vw_energy_summary
ORDER BY building_id, scenario_id;

-- Zone temperature drift analysis
SELECT 
    building_id,
    zone_id,
    scenario_id,
    COUNT(*) as total_hours,
    SUM(CASE WHEN ABS(temp_deviation) > 1.0 THEN 1 ELSE 0 END) as hours_out_of_bounds,
    AVG(temp_deviation) as avg_deviation
FROM vw_zone_with_weather
GROUP BY building_id, zone_id, scenario_id;

-- COP proxy vs outdoor temperature
SELECT 
    ROUND(outdoor_temp_C, 0) as temp_bin,
    AVG(cop_proxy) as avg_cop,
    COUNT(*) as num_observations
FROM vw_hvac_with_meters
WHERE power_kw > 0
GROUP BY temp_bin
ORDER BY temp_bin;

Data Validation

The pipeline includes comprehensive validation checks:

Schema Validation

Ensures all required tables and columns exist
Checks for null values in critical columns

Value Range Validation

Zone air temperature: 10-35°C
CO₂ levels: 400-2500 ppm
Relative humidity: 0-100%
Outdoor temperature: -30 to 40°C
All energy values: non-negative

Temporal Coverage

Verifies hourly time intervals
Checks for missing time periods
Ensures all fact tables cover the full time range

Energy Plausibility

Total energy consumption > 0
Electric energy reasonable relative to heating/cooling
Basic energy balance checks

Development

Running Tests

# Run all tests using make
make test

# Or use unittest directly
python -m unittest discover tests/ -v

# Run specific test file
python -m unittest tests/test_extract.py -v

Code Quality

# Format code with black
make format

# Run linter
make lint

# Type checking (manual)
mypy src/

Available Make Commands

Run make help to see all available commands:

Setup: make install, make all
Data Generation: make sample-data
Pipeline: make pipeline, make pipeline-fast
Testing: make test, make lint, make format
Cleanup: make clean, make clean-all
Export: make export-json

📤 Frontend Export

Export processed energy data to JSON for integration with the energy-pipeline-demo frontend:

Environment Setup

# Create and activate virtual environment (recommended)
python -m venv .venv
source .venv/bin/activate    # Linux/macOS
# .venv\Scripts\activate     # Windows

# Install dependencies
pip install -r requirements.txt

Export Command

# Export energy data to JSON
make export-json

# Output: artifacts/json/ida_ice_simulation_summary.json (canonical)

Canonical Schema (Recommended)

The canonical export ida_ice_simulation_summary.json provides a rich, production-ready schema:

{
  "pipeline": "ida_ice_energy_simulation",
  "generated_at": "2025-11-28T12:00:00Z",
  "scenario": {
    "name": "Baseline Energy Analysis",
    "building_type": "Office Complex",
    "location": "Tallinn, Estonia",
    "floor_area_m2": 2170,
    "simulation_period": "2024-01-01 to 2024-12-31"
  },
  "annual": {
    "total_kwh": 185000,
    "heating_kwh": 92000,
    "cooling_kwh": 18500,
    "lighting_kwh": 32400,
    "equipment_kwh": 42100
  },
  "monthly_breakdown": [
    { "month": "Jan", "heating_kwh": 16500, "cooling_kwh": 0, "total_kwh": 22300 },
    ...
  ],
  "kpis": {
    "energy_intensity_kwh_m2": 85.2,
    "heating_intensity_kwh_m2": 42.4,
    "cooling_intensity_kwh_m2": 8.5,
    "annual_co2_tons": 42.3,
    "co2_intensity_kg_m2": 19.5,
    "peak_demand_kw": 156.8,
    "comfort_hours_percent": 94.2
  }
}

Section	Fields	Description
`pipeline`	string	Pipeline identifier
`generated_at`	ISO 8601 UTC	Export timestamp
`scenario`	object	Simulation scenario metadata
`annual`	object	Annual energy totals by category
`monthly_breakdown`	array	Monthly energy breakdown
`kpis`	object	Key performance indicators

Sync to Frontend

After export, copy to the frontend demo:

# From workspace root
cp IDA-ICE-Energy-Simulation-ETL-Pipeline/artifacts/json/ida_ice_simulation_summary.json \
   energy-pipeline-demo/public/data/

# Or use the sync script
cd energy-pipeline-demo && ./sync-data.sh

Productionization Path

This pipeline is designed for easy transition to production databases:

Database Integration

The canonical JSON schema maps directly to a star schema:

JSON Section	Database Table	Notes
`scenario`	`dim_scenarios`	Dimension table for simulation metadata
`annual`	`fact_annual_energy`	Annual aggregates fact table
`monthly_breakdown`	`fact_monthly_energy`	Monthly time-series fact table
`kpis`	`fact_kpis`	Computed KPI metrics

MySQL/PostgreSQL Example

-- Dimension table for scenarios
CREATE TABLE dim_scenarios (
    scenario_id SERIAL PRIMARY KEY,
    name VARCHAR(255),
    building_type VARCHAR(100),
    location VARCHAR(255),
    floor_area_m2 NUMERIC,
    simulation_start DATE,
    simulation_end DATE
);

-- Fact table for monthly energy
CREATE TABLE fact_monthly_energy (
    id SERIAL PRIMARY KEY,
    scenario_id INT REFERENCES dim_scenarios(scenario_id),
    month_num INT,
    month_name VARCHAR(3),
    heating_kwh NUMERIC,
    cooling_kwh NUMERIC,
    lighting_kwh NUMERIC,
    equipment_kwh NUMERIC,
    total_kwh NUMERIC
);

-- Fact table for KPIs
CREATE TABLE fact_kpis (
    id SERIAL PRIMARY KEY,
    scenario_id INT REFERENCES dim_scenarios(scenario_id),
    energy_intensity_kwh_m2 NUMERIC,
    annual_co2_tons NUMERIC,
    peak_demand_kw NUMERIC,
    comfort_hours_percent NUMERIC,
    calculated_at TIMESTAMPTZ DEFAULT NOW()
);

Migration Strategy

Development: Use JSON exports + make export-json
Staging: Add database loader using same schema structure
Production: Integrate with DuckDB/Parquet for analytics, write summaries to relational DB

The star schema design in this pipeline (dim_building, fact_meters, etc.) already mirrors production patterns.

File Formats

Input: Simulation ZIP Files

Each ZIP contains:

metadata.json: Building and scenario information
zones.csv: Zone-level conditions (hourly)
hvac.csv: HVAC system data (hourly)
meters.csv: Building energy meters (hourly)
weather.csv: Weather conditions (hourly)

Output: Star Schema

Parquet files: One file per table in data/processed/parquet/
DuckDB database: Single file simulations.duckdb with all tables and views

Analytics Examples

Common analyses supported by this pipeline:

Zone Temperature Drift: Identify hours where zones deviate from setpoint
HVAC Energy Performance: Analyze COP proxy vs outdoor conditions
Scenario Comparison: Compare energy use between baseline and retrofit
Occupancy Patterns: Analyze energy use during occupied vs unoccupied hours
Peak Demand Analysis: Identify peak electric and thermal loads

Requirements

Python 3.12+
pandas >= 2.0.0
numpy >= 1.24.0
duckdb >= 0.9.0
pyarrow >= 14.0.0

See requirements.txt for full list of dependencies.

License

MIT License - see LICENSE file for details.

Contributing

Contributions are welcome! Feel free to fork and submit pull requests.

Contact

For questions or issues, please open an issue in the repository.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.github/workflows		.github/workflows
artifacts/json		artifacts/json
data		data
notebooks		notebooks
scripts		scripts
src		src
tests		tests
.flake8		.flake8
.gitignore		.gitignore
IMPLEMENTATION_SUMMARY.md		IMPLEMENTATION_SUMMARY.md
LICENSE		LICENSE
Makefile		Makefile
PROJECT_STRUCTURE.md		PROJECT_STRUCTURE.md
README.md		README.md
config.yaml		config.yaml
hf_dataset_card.md		hf_dataset_card.md
requirements.txt		requirements.txt
run_pipeline.py		run_pipeline.py

Folders and files

Latest commit

History

Repository files navigation

IDA ICE Energy Simulation ETL Pipeline

Features

Project Structure

Quick Start

Option 1: Using Make (Recommended)

Option 2: Manual Commands

🎬 Demo Steps (Frontend Integration)

Common Commands

Detailed Setup

1. Setup Environment

2. Generate Sample Data

3. Run ETL Pipeline

Data Schema

Dimension Tables

Fact Tables

Analytical Views

Usage Examples

Python API

SQL Queries

Data Validation

Schema Validation

Value Range Validation

Temporal Coverage

Energy Plausibility

Development

Running Tests

Code Quality

Available Make Commands

📤 Frontend Export

Environment Setup

Export Command

Canonical Schema (Recommended)

Sync to Frontend

Productionization Path

Database Integration

MySQL/PostgreSQL Example

Migration Strategy

File Formats

Input: Simulation ZIP Files

Output: Star Schema

Analytics Examples

Requirements

License

Contributing

Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages