Skip to content

Added Germany PV & GFS data download pipelines#124

Draft
Sharkyii wants to merge 10 commits intoopenclimatefix:mainfrom
Sharkyii:data/Germany
Draft

Added Germany PV & GFS data download pipelines#124
Sharkyii wants to merge 10 commits intoopenclimatefix:mainfrom
Sharkyii:data/Germany

Conversation

@Sharkyii
Copy link

Description

This PR adds automated data download pipelines for Germany, including:

  • Solar PV generation data using the Bundesnetzagentur SMARD API (15-minute resolution)
  • GFS meteorological data download notebook for Germany using NOAA public datasets

I will proceed with Step 5 (Model Training Pipeline Integration).

Relates #121

Checklist:

  • My code follows OCF's coding style guidelines
  • I have performed a self-review of my own code
  • I have made corresponding changes to the documentation
  • I have added tests that prove my fix is effective or that my feature works
  • I have checked my code and corrected any misspellings

@Sharkyii Sharkyii marked this pull request as ready for review January 22, 2026 17:44
@Sharkyii Sharkyii marked this pull request as draft January 22, 2026 17:44
@Sharkyii
Copy link
Author

@peterdudfield till now i have downloaded the data for 1 year and uploaded the generational data for Germany here:
Hugging Face: https://huggingface.co/datasets/Shark26/germany_pv_data
S3: s3://germany_pv_data
Will soon test the data via training the model
could you please check it once @siddharth7113

Removed hardcoded zarr paths for GSP and GFS data in Germany configuration.
@Sharkyii
Copy link
Author

Sharkyii commented Feb 3, 2026

@peterdudfield @siddharth7113
i added scripts to download the data and still i am not able to run save_sample and train the model because of this error
python save_samples.py `

+datamodule.sample_output_dir="./samples" +datamodule.num_train_samples=10
+datamodule.num_val_samples=5
CONFIG
├── trainer
│ └── target: lightning.pytorch.trainer.trainer.Trainer
│ accelerator: auto
│ devices: auto
│ min_epochs: null
│ max_epochs: null
│ reload_dataloaders_every_n_epochs: 0
│ num_sanity_val_steps: 8
│ fast_dev_run: false
│ accumulate_grad_batches: 4
│ log_every_n_steps: 50

├── model
│ └── target: pvnet.models.multimodal.multimodal.Model
│ output_quantiles:
│ - 0.02
│ - 0.1
│ - 0.25
│ - 0.5
│ - 0.75
│ - 0.9
│ - 0.98
│ nwp_encoders_dict:
│ gfs:
target: pvnet.models.multimodal.encoders.encoders3d.ResConv3DNet2
partial: true
│ in_channels: 14
│ out_features: 32
│ n_res_blocks: 1
│ hidden_channels: 6
│ image_size_pixels: 2
│ output_network:
target: pvnet.models.multimodal.linear_networks.networks.ResFCNet2
partial: true
│ fc_hidden_features: 128
│ n_res_blocks: 6
│ res_block_layers: 2
│ dropout_frac: 0.0
│ embedding_dim: 16
│ include_sun: true
│ include_gsp_yield_history: false
│ include_site_yield_history: false
│ forecast_minutes: 480
│ history_minutes: 60
│ nwp_history_minutes:
│ gfs: 180
│ nwp_forecast_minutes:
│ gfs: 540
│ nwp_interval_minutes:
│ gfs: 180
│ optimizer:
target: pvnet.optimizers.EmbAdamWReduceLROnPlateau
│ lr: 0.0001
│ weight_decay: 0.01
│ amsgrad: true
│ patience: 5
│ factor: 0.1
│ threshold: 0.002

├── datamodule
│ └── target: pvnet.data.DataModule
│ configuration: C:\Users\SNEH\open-data-pvnet\src\open_data_pvnet\configs\PVNet_configs\datamodule\configuration\germany_configuration.yaml
│ num_workers: 8
│ prefetch_factor: 2
│ batch_size: 8
│ train_period:
│ - null
│ - '2023-06-30'
│ val_period:
│ - '2023-07-01'
│ - '2023-12-31'
│ sample_output_dir: ./samples
│ num_train_samples: 10
│ num_val_samples: 5

├── callbacks
│ └── early_stopping:
target: lightning.pytorch.callbacks.EarlyStopping
│ monitor: ${resolve_monitor_loss:${model.output_quantiles}}
│ mode: min
│ patience: 10
│ min_delta: 0
│ learning_rate_monitor:
target: lightning.pytorch.callbacks.LearningRateMonitor
│ logging_interval: epoch
│ model_summary:
target: lightning.pytorch.callbacks.ModelSummary
│ max_depth: 3
│ model_checkpoint:
target: lightning.pytorch.callbacks.ModelCheckpoint
│ monitor: ${resolve_monitor_loss:${model.output_quantiles}}
│ mode: min
│ save_top_k: 1
│ save_last: true
│ every_n_epochs: 1
│ verbose: false
│ filename: epoch={epoch}-step={step}
│ dirpath: PLACEHOLDER/${model_name}
│ auto_insert_metric_name: false
│ save_on_train_epoch_end: false

├── logger
│ └── wandb:
target: lightning.pytorch.loggers.wandb.WandbLogger
│ project: GFS_TEST_RUN
│ name: ${model_name}
│ save_dir: PLACEHOLDER
│ offline: false
│ id: ${oc.env:WANDB_RUN_ID}
│ log_model: true
│ prefix: ''
│ job_type: train
│ group: ''
│ tags: []

└── seed
└── 2727831
----- Saving val samples -----
Error executing job with overrides: ['+datamodule.sample_output_dir=./samples', '+datamodule.num_train_samples=10', '+datamodule.num_val_samples=5']
Traceback (most recent call last):
File "C:\Users\open-data-pvnet\src\open_data_pvnet\scripts\save_samples.py", line 171, in main
val_dataset = get_dataset(
^^^^^^^^^^^^
File "C:\Users\open-data-pvne\src\open_data_pvnet\scripts\save_samples.py", line 106, in get_dataset
return dataset_cls(config_path, start_time=start_time, end_time=end_time)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\SNEH\AppData\Local\Programs\Python\Python312\Lib\site-packages\ocf_data_sampler\torch_datasets\datasets\pvnet_uk.py", line 254, in init
super().init(config_filename, start_time, end_time, gsp_ids)
File "C:\Users\SNEH\AppData\Local\Programs\Python\Python312\Lib\site-packages\ocf_data_sampler\torch_datasets\datasets\pvnet_uk.py", line 100, in init
datasets_dict = get_dataset_dict(config.input_data, gsp_ids=gsp_ids)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\SNEH\AppData\Local\Programs\Python\Python312\Lib\site-packages\ocf_data_sampler\load\load_dataset.py", line 24, in get_dataset_dict
da_gsp = open_gsp(
^^^^^^^^^
File "C:\Users\SNEH\AppData\Local\Programs\Python\Python312\Lib\site-packages\ocf_data_sampler\load\gsp.py", line 60, in open_gsp
raise ValueError(
ValueError: Some GSP IDs in the GSP generation data are not available in the locations file.

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
PS C:\Users\open-data-pvnet\src\open_data_pvnet

Updated zarr_path to an empty string for flexibility.
1
Refactor process_grib function to handle multiple levels and improve error handling.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only put notebooks in this directory, for putting up the solar data , I would recommend to use scripts/nwp

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure will modify it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please avoid using emojis in the code, and this file has extension missing.

@Sharkyii
Copy link
Author

@siddharth7113 how could i overcome the above error
Should i use my own model temporary?

@siddharth7113
Copy link
Contributor

Hi @Sharkyii ,

I have left some comments on current code.

My advice to do the following things in separate PRs and preferably in order :

  1. Write a script to download the data, make sure the script is clean with docstrings and tests (this is something that has also been missing with old scripts) and put them in scripts/generation.

  2. Once this is done, I would recommend to get the nwp data using gfs script and then write model configs. test these and open a draft PR for this.

@siddharth7113
Copy link
Contributor

@peterdudfield @siddharth7113 i added scripts to download the data and still i am not able to run save_sample and train the model because of this error python save_samples.py `

+datamodule.sample_output_dir="./samples" +datamodule.num_train_samples=10
+datamodule.num_val_samples=5
CONFIG
├── trainer
│ └── target: lightning.pytorch.trainer.trainer.Trainer
│ accelerator: auto
│ devices: auto
│ min_epochs: null
│ max_epochs: null
│ reload_dataloaders_every_n_epochs: 0
│ num_sanity_val_steps: 8
│ fast_dev_run: false
│ accumulate_grad_batches: 4
│ log_every_n_steps: 50

├── model
│ └── target: pvnet.models.multimodal.multimodal.Model
│ output_quantiles:
│ - 0.02
│ - 0.1
│ - 0.25
│ - 0.5
│ - 0.75
│ - 0.9
│ - 0.98
│ nwp_encoders_dict:
│ gfs:
target: pvnet.models.multimodal.encoders.encoders3d.ResConv3DNet2
partial: true
│ in_channels: 14
│ out_features: 32
│ n_res_blocks: 1
│ hidden_channels: 6
│ image_size_pixels: 2
│ output_network:
target: pvnet.models.multimodal.linear_networks.networks.ResFCNet2
partial: true
│ fc_hidden_features: 128
│ n_res_blocks: 6
│ res_block_layers: 2
│ dropout_frac: 0.0
│ embedding_dim: 16
│ include_sun: true
│ include_gsp_yield_history: false
│ include_site_yield_history: false
│ forecast_minutes: 480
│ history_minutes: 60
│ nwp_history_minutes:
│ gfs: 180
│ nwp_forecast_minutes:
│ gfs: 540
│ nwp_interval_minutes:
│ gfs: 180
│ optimizer:
target: pvnet.optimizers.EmbAdamWReduceLROnPlateau
│ lr: 0.0001
│ weight_decay: 0.01
│ amsgrad: true
│ patience: 5
│ factor: 0.1
│ threshold: 0.002

├── datamodule
│ └── target: pvnet.data.DataModule
│ configuration: C:\Users\SNEH\open-data-pvnet\src\open_data_pvnet\configs\PVNet_configs\datamodule\configuration\germany_configuration.yaml
│ num_workers: 8
│ prefetch_factor: 2
│ batch_size: 8
│ train_period:
│ - null
│ - '2023-06-30'
│ val_period:
│ - '2023-07-01'
│ - '2023-12-31'
│ sample_output_dir: ./samples
│ num_train_samples: 10
│ num_val_samples: 5

├── callbacks
│ └── early_stopping:
target: lightning.pytorch.callbacks.EarlyStopping
│ monitor: ${resolve_monitor_loss:${model.output_quantiles}}
│ mode: min
│ patience: 10
│ min_delta: 0
│ learning_rate_monitor:
target: lightning.pytorch.callbacks.LearningRateMonitor
│ logging_interval: epoch
│ model_summary:
target: lightning.pytorch.callbacks.ModelSummary
│ max_depth: 3
│ model_checkpoint:
target: lightning.pytorch.callbacks.ModelCheckpoint
│ monitor: ${resolve_monitor_loss:${model.output_quantiles}}
│ mode: min
│ save_top_k: 1
│ save_last: true
│ every_n_epochs: 1
│ verbose: false
│ filename: epoch={epoch}-step={step}
│ dirpath: PLACEHOLDER/${model_name}
│ auto_insert_metric_name: false
│ save_on_train_epoch_end: false

├── logger
│ └── wandb:
target: lightning.pytorch.loggers.wandb.WandbLogger
│ project: GFS_TEST_RUN
│ name: ${model_name}
│ save_dir: PLACEHOLDER
│ offline: false
│ id: ${oc.env:WANDB_RUN_ID}
│ log_model: true
│ prefix: ''
│ job_type: train
│ group: ''
│ tags: []

└── seed
└── 2727831
----- Saving val samples -----
Error executing job with overrides: ['+datamodule.sample_output_dir=./samples', '+datamodule.num_train_samples=10', '+datamodule.num_val_samples=5']
Traceback (most recent call last):
File "C:\Users\open-data-pvnet\src\open_data_pvnet\scripts\save_samples.py", line 171, in main
val_dataset = get_dataset(
^^^^^^^^^^^^
File "C:\Users\open-data-pvne\src\open_data_pvnet\scripts\save_samples.py", line 106, in get_dataset
return dataset_cls(config_path, start_time=start_time, end_time=end_time)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\SNEH\AppData\Local\Programs\Python\Python312\Lib\site-packages\ocf_data_sampler\torch_datasets\datasets\pvnet_uk.py", line 254, in init
super().init(config_filename, start_time, end_time, gsp_ids)
File "C:\Users\SNEH\AppData\Local\Programs\Python\Python312\Lib\site-packages\ocf_data_sampler\torch_datasets\datasets\pvnet_uk.py", line 100, in init
datasets_dict = get_dataset_dict(config.input_data, gsp_ids=gsp_ids)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\SNEH\AppData\Local\Programs\Python\Python312\Lib\site-packages\ocf_data_sampler\load\load_dataset.py", line 24, in get_dataset_dict
da_gsp = open_gsp(
^^^^^^^^^
File "C:\Users\SNEH\AppData\Local\Programs\Python\Python312\Lib\site-packages\ocf_data_sampler\load\gsp.py", line 60, in open_gsp
raise ValueError
ValueError: Some GSP IDs in the GSP generation data are not available in the locations file.

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace. PS C:\Users\open-data-pvnet\src\open_data_pvnet

save_samples.py is for making samples once the dataset is downlaoded and in format with ocf-data-sampler it used to create samples in the format compatible with PVnet training.

@siddharth7113
Copy link
Contributor

@siddharth7113 how could i overcome the above error Should i use my own model temporary?

refer to this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants