Skip to content

Commit 5412b7c

Browse files
committed
docs: address PR feedback for new country guide
1 parent 33f2aa5 commit 5412b7c

File tree

1 file changed

+22
-239
lines changed

1 file changed

+22
-239
lines changed

docs/training_model_new_country.md

Lines changed: 22 additions & 239 deletions
Original file line numberDiff line numberDiff line change
@@ -83,52 +83,32 @@ Generation data represents the actual solar PV power output for your target coun
8383
3. **Manual Data Collection**
8484
- Download from national energy/grid operator websites
8585
- Use data from research institutions or universities
86-
- Format: CSV, JSON, or database exports
86+
- **Format**: Must be converted to Zarr format
8787

88-
### Data Requirements
88+
### Generation Data Schema
8989

90-
- **Time Resolution**: 30-minute intervals (recommended) or hourly
91-
- **Coverage**: National or regional level
92-
- **Time Period**: At least 1 year of historical data (more is better)
93-
- **Format**: Should be convertible to Zarr format
90+
The data **must** be saved in Zarr format with the following schema:
9491

95-
### Converting to Zarr Format
92+
- **Dimensions**: `(time_utc, location_id)`
93+
- **Data Variables**:
94+
- `generation_mw`: (float32) Generation in MW
95+
- `capacity_mwp`: (float32) Capacity in MW peak
96+
- **Coordinates**:
97+
- `time_utc`: (datetime64[ns]) Time in UTC
98+
- `location_id`: (int) Unique identifier for each location
99+
- `longitude`: (float) Longitude of the location
100+
- `latitude`: (float) Latitude of the location
96101

97-
Once you have generation data, convert it to Zarr format:
102+
> [!TIP]
103+
> If `capacity_mwp` is not available, you can approximate it by taking the maximum generation value observed for that location over the time period.
98104
99-
```python
100-
import pandas as pd
101-
import xarray as xr
102-
import zarr
103-
104-
# Load your data (example with CSV)
105-
df = pd.read_csv('generation_data.csv', parse_dates=['datetime'])
106-
107-
# Convert to xarray Dataset
108-
# See https://github.com/openclimatefix/ocf-data-sampler/blob/main/ocf_data_sampler/load/generation.py
109-
ds = xr.Dataset(
110-
{
111-
'generation': (['time', 'location_id'], df.pivot_table(
112-
index='datetime',
113-
columns='location_id',
114-
values='generation'
115-
).values)
116-
},
117-
coords={
118-
'time': df['datetime'].unique(),
119-
'location_id': df['location_id'].unique()
120-
}
121-
)
122-
123-
# Save to Zarr
124-
ds.to_zarr('generation_data.zarr', mode='w')
125-
```
105+
For reference on how generation data is loaded, see [ocf-data-sampler/load/generation.py](https://github.com/openclimatefix/ocf-data-sampler/blob/main/ocf_data_sampler/load/generation.py).
126106

127107
### Storage Location
128108

129109
Store your generation data in a location accessible to the training pipeline:
130110
- **Local**: `./data/{country}/generation/{year}.zarr`
131-
- **S3**: `s3://ocf-open-data-pvnet/data/{country}/generation/{year}.zarr`
111+
- **S3**: `s3://ocf-open-data-pvnet/data/{country}/generation/{year}.zarr` (Contact @peterdudfield to upload data here after model verification)
132112
- **Hugging Face**: Upload to a Hugging Face dataset
133113

134114
---
@@ -144,42 +124,6 @@ Numerical Weather Prediction (NWP) data provides weather forecasts that the mode
144124
- **Good Resolution**: 0.25° (~25km) resolution
145125
- **Multiple Variables**: Includes all necessary weather parameters
146126

147-
### Downloading GFS Data
148-
149-
#### Option A: Using the CLI (Recommended)
150-
151-
```bash
152-
# Download GFS data for a specific date range
153-
# Note: GFS data is available globally, so no region parameter needed
154-
open-data-pvnet gfs archive --year 2023 --month 1 --day 1
155-
```
156-
157-
#### Option B: Direct from AWS S3
158-
159-
GFS data is available on AWS S3:
160-
161-
```bash
162-
# List available data
163-
aws s3 ls s3://noaa-gfs-bdp-pds/gfs.20230101/00/atmos/ --no-sign-request
164-
165-
# Download specific files
166-
aws s3 sync s3://noaa-gfs-bdp-pds/gfs.20230101/00/atmos/ ./gfs_data/ --no-sign-request
167-
```
168-
169-
#### Option C: Using Python
170-
171-
```python
172-
import xarray as xr
173-
import s3fs
174-
175-
# Access GFS data from S3
176-
s3 = s3fs.S3FileSystem(anon=True)
177-
gfs_path = 's3://noaa-gfs-bdp-pds/gfs.20230101/00/atmos/'
178-
179-
# Open dataset
180-
ds = xr.open_dataset(s3.open(gfs_path + 'gfs.t00z.pgrb2.0p25.f000'))
181-
```
182-
183127
### Required GFS Variables
184128

185129
The model needs these weather variables (channels):
@@ -207,9 +151,10 @@ Example:
207151

208152
```python
209153
import xarray as xr
154+
import s3fs
210155

211-
# Load GFS data
212-
gfs_ds = xr.open_zarr('gfs_global.zarr')
156+
# Load GFS data from S3 (no need to download all of it)
157+
gfs_ds = xr.open_zarr('s3://ocf-open-data-pvnet/data/gfs_global.zarr') # Update with actual S3 path if different
213158

214159
# Define bounding box for your country (example: Germany)
215160
lat_min, lat_max = 47.0, 55.0
@@ -243,105 +188,9 @@ Create a new configuration file: `src/open_data_pvnet/configs/PVNet_configs/data
243188

244189
Example for Germany (`germany_configuration.yaml`):
245190

246-
```yaml
247-
general:
248-
description: Configuration for {Country} solar forecasting
249-
name: {country}_config
250-
251-
input_data:
252-
gsp:
253-
# Path to your generation data in zarr format
254-
zarr_path: "s3://ocf-open-data-pvnet/data/{country}/generation/2023.zarr"
255-
# Or local path: "./data/{country}/generation/2023.zarr"
256-
interval_start_minutes: -60 # 1 hour before forecast time
257-
interval_end_minutes: 480 # 8 hours after forecast time
258-
time_resolution_minutes: 30 # Match your data resolution
259-
dropout_timedeltas_minutes: []
260-
dropout_fraction: 0.0
261-
public: True # Set to False if using private S3 bucket
262-
263-
nwp:
264-
gfs:
265-
time_resolution_minutes: 180 # GFS resolution (3 hours)
266-
interval_start_minutes: -180 # 3 hours before
267-
interval_end_minutes: 540 # 9 hours after
268-
dropout_fraction: 0.0
269-
dropout_timedeltas_minutes: []
270-
# Path to your GFS data for the country
271-
zarr_path: "s3://ocf-open-data-pvnet/data/{country}/gfs/2023.zarr"
272-
provider: "gfs"
273-
# Adjust based on your cropped region size
274-
image_size_pixels_height: 32 # Adjust for your country
275-
image_size_pixels_width: 40 # Adjust for your country
276-
public: True
277-
channels:
278-
- dlwrf
279-
- dswrf
280-
- hcc
281-
- mcc
282-
- lcc
283-
- prate
284-
- r
285-
- t
286-
- tcc
287-
- u10
288-
- u100
289-
- v10
290-
- v100
291-
- vis
292-
# Normalization constants (calculate from your data)
293-
# IMPORTANT: You must calculate these from YOUR actual GFS data for your country
294-
# The values below are examples from UK data - replace with your country's values
295-
normalisation_constants:
296-
dlwrf:
297-
mean: 298.342
298-
std: 96.305916
299-
dswrf:
300-
mean: 168.12321
301-
std: 246.18533
302-
hcc:
303-
mean: 35.272
304-
std: 42.525383
305-
lcc:
306-
mean: 43.578342
307-
std: 44.3732
308-
mcc:
309-
mean: 33.738823
310-
std: 43.150745
311-
prate:
312-
mean: 2.8190969e-05
313-
std: 0.00010159573
314-
r:
315-
mean: 18.359747
316-
std: 25.440672
317-
t:
318-
mean: 278.5223
319-
std: 22.825893
320-
tcc:
321-
mean: 66.841606
322-
std: 41.030598
323-
u10:
324-
mean: -0.0022310058
325-
std: 5.470838
326-
u100:
327-
mean: 0.0823025
328-
std: 6.8899174
329-
v10:
330-
mean: 0.06219831
331-
std: 4.7401133
332-
v100:
333-
mean: 0.0797807
334-
std: 6.076132
335-
vis:
336-
mean: 19628.32
337-
std: 8294.022
338-
# Note: Calculate these from your GFS data using the script in section 3.2
339-
340-
solar_position:
341-
interval_start_minutes: -60
342-
interval_end_minutes: 480
343-
time_resolution_minutes: 30
344-
```
191+
Please refer to the [example_configuration.yaml](https://github.com/openclimatefix/open-data-pvnet/blob/main/src/open_data_pvnet/configs/PVNet_configs/datamodule/configuration/example_configuration.yaml) for the most up-to-date structure and zarr path examples.
192+
193+
You can copy this file and adapt it for your country.
345194

346195
### 3.2 Calculate Normalization Constants
347196

@@ -602,73 +451,7 @@ After training your first model:
602451

603452
---
604453

605-
## Example: Complete Workflow for a New Country
606-
607-
This example uses **Germany** as a reference, but the same process applies to other countries like USA, Netherlands, Belgium, or France.
608-
609-
### Step-by-Step Example: Germany
610-
611-
```bash
612-
# 1. Get generation data
613-
# For Germany: Check ENTSO-E Transparency Platform or Bundesnetzagentur
614-
# Save as: ./data/germany/generation/2023.zarr
615-
616-
# 2. Get GFS data for Germany region
617-
open-data-pvnet gfs archive --year 2023 --month 1 --day 1
618-
# Crop to Germany bounding box (lat: 47-55°N, lon: 5-15°E)
619-
# Save as: ./data/germany/gfs/2023.zarr
620-
621-
# 3. Create configuration file
622-
# Create: src/open_data_pvnet/configs/PVNet_configs/datamodule/configuration/germany_configuration.yaml
623-
# Use the template in Step 3.1, replacing {country} with "germany"
624-
625-
# 4. Calculate normalization constants
626-
# Run the script from Step 3.2 on your GFS data
627-
# Update the normalisation_constants in your config file
628-
629-
# 5. Update training configuration
630-
# Edit: streamed_batches.yaml with path to germany_configuration.yaml
631-
# Set train_period and val_period based on your data availability
632-
633-
# 6. Generate training samples (optional but recommended)
634-
python src/open_data_pvnet/scripts/save_samples.py
635-
636-
# 7. Train model
637-
python run.py
638-
639-
# 8. Save and share model weights
640-
# Model saved in: outputs/{timestamp}/checkpoints/best.ckpt
641-
# Upload to Hugging Face or S3 (see Step 5)
642-
```
643-
644-
### Country-Specific Data Sources
645-
646-
#### United States
647-
- **Generation Data**: EIA (Energy Information Administration), CAISO, ERCOT
648-
- **Data Format**: Often available via APIs or CSV downloads
649-
- **Coverage**: Regional (ISO regions) or national level
650-
651-
#### Netherlands
652-
- **Generation Data**: ENTSO-E Transparency Platform, TenneT (Dutch TSO)
653-
- **Data Format**: API access via ENTSO-E, CSV exports
654-
- **Coverage**: National level
655-
656-
#### Belgium
657-
- **Generation Data**: ENTSO-E Transparency Platform, Elia (Belgian TSO)
658-
- **Data Format**: API or CSV
659-
- **Coverage**: National level
660-
661-
#### France
662-
- **Generation Data**: ENTSO-E Transparency Platform, RTE (French TSO)
663-
- **Data Format**: API access, data portal
664-
- **Coverage**: National level
665-
666-
#### Germany
667-
- **Generation Data**: ENTSO-E Transparency Platform, Bundesnetzagentur
668-
- **Data Format**: API or CSV exports
669-
- **Coverage**: National and regional level
670454

671-
**Note**: For all European countries, the [ENTSO-E Transparency Platform](https://transparency.entsoe.eu/) is a valuable resource for generation data.
672455

673456
---
674457

0 commit comments

Comments
 (0)