This repository contains the code and data for the paper "Observation-driven correction of numerical weather prediction for marine winds" submitted to JGR: Machine Learning and Computation.
The paper presents a transformer-based approach that reformulates marine wind forecasting as observation-informed correction of numerical weather prediction. Rather than forecasting winds directly, the model learns local correction patterns by assimilating the latest in-situ observations to adjust Global Forecast System (GFS) outputs. The architecture handles irregular and time-varying observation sets through masking and set-based attention mechanisms, conditions predictions on recent observation–forecast pairs via cross-attention, and employs cyclical time embeddings and coordinate-aware location representations to enable single-pass inference at arbitrary spatial coordinates.
The model is evaluated over the Atlantic Ocean using collocated observations from the International Comprehensive Ocean-Atmosphere Data Set (ICOADS). It reduces GFS 10-meter wind root-mean-square error at all lead times up to 48 hours, achieving 45% improvement at 1-hour lead time and 13% improvement at 48-hour lead time. The tokenized architecture naturally accommodates heterogeneous observing platforms (ships, buoys, tide gauges, and coastal stations) and produces both site-specific predictions and basin-scale gridded products in a single forward pass.
Use the following citation when the code or data are used:
Peduto, M.; Yang, Q.; Giezendanner, J.; Tuia, D.; Wang, S.; Observation-driven correction of numerical weather prediction for marine winds. Submitted to JGR: Machine Learning and Computation, 2025.
The data for training, testing and validation can be found on Zenodo. The files are already processed and ready to be used in the model.
For ICOADS, ERA5 and GFS, the following variables are available:
- u and v component of wind vector at 10 meters above ground
- additional variables for ERA5 and GFS
The code is organised as follows (in offshore-wind-forecasting/):
launch_global_models.pyis a laucher pointing attrain_global_models.py(the arguments of the parser need to be given)train_global_models.pycontains the main code loop with the arguments --lead_time (lead time hours), --type_data (global or subset), --global_position_embedding (global), --absolute_time_embedding (absolute)inference_gridded.pycontains the code for the gridded evaluation of the model- the folder
Dataloader/contains the data loaders for the models but also the gridded inference models/contains the code for the model, the cross-attention, the activations, the location encoder, and the early stoppingcommon_functiony.pycontains some utils functions for the pipelineloading_files/contains the pipeline to load the values from ERA5 and GFSprocessing_files/contains the pipeline to process the data from ICOADS, ERA5 and GFS into the files used for training the models
Once the processed training files are in the appropriate folders the code only need the appropriate arguments parse when excecuting the main scripts. The training files are under the following format, where nis the requested lead time:
Data/
├── training_files
├── lead_time_n
├── train.parquet
├── test.parquet
├── validation.parquet