Gauge Interest & Use Cases: Support creation of anemoi-datasets using Forecast Data #412

anaprietonem · 2025-09-04T06:37:00Z

anaprietonem
Sep 4, 2025
Maintainer

Hello!

anemoi-datasets currently focuses on generating datasets for analysis, usually with one timestamp per valid time. We’re exploring the idea of also supporting forecast data. This could be useful, but it’s trickier because:

Forecasts often have many timestamps for the same valid time
We’re not sure yet what the main use cases would be.

We need to decide how the data should be stored and shaped before we build anything.

👉 We’d love your input:

Would this be useful for you?
What would your use cases look like? & How would you expect the data to be structured ?

Your feedback will help us decide if and how to move forward with this feature. Thanks!

JoffreyDumontLeBrazidec · 2025-09-04T08:22:52Z

JoffreyDumontLeBrazidec
Sep 4, 2025
Collaborator

Current reforecast datasets for downscaling have specific particularities.
Each maps to a tuple (reference time, lead time, model time), rather than a single time as in analysis or reanalysis.

Training on ensemble forecasts could also be useful—not as an ensemble like AIFS-CRPS, but simply to increase sample size. This would add a fourth dimension to the tuple. Other use cases may introduce further dimensions. The result is multiple timestamps for the same valid time.

To address this, the feature/fake-hindcasts branch in anemoi-datasets (thanks Baudouin) implements abstract indices or “fake dates” for indexing. These indices do not map directly to valid times but to tuples of information. The approach works but is hacky.

A cleaner implementation of abstract indexing would be great. It should ensure safe pairing of data across datasets, e.g. matching low- and high-resolution inputs. At training time, one should be able to retrieve the correct tuple—for example (reference time = 2015-08-02, lead time = 24h, model time = 2023-08-02)—consistently across datasets.

1 reply

gideonite Sep 8, 2025

Interesting, I'm not familiar with this "fake dates" concept. How is that integrated into anemoi-datasets?

OpheliaMiralles · 2025-09-04T13:59:22Z

OpheliaMiralles
Sep 4, 2025
Collaborator

For nowcasting, I had to flatten forecast data in slices of [t_0, t_0+6h] of valid times for anemoi-datasets creation as the objective was to interpolate between t_0 and t_0+6h every 10min. I would second what was said above and adopt a convention like (reference_time, lead_time, validity_time).

0 replies

gideonite · 2025-09-08T21:51:10Z

gideonite
Sep 8, 2025

Thanks for opening this discussion @anaprietonem.

Emulating a forecasting model and predicting (re-)analysis-to-(re-)analysis changes are slightly different tasks. Predicting analysis to analysis changes tries to essentially improve upon existing forecasting models whereas forecasting model emulation seeks to answer the question, "What would the forecasting model have predicted, had it seen this initial state."

How would you expect the data to be structured ?

We have been thinking of this data as indexed by a tuple of two times (init_time, lead_time). So far, this approach has made a lot of sense to me but I'm also interested in improvements.

0 replies

bastien-francois · 2025-09-11T08:27:52Z

bastien-francois
Sep 11, 2025

Hello!

Thanks for starting the discussion!

I’m very interested in this functionality to produce data for performing forecast verification. At KNMI, we work with UWC-West reforecast data, including both forecasts and analyses. The analyses and NWP forecasts are initially stored as GRIB2 files (one file per date and lead time) on a remote machine in Iceland with no internet access. For AI model inference, the netcdf format has been chosen so far.

This functionality would allow us to build all the necessary NWP forecasts into Zarr files, which is ideal since the verification package we use (developed by RMI and mainly maintained by @mpvginde) already supports the Zarr format (use xarray and dask). You can find the verification package here: rmai-verification on GitHub.

Using indexes as those already suggested (reference_time, lead_time, valid_time) seems to me the most logical thing to do.

Thanks!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Gauge Interest & Use Cases: Support creation of anemoi-datasets using Forecast Data #412

Uh oh!

{{title}}

Uh oh!

Replies: 4 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Gauge Interest & Use Cases: Support creation of anemoi-datasets using Forecast Data #412

Uh oh!

anaprietonem Sep 4, 2025 Maintainer

Replies: 4 comments · 1 reply

Uh oh!

Uh oh!

JoffreyDumontLeBrazidec Sep 4, 2025 Collaborator

Uh oh!

gideonite Sep 8, 2025

Uh oh!

OpheliaMiralles Sep 4, 2025 Collaborator

Uh oh!

Uh oh!

gideonite Sep 8, 2025

Uh oh!

bastien-francois Sep 11, 2025

anaprietonem
Sep 4, 2025
Maintainer

Replies: 4 comments 1 reply

JoffreyDumontLeBrazidec
Sep 4, 2025
Collaborator

OpheliaMiralles
Sep 4, 2025
Collaborator

gideonite
Sep 8, 2025

bastien-francois
Sep 11, 2025