Skip to content

Commit da02fe7

Browse files
matschreinerpre-commit-ci[bot]Copilotanaprietonemdietervdb-meteo
authored
fix(training)!: Refactor configuration by introducing system schema with hardware, paths, and files subschemas (#598)
# Description This PR reorganizes the configuration structure by introducing a new top-level schema called system, which groups the subschemas hardware, storage, and files (see [issue #513](#513)). Previously, paths and files were defined under hardware, which was confusing since they describe filesystem layout and not than compute resources. It also wasn’t clear whether paths refer to directories or files. It has now been renamed to storage, clarifying its role as the definition of directory structure for inputs, outputs, and logs. ``` system/ ├── hardware.yaml ├── input.yaml └── output.yaml ``` The PR also isolates the concatenation logic for paths in the pydantic scheme so we don't need to write out the full paths for all outputs/logs/etc in each field. This is very brittle and used to happen in both code throughout the framework and inside the nested configuration files. This is now isolated to happen in only one place. <!-- readthedocs-preview anemoi-training start --> ---- 📚 Documentation preview 📚: https://anemoi-training--598.org.readthedocs.build/en/598/ <!-- readthedocs-preview anemoi-training end --> <!-- readthedocs-preview anemoi-graphs start --> ---- 📚 Documentation preview 📚: https://anemoi-graphs--598.org.readthedocs.build/en/598/ <!-- readthedocs-preview anemoi-graphs end --> <!-- readthedocs-preview anemoi-models start --> ---- 📚 Documentation preview 📚: https://anemoi-models--598.org.readthedocs.build/en/598/ <!-- readthedocs-preview anemoi-models end --> --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Copilot <[email protected]> Co-authored-by: anaprietonem <[email protected]> Co-authored-by: Ana Prieto Nemesio <[email protected]> Co-authored-by: Dieter Van den Bleeken <[email protected]> Co-authored-by: Mario Santa Cruz <[email protected]>
1 parent 84e5882 commit da02fe7

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

69 files changed

+770
-600
lines changed
Lines changed: 71 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,71 @@
1+
# (C) Copyright 2025 Anemoi contributors.
2+
#
3+
# This software is licensed under the terms of the Apache Licence Version 2.0
4+
# which can be obtained at http://www.apache.org/licenses/LICENSE-2.0.
5+
#
6+
# In applying this licence, ECMWF does not waive the privileges and immunities
7+
# granted to it by virtue of its status as an intergovernmental organisation
8+
# nor does it submit to any jurisdiction.
9+
10+
from anemoi.models.migrations import CkptType
11+
from anemoi.models.migrations import MigrationContext
12+
from anemoi.models.migrations import MigrationMetadata
13+
14+
# DO NOT CHANGE -->
15+
metadata = MigrationMetadata(
16+
versions={
17+
"migration": "1.0.0",
18+
"anemoi-models": "%NEXT_ANEMOI_MODELS_VERSION%",
19+
},
20+
)
21+
# <-- END DO NOT CHANGE
22+
23+
24+
def migrate_setup(context: MigrationContext) -> None:
25+
"""Migrate setup callback to be run before loading the checkpoint.
26+
27+
Parameters
28+
----------
29+
context : MigrationContext
30+
A MigrationContext instance
31+
"""
32+
context.move_attribute(
33+
"anemoi.training.schemas.hardware.HardwareSchema", "anemoi.training.schemas.system.HardwareSchema"
34+
)
35+
context.move_attribute("anemoi.training.schemas.hardware.FilesSchema", "anemoi.training.schemas.system.InputSchema")
36+
context.move_attribute(
37+
"anemoi.training.schemas.hardware.PathsSchema", "anemoi.training.schemas.system.OutputSchema"
38+
)
39+
context.move_module("anemoi.training.schemas.hardware", "anemoi.training.schemas.system")
40+
41+
42+
def migrate(ckpt: CkptType) -> CkptType:
43+
"""Migrate the checkpoint.
44+
45+
Parameters
46+
----------
47+
ckpt : CkptType
48+
The checkpoint dict.
49+
50+
Returns
51+
-------
52+
CkptType
53+
The migrated checkpoint dict.
54+
"""
55+
return ckpt
56+
57+
58+
def rollback(ckpt: CkptType) -> CkptType:
59+
"""Rollback the checkpoint.
60+
61+
Parameters
62+
----------
63+
ckpt : CkptType
64+
The checkpoint dict.
65+
66+
Returns
67+
-------
68+
CkptType
69+
The rollbacked checkpoint dict.
70+
"""
71+
return ckpt

training/docs/user-guide/distributed.rst

Lines changed: 14 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -40,13 +40,13 @@ shown in the figure below
4040

4141
Model Sharding (source: `Jacobs et al. (2023) <https://arxiv.org/pdf/2309.14509>`_)
4242

43-
To use model sharding, set ``config.hardware.num_gpus_per_model`` to the
44-
number of GPUs you wish to shard the model across. Set ``config.model.
45-
keep_batch_sharded=True`` to also keep batches fully sharded throughout
46-
training, reducing memory usage for large inputs or long rollouts. It is
47-
recommended to only shard if the model does not fit in GPU memory, as
48-
data distribution is a much more efficient way to parallelise the
49-
training.
43+
To use model sharding, set ``config.system.hardware.num_gpus_per_model``
44+
to the number of GPUs you wish to shard the model across. Set
45+
``config.model. keep_batch_sharded=True`` to also keep batches fully
46+
sharded throughout training, reducing memory usage for large inputs or
47+
long rollouts. It is recommended to only shard if the model does not fit
48+
in GPU memory, as data distribution is a much more efficient way to
49+
parallelise the training.
5050

5151
Anemoi Training provides different sharding strategies depending if the
5252
model task is deterministic or ensemble based.
@@ -57,7 +57,7 @@ For deterministic models, the ``DDPGroupStrategy`` is used:
5757
5858
strategy:
5959
_target_: anemoi.training.distributed.strategy.DDPGroupStrategy
60-
num_gpus_per_model: ${hardware.num_gpus_per_model}
60+
num_gpus_per_model: ${system.hardware.num_gpus_per_model}
6161
read_group_size: ${dataloader.read_group_size}
6262
6363
When using model sharding, ``config.dataloader.read_group_size`` allows
@@ -72,20 +72,20 @@ across GPUs:
7272
7373
strategy:
7474
_target_: anemoi.training.distributed.strategy.DDPEnsGroupStrategy
75-
num_gpus_per_model: ${hardware.num_gpus_per_model}
75+
num_gpus_per_model: ${system.hardware.num_gpus_per_model}
7676
read_group_size: ${dataloader.read_group_size}
7777
78-
This requires setting ``config.hardware.num_gpus_per_ensemble`` to the
79-
number of GPUs you wish to parallelise the ensemble members across and
80-
``config.training.ensemble_size_per_device`` to the number of ensemble
81-
members per GPU.
78+
This requires setting ``config.system.hardware.num_gpus_per_ensemble``
79+
to the number of GPUs you wish to parallelise the ensemble members
80+
across and ``config.training.ensemble_size_per_device`` to the number of
81+
ensemble members per GPU.
8282

8383
*********
8484
Example
8585
*********
8686

8787
Suppose the job is running on 2 nodes each with 4 GPUs and that
88-
``config.hardware.num_gpus_per_model=2`` and
88+
``config.system.hardware.num_gpus_per_model=2`` and
8989
``config.dataloader.batch_size.training=4``. Then each model will be
9090
sharded across 2 GPUs and the data sharded across ``total number of
9191
GPUs/num_gpus_per_model=4``. This means the effective batch size is 16.

training/docs/user-guide/models.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -55,7 +55,7 @@ al. (2023).
5555

5656
The physical data is encoded on to a multi-mesh latent space of
5757
decreasing resolution. This multi-mesh is defined by the graph given in
58-
``config.hardware.files.graph``.
58+
``config.system.input.graph``.
5959

6060
.. figure:: ../images/gnn-encoder-decoder-multimesh.jpg
6161
:width: 500

training/docs/user-guide/tracking.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -113,7 +113,7 @@ manually installed:
113113
To enable offline logging, set
114114
``config.diagnostics.logger.mlflow.offline`` to ``True`` and run the
115115
training as usual. Logs will be saved to the directory specified in
116-
``config.hardware.paths.logs``
116+
``config.system.output.logs``
117117

118118
When training is done, use the ``mlflow sync`` command to sync the
119119
offline logs to a server:

training/docs/user-guide/training.rst

Lines changed: 6 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -502,18 +502,8 @@ finished training. It's also possible to restart the model training from
502502
a specific checkpoint. This can either be a checkpoint from the same run
503503
or a checkpoint from a different run that you have run in the past or
504504
that you using for transfer learning. To do this, set
505-
``config.hardware.files.warm_start`` to be the checkpoint filename they
506-
want to restart from and ``config.hardware.paths.warm_start`` to be the
507-
path to the checkpoint. See the example below.
508-
509-
.. code:: yaml
510-
511-
# This is a sample YAML block
512-
hardware:
513-
files:
514-
warm_start: checkpoint_epoch_10.ckpt
515-
paths:
516-
warm_start: /path/to/checkpoint/folder/
505+
``config.system.input.warm_start`` to be the path to the checkpoint they
506+
want to restart from.
517507

518508
The above can be adapted depending on the use case and taking advantage
519509
of hydra, you can also reuse ``config.training.run_id`` or
@@ -540,10 +530,10 @@ flag to True in the configuration file.
540530
transfer_learning: True
541531
542532
When this flag is active and a checkpoint path is specified in
543-
config.hardware.files.warm_start or self.last_checkpoint, the system
544-
loads the pre-trained weights using the `transfer_learning_loading`
545-
function. This approach ensures only compatible weights are loaded and
546-
mismatched layers are handled appropriately.
533+
config.system.input.warm_start or self.last_checkpoint, the system loads
534+
the pre-trained weights using the `transfer_learning_loading` function.
535+
This approach ensures only compatible weights are loaded and mismatched
536+
layers are handled appropriately.
547537

548538
For example, transfer learning might be used to adapt a weather
549539
forecasting model trained on one geographic region to another region
Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,11 @@
1-
dataset: ${hardware.paths.data}/${hardware.files.dataset}
1+
dataset: ${system.input.dataset}
22

33
training:
44
dataset: ${dataloader.dataset}
55
start: null
66
end: 2020
77
frequency: ${data.frequency}
8-
drop: []
8+
drop: []
99

1010
validation_rollout: 1 # number of rollouts to use for validation, must be equal or greater than rollout expected by callbacks
1111

@@ -14,11 +14,11 @@ validation:
1414
start: 2021-01-01
1515
end: 2021
1616
frequency: ${data.frequency}
17-
drop: []
17+
drop: []
1818

1919
test:
2020
dataset: ${dataloader.dataset}
2121
start: 2022-01
2222
end: null
2323
frequency: ${data.frequency}
24-
drop: []
24+
drop: []

training/docs/user-guide/yaml/example_crps_config.yaml

Lines changed: 10 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@ defaults:
22
- data: zarr
33
- dataloader: native_grid
44
- diagnostics: evaluation
5-
- hardware: example
5+
- system: example
66
- graph: encoder_decoder_only
77
- model: transformer_ens
88
- training: default
@@ -11,14 +11,15 @@ defaults:
1111
config_validation: True
1212

1313
# Changes in hardware
14-
hardware:
14+
system:
1515
files:
1616
truncation: ${data.resolution}-O32-linear.mat.npz
1717
truncation_inv: O32-${data.resolution}-linear.mat.npz
18-
num_gpus_per_ensemble: 1
19-
num_gpus_per_node: 1
20-
num_nodes: 1
21-
num_gpus_per_model: 1
18+
hardware:
19+
num_gpus_per_ensemble: 1
20+
num_gpus_per_node: 1
21+
num_nodes: 1
22+
num_gpus_per_model: 1
2223

2324
data:
2425
resolution: o96
@@ -32,13 +33,13 @@ training:
3233
# Changes in strategy
3334
strategy:
3435
_target_: anemoi.training.distributed.strategy.DDPEnsGroupStrategy
35-
num_gpus_per_ensemble: ${hardware.num_gpus_per_ensemble}
36-
num_gpus_per_model: ${hardware.num_gpus_per_model}
36+
num_gpus_per_ensemble: ${system.hardware.num_gpus_per_ensemble}
37+
num_gpus_per_model: ${system.hardware.num_gpus_per_model}
3738

3839
# Changes in training loss
3940
training_loss:
4041
_target_: anemoi.training.losses.kcrps.AlmostFairKernelCRPS
41-
scalars: ['variable']
42+
scalars: ["variable"]
4243
ignore_nans: False
4344
alpha: 1.0
4445

training/src/anemoi/training/config/config.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@ defaults:
22
- data: zarr
33
- dataloader: native_grid
44
- diagnostics: evaluation
5-
- hardware: example
5+
- system: example
66
- graph: multi_scale
77
- model: gnn
88
- training: default

training/src/anemoi/training/config/dataloader/native_grid.yaml

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ pin_memory: True
1010
# The number of GPUs per model must be divisible by read_group_size.
1111
# To disable, set to 1.
1212
# ============
13-
read_group_size: ${hardware.num_gpus_per_model}
13+
read_group_size: ${system.hardware.num_gpus_per_model}
1414

1515
num_workers:
1616
training: 8
@@ -50,14 +50,14 @@ grid_indices:
5050
# See https://anemoi-datasets.readthedocs.io
5151
# ============
5252

53-
dataset: ${hardware.paths.data}/${hardware.files.dataset}
53+
dataset: ${system.input.dataset}
5454

5555
training:
5656
dataset: ${dataloader.dataset}
5757
start: null
5858
end: 2020
5959
frequency: ${data.frequency}
60-
drop: []
60+
drop: []
6161

6262
validation_rollout: 1 # number of rollouts to use for validation, must be equal or greater than rollout expected by callbacks
6363

@@ -66,11 +66,11 @@ validation:
6666
start: 2021
6767
end: 2021
6868
frequency: ${data.frequency}
69-
drop: []
69+
drop: []
7070

7171
test:
7272
dataset: ${dataloader.dataset}
7373
start: 2022
7474
end: null
7575
frequency: ${data.frequency}
76-
drop: []
76+
drop: []

training/src/anemoi/training/config/debug.yaml

Lines changed: 11 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@ defaults:
22
- data: zarr
33
- dataloader: native_grid
44
- diagnostics: evaluation
5-
- hardware: example
5+
- system: example
66
- graph: multi_scale
77
- model: gnn
88
- training: default
@@ -14,20 +14,21 @@ config_validation: True
1414
## When you commit your changes, assign the new features and keywords
1515
## to the correct defaults.
1616
# For example to change from default GPU count:
17-
# hardware:
18-
# num_gpus_per_node: 1
17+
# system:
18+
# hardware:
19+
# num_gpus_per_node: 1
1920

2021
diagnostics:
2122
plot:
2223
callbacks: []
23-
hardware:
24-
files:
24+
system:
25+
input:
2526
graph: ???
26-
accelerator: auto
27-
num_gpus_per_node: 1
28-
num_nodes: 1
29-
num_gpus_per_model: 1
30-
27+
hardware:
28+
accelerator: auto
29+
num_gpus_per_node: 1
30+
num_nodes: 1
31+
num_gpus_per_model: 1
3132

3233
model:
3334
num_channels: 128

0 commit comments

Comments
 (0)