feat(training): Refactor optimizer creation to support custom and torch optimizers (#588)

gabrieloks · pre-commit-ci[bot] · ssmmnn11 · web-flow · commit cd777fb44b23 · 2025-11-20T11:35:02.000+01:00
## Description Refactoring of the optimizer creation to support custom and torch optimizers. ## What problem does this change solve? This PR enables users to select different PyTorch optimizers and also allows the use of custom ones, such as AdEMAMix. ## What issue or task does this change relate to?  ## Additional notes ##  ***As a contributor to the Anemoi framework, please ensure that your changes include unit tests, updates to any affected dependencies and documentation, and have been tested in a parallel setting (i.e., with multiple GPUs). As a reviewer, you are also responsible for verifying these aspects and requesting changes if they are not adequately addressed. For guidelines about those please refer to https://anemoi.readthedocs.io/en/latest/*** By opening this pull request, I affirm that all authors agree to the [Contributor License Agreement.](https://github.com/ecmwf/codex/blob/main/Legal/contributor_license_agreement.md)  ---- 📚 Documentation preview 📚: https://anemoi-training--588.org.readthedocs.build/en/588/   ---- 📚 Documentation preview 📚: https://anemoi-graphs--588.org.readthedocs.build/en/588/   ---- 📚 Documentation preview 📚: https://anemoi-models--588.org.readthedocs.build/en/588/  --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Simon Lang <simon.lang@ecmwf.int> Co-authored-by: Harrison Cook <Harrison.cook@ecmwf.int> Co-authored-by: Ana Prieto Nemesio <91897203+anaprietonem@users.noreply.github.com>
diff --git a/LICENCES/APPLE_ML_ACKNOWLEDGEMENTS b/LICENCES/APPLE_ML_ACKNOWLEDGEMENTS
@@ -0,0 +1,116 @@
+Acknowledgements
+Portions of our AdEMAMix implementation may utilize the following copyrighted
+material, the use of which is hereby acknowledged.
+
+_____________________
+
+The Pytorch team (Pytorch)
+
+	From PyTorch:
+
+	Copyright (c) 2016-     Facebook, Inc            (Adam Paszke)
+	Copyright (c) 2014-     Facebook, Inc            (Soumith Chintala)
+	Copyright (c) 2011-2014 Idiap Research Institute (Ronan Collobert)
+	Copyright (c) 2012-2014 Deepmind Technologies    (Koray Kavukcuoglu)
+	Copyright (c) 2011-2012 NEC Laboratories America (Koray Kavukcuoglu)
+	Copyright (c) 2011-2013 NYU                      (Clement Farabet)
+	Copyright (c) 2006-2010 NEC Laboratories America (Ronan Collobert, Leon Bottou, Iain Melvin, Jason Weston)
+	Copyright (c) 2006      Idiap Research Institute (Samy Bengio)
+	Copyright (c) 2001-2004 Idiap Research Institute (Ronan Collobert, Samy Bengio, Johnny Mariethoz)
+
+	From Caffe2:
+
+	Copyright (c) 2016-present, Facebook Inc. All rights reserved.
+
+	All contributions by Facebook:
+	Copyright (c) 2016 Facebook Inc.
+
+	All contributions by Google:
+	Copyright (c) 2015 Google Inc.
+	All rights reserved.
+
+	All contributions by Yangqing Jia:
+	Copyright (c) 2015 Yangqing Jia
+	All rights reserved.
+
+	All contributions by Kakao Brain:
+	Copyright 2019-2020 Kakao Brain
+
+	All contributions by Cruise LLC:
+	Copyright (c) 2022 Cruise LLC.
+	All rights reserved.
+
+	All contributions by Arm:
+	Copyright (c) 2021, 2023-2024 Arm Limited and/or its affiliates
+
+	All contributions from Caffe:
+	Copyright(c) 2013, 2014, 2015, the respective contributors
+	All rights reserved.
+
+	All other contributions:
+	Copyright(c) 2015, 2016 the respective contributors
+	All rights reserved.
+
+	Caffe2 uses a copyright model similar to Caffe: each contributor holds
+	copyright over their contributions to Caffe2. The project versioning records
+	all such contribution and copyright details. If a contributor wants to further
+	mark their specific copyright on a particular contribution, they should
+	indicate their copyright solely in the commit message of the change when it is
+	committed.
+
+	All rights reserved.
+
+	Redistribution and use in source and binary forms, with or without
+	modification, are permitted provided that the following conditions are met:
+
+	1. Redistributions of source code must retain the above copyright
+	   notice, this list of conditions and the following disclaimer.
+
+	2. Redistributions in binary form must reproduce the above copyright
+	   notice, this list of conditions and the following disclaimer in the
+	   documentation and/or other materials provided with the distribution.
+
+	3. Neither the names of Facebook, Deepmind Technologies, NYU, NEC Laboratories America
+	   and IDIAP Research Institute nor the names of its contributors may be
+	   used to endorse or promote products derived from this software without
+	   specific prior written permission.
+
+	THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+	AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+	IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+	ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
+	LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
+	CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
+	SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
+	INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
+	CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
+	ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
+	POSSIBILITY OF SUCH DAMAGE.
+
+
+Google Deepmind (Jax)
+	Licensed under the Apache License, Version 2.0 (the "License");
+	you may not use this file except in compliance with the License.
+	You may obtain a copy of the License at
+
+	  http://www.apache.org/licenses/LICENSE-2.0
+
+	Unless required by applicable law or agreed to in writing, software
+	distributed under the License is distributed on an "AS IS" BASIS,
+	WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+	See the License for the specific language governing permissions and
+	limitations under the License.
+
+
+Google Deepmind (Optax)
+	Licensed under the Apache License, Version 2.0 (the "License");
+	you may not use this file except in compliance with the License.
+	You may obtain a copy of the License at
+
+	  http://www.apache.org/licenses/LICENSE-2.0
+
+	Unless required by applicable law or agreed to in writing, software
+	distributed under the License is distributed on an "AS IS" BASIS,
+	WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+	See the License for the specific language governing permissions and
+	limitations under the License.
diff --git a/LICENCES/APPLE_ML_ADEMAMIX_LICENSE b/LICENCES/APPLE_ML_ADEMAMIX_LICENSE
@@ -0,0 +1,29 @@
+MIT License
+
+Copyright © 2024 Apple Inc.
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
+
+
+-------------------------------------------------------------------------------
+SOFTWARE DISTRIBUTED WITH ADEMAMIX:
+
+The AdEMAMix provided code includes a number of subcomponents with separate
+copyright notices and license terms - please see the file ACKNOWLEDGEMENTS.
+-------------------------------------------------------------------------------
diff --git a/NOTICE.md b/NOTICE.md
@@ -0,0 +1,3 @@
+This product includes third-party software components developed by Apple Inc.
+Specifically, it incorporates the "AdEMAMix" optimizer implementation,
+which is made available under the MIT License.
diff --git a/training/docs/modules/optimizers.rst b/training/docs/modules/optimizers.rst
@@ -0,0 +1,102 @@
+############
+ Optimizers
+############
+
+Optimizers are responsible for updating the model parameters during
+training based on the computed gradients from the loss function. In
+``anemoi-training``, optimizers are configured in the training
+configuration file under ``config.training.optimizer``. By default,
+optimizers are instantiated using a Hydra-style ``_target_`` entry,
+allowing full flexibility to specify both standard PyTorch optimizers
+and custom implementations.
+
+The optimizer configuration is handled internally by the
+``BaseGraphModule`` class through its ``_create_optimizer_from_config``
+method, which reads the provided configuration and creates the
+corresponding optimizer object. Additional settings, such as learning
+rate schedulers and warm-up phases, are also defined and managed within
+the same module.
+
+**************************
+ Configuring an Optimizer
+**************************
+
+An optimizer can be defined in the training configuration file using its
+Python import path as the ``_target_``. For example, to use the standard
+Adam optimizer:
+
+.. code:: yaml
+
+   optimizer:
+      _target_: torch.optim.Adam
+      betas: [0.9, 0.95]
+      weight_decay: 0.1
+
+The ``BaseGraphModule`` automatically injects the learning rate from
+``config.training.lr``. The optimizer configuration can therefore focus
+on algorithm-specific parameters.
+
+**************************
+ Learning Rate Schedulers
+**************************
+
+Learning rate schedulers can be attached to any optimizer to control the
+evolution of the learning rate during training. By default,
+``anemoi-training`` uses a ``CosineLRScheduler`` from ``timm.scheduler``
+with optional warm-up steps and minimum learning rate.
+
+The scheduler is created by ``BaseGraphModule._create_scheduler`` and
+returned to the trainer together with the optimizer in
+``configure_optimizers``. Currently, the scheduler is hard-coded to
+``CosineLRScheduler``, but in the future this will be made more flexible
+to allow configurable schedulers.
+
+The scheduler is returned in a dictionary of the form:
+
+.. code:: python
+
+   {
+       "optimizer": optimizer,
+       "lr_scheduler": {
+           "scheduler": scheduler,
+           "interval": "step",
+       },
+   }
+
+********************
+ AdEMAMix Optimizer
+********************
+
+``AdEMAMix`` is a custom optimizer implemented in
+``anemoi.training.optimizers.AdEMAMix.py`` and taken from the `Apple ML
+AdEMAMix project <https://github.com/apple/ml-ademamix>`_. It combines
+elements of Adam and exponential moving average (EMA) mixing for
+improved stability and generalization.
+
+The optimizer maintains **three exponential moving averages (EMAs)** of
+the gradients. See <https://arxiv.org/abs/2409.03137> for more details.
+
+***********************
+ Configuration in YAML
+***********************
+
+An example configuration for using ``AdEMAMix`` is shown below:
+
+.. code:: yaml
+
+   optimizer:
+      _target_: anemoi.training.optimizers.AdEMAMix.AdEMAMix
+      betas: [0.9, 0.999, 0.9999]
+      alpha: 2.0
+      weight_decay: 0.01
+      beta3_warmup: 1000
+      alpha_warmup: 1000
+
+**************************
+ Implementation Reference
+**************************
+
+.. automodule:: anemoi.training.optimizers.AdEMAMix
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/training/src/anemoi/training/config/training/default.yaml b/training/src/anemoi/training/config/training/default.yaml
@@ -37,11 +37,40 @@ swa:
   enabled: False
   lr: 1.e-4
 
-# Optimizer settings
+# =====================================================================
+# Optimizer configuration
+# =====================================================================
 optimizer:
-  zero: False # use ZeroRedundancyOptimizer ; saves memory for larger models
-  kwargs:
-    betas: [0.9, 0.95]
+  # ---------------------------------------------------------------
+  # Choose optimizer type (_target_ approach)
+  # ---------------------------------------------------------------
+  # Default optimizer: AdamW
+  _target_: torch.optim.AdamW
+
+  # ---------------------------------------------------------------
+  # Common optimizer parameters
+  # ---------------------------------------------------------------
+  # Learning rate is defined elsewhere
+  #kwargs:
+  betas: [0.9, 0.95]      # β₁, β₂ for Adam-style optimizers
+
+  # ---------------------------------------------------------------
+  # Optional: configuration for AdEMAMix (custom optimizer)
+  # Uncomment the lines below to enable it
+  # ---------------------------------------------------------------
+  # _target_: anemoi.training.optimizers.AdEMAMix.AdEMAMix  # Custom optimizer
+  # betas: [0.9, 0.95, 0.9999]  # β₁, β₂, β₃
+  # alpha: 8.0                   # Mixing factor controlling EMA fusion
+  # beta3_warmup: 260000         # Warm-up steps for β₃ (in iterations)
+  # alpha_warmup: 260000         # Warm-up steps for α (in iterations)
+  # weight_decay: 0.01
+
+  # Optional: configuration for ZeroRedundancyOptimizer
+  # _target_: torch.distributed.optim.ZeroRedundancyOptimizer
+  # optimizer_class:
+  #   _target_: torch.optim.AdamW
+  #   _partial_: true
+  # betas: [0.9, 0.95]
 
 # select model
 model_task: anemoi.training.train.tasks.GraphForecaster
diff --git a/training/src/anemoi/training/config/training/diffusion.yaml b/training/src/anemoi/training/config/training/diffusion.yaml
@@ -39,11 +39,10 @@ swa:
 
 # Optimizer settings
 optimizer:
-  zero: False
-  kwargs:
-    weight_decay: 0.1
-    betas: [0.9, 0.95]
-    eps: 1e-7
+  _target_: torch.optim.AdamW
+  weight_decay: 0.1
+  betas: [0.9, 0.95]
+  eps: 1e-7
 
 # select model
 model_task: anemoi.training.train.tasks.GraphDiffusionForecaster
diff --git a/training/src/anemoi/training/config/training/ensemble.yaml b/training/src/anemoi/training/config/training/ensemble.yaml
@@ -39,9 +39,8 @@ swa:
 
 # Optimizer settings
 optimizer:
-  zero: False # use ZeroRedundancyOptimizer ; saves memory for larger models
-  kwargs:
-    betas: [0.9, 0.95]
+  _target_: torch.optim.AdamW
+  betas: [0.9, 0.95]
 
 # select model
 model_task: anemoi.training.train.tasks.GraphEnsForecaster
diff --git a/training/src/anemoi/training/config/training/interpolator.yaml b/training/src/anemoi/training/config/training/interpolator.yaml
@@ -39,9 +39,8 @@ swa:
 
 # Optimizer settings
 optimizer:
-  zero: False # use ZeroRedundancyOptimizer ; saves memory for larger models
-  kwargs:
-    betas: [0.9, 0.95]
+  _target_: torch.optim.AdamW
+  betas: [0.9, 0.95]
 
 # select model
 model_task: anemoi.training.train.tasks.GraphInterpolator
diff --git a/training/src/anemoi/training/config/training/lam.yaml b/training/src/anemoi/training/config/training/lam.yaml
@@ -39,9 +39,8 @@ swa:
 
 # Optimizer settings
 optimizer:
-  zero: False # use ZeroRedundancyOptimizer ; saves memory for larger models
-  kwargs:
-    betas: [0.9, 0.95]
+  _target_: torch.optim.AdamW
+  betas: [0.9, 0.95]
 
 # select model
 model_task: anemoi.training.train.tasks.GraphForecaster
diff --git a/training/src/anemoi/training/config/training/stretched.yaml b/training/src/anemoi/training/config/training/stretched.yaml
@@ -39,9 +39,8 @@ swa:
 
 # Optimizer settings
 optimizer:
-  zero: False # use ZeroRedundancyOptimizer ; saves memory for larger models
-  kwargs:
-    betas: [0.9, 0.95]
+  _target_: torch.optim.AdamW
+  betas: [0.9, 0.95]
 
 # select model
 model_task: anemoi.training.train.tasks.GraphForecaster
diff --git a/training/src/anemoi/training/optimizers/AdEMAMix.py b/training/src/anemoi/training/optimizers/AdEMAMix.py
diff --git a/training/src/anemoi/training/schemas/training.py b/training/src/anemoi/training/schemas/training.py
diff --git a/training/src/anemoi/training/train/tasks/base.py b/training/src/anemoi/training/train/tasks/base.py
diff --git a/training/tests/integration/aicon/test_cicd_aicon_04_icon-dream_medium.yaml b/training/tests/integration/aicon/test_cicd_aicon_04_icon-dream_medium.yaml
diff --git a/training/tests/unit/train/test_optimizer.py b/training/tests/unit/train/test_optimizer.py

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+This product includes third-party software components developed by Apple Inc.`
	`2`	`+Specifically, it incorporates the "AdEMAMix" optimizer implementation,`
	`3`	`+which is made available under the MIT License.`