How to use `optax.ema`? #421

noeltong · 2022-09-18T11:25:17Z

noeltong
Sep 18, 2022

Hello, everyone! I'm just training my model with adamw optimizer and exponential moving average, and I used optax.chain to combine the gradient clip, the optimizer and optax.ema,

optimizer = optax.chain(
            optax.adaptive_grad_clip(1.),
            optimizer,
            optax.ema(0.9999)
        )

However, when I started training my model, the loss seems to be hard to converge

Did I use optax.ema properly? The version of optax is 0.1.3.

giladturok · 2025-04-03T17:57:33Z

giladturok
Apr 3, 2025

Was wondering the same thing! Does anyone have any examples of proper usage of optax.ema? Would be immensely appreciated!

0 replies

vroulet · 2025-04-03T18:03:08Z

vroulet
Apr 3, 2025
Maintainer

What do you mean by proper usage? You could create some variant of adam by chaining rmsprop and ema for example, that probably can work well.

4 replies

giladturok Apr 3, 2025

Thanks for the speedy reply! I want to make sure I am properly keeping track of my parameters params and exponential moving average of my parameters exp_params. I am specifically wondering if optax.ema takes care of this automatically.

For example, is the following correct: optimizer = optax.chain(optax.adam(1e-3), optax.ema(0.9))?

giladturok Apr 3, 2025

When I run the optimizer above, my loss explodes (similar to OP). I am unsure if this is because of an issue with (1) my optimization procedure or (2) incorrect usage of optax.ema.

giladturok Apr 3, 2025

Also, when should I use optax.ema vs optax.incremental_update?

optax.ema doc states it "compute[s] an exponential moving average of past updates". optax.incremental_update doc states that it "tracks an (exponential moving) average of the past parameters". These are very similar.

I am aware that the former is a GradientTransformation and the latter is (mostly) a tree utility function. However, it seems like they can be used interchangeably in some settings?

vroulet Apr 3, 2025
Maintainer

The gradient transformations, as their names suggest, are supposed to only be used on gradients (or transformed gradients). So the main question is what these functions take as inputs and what ema you desired

vroulet · 2025-04-03T18:34:34Z

vroulet
Apr 3, 2025
Maintainer

optax.ema will take the gradients and transformed gradients, keep an ema of them and add it to the input (see source code).
So in your case it would accumulate the adam updates (not the params). Does that help?

3 replies

giladturok Apr 3, 2025

Thank you! If I wanted to accumulate the params with an exponentially moving average, how does one do that in optax? Hope this isn't too simple of a question -- I couldn't easily find how to do it!

vroulet Apr 3, 2025
Maintainer

I think incremental_update is what you are searching for. It would update a tensor so that it keeps some moving average.
However it is a stateless transformation so you'll have to keep track of both params and ema_params.

You could also probably do something like

opt = optax.adam(learning_rate=...)
ema = optax.ema()
ema_state = ema.init(params)
opt_state = opt.init(params)
grads = jax.grad(loss)(params
updates, opt_state = opt.update(updates, opt_state)
params = optax.apply_update(params, updates)
_,  ema_state = ema.update(params, ema_state)

In any case optax is just a convenience library. You probably code the ema by yourself easily too.

giladturok Apr 3, 2025

That's incredibly helpful! Thanks so much for the clarification and example!

giladturok · 2025-04-07T16:46:57Z

giladturok
Apr 7, 2025

This is my final implementation for an exponential moving average (ema) over parameters of my neural network:

class State(NamedTuple):
    params: PyTree
    opt_state: OptState

grads = jax.grad(loss_fn)(state.params)
updates, new_opt_state = optimizer.update(grads, state.opt_state)
new_params = optax.incremental_update(
    new_tensors = optax.apply_updates(state.params, updates),
    old_tensors = state.params,
    step_size = 0.999
) # exponential moving average btwn prev params and new params from optimizer update
new_state = LFISState(params=new_params, opt_state=new_opt_state)

This avoids needing to separately track params and ema_params in my state. Might be helpful for others; feel free to give feedback or ask questions.

3 replies

jkyl Apr 8, 2025

Thanks @gil2rok this was helpful for me too. I modified your solution to work with nnx rather than linen:

import optax
from flax import nnx

def ema_step(
    new_model: nnx.Module,
    old_model: nnx.Module, 
    decay: float = 0.999,
) -> nnx.Module:
    old_params = nnx.split(old_model, nnx.Param, ...)[1]
    graphdef, new_params, *other_state = nnx.split(new_model, nnx.Param, ...)
    params = optax.incremental_update(new_params, old_params, decay)
    return nnx.merge(graphdef, params, *other_state)

class EMAOptimizer(nnx.Optimizer):
    
    def __init__(
        self,
        model: nnx.Module,
        tx: optax.GradientTransformation,
        wrt: nnx.filterlib.Filter = nnx.Param,
        ema_decay: float = 0.999,
    ):
        super().__init__(model, tx, wrt)
        self.ema_model = model
        self.ema_decay = ema_decay
        
    def update(self, grads, **kwargs):
        super().update(grads, **kwargs)
        self.ema_model = ema_step(self.model, self.ema_model, self.ema_decay)

jkyl Apr 8, 2025

(although in my solution, I am maintaining the EMA model separately from the normal model that receives gradient updates)

giladturok Apr 8, 2025

Looks nice! Nit: I'm using equinox, not flax.linen.

How to use optax.ema? #421

Uh oh!

Uh oh!

Replies: 4 comments · 10 replies

Uh oh!

Uh oh!

vroulet Apr 3, 2025 Maintainer

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vroulet Apr 3, 2025 Maintainer

Uh oh!

vroulet Apr 3, 2025 Maintainer

Uh oh!

Uh oh!

vroulet Apr 3, 2025 Maintainer

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

How to use `optax.ema`? #421

Replies: 4 comments 10 replies

vroulet
Apr 3, 2025
Maintainer

vroulet Apr 3, 2025
Maintainer

vroulet
Apr 3, 2025
Maintainer

vroulet Apr 3, 2025
Maintainer