Reliably differentiating an L-BFGS optimizer without floating point error? #1204

lankef · 2025-02-22T03:28:21Z

lankef
Feb 22, 2025

Hello all,

I am fairly new to optax, and am trying to call jaxfwd on a l-bfgs routine, but occasionally it's giving me nan.

# A loop version of while_loop for debugging
def wl_debug(cond_fun, body_fun, init_val):
    val = init_val
    while cond_fun(val):
        val = body_fun(val)
    return val

def run_opt_optax(init_params, fun, max_iter, ftol, xtol, gtol, opt):
    value_and_grad_fun = optax.value_and_grad_from_state(fun)
    # Carry is params, update, value, dx, du, df, state1
    def step(carry):
        params1, updates1, value1, _, _, _, state1 = carry
        value2, grad2 = value_and_grad_fun(params1, state=state1)
        updates2, state2 = opt.update(
            grad2, state1, params1, value=value2, grad=grad2, value_fn=fun
        )
        params2 = optax.apply_updates(params1, updates2)
        return(
            params2, updates2, value2, 
            jnp.linalg.norm(params2 - params1), 
            jnp.linalg.norm(updates2 - updates1), 
            jnp.abs(value2 - value1), 
            state2
        )
  
    def continuing_criterion(carry):
        params, _, _, dx, du, df, state = carry
        iter_num = otu.tree_get(state, 'count')
        grad = otu.tree_get(state, 'grad')
        err = otu.tree_l2_norm(grad)
        # Stopping condition based on gradient and changes in x, u and f.
        return (iter_num == 0) | (
            (iter_num < max_iter) 
            & (err >= gtol)
            & (dx >= xtol)
            & (du >= xtol) # not using a separate tol for now
            & (df >= ftol)
        )
  
    init_carry = (
        init_params, 
        jnp.zeros_like(init_params, dtype=jnp.float32),
        0., 0., 0., 0.,
        opt.init(init_params)
    )
    final_params, _, _, _, _, _, final_state = wl_debug(
        continuing_criterion, step, init_carry
    )
    return(otu.tree_get(final_state, 'params'))

# x are n optimization dofs
# a are additional params we will try to differentiate wrt
f = lambda x, a: <some function of x and a>
opt = optax.lbfgs()
def f_opt(a):
    x_opt = run_opt_optax(
        init_params=jnp.zeros(n), 
        fun=f, 
        max_iter=1000,  
        ftol=1e-7,  
        xtol=1e-7,  
        gtol=1e-7,  
        opt=opt
    )
    return f(x_opt)

da_df_opt = jacfwd(f_test)
da_df_opt(<some value of a>)

Edit: Evaluating da_df_opt(<some value of a>) occasionally gives nan. To see what's going on, I added some print statements in scale_by_lbfgs. It appears updates's tangent begin to overflow first inside scale_by_lbfgs of transform.py.

updates Traced<ShapedArray(float32[40])>with<JVPTrace> with
  primal = Array([ 0.03709674, -0.1340442 ,  0.15852556,  0.3838429 ,  1.6163986 ,
        0.7686366 , -0.1452667 ,  0.03442389,  0.45168406,  0.46901456,
        0.3512962 , -0.7028064 , -0.85333073,  0.48626453, -1.5292846 ,
       -0.30536592,  0.3962326 ,  0.0483312 ,  0.3313378 , -0.78124094,
        0.43221253,  1.764327  , -0.9597864 ,  0.00219985,  0.6203904 ,
        0.02596491,  0.34479657, -0.34165007,  0.10995138,  0.5534703 ,
       -0.59125775, -0.84366196,  0.09906857, -0.23047997, -0.00866063,
       -0.09492865, -0.16971865, -0.43064022,  0.17965697, -0.38844582],      dtype=float32)
  tangent = Traced<ShapedArray(float32[40])>with<BatchTrace> with
    val = Array([[ 4.75278335e+32, -7.45372518e+32, -2.32712286e+33,
         1.54883661e+33,  1.09408994e+33,  2.01539068e+33,
        -1.20603562e+33, -5.61383796e+32, -6.11215063e+30,
         4.31446040e+31, -1.49859064e+33, -8.95668296e+30,
         2.48097606e+33,  2.96554958e+32, -1.30273421e+32,
        -1.62834161e+33,  2.03013520e+32, -1.42886630e+31,
        -2.49652447e+32, -8.71307384e+32,  2.05811338e+33,
         1.74479741e+33,  9.50509937e+32, -8.07137679e+32,
        -6.72502029e+32,  7.57263474e+32, -7.38074963e+30,
         1.11680064e+32,  5.58054356e+32,  2.51651391e+33,
        -1.59012160e+33, -3.34282767e+32, -1.53292042e+33,
         5.53874335e+32,  6.27522135e+32,  3.46913837e+30,
         1.82462903e+32,  6.30126761e+32, -9.51484099e+31,
        -4.04623704e+33],
        ...,
       [            nan,             nan,             nan,
                    nan,             nan,             nan,
                    nan,             nan,             nan,
                    nan,             nan,             nan,
                    nan,             nan,             nan,
                    nan,             nan,             nan,
                    nan,             nan,             nan,
                    nan,             nan,             nan,
                    nan,             nan,             nan,
                    nan,             nan,             nan,
                    nan,             nan,             nan,
                    nan,             nan,             nan,
                    nan,             nan,             nan,
                    nan],
       ...

l-bfgs works robustly on the problem, and much better than sgd or adam, so I'd rather not switch to another algorithm.

Since it's the tangent, not the primal value overflowing, I don't quite know what to do. Is it possible to add a stopping criterion, or modify the lbfgs somehow to prevent this?

Thanks a lot in advance!🙏🙏🙏

Answered by rdyro

Mar 4, 2025

The local minimization procedures in the linesearch might be to blame given it does a lot of divisions. You could try implementing a simple scale-by-0.7-until-fn-value-is-lower-than-last-value for numerical stability.

Another possibility could be using implicit differentiation like here or here

View full answer

rdyro · 2025-02-23T20:45:00Z

rdyro
Feb 23, 2025
Maintainer

Oh, this is such an interesting failure mode! I'd love to take a look!

Perhaps this might be related #1189, but it's just a guess.

Thank you for the repro, I'll try it out. In the meantime if you want to push on it yourself as well, maybe useful here: https://docs.jax.dev/en/latest/debugging/flags.html

3 replies

lankef Feb 23, 2025
Author

I really appreciate the interest!
I actually tried running with the jax_debug_nan flag a few days ago. It seems to encounter a false positive very early on due to

  File "/home/***/code/optax/optax/_src/transform.py", line 1680, in update_fn
    vdot_diff_params_updates == 0.0, 0.0, 1.0 / vdot_diff_params_updates
                                          ~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~
FloatingPointError: invalid value (nan) encountered in jit(true_divide). Because jax_config.debug_nans.value and/or config.jax_debug_infs is set, the de-optimized function (i.e., the function as if the `jit` decorator were removed) was called in an attempt to get a more precise error message. However, the de-optimized function did not produce invalid values during its execution. This behavior can result from `jit` optimizations causing the invalid value to be produced. It may also arise from having nan/inf constants as outputs, like `jax.jit(lambda ...: jax.numpy.nan)(...)`. 

It may be possible to avoid the invalid value by removing the `jit` decorator, at the cost of losing optimizations. 

If you see this error, consider opening a bug report at https://github.com/jax-ml/jax.

I'm not super sure if this is related to jax-ml/jax#22291, which has been closed already.

Because of this the code will stop in the first iteration when this flag set to True... The tangent of updates tho takes many iterations to become nan.

I apologize for not posting a code snippet that reproduces it. The code that generates this problem is pretty long. I don't want to trouble you for looking for issues in my code, but I can post a git link if you can't reproduce it.

rdyro Feb 24, 2025
Maintainer

Oh, I just noticed you left some parts out. I attempted to reproduce it myself, but couldn't so far.

One possible workaround for debugging nans is manually using jax.debug.callback conditionally as shown here: #1185 - this should be able to match the actual jit numerics as closely as possible.

I apologize for not posting a code snippet that reproduces it. The code that generates this problem is pretty long. I don't want to trouble you for looking for issues in my code, but I can post a git link if you can't reproduce it.

If you can post the full repro, I'd really appreciate it!

Llbfgs should avoid diving by a small number in the forward pass, but maybe the backwards pass becomes contaminated, e.g. here:

optax/optax/_src/transform.py

Line 1710 in 9b682ab

capped_inv_norm = jnp.minimum(1.0, 1.0/otu.tree_l2_norm(updates))

referenced in Stop gradient in linesearch identity scaling #1190 contributed by @younik

We should have this fix merged this week.

Regardless, let's keep digging into this!

lankef Feb 24, 2025
Author

Hi rdyro,

This should reproduce the error.

Archive.zip

I ran it on a V100. I have also included the following new lines in transform.py to track the variables:

Edit: I've noticed that #1190 is merged to the main branch, so I gave it a try, but the behavior seems to persist.

  def update_fn(
      updates: base.Updates, state: ScaleByLBFGSState, params: base.Params
  ) -> tuple[base.Updates, ScaleByLBFGSState]:
    # Essentially memory_idx is the iteration k (modulo the memory size)
    # and prev_memory_idx is k-1 (modulo the memory size).
    memory_idx = state.count % memory_size
    prev_memory_idx = (state.count - 1) % memory_size

    # We first update the preconditioner and then preconditon the updates.
    # That way, we can chain this function with a linesearch to update the
    # preconditioner only once a valid stepsize has been found by the linesearch
    # and the step has been done.

    # 1. Updates the memory buffers given fresh params and gradients/updates
    diff_params = otu.tree_sub(params, state.params)
    diff_updates = otu.tree_sub(updates, state.updates)
    vdot_diff_params_updates = otu.tree_real(
        otu.tree_vdot(diff_updates, diff_params)
    )
    weight = jnp.where(
        vdot_diff_params_updates == 0.0, 0.0, 1.0 / vdot_diff_params_updates
    )
    print('params', params) # lankef: NEW LINE
    print('updates', updates) # lankef: NEW LINE
    print('diff_params', diff_params) # lankef: NEW LINE
    print('diff_updates', diff_updates) # lankef: NEW LINE
    print('vdot_diff_params_updates', vdot_diff_params_updates) # lankef: NEW LINE
    print('weight', weight) # lankef: NEW LINE

The full output is very long so I won't include it here. The nan first appears at iteration 108 in the lbfgs loop.

rdyro · 2025-03-02T03:02:00Z

rdyro
Mar 2, 2025
Maintainer

I was able to run your repro, unfortunately NaN debugging utils in jax are still work in progress.

I think in your case the linesearch is what causes NaNs in gradients, I tried replacing the default zoom linesearch in lbfgs with: linesearch = _linesearch.scale_by_backtracking_linesearch(2, store_grad=True) and the gradients are then non-NaN, with steps >= 2 they become NaN even with the backtracking line search.

One possible temporary solution would be implementing your own linesearch in a way that's gradient safe. I'll add investigating the lineasearch gradient stability to my TODO.

5 replies

rdyro Mar 2, 2025
Maintainer

Oh, another bonus might be that a custom linesearch based on jax.lax.scan instead of jax.lax.while_loop might be backwards differentiable!

lankef Mar 4, 2025
Author

I see. I am still getting nan's using linesearch=optax.scale_by_backtracking_linesearch(2, store_grad=True)), but thank you for narrowing the issue down to the line search at least. Let me see if I can figure out which step in the line search is causing the issue..

rdyro Mar 4, 2025
Maintainer

The local minimization procedures in the linesearch might be to blame given it does a lot of divisions. You could try implementing a simple scale-by-0.7-until-fn-value-is-lower-than-last-value for numerical stability.

Another possibility could be using implicit differentiation like here or here

Answer selected by lankef

lankef Mar 5, 2025
Author

Thank you for the suggestion. Now that I thought more about it, implicit differentiation is probably a much better idea for this specific problem! Again I really appreciate your time spent looking at this with me.

rdyro Mar 5, 2025
Maintainer

Awesome, thanks for the repro, it was really useful for me to understand linesearch differentiability. Good luck!

Reliably differentiating an L-BFGS optimizer without floating point error? #1204

Uh oh!

Uh oh!

lankef Feb 22, 2025

Replies: 2 comments · 8 replies

Uh oh!

rdyro Feb 23, 2025 Maintainer

Uh oh!

Uh oh!

lankef Feb 23, 2025 Author

Uh oh!

rdyro Feb 24, 2025 Maintainer

Uh oh!

Uh oh!

lankef Feb 24, 2025 Author

Uh oh!

rdyro Mar 2, 2025 Maintainer

Uh oh!

rdyro Mar 2, 2025 Maintainer

Uh oh!

lankef Mar 4, 2025 Author

Uh oh!

rdyro Mar 4, 2025 Maintainer

Uh oh!

lankef Mar 5, 2025 Author

Uh oh!

rdyro Mar 5, 2025 Maintainer

lankef
Feb 22, 2025

Replies: 2 comments 8 replies

rdyro
Feb 23, 2025
Maintainer

lankef Feb 23, 2025
Author

rdyro Feb 24, 2025
Maintainer

lankef Feb 24, 2025
Author

rdyro
Mar 2, 2025
Maintainer

rdyro Mar 2, 2025
Maintainer

lankef Mar 4, 2025
Author

rdyro Mar 4, 2025
Maintainer

lankef Mar 5, 2025
Author

rdyro Mar 5, 2025
Maintainer