Why is grad(f) ~2 orders of magnitude more expensive to evaluate than f in this example #10284

fbartolic · 2022-04-14T18:07:29Z

fbartolic
Apr 14, 2022

Here is a reproducible example:

import jax.numpy as jnp
from jax import  jit, grad, lax
from jax.config import config
config.update("jax_enable_x64", True)
config.update('jax_platform_name', 'cpu')

@jit
def _poly_coeffs_binary(w, a, e1):
    wbar = jnp.conjugate(w)

    p_0 = -(a ** 2) + wbar ** 2
    p_1 = a ** 2 * w - 2 * a * e1 + a - w * wbar ** 2 + wbar
    p_2 = (
        2 * a ** 4
        - 2 * a ** 2 * wbar ** 2
        + 4 * a * wbar * e1
        - 2 * a * wbar
        - 2 * w * wbar
    )
    p_3 = (
        -2 * a ** 4 * w
        + 4 * a ** 3 * e1
        - 2 * a ** 3
        + 2 * a ** 2 * w * wbar ** 2
        - 4 * a * w * wbar * e1
        + 2 * a * w * wbar
        + 2 * a * e1
        - a
        - w
    )
    p_4 = (
        -(a ** 6)
        + a ** 4 * wbar ** 2
        - 4 * a ** 3 * wbar * e1
        + 2 * a ** 3 * wbar
        + 2 * a ** 2 * w * wbar
        + 4 * a ** 2 * e1 ** 2
        - 4 * a ** 2 * e1
        + 2 * a ** 2
        - 4 * a * w * e1
        + 2 * a * w
    )
    p_5 = (
        a ** 6 * w
        - 2 * a ** 5 * e1
        + a ** 5
        - a ** 4 * w * wbar ** 2
        - a ** 4 * wbar
        + 4 * a ** 3 * w * wbar * e1
        - 2 * a ** 3 * w * wbar
        + 2 * a ** 3 * e1
        - a ** 3
        - 4 * a ** 2 * w * e1 ** 2
        + 4 * a ** 2 * w * e1
        - a ** 2 * w
    )

    p = jnp.stack([p_0, p_1, p_2, p_3, p_4, p_5])

    return jnp.moveaxis(p, 0, -1)

@jit
def lens_eq_binary(z, a, e1):
    zbar = jnp.conjugate(z)
    return z - e1 / (zbar - a) - (1.0 - e1) / (zbar + a)

@jit
def lens_eq_jac_det_binary(z, a, e1):
    zbar = jnp.conjugate(z)
    return 1.0 - jnp.abs(e1 / (zbar - a) ** 2 + (1.0 - e1) / (zbar + a) ** 2) ** 2

@jit
def images_point_source_binary(w, a, e1):
    # Compute complex polynomial coefficients for each element of w
    coeffs = _poly_coeffs_binary(w, a, e1)

    # Compute roots
    roots = jnp.roots(coeffs, strip_zeros=False)
    roots = jnp.moveaxis(roots, -1, 0)

    # Evaluate the lens equation at the roots
    lens_eq_eval = lens_eq_binary(roots, a, e1) - w

    # Mask out roots which don't satisfy the lens equation
    mask_solutions = jnp.abs(lens_eq_eval) < 1e-5

    return roots, mask_solutions

@jit
def mag_point_source_binary(w, a, e1):
    images, mask = images_point_source_binary(
        w, a, e1, 
    )
    det = lens_eq_jac_det_binary(images, a, e1)
    mag = (1.0 / jnp.abs(det)) * mask

    return mag.sum(axis=0).reshape(w.shape)


w = 0. + 0.3j 
f = lambda w: mag_point_source_binary(w, 0.5*0.9, 0.8)

%%timeit
f(w).block_until_ready()

%%timeit
grad(f)(w).block_until_ready()

Calling grad(f)(w) is at least 2 orders of magnitude more expensive than calling f(w). Does anyone know why that is the case here?

Answered by mattjj

Apr 14, 2022

Try putting jit on the outside of grad so that we can push more of the computation to XLA:

In [4]: %timeit f(w).block_until_ready()
24.1 µs ± 13.7 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [5]: timeit grad(f)(w).block_until_ready()
3.27 ms ± 198 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [6]: grad_f = jit(grad(f))

In [7]: timeit grad_f(w).block_until_ready()
32.2 µs ± 15.6 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

View full answer

mattjj · 2022-04-14T18:25:19Z

mattjj
Apr 14, 2022
Maintainer

Try putting jit on the outside of grad so that we can push more of the computation to XLA:

In [4]: %timeit f(w).block_until_ready()
24.1 µs ± 13.7 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [5]: timeit grad(f)(w).block_until_ready()
3.27 ms ± 198 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [6]: grad_f = jit(grad(f))

In [7]: timeit grad_f(w).block_until_ready()
32.2 µs ± 15.6 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

4 replies

fbartolic Apr 14, 2022
Author

Ah yes, I forgot that I need a jit outside of grad. Thank you!

Out of curiosity, why is the snippet bellow bot equivalent to what you did above?

%%timeit
jit(grad(f))(w).block_until_ready()

mattjj Apr 14, 2022
Maintainer

That's due to a caching detail, which actually we could revise. Every time grad(f) is evaluated it returns a new Python callable object. That means jit(grad(f)) gets a fresh cache, and so evaluating jit(grad(f))(x) multiple times won't get any compilation cache hits. On the other hand, if we let grad_f = grad(f) and then evaluate jit(grad_f)(x) multiple times, we do get cache hits. But even safer is to let grad_f = jit(grad(f)) and then call grad_f(x) multiple times.

We could make a cache for grad so that grad(f) always returns the same Python callable when given the same callable f, but that might be too much caching. So the safest thing to do is not to redefine callables.

fbartolic Apr 14, 2022
Author

Thank you for the clear explanation!

patrick-kidger May 6, 2022

Incidentally it is possible to rerun JIT-grad without recompiling, and without needing to cache the grad'd function. Using Equinox:

import equinox as eqx
import jax.numpy as jnp

def f(x):
    print("compiling!")
    return x

eqx.filter_jit(eqx.filter_grad(f))(jnp.array(1.))  # compiling!
eqx.filter_jit(eqx.filter_grad(f))(jnp.array(1.))  # nothing

This is a combination of two improvements relative to jax.jit and jax.grad:

filter_grad doesn't return a brand-new function. Instead it returns a PyTree, with f on one of its leaves, and a __call__ method defined for computing the gradient. As a result each call to filter_grad(f) produces an object with the same PyTree structure.
filter_jit can wrap any callable, not just functions. In particular: just like how function arguments can be PyTrees, the callable itself can be a PyTree as well.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Why is grad(f) ~2 orders of magnitude more expensive to evaluate than f in this example #10284

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Why is grad(f) ~2 orders of magnitude more expensive to evaluate than f in this example #10284

Uh oh!

fbartolic Apr 14, 2022

Replies: 1 comment · 4 replies

Uh oh!

mattjj Apr 14, 2022 Maintainer

Uh oh!

fbartolic Apr 14, 2022 Author

Uh oh!

mattjj Apr 14, 2022 Maintainer

Uh oh!

fbartolic Apr 14, 2022 Author

Uh oh!

patrick-kidger May 6, 2022

fbartolic
Apr 14, 2022

Replies: 1 comment 4 replies

mattjj
Apr 14, 2022
Maintainer

fbartolic Apr 14, 2022
Author

mattjj Apr 14, 2022
Maintainer

fbartolic Apr 14, 2022
Author