lax.fori_loop and jit performance comparison #7364

gaspardbb · 2021-07-23T13:24:26Z

gaspardbb
Jul 23, 2021

Hi,
I'm learning Jax and I am curious about the difference in performance between a jitted python loop and Lax's fori_loop.

Question

I had to do a basic computation with a for loop. I noticed that jitting the python loop resulted in much better performances that lax's native fori_loop. From what I understood, calling jit unrolls the whole loop and enables performance improvements. Still, I am surprised for the performance gap to be this big. So what is the use case of fori_loop? Should we always prefer native Python loop with jit for perfomances?

Settings

Specifically, I had to handle the following computation:

@partial(jit, static_argnames=["n_steps"])
def gd_filter_jax(sigma: jnp.array, n_steps: int, power: float):
    result = jnp.zeros_like(sigma)
    for k in range(1, n_steps+1):
        prod_k = jnp.ones_like(sigma)
        for i in range(k+2, n_steps+1):
            tau_i = i ** (- power)
            prod_k *= 1 - tau_i * sigma
        result += prod_k * k ** (- power)
    return result

And I compared it with a native Python loop (without jit) and a version with lax:

@partial(jit, static_argnames=["n_steps"])
def gd_filter_lax(sigma: jnp.array, n_steps: int, power: float):

    result = lax.fori_loop(1, n_steps+1, 
        lambda k, s: s + k ** (- power) * lax.fori_loop(k+2, n_steps+1, 
            lambda i, p: p * (1 -  i ** (- power) * sigma),
            jnp.ones_like(sigma)
            ),
        jnp.zeros_like(sigma)
    )

    return result

I called these functions with the following parameters:

sigma = np.arange(1, 100)**(-2.)
%timeit gd_filter(sigma, 100, 1.).block_until_ready()

that is, n_steps=100. The results are the following:

Pure Python: 2.02 s ± 88.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
JAX Jit: 482 µs ± 17.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
LAX: 128 ms ± 3.19 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

With n_steps=1000, JAX runs in ~11s and LAX is simply too long, which I find surprising. I'd be glad for any hindsight on these tools. Thanks!

Answered by C-J-Cundy

Jul 23, 2021

I think the short answer is that lax.fori_loop is designed for truly sequential cases, where each individual iterate depends on the previous step, and the step function is reasonably complex (think a step of gradient descent, or an action in a reinforcement learning context). If jit were to statically unroll e.g. 1000 steps of gradient descent it would take a huge amount of time for XLA to actually compile the function. In the fori_loop context XLA is encouraged (possibly required? Disclaimer: I'm not an XLA expert) to treat each iterate as its own individual black box, so can't make any simplifications/speedups by e.g. batching individual multiplies into a faster matrix multiply.

In your…

View full answer

C-J-Cundy · 2021-07-23T20:31:54Z

C-J-Cundy
Jul 23, 2021

I think the short answer is that lax.fori_loop is designed for truly sequential cases, where each individual iterate depends on the previous step, and the step function is reasonably complex (think a step of gradient descent, or an action in a reinforcement learning context). If jit were to statically unroll e.g. 1000 steps of gradient descent it would take a huge amount of time for XLA to actually compile the function. In the fori_loop context XLA is encouraged (possibly required? Disclaimer: I'm not an XLA expert) to treat each iterate as its own individual black box, so can't make any simplifications/speedups by e.g. batching individual multiplies into a faster matrix multiply.

In your case, gd_filter_jax is not 'really' a sequential problem: you can compute each individual term of the result independently, and so the static unrolling under the jit is presumably able to parallelize the whole algorithm, saving lots of time. You could look at the generated jaxpr for filter_lax vs filter_jax to confirm this. In the future in this parallelizable case you can probably get the performance of the unrolled python loop by doing a vmap instead of a fori_loop. This is slightly trickier in your case because of the variable-sized inner tau vectors, but you can get away with this by passing in masks to zero out terms you don't want.

As an aside, here's how you could do the filter without any loops:

@partial(jit, static_argnames=["n_steps"])
def newest_gd_filter_jax(sigma: jnp.array, n_steps: int, power: float):
    result = jnp.zeros_like(sigma)
    mask = np.tril(np.ones((n_steps,n_steps)), -2)
    summed = np.cumsum(np.ones((n_steps, n_steps)), axis=0)
    ks = np.arange(1, n_steps + 1)
    ks_for_tau = jnp.where(mask * summed != 0, mask * summed ** -power, 0)
    tmp = jnp.prod(1 - ks_for_tau[..., None] * sigma, axis=0)
    return jnp.sum(tmp * (ks[..., None] ** -power), axis=0)

1 reply

gaspardbb Jul 24, 2021
Author

Thanks for the detailed answer! Indeed looking at the generated jaxpr, the Python version indeed unroll all the computation. So I'm not sure it will scale well with big loops.
On the other hand, I get that we can write it without any loops, but I was hoping I could escape those complex manipulation of arrays to broadcast and avoid loops... I find it require lot of work and makes the code unreadable (even though I clearly understand the benefit of doing so!).
Anyway, I'll remember this correct use case of fori_loop! Thanks again for the hindsight!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

lax.fori_loop and jit performance comparison #7364

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

lax.fori_loop and jit performance comparison #7364

Uh oh!

gaspardbb Jul 23, 2021

Question

Settings

Replies: 1 comment · 1 reply

Uh oh!

C-J-Cundy Jul 23, 2021

Uh oh!

gaspardbb Jul 24, 2021 Author

gaspardbb
Jul 23, 2021

Replies: 1 comment 1 reply

C-J-Cundy
Jul 23, 2021

gaspardbb Jul 24, 2021
Author