Slow compilation and OOM gradient computation #8365

aespielberg · 2021-10-26T04:20:37Z

aespielberg
Oct 26, 2021

Hi all,

First of all, thank you so much for the awesome project. I was inspired by both Brax a talk I saw @mattjj give recently, and thought that I could use Jax for my soft structure differentiable simulation needs. I tried to port the diffmpm module from the difftaichi paper to Jax. Linked here is my attempt, where I've commented in two questions at the bottom (Line 294/298). I want to note that this is still not working perfectly - I know it has correctness bugs, but my questions here are specifically about performance.

I am extremely happy with the forward evaluation speed. It is approximately the speed I was hoping it would perform at without too many software engineering tricks. However, I am finding two aspects, compilation speed, and gradient memory usage, to be a bit problematic.

I created a simulation function by jit-ing a few subroutines and executing them in a loop. This loop is executed n times, after which a loss value is returend.

Question 1: Is there any way to then jit the entire simulation? Is there even a point to doing this? I was unsure if a) jits can be nested into one another, b) if it's good practice to do so, and c) if it is allowed, why does it take so long to execute? I have given up on the compilation as it runs for minutes without terminating, which is not particularly useful.

Question 2: One of the reasons jax is so exciting to me is because it is differentiable and easily parallelizable, and higher-order gradients appear to be simple to compute and accelerated. Unfortunately, attempting to compute grad on my large function leads to out of memory issues on my 8GB GPU. Even including intermediates, I'm genuinely not sure that this should lead to an out of memory issue. If my back-of-the-envelope math is right, each state variable is less than 100kB, so even considering all of them over 2,000 steps, this shouldn't take more than a few GB of memory - and that's without any compiler optimizations. Right now, I can only handle a few hundred steps. I could, of course, compute the grad of each simulation step and perform backpropagation manually, but this seems to defeat part of the purpose of Jax, and would make it impossible to efficiently compute higher-order derivatives. Is there a reason my simulation is using so much memory? If this is the expected memory consumption, since each loop of my simulation is identical in structure, is there anything I can do to simplify the compilation procedure?

If either issue is a bug as opposed to user error, please let me know and I will file a bug.

C-J-Cundy · 2021-10-26T23:18:13Z

C-J-Cundy
Oct 26, 2021

This looks like an awesome project!
I think both of your questions can be solved by using jax.lax.scan instead of a python for loop.
When a function is jit compiled the compiler statically unrolls python loops first. Combine that with the fact that the compilation time generally grows superlinearly with program length, and there's a recipe for very long compilation times. If you use a scan (or a fori_loop etc) then this lets the compiler know about the structure of the iteration so it can dramatically speed up compilation.

I think that this is what your forwards function would look like with a scan, although I haven't extensively tested this

p1 = vmap(p2g_contrib, in_axes=(0, 0, 0, 0, 0, None))
p2 = p2g_accum
go = vmap(vmap(grid_op))
g = vmap(g2p, in_axes=(0, None))
ca = compute_actuation

def forward_scan(x_0, v_0, C_0, F_0, actuator_id, n_steps):

  def inner_step(carry, actuation):
    x, v, C, F = carry
    F, affine, stress = p1(x, v, C, F, actuator_id, actuation)
    grid_m_in, grid_v_in = p2(x, v, F, affine, stress)
    grid_v_out = go(index_array, grid_m_in, grid_v_in)
    x, v, C = g(x, grid_v_out)
    carry_out = x, v, C, F
    return carry_out, None

  t_actuations = vmap(compute_actuation)(jnp.arange(n_steps, dtype=int))
  carry_init = (x_0, v_0, C_0, F_0)
  (x_final, _, _, _), _ = jax.lax.scan(inner_step, carry_init, t_actuations)
  return jnp.mean(x_final[:, 0])

Q1: Yes, you can jit the forward_scan function above, although you will need to specify n_steps with static_argnames/static_argnums. On a colab it jits in a few seconds for me.
Q1a I think the general rule is that you should 'jit at the highest level', so here I have removed the jits around p1, p2, etc and would just jit forward_scan itself. I think nested jitting is generally discouraged but I'm not super up-to-date on this.

Q2. The RAM on a colab seem sufficient to compute jit(grad(forward_scan), static_argnames='n_steps'). Again, it's important to jit the grad and not grad the jitted function, as the jit of the grad will optimize memory usage etc. If you have additional memory issues then I would recommend looking into gradient checkpointing.

Hope that helps!

P.S. When I ran the grad of forwards I got a NaN for the first row, which I think comes from line 178, lit = jnp.linalg.norm(vit) + 1e-10 and this issue: #3058 . If you replace it with lit = jnp.linalg.norm(vit + 1e-10) the NaN goes away.

9 replies

aespielberg Oct 27, 2021
Author

Or, I guess JAX arrays are immutable already. So why can't memory be shared?

C-J-Cundy Oct 27, 2021

Just playing around a bit, I see that if I remove everything from the inner loop except

F, affine, stress = p1(x, v, C, F, actuator_id, actuation)
x, v, C = g(x, grid_v_zero)

with a fixed grid_v_zero defined outside the loop, then we have an OOM, so clearly it's taking more than 128MB. It definitely could be some other problem than grid_v_zero getting duplicated, but it seems g specifically is (at least one of) the sources of high memory usage.

I'd recommend getting a minimal reproduction which shows the increased memory usage and then opening an issue, which should get the attention of the jax experts who are better at digging into jitted functions and memory profiles and figuring out what's going on.

aespielberg Oct 28, 2021
Author

I wrote a new problem with take: https://github.com/aespielberg/jax_mpm/blob/main/jax_diffmpm4.py

But, still the same problem.

In retrospect, difftaichi and ChainQueen both used some amount of checkpointing, so it seems like I am going to have to rely on that. Still, I'm not 100% sure I understand why it's necessary on a system of this size.

C-J-Cundy Oct 28, 2021

I'd recommend making a simple reproduction of the error, where JAX is taking more memory than you think is necessary, and opening an issue. They should be able to confirm that the issue is materialization of specific large intermdiates; it may well be that this setup exposes a possible optimization that could be made for XLA, or can be fixed in JAX by some refactoring of the main algorithm.

aespielberg Oct 30, 2021
Author

Thank you for the advice; I tore the code apart today and it seems that the main inefficiency is in the grid_op, but specifically being caused by the conditionals. For some reason, it seems all branches are being evaluated, even when I switch jnp.where to jnp.cond. I've opened a bug: #8409

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Slow compilation and OOM gradient computation #8365

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 9 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Slow compilation and OOM gradient computation #8365

Uh oh!

aespielberg Oct 26, 2021

Replies: 1 comment · 9 replies

Uh oh!

C-J-Cundy Oct 26, 2021

Uh oh!

aespielberg Oct 27, 2021 Author

Uh oh!

C-J-Cundy Oct 27, 2021

Uh oh!

aespielberg Oct 28, 2021 Author

Uh oh!

Uh oh!

C-J-Cundy Oct 28, 2021

Uh oh!

aespielberg Oct 30, 2021 Author

aespielberg
Oct 26, 2021

Replies: 1 comment 9 replies

C-J-Cundy
Oct 26, 2021

aespielberg Oct 27, 2021
Author

aespielberg Oct 28, 2021
Author

aespielberg Oct 30, 2021
Author