Understanding performance boosts from (unnecessary) vmaps #10332

brentyi · 2022-04-18T02:03:02Z

brentyi
Apr 18, 2022

Hi!

I'm hoping to get some advice on understanding some vmap-related performance peculiarities that I'm observing, for two numerically-equivalent variants of some model training code.

Here are plots of the training losses for the two variants (blue and pink), with step count as the x-axis on the left and wall-clock time as the x-axis on the right:

Depending on hyperparameters, there's a speed difference of around 3~4x when I run training in single-precision mode, and ~1.5x with mixed-precision.

The speedup is of course well-appreciated, but the problem is that it's the result of adding a seemingly unnecessary vmap, which gets applied to one of two batch axes in coordinates that are ultimately passed into jax.scipy.ndimage.map_coordinates... which is already vectorized over an arbitrary number of trailing batch axes.

It's not crazy to me that this would happen — maybe the extra vmap impacts memory layout, or cache coherency, or how XLA ends up parallelizing underlying operations — but it makes me uncomfortable because (a) the throughput change is massive and (b) I stumbled into it completely by accident. My faith in my code and competence are ultimately shaken; maybe there are other places where I can slide in seemingly unnecessary vmaps to get large performance boosts? Maybe there's much more to gain by reshaping, applying a vmap, and then reverting the reshape?

And some questions are raised, which I'm hoping I could get some high-level thoughts on from somebody who knows what they're doing:

To confirm: is this kind of performance difference reasonable and expected?
Is there a best way to concretely understand why an extra vmap triggers such a drastic change? Maybe via jax.profiler or jax.make_jaxpr?
Is there some intuition I can build to assess at a glance how a vmap like this would impact performance? The same applies to ordering of vmaps, which I've found myself guessing and checking on to improve speed by a few percentage points.

Apologies for the lack of a compact example for reproducing this (I had trouble creating one), but here are the short few lines that get us from the blue curve to the pink curve above: https://github.com/brentyi/tensorf-jax/blob/8a9deba130b62bef4fdaae1db17382bc225014cd/tensorf/tensor_vm.py#L59-L73

Thanks for reading!!

YouJiacheng · 2022-04-18T05:10:18Z

YouJiacheng
Apr 18, 2022

Not really know the reason.
But you can inspect generated HLO:

print(jax.jit(f).lower(example_args).compile().compiler_ir()[0].to_string())

0 replies

jakevdp · 2022-04-18T16:06:13Z

jakevdp
Apr 18, 2022
Maintainer

This is an interesting example - thanks for sharing!

Regarding your questions about vmap – it's hard to say much in general. The way vmap works is that it looks at each individual operation you want to do, and replaces it with another operation or sequence of operations that perform the same thing in a batched manner. The details are entirely defined in the batching rules for various primitives. For example, here is the batch rule for dot_general, which is the workhorse behind generalized matrix products. It essentially adds new dimension numbers to the metadata, and then calls back into the original dot_general primitive.

Other batching rules are more complicated; for example the batching rule for lax.cond_p is defined here. In some cases it lowers not to cond, but to select, which has very different performance characteristics from cond, and can be much slower in some cases.

Each primitive operation in JAX that is compatible with vmap has a rule like this –in your case, one of the vmap rules appears to have resulted in a sequence of operations that the compiler is able to optimize more effectively than the original primitive. It's hard to say anything more specific than that without looking into the sequence of operations and their batching rules, and what rules the compiler has to optimize them (perhaps by digging into the HLO mentioned in the other answer). In a perfect world, micro-benchmarking at this level should be something taken care of by the compiler rather than the user: the XLA compiler is good and always getting better, but it's by no means perfect, so we sometimes end up with puzzling behavior like this.

Perhaps one action item: if we're able to drill down and find out which particular vmap rule is causing this performance difference, we could likely optimize the implementation on JAX's side of things.

Does that answer your question?

1 reply

brentyi Apr 18, 2022
Author

Both this and the HLO note from @YouJiacheng were super helpful, thanks!

I ran some experiments and the vmap seems to only help when there's a backward pass involved, so I wrote some scripts for generating the HLO corresponding to both:

A simplified backward pass operation, where the vmap introduces a 1.5x speedup.
- results with vmap and without vmap, script
The full training step, where the vmap introduces the original 3~4x speedup.
- results with vmap and without vmap, script

For the simpler interpolation case, the diff seems to indicate that the difference is just with axis ordering, so we can conclude that the 1.5x improvement is from better memory layout. Is this something that JAX might be able to better figure out on its own in the future? (related: I've sometimes found it helpful to return intermediate arrays to force jitted functions in JAX into using a specific memory layout that's faster than what would otherwise be chosen by the compiler)

For the full training step, the diff shows the same axis swap ([3,16,4096,221] becomes [3,4096,16,221]), but the generated HLO is also ~150 lines shorter. Any tips on what else to look for here?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Understanding performance boosts from (unnecessary) vmaps #10332

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Understanding performance boosts from (unnecessary) vmaps #10332

Uh oh!

brentyi Apr 18, 2022

Replies: 2 comments · 1 reply

Uh oh!

Uh oh!

YouJiacheng Apr 18, 2022

Uh oh!

jakevdp Apr 18, 2022 Maintainer

Uh oh!

Uh oh!

brentyi Apr 18, 2022 Author

brentyi
Apr 18, 2022

Replies: 2 comments 1 reply

YouJiacheng
Apr 18, 2022

jakevdp
Apr 18, 2022
Maintainer

brentyi Apr 18, 2022
Author