Why is jax.lax.ragged_all_to_all so slow? #34883

jstuecker · 2026-02-06T12:57:53Z

jstuecker
Feb 6, 2026

I am writing a multi-GPU particle simulation code with jax + CUDA-FFI. Particles are ordered along a space filling curve. So it can happen that any GPU needs data from any other GPU, but in general most communication will happen with closer GPUs and little communication with farther away GPUs.
In theory jax's ragged_all_to_all communication is exactly what I need for my setup:

Allows flexible communication between any devices
Communication volume may be data dependent
May contain zero-size messages

However, today I have written a little benchmark to understand its performance... And... I am shocked...

Below, you can find a plot of the time that it takes to finish a simple jitted+shard-mapped function with a perfectly balanced all-to-all communication for 64 GPUs (16 nodes, 4 GPUs per node). The x-axis shows the amount of data that is created on every GPU for the communication. (So e.g. at 2GB each GPU sends to each other GPU around 32MB.) The lines compare the performance of jax.lax.ragged_all_to_all versus jax.lax.all_to_all. Obviously jax.lax.all_to_all is more optimized towards this scenario, so it is understandable that jax.lax.ragged_all_to_all would be notably slower. However, what puzzles me is how large this gap is. Consider that the communication time for basically empty messages is about 10 times worse than that of a fixed size jax.lax.all_to_all. An empty ragged all-to-all communication is about as slow as a fixed all-to-all where each node sends ~8MB to each other node...
In further experiments I have found that the performance can be improved by splitting into a intra-node and and inter-node communication, but the gain is only about a factor ~2, which still leaves a large gap...
Further, I have noted that the low-message size baseline grows very strongly (~close to linear) with the number of GPUs involved. This leaves me with little hope of scaling e.g. to hundreds of GPUs for the moment...

Here is the script that produces this plot ragged_all2all.py (was run on a slurm system with 1 process per GPU). You can find the system specs here under compute nodes / booster partition. I am using CUDA 12.9 and jax 0.8.2.

I wonder whether someone has any suggestions for getting more out of the ragged_all_to_all communication. Also I wonder how to explain the extreme difference between jax.lax.all_to_all and jax.lax.ragged_all_to_all. I would have expected that they would roughly need to do the same thing here. Understanding this might help me make the right decisions moving forward.

On another note... does anyone know whether it is an option to implement custom communication kernels inside jax's FFI with NCCL?

Answered by jstuecker

Feb 6, 2026

With the help of a colleague, we found the problem. It is because my slurm job was allocated with 1 CPU per task (which was the default of my cluster) and -- as recommend by jax -- with 1 task per GPU. I had subconsciously assumed that jax only uses the CPUs for compiling code and therefore the number of CPUs is completely irrelevant. However, it turns out that CPUs do actually a lot of work during communication! Simply by adding
#SBATCH --cpus-per-task=8
or
#SBATCH --exclusive
to my slurm job script performance jumps up dramatically -- making the performance consistent between ragged and fixed all-to-all and also improving the fixed all-to-all notably:

For the low-data limit this is app…

View full answer

jstuecker · 2026-02-06T20:45:34Z

jstuecker
Feb 6, 2026
Author

With the help of a colleague, we found the problem. It is because my slurm job was allocated with 1 CPU per task (which was the default of my cluster) and -- as recommend by jax -- with 1 task per GPU. I had subconsciously assumed that jax only uses the CPUs for compiling code and therefore the number of CPUs is completely irrelevant. However, it turns out that CPUs do actually a lot of work during communication! Simply by adding
#SBATCH --cpus-per-task=8
or
#SBATCH --exclusive
to my slurm job script performance jumps up dramatically -- making the performance consistent between ragged and fixed all-to-all and also improving the fixed all-to-all notably:

For the low-data limit this is approximately a factor of 50 -- quite the improvement for using only 8x the number of CPUs!
Since this seems to be quite a trap, I'll raise an issue, asking the jax team to improve the documentation on this aspect of multi-GPU computation -- I don't remember reading a word about this.
I hope this will be helpful to someone else!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why is jax.lax.ragged_all_to_all so slow? #34883

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Why is jax.lax.ragged_all_to_all so slow? #34883

Uh oh!

Uh oh!

jstuecker Feb 6, 2026

Replies: 1 comment

Uh oh!

jstuecker Feb 6, 2026 Author

jstuecker
Feb 6, 2026

jstuecker
Feb 6, 2026
Author