Asynchronous dispatch for jitted function #8281

mavenlin · 2021-10-19T12:58:50Z

mavenlin
Oct 19, 2021

Hi jax team,

I'm expecting a jitted function to be asynchronously dispatched, the calling thread can return immediately and make other function calls. One can leverage the above property to parallelize GPU kernels on different devices.

e.g.

import jax
import numpy as np
from absl import logging
from queue import Queue
import threading

jax.profiler.start_trace('./tensorboard')

if __name__ == "__main__":
  devices = jax.local_devices()
  q = Queue()

  data = [np.random.randint(
      low=0, high=255, size=(64, 4, 84, 84), dtype=np.uint8
  ) for idx in range(1000)]

  def put():
    for idx, state in enumerate(data):
      device = devices[idx % len(devices)]
      state = jax.device_put(state, device)
      q.put(state)

  t = threading.Thread(target=put)
  t.start()

  means = [jax.jit(jax.numpy.mean, device=device) for device in devices]

  for idx in range(1000):
    s = q.get()
    m = means[idx % len(devices)](s)
    if idx == 20:
      jax.profiler.stop_trace()

I expect the cuda kernels of the above code on the different devices to overlap,
however, this is what I get from the profiler, the jitted functions look like they are called synchronously sequentially. Actually writing m = means[idx % len(devices)](s).block_until_ready() is similar to without block_until_ready.

Seems my understanding on the async dispatch is wrong?

skye · 2021-10-21T00:02:32Z

skye
Oct 21, 2021
Maintainer

I'm not sure what's happening either! What jax + jaxlib versions are you using? I'm trying to repro.

1 reply

skye Oct 21, 2021
Maintainer

Nevermind, I was able to repro with latest jax + jaxlib versions.

skye · 2021-10-21T01:46:06Z

skye
Oct 21, 2021
Maintainer

I think the computations aren't running in parallel because they're so short that the dispatch time is slower than the time it takes them to run. You can see that every GPU kernel begins and finishes execution during its corresponding JaxCompiledFunction(mean) call on the host, which is what dispatches the computation and prepares the result. I believe if you were to jit a larger function, you would begin to see overlapping execution.

7 replies

cgarciae Oct 21, 2021
Collaborator

I also noticed this when calling the same function iteratively, my initial expectation was that unless there was a blocking operation at some point (e.g. print / convert to numpy) the code should pass through the for loop quickly. Not sure what jax's actual policy is in my case but sounds related to this.

zhongwen Oct 27, 2021

@skye friendly ping :)

Hwhitetooth Feb 25, 2022

@cgarciae I encountered the same problem as yours. I was trying to take advantage of the asynchronous dispatching to overlap data fetching on CPU and parameter updates on GPU in a training loop. But the parameter update function takes a long time to return. In fact, I tried .block_until_ready() on the returned parameters and there is no difference between blocking or not. Did you find out the reason behind this? Thank you!

hawkinsp Mar 8, 2022
Maintainer

When JAX calls XLA:GPU to dispatches computations, it actually does so synchronously, in the sense that the code inside XLA:GPU that dispatches a sequence of GPU kernels runs on the calling thread. Those kernels run asynchronously on the GPU (absent things like control flow). Here the computation is still pretty cheap: it looks to me like the time to dispatch the kernels of the computation is pretty close to the time it takes for the kernels to run. So you don't end up seeing any parallelism.

There's one thing you could try here, which is to use Python threads to launch work for different GPUs. Since the XLA/GPU kernel dispatch will occur without the GIL held, multiple threads should be able to make progress in parallel.

There are two things we could do here:
a) make kernel dispatch faster. One possibility here would be for XLA to use CUDA graphs, although last I heard their design was still problematic. Another possibility would be to micro-optimize the kernel dispatch path. I see a few interesting features of the trace that perhaps we could improve.
b) dispatch computations on a separate thread, which is something we already do on CPU for computations that we think will exceed a certain FLOP cost. I suspect this is probably a good idea for us to do, because the time for a thread switch is probably a few microseconds, but that's comparable to the cost of a single GPU kernel launch. So computations with more than a handful of kernel launches may run faster this way.

(The point of the asynchronous execution design is to rush through the C++ work as quickly as possible so we can return control to Python code, hopefully allowing it to run ahead of the GPU and enqueue more work.)

Note this issue is GPU-specific: I do not expect to see this behavior on CPU or TPU.

jon-chuang May 12, 2024

@hawkinsp may I ask how mechanistically, the XLA runtime will block a jitted-function for TPU in a training loop where the outputs of a previous step are the inputs of the current step?

Does it block the C++ thread (e.g. in PJRT) until a write lock on one of the inputs acquired by another kernel execution is released?

This is because technically, in GPU land, due to CUDA streams, couldn't the python thread run far ahead of the kernels, despite the inputs and outputs for a given step not yet being computed? (only async handles are returned as intermediate python values).

Given a long enough kernel execution time (e.g. large distributed training) and batched data size, where the step repeatedly calls jax.device_put, couldn't this in theory lead to OOM due to device memory garbage collection not being able to clean up input batches allocated for future steps in the run-ahead python step, unless some explicit synchronization is periodically called from python land aka JAX?

I don't know what queuing the TPU has in place.

YouJiacheng · 2022-02-28T05:16:52Z

YouJiacheng
Feb 28, 2022

@skye friendly ping :)

0 replies

Asynchronous dispatch for jitted function #8281

Uh oh!

Uh oh!

mavenlin Oct 19, 2021

Replies: 3 comments · 8 replies

Uh oh!

skye Oct 21, 2021 Maintainer

Uh oh!

skye Oct 21, 2021 Maintainer

Uh oh!

skye Oct 21, 2021 Maintainer

Uh oh!

Uh oh!

cgarciae Oct 21, 2021 Collaborator

Uh oh!

zhongwen Oct 27, 2021

Uh oh!

Hwhitetooth Feb 25, 2022

Uh oh!

hawkinsp Mar 8, 2022 Maintainer

Uh oh!

Uh oh!

jon-chuang May 12, 2024

Uh oh!

YouJiacheng Feb 28, 2022

mavenlin
Oct 19, 2021

Replies: 3 comments 8 replies

skye
Oct 21, 2021
Maintainer

skye Oct 21, 2021
Maintainer

skye
Oct 21, 2021
Maintainer

cgarciae Oct 21, 2021
Collaborator

hawkinsp Mar 8, 2022
Maintainer

YouJiacheng
Feb 28, 2022