Overhead of host_callback vs jax primitives? #7022

lukepfister · 2021-06-18T20:14:28Z

lukepfister
Jun 18, 2021

I'm in a situation where I need to repeatedly call external code inside of a loop. I need to take in a DeviceArray, send it to external code, and return the result to a DeviceArray. The external code is itself Cython code that eventually calls CUDA code, though I don't believe it is written to accept CUDA arrays by default... I believe I have to pass through the host. Not entirely sure about that.

As a first pass, I just copied from DeviceArray -> ndarray and then called the function:

def crude_version(x):
    x_np  = x.copy().copy()
    return jax.device_put( extern_func(x_np) )

This has obvious drawbacks: I can't jit any function that contains crude_version inside of it, pmap/vmap don't work, etc.

I tried a second version using host_callback:

def hcb_version(x):
    return host_callback(extern_func, x_np, out_shape)

This works and lets me jit/vmap/pmap, but there is a drawback: it is over 10x slower than crude_version. Timing crude_version gives ~2ms per call, hcb_version give 20ms per call.

Two questions:

Is this level of overhead expected?
If so, would writing a jax Primitive be faster?

Answered by mattjj

Jun 18, 2021

Thanks for the questions!

For your first question, the host_callback mechanism is indeed inefficient and is being revised. So I guess I'd say yes it's expected, but temporary. Ultimately its overheads should be made low (or the overheads of the API that replaces it). But that doesn't help you now!

For the second question, writing a Primitive isn't necessarily an alternative; IIUC you'd still need a way for, say, a jit-compiled GPU program to call back onto the host to run your extern_func. So your Primitive's translation rule would need to solve the same problem that host_callback's machinery needs to solve. That's what I mean by it's not really an alternative: we still have the question …

View full answer

mattjj · 2021-06-18T23:35:18Z

mattjj
Jun 18, 2021
Maintainer

Thanks for the questions!

For your first question, the host_callback mechanism is indeed inefficient and is being revised. So I guess I'd say yes it's expected, but temporary. Ultimately its overheads should be made low (or the overheads of the API that replaces it). But that doesn't help you now!

For the second question, writing a Primitive isn't necessarily an alternative; IIUC you'd still need a way for, say, a jit-compiled GPU program to call back onto the host to run your extern_func. So your Primitive's translation rule would need to solve the same problem that host_callback's machinery needs to solve. That's what I mean by it's not really an alternative: we still have the question of how to rig up the callback in the translation rule.

Do you have bindings to extern_func in Cython, or just in Python? That would determine whether the callback is into a Python function or a Cython (i.e. a C/C++) function. It wouldn't make a big difference either way, though.

On CPU, where you wouldn't need to transfer buffers to and from the GPU, if you have Cython bindings it's not too hard to rig up the kind of CustomCall mechanism you'd want underneath a Primitive for good performance. (host_callback doesn't yet use CustomCalls, but it's being revised to use them!) You can see examples in lapack.pyx.

If your external code accepted GPU arrays then you could do something like in cuda_prng_kernels.cc, cuda_prng_kernels.cu.cc, and cuda_rng.py, which use a GPU CustomCall. But if you need to transfer data to and from the GPU inside the call, I don't think we have a good example yet. The revised host_callback would be a good example, but with that you might be happy just using host_callback itself.

See also this amazing "Extending JAX with custom C++ and CUDA code" tutorial.

So to summarize:

host_callback should get faster soon (a couple weeks?)
you can try to rig up a new Primitive with a CustomCall translation rule, maybe following this tutorial, but it'll take some work

1 reply

lukepfister Jun 21, 2021
Author

Thanks Matt! I'll report back if we wind up going the CustomCall route.

nickmcgreivy · 2023-02-21T23:13:35Z

nickmcgreivy
Feb 21, 2023

Hi Matt and co,

I'm finding that host_callback still takes about 20ms per call. Is there any hope that host_callback will get faster soon? Or should I not count on it?

Thanks,
Nick

6 replies

nickmcgreivy Feb 22, 2023

On CPU each host_callback takes 125 microseconds. On GPU each host_callback takes 20ms. I'm running on an NVIDIA A100 GPU.

I've put simple example code below. I'm using jax.__version__ == 0.4.4, python version 3.10.9.

import jax.experimental.host_callback as hcb
import jax
from time import time

fn_sin = lambda x: jnp.sin(x)

@jax.jit
def f_cb(x):
	y = x**2 - 2 * x + 11
	return hcb.call(fn_sin, y, result_shape=y)

@jax.jit
def f(x):
	y = x**2 - 2 * x + 11
	return jnp.sin(x)

N = 100
nx = 10

@jax.jit
def create():
	x = jnp.linspace(0, 100, nx)
	return f(x)

@jax.jit
def create_cb():
	x = jnp.linspace(0, 100,  nx)
	return f_cb(x)

_ = create().block_until_ready()
_ = create_cb().block_until_ready()
t0 = time()
for j in range(N):
	_ = create().block_until_ready()
t1 = time()
for j in range(N):
	_ = create_cb().block_until_ready()
t2 = time()
print("Average time to run normal function: {}".format((t1 - t0)/N))
print("Average time to run callback function: {}".format((t2 - t1)/N))```

sharadmv Feb 22, 2023
Collaborator

Could you try using np.sin in fn_sin? The HCB will pass the callback CPU-backed numpy arrays and I'm not sure jnp will behave well inside the callback. Also, I suggest you try using the newer callback APIs in general (jax.debug.callback, jax.pure_callback, jax.experimental.io_callback`) though they will have a similar performance profile to HCB.

nickmcgreivy Feb 22, 2023

Switching to jax.pure_callback speeds things up dramatically. Glory to Sharad, you've saved me many hours of trouble.

sharadmv Feb 22, 2023
Collaborator

Glad to see it helped though I'm surprised you saw that big of a speedup from pure_callback.

sharadmv Feb 22, 2023
Collaborator

Looking more closely I think HCB uses an infeed/outfeed-based callback instead of CustomCall so perhaps the speedup is actually reasonable.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Overhead of host_callback vs jax primitives? #7022

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 7 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Overhead of host_callback vs jax primitives? #7022

Uh oh!

lukepfister Jun 18, 2021

Replies: 2 comments · 7 replies

Uh oh!

mattjj Jun 18, 2021 Maintainer

Uh oh!

lukepfister Jun 21, 2021 Author

Uh oh!

Uh oh!

nickmcgreivy Feb 21, 2023

Uh oh!

nickmcgreivy Feb 22, 2023

Uh oh!

sharadmv Feb 22, 2023 Collaborator

Uh oh!

nickmcgreivy Feb 22, 2023

Uh oh!

sharadmv Feb 22, 2023 Collaborator

Uh oh!

sharadmv Feb 22, 2023 Collaborator

lukepfister
Jun 18, 2021

Replies: 2 comments 7 replies

mattjj
Jun 18, 2021
Maintainer

lukepfister Jun 21, 2021
Author

nickmcgreivy
Feb 21, 2023

sharadmv Feb 22, 2023
Collaborator

sharadmv Feb 22, 2023
Collaborator

sharadmv Feb 22, 2023
Collaborator