Runtime on GPU is not parallel when using jax.config.update("jax_enable_x64", True) #7990

kx-au · 2021-09-23T09:13:43Z

kx-au
Sep 23, 2021

Hi all,

When I run the code below without config.update("jax_enable_x64", True), i.e. using 32-bit number, I see parallel performance -- running a batch of 1000 test_fn using vmap takes the same time as 1 chain.

However, when I use 64-bit number, I see an increase in the runtime. The increase is not much, e.g. 1000 chains takes about twice as long as 1 chain. Nevertheless, I do not understand why this is the case. I reproduced the same result in Google Colab, this is the exact printout (this is running the code AFTER the first run, to account for compilation time):

1 parallel batches takes 0.127 seconds.
101 parallel batches takes 0.135 seconds.
201 parallel batches takes 0.147 seconds.
301 parallel batches takes 0.158 seconds.
401 parallel batches takes 0.159 seconds.
501 parallel batches takes 0.171 seconds.
601 parallel batches takes 0.195 seconds.
701 parallel batches takes 0.207 seconds.
801 parallel batches takes 0.212 seconds.
901 parallel batches takes 0.231 seconds.
1001 parallel batches takes 0.242 seconds.

Here is my code.

from jax.config import config
config.update("jax_enable_x64", True)
config.update('jax_platform_name', 'gpu')
import jax.numpy as jnp
import jax
import time

jax.lib.xla_bridge.get_backend().platform # Make sure it's on GPU

key = jax.random.PRNGKey(420)

@jax.jit
def test_fn(x):
  '''run sigmoid for many times'''
  out = jax.lax.fori_loop(
      lower=1,
      upper=10000,
      body_fun=lambda _, x: jax.nn.sigmoid(x) + 1e-1,
      init_val=x)
  return out

test_fn_parallel = jax.vmap(test_fn, in_axes=(0,))

for n in range(1,1002,100): # number of processes
    subkey, key = jax.random.split(key)
    x = jax.random.normal(subkey,(n,100)) # randoml input
    t1 = time.time()
    out = test_fn_parallel(x)
    t2 = time.time()
    print(f'{n} parallel batches takes {t2-t1:.3f} seconds.')

Sidenote: Using only 32-bit number, I also observe the same 'less-than-parallel' performance when I vmap over functions (MCMC chains via lax.scan) that involve large matrix multiplications, neural networks etc. The exact same MCMC implementation yields parallel performance when sampling a 100-dimensional Gaussian, which requires no more than jnp.sum and its jax.grad gradient. Could complex operations also ruin parallelism?

jakevdp · 2021-09-23T12:46:54Z

jakevdp
Sep 23, 2021
Maintainer

See the answer at #7991 (comment). In the future, there is no need to post your question twice.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Runtime on GPU is not parallel when using jax.config.update("jax_enable_x64", True) #7990

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Runtime on GPU is not parallel when using jax.config.update("jax_enable_x64", True) #7990

Uh oh!

Uh oh!

kx-au Sep 23, 2021

Replies: 1 comment

Uh oh!

jakevdp Sep 23, 2021 Maintainer

kx-au
Sep 23, 2021

jakevdp
Sep 23, 2021
Maintainer