Performance too slow with Jax #5820

agobL · 2021-02-23T13:22:46Z

agobL
Feb 23, 2021

Hello,

For my internship, I need to compare performances between Jax and Autograd.
But when I try to run that code (for example), I get worse performances with Jax than with Autograd :

import jax.numpy as jnp
from jax import jit
import time
import autograd.numpy as np
def foo(a):
	for i in range(128):
		for j in range(128):
			s = a[0]
	return 0
a = jnp.zeros(128) 
t = time.time()
foo(a)
print('Including just-in-time compilation ',time.time()-t)
t = time.time() 
foo(a)
print('Run using compiled code jax ',time.time()-t)
foo1=jit(foo)
t = time.time()
foo1(a)
print('Run using compiled code jax jit ',time.time()-t)
t = time.time()
foo(np.zeros(128))
print('Run using compiled code autograd ',time.time()-t)

The outputs I got are :

Including just-in-time compilation  19.96216082572937
Run using compiled code jax  19.06805419921875
Run using compiled code jax jit  36.82775378227234
Run using compiled code autograd  0.003182649612426758

I don't know where this issue can come from, I work on Visual Studio Code and run my program on a GPU servor working on Ubuntu. The version of CUDA is the following :

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Thu_Jan_28_19:32:09_PST_2021
Cuda compilation tools, release 11.2, V11.2.142
Build cuda_11.2.r11.2/compiler.29558016_0

I also tried the code on jupyther notebook and got the same results.

Thanks in advance for your help.

jakevdp · 2021-02-23T13:40:21Z

jakevdp
Feb 23, 2021
Maintainer

Thanks for the question!

The issue is your use of nested for-loops within JIT. These will be unrolled by the JIT compiler, meaning that your program ends up sending a sequence of over 16000 individual instructions to XLA, which accounts for the slow compile times. Rather than nested for-loops, you should try expressing your program's logic in terms of vectorized array computations (similar to how Numpy achieves fast performance) or using jax-specific tools like vmap.

I'd show an example based on your function, but it seems to be a bit over-simplified.

Note also that if you call the jitted function a second time on similar input (so that compilation time is not included), execution will be very fast because XLA can optimize away these 16000 no-ops.

0 replies

agobL · 2021-02-23T15:32:43Z

agobL
Feb 23, 2021
Author

I just do not see how to use vmap in my function foo because I don't know Jax very well. Could you please help me in that case ? Thank you.

0 replies

jakevdp · 2021-02-23T15:42:29Z

jakevdp
Feb 23, 2021
Maintainer

There's no way to use vmap in your function foo, because your function doesn't do anything aside from repeatedly assigning a local variable. If you're looking to optimize foo above, I'd suggest writing it this way:

def foo(a):
  return 0

I assume that foo() in your question is a standin for a more complicated function that does something with the inputs – If you have a different function that you're trying to optimize using vmap, I may be able to help point you in the right direction.

0 replies

zhangqiaorjc · 2021-02-23T18:10:29Z

zhangqiaorjc
Feb 23, 2021
Collaborator

Is the benchmark example trying to measure dispatch time? Might be useful to figure out what you are trying to measure here?

0 replies

agobL · 2021-02-24T08:32:06Z

agobL
Feb 24, 2021
Author

Actually, I'd like to improve performance for the following functions :

import jax.numpy as jnp

def f1(x): 
    res=0
    for e in x:
        res+=(3.5*e)**3;
    return res

def f2(x):
    res=jnp.zeros(sorties)
    for i in range(len(res)):
        res.at[i].set((x*i)**3)
    return res

1 reply

jakevdp Feb 24, 2021
Maintainer

I see - thanks. In this case, vmap is a bit too heavy-handed, and the best approach would be to use vectorized operations, similar to how you achieve performance for such operations in NumPy, for example:

import jax.numpy as jnp

def f1(x):
  return jnp.sum((3.5 * x) ** 3)

def f2(x):
  # assuming `sorties` is an integer and `x` is a scalar
  return (x * jnp.arange(sorties)) ** 3

agobL · 2021-02-26T14:51:08Z

agobL
Feb 26, 2021
Author

Thank you for answering me. I have another question : I have a Jax Tracer Object like that in a function : (Traced<ConcreteArray([1. 2. 3. 4. 5.])>with<JVPTrace(level=2/0)>
with primal = DeviceArray([1., 2., 3., 4., 5.], dtype=float32)
tangent = Traced<ShapedArray(float32[5])>with<BatchTrace(level=1/0)>
with val = array([[1., 0., 0., 0., 0.],
[0., 1., 0., 0., 0.],
[0., 0., 1., 0., 0.],
[0., 0., 0., 1., 0.],
[0., 0., 0., 0., 1.]], dtype=float32)
batch_dim = 0,)
Is there a command or another way to "extract" the values inside it ? For example, I want to get [1.,2.,3.,4.,5.]

1 reply

jakevdp Feb 26, 2021
Maintainer

Is there a command or another way to "extract" the values inside it ?

Yes and no - the values are stored in the tracer object (you can look in the source to see how to access them), but unwrapping jax objects like this will break JAX's transforms (e.g. gradients will no longer be computed correctly), so it's not a great idea. Can you say more about what problem you're having that led to this question?

Performance too slow with Jax #5820

Uh oh!

Uh oh!

agobL Feb 23, 2021

Replies: 6 comments · 2 replies

Uh oh!

Uh oh!

jakevdp Feb 23, 2021 Maintainer

Uh oh!

agobL Feb 23, 2021 Author

Uh oh!

Uh oh!

jakevdp Feb 23, 2021 Maintainer

Uh oh!

zhangqiaorjc Feb 23, 2021 Collaborator

Uh oh!

Uh oh!

agobL Feb 24, 2021 Author

Uh oh!

Uh oh!

jakevdp Feb 24, 2021 Maintainer

Uh oh!

agobL Feb 26, 2021 Author

Uh oh!

Uh oh!

jakevdp Feb 26, 2021 Maintainer

agobL
Feb 23, 2021

Replies: 6 comments 2 replies

jakevdp
Feb 23, 2021
Maintainer

agobL
Feb 23, 2021
Author

jakevdp
Feb 23, 2021
Maintainer

zhangqiaorjc
Feb 23, 2021
Collaborator

agobL
Feb 24, 2021
Author

jakevdp Feb 24, 2021
Maintainer

agobL
Feb 26, 2021
Author

jakevdp Feb 26, 2021
Maintainer