JAX computing on CPU even though all parameters are on GPU #18596

dfdx · 2023-11-18T16:21:56Z

dfdx
Nov 18, 2023

I've translated Facebook Research's original Llama implementation to JAX version. Mostly it was torch.Tensor to jax.Array translation, and I tested all modules thoroughly, so I'm pretty much sure there are no major issues with the code.

Except that JAX version runs 40 times slower.

If I interpret the profiler correctly, JAX spends most of the time on CPU while the prevalent operation on GPU is memory copy:

If we break down the CPU section, numbers sum up - attention takes most of the time (33ms), followed by feedforward (11ms):

JAX variables include parameters and cache values - I checked that both are allocated on GPU before the test. I also added a few print() statements to check device of arrays in a few random places, and the all show CUDA as their device.

To summarize:

all device checks show that computations should happen on GPU
the profiler and execution time imply computations are done on CPU

Any idea why this may be happening?

Answered by dfdx

Dec 11, 2023

As discussed in another thread, JIT-compiling the code resolves the issue. After it, JAX version actually works slightly faster than PyTorch.

View full answer

dfdx · 2023-11-22T18:44:39Z

dfdx
Nov 22, 2023
Author

I must be misunderstanding how JAX/Flax work, but I'm getting super strange results even on simpler examples.

    import jax
    rng = jax.random.PRNGKey(0)
    x = jax.random.normal(rng, (8, 4096))
    w = jax.random.normal(rng, (4096, 1024))
    x.device()    # cuda(id=0)
    w.device()    # cuda(id=0)

    import torch
    pt_x = torch.randn((8, 4096)).to(torch.device("cuda"))
    pt_w = torch.randn((4096, 1024)).to(torch.device("cuda"))

    import timeit
    N = 10_000
    timeit.timeit(lambda: (x @ w).block_until_ready(), number=N)    # 0.95 seconds
    timeit.timeit(lambda: pt_x @ pt_w, number=N)                    # 0.24 seconds

So simple matrix multiplication is ~4 times slower in JAX than in PyTorch.

I also tested it with Flax modules, but the difference is almost ~100 times!

    import flax.linen as nn
    dense = nn.Dense(1024, use_bias=False)
    variables = dense.init(rng, x)
    jax.tree_util.tree_leaves(variables)[0].device()                                  # cuda(id=0)
    timeit.timeit(lambda: dense.apply(variables, x).block_until_ready(), number=N)    # 25.1 seconds (!)

    import torch.nn as tnn
    pt_dense = tnn.Linear(4096, 1024, bias=False).to(torch.device("cuda"))
    timeit.timeit(lambda: pt_dense(pt_x), number=N)                                   # 0.3 seconds

Am I testing it the right way? If so, what can be the reason for such difference in performance?

System:

jax==0.4.18                                    # also tried with the latest 0.4.20
jaxlib==0.4.18+cuda11.cudnn86                  # also tried with the latest 0.4.20+cuda11.cudnn86
torch==2.1.0
GPU: A100 
Driver: NVIDIA-SMI 525.147.05   Driver Version: 525.147.05   CUDA Version: 12.0

0 replies

dfdx · 2023-12-11T23:29:31Z

dfdx
Dec 11, 2023
Author

As discussed in another thread, JIT-compiling the code resolves the issue. After it, JAX version actually works slightly faster than PyTorch.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

JAX computing on CPU even though all parameters are on GPU #18596

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

JAX computing on CPU even though all parameters are on GPU #18596

Uh oh!

dfdx Nov 18, 2023

Replies: 2 comments

Uh oh!

Uh oh!

dfdx Nov 22, 2023 Author

Uh oh!

dfdx Dec 11, 2023 Author

dfdx
Nov 18, 2023

dfdx
Nov 22, 2023
Author

dfdx
Dec 11, 2023
Author