Redundant H2D & D2H transfers when D2D is faster on model parallelism over GPUs #8545

MingRuey · 2021-11-15T18:41:25Z

MingRuey
Nov 15, 2021

First of all, thanks Jax team. It's a great tool enabling much GPU-accelerated computing.

Currently, I am trying to split some custom calculations over multiple GPUs to achieve pipeline parallelisms like GPipe and PipeDream. So I make a small experiment as the following:

import jax
import jax.numpy as jnp

KEY = jax.random.PRNGKey(42)
GPUS = jax.devices("gpu")
print(GPUS)  # yields [GpuDevice(id=0, process_index=0), GpuDevice(id=1, process_index=0)]

w0 = jax.device_put(jax.random.normal(KEY, shape=(100, 100)), device=GPUS[0])
fn0 = jax.jit(lambda x: jnp.dot(w0, x), device=GPUS[0])

w1 = jax.device_put(jax.random.normal(KEY, shape=(100, 100)), device=GPUS[1])
fn1 = jax.jit(lambda x: jnp.dot(w1, x), device=GPUS[1])

def version1(x):
    y1 = fn0(x)
    y2 = fn1(y1)
    return y2

with jax.profiler.trace(str("./_tmp")):
    inputs = jax.random.normal(KEY, shape=(100,))
    result = version1(inputs)

I expect the fn0 is calculated on GPU0, resulting in y1 and passing the y1 to GPU1 for fn1. However, examining the TensorBoard trace_viewer, I found y1 is copied to host (CPU) then copied to GPU1. These H2D/D2H copies seem redundant and waste the bandwidth, where D2D for y1 to GPU1 should be more efficient.

Is it able to explicitly control the transfer between devices? Or is possible to make Jax for more efficient memory transfer?

I notice pjit, but the GSPMD paper suggest that it's basically for homogeneous model splitting and not suitable to my needs. I also notice #6014, but not sure if my case is related to that.

I use Conda to control the environment on a Ubuntu 20.04 machine with multiple RTX 5000. The environment includes Python 3.8.10, Jax==0.2.24, jaxlib==0.1.73+cuda11.cudnn805 and gast==0.3.3 and few other packages.

Answered by MingRuey

Nov 15, 2021

Here are the raw trace_viewer in TensorBoard and an detailed examination:

I also make a few other attempts like jit the version1, or try to avoid the y1 & y2. But none of them give the desired result:

version1_jitted = jax.jit(version1)

def version2(x):
    return fn1(fn0(x))

version2_jitted = jax.jit(version2)

def version3(x):
    y1 = fn0(x)
    y1 = jax.device_put(y1, device=GPUS[1])
    y2 = fn1(y1)
    return y2

version3_jitted = jax.jit(version3)

View full answer

MingRuey · 2021-11-15T18:46:01Z

MingRuey
Nov 15, 2021
Author

Here are the raw trace_viewer in TensorBoard and an detailed examination:

I also make a few other attempts like jit the version1, or try to avoid the y1 & y2. But none of them give the desired result:

version1_jitted = jax.jit(version1)

def version2(x):
    return fn1(fn0(x))

version2_jitted = jax.jit(version2)

def version3(x):
    y1 = fn0(x)
    y1 = jax.device_put(y1, device=GPUS[1])
    y2 = fn1(y1)
    return y2

version3_jitted = jax.jit(version3)

8 replies

hawkinsp Nov 16, 2021
Maintainer

The transfer B5 is apparently also a device-to-device transfer that is being bounced via the host. The jaxlib fix will avoid the host bounce here also.

In general for JAX it can be hard to understand traces the first time they run, during JIT tracing/compilation. Can you, say, run the function a few more times to look at the steady state with no first-time JIT stuff? I think some of those transfer may not appear in the steady state.

MingRuey Nov 16, 2021
Author

I will do that and paste the result here, however, I think my question is well-answered and I mark this as such. Thanks for the information.

hawkinsp Nov 17, 2021
Maintainer

We just released jaxlib 0.1.74, which should contain the fix. Please try it out!

MingRuey Nov 17, 2021
Author

Thank you for notifying. I will.

MingRuey Nov 19, 2021
Author

I changed my test to do the jnp.dot on 1000 inputs suggested as above:

import time
import jax
import jax.numpy as jnp
from jax.tree_util import tree_map

KEY = jax.random.PRNGKey(42)
GPUS = jax.devices("gpu")
print(GPUS)  # yields [GpuDevice(id=0, process_index=0), GpuDevice(id=1, process_index=0)]

w0 = jax.device_put(jax.random.normal(KEY, shape=(100, 100)), device=GPUS[0])
fn0 = jax.jit(lambda x: jnp.dot(w0, x), device=GPUS[0])

w1 = jax.device_put(jax.random.normal(KEY, shape=(100, 100)), device=GPUS[1])
fn1 = jax.jit(lambda x: jnp.dot(w1, x), device=GPUS[1])

def version1_steady(xs):
    results = []
    for x in xs:
        y1 = fn0(x)
        y2 = fn1(y1)
        results.append(y2)
    return results

inputs = [
    jax.random.normal(KEY, shape=(100,))
    for _ in range(1000)
]
with jax.profiler.trace(str("./_tmp")):
    start = time.time()
    result = version1_steady(inputs)
    tree_map(lambda x: x.block_until_ready(), result)
    print(time.time() - start)

And confirm that switching to jax==0.2.25 and jaxlib==0.1.74, the D2H and H2D becomes single P2P copy. The copying time is thus reduced to roughly a half or less (not including the gap between D2H and H2D).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Redundant H2D & D2H transfers when D2D is faster on model parallelism over GPUs #8545

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 8 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Redundant H2D & D2H transfers when D2D is faster on model parallelism over GPUs #8545

Uh oh!

Uh oh!

MingRuey Nov 15, 2021

Replies: 1 comment · 8 replies

Uh oh!

MingRuey Nov 15, 2021 Author

Uh oh!

hawkinsp Nov 16, 2021 Maintainer

Uh oh!

MingRuey Nov 16, 2021 Author

Uh oh!

hawkinsp Nov 17, 2021 Maintainer

Uh oh!

MingRuey Nov 17, 2021 Author

Uh oh!

MingRuey Nov 19, 2021 Author

MingRuey
Nov 15, 2021

Replies: 1 comment 8 replies

MingRuey
Nov 15, 2021
Author

hawkinsp Nov 16, 2021
Maintainer

MingRuey Nov 16, 2021
Author

hawkinsp Nov 17, 2021
Maintainer

MingRuey Nov 17, 2021
Author

MingRuey Nov 19, 2021
Author