What does the performance of lax.pmean depend on? #19651

KegangWangCCNU · 2024-02-03T22:32:45Z

KegangWangCCNU
Feb 3, 2024

I use lax.pmean to implement DDP. Based on my understanding, lax.pmean can calculate the average of gradients across all GPUs. To improve the efficiency of parallel training, efforts should be made to speed up lax.pmean.

I tested the performance of lax.pmean on two servers, one is an AWS G5 instance with 4 A10G, and the other is a cluster composed of 4 V100 SXM2 connected by NVLink.

To my surprise, on the 4 A10G instances without NVLink, the performance of lax.pmean is higher. I wonder why this is and why NVLink did not work as expected?

The test code is as follows:

import os  
os.environ["XLA_PYTHON_CLIENT_MEM_FRACTION"] = ".90"
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
os.environ['JAX_ENABLE_X64'] = 'True'
#os.environ['NCCL_P2P_DISABLE'] = '1'
#os.environ['NCCL_DEBUG'] = 'INFO'
#os.environ['CUDA_VISIBLE_DEVICES'] = '0, 1'
#os.environ["XLA_PYTHON_CLIENT_ALLOCATOR"] = "platform"
import jax
from jax import lax
from jax import numpy as jnp 
from functools import partial

from jax.sharding import PositionalSharding
devnum = jax.device_count()
sharding = PositionalSharding(jax.devices())
share_ipt = lambda x: jax.device_put(x.reshape(devnum, -1, *x.shape[1:]), device=sharding.reshape(devnum, *[1]*len(x.shape))) 

ipt = share_ipt(jax.jit(lambda :jnp.zeros((8*1024*1024*512), dtype='float16'), device=jax.devices('cpu')[0])()) 

@partial(jax.pmap, axis_name='batch')
def pmean(ipt):
    return lax.pmean(ipt, axis_name='batch')

pmean(ipt).block_until_ready();

%%timeit
pmean(ipt).block_until_ready();

The software environment is as follows (consistent across two devices):

cuda=11.8 
cudnn=8.8 
nccl=2.19.4 
nvcc=12.3 
jaxlib=0.4.24 
jax=0.4.24

The hardware of the 2 servers is as follows:
server 0:

CPU: 2x Intel E5-2650v4 24-core
GPU: 4x Nvidia V100-SXM2-16G with nvlink
Inter-GPU bandwidth: 300G/s

server 1:

CPU: 1x AMD EPYC 7R32 24-core 
GPU: 4x Nvidia A10G @ pcie4.0x8 
Inter-GPU bandwidth: 32G/s

The result is as follows:
server 0:

2.26 s ± 120 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

server 1:

1.33 s ± 5.38 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

And I checked NCCL, confirming that jax used nvlink to run pmean. Additionally, I attempted to disable the NVLink on server 0, and the result is as follows:
server 0:

2.85 s ± 86.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I would like to know why a 4xV100 cluster does not have an advantage in parallel training? How should I improve it? Thank you very much for any answers!

KegangWangCCNU · 2024-02-04T20:44:43Z

KegangWangCCNU
Feb 4, 2024
Author

All issues have been resolved, as I used very large matrices for testing and NCCL's ring-reduce couldn't handle such large matrices. I don't know how JAX deals with this issue, but it caused a significant performance degradation. Typically, gradients consist of a set of smaller matrices, so this kind of performance drop won't affect actual usage.

I distributed 4 128M-sized matrices across 4 GPUs and then performed pmean, observing significant acceleration from NVLink.

server 0:

2.79 ms ± 14 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

server 1:

61.1 ms ± 60.9 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

What does the performance of lax.pmean depend on? #19651

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

What does the performance of lax.pmean depend on? #19651

Uh oh!

KegangWangCCNU Feb 3, 2024

Replies: 1 comment

Uh oh!

KegangWangCCNU Feb 4, 2024 Author

KegangWangCCNU
Feb 3, 2024

KegangWangCCNU
Feb 4, 2024
Author