Device placement for Multi-GPU with MPI #18101

therooler · 2023-10-13T13:52:34Z

therooler
Oct 13, 2023

Hey,

I am working with some code that uses MPI + JAX and I'm having trouble with setting the visible devices for JAX. I know that you can do Multi-GPU setups with Jax without MPI, but the code-base I'm using forces me to use MPI.

Setup:
Python 3.9.6
CUDA=11.4
NetKet==3.8
jax==0.4.9
jaxlib==0.4.7+cuda11.cudnn82

My MPI configuration is CUDA-enabled, although I don't think this an MPI problem.

For this MWE, I am working with 2 GPU's on a single node. There is an old isse that discusses how to place devices manually (#2965).

test.py

import os

from mpi4py import MPI

rank = MPI.COMM_WORLD.Get_rank()

# set only one visible device
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = f"{rank}"
os.environ["JAX_CUDA_VISIBLE_DEVICES"] = f"{rank}"
# force to use gpu
os.environ["JAX_PLATFORM_NAME"] = "gpu"

import jax
jax.default_device(jax.devices()[rank])

print(f"JAX {rank} -> {jax.devices()}")
print(f"JAX Local {rank} -> {jax.local_devices()}")
print("Devices", jax.devices())
print('Which device is it placed on?\n', jax.numpy.ones(3).device_buffer.device())
print(f'Can we place on GPU:{rank}?\n', jax.device_put(jax.numpy.ones(3), jax.devices()[rank]).device_buffer.device())

Running mpirun -np 2 python test.py then gives

JAX 0 -> [StreamExecutorGpuDevice(id=0, process_index=0, slice_index=0), StreamExecutorGpuDevice(id=1, process_index=0, slice_index=0)]
JAX Local 0 -> [StreamExecutorGpuDevice(id=0, process_index=0, slice_index=0), StreamExecutorGpuDevice(id=1, process_index=0, slice_index=0)]
Devices [StreamExecutorGpuDevice(id=0, process_index=0, slice_index=0), StreamExecutorGpuDevice(id=1, process_index=0, slice_index=0)]
JAX 1 -> [StreamExecutorGpuDevice(id=0, process_index=0, slice_index=0), StreamExecutorGpuDevice(id=1, process_index=0, slice_index=0)]
JAX Local 1 -> [StreamExecutorGpuDevice(id=0, process_index=0, slice_index=0), StreamExecutorGpuDevice(id=1, process_index=0, slice_index=0)]
Devices [StreamExecutorGpuDevice(id=0, process_index=0, slice_index=0), StreamExecutorGpuDevice(id=1, process_index=0, slice_index=0)]
Which device is it placed on?
 gpu:0
Can we place on GPU:0?
 gpu:0
Which device is it placed on?
 gpu:0
Can we place on GPU:1?
 gpu:1

So both processes see both GPUs. While this is fine, I cannot set the visible devices manually to force process 0 to use GPU:0 by default, and process 1 to use GPU:1.

Note that placing things manually works as expected, but the larger code base that I'm using expects the placement to be done automatically (so each process has to only see one device).

Could someone help me understand what is going on here?

Best,
Roeland

coreyjadams · 2024-03-07T17:00:49Z

coreyjadams
Mar 7, 2024

Hi Roeland,

I was looking through here for mpi content and saw this old question, sorry if you already got an answer.

In my experience, setting CUDA_VISIBLE_DEVICES after doing MPI_Init() has unpredictable and unreliably behavior. You often have to move that flag up in front of the init (done automatically in from mpi4py import MPI). The problem then becomes, how do you know what rank to use until you've initialized MPI??

Many MPI job launchers will set per-process environment variables that enable you do this things, try mpirun -n 1 printenv | grep LOCAL or mpirun -n 1 printenv | grep RANK to see what you have available on your system, and you can set the visible devices to that value before MPI_Init and it may help.

In my own code, I discover the local rank with mpi4py and then use that local rank to target the device I want specifically, but not by hand. If you need a hint along those lines, here's how I get the local rank:

https://github.com/Nuclear-Physics-with-Machine-Learning/JAX_QMC_Public/blob/main/jax_qmc/utils/mpi.py#L9-L66

I pass that rank to this function to turn it into a jax device:

https://github.com/Nuclear-Physics-with-Machine-Learning/JAX_QMC_Public/blob/main/jax_qmc/utils/startup.py#L11-L28

And then I use that device in my tensor creation calls:

https://github.com/Nuclear-Physics-with-Machine-Learning/JAX_QMC_Public/blob/main/bin/sr.py#L179-L180
https://github.com/Nuclear-Physics-with-Machine-Learning/JAX_QMC_Public/blob/main/bin/sr.py#L143

1 reply

therooler Mar 17, 2024
Author

Hey thanks for your message. I struggled a lot with understanding the multiprocessing setup in Jax but ended up figuring out how to do it without mpi4py. Right now I simply use the jax.distribute.initialize(...) approach, which has allowed me to run multi-node, multi-gpu code. There were a bunch of details specific to my HPC setup that I needed to get right before things finally worked though and it was very painful to debug.

Best,
Roeland

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Device placement for Multi-GPU with MPI #18101

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Device placement for Multi-GPU with MPI #18101

Uh oh!

therooler Oct 13, 2023

Replies: 1 comment · 1 reply

Uh oh!

coreyjadams Mar 7, 2024

Uh oh!

therooler Mar 17, 2024 Author

therooler
Oct 13, 2023

Replies: 1 comment 1 reply

coreyjadams
Mar 7, 2024

therooler Mar 17, 2024
Author