We have some code that looks like this:
dist.init_process_group(
backend="nccl",
rank=rank,
world_size=world_size,
init_method="tcp://127.0.0.1:29500"
)
ALL code should look like this:
dist.init_process_group(
backend="nccl",
rank=rank,
world_size=world_size,
init_method="tcp://127.0.0.1:29500",
device_id=torch.device(f"cuda:{device_id}")
)