Skip to content

Commit 72ed4b2

Browse files
committed
[Distributed] Fix broadcast_module_parameter for CPU-resident models
Use each rank's own GPU device for NCCL broadcast instead of the module's execution device, which may be CPU or shared across ranks when the model is not GPU-resident. Signed-off-by: Itay Etlis <itayetlis@gmail.com>
1 parent 44d65e3 commit 72ed4b2

File tree

1 file changed

+3
-2
lines changed

1 file changed

+3
-2
lines changed

src/llmcompressor/utils/distributed.py

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66
from typing import Dict, List, Tuple
77

88
import torch
9-
from compressed_tensors.utils import get_execution_device, update_offload_parameter
9+
from compressed_tensors.utils import update_offload_parameter
1010
from loguru import logger
1111
from torch import distributed as dist
1212
from torch.nn import Module
@@ -147,7 +147,8 @@ def broadcast_module_parameter(
147147
if param is None:
148148
return
149149

150-
device = get_execution_device(module)
150+
# NCCL requires each rank to use its own GPU
151+
device = torch.device(f"cuda:{dist.get_rank()}")
151152
tensor = param.data.to(device)
152153
dist.broadcast(tensor, src=src_rank)
153154
update_offload_parameter(module, param_name, tensor)

0 commit comments

Comments
 (0)