Skip to content

Commit 0c224e4

Browse files
nzmora-nvidiaGal Agam
andauthored
Change the all-reduce strategy to NCCL (#99)
* Change the all-reduce strategy to NCCL When the strategy is set to AUTO and world_size>1 we experience hangs and CUDA memory errors. * This is the same issue as https://nvbugspro.nvidia.com/bug/5331013 * Without this change test test_ad_build_small_multi.py fails (tp==2) * This is a temporary change until we understand why this hang is happening. * On dllcuster this issue does not manifest. Signed-off-by: Neta Zmora <[email protected]> * Re-enable test_ad_build_small_multi.py tests/unittest/_torch/auto_deploy/unit/multigpu/test_ad_build_small_multi.py Signed-off-by: Neta Zmora <[email protected]> * fix kvcache mem size compute - convert to MB Signed-off-by: Gal Agam <[email protected]> --------- Signed-off-by: Neta Zmora <[email protected]> Signed-off-by: Gal Agam <[email protected]> Co-authored-by: Gal Agam <[email protected]>
1 parent a9e227e commit 0c224e4

File tree

3 files changed

+3
-5
lines changed

3 files changed

+3
-5
lines changed

tensorrt_llm/_torch/auto_deploy/distributed/trtllm.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,8 @@ def trtllm_allreduce(tensor, op, all_reduce_params=None):
1717
rank, world_size = get_rank_world_size()
1818
assert op == ReduceOp.SUM, "TRT-LLM all reduce only supports SUM op."
1919
p_config = Mapping(world_size=world_size, tp_size=world_size, rank=rank)
20-
torch_op = AllReduce(mapping=p_config, strategy=AllReduceStrategy.AUTO)
20+
# Use Strategy.NCCL until https://nvbugspro.nvidia.com/bug/5331013 is fixed, then change to Strategy.AUTO
21+
torch_op = AllReduce(mapping=p_config, strategy=AllReduceStrategy.NCCL)
2122
return torch_op(tensor, all_reduce_params=all_reduce_params)
2223

2324
@torch.library.custom_op(

tensorrt_llm/_torch/auto_deploy/transformations/library/kvcache.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -174,7 +174,7 @@ def _get_mem_info_in_mb():
174174
memory_for_forward_pass = free_mem_pre - free_mem_post
175175
ad_logger.info(f"Memory for forward pass (MB): {memory_for_forward_pass}")
176176

177-
new_cache_size = free_mem_post * free_mem_ratio + current_cache_size
177+
new_cache_size = free_mem_post * 1024 * 1024 * free_mem_ratio + current_cache_size
178178
new_num_pages = int(new_cache_size // (current_cache_size // current_num_pages))
179179

180180
# Need to sync all the GPUs

tests/unittest/_torch/auto_deploy/unit/multigpu/test_ad_build_small_multi.py

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -19,9 +19,6 @@
1919
],
2020
)
2121
def test_build_ad(world_size: int, experiment_config: Dict):
22-
if world_size > 1:
23-
pytest.skip("https://nvbugspro.nvidia.com/bug/5331013")
24-
2522
experiment_config["args"]["world_size"] = world_size
2623
experiment_config["args"]["runtime"] = "trtllm" # Default runtime set to trtllm
2724
experiment_config = ExperimentConfig(**experiment_config)

0 commit comments

Comments
 (0)