Skip to content

Commit 3057d15

Browse files
[rocm7.0_internal_testing] skip test_all_gather_extensions_train_parity if world_size less then 2 (#2265)
Base class forces to use `world_size=2` even for 1 GPU. Then NCCL fails with errors: ``` ncclInvalidUsage: This usually reflects invalid usage of NCCL library. Duplicate GPU detected : rank 1 and rank 0 both on CUDA device c000 Duplicate GPU detected : rank 0 and rank 1 both on CUDA device c000 ``` This PR will skip FSDP tests if `world_size > torch.cuda.device_count()` ``` HIP_VISIBLE_DEVICES=0 pytest -v distributed/_composable/fsdp/test_fully_shard_extensions.py::TestFullyShardAllGatherExtensionsMultiProcess::test_all_gather_extensions_train_parity dist init r=0, world=2 dist init r=1, world=2 SKIPPED [15.5507s] (Need at least 2 CUDA devices) ``` Fixes SWDEV-535767
1 parent 13a8686 commit 3057d15

File tree

1 file changed

+2
-0
lines changed

1 file changed

+2
-0
lines changed

torch/testing/_internal/common_fsdp.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1187,6 +1187,8 @@ def _run(cls, rank, test_name, file_name, pipe, **kwargs):
11871187
fake_pg = kwargs.get("fake_pg", False)
11881188

11891189
print(f"dist init r={self.rank}, world={self.world_size}")
1190+
if torch.cuda.device_count() < self.world_size:
1191+
sys.exit(TEST_SKIPS[f"multi-gpu-{self.world_size}"].exit_code)
11901192

11911193
# Specify gloo backend to make 'init_process_group()' succeed,
11921194
# Actual tests will be skipped if there is no enough GPUs.

0 commit comments

Comments
 (0)