Commit 3057d15
authored
[rocm7.0_internal_testing] skip test_all_gather_extensions_train_parity if world_size less then 2 (#2265)
Base class forces to use `world_size=2` even for 1 GPU. Then NCCL fails
with errors:
```
ncclInvalidUsage: This usually reflects invalid usage of NCCL library.
Duplicate GPU detected : rank 1 and rank 0 both on CUDA device c000
Duplicate GPU detected : rank 0 and rank 1 both on CUDA device c000
```
This PR will skip FSDP tests if `world_size > torch.cuda.device_count()`
```
HIP_VISIBLE_DEVICES=0 pytest -v distributed/_composable/fsdp/test_fully_shard_extensions.py::TestFullyShardAllGatherExtensionsMultiProcess::test_all_gather_extensions_train_parity
dist init r=0, world=2
dist init r=1, world=2
SKIPPED [15.5507s] (Need at least 2 CUDA devices)
```
Fixes SWDEV-5357671 parent 13a8686 commit 3057d15
1 file changed
+2
-0
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1187 | 1187 | | |
1188 | 1188 | | |
1189 | 1189 | | |
| 1190 | + | |
| 1191 | + | |
1190 | 1192 | | |
1191 | 1193 | | |
1192 | 1194 | | |
| |||
0 commit comments