Skip to content

Commit 6112357

Browse files
committed
TL/CUDA: guard team creation when device info is incomplete
ucc_tl_cuda_team_topo_create relies on per-rank GPU device information (PCI IDs, NVLink matrices) that is populated only when every rank has at least one visible GPU. Without this check the topo init code dereferenced uninitialised or invalid device info, causing silent failures or incorrect topology matrices. Add an ucc_topo_has_device_info() guard before the topo_create call so that TL/CUDA gracefully reports UCC_ERR_NOT_SUPPORTED and falls back to another TL when device info is missing for any rank.
1 parent a1e8344 commit 6112357

File tree

1 file changed

+8
-0
lines changed

1 file changed

+8
-0
lines changed

src/components/tl/cuda/tl_cuda_team.c

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -341,6 +341,14 @@ ucc_status_t ucc_tl_cuda_team_create_test(ucc_base_team_t *tl_team)
341341
team->scratch.rem[i] = NULL;
342342
}
343343

344+
if (!ucc_topo_has_device_info(UCC_TL_CORE_TEAM(team)->topo)) {
345+
tl_debug(tl_team->context->lib,
346+
"not all ranks have visible GPU device info; "
347+
"skipping TL/CUDA team creation");
348+
status = UCC_ERR_NOT_SUPPORTED;
349+
goto exit_err;
350+
}
351+
344352
status = ucc_tl_cuda_team_topo_create(&team->super, &team->topo);
345353
if (status != UCC_OK) {
346354
goto exit_err;

0 commit comments

Comments
 (0)