Commit a1e8344
committed
TL/CUDA: fix NVLS rank-0 error propagation via handle status field
When cuMulticastCreate fails on rank 0, the error was stored in the
per-process-local nvls->status_supported field only. Non-root ranks
checked their own copy (always UCC_OK after memset) and proceeded to
call cudaIpcOpenEventHandle on an uninitialised handle, relying on the
import failure as the error signal — fragile and emits confusing CUDA
error messages.
Fix: add a status field to ucc_tl_cuda_nvls_handle_t so rank 0 can
embed the error code in the allgathered handle. Non-root ranks read
share_data[0].status in STATE_IMPORT_HANDLE and bail out immediately
with a clear warning when rank 0 reported a failure.1 parent 607a507 commit a1e8344
2 files changed
+20
-5
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
421 | 421 | | |
422 | 422 | | |
423 | 423 | | |
424 | | - | |
425 | | - | |
426 | | - | |
427 | | - | |
| 424 | + | |
| 425 | + | |
| 426 | + | |
428 | 427 | | |
429 | 428 | | |
430 | 429 | | |
431 | 430 | | |
| 431 | + | |
| 432 | + | |
| 433 | + | |
432 | 434 | | |
433 | 435 | | |
434 | 436 | | |
| |||
467 | 469 | | |
468 | 470 | | |
469 | 471 | | |
470 | | - | |
| 472 | + | |
| 473 | + | |
471 | 474 | | |
| 475 | + | |
| 476 | + | |
| 477 | + | |
| 478 | + | |
| 479 | + | |
| 480 | + | |
| 481 | + | |
| 482 | + | |
472 | 483 | | |
473 | 484 | | |
474 | 485 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
22 | 22 | | |
23 | 23 | | |
24 | 24 | | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
25 | 29 | | |
26 | 30 | | |
27 | 31 | | |
| |||
0 commit comments