Commit 4e47506
committed
TL/CUDA: raise MAX_NVLS_PEERS to 576 and guard nvls_init
GB200/NVL systems support up to 72 GPUs per NVSwitch domain with up
to 8 NVLink partitions per domain, yielding a theoretical maximum of
576 participants in a single NVLS multicast group. Raise
UCC_TL_CUDA_MAX_NVLS_PEERS from 144 to 72*8=576 to accommodate these
configurations.
Add a guard in nvls_init STATE_INIT that rejects teams exceeding this
limit with UCC_ERR_NOT_SUPPORTED and a clear warning, preventing
out-of-bounds accesses in the NVLS allgather buffer.1 parent e60cc40 commit 4e47506
2 files changed
+9
-1
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
32 | 32 | | |
33 | 33 | | |
34 | 34 | | |
35 | | - | |
| 35 | + | |
36 | 36 | | |
37 | 37 | | |
38 | 38 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
329 | 329 | | |
330 | 330 | | |
331 | 331 | | |
| 332 | + | |
| 333 | + | |
| 334 | + | |
| 335 | + | |
| 336 | + | |
| 337 | + | |
| 338 | + | |
| 339 | + | |
332 | 340 | | |
333 | 341 | | |
334 | 342 | | |
| |||
0 commit comments