Commit 78a8d13
committed
TL/CUDA: use nvmlDeviceGetGpuFabricInfoV for GB200+ NVL partition support
GB200 NVLink systems introduce NVL partitions (sub-fabric cliques):
multiple logical NVLink domains may share the same cliqueId but have
different partitionIds, and only GPUs within the same partition can
form a multicast group. The older nvmlDeviceGetGpuFabricInfo API
(v1) does not expose partitionId.
Changes:
- config/m4/cuda.m4: add AC_CHECK_DECL for nvmlDeviceGetGpuFabricInfoV
to define HAVE_NVML_GPU_FABRIC_INFO_V when the versioned API is
available (NVML r525+).
- utils/ucc_proc_info.h: add fabric_partition_id field to ucc_gpu_info_t.
- topo/cuda/ucc_sysinfo_cuda.c: use nvmlDeviceGetGpuFabricInfoV when
HAVE_NVML_GPU_FABRIC_INFO_V is defined, populating partitionId;
fall back to v1 (partitionId=0) when the new API is unavailable.
Add debug-level logging so admins can diagnose fabric detection.
- topo/ucc_topo.c: ucc_topo_is_single_nvlink_domain() now also checks
that all ranks share the same non-zero partitionId when the v2 API
populated it; a partitionId of 0 skips the partition check (v1
compat). Add per-rank debug messages for each failure case.
- tl/cuda/tl_cuda_nvls.c: expand the NVLS domain warning to mention
partition mismatch and direct users to DEBUG logging for details.1 parent 4e47506 commit 78a8d13
File tree
5 files changed
+83
-12
lines changed- config/m4
- src
- components
- tl/cuda
- topo
- cuda
- utils
5 files changed
+83
-12
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
85 | 85 | | |
86 | 86 | | |
87 | 87 | | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
88 | 94 | | |
89 | 95 | | |
90 | 96 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
343 | 343 | | |
344 | 344 | | |
345 | 345 | | |
346 | | - | |
347 | | - | |
| 346 | + | |
| 347 | + | |
| 348 | + | |
| 349 | + | |
| 350 | + | |
348 | 351 | | |
349 | 352 | | |
350 | 353 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
498 | 498 | | |
499 | 499 | | |
500 | 500 | | |
| 501 | + | |
| 502 | + | |
| 503 | + | |
| 504 | + | |
| 505 | + | |
| 506 | + | |
| 507 | + | |
| 508 | + | |
| 509 | + | |
| 510 | + | |
| 511 | + | |
| 512 | + | |
| 513 | + | |
| 514 | + | |
| 515 | + | |
| 516 | + | |
| 517 | + | |
| 518 | + | |
| 519 | + | |
| 520 | + | |
| 521 | + | |
| 522 | + | |
| 523 | + | |
501 | 524 | | |
502 | | - | |
503 | 525 | | |
504 | 526 | | |
505 | 527 | | |
506 | | - | |
507 | | - | |
| 528 | + | |
| 529 | + | |
| 530 | + | |
| 531 | + | |
| 532 | + | |
508 | 533 | | |
509 | | - | |
510 | | - | |
| 534 | + | |
| 535 | + | |
| 536 | + | |
511 | 537 | | |
| 538 | + | |
512 | 539 | | |
513 | 540 | | |
514 | | - | |
515 | | - | |
| 541 | + | |
| 542 | + | |
| 543 | + | |
516 | 544 | | |
517 | 545 | | |
518 | 546 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
561 | 561 | | |
562 | 562 | | |
563 | 563 | | |
| 564 | + | |
564 | 565 | | |
565 | 566 | | |
566 | 567 | | |
567 | 568 | | |
568 | 569 | | |
569 | 570 | | |
570 | 571 | | |
| 572 | + | |
571 | 573 | | |
572 | 574 | | |
573 | 575 | | |
574 | 576 | | |
575 | 577 | | |
| 578 | + | |
| 579 | + | |
| 580 | + | |
| 581 | + | |
| 582 | + | |
576 | 583 | | |
577 | 584 | | |
578 | 585 | | |
579 | | - | |
| 586 | + | |
| 587 | + | |
580 | 588 | | |
581 | 589 | | |
582 | 590 | | |
| 591 | + | |
| 592 | + | |
583 | 593 | | |
584 | 594 | | |
585 | | - | |
586 | | - | |
| 595 | + | |
| 596 | + | |
| 597 | + | |
| 598 | + | |
| 599 | + | |
| 600 | + | |
| 601 | + | |
| 602 | + | |
| 603 | + | |
| 604 | + | |
| 605 | + | |
| 606 | + | |
| 607 | + | |
| 608 | + | |
| 609 | + | |
| 610 | + | |
| 611 | + | |
| 612 | + | |
| 613 | + | |
| 614 | + | |
| 615 | + | |
| 616 | + | |
| 617 | + | |
587 | 618 | | |
588 | 619 | | |
589 | 620 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
57 | 57 | | |
58 | 58 | | |
59 | 59 | | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
60 | 63 | | |
61 | 64 | | |
62 | 65 | | |
| |||
0 commit comments