Skip to content

chore: update gb200 alltoall & sanity check#624

Open
wenpengw-nv wants to merge 5 commits intomainfrom
alltoall_sanity_check
Open

chore: update gb200 alltoall & sanity check#624
wenpengw-nv wants to merge 5 commits intomainfrom
alltoall_sanity_check

Conversation

@wenpengw-nv
Copy link
Contributor

@wenpengw-nv wenpengw-nv commented Mar 20, 2026

Overview:

Fix the AlltoAll SOL model for TRT-LLM to correctly account for per-rank deduplication in dispatch and combine operations, add NVLinkOneSided collected data for GB200 on trtllm 1.2.0rc6 (ep8/ep16/ep32), and add a sanity check notebook for AlltoAll data validation.

Details:

  1. Update GB200 AlltoAll collected data (trtllm_alltoall_perf.txt)
    Added NVLinkOneSided silicon measurements for GB200 on trtllm 1.2.0rc6, covering ep8, ep16, and ep32 configurations.
  2. Add AlltoAll sanity check (validate_database.ipynb)
    Added visualization and validation cells in the sanity check notebook to compare AlltoAll SOL predictions against silicon measurements

Summary by CodeRabbit

  • New Features

    • Added an interactive visualization for inspecting TRT-LLM All2All performance data with multi-panel matplotlib charts.
  • Bug Fixes

    • Refined All2All communication bandwidth and byte-count estimation for more accurate reported times.
    • Adjusted SOL reporting to change how overhead is accounted for in results.
  • Chores

    • Updated performance database records and associated metadata.

Signed-off-by: wenpengw-nv <wenpengw@nvidia.com>
Signed-off-by: wenpengw-nv <wenpengw@nvidia.com>
Signed-off-by: wenpengw-nv <wenpengw@nvidia.com>
@copy-pr-bot
Copy link

copy-pr-bot bot commented Mar 20, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added the chore label Mar 20, 2026
@github-actions
Copy link
Contributor

github-actions bot commented Mar 20, 2026

Sanity Check Chart Generation Report

New perf data files were detected in this PR. Please use the link above to
download sanity check charts for the new perf data to compare the collected
perf data vs SOL (theoretical max performance).

Below is a report of whether the chart generation was successful for each op.
If doesn't validate whether the perf data itself is sane.

Chart Generation Report for system: gb200, backend: trtllm, backend_version: 1.2.0rc6

  • CLI smoke test ✅

Signed-off-by: wenpengw-nv <wenpengw@nvidia.com>
@wenpengw-nv
Copy link
Contributor Author

Attach sanity check result
image
image

@coderabbitai
Copy link

coderabbitai bot commented Mar 20, 2026

Walkthrough

Adjusts TRT‑LLM All2All SOL communication byte‑count model and removes p2p_latency_ms from SOL timing; updates Git LFS pointer for alltoall perf data; adds visualize_trtllm_alltoall(...) and minor formatting/plotting tweaks in the notebook.

Changes

Cohort / File(s) Summary
All2All perf model
src/aiconfigurator/sdk/perf_database.py
Reworked SOL-time All2All bandwidth/byte-count estimation: derive remote_ranks = min(topk, num_experts, moe_ep_size-1); set phase-specific formulas (alltoall_prepare, *combine*, remaining dispatch); removed (1 - 1/moe_ep_size) factor; removed p2p_latency_ms from returned SOL time (now returns (sol_comm, sol_comm, 0.0)).
AlltoAll perf data pointer
src/aiconfigurator/systems/data/gb200/trtllm/1.2.0rc6/trtllm_alltoall_perf.txt
Updated Git LFS pointer metadata (OID and size changed) for the TRT‑LLM All2All performance binary blob.
Notebook: visualization & formatting
tools/sanity_check/validate_database.ipynb
Added visualize_trtllm_alltoall(database) to inspect database._trtllm_alltoall_data, derive ops/ep_sizes/quant_modes, and plot latency and SOL% across token counts; collapsed and reformatted several plotting/query expressions and updated notebook kernel metadata and database instantiation to system="gb200", backend="trtllm", version="1.2.0rc6".

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I nibble bytes and count each tiny thread,
I hop through ranks where remote_ranks are led,
I tuck p2p latency out of sight,
And plot SOL stars in the moonlit night.
Hooray for charts, for bytes, for tuned design!

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title 'chore: update gb200 alltoall & sanity check' is directly relevant and accurately summarizes the main changes: updating GB200 AlltoAll performance data and the sanity check notebook.
Description check ✅ Passed The PR description covers Overview, Details with specific numbered points about changes made, but is missing the 'Where should the reviewer start?' section and 'Related Issues' section as specified in the template.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

You can enable review details to help with troubleshooting, context usage and more.

Enable the reviews.review_details setting to include review details such as the model used, the time taken for each step and more in the review comments.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/aiconfigurator/sdk/perf_database.py`:
- Around line 5716-5728: remote_ranks is incorrectly computed as min(topk,
moe_ep_size) - 1 and ignores num_experts and whether any selected expert is
local; update the calculation to use num_experts and moe_ep_size and make local
selection explicit: compute remote_ranks = min(topk, num_experts, moe_ep_size)
and do NOT unconditionally subtract 1, and/or add an explicit parameter (e.g.,
local_selected or num_local_experts) to the enclosing helper so callers can
subtract local selections when appropriate; then use that remote_ranks value in
the data_bytes branches that reference remote_ranks (where op_name, num_tokens,
hidden_size are used).
- Around line 5720-5722: The combine branch treats all "*combine*" ops as BF16
(2 bytes/element); detect the low-precision path by checking op_name (e.g.,
op_name == "alltoall_combine_low_precision" or "alltoall_combine_low_precision"
in op_name) and set bytes_per_element = 0.5 for that case, otherwise keep 2 for
standard BF16, then compute data_bytes = num_tokens * remote_ranks * hidden_size
* bytes_per_element (update the assignment to data_bytes in the block containing
op_name and num_tokens/remote_ranks/hidden_size).

In `@tools/sanity_check/validate_database.ipynb`:
- Around line 1113-1123: The code unions token counts across all (node_num,
hidden_size, topk, num_experts) tuples into a single token_set while only using
the last tuple's (hs, tk, ne_val, nn_val) when querying kernel_data, which mixes
incompatible shapes; change the logic in the kernel_data traversal (the block
using qm, nn, h, t, ne, target_ep, token_set, hs, tk, ne_val, nn_val) to collect
token sets keyed by the full shape tuple (nn, h, t, ne) — e.g., build a mapping
shape_key -> token_set and record the corresponding (hs, tk, ne_val, nn_val) per
shape — then iterate over that mapping when building/plotting series so each
plotted line uses its own exact kernel_data lookup; alternatively, if you do not
want multiple shapes per (op_name, qm, target_ep), detect when more than one
distinct shape_key exists and raise an error (fail fast) rather than merging
them.
- Line 1174: The notebook call visualize_trtllm_alltoall(database) is using the
global `database` initialized earlier as h200_sxm + 1.2.0rc5, so update the
sanity-check invocation to point at the GB200 dataset version introduced in this
PR (GB200 / 1.2.0rc6) instead of reusing `database`; locate the visualization
call `visualize_trtllm_alltoall(database)` and replace its argument with the
GB200/1.2.0rc6 dataset reference (or construct a local variable for that
dataset) so a full "Run All" exercises the new NVLinkOneSided data.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: a37e89a5-dd6f-4bab-883e-4ceb26391451

📥 Commits

Reviewing files that changed from the base of the PR and between 0b9726d and bcd1f70.

📒 Files selected for processing (3)
  • src/aiconfigurator/sdk/perf_database.py
  • src/aiconfigurator/systems/data/gb200/trtllm/1.2.0rc6/trtllm_alltoall_perf.txt
  • tools/sanity_check/validate_database.ipynb

Signed-off-by: wenpengw-nv <wenpengw@nvidia.com>
@Arsene12358
Copy link
Contributor

@coderabbitai review the recent changes and see whether previous issues are resolved

@coderabbitai
Copy link

coderabbitai bot commented Mar 20, 2026

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Get previous review comments on PR `#624`
gh pr view 624 --repo ai-dynamo/aiconfigurator --json reviews,comments

# Get review comments on the PR
gh api repos/ai-dynamo/aiconfigurator/pulls/624/reviews
gh api repos/ai-dynamo/aiconfigurator/pulls/624/comments

Length of output: 59430


🏁 Script executed:

#!/bin/bash
# Check current state of perf_database.py around the changed lines
sed -n '5710,5745p' src/aiconfigurator/sdk/perf_database.py

echo "---"
# Check validate_database.ipynb for the visualization call and shape handling
python3 -c "
import json
with open('tools/sanity_check/validate_database.ipynb') as f:
    nb = json.load(f)
# Find the cells with visualize_trtllm_alltoall
for i, cell in enumerate(nb['cells']):
    src = ''.join(cell.get('source', []))
    if 'visualize_trtllm_alltoall' in src:
        print(f'=== Cell {i} ===')
        print(src[:3000])
        print()
"

Length of output: 4420


🏁 Script executed:

#!/bin/bash
python3 -c "
import json
with open('tools/sanity_check/validate_database.ipynb') as f:
    nb = json.load(f)
for i, cell in enumerate(nb['cells']):
    src = ''.join(cell.get('source', []))
    if 'visualize_trtllm_alltoall' in src:
        print(f'=== Cell {i} (full) ===')
        print(src)
        print()
"

Length of output: 7025


@Arsene12358 All four previously flagged issues have been resolved in the current code. Here's the status:

# File Previous Issue Status
1 perf_database.py remote_ranks ignored num_experts; was min(topk, moe_ep_size) - 1 Fixed — now min(topk, num_experts, moe_ep_size - 1)
2 perf_database.py alltoall_combine_low_precision incorrectly used 2 bytes/elem (BF16) instead of 0.5 (NVFP4) Fixed — now bytes_per_element = 0.5 if "low_precision" in op_name else 2
3 validate_database.ipynb Shape tokens were unioned into one token_set across incompatible (node_num, hidden_size, topk, num_experts) tuples, querying only the last tuple's parameters Fixed — now uses shape_map: dict[(nn, h, t, ne) -> token_set] so each plotted line queries its own exact shape parameters
4 validate_database.ipynb visualize_trtllm_alltoall(database) reused the global H200/1.2.0rc5 database instead of exercising the new GB200 data Fixed — now constructs alltoall_database = get_database(system="gb200", backend="trtllm", version="1.2.0rc6") and passes it explicitly

The sanity check charts attached by the author also visually confirm the SOL model is producing sensible results against the GB200 NVLinkOneSided silicon measurements. The PR looks good to merge.

✅ Actions performed

Full review triggered.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tools/sanity_check/validate_database.ipynb`:
- Around line 857-865: The subplot title and axis labels use values from the
wrong loop scope and mislabeled axes: update the set_title call in the MoE
plotting block (where ax[workload_distribution_id * 2, i].set_title is called)
to use the current loop's quant_mode-specific values (current topk and
num_experts variables) rather than the last iterated ones, and fix the xlabel
calls so the top row uses the correct x-axis variable name (use "m" when
plotting over m_list and "s" when plotting over s_list) instead of always
setting "s"; ensure any other axis label (e.g., ylabel for memory) remains
appropriate for the corresponding subplot.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 509e2f19-a023-4d9b-9084-f8023e0cf793

📥 Commits

Reviewing files that changed from the base of the PR and between 0b9726d and 38f5cec.

📒 Files selected for processing (3)
  • src/aiconfigurator/sdk/perf_database.py
  • src/aiconfigurator/systems/data/gb200/trtllm/1.2.0rc6/trtllm_alltoall_perf.txt
  • tools/sanity_check/validate_database.ipynb

Comment on lines +857 to +865
" ax[workload_distribution_id * 2, i].set_title(\n",
" f\"{workload_distribution_title} \\ntopk={topk} e={num_experts} tp={tp} ep={ep}\"\n",
" )\n",
" ax[workload_distribution_id * 2, i].set_xlabel(\"s\")\n",
" ax[workload_distribution_id * 2, i].set_ylabel(\"math sol %\")\n",
" # ax[0,i].set_ylim(0,1)\n",
" ax[workload_distribution_id*2, i].legend()\n",
" ax[workload_distribution_id*2+1, i].set_xlabel(\"s\")\n",
" ax[workload_distribution_id*2+1, i].set_ylabel(\"mem sol %\")\n",
" ax[workload_distribution_id * 2, i].legend()\n",
" ax[workload_distribution_id * 2 + 1, i].set_xlabel(\"s\")\n",
" ax[workload_distribution_id * 2 + 1, i].set_ylabel(\"mem sol %\")\n",
Copy link

@coderabbitai coderabbitai bot Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix the MoE subplot metadata.

The title on Line 857 uses topk / num_experts from the last quant_mode iterated, so it can misdescribe mixed-series subplots, and Lines 860 and 864 label m_list as s.

💡 Suggested fix
-            ax[workload_distribution_id * 2, i].set_title(
-                f"{workload_distribution_title} \ntopk={topk} e={num_experts} tp={tp} ep={ep}"
-            )
-            ax[workload_distribution_id * 2, i].set_xlabel("s")
+            ax[workload_distribution_id * 2, i].set_title(
+                f"{workload_distribution_title}\ntp={tp} ep={ep}"
+            )
+            ax[workload_distribution_id * 2, i].set_xlabel("num_tokens")
             ax[workload_distribution_id * 2, i].set_ylabel("math sol %")
             # ax[0,i].set_ylim(0,1)
             ax[workload_distribution_id * 2, i].legend()
-            ax[workload_distribution_id * 2 + 1, i].set_xlabel("s")
+            ax[workload_distribution_id * 2 + 1, i].set_xlabel("num_tokens")
             ax[workload_distribution_id * 2 + 1, i].set_ylabel("mem sol %")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tools/sanity_check/validate_database.ipynb` around lines 857 - 865, The
subplot title and axis labels use values from the wrong loop scope and
mislabeled axes: update the set_title call in the MoE plotting block (where
ax[workload_distribution_id * 2, i].set_title is called) to use the current
loop's quant_mode-specific values (current topk and num_experts variables)
rather than the last iterated ones, and fix the xlabel calls so the top row uses
the correct x-axis variable name (use "m" when plotting over m_list and "s" when
plotting over s_list) instead of always setting "s"; ensure any other axis label
(e.g., ylabel for memory) remains appropriate for the corresponding subplot.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wenpengw-nv could you take a look at this

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants