chore: update gb200 alltoall & sanity check by wenpengw-nv · Pull Request #624 · ai-dynamo/aiconfigurator

wenpengw-nv · 2026-03-20T01:55:41Z

Overview:

Fix the AlltoAll SOL model for TRT-LLM to correctly account for per-rank deduplication in dispatch and combine operations, add NVLinkOneSided collected data for GB200 on trtllm 1.2.0rc6 (ep8/ep16/ep32), and add a sanity check notebook for AlltoAll data validation.

Details:

Update GB200 AlltoAll collected data (trtllm_alltoall_perf.txt)
Added NVLinkOneSided silicon measurements for GB200 on trtllm 1.2.0rc6, covering ep8, ep16, and ep32 configurations.
Add AlltoAll sanity check (validate_database.ipynb)
Added visualization and validation cells in the sanity check notebook to compare AlltoAll SOL predictions against silicon measurements

Summary by CodeRabbit

New Features
- Added an interactive visualization for inspecting TRT-LLM All2All performance data with multi-panel matplotlib charts.
Bug Fixes
- Refined All2All communication bandwidth and byte-count estimation for more accurate reported times.
- Adjusted SOL reporting to change how overhead is accounted for in results.
Chores
- Updated performance database records and associated metadata.

Signed-off-by: wenpengw-nv <wenpengw@nvidia.com>

copy-pr-bot · 2026-03-20T01:55:45Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

github-actions · 2026-03-20T01:57:28Z

Sanity Check Chart Generation Report

New perf data files were detected in this PR. Please use the link above to
download sanity check charts for the new perf data to compare the collected
perf data vs SOL (theoretical max performance).

Below is a report of whether the chart generation was successful for each op.
If doesn't validate whether the perf data itself is sane.

Chart Generation Report for system: gb200, backend: trtllm, backend_version: 1.2.0rc6

CLI smoke test ✅

Signed-off-by: wenpengw-nv <wenpengw@nvidia.com>

wenpengw-nv · 2026-03-20T02:00:57Z

Attach sanity check result

coderabbitai · 2026-03-20T02:04:50Z

Walkthrough

Adjusts TRT‑LLM All2All SOL communication byte‑count model and removes p2p_latency_ms from SOL timing; updates Git LFS pointer for alltoall perf data; adds visualize_trtllm_alltoall(...) and minor formatting/plotting tweaks in the notebook.

Changes

Cohort / File(s)	Summary
All2All perf model `src/aiconfigurator/sdk/perf_database.py`	Reworked SOL-time All2All bandwidth/byte-count estimation: derive `remote_ranks = min(topk, num_experts, moe_ep_size-1)`; set phase-specific formulas (`alltoall_prepare`, `combine`, remaining dispatch); removed `(1 - 1/moe_ep_size)` factor; removed `p2p_latency_ms` from returned SOL time (now returns `(sol_comm, sol_comm, 0.0)`).
AlltoAll perf data pointer `src/aiconfigurator/systems/data/gb200/trtllm/1.2.0rc6/trtllm_alltoall_perf.txt`	Updated Git LFS pointer metadata (OID and size changed) for the TRT‑LLM All2All performance binary blob.
Notebook: visualization & formatting `tools/sanity_check/validate_database.ipynb`	Added `visualize_trtllm_alltoall(database)` to inspect `database._trtllm_alltoall_data`, derive ops/ep_sizes/quant_modes, and plot latency and SOL% across token counts; collapsed and reformatted several plotting/query expressions and updated notebook kernel metadata and database instantiation to `system="gb200"`, `backend="trtllm"`, `version="1.2.0rc6"`.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I nibble bytes and count each tiny thread,
I hop through ranks where remote_ranks are led,
I tuck p2p latency out of sight,
And plot SOL stars in the moonlit night.
Hooray for charts, for bytes, for tuned design!

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'chore: update gb200 alltoall & sanity check' is directly relevant and accurately summarizes the main changes: updating GB200 AlltoAll performance data and the sanity check notebook.
Description check	✅ Passed	The PR description covers Overview, Details with specific numbered points about changes made, but is missing the 'Where should the reviewer start?' section and 'Related Issues' section as specified in the template.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

📝 Coding Plan

Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Tip

You can enable review details to help with troubleshooting, context usage and more.

Enable the reviews.review_details setting to include review details such as the model used, the time taken for each step and more in the review comments.

coderabbitai

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/aiconfigurator/sdk/perf_database.py`:
- Around line 5716-5728: remote_ranks is incorrectly computed as min(topk,
moe_ep_size) - 1 and ignores num_experts and whether any selected expert is
local; update the calculation to use num_experts and moe_ep_size and make local
selection explicit: compute remote_ranks = min(topk, num_experts, moe_ep_size)
and do NOT unconditionally subtract 1, and/or add an explicit parameter (e.g.,
local_selected or num_local_experts) to the enclosing helper so callers can
subtract local selections when appropriate; then use that remote_ranks value in
the data_bytes branches that reference remote_ranks (where op_name, num_tokens,
hidden_size are used).
- Around line 5720-5722: The combine branch treats all "*combine*" ops as BF16
(2 bytes/element); detect the low-precision path by checking op_name (e.g.,
op_name == "alltoall_combine_low_precision" or "alltoall_combine_low_precision"
in op_name) and set bytes_per_element = 0.5 for that case, otherwise keep 2 for
standard BF16, then compute data_bytes = num_tokens * remote_ranks * hidden_size
* bytes_per_element (update the assignment to data_bytes in the block containing
op_name and num_tokens/remote_ranks/hidden_size).

In `@tools/sanity_check/validate_database.ipynb`:
- Around line 1113-1123: The code unions token counts across all (node_num,
hidden_size, topk, num_experts) tuples into a single token_set while only using
the last tuple's (hs, tk, ne_val, nn_val) when querying kernel_data, which mixes
incompatible shapes; change the logic in the kernel_data traversal (the block
using qm, nn, h, t, ne, target_ep, token_set, hs, tk, ne_val, nn_val) to collect
token sets keyed by the full shape tuple (nn, h, t, ne) — e.g., build a mapping
shape_key -> token_set and record the corresponding (hs, tk, ne_val, nn_val) per
shape — then iterate over that mapping when building/plotting series so each
plotted line uses its own exact kernel_data lookup; alternatively, if you do not
want multiple shapes per (op_name, qm, target_ep), detect when more than one
distinct shape_key exists and raise an error (fail fast) rather than merging
them.
- Line 1174: The notebook call visualize_trtllm_alltoall(database) is using the
global `database` initialized earlier as h200_sxm + 1.2.0rc5, so update the
sanity-check invocation to point at the GB200 dataset version introduced in this
PR (GB200 / 1.2.0rc6) instead of reusing `database`; locate the visualization
call `visualize_trtllm_alltoall(database)` and replace its argument with the
GB200/1.2.0rc6 dataset reference (or construct a local variable for that
dataset) so a full "Run All" exercises the new NVLinkOneSided data.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: a37e89a5-dd6f-4bab-883e-4ceb26391451

📥 Commits

Reviewing files that changed from the base of the PR and between 0b9726d and bcd1f70.

📒 Files selected for processing (3)

src/aiconfigurator/sdk/perf_database.py
src/aiconfigurator/systems/data/gb200/trtllm/1.2.0rc6/trtllm_alltoall_perf.txt
tools/sanity_check/validate_database.ipynb

src/aiconfigurator/sdk/perf_database.py

tools/sanity_check/validate_database.ipynb

Signed-off-by: wenpengw-nv <wenpengw@nvidia.com>

Arsene12358 · 2026-03-20T10:51:45Z

@coderabbitai review the recent changes and see whether previous issues are resolved

coderabbitai · 2026-03-20T10:52:55Z

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Get previous review comments on PR `#624`
gh pr view 624 --repo ai-dynamo/aiconfigurator --json reviews,comments

# Get review comments on the PR
gh api repos/ai-dynamo/aiconfigurator/pulls/624/reviews
gh api repos/ai-dynamo/aiconfigurator/pulls/624/comments

Length of output: 59430

🏁 Script executed:

#!/bin/bash
# Check current state of perf_database.py around the changed lines
sed -n '5710,5745p' src/aiconfigurator/sdk/perf_database.py

echo "---"
# Check validate_database.ipynb for the visualization call and shape handling
python3 -c "
import json
with open('tools/sanity_check/validate_database.ipynb') as f:
    nb = json.load(f)
# Find the cells with visualize_trtllm_alltoall
for i, cell in enumerate(nb['cells']):
    src = ''.join(cell.get('source', []))
    if 'visualize_trtllm_alltoall' in src:
        print(f'=== Cell {i} ===')
        print(src[:3000])
        print()
"

Length of output: 4420

🏁 Script executed:

#!/bin/bash
python3 -c "
import json
with open('tools/sanity_check/validate_database.ipynb') as f:
    nb = json.load(f)
for i, cell in enumerate(nb['cells']):
    src = ''.join(cell.get('source', []))
    if 'visualize_trtllm_alltoall' in src:
        print(f'=== Cell {i} (full) ===')
        print(src)
        print()
"

Length of output: 7025

@Arsene12358 All four previously flagged issues have been resolved in the current code. Here's the status:

#	File	Previous Issue	Status
1	`perf_database.py`	`remote_ranks` ignored `num_experts`; was `min(topk, moe_ep_size) - 1`	✅ Fixed — now `min(topk, num_experts, moe_ep_size - 1)`
2	`perf_database.py`	`alltoall_combine_low_precision` incorrectly used 2 bytes/elem (BF16) instead of 0.5 (NVFP4)	✅ Fixed — now `bytes_per_element = 0.5 if "low_precision" in op_name else 2`
3	`validate_database.ipynb`	Shape tokens were unioned into one `token_set` across incompatible `(node_num, hidden_size, topk, num_experts)` tuples, querying only the last tuple's parameters	✅ Fixed — now uses `shape_map: dict[(nn, h, t, ne) -> token_set]` so each plotted line queries its own exact shape parameters
4	`validate_database.ipynb`	`visualize_trtllm_alltoall(database)` reused the global H200/1.2.0rc5 database instead of exercising the new GB200 data	✅ Fixed — now constructs `alltoall_database = get_database(system="gb200", backend="trtllm", version="1.2.0rc6")` and passes it explicitly

The sanity check charts attached by the author also visually confirm the SOL model is producing sensible results against the GB200 NVLinkOneSided silicon measurements. The PR looks good to merge.

✅ Actions performed

Full review triggered.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tools/sanity_check/validate_database.ipynb`:
- Around line 857-865: The subplot title and axis labels use values from the
wrong loop scope and mislabeled axes: update the set_title call in the MoE
plotting block (where ax[workload_distribution_id * 2, i].set_title is called)
to use the current loop's quant_mode-specific values (current topk and
num_experts variables) rather than the last iterated ones, and fix the xlabel
calls so the top row uses the correct x-axis variable name (use "m" when
plotting over m_list and "s" when plotting over s_list) instead of always
setting "s"; ensure any other axis label (e.g., ylabel for memory) remains
appropriate for the corresponding subplot.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 509e2f19-a023-4d9b-9084-f8023e0cf793

📥 Commits

Reviewing files that changed from the base of the PR and between 0b9726d and 38f5cec.

📒 Files selected for processing (3)

src/aiconfigurator/sdk/perf_database.py
src/aiconfigurator/systems/data/gb200/trtllm/1.2.0rc6/trtllm_alltoall_perf.txt
tools/sanity_check/validate_database.ipynb

coderabbitai · 2026-03-20T11:00:29Z

tools/sanity_check/validate_database.ipynb

+    "            ax[workload_distribution_id * 2, i].set_title(\n",
+    "                f\"{workload_distribution_title} \\ntopk={topk} e={num_experts} tp={tp} ep={ep}\"\n",
+    "            )\n",
+    "            ax[workload_distribution_id * 2, i].set_xlabel(\"s\")\n",
+    "            ax[workload_distribution_id * 2, i].set_ylabel(\"math sol %\")\n",
    "            # ax[0,i].set_ylim(0,1)\n",
-    "            ax[workload_distribution_id*2, i].legend()\n",
-    "            ax[workload_distribution_id*2+1, i].set_xlabel(\"s\")\n",
-    "            ax[workload_distribution_id*2+1, i].set_ylabel(\"mem sol %\")\n",
+    "            ax[workload_distribution_id * 2, i].legend()\n",
+    "            ax[workload_distribution_id * 2 + 1, i].set_xlabel(\"s\")\n",
+    "            ax[workload_distribution_id * 2 + 1, i].set_ylabel(\"mem sol %\")\n",


⚠️ Potential issue | 🟡 Minor

Fix the MoE subplot metadata.

The title on Line 857 uses topk / num_experts from the last quant_mode iterated, so it can misdescribe mixed-series subplots, and Lines 860 and 864 label m_list as s.

💡 Suggested fix

- ax[workload_distribution_id * 2, i].set_title( - f"{workload_distribution_title} \ntopk={topk} e={num_experts} tp={tp} ep={ep}" - ) - ax[workload_distribution_id * 2, i].set_xlabel("s") + ax[workload_distribution_id * 2, i].set_title( + f"{workload_distribution_title}\ntp={tp} ep={ep}" + ) + ax[workload_distribution_id * 2, i].set_xlabel("num_tokens") ax[workload_distribution_id * 2, i].set_ylabel("math sol %") # ax[0,i].set_ylim(0,1) ax[workload_distribution_id * 2, i].legend() - ax[workload_distribution_id * 2 + 1, i].set_xlabel("s") + ax[workload_distribution_id * 2 + 1, i].set_xlabel("num_tokens") ax[workload_distribution_id * 2 + 1, i].set_ylabel("mem sol %")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tools/sanity_check/validate_database.ipynb` around lines 857 - 865, The subplot title and axis labels use values from the wrong loop scope and mislabeled axes: update the set_title call in the MoE plotting block (where ax[workload_distribution_id * 2, i].set_title is called) to use the current loop's quant_mode-specific values (current topk and num_experts variables) rather than the last iterated ones, and fix the xlabel calls so the top row uses the correct x-axis variable name (use "m" when plotting over m_list and "s" when plotting over s_list) instead of always setting "s"; ensure any other axis label (e.g., ylabel for memory) remains appropriate for the corresponding subplot.

@wenpengw-nv could you take a look at this

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

wenpengw-nv added 3 commits March 19, 2026 06:12

add trtllm alltoall sanity check

5b65f1a

Signed-off-by: wenpengw-nv <wenpengw@nvidia.com>

revert part of code

dc99134

Signed-off-by: wenpengw-nv <wenpengw@nvidia.com>

update gb200 nvlinkonesided

bcd1f70

Signed-off-by: wenpengw-nv <wenpengw@nvidia.com>

wenpengw-nv requested review from AichenF, Arsene12358, Harrilee, YijiaZhao, ilyasher, jasonqinzhou, simone-chen, tianhaox and xutizhou as code owners March 20, 2026 01:55

github-actions bot added the chore label Mar 20, 2026

format

96188d2

Signed-off-by: wenpengw-nv <wenpengw@nvidia.com>

coderabbitai bot reviewed Mar 20, 2026

View reviewed changes

address review

38f5cec

Signed-off-by: wenpengw-nv <wenpengw@nvidia.com>

coderabbitai bot reviewed Mar 20, 2026

View reviewed changes

Conversation

wenpengw-nv commented Mar 20, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Details:

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Mar 20, 2026

Uh oh!

github-actions bot commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Sanity Check Chart Generation Report

Chart Generation Report for system: gb200, backend: trtllm, backend_version: 1.2.0rc6

Uh oh!

wenpengw-nv commented Mar 20, 2026

Uh oh!

coderabbitai bot commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Arsene12358 commented Mar 20, 2026

Uh oh!

coderabbitai bot commented Mar 20, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Arsene12358 Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wenpengw-nv commented Mar 20, 2026 •

edited by coderabbitai bot

Loading

github-actions bot commented Mar 20, 2026 •

edited

Loading

coderabbitai bot commented Mar 20, 2026 •

edited

Loading

coderabbitai bot Mar 20, 2026 •

edited

Loading