feat: Extend sanity check tests by ilyasher · Pull Request #605 · ai-dynamo/aiconfigurator

ilyasher · 2026-03-17T18:05:31Z

Overview:

Add tests that validate_database.ipynb sanity check notebook works with every system+backend combination
Fix some issues in validate_database.ipynb so that it works with every system+backend combo
a. Add unsqueeze=False to fix matplotlib plot IndexError when axis size is 1
b. Small fixes for moe sanity check
c. Wrap MLA in try/except since it's not needed for many models
For SGLang, if custom_allreduce_perf.txt is not available, use TRTLLM's custom_allreduce_perf.txt instead (We are already using TRTLLM allreduce data for SGLang, this just means we don't need to copy the file)
If a file called INCOMPLETE is in the database dir for a certain backend version, that version will be ignored.

Summary by CodeRabbit

New Features
- Added fallback mechanism for custom_allreduce performance data when using SGLang backend, automatically loading alternative data when needed.
Improvements
- Enhanced database validation with comprehensive error handling and environment-driven configuration support.
- Improved data visualization with robust safeguards to gracefully handle unavailable performance data.

copy-pr-bot · 2026-03-17T18:05:35Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-03-17T18:17:57Z

Walkthrough

Three files updated to enhance performance data fallback handling and validation robustness: a fallback mechanism in perf_database.py loads TensorRT-LLM data for SGLang when unavailable, test_sanity_check.py switches from fixture-based to subprocess-parametrized validation, and validate_database.ipynb adds environment-driven configuration with comprehensive error handling.

Changes

Cohort / File(s)	Summary
Performance Data Fallback `src/aiconfigurator/sdk/perf_database.py`	Implements SGLang fallback mechanism: when SGLang backend lacks custom_allreduce data, code locates latest TensorRT-LLM database version, loads its custom_allreduce data file, and wraps it in LoadedOpData with debug logging.
Test Refactoring `tests/e2e/tools/test_sanity_check.py`	Replaces fixture-based CWD handling with parametrized subprocess approach; enumerates supported system/backend/version combinations, executes notebook validation in separate process with environment variables (AIC_VALIDATE_SYSTEM, AIC_VALIDATE_BACKEND, AIC_VALIDATE_VERSION, MPLBACKEND), and validates exit codes.
Validation Notebook `tools/sanity_check/validate_database.ipynb`	Adds environment-driven database lookup (system, backend, version parameters), imports PerfDataNotAvailableError, wraps database operations in try/except blocks, adds squeeze=False to subplots for consistent axes shape, includes pre-visualization guards via raise_if_not_loaded() calls, and validates data availability before plotting.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 A fallback hops where data should be,
SGLang whispers "TensorRT, help me!"
Tests now bound for subprocess lands,
With env vars held in gentle hands,
Notebooks guard their plots with care,
Errors caught in try/except snare.

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main objective of the PR: extending sanity check tests to cover all system/backend combinations.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Description check	✅ Passed	The pull request description is well-structured and covers all main objectives with clear technical details about fixes and features added.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

📝 Coding Plan

Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Tip

You can disable poems in the walkthrough.

Disable the reviews.poem setting to disable the poems in the walkthrough.

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

tests/e2e/tools/test_sanity_check.py (1)

48-54: Bound the notebook subprocess with a timeout.

A notebook regression can hang this child process forever and block the whole e2e shard until the outer CI timeout. Adding a local timeout makes failures deterministic and much easier to diagnose.

Suggested fix

     result = sp.run(
         [sys.executable, "-c", "import import_ipynb; import validate_database"],
         cwd=SANITY_CHECK_DIR,
         env=env,
         capture_output=True,
         text=True,
+        timeout=600,
     )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/e2e/tools/test_sanity_check.py` around lines 48 - 54, The subprocess
call that runs the notebook using sp.run (assigning to result) needs a bounded
timeout to avoid hangs; add a timeout argument (e.g., timeout=60 or an
appropriate constant) to the sp.run invocation when calling [sys.executable,
"-c", "import import_ipynb; import validate_database"] with cwd=SANITY_CHECK_DIR
and env=env, and wrap the call in a try/except for subprocess.TimeoutExpired to
fail the test with a clear error message and capture/print any partial
stdout/stderr for diagnostics.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/aiconfigurator/sdk/perf_database.py`:
- Around line 2052-2068: The code builds trtllm_custom_allreduce_path under the
current systems_root even though get_latest_database_version(self.system,
"trtllm") may have discovered the version in a different configured root; update
the logic to use the actual database root where that TRTLLM version was found
(rather than always using systems_root). Concretely, modify
get_latest_database_version or add a helper (e.g.,
get_latest_database_root_and_version or find_database_root_for_version) so you
can obtain both the root and version, then construct
trtllm_custom_allreduce_path using that discovered root together with
self.system_spec["data_dir"], "trtllm", the returned version and
PerfDataFilename.custom_allreduce; then call load_custom_allreduce_data and wrap
into LoadedOpData as before so query_custom_allreduce() can succeed when TRTLLM
lives in a different configured systems path.

In `@tools/sanity_check/validate_database.ipynb`:
- Around line 857-888: The try/except around database.query_moe in the
validate_database notebook swallows all exceptions and treats failures as
successes; change it to only catch the specific PerfDataNotAvailableError (or
whatever sentinel the database uses) and re-raise any other Exception, or better
yet pre-filter invalid (moe_tp_size=tp, moe_ep_size=ep, quant_mode) combinations
before calling database.query_moe to avoid calling it for unsupported configs;
update the exception handler around the database.query_moe call to catch
PerfDataNotAvailableError and continue, but let other exceptions propagate (or
explicitly raise/log and fail the notebook) so real regressions are not masked.

---

Nitpick comments:
In `@tests/e2e/tools/test_sanity_check.py`:
- Around line 48-54: The subprocess call that runs the notebook using sp.run
(assigning to result) needs a bounded timeout to avoid hangs; add a timeout
argument (e.g., timeout=60 or an appropriate constant) to the sp.run invocation
when calling [sys.executable, "-c", "import import_ipynb; import
validate_database"] with cwd=SANITY_CHECK_DIR and env=env, and wrap the call in
a try/except for subprocess.TimeoutExpired to fail the test with a clear error
message and capture/print any partial stdout/stderr for diagnostics.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 15356501-4712-4e9d-8403-0948a0befb68

📥 Commits

Reviewing files that changed from the base of the PR and between 46a3f23 and 7bf61d9.

📒 Files selected for processing (3)

src/aiconfigurator/sdk/perf_database.py
tests/e2e/tools/test_sanity_check.py
tools/sanity_check/validate_database.ipynb

src/aiconfigurator/sdk/perf_database.py

coderabbitai · 2026-03-17T18:18:01Z

tools/sanity_check/validate_database.ipynb

+    "                    try:\n",
+    "                        db_time = database.query_moe(\n",
+    "                            num_tokens=m,\n",
+    "                            hidden_size=hidden_size,\n",
+    "                            inter_size=inter_size,\n",
+    "                            topk=topk,\n",
+    "                            num_experts=num_experts,\n",
+    "                            moe_tp_size=tp,\n",
+    "                            moe_ep_size=ep,\n",
+    "                            quant_mode=quant_mode,\n",
+    "                            workload_distribution=workload_distribution,\n",
+    "                            database_mode=DatabaseMode.SILICON,\n",
+    "                        )\n",
+    "                    except Exception as e:\n",
+    "                        print(f\"Error querying moe: {e}\")\n",
+    "                        break\n",
    "                    percentage_of_math = sol_math / db_time\n",
    "                    percentage_of_mem = sol_mem / db_time\n",
    "                    sol_math_list.append(percentage_of_math)\n",
    "                    sol_mem_list.append(percentage_of_mem)\n",
    "\n",
-    "                ax[workload_distribution_id*2, i].plot(\n",
-    "                    m_list, sol_math_list, color=color_list[color_id], label=f\"{quant_mode} math\"\n",
-    "                )\n",
-    "                ax[workload_distribution_id*2+1, i].plot(\n",
-    "                    m_list,\n",
-    "                    sol_mem_list,\n",
-    "                    color=color_list[color_id],\n",
-    "                    linestyle=\"--\",\n",
-    "                    label=f\"{quant_mode} mem\",\n",
-    "                )\n",
+    "                if len(m_list) == len(sol_math_list) and len(m_list) == len(sol_mem_list):\n",
+    "                    ax[workload_distribution_id*2, i].plot(\n",
+    "                        m_list, sol_math_list, color=color_list[color_id], label=f\"{quant_mode} math\"\n",
+    "                    )\n",
+    "                    ax[workload_distribution_id*2+1, i].plot(\n",
+    "                        m_list,\n",
+    "                        sol_mem_list,\n",
+    "                        color=color_list[color_id],\n",
+    "                        linestyle=\"--\",\n",
+    "                        label=f\"{quant_mode} mem\",\n",
+    "                    )\n",


⚠️ Potential issue | 🟠 Major

Don’t turn arbitrary query_moe() failures into a passing sanity check.

This notebook is now part of the e2e gate, but the bare except Exception only prints and skips plotting. That masks real regressions in query_moe() and still lets the notebook pass. If the intent is to skip unsupported cases, catch PerfDataNotAvailableError specifically or pre-filter invalid (tp, ep, quant_mode) combinations before the query.

Suggested fix

- except Exception as e: - print(f"Error querying moe: {e}") + except PerfDataNotAvailableError as e: + print(f"Skipping unsupported MoE case: {e}") break

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

" try:\n",

" db_time = database.query_moe(\n",

" num_tokens=m,\n",

" hidden_size=hidden_size,\n",

" inter_size=inter_size,\n",

" topk=topk,\n",

" num_experts=num_experts,\n",

" moe_tp_size=tp,\n",

" moe_ep_size=ep,\n",

" quant_mode=quant_mode,\n",

" workload_distribution=workload_distribution,\n",

" database_mode=DatabaseMode.SILICON,\n",

" )\n",

" except Exception as e:\n",

" print(f\"Error querying moe: {e}\")\n",

" break\n",

" percentage_of_math = sol_math / db_time\n",

" percentage_of_mem = sol_mem / db_time\n",

" sol_math_list.append(percentage_of_math)\n",

" sol_mem_list.append(percentage_of_mem)\n",

"\n",

" ax[workload_distribution_id*2, i].plot(\n",

" m_list, sol_math_list, color=color_list[color_id], label=f\"{quant_mode} math\"\n",

" )\n",

" ax[workload_distribution_id*2+1, i].plot(\n",

" m_list,\n",

" sol_mem_list,\n",

" color=color_list[color_id],\n",

" linestyle=\"--\",\n",

" label=f\"{quant_mode} mem\",\n",

" )\n",

" if len(m_list) == len(sol_math_list) and len(m_list) == len(sol_mem_list):\n",

" ax[workload_distribution_id*2, i].plot(\n",

" m_list, sol_math_list, color=color_list[color_id], label=f\"{quant_mode} math\"\n",

" )\n",

" ax[workload_distribution_id*2+1, i].plot(\n",

" m_list,\n",

" sol_mem_list,\n",

" color=color_list[color_id],\n",

" linestyle=\"--\",\n",

" label=f\"{quant_mode} mem\",\n",

" )\n",

" try:\n",

" db_time = database.query_moe(\n",

" num_tokens=m,\n",

" hidden_size=hidden_size,\n",

" inter_size=inter_size,\n",

" topk=topk,\n",

" num_experts=num_experts,\n",

" moe_tp_size=tp,\n",

" moe_ep_size=ep,\n",

" quant_mode=quant_mode,\n",

" workload_distribution=workload_distribution,\n",

" database_mode=DatabaseMode.SILICON,\n",

" )\n",

" except PerfDataNotAvailableError as e:\n",

" print(f\"Skipping unsupported MoE case: {e}\")\n",

" break\n",

" percentage_of_math = sol_math / db_time\n",

" percentage_of_mem = sol_mem / db_time\n",

" sol_math_list.append(percentage_of_math)\n",

" sol_mem_list.append(percentage_of_mem)\n",

"\n",

" if len(m_list) == len(sol_math_list) and len(m_list) == len(sol_mem_list):\n",

" ax[workload_distribution_id*2, i].plot(\n",

" m_list, sol_math_list, color=color_list[color_id], label=f\"{quant_mode} math\"\n",

" )\n",

" ax[workload_distribution_id*2+1, i].plot(\n",

" m_list,\n",

" sol_mem_list,\n",

" color=color_list[color_id],\n",

" linestyle=\"--\",\n",

" label=f\"{quant_mode} mem\",\n",

" )\n",

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tools/sanity_check/validate_database.ipynb` around lines 857 - 888, The try/except around database.query_moe in the validate_database notebook swallows all exceptions and treats failures as successes; change it to only catch the specific PerfDataNotAvailableError (or whatever sentinel the database uses) and re-raise any other Exception, or better yet pre-filter invalid (moe_tp_size=tp, moe_ep_size=ep, quant_mode) combinations before calling database.query_moe to avoid calling it for unsupported configs; update the exception handler around the database.query_moe call to catch PerfDataNotAvailableError and continue, but let other exceptions propagate (or explicitly raise/log and fail the notebook) so real regressions are not masked.

That's true but I can't fix query_moe right now. I think it's better to still generate the rest of the charts even if some combinations fail for moe.

@ilyasher, understood — that's a reasonable trade-off for a sanity-check notebook where partial output is better than no output.

One lightweight improvement that doesn't require fixing query_moe: you could log the exception type alongside the message so unexpected errors (e.g. AttributeError, KeyError) are still clearly visible in CI output and distinguishable from expected data-availability gaps:

except Exception as e: print(f"Skipping MoE case (tp={tp}, ep={ep}, quant_mode={quant_mode}): [{type(e).__name__}] {e}") break

That way, if a real regression surfaces, it won't be silently swallowed — the error class will appear in the logs and someone can act on it. The existing # TODO: fix query_moe for all combos and remove try/except comment already signals that this is temporary, which is good.

( ˶ˆᗜˆ˵ )

Signed-off-by: Ilya Sherstyuk <isherstyuk@nvidia.com>

github-actions bot added the feat label Mar 17, 2026

ilyasher marked this pull request as ready for review March 17, 2026 18:06

ilyasher requested review from a team, AichenF, Arsene12358, Harrilee, YijiaZhao, jasonqinzhou, simone-chen, tianhaox and xutizhou as code owners March 17, 2026 18:06

coderabbitai bot reviewed Mar 17, 2026

View reviewed changes

ilyasher force-pushed the dev-isherstyuk-extend-sanity-check-tests branch from 7bf61d9 to e4654a8 Compare March 17, 2026 20:13

ai-dynamo deleted a comment from github-actions bot Mar 17, 2026

ilyasher added 2 commits March 17, 2026 16:16

extend sanity check tests

ffff648

Signed-off-by: Ilya Sherstyuk <isherstyuk@nvidia.com>

remove sglang fallback

4cd27d3

Signed-off-by: Ilya Sherstyuk <isherstyuk@nvidia.com>

ilyasher force-pushed the dev-isherstyuk-extend-sanity-check-tests branch from c413194 to 4cd27d3 Compare March 17, 2026 23:17

ai-dynamo deleted a comment from github-actions bot Mar 18, 2026

Arsene12358 approved these changes Mar 19, 2026

View reviewed changes

tianhaox approved these changes Mar 19, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Extend sanity check tests#605

feat: Extend sanity check tests#605
ilyasher wants to merge 2 commits intomainfrom
dev-isherstyuk-extend-sanity-check-tests

ilyasher commented Mar 17, 2026 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Mar 17, 2026

Uh oh!

coderabbitai bot commented Mar 17, 2026 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot Mar 17, 2026 •

edited

Loading

Uh oh!

ilyasher Mar 17, 2026

Uh oh!

coderabbitai bot Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ilyasher commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Mar 17, 2026

Uh oh!

coderabbitai bot commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ilyasher Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ilyasher commented Mar 17, 2026 •

edited

Loading

coderabbitai bot commented Mar 17, 2026 •

edited

Loading

coderabbitai bot Mar 17, 2026 •

edited

Loading