Skip to content

feat: Extend sanity check tests#605

Open
ilyasher wants to merge 2 commits intomainfrom
dev-isherstyuk-extend-sanity-check-tests
Open

feat: Extend sanity check tests#605
ilyasher wants to merge 2 commits intomainfrom
dev-isherstyuk-extend-sanity-check-tests

Conversation

@ilyasher
Copy link
Contributor

@ilyasher ilyasher commented Mar 17, 2026

Overview:

  1. Add tests that validate_database.ipynb sanity check notebook works with every system+backend combination
  2. Fix some issues in validate_database.ipynb so that it works with every system+backend combo
    a. Add unsqueeze=False to fix matplotlib plot IndexError when axis size is 1
    b. Small fixes for moe sanity check
    c. Wrap MLA in try/except since it's not needed for many models
  3. For SGLang, if custom_allreduce_perf.txt is not available, use TRTLLM's custom_allreduce_perf.txt instead (We are already using TRTLLM allreduce data for SGLang, this just means we don't need to copy the file)
  4. If a file called INCOMPLETE is in the database dir for a certain backend version, that version will be ignored.

Summary by CodeRabbit

  • New Features

    • Added fallback mechanism for custom_allreduce performance data when using SGLang backend, automatically loading alternative data when needed.
  • Improvements

    • Enhanced database validation with comprehensive error handling and environment-driven configuration support.
    • Improved data visualization with robust safeguards to gracefully handle unavailable performance data.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Mar 17, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added the feat label Mar 17, 2026
@ilyasher ilyasher marked this pull request as ready for review March 17, 2026 18:06
@coderabbitai
Copy link

coderabbitai bot commented Mar 17, 2026

Walkthrough

Three files updated to enhance performance data fallback handling and validation robustness: a fallback mechanism in perf_database.py loads TensorRT-LLM data for SGLang when unavailable, test_sanity_check.py switches from fixture-based to subprocess-parametrized validation, and validate_database.ipynb adds environment-driven configuration with comprehensive error handling.

Changes

Cohort / File(s) Summary
Performance Data Fallback
src/aiconfigurator/sdk/perf_database.py
Implements SGLang fallback mechanism: when SGLang backend lacks custom_allreduce data, code locates latest TensorRT-LLM database version, loads its custom_allreduce data file, and wraps it in LoadedOpData with debug logging.
Test Refactoring
tests/e2e/tools/test_sanity_check.py
Replaces fixture-based CWD handling with parametrized subprocess approach; enumerates supported system/backend/version combinations, executes notebook validation in separate process with environment variables (AIC_VALIDATE_SYSTEM, AIC_VALIDATE_BACKEND, AIC_VALIDATE_VERSION, MPLBACKEND), and validates exit codes.
Validation Notebook
tools/sanity_check/validate_database.ipynb
Adds environment-driven database lookup (system, backend, version parameters), imports PerfDataNotAvailableError, wraps database operations in try/except blocks, adds squeeze=False to subplots for consistent axes shape, includes pre-visualization guards via raise_if_not_loaded() calls, and validates data availability before plotting.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 A fallback hops where data should be,
SGLang whispers "TensorRT, help me!"
Tests now bound for subprocess lands,
With env vars held in gentle hands,
Notebooks guard their plots with care,
Errors caught in try/except snare.

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main objective of the PR: extending sanity check tests to cover all system/backend combinations.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Description check ✅ Passed The pull request description is well-structured and covers all main objectives with clear technical details about fixes and features added.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

You can disable poems in the walkthrough.

Disable the reviews.poem setting to disable the poems in the walkthrough.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
tests/e2e/tools/test_sanity_check.py (1)

48-54: Bound the notebook subprocess with a timeout.

A notebook regression can hang this child process forever and block the whole e2e shard until the outer CI timeout. Adding a local timeout makes failures deterministic and much easier to diagnose.

Suggested fix
     result = sp.run(
         [sys.executable, "-c", "import import_ipynb; import validate_database"],
         cwd=SANITY_CHECK_DIR,
         env=env,
         capture_output=True,
         text=True,
+        timeout=600,
     )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/e2e/tools/test_sanity_check.py` around lines 48 - 54, The subprocess
call that runs the notebook using sp.run (assigning to result) needs a bounded
timeout to avoid hangs; add a timeout argument (e.g., timeout=60 or an
appropriate constant) to the sp.run invocation when calling [sys.executable,
"-c", "import import_ipynb; import validate_database"] with cwd=SANITY_CHECK_DIR
and env=env, and wrap the call in a try/except for subprocess.TimeoutExpired to
fail the test with a clear error message and capture/print any partial
stdout/stderr for diagnostics.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/aiconfigurator/sdk/perf_database.py`:
- Around line 2052-2068: The code builds trtllm_custom_allreduce_path under the
current systems_root even though get_latest_database_version(self.system,
"trtllm") may have discovered the version in a different configured root; update
the logic to use the actual database root where that TRTLLM version was found
(rather than always using systems_root). Concretely, modify
get_latest_database_version or add a helper (e.g.,
get_latest_database_root_and_version or find_database_root_for_version) so you
can obtain both the root and version, then construct
trtllm_custom_allreduce_path using that discovered root together with
self.system_spec["data_dir"], "trtllm", the returned version and
PerfDataFilename.custom_allreduce; then call load_custom_allreduce_data and wrap
into LoadedOpData as before so query_custom_allreduce() can succeed when TRTLLM
lives in a different configured systems path.

In `@tools/sanity_check/validate_database.ipynb`:
- Around line 857-888: The try/except around database.query_moe in the
validate_database notebook swallows all exceptions and treats failures as
successes; change it to only catch the specific PerfDataNotAvailableError (or
whatever sentinel the database uses) and re-raise any other Exception, or better
yet pre-filter invalid (moe_tp_size=tp, moe_ep_size=ep, quant_mode) combinations
before calling database.query_moe to avoid calling it for unsupported configs;
update the exception handler around the database.query_moe call to catch
PerfDataNotAvailableError and continue, but let other exceptions propagate (or
explicitly raise/log and fail the notebook) so real regressions are not masked.

---

Nitpick comments:
In `@tests/e2e/tools/test_sanity_check.py`:
- Around line 48-54: The subprocess call that runs the notebook using sp.run
(assigning to result) needs a bounded timeout to avoid hangs; add a timeout
argument (e.g., timeout=60 or an appropriate constant) to the sp.run invocation
when calling [sys.executable, "-c", "import import_ipynb; import
validate_database"] with cwd=SANITY_CHECK_DIR and env=env, and wrap the call in
a try/except for subprocess.TimeoutExpired to fail the test with a clear error
message and capture/print any partial stdout/stderr for diagnostics.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 15356501-4712-4e9d-8403-0948a0befb68

📥 Commits

Reviewing files that changed from the base of the PR and between 46a3f23 and 7bf61d9.

📒 Files selected for processing (3)
  • src/aiconfigurator/sdk/perf_database.py
  • tests/e2e/tools/test_sanity_check.py
  • tools/sanity_check/validate_database.ipynb

Comment on lines +857 to +888
" try:\n",
" db_time = database.query_moe(\n",
" num_tokens=m,\n",
" hidden_size=hidden_size,\n",
" inter_size=inter_size,\n",
" topk=topk,\n",
" num_experts=num_experts,\n",
" moe_tp_size=tp,\n",
" moe_ep_size=ep,\n",
" quant_mode=quant_mode,\n",
" workload_distribution=workload_distribution,\n",
" database_mode=DatabaseMode.SILICON,\n",
" )\n",
" except Exception as e:\n",
" print(f\"Error querying moe: {e}\")\n",
" break\n",
" percentage_of_math = sol_math / db_time\n",
" percentage_of_mem = sol_mem / db_time\n",
" sol_math_list.append(percentage_of_math)\n",
" sol_mem_list.append(percentage_of_mem)\n",
"\n",
" ax[workload_distribution_id*2, i].plot(\n",
" m_list, sol_math_list, color=color_list[color_id], label=f\"{quant_mode} math\"\n",
" )\n",
" ax[workload_distribution_id*2+1, i].plot(\n",
" m_list,\n",
" sol_mem_list,\n",
" color=color_list[color_id],\n",
" linestyle=\"--\",\n",
" label=f\"{quant_mode} mem\",\n",
" )\n",
" if len(m_list) == len(sol_math_list) and len(m_list) == len(sol_mem_list):\n",
" ax[workload_distribution_id*2, i].plot(\n",
" m_list, sol_math_list, color=color_list[color_id], label=f\"{quant_mode} math\"\n",
" )\n",
" ax[workload_distribution_id*2+1, i].plot(\n",
" m_list,\n",
" sol_mem_list,\n",
" color=color_list[color_id],\n",
" linestyle=\"--\",\n",
" label=f\"{quant_mode} mem\",\n",
" )\n",
Copy link

@coderabbitai coderabbitai bot Mar 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don’t turn arbitrary query_moe() failures into a passing sanity check.

This notebook is now part of the e2e gate, but the bare except Exception only prints and skips plotting. That masks real regressions in query_moe() and still lets the notebook pass. If the intent is to skip unsupported cases, catch PerfDataNotAvailableError specifically or pre-filter invalid (tp, ep, quant_mode) combinations before the query.

Suggested fix
-                    except Exception as e:
-                        print(f"Error querying moe: {e}")
+                    except PerfDataNotAvailableError as e:
+                        print(f"Skipping unsupported MoE case: {e}")
                         break
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
" try:\n",
" db_time = database.query_moe(\n",
" num_tokens=m,\n",
" hidden_size=hidden_size,\n",
" inter_size=inter_size,\n",
" topk=topk,\n",
" num_experts=num_experts,\n",
" moe_tp_size=tp,\n",
" moe_ep_size=ep,\n",
" quant_mode=quant_mode,\n",
" workload_distribution=workload_distribution,\n",
" database_mode=DatabaseMode.SILICON,\n",
" )\n",
" except Exception as e:\n",
" print(f\"Error querying moe: {e}\")\n",
" break\n",
" percentage_of_math = sol_math / db_time\n",
" percentage_of_mem = sol_mem / db_time\n",
" sol_math_list.append(percentage_of_math)\n",
" sol_mem_list.append(percentage_of_mem)\n",
"\n",
" ax[workload_distribution_id*2, i].plot(\n",
" m_list, sol_math_list, color=color_list[color_id], label=f\"{quant_mode} math\"\n",
" )\n",
" ax[workload_distribution_id*2+1, i].plot(\n",
" m_list,\n",
" sol_mem_list,\n",
" color=color_list[color_id],\n",
" linestyle=\"--\",\n",
" label=f\"{quant_mode} mem\",\n",
" )\n",
" if len(m_list) == len(sol_math_list) and len(m_list) == len(sol_mem_list):\n",
" ax[workload_distribution_id*2, i].plot(\n",
" m_list, sol_math_list, color=color_list[color_id], label=f\"{quant_mode} math\"\n",
" )\n",
" ax[workload_distribution_id*2+1, i].plot(\n",
" m_list,\n",
" sol_mem_list,\n",
" color=color_list[color_id],\n",
" linestyle=\"--\",\n",
" label=f\"{quant_mode} mem\",\n",
" )\n",
" try:\n",
" db_time = database.query_moe(\n",
" num_tokens=m,\n",
" hidden_size=hidden_size,\n",
" inter_size=inter_size,\n",
" topk=topk,\n",
" num_experts=num_experts,\n",
" moe_tp_size=tp,\n",
" moe_ep_size=ep,\n",
" quant_mode=quant_mode,\n",
" workload_distribution=workload_distribution,\n",
" database_mode=DatabaseMode.SILICON,\n",
" )\n",
" except PerfDataNotAvailableError as e:\n",
" print(f\"Skipping unsupported MoE case: {e}\")\n",
" break\n",
" percentage_of_math = sol_math / db_time\n",
" percentage_of_mem = sol_mem / db_time\n",
" sol_math_list.append(percentage_of_math)\n",
" sol_mem_list.append(percentage_of_mem)\n",
"\n",
" if len(m_list) == len(sol_math_list) and len(m_list) == len(sol_mem_list):\n",
" ax[workload_distribution_id*2, i].plot(\n",
" m_list, sol_math_list, color=color_list[color_id], label=f\"{quant_mode} math\"\n",
" )\n",
" ax[workload_distribution_id*2+1, i].plot(\n",
" m_list,\n",
" sol_mem_list,\n",
" color=color_list[color_id],\n",
" linestyle=\"--\",\n",
" label=f\"{quant_mode} mem\",\n",
" )\n",
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tools/sanity_check/validate_database.ipynb` around lines 857 - 888, The
try/except around database.query_moe in the validate_database notebook swallows
all exceptions and treats failures as successes; change it to only catch the
specific PerfDataNotAvailableError (or whatever sentinel the database uses) and
re-raise any other Exception, or better yet pre-filter invalid (moe_tp_size=tp,
moe_ep_size=ep, quant_mode) combinations before calling database.query_moe to
avoid calling it for unsupported configs; update the exception handler around
the database.query_moe call to catch PerfDataNotAvailableError and continue, but
let other exceptions propagate (or explicitly raise/log and fail the notebook)
so real regressions are not masked.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true but I can't fix query_moe right now. I think it's better to still generate the rest of the charts even if some combinations fail for moe.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ilyasher, understood — that's a reasonable trade-off for a sanity-check notebook where partial output is better than no output.

One lightweight improvement that doesn't require fixing query_moe: you could log the exception type alongside the message so unexpected errors (e.g. AttributeError, KeyError) are still clearly visible in CI output and distinguishable from expected data-availability gaps:

except Exception as e:
    print(f"Skipping MoE case (tp={tp}, ep={ep}, quant_mode={quant_mode}): [{type(e).__name__}] {e}")
    break

That way, if a real regression surfaces, it won't be silently swallowed — the error class will appear in the logs and someone can act on it. The existing # TODO: fix query_moe for all combos and remove try/except comment already signals that this is temporary, which is good.

( ˶ˆᗜˆ˵ )

@ilyasher ilyasher force-pushed the dev-isherstyuk-extend-sanity-check-tests branch from 7bf61d9 to e4654a8 Compare March 17, 2026 20:13
@ai-dynamo ai-dynamo deleted a comment from github-actions bot Mar 17, 2026
Signed-off-by: Ilya Sherstyuk <isherstyuk@nvidia.com>
Signed-off-by: Ilya Sherstyuk <isherstyuk@nvidia.com>
@ilyasher ilyasher force-pushed the dev-isherstyuk-extend-sanity-check-tests branch from c413194 to 4cd27d3 Compare March 17, 2026 23:17
@ai-dynamo ai-dynamo deleted a comment from github-actions bot Mar 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants