Skip to content

chore: Add data from auto-collector pipeline 46210829 (h100_sxm_sglang_0.5.9)#597

Open
dynamo-ops wants to merge 1 commit intomainfrom
auto-data-collection-46210829-h100_sxm_sglang_0.5.9
Open

chore: Add data from auto-collector pipeline 46210829 (h100_sxm_sglang_0.5.9)#597
dynamo-ops wants to merge 1 commit intomainfrom
auto-data-collection-46210829-h100_sxm_sglang_0.5.9

Conversation

@dynamo-ops
Copy link
Contributor

@dynamo-ops dynamo-ops commented Mar 16, 2026

Error Summary for Auto-Collector Run

Collection summary for h100_sxm sglang:0.5.9

Error summary

{
    "backend": "sglang",
    "version": "0.5.9",
    "timestamp": "2026-03-16T06:36:57.049936",
    "total_errors": 487,
    "errors_by_module": {
        "sglang.gemm": 486,
        "sglang.wideep_moe": 1
    },
    "errors_by_type": {
        "RuntimeError": 484,
        "OutOfMemoryError": 3
    }
}

Summary by CodeRabbit

  • Chores
    • Added seven new performance benchmark data files for H100 SXM with sglang v0.5.9. Datasets cover context attention mechanisms, matrix multiplication operations, generation-phase attention processing, MLA-based operations, and mixture-of-experts configurations. These files enable comprehensive AI system performance evaluation, detailed technical analysis, and ongoing optimization tracking for improved system efficiency and resource utilization.

Signed-off-by: dynamo-ops <170655669+dynamo-ops@users.noreply.github.com>
@copy-pr-bot
Copy link

copy-pr-bot bot commented Mar 16, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added the chore label Mar 16, 2026
@coderabbitai
Copy link

coderabbitai bot commented Mar 16, 2026

Walkthrough

Seven new Git LFS pointer files are added to track large binary performance metric assets in the H100 SXM SGLang 0.5.9 directory. Each pointer contains standard Git LFS metadata (version, oid, size) without introducing code or logic changes.

Changes

Cohort / File(s) Summary
Git LFS Performance Data Pointers
src/aiconfigurator/systems/data/h100_sxm/sglang/0.5.9/context_attention_perf.txt, context_mla_perf.txt, gemm_perf.txt, generation_attention_perf.txt, generation_mla_perf.txt, mla_bmm_perf.txt, moe_perf.txt
Added seven Git LFS pointer files for tracking large binary performance metric assets. Each file contains standard LFS metadata (version, oid, size) with no code or logic modifications.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Poem

🐰 Seven scrolls of data we now shall keep,
Performance metrics, stashed so deep,
Git LFS pointers, neat and small,
Tracking benchmarks for H100's call! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Description check ⚠️ Warning The description lacks required template sections (Overview, Details, Where should reviewer start, Related Issues) and primarily contains error summary data instead of explaining the changes being merged. Restructure the description to follow the template: add Overview section, explain what data is being added and why, specify which files to review, and note any related issues.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: adding performance data files from an automated collection pipeline run for H100 SXM with sglang 0.5.9.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

You can customize the high-level summary generated by CodeRabbit.

Configure the reviews.high_level_summary_instructions setting to provide custom instructions for generating the high-level summary.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/aiconfigurator/systems/data/h100_sxm/sglang/0.5.9/gemm_perf.txt`:
- Around line 1-3: Block ingestion of the gemm_perf.txt artifact until collector
errors are resolved: stop persisting
src/aiconfigurator/systems/data/h100_sxm/sglang/0.5.9/gemm_perf.txt when the run
metadata reports errors (currently 486 GEMM module errors out of 487); require
either a successful rerun with zero collector/GEMM errors or attach a validation
report that proves coverage and data integrity for gemm_perf.txt before allowing
ingestion; update the ingestion gating logic (the collector/GEMM validation
step) to check the run metadata error count and reject artifacts with non-zero
GEMM/collector errors, and surface a clear error message referencing the
offending gemm_perf.txt artifact when rejecting.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 730bf8a2-96d7-4519-9807-a0b6c976df22

📥 Commits

Reviewing files that changed from the base of the PR and between ae86e39 and 2d9299b.

📒 Files selected for processing (7)
  • src/aiconfigurator/systems/data/h100_sxm/sglang/0.5.9/context_attention_perf.txt
  • src/aiconfigurator/systems/data/h100_sxm/sglang/0.5.9/context_mla_perf.txt
  • src/aiconfigurator/systems/data/h100_sxm/sglang/0.5.9/gemm_perf.txt
  • src/aiconfigurator/systems/data/h100_sxm/sglang/0.5.9/generation_attention_perf.txt
  • src/aiconfigurator/systems/data/h100_sxm/sglang/0.5.9/generation_mla_perf.txt
  • src/aiconfigurator/systems/data/h100_sxm/sglang/0.5.9/mla_bmm_perf.txt
  • src/aiconfigurator/systems/data/h100_sxm/sglang/0.5.9/moe_perf.txt

Comment on lines +1 to +3
version https://git-lfs.github.com/spec/v1
oid sha256:5ef3d903bda0116a2dbb99c3394e7a5763301e6197dcfd6f5d3af33c55f64517
size 9096207
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Block ingest of this GEMM artifact until collector errors are resolved.

The pointer itself is valid, but this PR’s run metadata reports 486 GEMM module errors (out of 487 total). Shipping gemm_perf.txt from that run risks persisting incomplete/corrupted performance data. Please gate this update on a successful rerun (or attach a validation report proving coverage/quality for this artifact).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/aiconfigurator/systems/data/h100_sxm/sglang/0.5.9/gemm_perf.txt` around
lines 1 - 3, Block ingestion of the gemm_perf.txt artifact until collector
errors are resolved: stop persisting
src/aiconfigurator/systems/data/h100_sxm/sglang/0.5.9/gemm_perf.txt when the run
metadata reports errors (currently 486 GEMM module errors out of 487); require
either a successful rerun with zero collector/GEMM errors or attach a validation
report that proves coverage and data integrity for gemm_perf.txt before allowing
ingestion; update the ingestion gating logic (the collector/GEMM validation
step) to check the run metadata error count and reject artifacts with non-zero
GEMM/collector errors, and surface a clear error message referencing the
offending gemm_perf.txt artifact when rejecting.

@Arsene12358
Copy link
Contributor

Do we actually expect most of the GEMM test cases to fail?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants