Add AI-consumable skills for Slurm GPU cluster operations by edwardsp · Pull Request #123 · Azure/ai-infrastructure-on-azure

edwardsp · 2026-03-06T18:40:28Z

What

11 operational skills (knowledge documents) for managing Azure CycleCloud Workspace for Slurm clusters with NVIDIA GPU nodes (GB300 and H100). Each skill is a self-contained markdown file with exact commands, expected values, thresholds, and decision trees — written to be directly consumed by AI assistants.

Skills (1,715 lines across 11 files)

Routing

slurm_router — Intent-to-skill mapping with enforced response contract

Diagnostic — How to run tests and read results

sku_performance_baseline — Expected NCCL busbw, GPU GFlops, thermal limits per SKU
nccl_allreduce_test — NCCL all_reduce_perf launcher, per-SKU env vars, output interpretation
node_gpu_validation — ubergemm GEMM benchmarks, CSV parsing, fleet analysis
thermal_stress_test — dcgmproftester execution, pass/fail, supplementary diagnostics
ib_link_validation — IB port state, pkeys, error counters, soft fixes

Reasoning — How to analyze and isolate problems

nccl_performance_diagnosis — Bisection algorithm, intra-rack vs inter-rack, root cause
cluster_outlier_detection — Z-score, MAD, absolute threshold methods
rack_topology — MNNVL domains, ClusterUUID discovery, FabricManager

Remediation — How to fix or replace bad hardware

azure_node_health_report — Complete 26-category GHR reference from official MS docs, REST API
node_drain_and_replace — Drain/undrain/reboot decision tree, post-replacement validation

AI assistant integration

GitHub Copilot: .copilot/skills/ with 11 symlinks to skills/slurm/ for selective skill loading. .github/copilot-instructions.md enforces skill-first workflow with mandatory router.
Claude Code: CLAUDE.md at repo root points to skill directory.
Each SKILL.md includes YAML frontmatter (name, description) for Copilot's skill discovery.
Skills that reference test scripts from infrastructure_validations/ include a prerequisite note linking back to this repo.
skills/README.md documents usage options for both Copilot and Claude Code.

Diagnostic skills: - sku_performance_baseline: Expected values per SKU (NCCL, GPU GEMM, thermal) - node_gpu_validation: ubergemm GEMM test execution and result interpretation - ib_link_validation: IB port state, pkeys, error counters, soft fixes - nccl_allreduce_test: NCCL all_reduce_perf test, per-SKU env vars, output parsing - thermal_stress_test: dcgmproftester thermal stress test and diagnostics Reasoning skills: - nccl_performance_diagnosis: Busbw analysis, bisection algorithm, root cause - cluster_outlier_detection: Statistical methods (z-score, MAD) for fleet analysis - rack_topology: MNNVL domains, ClusterUUID discovery, rack structure per SKU Remediation skills: - azure_node_health_report: Full GHR category reference from official MS docs, IMDS/KVP data collection, REST API format, insight polling - node_drain_and_replace: Slurm drain/undrain workflow, reboot procedure, decision tree for when to drain vs reboot vs GHR

Both CLAUDE.md and .github/copilot-instructions.md point assistants to skills/slurm/ for cluster operations knowledge. These are picked up automatically when the repo is opened in VS Code — no per-client configuration needed.

Add comparison table showing which assistants have auto-discovery support (Copilot instructions, Copilot skills, Claude CLAUDE.md, Cursor rules, Windsurf rules) and what this repo currently provides. Explain .copilot/skills/ directory structure for selective loading.

Move each skills/slurm/<name>.md into skills/slurm/<name>/SKILL.md. Add YAML frontmatter (name, description) to each for .copilot/skills/ compatibility. Update README with Copilot and Claude Code usage instructions including cp/symlink examples for .copilot/skills/.

5 skills reference scripts from infrastructure_validations/. Add a callout so users who install skills standalone know to clone the repo.

…t instructions - New slurm_router skill: intent-to-skill routing with response contract - .copilot/skills/: 11 symlinks to skills/slurm/ for Copilot selective loading - copilot-instructions.md: rewritten as skill-first workflow with mandatory router - skills/README.md: add slurm_router to reference table and example workflows

edwardsp added 8 commits March 5, 2026 11:55

Add skills README with overview, usage patterns, and example workflows

6e1d5f1

Add repo-level instruction files for Copilot and Claude

275b0b0

Both CLAUDE.md and .github/copilot-instructions.md point assistants to skills/slurm/ for cluster operations knowledge. These are picked up automatically when the repo is opened in VS Code — no per-client configuration needed.

Add repo prerequisite note to skills that reference test scripts

ff1eb70

5 skills reference scripts from infrastructure_validations/. Add a callout so users who install skills standalone know to clone the repo.

Merge upstream main into skill

b92a89b

edwardsp requested review from jermth and wolfgang-desalvador March 6, 2026 18:40

wolfgang-desalvador approved these changes Mar 6, 2026

View reviewed changes

edwardsp merged commit e62e77f into main Mar 9, 2026
19 of 20 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add AI-consumable skills for Slurm GPU cluster operations#123

Add AI-consumable skills for Slurm GPU cluster operations#123
edwardsp merged 8 commits intomainfrom
skill

edwardsp commented Mar 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

edwardsp commented Mar 6, 2026

What

Skills (1,715 lines across 11 files)

AI assistant integration

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants