Skip to content

Add AI-consumable skills for Slurm GPU cluster operations#123

Merged
edwardsp merged 8 commits intomainfrom
skill
Mar 9, 2026
Merged

Add AI-consumable skills for Slurm GPU cluster operations#123
edwardsp merged 8 commits intomainfrom
skill

Conversation

@edwardsp
Copy link
Contributor

@edwardsp edwardsp commented Mar 6, 2026

What

11 operational skills (knowledge documents) for managing Azure CycleCloud Workspace for Slurm clusters with NVIDIA GPU nodes (GB300 and H100). Each skill is a self-contained markdown file with exact commands, expected values, thresholds, and decision trees — written to be directly consumed by AI assistants.

Skills (1,715 lines across 11 files)

Routing

  • slurm_router — Intent-to-skill mapping with enforced response contract

Diagnostic — How to run tests and read results

  • sku_performance_baseline — Expected NCCL busbw, GPU GFlops, thermal limits per SKU
  • nccl_allreduce_test — NCCL all_reduce_perf launcher, per-SKU env vars, output interpretation
  • node_gpu_validation — ubergemm GEMM benchmarks, CSV parsing, fleet analysis
  • thermal_stress_test — dcgmproftester execution, pass/fail, supplementary diagnostics
  • ib_link_validation — IB port state, pkeys, error counters, soft fixes

Reasoning — How to analyze and isolate problems

  • nccl_performance_diagnosis — Bisection algorithm, intra-rack vs inter-rack, root cause
  • cluster_outlier_detection — Z-score, MAD, absolute threshold methods
  • rack_topology — MNNVL domains, ClusterUUID discovery, FabricManager

Remediation — How to fix or replace bad hardware

  • azure_node_health_report — Complete 26-category GHR reference from official MS docs, REST API
  • node_drain_and_replace — Drain/undrain/reboot decision tree, post-replacement validation

AI assistant integration

  • GitHub Copilot: .copilot/skills/ with 11 symlinks to skills/slurm/ for selective skill loading. .github/copilot-instructions.md enforces skill-first workflow with mandatory router.
  • Claude Code: CLAUDE.md at repo root points to skill directory.
  • Each SKILL.md includes YAML frontmatter (name, description) for Copilot's skill discovery.
  • Skills that reference test scripts from infrastructure_validations/ include a prerequisite note linking back to this repo.
  • skills/README.md documents usage options for both Copilot and Claude Code.

edwardsp added 8 commits March 5, 2026 11:55
Diagnostic skills:
- sku_performance_baseline: Expected values per SKU (NCCL, GPU GEMM, thermal)
- node_gpu_validation: ubergemm GEMM test execution and result interpretation
- ib_link_validation: IB port state, pkeys, error counters, soft fixes
- nccl_allreduce_test: NCCL all_reduce_perf test, per-SKU env vars, output parsing
- thermal_stress_test: dcgmproftester thermal stress test and diagnostics

Reasoning skills:
- nccl_performance_diagnosis: Busbw analysis, bisection algorithm, root cause
- cluster_outlier_detection: Statistical methods (z-score, MAD) for fleet analysis
- rack_topology: MNNVL domains, ClusterUUID discovery, rack structure per SKU

Remediation skills:
- azure_node_health_report: Full GHR category reference from official MS docs,
  IMDS/KVP data collection, REST API format, insight polling
- node_drain_and_replace: Slurm drain/undrain workflow, reboot procedure,
  decision tree for when to drain vs reboot vs GHR
Both CLAUDE.md and .github/copilot-instructions.md point assistants
to skills/slurm/ for cluster operations knowledge. These are picked
up automatically when the repo is opened in VS Code — no per-client
configuration needed.
Add comparison table showing which assistants have auto-discovery
support (Copilot instructions, Copilot skills, Claude CLAUDE.md,
Cursor rules, Windsurf rules) and what this repo currently provides.
Explain .copilot/skills/ directory structure for selective loading.
Move each skills/slurm/<name>.md into skills/slurm/<name>/SKILL.md.
Add YAML frontmatter (name, description) to each for .copilot/skills/
compatibility. Update README with Copilot and Claude Code usage
instructions including cp/symlink examples for .copilot/skills/.
5 skills reference scripts from infrastructure_validations/. Add a
callout so users who install skills standalone know to clone the repo.
…t instructions

- New slurm_router skill: intent-to-skill routing with response contract
- .copilot/skills/: 11 symlinks to skills/slurm/ for Copilot selective loading
- copilot-instructions.md: rewritten as skill-first workflow with mandatory router
- skills/README.md: add slurm_router to reference table and example workflows
@edwardsp edwardsp merged commit e62e77f into main Mar 9, 2026
19 of 20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants