Conversation
Diagnostic skills: - sku_performance_baseline: Expected values per SKU (NCCL, GPU GEMM, thermal) - node_gpu_validation: ubergemm GEMM test execution and result interpretation - ib_link_validation: IB port state, pkeys, error counters, soft fixes - nccl_allreduce_test: NCCL all_reduce_perf test, per-SKU env vars, output parsing - thermal_stress_test: dcgmproftester thermal stress test and diagnostics Reasoning skills: - nccl_performance_diagnosis: Busbw analysis, bisection algorithm, root cause - cluster_outlier_detection: Statistical methods (z-score, MAD) for fleet analysis - rack_topology: MNNVL domains, ClusterUUID discovery, rack structure per SKU Remediation skills: - azure_node_health_report: Full GHR category reference from official MS docs, IMDS/KVP data collection, REST API format, insight polling - node_drain_and_replace: Slurm drain/undrain workflow, reboot procedure, decision tree for when to drain vs reboot vs GHR
Both CLAUDE.md and .github/copilot-instructions.md point assistants to skills/slurm/ for cluster operations knowledge. These are picked up automatically when the repo is opened in VS Code — no per-client configuration needed.
Add comparison table showing which assistants have auto-discovery support (Copilot instructions, Copilot skills, Claude CLAUDE.md, Cursor rules, Windsurf rules) and what this repo currently provides. Explain .copilot/skills/ directory structure for selective loading.
Move each skills/slurm/<name>.md into skills/slurm/<name>/SKILL.md. Add YAML frontmatter (name, description) to each for .copilot/skills/ compatibility. Update README with Copilot and Claude Code usage instructions including cp/symlink examples for .copilot/skills/.
5 skills reference scripts from infrastructure_validations/. Add a callout so users who install skills standalone know to clone the repo.
…t instructions - New slurm_router skill: intent-to-skill routing with response contract - .copilot/skills/: 11 symlinks to skills/slurm/ for Copilot selective loading - copilot-instructions.md: rewritten as skill-first workflow with mandatory router - skills/README.md: add slurm_router to reference table and example workflows
wolfgang-desalvador
approved these changes
Mar 6, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
11 operational skills (knowledge documents) for managing Azure CycleCloud Workspace for Slurm clusters with NVIDIA GPU nodes (GB300 and H100). Each skill is a self-contained markdown file with exact commands, expected values, thresholds, and decision trees — written to be directly consumed by AI assistants.
Skills (1,715 lines across 11 files)
Routing
slurm_router— Intent-to-skill mapping with enforced response contractDiagnostic — How to run tests and read results
sku_performance_baseline— Expected NCCL busbw, GPU GFlops, thermal limits per SKUnccl_allreduce_test— NCCL all_reduce_perf launcher, per-SKU env vars, output interpretationnode_gpu_validation— ubergemm GEMM benchmarks, CSV parsing, fleet analysisthermal_stress_test— dcgmproftester execution, pass/fail, supplementary diagnosticsib_link_validation— IB port state, pkeys, error counters, soft fixesReasoning — How to analyze and isolate problems
nccl_performance_diagnosis— Bisection algorithm, intra-rack vs inter-rack, root causecluster_outlier_detection— Z-score, MAD, absolute threshold methodsrack_topology— MNNVL domains, ClusterUUID discovery, FabricManagerRemediation — How to fix or replace bad hardware
azure_node_health_report— Complete 26-category GHR reference from official MS docs, REST APInode_drain_and_replace— Drain/undrain/reboot decision tree, post-replacement validationAI assistant integration
.copilot/skills/with 11 symlinks toskills/slurm/for selective skill loading..github/copilot-instructions.mdenforces skill-first workflow with mandatory router.CLAUDE.mdat repo root points to skill directory.SKILL.mdincludes YAML frontmatter (name,description) for Copilot's skill discovery.infrastructure_validations/include a prerequisite note linking back to this repo.skills/README.mddocuments usage options for both Copilot and Claude Code.