diff --git a/.copilot/skills/azure_node_health_report b/.copilot/skills/azure_node_health_report new file mode 120000 index 0000000..465ff80 --- /dev/null +++ b/.copilot/skills/azure_node_health_report @@ -0,0 +1 @@ +../../skills/slurm/azure_node_health_report \ No newline at end of file diff --git a/.copilot/skills/cluster_outlier_detection b/.copilot/skills/cluster_outlier_detection new file mode 120000 index 0000000..40d0535 --- /dev/null +++ b/.copilot/skills/cluster_outlier_detection @@ -0,0 +1 @@ +../../skills/slurm/cluster_outlier_detection \ No newline at end of file diff --git a/.copilot/skills/ib_link_validation b/.copilot/skills/ib_link_validation new file mode 120000 index 0000000..d247411 --- /dev/null +++ b/.copilot/skills/ib_link_validation @@ -0,0 +1 @@ +../../skills/slurm/ib_link_validation \ No newline at end of file diff --git a/.copilot/skills/nccl_allreduce_test b/.copilot/skills/nccl_allreduce_test new file mode 120000 index 0000000..f437eb9 --- /dev/null +++ b/.copilot/skills/nccl_allreduce_test @@ -0,0 +1 @@ +../../skills/slurm/nccl_allreduce_test \ No newline at end of file diff --git a/.copilot/skills/nccl_performance_diagnosis b/.copilot/skills/nccl_performance_diagnosis new file mode 120000 index 0000000..faeddf1 --- /dev/null +++ b/.copilot/skills/nccl_performance_diagnosis @@ -0,0 +1 @@ +../../skills/slurm/nccl_performance_diagnosis \ No newline at end of file diff --git a/.copilot/skills/node_drain_and_replace b/.copilot/skills/node_drain_and_replace new file mode 120000 index 0000000..6aa5d65 --- /dev/null +++ b/.copilot/skills/node_drain_and_replace @@ -0,0 +1 @@ +../../skills/slurm/node_drain_and_replace \ No newline at end of file diff --git a/.copilot/skills/node_gpu_validation b/.copilot/skills/node_gpu_validation new file mode 120000 index 0000000..36bb75e --- /dev/null +++ b/.copilot/skills/node_gpu_validation @@ -0,0 +1 @@ +../../skills/slurm/node_gpu_validation \ No newline at end of file diff --git a/.copilot/skills/rack_topology b/.copilot/skills/rack_topology new file mode 120000 index 0000000..4650685 --- /dev/null +++ b/.copilot/skills/rack_topology @@ -0,0 +1 @@ +../../skills/slurm/rack_topology \ No newline at end of file diff --git a/.copilot/skills/sku_performance_baseline b/.copilot/skills/sku_performance_baseline new file mode 120000 index 0000000..a56d63d --- /dev/null +++ b/.copilot/skills/sku_performance_baseline @@ -0,0 +1 @@ +../../skills/slurm/sku_performance_baseline \ No newline at end of file diff --git a/.copilot/skills/slurm_router b/.copilot/skills/slurm_router new file mode 120000 index 0000000..8ad67bf --- /dev/null +++ b/.copilot/skills/slurm_router @@ -0,0 +1 @@ +../../skills/slurm/slurm_router \ No newline at end of file diff --git a/.copilot/skills/thermal_stress_test b/.copilot/skills/thermal_stress_test new file mode 120000 index 0000000..5959c13 --- /dev/null +++ b/.copilot/skills/thermal_stress_test @@ -0,0 +1 @@ +../../skills/slurm/thermal_stress_test \ No newline at end of file diff --git a/.github/copilot-instructions.md b/.github/copilot-instructions.md new file mode 100644 index 0000000..8ba89d1 --- /dev/null +++ b/.github/copilot-instructions.md @@ -0,0 +1,47 @@ +# Azure HPC GPU Cluster Operations (Skill-First) + +This repository is operated with a **skill-first workflow** for Azure CycleCloud Workspace for Slurm clusters with NVIDIA GPU nodes. + +## Mandatory Behavior + +For any cluster operations, validation, or troubleshooting request: + +1. Use local skills from `.copilot/skills/` first. +2. Start with `.copilot/skills/slurm_router/SKILL.md` to select the right skill set. +3. Execute commands and thresholds from the selected `SKILL.md` files. +4. Do not provide generic HPC advice when a skill exists for that task. +5. If required inputs are missing (SKU, nodelist, cluster name, failing job details), ask for them explicitly. + +## Local Skills Directory + +Primary skill source: + +- `.copilot/skills/slurm_router/SKILL.md` (intent router) +- `.copilot/skills/sku_performance_baseline/SKILL.md` +- `.copilot/skills/node_gpu_validation/SKILL.md` +- `.copilot/skills/ib_link_validation/SKILL.md` +- `.copilot/skills/nccl_allreduce_test/SKILL.md` +- `.copilot/skills/thermal_stress_test/SKILL.md` +- `.copilot/skills/nccl_performance_diagnosis/SKILL.md` +- `.copilot/skills/cluster_outlier_detection/SKILL.md` +- `.copilot/skills/rack_topology/SKILL.md` +- `.copilot/skills/azure_node_health_report/SKILL.md` +- `.copilot/skills/node_drain_and_replace/SKILL.md` + +Canonical source (symlink targets) is `skills/slurm/`. + +## Response Contract + +For operational responses, follow this structure: + +1. Selected skills +2. Ordered run plan +3. Exact commands +4. Pass/fail thresholds +5. Action decision (continue, isolate, drain, reboot, GHR) + +## Test Script Paths + +- `infrastructure_validations/slurm/NCCL/` — NCCL all_reduce_perf launcher with per-SKU configs +- `infrastructure_validations/slurm/gpu_test/` — GPU GEMM benchmark (ubergemm) +- `infrastructure_validations/slurm/thermal_test/` — Thermal stress test (dcgmproftester) diff --git a/CLAUDE.md b/CLAUDE.md new file mode 100644 index 0000000..317a62a --- /dev/null +++ b/CLAUDE.md @@ -0,0 +1,24 @@ +# Azure HPC GPU Cluster Operations + +This repo contains infrastructure validation tests and operational knowledge for Azure CycleCloud Workspace for Slurm clusters with NVIDIA GPU nodes. + +## Skills + +Read the skills in `skills/slurm/` for domain knowledge about cluster validation, diagnosis, and remediation. These cover: + +- **SKU baselines** — expected NCCL bandwidth, GPU GFlops, and thermal limits for GB300 and H100 +- **Test execution** — how to run NCCL, GPU GEMM, and thermal tests via Slurm +- **IB validation** — checking InfiniBand links, pkeys, error counters +- **NCCL diagnosis** — bisection algorithm for isolating bad nodes, intra-rack vs inter-rack analysis +- **Rack topology** — MNNVL domains, ClusterUUID discovery +- **Outlier detection** — statistical methods for fleet-wide analysis +- **Azure GHR** — full impact category reference, data collection, REST API +- **Node lifecycle** — drain/undrain/reboot decision tree + +When answering questions about cluster operations, hardware validation, or troubleshooting GPU/network issues, refer to the relevant skill file for exact commands, thresholds, and procedures. + +## Test Scripts + +- `infrastructure_validations/slurm/NCCL/` — NCCL all_reduce_perf launcher with per-SKU configs +- `infrastructure_validations/slurm/gpu_test/` — GPU GEMM benchmark (ubergemm) +- `infrastructure_validations/slurm/thermal_test/` — Thermal stress test (dcgmproftester) diff --git a/skills/README.md b/skills/README.md new file mode 100644 index 0000000..d248e1c --- /dev/null +++ b/skills/README.md @@ -0,0 +1,117 @@ +# Skills + +Operational knowledge for managing Azure HPC GPU clusters. Each skill is a self-contained markdown document covering one aspect of cluster validation, diagnosis, or remediation. + +## Who Is This For? + +You're on an Azure CycleCloud Workspace for Slurm cluster, you've cloned this repo, and you've opened VS Code. You need to validate hardware, troubleshoot a slow training job, or file an Azure health report — and you want an AI assistant (Copilot, Claude, etc.) to help. + +These skills give the assistant the domain knowledge it needs to actually help — correct commands, expected values, environment variables, and decision trees that are specific to Azure HPC GPU SKUs. + +## How to Use + +Each skill is a directory containing a `SKILL.md` file with YAML frontmatter (`name`, `description`) and the full skill content. This structure is directly compatible with `.copilot/skills/` and easy to reference from any assistant. + +``` +skills/slurm/ + nccl_allreduce_test/ + SKILL.md # frontmatter + full skill content + rack_topology/ + SKILL.md + ... +``` + +### GitHub Copilot + +**Option 1 — Always-on instructions.** The repo includes `.github/copilot-instructions.md`, which Copilot auto-loads for every chat in this workspace. It points to these skills. + +**Option 2 — Selective skill loading.** Copy (or symlink) skill directories into `.copilot/skills/` at the repo root: + +```bash +# Copy all skills +cp -r skills/slurm/* .copilot/skills/ + +# Or symlink individual ones +mkdir -p .copilot/skills +ln -s ../../skills/slurm/nccl_performance_diagnosis .copilot/skills/ +``` + +Copilot reads the `description` in each `SKILL.md` frontmatter and **selectively loads only relevant skills** based on the query — better than always-on when you have many skills. + +**Option 3 — On demand.** Attach a specific skill in chat: `#file:skills/slurm/nccl_performance_diagnosis/SKILL.md` + +### Claude Code + +**Option 1 — Always-on instructions.** The repo includes `CLAUDE.md` at the root, which Claude auto-loads when the repo is opened. It points to these skills. + +**Option 2 — Subdirectory CLAUDE.md.** Claude Code also reads `CLAUDE.md` files in subdirectories for scoped context. You could add a `skills/slurm/CLAUDE.md` that lists all skills in that directory. + +**Option 3 — On demand.** Drag a skill file into the chat input or reference it with `@file`. + +### As agent system prompts + +If you're building an AI agent, load the relevant `SKILL.md` content into the system prompt. The skills are written to be directly usable as context — they contain commands, thresholds, and decision logic, not just descriptions. + +## Skills Reference + +### Routing — Choose the right skill set first + +| Skill | What It Covers | +|-------|---------------| +| [slurm_router](slurm/slurm_router/SKILL.md) | Intent-to-skill routing for Slurm operations. Selects the correct skills first, then enforces exact commands, thresholds, and action decisions from those skills. | + +### Diagnostic — How to run tests and read results + +| Skill | What It Covers | +|-------|---------------| +| [sku_performance_baseline](slurm/sku_performance_baseline/SKILL.md) | Expected NCCL busbw, GPU GFlops, thermal limits, IB ports, and rack sizes for GB300 and H100 SKUs. Warn and GHR thresholds. | +| [node_gpu_validation](slurm/node_gpu_validation/SKILL.md) | Running ubergemm GEMM benchmarks, parsing CSV output, identifying underperforming GPUs, fleet-wide analysis. | +| [ib_link_validation](slurm/ib_link_validation/SKILL.md) | Checking IB port state (operstate, ibstat), partition keys, error counters, link flap detection, and soft fixes. | +| [nccl_allreduce_test](slurm/nccl_allreduce_test/SKILL.md) | Running NCCL all_reduce_perf via the launcher, per-SKU environment variables (MNNVL, SHARP, GDR), output columns, quick vs full sweep. | +| [thermal_stress_test](slurm/thermal_stress_test/SKILL.md) | Running dcgmproftester thermal stress, interpreting pass/fail, supplementary diagnostics (temperatures, throttle reasons, DCGMI levels). | + +### Reasoning — How to analyze and isolate problems + +| Skill | What It Covers | +|-------|---------------| +| [nccl_performance_diagnosis](slurm/nccl_performance_diagnosis/SKILL.md) | Scoping intra-rack vs inter-rack failures, bisection algorithm for isolating bad nodes, GPU vs network root cause analysis. | +| [cluster_outlier_detection](slurm/cluster_outlier_detection/SKILL.md) | Statistical methods (absolute threshold, z-score, MAD) for finding degraded nodes in fleet-wide test results. | +| [rack_topology](slurm/rack_topology/SKILL.md) | MNNVL domains, ClusterUUID discovery via nvidia-smi, expected rack sizes, FabricManager troubleshooting. | + +### Remediation — How to fix or replace bad hardware + +| Skill | What It Covers | +|-------|---------------| +| [azure_node_health_report](slurm/azure_node_health_report/SKILL.md) | Complete GHR impact category reference (26 categories), collecting PhysicalHostName and Resource ID, REST API format, polling insights. | +| [node_drain_and_replace](slurm/node_drain_and_replace/SKILL.md) | Slurm drain/undrain commands, reboot procedure, decision tree for when to drain vs reboot vs GHR, post-replacement validation. | + +## Example Workflows + +### "I just got a new cluster, validate everything" + +Skills needed: `slurm_router`, `sku_performance_baseline`, `rack_topology`, `nccl_allreduce_test`, `node_gpu_validation`, `thermal_stress_test` + +1. Discover rack topology (ClusterUUIDs). +2. Run NCCL all_reduce per rack (MNNVL test). +3. Run GPU GEMM test on all nodes. +4. Run thermal stress test on all nodes. +5. Compare results against SKU baselines. + +### "A training job is running slow" + +Skills needed: `slurm_router`, `nccl_performance_diagnosis`, `sku_performance_baseline`, `ib_link_validation` + +1. Run a quick NCCL check on the job's nodelist. +2. If bandwidth is low, identify which rack is affected. +3. Bisect the failing rack to find the bad node. +4. Check IB links and GPU health on the suspect node. + +### "I found a bad node, now what?" + +Skills needed: `slurm_router`, `node_drain_and_replace`, `azure_node_health_report` + +1. Collect metadata (PhysicalHostName, Resource ID) **before** rebooting. +2. Drain the node. +3. Attempt reboot if appropriate. +4. If issue persists, file GHR with the correct impact category. +5. Poll insights for resolution status. diff --git a/skills/slurm/azure_node_health_report/SKILL.md b/skills/slurm/azure_node_health_report/SKILL.md new file mode 100644 index 0000000..2a0ae7b --- /dev/null +++ b/skills/slurm/azure_node_health_report/SKILL.md @@ -0,0 +1,236 @@ +--- +name: azure-node-health-report +description: "File Azure Guest Health Reports for node investigation or replacement. Complete impact category reference (26 categories), PhysicalHostName and Resource ID collection, REST API format, and insight polling." +--- + +# Azure Node Health Report (GHR) + +How to file an Azure Guest Health Report to request node investigation or replacement. Includes the complete impact category reference from official Microsoft documentation, data collection procedures, and REST API format. + +**Reference**: [Report node health by using Guest Health Reporting](https://learn.microsoft.com/en-us/azure/azure-impact-reporting/guest-health-impact-report) | [Impact categories](https://learn.microsoft.com/en-us/azure/azure-impact-reporting/guest-health-impact-categories) + +## Data Collection — Do This FIRST + +**ALWAYS collect node metadata before rebooting or draining.** If the node goes down, you lose access to IMDS and KVP data needed for the GHR. + +### 1. Get the PhysicalHostName (REQUIRED) + +The PhysicalHostName identifies the physical server hosting the VM. It is read from Hyper-V KVP (Key-Value Pair) pool 3. + +```bash +# On the target node +tr -d '\0' < /var/lib/hyperv/.kvp_pool_3 2>/dev/null | sed -e 's/.*Qualified\(.*\)VirtualMachineDynamic.*/\1/' +``` + +This returns a string like `GGBB90904476`. + +**All HPC impact requests must include PhysicalHostName.** Without it, Azure cannot identify the physical server for remediation. + +### 2. Get the Resource ID (REQUIRED) + +The Resource ID is the fully qualified ARM path to the VM. Query it from the Azure Instance Metadata Service (IMDS): + +```bash +# On the target node +curl -s -H "Metadata:true" "http://169.254.169.254/metadata/instance/compute?api-version=2021-02-01" +``` + +Parse the JSON response to construct the resource ID: + +``` +/subscriptions/{subscriptionId}/resourceGroups/{resourceGroupName}/providers/Microsoft.Compute/virtualMachines/{name} +``` + +Some images return `resourceId` directly in the response. If not, construct it from `subscriptionId`, `resourceGroupName`, and `name` fields. + +### 3. Get the VmUniqueId (recommended) + +```bash +# On the target node +cat /sys/class/dmi/id/product_uuid 2>/dev/null || \ + curl -s -H "Metadata:true" "http://169.254.169.254/metadata/instance/compute/vmId?api-version=2021-02-01&format=text" +``` + +### 4. GPU details (optional, speeds up recovery) + +For GPU-related GHRs, include as much detail as possible: + +```bash +# GPU serial numbers and PCIe locations +nvidia-smi --query-gpu=index,serial,pci.bus_id,name --format=csv,noheader + +# For a specific bad GPU (e.g., GPU 2) +nvidia-smi -i 2 --query-gpu=serial,pci.bus_id,name --format=csv,noheader +``` + +## Impact Categories — Complete Reference + +Source: [Impact categories for Guest Health Reporting](https://learn.microsoft.com/en-us/azure/azure-impact-reporting/guest-health-impact-categories) + +Three main types: +- **Reset**: Refresh node health state. +- **Reboot**: Request node restart. +- **Unhealthy**: Node has issues — take out of production for diagnostics and repair. + +### Full Category List + +| Category | Description | Node Removed? | +|----------|-------------|:------------:| +| `Resource.Hpc.Reset` | Reset node health status | No | +| `Resource.Hpc.Reboot` | Restart the node | No | +| `Resource.Hpc.Unhealthy.HpcMissingGpu` | Missing GPU | Yes | +| `Resource.Hpc.Unhealthy.MissingIB` | Missing InfiniBand port | Yes | +| `Resource.Hpc.Unhealthy.IBPerformance` | Degraded InfiniBand performance | Yes | +| `Resource.Hpc.Unhealthy.IBPortDown` | InfiniBand port is in a down state | Yes | +| `Resource.Hpc.Unhealthy.IBPortFlapping` | InfiniBand port flapping | Yes | +| `Resource.Hpc.Unhealthy.HpcGpuDcgmDiagFailure` | DCGMI diagnostic failure | Yes | +| `Resource.Hpc.Unhealthy.HpcRowRemapFailure` | GPU row remapping failure | Yes | +| `Resource.Hpc.Unhealthy.HpcInforomCorruption` | GPU infoROM corruption | Yes | +| `Resource.Hpc.Unhealthy.HpcGenericFailure` | Issue doesn't fit other categories | Yes | +| `Resource.Hpc.Unhealthy.ManualInvestigation` | Request manual investigation by HPC team | Yes | +| `Resource.Hpc.Unhealthy.XID95UncontainedECCError` | GPU uncontained ECC error (XID 95) | Yes | +| `Resource.Hpc.Unhealthy.XID94ContainedECCError` | GPU contained ECC error (XID 94) | Yes | +| `Resource.Hpc.Unhealthy.XID79FallenOffBus` | GPU fell off PCIe bus (XID 79) | Yes | +| `Resource.Hpc.Unhealthy.XID48DoubleBitECC` | GPU double-bit ECC error (XID 48) | Yes | +| `Resource.Hpc.Unhealthy.UnhealthyGPUNvidiasmi` | nvidia-smi unresponsive | Yes | +| `Resource.Hpc.Unhealthy.NvLink` | NVLink is down | Yes | +| `Resource.Hpc.Unhealthy.HpcDcgmiThermalReport` | DCGMI thermal violations | Yes | +| `Resource.Hpc.Unhealthy.ECCPageRetirementTableFull` | Page retirements over threshold | Yes | +| `Resource.Hpc.Unhealthy.DBEOverLimit` | >10 retired pages for double-bit ECC in 7 days | Yes | +| `Resource.Hpc.Unhealthy.GpuXIDError` | GPU XID error (other than 48, 79, 94, 95) | Yes | +| `Resource.Hpc.Unhealthy.AmdGpuResetFailed` | AMD GPU unrecoverable reset failure | Yes | +| `Resource.Hpc.Unhealthy.EROTFailure` | GPU memory External Root of Trust failure | Yes | +| `Resource.Hpc.Unhealthy.GPUMemoryBWFailure` | GPU memory bandwidth failure | Yes | +| `Resource.Hpc.Unhealthy.CPUPerformance` | CPU performance issue | Yes | + +### Choosing the Right Category + +| Observed Issue | Category | +|---------------|----------| +| GPU not visible in nvidia-smi | `HpcMissingGpu` | +| IB port shows carrier=-1, won't come up after reboot | `IBPortDown` | +| IB port carrier_changes count is high | `IBPortFlapping` | +| IB bandwidth test consistently degraded | `IBPerformance` | +| IB interface completely missing | `MissingIB` | +| dcgmi diag -r 3 fails | `HpcGpuDcgmDiagFailure` | +| Thermal throttling under load | `HpcDcgmiThermalReport` | +| XID 79 in dmesg (GPU fallen off bus) | `XID79FallenOffBus` | +| XID 94 in dmesg (contained ECC error) | `XID94ContainedECCError` | +| XID 95 in dmesg (uncontained ECC error) | `XID95UncontainedECCError` | +| XID 48 in dmesg (double-bit ECC) | `XID48DoubleBitECC` | +| Other XID errors | `GpuXIDError` | +| nvidia-smi hangs or crashes | `UnhealthyGPUNvidiasmi` | +| NVLink down / FabricManager errors / ClusterUUID all zeros | `NvLink` | +| GPU row remap failure | `HpcRowRemapFailure` | +| GPU infoROM corruption | `HpcInforomCorruption` | +| None of the above fits | `HpcGenericFailure` | +| Need Azure HPC team to investigate | `ManualInvestigation` | + +## REST API Format + +### Endpoint + +``` +PUT https://management.azure.com/subscriptions/{subscriptionId}/providers/Microsoft.Impact/workloadImpacts/{workloadImpactName}?api-version=2023-02-01-preview +``` + +- `{subscriptionId}`: The subscription onboarded to GHR. +- `{workloadImpactName}`: A unique identifier (use a GUID). + +### Request Body + +```json +{ + "properties": { + "startDateTime": "2025-01-15T12:00:00Z", + "reportedTimeUtc": "2025-01-15T12:05:00Z", + "impactCategory": "Resource.Hpc.Unhealthy.IBPortDown", + "impactDescription": "IB port ib2 down on ccw-gpu-5. Persists after reboot. ibstat shows State: Down, Physical state: Polling.", + "impactedResourceId": "/subscriptions/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx/resourceGroups/myRG/providers/Microsoft.Compute/virtualMachines/ccw-gpu-5", + "additionalProperties": { + "PhysicalHostName": "GGBB90904476", + "VmUniqueId": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" + } + } +} +``` + +### Using az CLI + +```bash +az rest --method PUT \ + --headers "Content-Type=application/json" \ + --url "https://management.azure.com/subscriptions/${SUB_ID}/providers/Microsoft.Impact/workloadImpacts/$(uuidgen)?api-version=2023-02-01-preview" \ + --body '{ + "properties": { + "startDateTime": "'$(date -u +%Y-%m-%dT%H:%M:%SZ)'", + "reportedTimeUtc": "'$(date -u +%Y-%m-%dT%H:%M:%SZ)'", + "impactCategory": "Resource.Hpc.Unhealthy.IBPortDown", + "impactDescription": "IB port ib2 down on ccw-gpu-5 after reboot", + "impactedResourceId": "/subscriptions/.../virtualMachines/ccw-gpu-5", + "additionalProperties": { + "PhysicalHostName": "GGBB90904476" + } + } + }' +``` + +### Additional Properties for GPU Issues + +For GPU-related categories, include these optional fields to speed up recovery: + +```json +"additionalProperties": { + "PhysicalHostName": "GGBB90904476", + "VmUniqueId": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx", + "Manufacturer": "NVIDIA", + "SerialNumber": "1234567890", + "ModelNumber": "GB300", + "Location": "00000000:C9:00.0", + "LogUrl": "https://..." +} +``` + +### Row Remap Fields + +For `HpcRowRemapFailure`, include row remap details: + +```json +"additionalProperties": { + "PhysicalHostName": "GGBB90904476", + "UCE": "3", + "SerialNumber": "1234567890" +} +``` + +## Querying GHR Status (Insights) + +After submitting a GHR, poll for insights to track progress: + +```bash +GET "https://management.azure.com/subscriptions/{subscriptionId}/providers/Microsoft.Impact/workloadImpacts/{impactId}/insights?api-version=2025-01-01-preview" +``` + +### Insight Status Codes + +| statusCode | terminalInsight | Meaning | +|-----------|:-:|---------| +| `AcknowledgedUnhealthy` | false | Azure acknowledged the report; investigation in progress | +| `NodeRemovedFromService` | true | Node removed for repair; expect replacement | +| `TooManyRequests` | true | Rate limited — wait before resubmitting | + +### Interpreting Insights + +Insights arrive as a sequence. Check `additionalDetails.terminalInsight`: +- `false` — still being processed, check again later. +- `true` — final state, no more updates coming. + +## Workflow Summary + +1. **Detect issue** (via NCCL test, GPU test, healthcheck, user report). +2. **Collect metadata** — PhysicalHostName + Resource ID (BEFORE any reboot). +3. **Attempt soft fix** — reboot the node (unless it's FabricManager/XID79/XID95). +4. **If issue persists after reboot** — drain the node in Slurm. +5. **File GHR** — use the correct impact category, include PhysicalHostName and all available GPU details. +6. **Poll insights** — monitor for acknowledgment and resolution. +7. **After Azure repairs/replaces** — the node will return with new hardware. Undrain and re-validate. diff --git a/skills/slurm/cluster_outlier_detection/SKILL.md b/skills/slurm/cluster_outlier_detection/SKILL.md new file mode 100644 index 0000000..c4409e5 --- /dev/null +++ b/skills/slurm/cluster_outlier_detection/SKILL.md @@ -0,0 +1,119 @@ +--- +name: cluster-outlier-detection +description: "Statistical methods for identifying underperforming nodes from batch test results. Absolute thresholds, z-score, and MAD methods for fleet-wide GPU and NCCL analysis." +--- + +# Cluster Outlier Detection + +Statistical methods for identifying underperforming nodes from batch test results. + +## When to Use + +After running fleet-wide tests (GPU GEMM, NCCL per-rack, thermal), you have a set of per-node or per-rack metrics. Outlier detection finds nodes that are degraded relative to their peers, even if their absolute values are technically within tolerance. + +## Method 1: Absolute Threshold + +Compare each node's metric against a fixed threshold from the SKU baseline. + +``` +if metric < threshold: + flag node +``` + +Pros: Simple, deterministic, directly actionable. +Cons: Misses nodes that are degrading but not yet below the threshold. Does not adapt to fleet conditions. + +Use the thresholds from `sku_performance_baseline` for pass/fail decisions. + +## Method 2: Z-Score (Standard Deviation) + +Compute fleet mean and standard deviation, then flag nodes more than N standard deviations below the mean. + +``` +mean = average(all_node_metrics) +stdev = standard_deviation(all_node_metrics) +z_score = (node_metric - mean) / stdev + +if z_score < -2.0: + flag as outlier +``` + +### Threshold guidance + +| Z-score | Percentile | Action | +|---------|-----------|--------| +| < -1.5 | ~7th percentile | Monitor — performance is below peers | +| < -2.0 | ~2nd percentile | Investigate — likely degraded | +| < -3.0 | ~0.1th percentile | Drain — almost certainly hardware issue | + +Pros: Adapts to actual fleet performance. Catches relative degradation. +Cons: Requires enough data points (≥ 10 nodes). Sensitive to outliers in the dataset itself (one very bad node inflates stdev). + +### Robust variant: use median and MAD + +For small fleets or fleets with known bad nodes: + +``` +median = median(all_node_metrics) +MAD = median(|metric - median| for each node) +modified_z = 0.6745 * (node_metric - median) / MAD + +if modified_z < -2.0: + flag as outlier +``` + +MAD (Median Absolute Deviation) is less sensitive to extreme outliers than standard deviation. + +## Method 3: Deviation from Expected + +Compare each node against the expected value for the SKU, expressed as percentage deviation. + +``` +deviation_pct = (expected - node_metric) / expected * 100 + +if deviation_pct > warn_pct: + flag as warning (e.g., > 3.5%) +if deviation_pct > ghr_pct: + flag for GHR (e.g., > 7%) +``` + +This is what the GPU GEMM analysis uses (see `node_gpu_validation` skill). + +## Applying to Different Test Types + +### GPU GEMM results + +- **Metric**: Minimum GFlops across GPUs on each node (one bad GPU = bad node). +- **Expected**: Per-SKU from `sku_performance_baseline`. +- **Method**: Absolute threshold (deviation from expected) **plus** z-score across fleet. +- **Granularity**: Per-GPU if you want to identify which GPU is degraded. + +### NCCL per-rack results + +- **Metric**: Peak busbw at 16 G message size for each rack's NCCL test. +- **Expected**: Per-SKU MNNVL or IB baseline. +- **Method**: Absolute threshold first. For racks near the threshold, compare against other racks' results. +- **Note**: A single bad node in a rack drags down the entire rack's result. If one rack fails, bisect it (see `nccl_performance_diagnosis`). + +### NCCL pairwise results + +- **Metric**: busbw for each node-pair test. +- **Expected**: Similar to full-rack baseline (may be slightly higher for 2-node test due to less contention). +- **Method**: The node that appears in all failing pairs is the bad one. If node A fails with [B, C, D] but B passes with [C, D], then A is the problem. + +### Thermal test results + +- **Metric**: Binary pass/fail per GPU. +- **Method**: No statistics needed — any failure is a flag. + +## Reporting Format + +When presenting outlier results, include: + +1. **Fleet summary**: Total nodes, mean, stdev, min, max. +2. **Sorted list** (worst first): Node name, metric value, deviation from expected (%), z-score. +3. **Action categories**: + - GHR required (below absolute GHR threshold) + - Warning (below absolute warn threshold or z < -2) + - Healthy (above all thresholds) +4. **Per-node detail**: If GPU GEMM, include per-GPU values for flagged nodes (to identify which GPU). diff --git a/skills/slurm/ib_link_validation/SKILL.md b/skills/slurm/ib_link_validation/SKILL.md new file mode 100644 index 0000000..1fb5ad8 --- /dev/null +++ b/skills/slurm/ib_link_validation/SKILL.md @@ -0,0 +1,163 @@ +--- +name: ib-link-validation +description: "Check InfiniBand connectivity, port state, partition keys, and error counters on Azure HPC nodes. Covers operstate, ibstat, pkey verification, link flap detection, and soft fixes." +--- + +# InfiniBand Link Validation + +How to check InfiniBand connectivity, port state, partition keys, and error counters on Azure HPC nodes. + +## IB Interface Layout + +### GB300 (Standard_ND128isr_GB300_v6) +- 4 IB ports: `ib0`, `ib1`, `ib2`, `ib3` +- 4 × 400 Gb/s (NDR) +- HCA devices: `mlx5_ib0` through `mlx5_ib3` (plus additional for management) + +### H100 (Standard_ND96isr_H100_v5) +- 8 IB ports: `ib0` through `ib7` +- 8 × 400 Gb/s (NDR) +- HCA devices: `mlx5_ib0` through `mlx5_ib7` + +## Quick Health Check + +### 1. Linux network layer — operstate + +```bash +for i in ib0 ib1 ib2 ib3; do + echo "$i: $(cat /sys/class/net/$i/operstate 2>/dev/null || echo missing)" +done +``` + +Expected: all `up`. If any shows `down` or `missing`, the link is not functional. + +### 2. healthagent check + +```bash +sudo /usr/bin/health +``` + +Returns JSON. Look for IB interfaces in the output — `carrier=-1` means link down. + +### 3. IB layer — ibstat + +```bash +ibstat | grep -A5 "Port 1" +``` + +Key fields: +- `State: Active` — link is up and routed +- `Physical state: LinkUp` — physical layer is connected +- `Rate: 400` — NDR speed + +Bad states: `State: Down`, `Physical state: Polling` (cable or switch issue). + +### 4. IB device list + +```bash +ibv_devinfo | grep -E "hca_id|port:|state|phys_state|rate" +``` + +## Partition Key (pkey) Validation + +Pkeys control IB subnet membership. NCCL traffic requires a valid pkey. + +```bash +# Show pkeys on all ports +for dev in $(ibv_devinfo -l 2>/dev/null | grep -v "^$" | grep -v "device" | awk '{print $1}'); do + echo "=== $dev ===" + cat /sys/class/infiniband/$dev/ports/1/pkeys/* 2>/dev/null | sort -u +done +``` + +Expected: at least one non-zero pkey (typically `0x8001` or similar full-member key). If only `0x0000` or `0x7fff`, the port is not properly joined to the subnet. + +### Common pkey commands + +```bash +# Check specific device +cat /sys/class/infiniband/mlx5_ib0/ports/1/pkeys/0 + +# Verify NCCL can see the right interface +ibv_devinfo -d mlx5_ib0 -v | grep pkey +``` + +## Error Counter Checks + +IB error counters indicate link quality issues. High error rates cause retransmissions that degrade NCCL performance. + +```bash +# Per-port error counters +perfquery -x # extended counters on default port + +# All ports, specific counters +for port in 1; do + for dev in mlx5_ib0 mlx5_ib1 mlx5_ib2 mlx5_ib3; do + echo "=== $dev port $port ===" + perfquery -x -d $dev -P $port 2>/dev/null | grep -i "err\|discard\|drop" + done +done +``` + +Key counters: +- `SymbolErrorCounter` — encoding errors (cable/transceiver issue) +- `LinkErrorRecoveryCounter` — link retrained (flapping) +- `LinkDownedCounter` — link went down +- `PortRcvErrors` — received malformed packets +- `PortXmitDiscards` — packets dropped on transmit + +### Threshold guidance + +| Counter | Normal | Investigate | +|---------|--------|-------------| +| SymbolErrorCounter | 0 | > 0 (cable issue) | +| LinkErrorRecoveryCounter | 0 | > 0 (flapping) | +| LinkDownedCounter | 0 | > 0 (link failure history) | +| PortRcvErrors | 0 | > 100 | +| PortXmitDiscards | 0–low | > 1000 (congestion or config) | + +## Link Flap Detection + +```bash +# Check link_flap sysfs counter (if available) +for i in ib0 ib1 ib2 ib3; do + echo "$i flaps: $(cat /sys/class/net/$i/carrier_changes 2>/dev/null || echo N/A)" +done +``` + +High `carrier_changes` indicates an unstable link (bad cable, transceiver, or switch port). + +## Soft Fix: Bring Interface Up + +```bash +sudo ip link set ib0 up +sudo ip link set ib1 up +sudo ip link set ib2 up +sudo ip link set ib3 up +``` + +After bringing links up, restart healthagent and re-check: + +```bash +sudo systemctl restart healthagent && sleep 5 +sudo /usr/bin/health +``` + +If the interface stays down after `ip link set up`, the problem is at the physical layer (cable, switch, HCA). A reboot may help; if not, file GHR with category `ib_down`. + +## dmesg Diagnostics + +```bash +# IB / Mellanox errors +sudo dmesg | grep -i "ib\|infiniband\|mlx" | tail -20 + +# Look for specific failure modes +sudo dmesg | grep -i "link_state\|link down\|port_inactive" +``` + +## GHR Categories for IB Issues + +| Issue | GHR Category | +|-------|-------------| +| Port down (carrier=-1, not recoverable by reboot) | `ib_down` | +| Port flapping (high carrier_changes / LinkErrorRecovery) | `ib_flapping` | diff --git a/skills/slurm/nccl_allreduce_test/SKILL.md b/skills/slurm/nccl_allreduce_test/SKILL.md new file mode 100644 index 0000000..54cd962 --- /dev/null +++ b/skills/slurm/nccl_allreduce_test/SKILL.md @@ -0,0 +1,137 @@ +--- +name: nccl-allreduce-test +description: "Run NCCL all_reduce_perf bandwidth tests via Slurm, configure per-SKU environment variables (MNNVL, SHARP, GDR), and interpret busbw results." +--- + +# NCCL AllReduce Test + +How to run NCCL all_reduce_perf bandwidth tests, configure environment variables per SKU, and interpret results. + +> **Scripts**: This skill references test scripts from the [Azure/ai-infrastructure-on-azure](https://github.com/Azure/ai-infrastructure-on-azure) repo. Clone it and run from the repo root. + +## Test Binary + +``` +/opt/nccl-tests/build/all_reduce_perf +``` + +This is the standard NCCL test binary from [nccl-tests](https://github.com/NVIDIA/nccl-tests). It measures collective bandwidth across GPUs and nodes. + +## Running via the Launcher + +The launcher script is at `infrastructure_validations/slurm/NCCL/nccl_test.sh`. It loads per-SKU configs and handles sbatch submission. + +```bash +cd infrastructure_validations/slurm/NCCL + +# Full sweep — GB300, 4 nodes +./nccl_test.sh --sku graceblackwell -N 4 + +# Full sweep — H100, 8 nodes +./nccl_test.sh --sku hopper -N 8 -w ccw-gpu-[1-8] + +# Quick bandwidth check — large messages only, 10 iterations +./nccl_test.sh --sku graceblackwell --begin-size 16G --end-size 16G --iters 10 -N 18 + +# Auto-detect SKU from nodelist +./nccl_test.sh -N 4 -w ccw-gpu-[1-4] +``` + +### CLI options + +| Option | Default | Description | +|--------|---------|-------------| +| `--sku NAME` | auto-detect | Config name: `graceblackwell` or `hopper` | +| `--begin-size SIZE` | `1K` | Start message size | +| `--end-size SIZE` | `16G` | End message size | +| `--iters N` | nccl default | Iterations per message size | +| `--check` | off | Enable data correctness validation | + +All other arguments pass through to sbatch (e.g., `-N 4`, `-w nodelist`). + +## Per-SKU Environment Variables + +### Grace Blackwell (GB300 / NDv6) + +Config file: `configs/graceblackwell.conf` + +Key settings: +- 4 GPUs per node, 4 tasks per node, 24 CPUs per task +- MNNVL enabled (`NCCL_MNNVL_ENABLE=1`, `NCCL_NVLS_ENABLE=1`) +- DMA-BUF for GPU-direct (`NCCL_DMABUF_ENABLE=1`) +- SHM disabled (`NCCL_SHM_DISABLE=1`) — NVLink is faster +- IB SL=1 (`NCCL_IB_SL=1`) — required for Azure NDR fabric +- GDR C2C enabled (`NCCL_NET_GDR_C2C=1`) +- RDMA-SHARP plugin library on LD_LIBRARY_PATH + +### Hopper (H100 / NDv5) + +Config file: `configs/hopper.conf` + +Key settings: +- 8 GPUs per node, 8 tasks per node, 12 CPUs per task +- CPU affinity mask binding (complex hex mask per GPU) +- Topology file: `NCCL_TOPO_FILE=/opt/microsoft/ndv5-topo.xml` +- PXN disabled (`NCCL_PXN_DISABLE=1`) +- Min 32 channels (`NCCL_MIN_NCHANNELS=32`) +- SHARP / CollNet enabled (`NCCL_COLLNET_ENABLE=1`) +- UCX transport (`UCX_TLS=rc`) +- IB PCIe relaxed ordering enabled + +## Output Format + +``` +# out-of-place in-place +# size count type redop root time algbw busbw #wrong time algbw busbw #wrong + 0 0 float sum -1 0.02 0.00 0.00 0 0.01 0.00 0.00 0 + 1024 256 float sum -1 17.94 0.06 0.11 0 17.94 0.06 0.11 0 +... + 17179869184 4294967296 float sum -1 18285.0 939.58 936.93 0 18292.6 939.19 936.54 0 +# Out of bounds values : 0 OK +# Avg bus bandwidth : 487.265 +``` + +### Key columns + +- **busbw** (bus bandwidth, GB/s): The primary metric for evaluating collective performance. This accounts for the algorithm's data movement pattern. +- **algbw** (algorithm bandwidth, GB/s): Raw data rate. Always ≥ busbw. +- **#wrong**: Data corruption errors (should be 0). + +### What to look at + +1. **Peak busbw at 16 G message size**: This is the headline number. Compare against SKU baseline. +2. **Avg bus bandwidth**: Reported at the end of the run. This averages across all message sizes — small messages drag it down, so it's always lower than peak. +3. **#wrong column**: Any non-zero value indicates data corruption — serious hardware problem. + +## Quick vs Full Sweep + +| Mode | Begin | End | Iters | Duration | Purpose | +|------|-------|-----|-------|----------|---------| +| Quick check | 16G | 16G | 10 | ~2 min | Validate peak bandwidth | +| Full sweep | 1K | 16G | default | ~15-30 min | Profile across all sizes, detect small-message regressions | +| Bisection test | 8G | 16G | 20 | ~5 min | Balance speed and confidence during fault isolation | + +## Expected Results + +See `sku_performance_baseline` skill for per-SKU busbw targets. + +### GB300 intra-rack (MNNVL, 18 nodes) +- Peak busbw at 16 G: ~937 GB/s +- This tests NVLink/NVSwitch/MNNVL interconnect within the rack. + +### GB300 inter-rack (IB-only, across racks) +- Peak busbw at 16 G: ~200 GB/s +- This tests InfiniBand interconnect between racks. + +### H100 (8 nodes, full IB) +- Peak busbw at 16 G: ~450 GB/s + +## Failure Indicators + +| Observation | What It Means | +|------------|---------------| +| busbw near zero | NCCL could not establish communication — check IB links, pkeys | +| busbw < 50 % of expected | Likely a bad node dragging down the collective | +| #wrong > 0 | Data corruption — hardware fault, file GHR immediately | +| Job hangs (no output growth) | NCCL initialization stuck — likely a downed IB link or pkey mismatch | +| "NCCL WARN" in output about IB | IB fabric issue — check ibstat on all nodes | diff --git a/skills/slurm/nccl_performance_diagnosis/SKILL.md b/skills/slurm/nccl_performance_diagnosis/SKILL.md new file mode 100644 index 0000000..154cc92 --- /dev/null +++ b/skills/slurm/nccl_performance_diagnosis/SKILL.md @@ -0,0 +1,130 @@ +--- +name: nccl-performance-diagnosis +description: "Analyze NCCL bandwidth results, scope intra-rack vs inter-rack failures, and use bisection algorithm to isolate bad nodes. GPU vs network root cause analysis." +--- + +# NCCL Performance Diagnosis + +How to analyze NCCL bandwidth results, identify what type of failure is occurring, and isolate the bad node(s). + +> **Scripts**: This skill references test scripts from the [Azure/ai-infrastructure-on-azure](https://github.com/Azure/ai-infrastructure-on-azure) repo. Clone it and run from the repo root. + +## Diagnosis Framework + +When NCCL bandwidth is below the expected baseline, work through these levels: + +1. **Is the problem intra-rack or inter-rack?** +2. **Is it one node or multiple nodes?** +3. **Is it a GPU issue or a network issue?** + +## Step 1: Scope the Problem + +### Intra-rack (MNNVL) test fails + +If a per-rack NCCL test (using all nodes in one MNNVL domain) shows low bandwidth: +- The problem is within the NVLink/NVSwitch fabric in that rack. +- One bad node in the rack will drag down the entire collective. +- Proceed to **bisection** to find the bad node. + +### Inter-rack (IB-only) test fails + +If cross-rack NCCL tests show low bandwidth: +- The problem is in the InfiniBand fabric. +- Could be a bad IB link, switch port, or pkey issue on one or more nodes. +- Check IB links on all participating nodes (see `ib_link_validation` skill). +- Also compare per-rack results — if one rack is consistently the slow side, the problem is nodes in that rack. + +### Single-node test (intra-node only) + +If all inter-node tests are fine but a single node shows issues: +- Run a 2-node NCCL test with the suspect node + a known-good node. +- If that pair fails: the suspect is confirmed bad. +- If that pair passes: the issue may be environmental/transient. + +## Step 2: Bisection Algorithm + +Bisection isolates the bad node(s) from a failing group by repeatedly splitting and testing. + +### Algorithm + +1. **Start**: Take all N nodes in the failing group. +2. **Split**: Divide into two halves (group A, group B). +3. **Test both halves in parallel** (as separate NCCL test jobs). +4. **Analyze**: + - **Both pass**: The problem only occurs when all nodes interact — rare, possibly a specific switch or routing issue. Try recombining to confirm. + - **One passes, one fails**: The passing half is "known good." Recurse on the failing half. + - **Both fail**: Multiple bad nodes, one in each half. Recurse on both. +5. **Terminate** when a group has 2–3 nodes. +6. **Individual isolation**: Test each suspect node paired with a **different** known-good node. The node in the failing pair is the bad one. +7. **Drain** confirmed bad node(s). +8. **Verify**: Run the original test with remaining good nodes. Confirm it passes. + +### Parallel pair testing + +When testing 2–3 suspects individually, pair each with a different known-good node and run all pairs as separate jobs simultaneously. This avoids serializing the final isolation step. + +Example with 3 suspects (S1, S2, S3) and known-good nodes (G1, G2, G3): +``` +Test 1: [S1, G1] → FAIL → S1 is bad +Test 2: [S2, G2] → PASS → S2 is good +Test 3: [S3, G3] → FAIL → S3 is bad +``` + +**Important**: Use a different good node for each pair to avoid the good node being a bottleneck or correlating failures. + +### Minimum group sizes for NCCL testing + +- GB300 MNNVL test: Minimum 2 nodes (NVLink bisection within rack). +- H100 IB test: Minimum 2 nodes. +- For meaningful bandwidth, 4+ nodes is preferred. + +## Step 3: Root Cause Analysis + +Once the bad node is identified, determine whether the issue is GPU or network: + +### GPU issue indicators +- GPU GEMM test also fails on this node → GPU compute problem. +- `nvidia-smi nvlink -s` shows inactive or degraded NVLink connections. +- `dmesg` shows XID errors. +- `dcgmi diag -r 1` fails. + +### Network issue indicators +- GPU GEMM test passes (compute is fine) but NCCL fails → network path issue. +- `ibstat` shows a port down or in `Polling` state. +- IB error counters are elevated (see `ib_link_validation` skill). +- pkey is missing or wrong on one port. + +### NVSwitch / MNNVL issue indicators (GB300) +- NCCL intra-rack test fails but inter-rack test is fine between other racks. +- `nvidia-smi -q` shows `ClusterUUID: 00000000-0000-0000-0000-000000000000` (NVLink fabric not initialized). +- FabricManager errors in `systemctl status nvidia-fabricmanager`. +- NVLink errors: `nvidia-smi nvlink -e`. + +## Bandwidth Patterns and Interpretation + +| Pattern | Likely Cause | +|---------|-------------| +| busbw ~50 % of expected | One bad node in a 2-node test | +| busbw ~0 | NCCL cannot communicate — IB link down or pkey issue | +| busbw normal at small sizes, drops at large sizes | Congestion or IB bandwidth limit | +| busbw varies across runs (±20 %) | Transient issue — noisy neighbor, thermal throttle, or IB congestion | +| All racks fail | Cluster-wide issue — check switch, SM, or subnet manager | +| One rack fails, others pass | Rack-level issue — NVSwitch, TOR switch, or power | + +## Quick-vs-Full Test Strategy + +| Scenario | Test Approach | +|----------|--------------| +| Initial validation of a new cluster | Full sweep (1K–16G) on full rack | +| Routine daily check | Quick check (16G, 10 iters) per rack | +| After node replacement | Quick check on affected rack | +| Investigating a user-reported slow job | Quick check on the job's nodelist | +| Bad rack found | Bisect within that rack | + +## Tools Reference + +- NCCL test launcher: `infrastructure_validations/slurm/NCCL/nccl_test.sh` +- Per-SKU configs: `infrastructure_validations/slurm/NCCL/configs/` +- GPU GEMM test: `infrastructure_validations/slurm/gpu_test/gpu_test.slurm` +- IB validation commands: see `ib_link_validation` skill +- Baselines: see `sku_performance_baseline` skill diff --git a/skills/slurm/node_drain_and_replace/SKILL.md b/skills/slurm/node_drain_and_replace/SKILL.md new file mode 100644 index 0000000..3d5c017 --- /dev/null +++ b/skills/slurm/node_drain_and_replace/SKILL.md @@ -0,0 +1,193 @@ +--- +name: node-drain-and-replace +description: "Slurm node lifecycle management — drain, undrain, reboot, and file for replacement. Decision tree for when to drain vs reboot vs GHR." +--- + +# Node Drain and Replace + +Slurm node lifecycle management: when and how to drain, undrain, reboot, and file for replacement. + +## Slurm Node States + +| State | Meaning | +|-------|---------| +| `idle` | Available for jobs | +| `allocated` | Running a job | +| `mixed` | Some CPUs/GPUs allocated, some free | +| `drained` | Administratively removed from scheduling; no new jobs | +| `draining` | Drained but still running existing job(s) | +| `down` | Node is unreachable or failed healthcheck | +| `down*` | Node is down with a reason | + +## Drain a Node + +```bash +sudo scontrol update NodeName=ccw-gpu-5 State=DRAIN Reason="IB_port_down_20250115" +``` + +**Always include a dated reason.** Format: `_`. This creates an audit trail so you know why nodes were drained and when. + +### Drain multiple nodes + +```bash +sudo scontrol update NodeName=ccw-gpu-[5-8] State=DRAIN Reason="NCCL_low_mnnvl_20250115" +``` + +### Check drain reasons + +```bash +sinfo -R +``` + +## Undrain a Node + +Return a drained node to service: + +```bash +sudo scontrol update NodeName=ccw-gpu-5 State=RESUME Reason="fixed_after_reboot" +``` + +**Critical**: Actually run this command. Just saying the node is undrained doesn't make it so. + +### Verify it's back in service + +```bash +sinfo -N -n ccw-gpu-5 -o "%N %T" +``` + +Should show `idle` (or `allocated` if a job grabbed it immediately). + +## Decision Tree: What to Do with a Bad Node + +``` +Issue Detected +│ +├─ FabricManager error / XID 79 / XID 95? +│ └─ YES → Drain → Collect metadata → File GHR (skip reboot) +│ +├─ IB port down? +│ ├─ Try soft fix: sudo ip link set ibX up +│ ├─ If soft fix works → restart healthagent → verify → undrain +│ └─ If soft fix fails → reboot → check → if still down → Drain + GHR +│ +├─ GPU performance degraded? +│ ├─ Re-test to confirm (not transient) +│ ├─ Check nvidia-smi -q for throttling, ECC errors +│ ├─ Run dcgmi diag -r 1 for quick validation +│ ├─ If persistent → reboot → re-test +│ └─ If still degraded after reboot → Drain + GHR +│ +├─ NCCL bandwidth low (one rack)? +│ ├─ Bisect to find the bad node (see nccl_performance_diagnosis) +│ ├─ Drain the bad node +│ ├─ Investigate the bad node (GPU test, IB check, healthcheck) +│ └─ File GHR if issue persists after reboot +│ +├─ Thermal test failure? +│ ├─ Reboot → re-test +│ └─ If still fails → Drain + GHR (category: gpu_throttling or dcgm_failure) +│ +└─ Unknown / general issue? + ├─ Run healthcheck: sudo /usr/bin/health + ├─ Check dmesg for errors + ├─ Reboot → re-check + └─ If unresolved → Drain + GHR (category: HpcGenericFailure) +``` + +## Reboot Procedure + +### 1. BEFORE rebooting — cache metadata + +```bash +# On the target node, save physical hostname and resource ID +# See azure_node_health_report skill for commands +``` + +This is **critical** — if you reboot first and the node doesn't come back, you won't have the data needed for a GHR. + +### 2. Reboot + +```bash +# From scheduler, via SSH to the node +ssh ccw-gpu-5 'sudo reboot' +``` + +### 3. Wait for node to return + +Poll until the node is reachable (typically 2–3 minutes): + +```bash +# Simple poll loop +for i in $(seq 1 20); do + ssh -o ConnectTimeout=5 ccw-gpu-5 uptime 2>/dev/null && break + echo "Waiting... ($i)" + sleep 15 +done +``` + +### 4. Verify after reboot + +```bash +# Check healthagent +ssh ccw-gpu-5 'sudo /usr/bin/health' + +# Check IB interfaces directly (healthagent may have stale data) +ssh ccw-gpu-5 'for i in ib0 ib1 ib2 ib3; do echo "$i: $(cat /sys/class/net/$i/operstate 2>/dev/null || echo missing)"; done' + +# Check GPUs +ssh ccw-gpu-5 'nvidia-smi -L' + +# Check NVLink +ssh ccw-gpu-5 'nvidia-smi nvlink -s 2>&1 | head -20' +``` + +### 5. If healthagent shows stale data + +Real commands show everything OK but healthagent still reports failure: + +```bash +ssh ccw-gpu-5 'sudo systemctl restart healthagent && sleep 5 && sudo /usr/bin/health' +``` + +## After Azure Replaces the Node + +When Azure processes a GHR and replaces/repairs the physical hardware: + +1. The node will come back online (may take hours to days). +2. **Verify the replacement**: + - Run GPU GEMM test on the node. + - Run a 2-node NCCL test (pair with a known-good node). + - Check IB links and pkeys. + - Run healthcheck. +3. **If all checks pass**: Undrain the node. +4. **If checks fail**: File a new GHR — the replacement may also be faulty. + +## Batch Operations + +### Drain all nodes in a failing rack + +After bisection identifies a rack-level issue: + +```bash +# Get all nodes with a specific ClusterUUID +RACK_NODES="ccw-gpu-[1-18]" +sudo scontrol update NodeName=$RACK_NODES State=DRAIN Reason="rack_nvswitch_failure_20250115" +``` + +### Undrain all nodes after validation + +```bash +sudo scontrol update NodeName=ccw-gpu-[1-18] State=RESUME Reason="validated_after_repair" +``` + +### List all drained nodes + +```bash +sinfo -t drain,drained -N -o "%N %T %E" +``` + +### Count nodes by state + +```bash +sinfo -p gpu -h -o "%T" | sort | uniq -c | sort -rn +``` diff --git a/skills/slurm/node_gpu_validation/SKILL.md b/skills/slurm/node_gpu_validation/SKILL.md new file mode 100644 index 0000000..375c6a1 --- /dev/null +++ b/skills/slurm/node_gpu_validation/SKILL.md @@ -0,0 +1,102 @@ +--- +name: node-gpu-validation +description: "Test GPU compute performance using ubergemm GEMM benchmarks. Parse CSV output, identify underperforming GPUs, run fleet-wide analysis." +--- + +# Node GPU Validation + +How to test GPU compute performance on individual nodes using NVIDIA's ubergemm GEMM benchmark. + +> **Scripts**: This skill references test scripts from the [Azure/ai-infrastructure-on-azure](https://github.com/Azure/ai-infrastructure-on-azure) repo. Clone it and run from the repo root. + +## What It Tests + +ubergemm runs a sustained General Matrix Multiply workload on each GPU independently. The output is GFlops per GPU. A healthy GPU produces consistent results near the SKU baseline; a degraded GPU will show significantly lower throughput. + +## Running the Test + +### Slurm batch script + +The self-contained script is at `infrastructure_validations/slurm/gpu_test/gpu_test.slurm`. + +```bash +# Test 4 nodes, 4 GPUs each (GB300) +sbatch --gpus-per-node=4 -N 4 gpu_test.slurm + +# Test 8 nodes, 8 GPUs each (H100) +sbatch --gpus-per-node=8 -N 8 gpu_test.slurm + +# Target specific nodes +sbatch --gpus-per-node=4 -N 2 -w ccw-gpu-[1-2] gpu_test.slurm +``` + +The script runs ubergemm for 60 seconds per GPU, in parallel across all GPUs on each node via `srun --ntasks-per-node=$SLURM_GPUS_ON_NODE`. + +### ubergemm binary location + +``` +/usr/libexec/datacenter-gpu-manager-4/plugins/cuda13/updated/ubergemm +``` + +This path is consistent across both GB300 and H100 Azure HPC images. + +### Manual single-node test + +```bash +# Run on GPU 0 for 60 seconds +CUDA_VISIBLE_DEVICES=0 /usr/libexec/datacenter-gpu-manager-4/plugins/cuda13/updated/ubergemm -t 60 +``` + +## Output Format + +The batch script produces CSV output: + +``` +hostname,gpu0,gpu1,gpu2,gpu3 +ccw-gpu-1,1856202,1849317,1852441,1847956 +ccw-gpu-2,1851000,1848200,1850100,1849500 +``` + +Each value is GFlops for that GPU. The raw ubergemm output contains a line like: + +``` +GFlops:1.85620e+06 GFlops +``` + +The batch script parses this with `grep -oP 'GFlops:[0-9.e+]+'` and converts via awk. + +## Interpreting Results + +### Per-node analysis + +1. Parse each row into `hostname` and per-GPU GFlops values. +2. Take the **minimum** GFlops across GPUs on that node — one bad GPU flags the node. +3. Compare against the SKU baseline (see `sku_performance_baseline` skill). + +### Fleet analysis + +1. Collect per-node minimum GFlops across all tested nodes. +2. Compute fleet mean and standard deviation. +3. Flag nodes where min GFlops < warn threshold (3.5 % below expected). +4. Flag nodes where min GFlops < GHR threshold (7 % below expected). +5. Sort worst-first for triage. + +### Statistical outlier detection + +When the fleet is large (> 10 nodes), also flag nodes more than 2 standard deviations below the mean. This catches nodes that are degraded relative to peers even if still above the absolute threshold. + +## Common Failure Patterns + +| Pattern | Likely Cause | +|---------|-------------| +| All GPUs on a node are equally low | Thermal throttling, power capping, or PCIe bandwidth issue | +| One GPU significantly lower than others | Degraded GPU — hardware fault | +| All nodes in a rack are low | Power or cooling issue at rack level | +| GFlops near zero or parse error | GPU not visible, driver crash, XID error in dmesg | + +## What to Do with Results + +- **All nodes pass**: Record baseline for future comparison. +- **Warn-level nodes**: Re-test to confirm. Check `nvidia-smi -q` for thermal throttling or ECC errors. Consider running DCGMI diagnostics (`dcgmi diag -r 3`). +- **GHR-level nodes**: Drain the node, file GHR with category `generic` (include per-GPU GFlops in description). +- **Zero / missing output**: Check if GPUs are visible (`nvidia-smi -L`), check dmesg for XID errors (`sudo dmesg | grep -i xid`). diff --git a/skills/slurm/rack_topology/SKILL.md b/skills/slurm/rack_topology/SKILL.md new file mode 100644 index 0000000..36e00f1 --- /dev/null +++ b/skills/slurm/rack_topology/SKILL.md @@ -0,0 +1,141 @@ +--- +name: rack-topology +description: "MNNVL domain discovery on Azure GB300 clusters. ClusterUUID lookup via nvidia-smi, expected rack sizes per SKU, FabricManager troubleshooting." +--- + +# Rack Topology + +How MNNVL domains work on Azure GB300 clusters, how to discover rack membership, and expected rack structure per SKU. + +> **Scripts**: This skill references test scripts from the [Azure/ai-infrastructure-on-azure](https://github.com/Azure/ai-infrastructure-on-azure) repo. Clone it and run from the repo root. + +## What Is a Rack / MNNVL Domain? + +On GB300 (NDv6) clusters, nodes within a physical rack are connected via NVSwitch/NVLink in an MNNVL (Multi-Node NVLink) domain. This gives intra-rack bandwidth of ~900+ GB/s for allreduce operations — far higher than the ~200 GB/s available over InfiniBand between racks. + +Each MNNVL domain has a unique **ClusterUUID** reported by nvidia-smi. All nodes sharing the same ClusterUUID are in the same physical rack and can use NVLink for communication. + +## Rack Structure by SKU + +### GB300 (Standard_ND128isr_GB300_v6) + +- **18 nodes per rack** (72 GPUs per MNNVL domain) +- 4 GPUs per node +- Nodes within a rack communicate via NVLink/NVSwitch/MNNVL +- Nodes across racks communicate via InfiniBand NDR 400 Gb/s +- ClusterUUID is a valid UUID (e.g., `a1b2c3d4-e5f6-7890-abcd-ef1234567890`) + +### H100 (Standard_ND96isr_H100_v5) + +- **No MNNVL** — NVSwitch is intra-node only (8 GPUs within one node) +- 8 GPUs per node +- All inter-node communication is via InfiniBand +- ClusterUUID may not be present or meaningful +- Rack topology is less relevant for NCCL testing (no intra-rack NVLink advantage) + +## Discovering Rack Topology + +### Single node query + +```bash +nvidia-smi -q | grep ClusterUUID +``` + +Output: +``` + ClusterUUID : a1b2c3d4-e5f6-7890-abcd-ef1234567890 +``` + +### Fleet-wide discovery with parallel-ssh + +```bash +# From the scheduler node +parallel-ssh -H "ccw-gpu-1 ccw-gpu-2 ccw-gpu-3 ..." -t 15 -i \ + "nvidia-smi -q 2>/dev/null | grep 'ClusterUUID' | head -1 | awk -F': ' '{print \$2}'" +``` + +Output: +``` +[1] 14:23:45 [SUCCESS] ccw-gpu-1 +a1b2c3d4-e5f6-7890-abcd-ef1234567890 +[2] 14:23:45 [SUCCESS] ccw-gpu-2 +a1b2c3d4-e5f6-7890-abcd-ef1234567890 +[3] 14:23:46 [SUCCESS] ccw-gpu-19 +b2c3d4e5-f6a7-8901-bcde-f12345678901 +``` + +Group nodes by UUID to get rack membership. + +### Programmatic discovery + +Using Slurm hostlist expansion and parallel SSH: + +```bash +# Get all nodes in the GPU partition +NODES=$(sinfo -p gpu -h -N -o '%N' | sort -u | tr '\n' ' ') + +# Query ClusterUUID from all nodes +parallel-ssh -H "$NODES" -t 15 -i \ + "nvidia-smi -q 2>/dev/null | grep 'ClusterUUID' | head -1 | awk -F': ' '{print \$2}'" +``` + +### Handling edge cases + +- **Drained/down nodes**: Skip them — they can't be queried. Clear any cached rack_id. +- **ClusterUUID = N/A or all zeros**: NVLink fabric not initialized. This is a hardware issue — file GHR with category `nvlink_down`. +- **Node missing from output**: SSH failed — node may be unresponsive. + +## Validating Rack Size + +After discovery, verify each rack has the expected number of nodes: + +| SKU | Expected Rack Size | +|-----|-------------------| +| GB300 (NDv6) | 18 nodes | + +If a rack has fewer than expected nodes: +- Check if the missing nodes are drained/down (expected — they were filtered out). +- If nodes are in `idle` or `allocated` state but didn't return a ClusterUUID, investigate those nodes. + +## Using Rack Topology for Testing + +### Per-rack NCCL tests (MNNVL) + +Test each rack independently to validate intra-rack NVLink bandwidth: + +```bash +# For each rack, run NCCL test on its nodes +./nccl_test.sh --sku graceblackwell -N 18 -w ccw-gpu-[1-18] +``` + +Expected busbw: ~937 GB/s at 16 G message size. + +### Inter-rack NCCL tests (IB-only) + +Pick one node from each rack and test across racks: + +```bash +# One node per rack, testing IB fabric +./nccl_test.sh --sku graceblackwell -N 4 -w ccw-gpu-1,ccw-gpu-19,ccw-gpu-37,ccw-gpu-55 +``` + +Use IB-only NCCL settings (disable MNNVL) for pure IB measurement. + +### Rack-aware training node selection + +For training jobs, prefer allocating full racks (or multiples of racks) to maximize MNNVL utilization. Incomplete rack allocation wastes NVLink bandwidth and forces more traffic over IB. + +## FabricManager + +NVLink/MNNVL requires NVIDIA FabricManager to be running: + +```bash +systemctl status nvidia-fabricmanager +``` + +Healthy output includes `Active: active (running)`. + +Common FabricManager issues: +- **"training in progress"** with ClusterUUID all zeros → NVLink fabric failed to initialize. GHR category: `nvlink_down`. +- **"FabricManager not running"** → Service crashed or failed to start. Try `sudo systemctl restart nvidia-fabricmanager`. If it won't start, GHR. +- **DCGM NVSwitch errors** → `dcgmi discovery -l | grep -i nvswitch` to check NVSwitch visibility. diff --git a/skills/slurm/sku_performance_baseline/SKILL.md b/skills/slurm/sku_performance_baseline/SKILL.md new file mode 100644 index 0000000..2db843f --- /dev/null +++ b/skills/slurm/sku_performance_baseline/SKILL.md @@ -0,0 +1,54 @@ +--- +name: sku-performance-baseline +description: "Expected NCCL busbw, GPU GFlops, thermal limits, IB port counts, and rack sizes for GB300 and H100 SKUs. Warn and GHR thresholds." +--- + +# SKU Performance Baseline + +Expected performance values for Azure HPC GPU SKUs. Use these baselines to determine whether test results indicate healthy or degraded hardware. + +## SKU Reference + +### Standard_ND128isr_GB300_v6 (Grace Blackwell) + +| Metric | Expected | Warn | GHR | +|--------|----------|------|-----| +| GPU count | 4 per node | — | < 4 | +| GPU GEMM (ubergemm, 60 s) | ~1,850 TFlops/GPU | < 1,785 TFlops (3.5 %) | < 1,720 TFlops (7 %) | +| NCCL all_reduce busbw (intra-rack, MNNVL, 16 G) | ~937 GB/s | < 800 GB/s | < 600 GB/s | +| NCCL all_reduce busbw (inter-rack, IB-only, 16 G) | ~200 GB/s | < 180 GB/s | < 150 GB/s | +| Thermal stress (dcgmproftester, target 1004) | All GPUs pass | — | Any GPU fail | +| IB ports | 4 × 400 Gb/s (ib0–ib3) | — | Any port down | +| NVLink domain | 18 nodes per MNNVL rack (ClusterUUID) | < 18 nodes in rack | — | + +- **Rack size**: 18 nodes (72 GPUs per MNNVL domain). +- **NVLink**: Inter-node NVLink via NVSwitch / MNNVL within a rack. +- **Interconnect**: InfiniBand NDR 400 Gb/s across racks, IB SL=1. + +### Standard_ND96isr_H100_v5 (Hopper) + +| Metric | Expected | Warn | GHR | +|--------|----------|------|-----| +| GPU count | 8 per node | — | < 8 | +| GPU GEMM (ubergemm, 60 s) | ~769 GFlops/GPU | < 742 GFlops (3.5 %) | < 715 GFlops (7 %) | +| NCCL all_reduce busbw (full sweep, 16 G) | ~450 GB/s | < 400 GB/s | < 300 GB/s | +| Thermal stress (dcgmproftester, target 1004) | All GPUs pass | — | Any GPU fail | +| IB ports | 8 × 400 Gb/s (ib0–ib7) | — | Any port down | + +- **Rack size**: No MNNVL; NVSwitch is intra-node only. +- **NVLink**: 8 GPUs connected via NVSwitch within a single node. +- **Interconnect**: InfiniBand NDR 400 Gb/s, SHARP / CollNet enabled. + +## How to Use These Baselines + +1. **Run the test** (GPU GEMM, NCCL, thermal) on the target nodes. +2. **Compare results** against the Expected column for the node's SKU. +3. **If below Warn**: Re-test to confirm. Check for transient issues (thermal throttling, noisy neighbors). +4. **If below GHR**: Drain the node and file an Azure Guest Health Report. + +## Notes + +- GEMM values are per-GPU. A single underperforming GPU flags the entire node. +- NCCL busbw depends on node count and message size. Baselines assume a full rack at 16 G message size. +- Thermal test is binary pass/fail — any GPU failure is grounds for GHR. +- Always test with enough nodes to be meaningful (≥ 2 for NCCL, full rack preferred for MNNVL). diff --git a/skills/slurm/slurm_router/SKILL.md b/skills/slurm/slurm_router/SKILL.md new file mode 100644 index 0000000..4f4668a --- /dev/null +++ b/skills/slurm/slurm_router/SKILL.md @@ -0,0 +1,96 @@ +--- +name: slurm_router +description: "Router for Azure HPC Slurm operations. Selects the correct skills for validation, NCCL diagnosis, IB checks, topology, outlier detection, thermal checks, and node replacement workflows. Use this skill first for any Slurm GPU cluster question." +--- + +# Slurm Skill Router + +Use this skill first for any cluster-operations question in this repo. + +## Goal + +Map user intent to the correct skill(s), then execute only the procedures and thresholds from those selected skills. + +## Required Workflow + +1. Classify the request using the intent map below. +2. Explicitly list selected skills before giving commands. +3. Use exact commands, thresholds, and decision criteria from selected skill files. +4. If data is missing, ask for the minimum required input (SKU, nodelist, cluster name, failing job context). +5. Do not invent thresholds or procedures outside the selected skills. + +## Intent Map + +### New cluster bring-up / full validation +Use: +- `sku_performance_baseline` +- `rack_topology` +- `nccl_allreduce_test` +- `node_gpu_validation` +- `thermal_stress_test` + +### Slow training or low multi-node throughput +Use: +- `nccl_performance_diagnosis` +- `sku_performance_baseline` +- `ib_link_validation` +- `rack_topology` (when topology correlation is needed) + +### NCCL failures or low all-reduce bandwidth +Use: +- `nccl_allreduce_test` +- `nccl_performance_diagnosis` +- `ib_link_validation` +- `cluster_outlier_detection` (fleet-wide analysis) + +### GPU underperformance on one or more nodes +Use: +- `node_gpu_validation` +- `cluster_outlier_detection` +- `sku_performance_baseline` + +### Thermal throttling or suspected cooling issues +Use: +- `thermal_stress_test` +- `sku_performance_baseline` +- `node_gpu_validation` (if thermal impact on GEMM performance is suspected) + +### InfiniBand link/pkey/errors investigation +Use: +- `ib_link_validation` +- `nccl_performance_diagnosis` +- `rack_topology` (when rack-locality matters) + +### Identify degraded nodes across fleet +Use: +- `cluster_outlier_detection` +- `sku_performance_baseline` +- `node_gpu_validation` and/or `nccl_allreduce_test` (depending on metric source) + +### Node remediation / drain / replace / GHR +Use: +- `node_drain_and_replace` +- `azure_node_health_report` + +## Response Contract + +For every operations answer: + +1. **Selected skills:** list skill names. +2. **Run plan:** concise ordered steps. +3. **Commands:** exact commands from selected skills. +4. **Pass/fail criteria:** thresholds from selected skills. +5. **Decision:** next action (continue, drain, reboot, or file GHR). + +## Skill Locations + +- `skills/slurm/sku_performance_baseline/SKILL.md` +- `skills/slurm/node_gpu_validation/SKILL.md` +- `skills/slurm/ib_link_validation/SKILL.md` +- `skills/slurm/nccl_allreduce_test/SKILL.md` +- `skills/slurm/thermal_stress_test/SKILL.md` +- `skills/slurm/nccl_performance_diagnosis/SKILL.md` +- `skills/slurm/cluster_outlier_detection/SKILL.md` +- `skills/slurm/rack_topology/SKILL.md` +- `skills/slurm/azure_node_health_report/SKILL.md` +- `skills/slurm/node_drain_and_replace/SKILL.md` diff --git a/skills/slurm/thermal_stress_test/SKILL.md b/skills/slurm/thermal_stress_test/SKILL.md new file mode 100644 index 0000000..5e71cf3 --- /dev/null +++ b/skills/slurm/thermal_stress_test/SKILL.md @@ -0,0 +1,145 @@ +--- +name: thermal-stress-test +description: "Run GPU thermal stress tests using dcgmproftester. Interpret pass/fail results, check temperatures, throttle reasons, and DCGMI diagnostic levels." +--- + +# Thermal Stress Test + +How to run GPU thermal stress tests using dcgmproftester and interpret the results. + +> **Scripts**: This skill references test scripts from the [Azure/ai-infrastructure-on-azure](https://github.com/Azure/ai-infrastructure-on-azure) repo. Clone it and run from the repo root. + +## What It Tests + +dcgmproftester drives sustained GPU compute load to stress thermal limits. The test verifies that GPUs can maintain target performance under full thermal load without throttling or errors. A healthy GPU sustains the target workload for the full duration; a failing GPU throttles, produces errors, or crashes. + +## Running the Test + +### Slurm batch script + +The script is at `infrastructure_validations/slurm/thermal_test/thermal_test.slurm`. + +```bash +# Test 4 nodes, 4 GPUs each (GB300) — 15-minute stress +sbatch --gpus-per-node=4 -N 4 thermal_test.slurm + +# Test 2 nodes, 8 GPUs each (H100) +sbatch --gpus-per-node=8 -N 2 thermal_test.slurm + +# Target specific nodes +sbatch --gpus-per-node=4 -N 1 -w ccw-gpu-1 thermal_test.slurm +``` + +### Test parameters (hardcoded in thermal_test.sh) + +| Parameter | Value | Description | +|-----------|-------|-------------| +| `DURATION` | 900 (15 min) | Stress test duration in seconds | +| Target activity | 1004 | dcgmproftester stress workload ID | +| Binary | auto-detected | `dcgmproftester13` (preferred) or `dcgmproftester12` | + +### dcgmproftester binary location + +The script auto-detects the binary: + +```bash +command -v dcgmproftester13 || command -v dcgmproftester12 +``` + +On current Azure HPC images, `dcgmproftester13` is the available version. + +### Manual single-GPU test + +```bash +CUDA_VISIBLE_DEVICES=0 dcgmproftester13 --no-dcgm-validation -t 1004 -d 900 +``` + +## How the Test Works + +1. `thermal_test.slurm` runs `srun --ntasks-per-node=1` to execute `thermal_test.sh` once per node. +2. `thermal_test.sh` launches one `dcgmproftester` process per GPU in parallel (using `CUDA_VISIBLE_DEVICES`). +3. Each process runs for `DURATION` seconds. +4. After all processes complete, the script checks each process's exit code. +5. A non-zero exit code for any GPU means that GPU failed the thermal test. + +## Output Format + +``` +Starting thermal test on node ccw-gpu-1 (4 GPUs, 900s)... +All 4 GPU thermal tests passed on ccw-gpu-1! +``` + +Or on failure: + +``` +GPU 2 FAILED thermal test on ccw-gpu-1 (exit code 1) +THERMAL TEST FAILED on ccw-gpu-1: 1 of 4 GPUs failed +``` + +## Interpreting Results + +### Pass +All GPUs sustain the workload for the full duration. The GPU maintained safe temperatures under load. + +### Fail +One or more GPUs could not sustain the workload. Common reasons: +- **Thermal throttling**: GPU junction temperature exceeded safe limits, causing clock reduction that dropped below target. +- **ECC errors under load**: Heat-induced memory errors. +- **GPU hang / XID error**: The GPU stopped responding during the stress test. +- **Power capping**: Power delivery issue preventing sustained boost clocks. + +## Supplementary Diagnostics + +When a thermal test fails, gather more data: + +### GPU temperature and clocks during test + +```bash +# Run in parallel with the thermal test on the same node +watch -n 5 'nvidia-smi --query-gpu=index,temperature.gpu,clocks.sm,power.draw --format=csv,noheader,nounits' +``` + +### Temperature thresholds + +```bash +nvidia-smi -q | grep -A2 "Temperature" +``` + +Look for: +- `GPU Current Temp`: Current temperature +- `GPU T.Limit Temp`: Temperature headroom before throttling (negative = throttling) +- `GPU Shutdown Temp`: Hard shutdown limit + +### Clock throttle reasons + +```bash +nvidia-smi --query-gpu=index,clocks_event_reasons.active --format=csv,noheader +``` + +Key throttle reasons: +- `HW Thermal Slowdown` — GPU is too hot +- `SW Thermal Slowdown` — Driver-imposed thermal protection +- `HW Power Brake Slowdown` — External power brake signal +- `SW Power Cap` — Power limit reached + +### DCGMI diagnostics (more thorough) + +```bash +# Quick check (level 1, ~2 min) +dcgmi diag -r 1 + +# Extended diagnostics (level 3, ~20-30 min) +dcgmi diag -r 3 +``` + +Level 3 includes stress tests, memory bandwidth, PCIe bandwidth, and NVLink bandwidth checks. + +## GHR Category + +If a GPU fails thermal testing and the issue persists after reboot: + +| Issue | GHR Category | +|-------|-------------| +| Thermal throttling / thermal failure | `gpu_throttling` | +| DCGM diagnostic failure | `dcgm_failure` | +| GPU crashes during stress (XID error) | `xid_79` or `xid_94`/`xid_95` depending on XID code |