Skip to content

Commit 4587349

Browse files
danmoseleyCopilot
andauthored
Add ci-crash-dump skill for debugging CI test crashes (#12)
* Add ci-crash-dump skill for debugging CI test crashes Adds a skill to the dotnet-dnceng plugin for downloading and debugging crash dumps from CI test failures in dotnet repositories. The skill covers: - Finding crashed work items from a PR or build via the ci-analysis script - Querying the Helix API for dump artifacts and identifying crashes by exit code (Windows NTSTATUS codes, Unix signals) - Downloading dumps and runtime binaries from Helix - Debugging with dotnet-dump (managed), cdb (native/Windows), or lldb (native/Linux/macOS) - OS compatibility matrix for cross-platform dump analysis - Guidance on NativeAOT, Cross DAC, internal jobs, and mobile/WASM limitations Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Address review feedback: multi-crash, runfo fallback, analysis guidance - Clarify script reference is from this plugin's ci-analysis skill - Stop and redirect to ci-analysis if no dump files found - Multiple crashes: list them and ask user which to investigate - runfo is fallback when Files URIs are inaccessible, not alternative - Archive extraction covers .zip and .tar.gz - NativeAOT identified by CI job name containing 'NativeAOT' - After loading dump: correlate with PR changes, don't stop at backtrace - Expired artifacts: report to user instead of just 'download promptly' - Internal jobs: report and hand off (agent can't get tokens) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Add symbol server guidance for managed and native debugging Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Add native binary version metadata guidance for crash analysis Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Address testing feedback: crashreport.json, macOS compat, cross-DAC discovery - Highlight .crashreport.json files as primary data source for macOS crashes - Prioritize direct download of crash artifacts before full payload - Be honest that macOS Mach-O dumps won't work with dotnet-dump cross-platform - Explain how to search AzDO artifacts for CrossDac binaries Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Require including resolved call stack in crash analysis report Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Require exit code, exception details, and stack in crash report Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Support issue-based workflow: find linked builds from issue comments Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Require PR/build link and existing issue search in crash report Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Address Windows testing feedback: console log stacks, DAC version, duplicate dumps, AzDO expiry - Console log often has symbolicated stacks from Helix crash handler (cdb) - dotnet-dump DAC version mismatch: add error code and --prerelease update - Duplicate dumps: WER vs createdump, (1) variant more reliable for SOS - AzDO builds expire too, not just Helix artifacts - Issue entry point: check body + comments, start with newest build Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Console log: only skip dump download if stacks have resolved symbols Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Always do independent analysis, don't just repeat existing stacks Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Fix cdb.exe discovery: search common paths instead of relying on PATH Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Add dotnet-symbol guidance for resolving stripped Linux native frames Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Address review feedback: fix dotnet-dump version guidance and CODEOWNERS ordering - Replace invalid '<major>.*' version specifier with exact version example - Reorder CODEOWNERS to list @lewing before @danmoseley for consistency Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Clarify how to determine dotnet-dump version from payload Explain that the runtime major version comes from the payload's shared/Microsoft.NETCore.App/<version>/ directory, and show how to find and install a matching dotnet-dump version from NuGet. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Address review: duplicate dumps note applies to Linux too Update the duplicate dumps guidance to cover both Windows and Linux, since Linux CI also produces duplicate crash dumps (e.g., core.1000.33 and coredump.33.dmp for the same crash). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Clarify which dump to prefer on each platform On Windows prefer the createdump variant (the (1) file). On Linux prefer coredump.*.dmp (createdump) over core.* (kernel core dump). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Address PR review feedback: step ordering, lldb/SOS clarity, Unicode, script path - Step 2: reference console log via Get-CIStatus.ps1 output, not Details endpoint - lldb section: clarify SOS plugin commands work inside lldb - Capitalize Unicode - Script path: reference by skill name, not relative path Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Expand lldb native commands: bt, bt all, frame variable Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Add DAC version mismatch fallback guidance and cdb discovery improvements - When dotnet-dump DAC loading fails, recommend 'modules -v' as first diagnostic step (works without DAC, reveals runtime paths) - Add guidance for unreleased .NET versions where no matching dotnet-dump package exists: skip directly to cdb instead of searching feeds - Position cdb as a fallback for DAC mismatch, not just native crashes - Add MSIX package discovery snippet (Get-AppxPackage) for finding cdb.exe - Document MSIX sandbox limitation: copy dumps to %TEMP% if cdb can't access the original path - Add 'No CLR runtime found' as an additional DAC mismatch error message Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Use relative path for Get-CIStatus.ps1 references Address review feedback: use ../ci-analysis/scripts/Get-CIStatus.ps1 instead of bare Get-CIStatus.ps1 so the script path is unambiguous. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Address hoyosjs PR feedback: crash attribution, console log scope, crashreport coverage - Crashes can be attributed to individual tests, not just entire work items; a single work item can have multiple crashes (e.g., RemoteExecutor) - Console log cdb/crash stacks are runtime-specific, not all repos; stacks alone may be insufficient for corruption or xunit-caught exceptions - .crashreport.json files are generated on both macOS and Linux, not just macOS Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Address remaining hoyosjs feedback: dotnet-dump backwards compat, 20-day expiry - Simplify dotnet-dump version guidance: it's backwards compatible, so just install latest from dotnet-tools feed - Fix Helix artifact expiry from ~30 days to 20 days Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
1 parent 8c856f3 commit 4587349

File tree

2 files changed

+280
-0
lines changed

2 files changed

+280
-0
lines changed

.github/CODEOWNERS

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,3 +6,4 @@
66

77
# Skills
88
/plugins/dotnet-dnceng/skills/ci-analysis/ @lewing @danmoseley
9+
/plugins/dotnet-dnceng/skills/ci-crash-dump/ @lewing @danmoseley
Lines changed: 279 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,279 @@
1+
---
2+
name: ci-crash-dump
3+
description: >
4+
Download and debug crash dumps from CI test failures in dotnet repositories.
5+
Use when a CI test crashed (not just failed), when the user wants to debug a crash dump
6+
from a PR or build, or when asked "debug dump", "download dump", "crash dump from CI",
7+
"test crashed", "analyze crash in PR", or "why did the test crash".
8+
DO NOT USE FOR: test failures that are not crashes (use ci-analysis),
9+
build failures, performance analysis, or analyzing dumps you already have locally.
10+
---
11+
12+
# CI Crash Dump Analysis
13+
14+
Dotnet repositories run tests on a distributed test infrastructure called Helix. When a test
15+
process crashes, Helix captures a dump file and publishes it as an artifact. This skill
16+
covers finding those artifacts, downloading them, and analyzing the dump.
17+
18+
## When to Use
19+
20+
- A CI test crashed (not just failed with assertion errors)
21+
- User wants to debug a dump from a PR or build
22+
23+
## When Not to Use
24+
25+
- Test failed but didn't crash (normal assertion failure) — use `ci-analysis`
26+
- User already has a dump file locally
27+
- Build failures (no test execution occurred)
28+
29+
## Step 1: Identify the Crashed Work Item
30+
31+
**If pointed at a PR**, use this plugin's `ci-analysis` skill to find failing Helix jobs.
32+
The `ci-analysis` skill provides `Get-CIStatus.ps1`:
33+
```
34+
../ci-analysis/scripts/Get-CIStatus.ps1 -PRNumber <PR> -Repository "dotnet/runtime" -ShowLogs
35+
../ci-analysis/scripts/Get-CIStatus.ps1 -BuildId <BuildId> -ShowLogs
36+
```
37+
38+
**If pointed at an issue** (not a PR), look at the issue body and comments for linked AzDO
39+
build URLs. Multiple builds may be listed — start with the most recent, as older builds may
40+
have expired from AzDO retention policies. Pass its build ID to
41+
`../ci-analysis/scripts/Get-CIStatus.ps1`.
42+
There is no associated PR in this scenario — skip PR correlation in your analysis.
43+
44+
> **Stacks already in the issue/PR:** The issue or PR may already contain a pasted stack trace.
45+
> Do not simply repeat it — always perform your own independent analysis from the console log
46+
> and/or dump. You may get better symbol resolution, find additional threads, or identify
47+
> details the original poster missed. Use any existing stacks as a cross-reference, not a
48+
> substitute.
49+
50+
Crashes are reported at the work item level. Look for work items that have dump files in
51+
their artifacts. Note that even individual test-name failures can be crashes (not just
52+
assertion failures) — for example, in dotnet/runtime test crashes are often attributed to
53+
individual tests. A single work item may also contain multiple crashes (e.g., tests using
54+
`RemoteExecutor` or process isolation). A PR may have many failures — look specifically
55+
for work items with dump files. If multiple work items crashed, list them and ask the user
56+
which one to investigate.
57+
58+
## Step 2: Check the Console Log First
59+
60+
Before downloading any dump files, check the work item's console log.
61+
`../ci-analysis/scripts/Get-CIStatus.ps1 -ShowLogs`
62+
reports the `ConsoleOutputUri`, or find it in the Helix work item Details response.
63+
In **dotnet/runtime**, the crash handler runs `cdb` (or equivalent) on the machine before
64+
uploading, so the console log may contain symbolicated native stacks (`~*k`) and
65+
managed stacks (`!clrstack -all`). Other repos may not have this — the console log may only
66+
show test output. If crash stacks are present and symbols are resolved (function names, not
67+
just hex addresses), use them as a starting point for analysis. However, stacks alone may
68+
not be sufficient — for corruption, heap issues, or cases where managed exceptions are
69+
caught by the test framework (common in libs tests using xunit), downloading and analyzing
70+
the dump provides deeper insight. If the stacks are missing, truncated, or show only
71+
unresolved addresses, proceed directly to download the dump.
72+
73+
## Step 3: Query the Work Item for Crash Evidence
74+
75+
Query the Helix API for work item details:
76+
```
77+
GET https://helix.dot.net/api/2019-06-17/jobs/{jobId}/workitems/{workItemName}
78+
```
79+
80+
> The work item name often contains spaces, parentheses, or other special characters.
81+
> URL-encode it (e.g., `[uri]::EscapeDataString($workItemName)` in PowerShell) or the
82+
> request will 404.
83+
84+
The response includes `ExitCode` and a `Files` array (each with `FileName` and `Uri`).
85+
However, the `Files` URIs from the Details endpoint can be broken for subdirectory or Unicode
86+
filenames. To get reliable download URIs, use the separate ListFiles endpoint:
87+
```
88+
GET https://helix.dot.net/api/2019-06-17/jobs/{jobId}/workitems/{workItemName}/files
89+
```
90+
91+
**Crash vs. normal failure:** Crashes have a negative or large `ExitCode` and crash artifacts
92+
in the `Files` array: `.dmp` files (Windows) or `.crashreport.json` files (macOS/Linux).
93+
Normal failures have `ExitCode: 1` and no crash artifacts.
94+
**If there are no dump or crashreport files, stop here** — this is a normal test failure,
95+
not a crash. Report the failure details to the user and suggest using the `ci-analysis`
96+
skill instead.
97+
98+
> **`.crashreport.json` files** are generated on macOS and Linux (there is no Windows
99+
> equivalent). They contain full native call stacks and are often the most useful starting
100+
> point for crash analysis — check these first before attempting to load the dump.
101+
102+
Common crash exit codes:
103+
104+
| Exit code | Meaning | Platform |
105+
|-----------|---------|----------|
106+
| `-1073740771` (`0xC000041D`) | Process abort | Windows |
107+
| `-1073741819` (`0xC0000005`) | Access violation | Windows |
108+
| `-532462766` (`0xE0434352`) | CLR unhandled exception | Windows |
109+
| `134` (128+6) | SIGABRT | Linux/macOS |
110+
| `139` (128+11) | SIGSEGV | Linux/macOS |
111+
112+
## Step 4: Download Artifacts
113+
114+
> **Check the console log first** (Step 2). If it already contains the crash stacks, you may
115+
> not need to download the dump at all. Only proceed with download if the console log doesn't
116+
> have sufficient detail.
117+
118+
Download files using the ListFiles endpoint URIs. Start with `.crashreport.json` files
119+
(contain stack traces, especially useful for macOS) and `.dmp` files — these are directly
120+
downloadable and often sufficient for initial analysis without needing the full payload.
121+
122+
> **Duplicate dumps:** Crashes often produce multiple dump files for the same crash.
123+
> On Windows you may see e.g. `dotnet.exe.6524.dmp` and `dotnet.exe(1).6524.dmp` — one from
124+
> Windows Error Reporting and one from `createdump`. Prefer the `createdump` variant (usually
125+
> the `(1)` file) as it is more reliable for SOS/`dotnet-dump`. On Linux you may see e.g.
126+
> `core.1000.33` and `coredump.33.dmp` for the same crash. Prefer the `coredump.*.dmp` file
127+
> (produced by `createdump`) over the `core.*` file (kernel core dump). In either case, only
128+
> analyze one and ignore the duplicate.
129+
130+
Download the remaining payload files (runtime binaries, test binaries) only if you need to
131+
load the dump in a debugger. Do not use `runfo get-helix-payload` unless you actually need
132+
the full payload — it downloads everything including large runtime binaries. If the `Files`
133+
URIs are inaccessible (expired, 403, etc.) and you do need the payload, fall back to
134+
[runfo](https://github.com/jaredpar/runfo): `runfo get-helix-payload -j <jobId> -w <workItem> -o <dir>`.
135+
136+
> **Internal Helix jobs** (identified by the org `dnceng` rather than `dnceng-public` in URLs,
137+
> or when the Helix API returns 401/403) require authentication that the agent does not have.
138+
> Report the job ID and work item name to the user and ask them to download manually.
139+
140+
Extract any archive files (`.zip`, `.tar.gz`) in the downloaded payload.
141+
142+
## Step 5: Debug the Dump
143+
144+
The dump needs matching runtime binaries (DAC, SOS) from the payload at
145+
`shared/Microsoft.NETCore.App/<version>/`.
146+
147+
> **`dotnet-dump` version must match the runtime version of the dump.** A .NET 9.0
148+
> `dotnet-dump` cannot load a .NET 11.0 DAC (fails with `0x80004002` or
149+
> `No CLR runtime found`). `dotnet-dump` is backwards compatible, so the simplest
150+
> approach is to install the latest version:
151+
> `dotnet tool install -g dotnet-dump --prerelease` (or `dotnet tool update -g dotnet-dump --prerelease`).
152+
> This usually comes from the `dotnet-tools` feed and works for all supported runtime versions.
153+
>
154+
> **If a matching dotnet-dump version is not available** (common for unreleased .NET versions
155+
> where no package exists on NuGet or dev feeds), **skip dotnet-dump and use `cdb` instead**
156+
> (see "Native crashes on Windows" below). `cdb` does not have the DAC version coupling
157+
> problem — its `!analyze -v` and `kn` commands produce native + managed stacks without
158+
> needing a matching SOS/DAC version.
159+
>
160+
> **When DAC loading fails**, run `modules -v` inside `dotnet-dump analyze` as a first
161+
> diagnostic step. This command works without the DAC and shows the full paths of all loaded
162+
> modules, including the exact `coreclr.dll` and `System.Private.CoreLib.dll` that were in
163+
> use. This tells you which runtime build produced the dump and where to find the matching
164+
> DAC. Use `setclrpath` to point to that directory.
165+
166+
Determine the dump's platform from the CI job name (e.g., "windows-x64", "linux-arm64").
167+
168+
### OS compatibility
169+
170+
`dotnet-dump` can analyze managed state cross-platform. Native debuggers require a matching OS.
171+
172+
| Dump OS | Agent on Windows | Agent on Linux | Agent on macOS |
173+
|---------|-----------------|----------------|----------------|
174+
| Windows |`dotnet-dump`, `cdb` | ⚠️ `dotnet-dump` managed-only | ⚠️ `dotnet-dump` managed-only |
175+
| Linux | ⚠️ `dotnet-dump` managed-only (needs Cross DAC — see below) |`dotnet-dump`, `lldb` | ⚠️ `dotnet-dump` managed-only |
176+
| macOS | ⚠️ Use `.crashreport.json` stacks only | ⚠️ Use `.crashreport.json` stacks only |`lldb` |
177+
178+
> **Cross DAC for Linux dumps on Windows**: `dotnet-dump` needs the cross-DAC binaries to read
179+
> Linux ELF core dumps. Search the AzDO build artifacts for names containing `CrossDac` (the
180+
> exact artifact name varies by build — e.g., `CoreCLRCrossDacArtifacts`). Copy the matching
181+
> architecture's binaries into the runtime dir alongside the payload. If no cross-DAC artifact
182+
> is found, report this to the user.
183+
>
184+
> **macOS Mach-O core dumps** generally cannot be loaded by `dotnet-dump` even with cross-DAC.
185+
> The `.crashreport.json` files from the Helix `Files` array contain full native stacks and
186+
> are the primary analysis path for macOS crashes on non-macOS agents.
187+
188+
If the agent cannot fully analyze the dump (OS mismatch), report the crash type, exit code,
189+
dump file path, and runtime binaries path, and suggest the user debug manually.
190+
191+
### Managed crashes
192+
193+
Use `dotnet-dump analyze`. The critical Helix-specific setup:
194+
- `setclrpath` — point to the runtime binaries from the payload
195+
- `setsymbolserver -directory` — same path, for symbols
196+
- `setsymbolserver` (no args) — also enables the Microsoft public symbol server for OS and framework symbols
197+
198+
Start with `pe` (print exception) and `clrstack -all`. See [SOS command reference](https://learn.microsoft.com/dotnet/core/diagnostics/sos-debugging-extension) for further commands.
199+
200+
### Native crashes on Windows
201+
202+
Use `cdb.exe` (command-line debugger) for native crashes, or as a **fallback when
203+
`dotnet-dump` cannot load the DAC** (e.g., unreleased .NET versions). `cdb` does not depend
204+
on SOS/DAC version matching for native stacks and `!analyze -v`.
205+
206+
`cdb.exe` may be at `C:\Program Files (x86)\Windows Kits\10\Debuggers\x64\cdb.exe`
207+
(Windows SDK) or inside the WinDbg MSIX package (`winget install --id Microsoft.WinDbg`).
208+
To find it in the MSIX package:
209+
```powershell
210+
$pkg = Get-AppxPackage -Name "*WinDbg*"
211+
$cdb = Join-Path $pkg.InstallLocation "amd64\cdb.exe"
212+
```
213+
214+
> **MSIX sandbox limitation:** `cdb.exe` from the WinDbg MSIX package may not be able to
215+
> access files at arbitrary paths (e.g., `C:\dumps`). If it reports "file not found" for a
216+
> dump that exists, copy the dump to `$env:TEMP` and open it from there.
217+
218+
Set up the Microsoft public symbol server: `.symfix+ c:\symbols`. Key commands: `!analyze -v` (automatic crash analysis), `kP` / `~*kP` (native stacks).
219+
For mixed native+managed: `.loadby sos coreclr`, then `!setclrpath`, `!pe`, `!clrstack`.
220+
221+
### Native crashes on Linux/macOS
222+
223+
Use `lldb` with the SOS plugin for combined native + managed debugging. Point it at the dump
224+
and the dotnet host binary from the payload. Native commands: `bt` (backtrace), `bt all`
225+
(all threads), `frame variable` (locals). After loading the SOS plugin, use `setclrpath` /`setsymbolserver`, then `pe`, `clrstack -all`
226+
for managed state — these SOS commands work inside `lldb` the same as in `dotnet-dump`.
227+
228+
For native symbol resolution: runtime binaries in CI are stripped (only `.dynsym` exports).
229+
Use `dotnet-symbol --host-only --debugging <path-to-libcoreclr.so>` to download the matching
230+
`.dbg` files, which resolve internal function names. Without these, native frames show only
231+
as offsets between exported symbols.
232+
233+
Setup: [LLDB for .NET](https://github.com/dotnet/diagnostics/blob/main/documentation/lldb/linux-instructions.md).
234+
235+
### NativeAOT crashes
236+
237+
The CI job name will contain "NativeAOT" (e.g., "osx-arm64 Release NativeAOT_Libraries").
238+
SOS does not work with NativeAOT. Use `cdb` or `lldb` directly.
239+
240+
## After Loading the Dump
241+
242+
**Always include the following in your report to the user:**
243+
- A **link to the PR or build** that was analyzed
244+
- The **exit code** and its meaning (e.g., `0xC0000005` = access violation)
245+
- The **crashing thread's call stack** (and any other relevant threads), with symbols resolved
246+
as far as possible — this is the most important output
247+
- Any **managed exception** type, message, and inner exception stack (use `pe`) if different
248+
from the native crash stack
249+
- **Links to existing issues** that match this crash — search the repo for open issues (and
250+
also closed issues) with matching crash signatures, stack frames, or error messages. If
251+
found, link them so the user can see prior context and whether this is a known problem.
252+
253+
Without these the user cannot verify your interpretation or dig deeper.
254+
255+
Use the backtrace, exception info, heap state, etc. together with the PR's code changes to
256+
understand *why* the crash happened. Correlate the crash location with what was changed — a
257+
crash in code touched by the PR is likely caused by it; a crash elsewhere may be pre-existing
258+
or a side effect.
259+
260+
If the crash is in a native binary not part of the PR, report its version metadata
261+
(`lm v m <module>` in cdb, `image list <module>` in lldb) — the version helps identify
262+
which build or package introduced it.
263+
264+
Use your judgement and knowledge of the codebase to form a diagnosis and suggest a fix.
265+
266+
## Common Pitfalls
267+
268+
- **Helix artifacts expire after 20 days.** If downloads fail with 404, the artifacts have likely expired — tell the user.
269+
- **AzDO builds also expire** due to retention policies. If a build returns "not found", try a more recent build. When multiple builds are listed (e.g., in an issue), start with the newest.
270+
- **`dotnet-dump` only handles managed state.** For native crashes, use `cdb`/`lldb` on matching OS.
271+
- **32-bit dumps on 64-bit OS:** Use 32-bit dotnet SDK to install dotnet-dump.
272+
- **Mobile/WASM dumps** are not covered — report the dump location and hand off.
273+
- **Internal jobs** (`dnceng` org) require auth the agent doesn't have — report and hand off.
274+
275+
## Further Reading
276+
277+
- [dotnet-dump](https://learn.microsoft.com/dotnet/core/diagnostics/dotnet-dump)
278+
- [SOS debugging extension](https://learn.microsoft.com/dotnet/core/diagnostics/sos-debugging-extension)
279+
- [Debugging .NET core dumps](https://github.com/dotnet/diagnostics/blob/main/documentation/debugging-coredump.md)

0 commit comments

Comments
 (0)