Skip to content

Commit a050741

Browse files
committed
Document canceled job recovery in helix skill
Canceled AzDO jobs (typically from timeouts) still have pipeline artifacts containing binlogs. The SendToHelix.binlog contains Helix job IDs that can be queried directly to recover actual test results. Discovered while investigating PR #124125 where a 3-hour timeout caused a WasmBuildTests job to be canceled, but all 226 Helix work items had actually passed.
1 parent 0b691ba commit a050741

File tree

1 file changed

+15
-0
lines changed
  • .github/skills/azdo-helix-failures

1 file changed

+15
-0
lines changed

.github/skills/azdo-helix-failures/SKILL.md

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -111,10 +111,25 @@ The script provides a recommendation at the end, but this is based on heuristics
111111
- **Manual investigation steps**: See [references/manual-investigation.md](references/manual-investigation.md)
112112
- **AzDO/Helix details**: See [references/azdo-helix-reference.md](references/azdo-helix-reference.md)
113113

114+
## Recovering Results from Canceled Jobs
115+
116+
Canceled jobs (typically from timeouts) often still have useful artifacts. The Helix work items may have completed successfully even though the AzDO job was killed while waiting to collect results.
117+
118+
**To investigate canceled jobs:**
119+
120+
1. **Download build artifacts**: Use the AzDO artifacts API to get `Logs_Build_*` pipeline artifacts for the canceled job. These contain binlogs even for canceled jobs.
121+
2. **Extract Helix job IDs**: Use the binlog MCP server to load the `SendToHelix.binlog` and search for `"Sent Helix Job"` messages. Each contains a Helix job ID.
122+
3. **Query Helix directly**: For each job ID, query `https://helix.dot.net/api/jobs/{jobId}/workitems?api-version=2019-06-17` to get actual pass/fail results.
123+
124+
**Example**: A `browser-wasm windows WasmBuildTests` job was canceled after 3 hours. The binlog (truncated) still contained 12 Helix job IDs. Querying them revealed all 226 work items passed — the "failure" was purely a timeout in the AzDO wrapper.
125+
126+
**Key insight**: "Canceled" ≠ "Failed". Always check artifacts before concluding results are lost.
127+
114128
## Tips
115129

116130
1. Read PR description and comments first for context
117131
2. Check if same test fails on main branch before assuming transient
118132
3. Look for `[ActiveIssue]` attributes for known skipped tests
119133
4. Use `-SearchMihuBot` for semantic search of related issues
120134
5. Binlogs in artifacts help diagnose MSB4018 task failures
135+
6. Use the binlog MCP server (`binlog.mcp`) to search binlogs for Helix job IDs, build errors, and properties

0 commit comments

Comments
 (0)