Skip to content

Commit 7ca4d35

Browse files
committed
Update 3 skills: ci-analysis-flow-analysis-flow-tracing
Synced from copilot-skills
1 parent 8f2643e commit 7ca4d35

25 files changed

+8369
-0
lines changed
Lines changed: 138 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,138 @@
1+
---
2+
name: ci-analysis
3+
description: >
4+
Analyze CI build and test status from Azure DevOps and Helix for dotnet repository PRs.
5+
Use when checking CI status, investigating failures, determining if a PR is ready to merge,
6+
or given URLs containing dev.azure.com or helix.dot.net. Also use when asked "why is CI red",
7+
"test failures", "retry CI", "rerun tests", "is CI green", "build failed", "checks failing",
8+
or "flaky tests". DO NOT USE FOR: investigating stale codeflow PRs or dependency update health,
9+
tracing whether a commit has flowed from one repo to another, reviewing code changes for
10+
correctness or style.
11+
---
12+
13+
# Azure DevOps and Helix CI Analysis
14+
15+
Analyze CI build status and test failures in Azure DevOps and Helix for dotnet repositories (runtime, sdk, aspnetcore, roslyn, and more).
16+
17+
> 🚨 **NEVER** use `gh pr review --approve` or `--request-changes`. Only `--comment` is allowed. Approval and blocking are human-only actions.
18+
19+
**Workflow**: Gather PR context (Step 0) → run the script → read human-readable output + `[CI_ANALYSIS_SUMMARY]` JSON → synthesize recommendations. The script collects data; you generate the advice. MCP tools (AzDO, Helix, GitHub) provide supplementary access when available; the script and `gh` CLI work independently when they're not.
20+
21+
**Accessing services**: There are several possible methods to access each service (AzDO, Helix, GitHub). Start with MCP tools, then fall back to CLI (`gh` for GitHub, `Invoke-RestMethod` for AzDO/Helix REST APIs). Explore all available options before determining you don't have access. For AzDO, multiple tool sets may exist for different organizations — match the org in the build URL to the correct tools (see [references/azdo-helix-reference.md](references/azdo-helix-reference.md#azure-devops-organizations)). If queries return null, check the org before trying other approaches. For complex investigations, track what you've tried in SQL to avoid repeating failed approaches.
22+
23+
## When to Use This Skill
24+
25+
- Checking CI status ("is CI passing?", "why is CI red?")
26+
- Investigating CI failures or determining merge readiness
27+
- Debugging Helix test issues or build errors
28+
- URLs containing `dev.azure.com`, `helix.dot.net`, or GitHub PR links with failing checks
29+
- Questions like "retry CI", "rerun tests", "test failures", "checks failing"
30+
- Investigating canceled or timed-out jobs for recoverable results
31+
32+
**Not for**: GitHub Actions workflows, non-Helix repos, or build performance (use binlog analysis).
33+
34+
> 💡 **Per-repo CI patterns differ significantly.** Each dotnet repo structures test results differently (TRX availability, console log patterns, work item naming). Before investigating a repo you haven't seen before, check the empirical profiles in `dnceng-knowledge/ci-repo-profiles` — they document the fastest investigation path per repo and prevent wasted MCP calls.
35+
36+
## Quick Start
37+
38+
```powershell
39+
# Analyze PR failures (most common) - defaults to dotnet/runtime
40+
./scripts/Get-CIStatus.ps1 -PRNumber 123445 -ShowLogs
41+
42+
# Analyze by build ID
43+
./scripts/Get-CIStatus.ps1 -BuildId 1276327 -ShowLogs
44+
45+
# Query specific Helix work item
46+
./scripts/Get-CIStatus.ps1 -HelixJob "4b24b2c2-..." -WorkItem "System.Net.Http.Tests"
47+
48+
# Other dotnet repositories
49+
./scripts/Get-CIStatus.ps1 -PRNumber 12345 -Repository "dotnet/aspnetcore"
50+
```
51+
52+
For full parameter reference and mode details, see [references/script-modes.md](references/script-modes.md).
53+
54+
## Step 0: Gather Context (before running anything)
55+
56+
Context changes how you interpret every failure. **Don't skip this.**
57+
58+
1. **Read PR metadata** — title, description, author, labels, linked issues
59+
2. **Classify the PR type**:
60+
61+
| PR Type | How to detect | Interpretation shift |
62+
|---------|--------------|---------------------|
63+
| **Code PR** | Human author, code changes | Failures likely relate to the changes |
64+
| **Flow/Codeflow PR** | Author is `dotnet-maestro[bot]`, "Update dependencies" | Missing packages may be behavioral, not infrastructure |
65+
| **Backport** | Title mentions "backport", targets release branch | Check if test exists on target branch |
66+
| **Merge PR** | Merging between branches | Conflicts cause failures, not individual changes |
67+
| **Dependency update** | Bumps package versions, global.json | Build failures often trace to the dependency |
68+
69+
3. **Check existing comments** — has someone already diagnosed failures or is a retry pending?
70+
4. **Note the changed files** — you'll use these for correlation after the script runs
71+
72+
## After the Script: Use Its Output
73+
74+
> 🚨 **The script already collected the data. Do NOT re-query AzDO or Helix for information the script already produced.** Parse the `[CI_ANALYSIS_SUMMARY]` JSON and the human-readable output first. Only make additional API calls for data the script *didn't* provide (e.g., deeper Helix log searches, binlog analysis, build progression).
75+
76+
**If the script found no builds** (e.g., AzDO builds expired, CI not triggered): report this to the user immediately. Don't spend turns re-querying AzDO with different org/project combinations — if the script couldn't find builds, they're likely unavailable. Offer alternatives: analyze by build ID if the user has one, check GitHub PR status for summary info, or note that Helix results may still be queryable directly even when AzDO builds have expired.
77+
78+
**If the script succeeded**: the `[CI_ANALYSIS_SUMMARY]` JSON contains `failedJobDetails`, `knownIssues`, `canceledJobNames`, `prCorrelation`, and `recommendationHint`. Use these fields — don't re-fetch the same data via MCP tools or REST APIs. To find specific details in large output, use `Select-String` or `grep` on the output file rather than re-running the script.
79+
80+
> 🚨 **Check build progression on multi-commit PRs.** If the PR has multiple commits, query AzDO for builds on `refs/pull/{PR}/merge` (sorted by queue time, top 10-20) — `gh pr checks` only shows the latest SHA. Present a progression table showing which builds passed/failed at which SHAs. This narrows failures to the commit that introduced them. See [references/build-progression-analysis.md](references/build-progression-analysis.md).
81+
82+
Then follow the detailed workflow in [references/analysis-workflow.md](references/analysis-workflow.md). Key principles:
83+
84+
1. **Cross-reference failures with known issues** — The script outputs `failedJobDetails` and `knownIssues` as separate lists. You must explicitly match each failure to a known issue (by error message, test name, or job type) or mark it **unmatched**. Don't present them as two independent lists — the user needs a per-failure verdict.
85+
2. **Check Build Analysis status** — Green = all failures matched known issues. Red = some unmatched. Never claim "all known issues" when Build Analysis is red.
86+
3. **Correlate with PR changes** — same files failing = likely PR-related.
87+
4. **Verify before claiming** — don't call it "infrastructure" without Build Analysis match or target-branch verification. Don't call it "safe to retry" unless ALL failures are accounted for.
88+
89+
For interpreting error categories, crash recovery, and canceled jobs: [references/failure-interpretation.md](references/failure-interpretation.md)
90+
91+
For generating recommendations from `[CI_ANALYSIS_SUMMARY]` JSON: [references/recommendation-generation.md](references/recommendation-generation.md)
92+
93+
## Presenting Results
94+
95+
> 🚨 **Keep tables narrow — 4 short columns max (# | Job | Verdict | Issue).** Put error descriptions, work item lists, and evidence in **detail bullets below the table**, not in cells. Wide tables wrap and become unreadable in terminals.
96+
97+
> 🚨 **Use markdown links** for PRs (`[#121195](url)`), builds (`[Build 1305302](url)`), and jobs (`[job name](azdo-job-url)`). The script output and MCP tools provide URLs — thread them through.
98+
99+
Lead with a 1-2 sentence verdict, then the summary table, then detail bullets (one per failure), then recommended actions. For the full format example: [references/recommendation-generation.md](references/recommendation-generation.md).
100+
101+
## Anti-Patterns
102+
103+
> 🚨 **Every failure verdict needs evidence — no "Likely flaky" without proof.** Each row in your summary table must cite a specific source: known issue number, Build Analysis match, or target-branch verification. If Build Analysis didn't match it and you haven't verified the target branch, the verdict is **"Unmatched — needs investigation"**, not "Likely flaky." A test that *looks* like it could be flaky is not the same as one you've *verified* is flaky.
104+
105+
> **Don't label failures "infrastructure" without evidence.** Requires: Build Analysis match, identical failure on target branch, or confirmed outage. Exception: `tests-passed-reporter-failed` is genuinely infrastructure.
106+
107+
> **Don't dismiss timed-out builds.** A build "failed" due to AzDO timeout can have 100% passing Helix work items. Check Helix job status before concluding failure.
108+
109+
> **Missing packages on flow PRs ≠ infrastructure.** Flow PRs request *different* packages. Check *which* package and *why* before assuming feed delay.
110+
111+
> **Don't present failures and known issues as separate lists.** Cross-reference them: for each `failedJobDetails` entry, state whether it matches a `knownIssues` entry or is unmatched. An `unclassified` failure can still match a known issue by error pattern.
112+
113+
> **Don't say "safe to retry" with Build Analysis red.** Map each failing job to a specific known issue first.
114+
115+
> **Don't use `Invoke-RestMethod` or `curl` for AzDO/Helix when MCP tools are available.** Check your available tools for Azure DevOps and Helix operations first. REST API fallback is for when MCP tools are genuinely unavailable, not a first resort.
116+
117+
## References
118+
119+
- **Script modes & parameters**: [references/script-modes.md](references/script-modes.md)
120+
- **Failure interpretation**: [references/failure-interpretation.md](references/failure-interpretation.md)
121+
- **Recommendation generation**: [references/recommendation-generation.md](references/recommendation-generation.md)
122+
- **Analysis workflow (Steps 1–3)**: [references/analysis-workflow.md](references/analysis-workflow.md)
123+
- **Helix artifacts & binlogs**: [references/helix-artifacts.md](references/helix-artifacts.md)
124+
- **Binlog comparison**: [references/binlog-comparison.md](references/binlog-comparison.md)
125+
- **Build progression analysis**: [references/build-progression-analysis.md](references/build-progression-analysis.md)
126+
- **Subagent delegation**: [references/delegation-patterns.md](references/delegation-patterns.md)
127+
- **Azure CLI investigation**: [references/azure-cli.md](references/azure-cli.md)
128+
- **Manual investigation**: [references/manual-investigation.md](references/manual-investigation.md)
129+
- **SQL tracking**: [references/sql-tracking.md](references/sql-tracking.md)
130+
- **AzDO/Helix details**: [references/azdo-helix-reference.md](references/azdo-helix-reference.md)
131+
132+
## Tips
133+
134+
1. Check if same test fails on target branch before assuming transient
135+
2. Look for `[ActiveIssue]` attributes for known skipped tests
136+
3. Use `-SearchMihuBot` for semantic search of related issues
137+
4. `gh pr checks --json` fields: `bucket`, `completedAt`, `description`, `event`, `link`, `name`, `startedAt`, `state`, `workflow``state` has `SUCCESS`/`FAILURE` directly (no `conclusion` field)
138+
5. "Canceled" ≠ "Failed" — canceled jobs may have recoverable Helix results. Helix data may persist even when AzDO builds have expired — query Helix directly if you have job IDs.
Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
# Analysis Workflow (Steps 1–3)
2+
3+
After completing Step 0 (Gather Context — see SKILL.md), follow these steps.
4+
5+
## Step 1: Run the Script
6+
7+
Run with `-ShowLogs` for detailed failure info. See [script-modes.md](script-modes.md) for parameter details.
8+
9+
## Step 1b: Investigate with AzDO Tools
10+
11+
When the script output is insufficient (e.g., build timeline fetch fails), use AzDO MCP tools to query builds directly. **Match the org from the build URL to the correct AzDO tools** — see [azdo-helix-reference.md](azdo-helix-reference.md#azure-devops-organizations). PR builds are in `dnceng-public`; internal builds are in `dnceng`.
12+
13+
## Step 2: Analyze Results
14+
15+
1. **Check Build Analysis** — If the Build Analysis GitHub check is **green**, all failures matched known issues and it's safe to retry. If it's **red**, some failures are unaccounted for — you must identify which failing jobs are covered by known issues and which are not. For 3+ failures, use SQL tracking to avoid missed matches (see [sql-tracking.md](sql-tracking.md)).
16+
2. **Correlate with PR changes** — Same files failing = likely PR-related
17+
3. **Compare with baseline** — If a test passes on the target branch but fails on the PR, compare Helix binlogs. See [binlog-comparison.md](binlog-comparison.md)**delegate binlog download/extraction to subagents** to avoid burning context on mechanical work.
18+
4. **Check build progression** — If the PR has multiple builds (multiple pushes), check whether earlier builds passed. A failure that appeared after a specific push narrows the investigation to those commits. See [build-progression-analysis.md](build-progression-analysis.md). Present findings as facts, not fix recommendations.
19+
5. **Interpret patterns** (but don't jump to conclusions):
20+
- Same error across many jobs → Real code issue
21+
- Build Analysis flags a known issue → That *specific failure* is safe to retry (but others may not be)
22+
- Failure is **not** in Build Analysis → Investigate further before assuming transient
23+
- Device failures, Docker pulls, network timeouts → *Could* be infrastructure, but verify against the target branch first
24+
- Test timeout but tests passed → Executor issue, not test failure
25+
6. **Check for mismatch with user's question** — The script only reports builds for the current head SHA. If the user asks about a job, error, or cancellation that doesn't appear in the results, **ask** if they're referring to a prior build. Common triggers:
26+
- User mentions a canceled job but `canceledJobNames` is empty
27+
- User says "CI is failing" but the latest build is green
28+
- User references a specific job name not in the current results
29+
Offer to re-run with `-BuildId` if the user can provide the earlier build ID from AzDO.
30+
31+
## Step 3: Verify Before Claiming
32+
33+
Before stating a failure's cause, verify your claim:
34+
35+
- **"Infrastructure failure"** → Did Build Analysis flag it? Does the same test pass on the target branch? If neither, don't call it infrastructure.
36+
- **"Transient/flaky"** → Has it failed before? Is there a known issue? A single non-reproducing failure isn't enough to call it flaky.
37+
- **"PR-related"** → Do the changed files actually relate to the failing test? Correlation in the script output is heuristic, not proof.
38+
- **"Safe to retry"** → Are ALL failures accounted for (known issues or infrastructure), or are you ignoring some? Check the Build Analysis check status — if it's red, not all failures are matched. Map each failing job to a specific known issue before concluding "safe to retry."
39+
- **"Not related to this PR"** → Have you checked if the test passes on the target branch? Don't assume — verify.
Lines changed: 96 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,96 @@
1+
# Azure DevOps and Helix Reference
2+
3+
## Supported Repositories
4+
5+
The script works with any dotnet repository that uses Azure DevOps and Helix:
6+
7+
| Repository | Common Pipelines |
8+
|------------|-----------------|
9+
| `dotnet/runtime` | runtime, runtime-dev-innerloop, dotnet-linker-tests |
10+
| `dotnet/sdk` | dotnet-sdk (mix of local and Helix tests) |
11+
| `dotnet/aspnetcore` | aspnetcore-ci |
12+
| `dotnet/roslyn` | roslyn-CI |
13+
| `dotnet/maui` | maui-public |
14+
15+
Use `-Repository` to specify the target:
16+
```powershell
17+
./scripts/Get-CIStatus.ps1 -PRNumber 12345 -Repository "dotnet/aspnetcore"
18+
```
19+
20+
## Build Definition IDs (Example: dotnet/runtime)
21+
22+
Each repository has its own build definition IDs. Here are common ones for dotnet/runtime:
23+
24+
| Definition ID | Name | Description |
25+
|---------------|------|-------------|
26+
| `129` | runtime | Main PR validation build |
27+
| `133` | runtime-dev-innerloop | Fast innerloop validation |
28+
| `139` | dotnet-linker-tests | ILLinker/trimming tests |
29+
30+
**Note:** The script auto-discovers builds for a PR, so you rarely need to know definition IDs.
31+
32+
## Azure DevOps Organizations
33+
34+
Builds live in two Azure DevOps organizations. **Determine the org from the build URL**, then use AzDO tools for that org:
35+
36+
| Organization | URL pattern | Project | When used |
37+
|---|---|---|---|
38+
| `dnceng-public` | `dev.azure.com/dnceng-public/...` | `public` (GUID: `cbb18261-c48f-4abb-8651-8cdcb5474649`) | PR validation builds (most common) |
39+
| `dnceng` | `dev.azure.com/dnceng/...` | `internal` or varies | Official/internal builds, signed builds |
40+
41+
**How to pick the right tools:** Look at the build URL from `gh pr checks` output or the script's `[CI_ANALYSIS_SUMMARY]`. The URL contains the org name (e.g., `dev.azure.com/dnceng-public/...`). Search your available AzDO tools for ones matching that org name — there may be multiple sets of AzDO tools for different organizations. If a query returns null, you're likely using tools for the wrong org.
42+
43+
> ⚠️ **Common mistake:** Wrong org = null results. PR validation builds are almost always in `dnceng-public`. If you get null, check the org before trying other approaches.
44+
45+
Override with:
46+
```powershell
47+
./scripts/Get-CIStatus.ps1 -BuildId 1276327 -Organization "dnceng" -Project "internal-project-guid"
48+
```
49+
50+
## Common Pipeline Names (Example: dotnet/runtime)
51+
52+
| Pipeline | Description |
53+
|----------|-------------|
54+
| `runtime` | Main PR validation build |
55+
| `runtime-dev-innerloop` | Fast innerloop validation |
56+
| `dotnet-linker-tests` | ILLinker/trimming tests |
57+
| `runtime-wasm-perf` | WASM performance tests |
58+
| `runtime-libraries enterprise-linux` | Enterprise Linux compatibility |
59+
60+
Other repos have different pipelines - the script discovers them automatically from the PR.
61+
62+
## Useful Links
63+
64+
- [Helix Portal](https://helix.dot.net/): View Helix jobs and work items (all repos)
65+
- [Helix API Documentation](https://helix.dot.net/swagger/): Swagger docs for Helix REST API
66+
- [Build Analysis](https://github.com/dotnet/arcade/blob/main/Documentation/Projects/Build%20Analysis/LandingPage.md): Known issues tracking (arcade infrastructure)
67+
- [dnceng-public AzDO](https://dev.azure.com/dnceng-public/public/_build): Public builds for all dotnet repos
68+
69+
### Repository-specific docs:
70+
- [runtime: Triaging Failures](https://github.com/dotnet/runtime/blob/main/docs/workflow/ci/triaging-failures.md)
71+
- [runtime: Area Owners](https://github.com/dotnet/runtime/blob/main/docs/area-owners.md)
72+
73+
## Test Execution Types
74+
75+
### Helix Tests
76+
Tests run on Helix distributed test infrastructure. The script extracts console log URLs and can fetch detailed failure info with `-ShowLogs`.
77+
78+
### Local Tests (Non-Helix)
79+
Some repositories (e.g., dotnet/sdk) run tests directly on the build agent. The script detects these and extracts Azure DevOps Test Run URLs.
80+
81+
## Known Issue Labels
82+
83+
- `Known Build Error` - Used by Build Analysis across all dotnet repositories
84+
- Search syntax: `repo:<owner>/<repo> is:issue is:open label:"Known Build Error" <test-name>`
85+
86+
Example searches (use `search_issues` when GitHub MCP is available, `gh` CLI otherwise):
87+
```bash
88+
# Search in runtime
89+
gh issue list --repo dotnet/runtime --label "Known Build Error" --search "FileSystemWatcher"
90+
91+
# Search in aspnetcore
92+
gh issue list --repo dotnet/aspnetcore --label "Known Build Error" --search "Blazor"
93+
94+
# Search in sdk
95+
gh issue list --repo dotnet/sdk --label "Known Build Error" --search "template"
96+
```

0 commit comments

Comments
 (0)