dotnet
diff --git a/‎plugins/dotnet-dnceng/skills/ci-analysis/SKILL.md‎
Lines changed: 138 additions & 0 deletions b/‎plugins/dotnet-dnceng/skills/ci-analysis/SKILL.md‎
Lines changed: 138 additions & 0 deletions
diff --git a/‎plugins/dotnet-dnceng/skills/ci-analysis/references/analysis-workflow.md‎
Lines changed: 39 additions & 0 deletions b/‎plugins/dotnet-dnceng/skills/ci-analysis/references/analysis-workflow.md‎
Lines changed: 39 additions & 0 deletions
diff --git a/‎plugins/dotnet-dnceng/skills/ci-analysis/references/azdo-helix-reference.md‎
Lines changed: 96 additions & 0 deletions b/‎plugins/dotnet-dnceng/skills/ci-analysis/references/azdo-helix-reference.md‎
Lines changed: 96 additions & 0 deletions
@@ -0,0 +1,138 @@
+---
+name: ci-analysis
+description: >
+  Analyze CI build and test status from Azure DevOps and Helix for dotnet repository PRs.
+  Use when checking CI status, investigating failures, determining if a PR is ready to merge,
+  or given URLs containing dev.azure.com or helix.dot.net. Also use when asked "why is CI red",
+  "test failures", "retry CI", "rerun tests", "is CI green", "build failed", "checks failing",
+  or "flaky tests". DO NOT USE FOR: investigating stale codeflow PRs or dependency update health,
+  tracing whether a commit has flowed from one repo to another, reviewing code changes for
+  correctness or style.
+---
+
+# Azure DevOps and Helix CI Analysis
+
+Analyze CI build status and test failures in Azure DevOps and Helix for dotnet repositories (runtime, sdk, aspnetcore, roslyn, and more).
+
+> 🚨 **NEVER** use `gh pr review --approve` or `--request-changes`. Only `--comment` is allowed. Approval and blocking are human-only actions.
+
+**Workflow**: Gather PR context (Step 0) → run the script → read human-readable output + `[CI_ANALYSIS_SUMMARY]` JSON → synthesize recommendations. The script collects data; you generate the advice. MCP tools (AzDO, Helix, GitHub) provide supplementary access when available; the script and `gh` CLI work independently when they're not.
+
+**Accessing services**: There are several possible methods to access each service (AzDO, Helix, GitHub). Start with MCP tools, then fall back to CLI (`gh` for GitHub, `Invoke-RestMethod` for AzDO/Helix REST APIs). Explore all available options before determining you don't have access. For AzDO, multiple tool sets may exist for different organizations — match the org in the build URL to the correct tools (see [references/azdo-helix-reference.md](references/azdo-helix-reference.md#azure-devops-organizations)). If queries return null, check the org before trying other approaches. For complex investigations, track what you've tried in SQL to avoid repeating failed approaches.
+
+## When to Use This Skill
+
+- Checking CI status ("is CI passing?", "why is CI red?")
+- Investigating CI failures or determining merge readiness
+- Debugging Helix test issues or build errors
+- URLs containing `dev.azure.com`, `helix.dot.net`, or GitHub PR links with failing checks
+- Questions like "retry CI", "rerun tests", "test failures", "checks failing"
+- Investigating canceled or timed-out jobs for recoverable results
+
+**Not for**: GitHub Actions workflows, non-Helix repos, or build performance (use binlog analysis).
+
+> 💡 **Per-repo CI patterns differ significantly.** Each dotnet repo structures test results differently (TRX availability, console log patterns, work item naming). Before investigating a repo you haven't seen before, check the empirical profiles in `dnceng-knowledge/ci-repo-profiles` — they document the fastest investigation path per repo and prevent wasted MCP calls.
+
+## Quick Start
+
+```powershell
+# Analyze PR failures (most common) - defaults to dotnet/runtime
+./scripts/Get-CIStatus.ps1 -PRNumber 123445 -ShowLogs
+
+# Analyze by build ID
+./scripts/Get-CIStatus.ps1 -BuildId 1276327 -ShowLogs
+
+# Query specific Helix work item
+./scripts/Get-CIStatus.ps1 -HelixJob "4b24b2c2-..." -WorkItem "System.Net.Http.Tests"
+
+# Other dotnet repositories
+./scripts/Get-CIStatus.ps1 -PRNumber 12345 -Repository "dotnet/aspnetcore"
+```
+
+For full parameter reference and mode details, see [references/script-modes.md](references/script-modes.md).
+
+## Step 0: Gather Context (before running anything)
+
+Context changes how you interpret every failure. **Don't skip this.**
+
+1. **Read PR metadata** — title, description, author, labels, linked issues
+2. **Classify the PR type**:
+
+| PR Type | How to detect | Interpretation shift |
+|---------|--------------|---------------------|
+| **Code PR** | Human author, code changes | Failures likely relate to the changes |
+| **Flow/Codeflow PR** | Author is `dotnet-maestro[bot]`, "Update dependencies" | Missing packages may be behavioral, not infrastructure |
+| **Backport** | Title mentions "backport", targets release branch | Check if test exists on target branch |
+| **Merge PR** | Merging between branches | Conflicts cause failures, not individual changes |
+| **Dependency update** | Bumps package versions, global.json | Build failures often trace to the dependency |
+
+3. **Check existing comments** — has someone already diagnosed failures or is a retry pending?
+4. **Note the changed files** — you'll use these for correlation after the script runs
+
+## After the Script: Use Its Output
+
+> 🚨 **The script already collected the data. Do NOT re-query AzDO or Helix for information the script already produced.** Parse the `[CI_ANALYSIS_SUMMARY]` JSON and the human-readable output first. Only make additional API calls for data the script *didn't* provide (e.g., deeper Helix log searches, binlog analysis, build progression).
+
+**If the script found no builds** (e.g., AzDO builds expired, CI not triggered): report this to the user immediately. Don't spend turns re-querying AzDO with different org/project combinations — if the script couldn't find builds, they're likely unavailable. Offer alternatives: analyze by build ID if the user has one, check GitHub PR status for summary info, or note that Helix results may still be queryable directly even when AzDO builds have expired.
+
+**If the script succeeded**: the `[CI_ANALYSIS_SUMMARY]` JSON contains `failedJobDetails`, `knownIssues`, `canceledJobNames`, `prCorrelation`, and `recommendationHint`. Use these fields — don't re-fetch the same data via MCP tools or REST APIs. To find specific details in large output, use `Select-String` or `grep` on the output file rather than re-running the script.
+
+> 🚨 **Check build progression on multi-commit PRs.** If the PR has multiple commits, query AzDO for builds on `refs/pull/{PR}/merge` (sorted by queue time, top 10-20) — `gh pr checks` only shows the latest SHA. Present a progression table showing which builds passed/failed at which SHAs. This narrows failures to the commit that introduced them. See [references/build-progression-analysis.md](references/build-progression-analysis.md).
+
+Then follow the detailed workflow in [references/analysis-workflow.md](references/analysis-workflow.md). Key principles:
+
+1. **Cross-reference failures with known issues** — The script outputs `failedJobDetails` and `knownIssues` as separate lists. You must explicitly match each failure to a known issue (by error message, test name, or job type) or mark it **unmatched**. Don't present them as two independent lists — the user needs a per-failure verdict.
+2. **Check Build Analysis status** — Green = all failures matched known issues. Red = some unmatched. Never claim "all known issues" when Build Analysis is red.
+3. **Correlate with PR changes** — same files failing = likely PR-related.
+4. **Verify before claiming** — don't call it "infrastructure" without Build Analysis match or target-branch verification. Don't call it "safe to retry" unless ALL failures are accounted for.
+
+For interpreting error categories, crash recovery, and canceled jobs: [references/failure-interpretation.md](references/failure-interpretation.md)
+
+For generating recommendations from `[CI_ANALYSIS_SUMMARY]` JSON: [references/recommendation-generation.md](references/recommendation-generation.md)
+
+## Presenting Results
+
+> 🚨 **Keep tables narrow — 4 short columns max (# | Job | Verdict | Issue).** Put error descriptions, work item lists, and evidence in **detail bullets below the table**, not in cells. Wide tables wrap and become unreadable in terminals.
+
+> 🚨 **Use markdown links** for PRs (`[#121195](url)`), builds (`[Build 1305302](url)`), and jobs (`[job name](azdo-job-url)`). The script output and MCP tools provide URLs — thread them through.
+
+Lead with a 1-2 sentence verdict, then the summary table, then detail bullets (one per failure), then recommended actions. For the full format example: [references/recommendation-generation.md](references/recommendation-generation.md).
+
+## Anti-Patterns
+
+> 🚨 **Every failure verdict needs evidence — no "Likely flaky" without proof.** Each row in your summary table must cite a specific source: known issue number, Build Analysis match, or target-branch verification. If Build Analysis didn't match it and you haven't verified the target branch, the verdict is **"Unmatched — needs investigation"**, not "Likely flaky." A test that *looks* like it could be flaky is not the same as one you've *verified* is flaky.
+
+> ❌ **Don't label failures "infrastructure" without evidence.** Requires: Build Analysis match, identical failure on target branch, or confirmed outage. Exception: `tests-passed-reporter-failed` is genuinely infrastructure.
+
+> ❌ **Don't dismiss timed-out builds.** A build "failed" due to AzDO timeout can have 100% passing Helix work items. Check Helix job status before concluding failure.
+
+> ❌ **Missing packages on flow PRs ≠ infrastructure.** Flow PRs request *different* packages. Check *which* package and *why* before assuming feed delay.
+
+> ❌ **Don't present failures and known issues as separate lists.** Cross-reference them: for each `failedJobDetails` entry, state whether it matches a `knownIssues` entry or is unmatched. An `unclassified` failure can still match a known issue by error pattern.
+
+> ❌ **Don't say "safe to retry" with Build Analysis red.** Map each failing job to a specific known issue first.
+
+> ❌ **Don't use `Invoke-RestMethod` or `curl` for AzDO/Helix when MCP tools are available.** Check your available tools for Azure DevOps and Helix operations first. REST API fallback is for when MCP tools are genuinely unavailable, not a first resort.
+
+## References
+
+- **Script modes & parameters**: [references/script-modes.md](references/script-modes.md)
+- **Failure interpretation**: [references/failure-interpretation.md](references/failure-interpretation.md)
+- **Recommendation generation**: [references/recommendation-generation.md](references/recommendation-generation.md)
+- **Analysis workflow (Steps 1–3)**: [references/analysis-workflow.md](references/analysis-workflow.md)
+- **Helix artifacts & binlogs**: [references/helix-artifacts.md](references/helix-artifacts.md)
+- **Binlog comparison**: [references/binlog-comparison.md](references/binlog-comparison.md)
+- **Build progression analysis**: [references/build-progression-analysis.md](references/build-progression-analysis.md)
+- **Subagent delegation**: [references/delegation-patterns.md](references/delegation-patterns.md)
+- **Azure CLI investigation**: [references/azure-cli.md](references/azure-cli.md)
+- **Manual investigation**: [references/manual-investigation.md](references/manual-investigation.md)
+- **SQL tracking**: [references/sql-tracking.md](references/sql-tracking.md)
+- **AzDO/Helix details**: [references/azdo-helix-reference.md](references/azdo-helix-reference.md)
+
+## Tips
+
+1. Check if same test fails on target branch before assuming transient
+2. Look for `[ActiveIssue]` attributes for known skipped tests
+3. Use `-SearchMihuBot` for semantic search of related issues
+4. `gh pr checks --json` fields: `bucket`, `completedAt`, `description`, `event`, `link`, `name`, `startedAt`, `state`, `workflow` — `state` has `SUCCESS`/`FAILURE` directly (no `conclusion` field)
+5. "Canceled" ≠ "Failed" — canceled jobs may have recoverable Helix results. Helix data may persist even when AzDO builds have expired — query Helix directly if you have job IDs.
@@ -0,0 +1,39 @@
+# Analysis Workflow (Steps 1–3)
+
+After completing Step 0 (Gather Context — see SKILL.md), follow these steps.
+
+## Step 1: Run the Script
+
+Run with `-ShowLogs` for detailed failure info. See [script-modes.md](script-modes.md) for parameter details.
+
+## Step 1b: Investigate with AzDO Tools
+
+When the script output is insufficient (e.g., build timeline fetch fails), use AzDO MCP tools to query builds directly. **Match the org from the build URL to the correct AzDO tools** — see [azdo-helix-reference.md](azdo-helix-reference.md#azure-devops-organizations). PR builds are in `dnceng-public`; internal builds are in `dnceng`.
+
+## Step 2: Analyze Results
+
+1. **Check Build Analysis** — If the Build Analysis GitHub check is **green**, all failures matched known issues and it's safe to retry. If it's **red**, some failures are unaccounted for — you must identify which failing jobs are covered by known issues and which are not. For 3+ failures, use SQL tracking to avoid missed matches (see [sql-tracking.md](sql-tracking.md)).
+2. **Correlate with PR changes** — Same files failing = likely PR-related
+3. **Compare with baseline** — If a test passes on the target branch but fails on the PR, compare Helix binlogs. See [binlog-comparison.md](binlog-comparison.md) — **delegate binlog download/extraction to subagents** to avoid burning context on mechanical work.
+4. **Check build progression** — If the PR has multiple builds (multiple pushes), check whether earlier builds passed. A failure that appeared after a specific push narrows the investigation to those commits. See [build-progression-analysis.md](build-progression-analysis.md). Present findings as facts, not fix recommendations.
+5. **Interpret patterns** (but don't jump to conclusions):
+   - Same error across many jobs → Real code issue
+   - Build Analysis flags a known issue → That *specific failure* is safe to retry (but others may not be)
+   - Failure is **not** in Build Analysis → Investigate further before assuming transient
+   - Device failures, Docker pulls, network timeouts → *Could* be infrastructure, but verify against the target branch first
+   - Test timeout but tests passed → Executor issue, not test failure
+6. **Check for mismatch with user's question** — The script only reports builds for the current head SHA. If the user asks about a job, error, or cancellation that doesn't appear in the results, **ask** if they're referring to a prior build. Common triggers:
+   - User mentions a canceled job but `canceledJobNames` is empty
+   - User says "CI is failing" but the latest build is green
+   - User references a specific job name not in the current results
+   Offer to re-run with `-BuildId` if the user can provide the earlier build ID from AzDO.
+
+## Step 3: Verify Before Claiming
+
+Before stating a failure's cause, verify your claim:
+
+- **"Infrastructure failure"** → Did Build Analysis flag it? Does the same test pass on the target branch? If neither, don't call it infrastructure.
+- **"Transient/flaky"** → Has it failed before? Is there a known issue? A single non-reproducing failure isn't enough to call it flaky.
+- **"PR-related"** → Do the changed files actually relate to the failing test? Correlation in the script output is heuristic, not proof.
+- **"Safe to retry"** → Are ALL failures accounted for (known issues or infrastructure), or are you ignoring some? Check the Build Analysis check status — if it's red, not all failures are matched. Map each failing job to a specific known issue before concluding "safe to retry."
+- **"Not related to this PR"** → Have you checked if the test passes on the target branch? Don't assume — verify.
@@ -0,0 +1,96 @@
+# Azure DevOps and Helix Reference
+
+## Supported Repositories
+
+The script works with any dotnet repository that uses Azure DevOps and Helix:
+
+| Repository | Common Pipelines |
+|------------|-----------------|
+| `dotnet/runtime` | runtime, runtime-dev-innerloop, dotnet-linker-tests |
+| `dotnet/sdk` | dotnet-sdk (mix of local and Helix tests) |
+| `dotnet/aspnetcore` | aspnetcore-ci |
+| `dotnet/roslyn` | roslyn-CI |
+| `dotnet/maui` | maui-public |
+
+Use `-Repository` to specify the target:
+```powershell
+./scripts/Get-CIStatus.ps1 -PRNumber 12345 -Repository "dotnet/aspnetcore"
+```
+
+## Build Definition IDs (Example: dotnet/runtime)
+
+Each repository has its own build definition IDs. Here are common ones for dotnet/runtime:
+
+| Definition ID | Name | Description |
+|---------------|------|-------------|
+| `129` | runtime | Main PR validation build |
+| `133` | runtime-dev-innerloop | Fast innerloop validation |
+| `139` | dotnet-linker-tests | ILLinker/trimming tests |
+
+**Note:** The script auto-discovers builds for a PR, so you rarely need to know definition IDs.
+
+## Azure DevOps Organizations
+
+Builds live in two Azure DevOps organizations. **Determine the org from the build URL**, then use AzDO tools for that org:
+
+| Organization | URL pattern | Project | When used |
+|---|---|---|---|
+| `dnceng-public` | `dev.azure.com/dnceng-public/...` | `public` (GUID: `cbb18261-c48f-4abb-8651-8cdcb5474649`) | PR validation builds (most common) |
+| `dnceng` | `dev.azure.com/dnceng/...` | `internal` or varies | Official/internal builds, signed builds |
+
+**How to pick the right tools:** Look at the build URL from `gh pr checks` output or the script's `[CI_ANALYSIS_SUMMARY]`. The URL contains the org name (e.g., `dev.azure.com/dnceng-public/...`). Search your available AzDO tools for ones matching that org name — there may be multiple sets of AzDO tools for different organizations. If a query returns null, you're likely using tools for the wrong org.
+
+> ⚠️ **Common mistake:** Wrong org = null results. PR validation builds are almost always in `dnceng-public`. If you get null, check the org before trying other approaches.
+
+Override with:
+```powershell
+./scripts/Get-CIStatus.ps1 -BuildId 1276327 -Organization "dnceng" -Project "internal-project-guid"
+```
+
+## Common Pipeline Names (Example: dotnet/runtime)
+
+| Pipeline | Description |
+|----------|-------------|
+| `runtime` | Main PR validation build |
+| `runtime-dev-innerloop` | Fast innerloop validation |
+| `dotnet-linker-tests` | ILLinker/trimming tests |
+| `runtime-wasm-perf` | WASM performance tests |
+| `runtime-libraries enterprise-linux` | Enterprise Linux compatibility |
+
+Other repos have different pipelines - the script discovers them automatically from the PR.
+
+## Useful Links
+
+- [Helix Portal](https://helix.dot.net/): View Helix jobs and work items (all repos)
+- [Helix API Documentation](https://helix.dot.net/swagger/): Swagger docs for Helix REST API
+- [Build Analysis](https://github.com/dotnet/arcade/blob/main/Documentation/Projects/Build%20Analysis/LandingPage.md): Known issues tracking (arcade infrastructure)
+- [dnceng-public AzDO](https://dev.azure.com/dnceng-public/public/_build): Public builds for all dotnet repos
+
+### Repository-specific docs:
+- [runtime: Triaging Failures](https://github.com/dotnet/runtime/blob/main/docs/workflow/ci/triaging-failures.md)
+- [runtime: Area Owners](https://github.com/dotnet/runtime/blob/main/docs/area-owners.md)
+
+## Test Execution Types
+
+### Helix Tests
+Tests run on Helix distributed test infrastructure. The script extracts console log URLs and can fetch detailed failure info with `-ShowLogs`.
+
+### Local Tests (Non-Helix)
+Some repositories (e.g., dotnet/sdk) run tests directly on the build agent. The script detects these and extracts Azure DevOps Test Run URLs.
+
+## Known Issue Labels
+
+- `Known Build Error` - Used by Build Analysis across all dotnet repositories
+- Search syntax: `repo:<owner>/<repo> is:issue is:open label:"Known Build Error" <test-name>`
+
+Example searches (use `search_issues` when GitHub MCP is available, `gh` CLI otherwise):
+```bash
+# Search in runtime
+gh issue list --repo dotnet/runtime --label "Known Build Error" --search "FileSystemWatcher"
+
+# Search in aspnetcore
+gh issue list --repo dotnet/aspnetcore --label "Known Build Error" --search "Blazor"
+
+# Search in sdk
+gh issue list --repo dotnet/sdk --label "Known Build Error" --search "template"
+```