Skip to content

[CI] Extend gate to all test types and decouple from PR review#34705

Open
kubaflo wants to merge 45 commits intomainfrom
copilot-kubaflo
Open

[CI] Extend gate to all test types and decouple from PR review#34705
kubaflo wants to merge 45 commits intomainfrom
copilot-kubaflo

Conversation

@kubaflo
Copy link
Copy Markdown
Contributor

@kubaflo kubaflo commented Mar 27, 2026

Summary

Extends the CI PR review pipeline to support all test types (UI tests, device tests, unit tests, XAML tests) and restructures the review flow by decoupling the gate from the copilot agent.

Before

  • Gate only supported UI tests (TestCases.HostApp / TestCases.Shared.Tests)
  • PRs with device tests, unit tests, or XAML tests were skipped by the gate
  • Gate ran as Phase 2 inside the copilot agent (4-phase: Pre-Flight → Gate → Try-Fix → Report)
  • Gate results were duplicated across all phase outputs
  • AI summary comment included session history merging (841 lines of code)

After

  • Gate supports all test types with auto-detection
  • Gate runs as a standalone script step before the copilot agent
  • Gate posts its own separate PR comment (<!-- AI Gate -->)
  • AI summary is simplified (170 lines, always overwrites, no session history)
  • PR review is now 3 phases: Pre-Flight → Try-Fix → Report

New Scripts

Script Purpose
Detect-TestsInDiff.ps1 Analyzes PR files, classifies tests by type (UITest, DeviceTest, UnitTest, XamlUnitTest), extracts method names from diffs
post-gate-comment.ps1 Posts/updates gate result as separate PR comment
RunTests.ps1 Unified test runner entry point for all test types

Test Detection

pwsh .github/scripts/shared/Detect-TestsInDiff.ps1 -PRNumber 25129
📱 [DeviceTest] EditorTests (PlaceholderHorizontalTextAlignment)
   Filter:  Category=Editor
🖥️ [UITest] Issue10987
   Filter:  Issue10987

New Review Flow

Step 0: Branch setup
Step 1: Gate (verify-tests-fail.ps1 — direct script, no copilot agent)
         → Posts <!-- AI Gate --> comment immediately
Step 2: PR Review (copilot agent — 3 phases: Pre-Flight, Try-Fix, Report)
         → Gate result passed in prompt
Step 3: Post AI Summary (<!-- AI Summary --> comment)
Step 4: Apply labels

PR Comments (Two Separate Comments)

Gate comment (<!-- AI Gate -->):

## 🚦 Gate — Test Verification
► Expand Full Gate — abc1234 · Fix editor alignment

### Gate Result: ✅ PASSED
| Step | Expected | Actual | Result |
| Without fix | FAIL | FAIL ||
| With fix | PASS | PASS ||

AI Summary comment (<!-- AI Summary -->):
Pre-Flight, Fix, Report sections only — no gate duplication.

Key Changes

  • verify-tests-fail.ps1: Auto-detects test type, routes to correct runner (BuildAndRunHostApp, Run-DeviceTests, dotnet test), iterates over all detected tests, -Platform mandatory
  • Detect-TestsInDiff.ps1: Shared detection engine — reads [Category] attributes for device test filtering, extracts method names from PR diffs
  • Review-PR.ps1: Gate as Step 1 (script), PR review as Step 2 (copilot), removed PR finalize step
  • post-ai-summary-comment.ps1: Rewritten from 841 → 170 lines, always overwrites
  • pr-gate.md: Strict output template, no cross-phase duplication rule
  • pr-review/SKILL.md: 3 phases (removed Gate), no-duplication rule
  • EstablishBrokenBaseline.ps1: Excludes TestUtils/DeviceTests.Runners from fix file detection

Verified

Copilot AI review requested due to automatic review settings March 27, 2026 13:29
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 27, 2026

🚀 Dogfood this PR with:

⚠️ WARNING: Do not do this without first carefully reviewing the code of this PR to satisfy yourself it is safe.

curl -fsSL https://raw.githubusercontent.com/dotnet/maui/main/eng/scripts/get-maui-pr.sh | bash -s -- 34705

Or

  • Run remotely in PowerShell:
iex "& { $(irm https://raw.githubusercontent.com/dotnet/maui/main/eng/scripts/get-maui-pr.ps1) } 34705"

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Extends the CI “gate” and PR review automation to detect and run all MAUI test types (UI/device/unit/XAML), and restructures the review flow so gate runs as a standalone step with its own PR comment, while the Copilot-driven review focuses on Pre-Flight/Try-Fix/Report.

Changes:

  • Update verify-tests-fail.ps1 to auto-detect test type(s) and dispatch to the right runner, running all detected tests.
  • Add shared test detection (Detect-TestsInDiff.ps1) and a dedicated gate comment poster (post-gate-comment.ps1); simplify AI summary posting.
  • Update review orchestration/docs to remove gate from the pr-review skill and run it from Review-PR.ps1 instead.

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
.github/skills/verify-tests-fail-without-fix/scripts/verify-tests-fail.ps1 Adds multi-test-type detection/routing and multi-test execution for gate verification.
.github/skills/verify-tests-fail-without-fix/SKILL.md Updates skill documentation for broader test-type support (but currently out of sync with script behavior/outputs).
.github/skills/try-fix/references/example-invocation.md Documents device/unit test command examples.
.github/skills/try-fix/SKILL.md Updates try-fix guidance to select the correct runner per test type.
.github/skills/pr-review/SKILL.md Changes orchestrator to 3 phases (Pre-Flight/Try-Fix/Report) and states gate is pre-run.
.github/scripts/shared/Detect-TestsInDiff.ps1 New shared test detection/classification script used by gate and tooling.
.github/scripts/post-gate-comment.ps1 New script to post/update a dedicated <!-- AI Gate --> PR comment.
.github/scripts/post-ai-summary-comment.ps1 Simplifies AI summary comment generation (no session history; gate posted separately).
.github/scripts/RunTests.ps1 New unified local entry point to run Unit/Device/UI/Integration tests.
.github/scripts/Review-PR.ps1 Reorders flow to run gate first via script, then invoke pr-review, then post comments/labels.
.github/scripts/EstablishBrokenBaseline.ps1 Expands “test path” patterns to exclude more test utility/runner paths from fix detection.
.github/pr-review/pr-report.md Updates report phase prerequisites now that gate is external to pr-review.
.github/pr-review/pr-gate.md Updates gate doc for detection template/output rules (but still references task-agent flow).
.github/copilot-instructions.md Updates repository Copilot instructions to reflect 3-phase PR review and multi-type gate.

Comment on lines +836 to +837
$TestLog = Join-Path $OutputPath "test-failure-$($testEntry.TestName).log"

Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Log file names are derived directly from TestName (e.g., test-failure-<TestName>.log). TestName can include spaces/parentheses/commas (device tests append method names), which can create awkward or invalid paths on some filesystems and can hit path-length limits. Consider sanitizing TestName for file names (or use an index-based file name) and store the display name inside the log content instead.

Suggested change
$TestLog = Join-Path $OutputPath "test-failure-$($testEntry.TestName).log"
# Sanitize TestName for use in a file name and keep it reasonably short
$rawTestName = [string]$testEntry.TestName
$invalidFileNameChars = [IO.Path]::GetInvalidFileNameChars()
$extraProblematicChars = [char[]]' ()[],'
$charsToReplace = $invalidFileNameChars + $extraProblematicChars
$sanitizedTestName = ($rawTestName.ToCharArray() | ForEach-Object {
if ($charsToReplace -contains $_) { '_' } else { $_ }
}) -join ''
if ([string]::IsNullOrWhiteSpace($sanitizedTestName)) {
$sanitizedTestName = "test-$testIndex"
}
$maxNameLength = 60
if ($sanitizedTestName.Length -gt $maxNameLength) {
$sanitizedTestName = $sanitizedTestName.Substring(0, $maxNameLength)
}
$TestLog = Join-Path $OutputPath ("test-failure-{0}.log" -f $sanitizedTestName)

Copilot uses AI. Check for mistakes.
Comment on lines +96 to +100
3. Auto-detects test classes from changed test files
4. Routes to the appropriate test runner
5. Runs tests (should FAIL to prove they catch the bug)
6. **Updates PR labels** based on result
7. Reports result
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doc claims the script "Updates PR labels" as part of the workflow, but the updated verify-tests-fail.ps1 no longer contains label-management logic. Please remove or update these steps so they reflect the current behavior (labeling appears to happen later in Review-PR.ps1).

Copilot uses AI. Check for mistakes.
Comment on lines +317 to +321
$patch = $null
if ($PRNumber) {
# Get per-file patch from GitHub API
$patch = gh api "repos/dotnet/maui/pulls/$PRNumber/files" --jq ".[] | select(.filename == `"$file`") | .patch" 2>$null
} else {
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For each device-test file, Step 4 calls gh api repos/dotnet/maui/pulls/$PRNumber/files and filters it with jq. This creates an N+1 GitHub API pattern that can hit rate limits and slow down gate on large PRs. Consider fetching the PR files/patches once (single API call) and caching them in a lookup keyed by filename.

Copilot uses AI. Check for mistakes.
Comment on lines +322 to +323
# Try from git diff
$patch = git diff $mergeBase HEAD -- $file 2>$null
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When running without -PRNumber, Step 4 uses git diff $mergeBase HEAD -- $file, but $mergeBase is only set in the earlier auto-diff path. If the caller provides -ChangedFiles directly, $mergeBase may be $null, producing an unintended diff (or no patch). Consider ensuring $mergeBase is always initialized for non-PR runs, or diffing against HEAD explicitly.

Suggested change
# Try from git diff
$patch = git diff $mergeBase HEAD -- $file 2>$null
# Try from git diff. If $mergeBase is not set (for example when -ChangedFiles
# was provided directly), fall back to diffing against the previous commit.
if ($mergeBase) {
$patch = git diff $mergeBase HEAD -- $file 2>$null
} else {
$patch = git diff HEAD~1 HEAD -- $file 2>$null
}

Copilot uses AI. Check for mistakes.
-DetectedProjectPath $testEntry.ProjectPath `
-LogFile $TestLog

$testResult = Get-TestResultFromOutput -LogFile $testOutputLog
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Get-TestResultFromOutput accepts a -TestFilter parameter (and contains filter-aware parsing logic), but this call doesn't pass the current test's filter. Either pass $testEntry.Filter through or remove the unused parameter/logic; otherwise the filter-aware parsing path can never be exercised.

Suggested change
$testResult = Get-TestResultFromOutput -LogFile $testOutputLog
$testResult = Get-TestResultFromOutput -LogFile $testOutputLog -TestFilter $testEntry.Filter

Copilot uses AI. Check for mistakes.
Comment on lines 28 to +31
.PARAMETER Platform
Target platform: "android", "ios", "catalyst" (MacCatalyst), or "windows"
Required for UITest and DeviceTest types. Optional for UnitTest and XamlUnitTest.

Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment header says -Platform is optional for UnitTest/XamlUnitTest, but $Platform is declared as a mandatory parameter, and the examples suggest running without it. Either make -Platform truly optional (and only enforce it for UI/Device tests), or update the docs/examples to reflect that -Platform is always required.

Copilot uses AI. Check for mistakes.
Comment on lines 139 to +143
| `test-without-fix.log` | Full test output from run without fix |
| `test-with-fix.log` | Full test output from run with fix |

**Plus UI test logs in** `CustomAgentLogsTmp/UITests/`:
- `android-device.log` or `ios-device.log` - Device logs
- `test-output.log` - NUnit test output
**Plus test logs in** `CustomAgentLogsTmp/`:
- `UITests/` - UI test device logs and output
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the Output Files section, the directory/path described earlier (CustomAgentLogsTmp/PRState/<PRNumber>/verify-tests-fail/) doesn't match the current verify-tests-fail.ps1 output location (now under .../PRAgent/gate/verify-tests-fail). Please update the paths and example structure here so consumers can find verification-report.md and the per-test logs reliably.

Copilot uses AI. Check for mistakes.
Comment on lines +36 to 39
2. **Select platform** — must be affected by bug AND available on host (see table above).

3. **Run verification via task agent** (MUST use task agent — never inline):
```
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Step 3 still instructs running gate via the task agent, but the new workflow in Review-PR.ps1 runs gate directly via verify-tests-fail.ps1 before invoking the Copilot pr-review skill. Please update this step to reflect the new script-driven gate (or clarify that this doc is only for manual, agent-driven gate runs).

Copilot uses AI. Check for mistakes.
# ============================================================================

# Get latest commit info
$commitJson = gh api "repos/dotnet/maui/pulls/$PRNumber/commits" --jq '.[-1] | {message: .commit.message, sha: .sha}' 2>$null | ConvertFrom-Json
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gh api ... | ConvertFrom-Json will throw if gh fails (e.g., auth missing / rate limit) because stderr is suppressed and stdout may be empty. Since $ErrorActionPreference = 'Stop', this can break the whole posting step. Wrap this in try/catch and fall back to Unknown commit info when the API call fails or returns empty.

Suggested change
$commitJson = gh api "repos/dotnet/maui/pulls/$PRNumber/commits" --jq '.[-1] | {message: .commit.message, sha: .sha}' 2>$null | ConvertFrom-Json
$commitJson = $null
try {
$rawCommitJson = gh api "repos/dotnet/maui/pulls/$PRNumber/commits" --jq '.[-1] | {message: .commit.message, sha: .sha}' 2>$null
if (-not [string]::IsNullOrWhiteSpace($rawCommitJson)) {
$commitJson = $rawCommitJson | ConvertFrom-Json
}
}
catch {
Write-Host "⚠️ Failed to fetch or parse commit info for PR #$PRNumber: $($_.Exception.Message)" -ForegroundColor Yellow
$commitJson = $null
}

Copilot uses AI. Check for mistakes.
Comment on lines 105 to 109
$commitJson = gh api "repos/dotnet/maui/pulls/$PRNumber/commits" --jq '.[-1] | {message: .commit.message, sha: .sha}' 2>$null | ConvertFrom-Json
$commitTitle = if ($commitJson) { ($commitJson.message -split "`n")[0] } else { "Unknown" }
$commitSha = if ($commitJson) { $commitJson.sha.Substring(0, 7) } else { "unknown" }
$commitUrl = if ($commitJson) { "https://github.com/dotnet/maui/commit/$($commitJson.sha)" } else { "#" }

Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to post-gate-comment.ps1, gh api ... | ConvertFrom-Json will throw when gh fails or returns empty output (stderr is suppressed). With $ErrorActionPreference = 'Stop', that prevents the summary comment from being posted. Add try/catch and default commit title/SHA/URL when the API call is unavailable.

Suggested change
$commitJson = gh api "repos/dotnet/maui/pulls/$PRNumber/commits" --jq '.[-1] | {message: .commit.message, sha: .sha}' 2>$null | ConvertFrom-Json
$commitTitle = if ($commitJson) { ($commitJson.message -split "`n")[0] } else { "Unknown" }
$commitSha = if ($commitJson) { $commitJson.sha.Substring(0, 7) } else { "unknown" }
$commitUrl = if ($commitJson) { "https://github.com/dotnet/maui/commit/$($commitJson.sha)" } else { "#" }
$commitJson = $null
$commitTitle = "Unknown"
$commitSha = "unknown"
$commitUrl = "#"
try {
$commitRaw = gh api "repos/dotnet/maui/pulls/$PRNumber/commits" --jq '.[-1] | {message: .commit.message, sha: .sha}' 2>$null
if ($commitRaw) {
$commitJson = $commitRaw | ConvertFrom-Json
}
} catch {
Write-Warning "Failed to fetch latest commit info for PR #$PRNumber: $($_.Exception.Message)"
}
if ($commitJson) {
$commitTitle = ($commitJson.message -split "`n")[0]
$commitSha = $commitJson.sha.Substring(0, 7)
$commitUrl = "https://github.com/dotnet/maui/commit/$($commitJson.sha)"
}

Copilot uses AI. Check for mistakes.
@kubaflo
Copy link
Copy Markdown
Contributor Author

kubaflo commented Mar 27, 2026

🔬 Multi-Model Code Review — PR #34705

Cross-pollinated review from GPT-5.4 (Gemini persona), GPT-5.2-Codex, and Claude Opus 4.6, each reviewing independently then synthesized.


✅ Consensus: Architecture is Sound

All three models agree the decoupling direction is correct:

  • Moving gate from Phase 2 (inside copilot agent) → standalone script Step 1 is deterministic, faster, and cleaner
  • Two-comment approach (<!-- AI Gate --> / <!-- AI Summary -->) separates concerns well
  • post-ai-summary-comment.ps1 rewrite (841→170 lines) is a major simplification

🔴 Critical — Unanimous (3/3 models flagged)

1. Empty test array → false-positive gate PASS

All three models independently identified this as the #1 issue:

$failedWithoutFix = ($withoutFixResults | Where-Object { $_.Passed }).Count -eq 0
$passedWithFix = ($withFixResults | Where-Object { -not $_.Passed }).Count -eq 0

When $withoutFixResults is empty (zero tests ran), both evaluate to $true → gate reports PASSED with nothing tested. This is the most dangerous failure mode.

Fix (one-liner):

if ($AllDetectedTests.Count -eq 0) {
    Write-Error "No tests detected — gate cannot verify"; exit 1
}
# + similar guard after each test run loop for empty results

2. No automated tests for ~1,200 lines of new script logic (Opus + Gemini)

Detect-TestsInDiff.ps1 (424 lines), RunTests.ps1 (625 lines), post-gate-comment.ps1 (134 lines) — all pure logic, highly testable with Pester, but zero tests. Get-TestResultFromOutput does regex parsing of varied output formats — the most fragile code in this PR.


🟡 Medium — Strong Agreement (2-3/3 models)

3. Device test filter falls back to bare class name (All 3)

When [Category] extraction fails, Detect-TestsInDiff.ps1 falls back to the class name (e.g., EditorTests). But Run-DeviceTests.ps1 expects Category=X format. A bare class name either runs all tests or fails silently. Additionally (Gemini), the category regex \[Category\(TestCategory\.(\w+)\)\] misses string categories ([Category("Battery")]) and multi-category attributes.

4. Gate can't produce documented SKIPPED state (Gemini + Codex)

Review-PR.ps1 unconditionally runs verify-tests-fail.ps1. If no tests detected, the script exits with error → gate becomes FAILED instead of the documented ⚠️ SKIPPED. Testless PRs shouldn't be failures.

5. Review-PR.ps1 doesn't pass -RequireFullVerification (Gemini)

pr-gate.md says invoke with RequireFullVerification: true, but Review-PR.ps1 omits it. The gate can silently fall back to failure-only mode, skipping "tests pass WITH fix" verification — half the gate contract.

6. Synthesized test entries have inconsistent key shapes (Opus + Codex)

Detection-returned entries have Runner, NeedsPlatform, Files, Methods. Explicitly-provided entries (verify-failure-only path) omit Runner and NeedsPlatform. Future code accessing $t.Runner on these entries will get $null.

Recommendation: Create a New-TestEntry helper that always produces a canonical hashtable shape.

7. Comment marker mismatch (Opus)

PR description says <!-- copilot:gate --> but code uses <!-- AI Gate -->. Other tooling searching for the documented marker won't find it.

8. shared-utils.ps1 import may not exist (Opus)

RunTests.ps1 does . "$PSScriptRoot/shared/shared-utils.ps1" — this file isn't in the PR diff. If it doesn't exist on the target branch, every unit test invocation crashes.

9. Label parsing expects old format (Gemini)

Update-AgentLabels.ps1 still looks for Result: lines, but new gate format uses ### Gate Result:. Labels will stop updating.

10. Documentation inconsistencies (Opus + Gemini)

  • SKILL.md says -Platform is always required, but it's only needed for UITest/DeviceTest
  • pr-gate.md still says "use task agent" — stale for the new standalone-script flow

🟢 Minor — Individual Model Insights

Finding Source
GitHub API pagination missing — large PRs skip patches past page 1 Codex + Gemini
^\s+Failed: regex never matches in multiline string (dead code path) Opus
HostApp-only UI test changes (no Shared.Tests file) dropped as "no tests" Gemini
Unit test project map incomplete — misses Core.Design.UnitTests, DualScreen.UnitTests Gemini
post-gate-comment.ps1 create-comment path lacks try/catch (update path has it) Gemini + Opus
GetTempFileName() uses system temp instead of project-relative path Opus
Get-TestResultFromLog appears dead after rewrite Gemini
No -SkipGate rollback flag in Review-PR.ps1 Opus

📊 Summary Matrix

Finding GPT-5.4 Codex Opus Severity
Empty array → false pass 🔴
No script tests 🔴
Device filter fallback 🟡
No SKIPPED state 🟡
Missing -RequireFullVerification 🟡
Inconsistent entry shapes 🟡
Comment marker mismatch 🟡
shared-utils.ps1 missing 🟡
Label parsing old format 🟡
Stale docs 🟡

🎯 Top 3 Recommended Actions

  1. Guard empty test arrays — Add explicit Count -eq 0 checks before and after verification loops. One-liner fix, prevents the most dangerous failure mode.

  2. Fix device test filter contract — Either ensure Category=X is always produced (fix regex to handle string categories + multi-category), or teach Get-TestResultFromOutput to handle bare class names.

  3. Normalize test entry contractNew-TestEntry helper function, always populates all keys. Eliminates the two-shape problem.


🤖 Generated via multi-model cross-pollination: GPT-5.4 · GPT-5.2-Codex · Claude Opus 4.6

@kubaflo
Copy link
Copy Markdown
Contributor Author

kubaflo commented Mar 27, 2026

All 10 review comments addressed in commit 0c33bfe:

  1. ✅ Sanitize TestName for log file names (replace invalid chars, truncate to 60)
  2. ✅ Remove stale 'Updates PR labels' from SKILL.md
  3. ✅ Cache PR files API call (single fetch, keyed lookup)
  4. ✅ Fix $mergeBase null fallback (default to HEAD~1)
  5. ✅ Pass TestFilter to Get-TestResultFromOutput in all loops
  6. ✅ Fix script header: Platform is mandatory for all test types
  7. ✅ Fix output file paths in SKILL.md (PRAgent/gate/verify-tests-fail/)
  8. ✅ Update pr-gate.md: gate runs as direct script, task agent optional
  9. ✅ Add try/catch for gh api in post-gate-comment.ps1
  10. ✅ Add try/catch for gh api in post-ai-summary-comment.ps1

@kubaflo
Copy link
Copy Markdown
Contributor Author

kubaflo commented Mar 27, 2026

🔬 Multi-Model Re-Review (v2) — PR #34705

Cross-pollinated re-review from GPT-5.4, GPT-5.2-Codex, and Claude Opus 4.6 after fix commit 0c33bfe (10 items addressed).


✅ Fixes Confirmed Working (All 3 Models Agree)

Fix Status Evidence
#5 Pass TestFilter to Get-TestResultFromOutput ✅ Fixed verify-tests-fail.ps1 lines 1184, 1252
#8 pr-gate.md → gate runs as script ✅ Fixed Doc correctly describes standalone flow
#9/#10 try/catch for gh api in posting scripts ✅ Fixed Both scripts have guarded commit info fetch
#3 Cache PR files API (avoid N+1) ✅ Fixed Detect-TestsInDiff.ps1 caches API response
Comment marker mismatch (round-1 #7) ✅ Fixed <!-- AI Gate --> consistent in code + docs
shared-utils.ps1 missing (round-1 #8) ✅ Resolved File exists at .github/scripts/shared/shared-utils.ps1 (was false alarm)

📊 Round-1 Issue Tracking — Updated Status

# Issue GPT-5.4 Codex Opus Consensus
1 Empty array → false pass ✅ Fixed ✅ Fixed ⚠️ Mitigated (🟡) ⚠️ Mitigated
2 No automated tests ❌ Open ❌ Open ❌ Open ❌ Open (🟡)
3 Device filter bare class name ❌ Open ⚠️ Partial ⚠️ Partial ⚠️ Partial (🟡)
4 SKIPPED state unreachable ❌ Open ❌ Open ❌ Open ❌ Open (🟡)
5 Missing -RequireFullVerification ❌ Open ❌ Open ❌ Open ❌ Open (🟡)
6 Inconsistent entry key shapes ⚠️ Partial ⚠️ Partial ⚠️ Partial ⚠️ Partial (🟢)
7 Comment marker mismatch ✅ Fixed ✅ Fixed ✅ Fixed ✅ Fixed
8 shared-utils.ps1 missing ✅ Fixed ✅ Fixed ✅ Fixed ✅ Fixed
9 Label parsing old format ❌ Open ❌ Open ✅ Fixed* ⚠️ Disputed
10 Platform defaults to android ❌ Open ❌ Open ⚠️ Partial ❌ Open (🟢)
11 Stale docs ⚠️ Partial ✅ Fixed ⚠️ Partial ⚠️ Partial (🟡)

*Item 9 disagreement: Opus considers the SKILL.md cleanup sufficient; Gemini/Codex note Update-AgentLabels.ps1 still regex-matches Result: not ### Gate Result:. This label parsing mismatch would cause labels to stop updating.


🔍 Key Disagreement: Empty Array Guard (#1)

This was the unanimous #1 critical from round 1. The models now diverge:

  • GPT-5.4 + Codex: ✅ Fixed — Get-AutoDetectedTestFilter now returns $null when no tests found, triggering hard exit 1 before the aggregation logic runs.
  • Opus: ⚠️ Mitigated, not eliminated — The upstream guard works, but the downstream aggregation logic structurally still treats empty arrays as "all passed." If a test IS detected but Invoke-TestRun produces unparseable output, the empty-array path is still reachable.

Cross-pollination verdict: The fix closes the main entry point (no tests → exit 1). The residual structural flaw is defense-in-depth — low real-world probability. Downgraded to 🟡.


🎯 Remaining Issues — Should They Block Merge?

Split verdict across models:

  • GPT-5.4: REQUEST CHANGES — SKIPPED state, -RequireFullVerification, device filter are blockers
  • Codex: REQUEST CHANGES — -RequireFullVerification is the main blocker
  • Opus: COMMENT (soft approve) — all remaining items are reasonable follow-ups; only stale docs should be fixed pre-merge

Items where blocking is debatable:

#4 SKIPPED state (🟡): Review-PR.ps1 maps exit codes to only PASSED/FAILED. Testless PRs report FAILED instead of SKIPPED. Opus argues this is a cosmetic distinction (gate failure doesn't halt the workflow). Gemini/Codex argue it creates confusing false failures.

Recommendation: Low-effort fix — use distinct exit codes (0=pass, 1=fail, 2=skip) in verify-tests-fail.ps1 and map exit code 2 to SKIPPED in Review-PR.ps1.

#5 -RequireFullVerification (🟡): Without this flag, the gate only verifies "tests fail without fix" and skips "tests pass with fix." Opus argues this is actually reasonable for edge cases (test-only PRs). Gemini/Codex argue it's half the gate contract.

Recommendation: Add -RequireFullVerification to the gate invocation in Review-PR.ps1 line 462. One flag addition.


📝 Pre-Merge Doc Fixes (All 3 Models Agree — Small Effort, High Value)

These are stale references that AI agents will consume directly, causing wrong paths or confused phase numbering:

  1. pr-review/SKILL.md line 212: Directory structure says gate/ → content.md # Phase 2 output (pr-gate) — gate is no longer Phase 2
  2. pr-review/SKILL.md line 231: Quick Reference table still lists 2. Gate | pr-gate.md | Verify tests via task agent
  3. verify-tests-fail.ps1 line ~56: Example says # Verify unit tests (no platform needed) but Platform is now Mandatory = $true
  4. Update-AgentLabels.ps1 lines 349-353: Regex matches Result: but gate output uses ### Gate Result:

🏁 Cross-Pollinated Verdict

Model Verdict Rationale
GPT-5.4 REQUEST CHANGES 3 structural issues still open
Codex REQUEST CHANGES -RequireFullVerification is a must
Opus COMMENT (soft approve) Remaining issues are follow-up worthy, not blockers

Synthesized recommendation: COMMENT with targeted fixes.

The architecture is sound and most critical issues are resolved. The remaining items fall into two buckets:

Fix before merge (small, high-confidence):

  1. Add -RequireFullVerification to gate invocation in Review-PR.ps1
  2. Fix 4 stale doc references (listed above)

Track as follow-ups:

  • SKIPPED state via distinct exit codes
  • Empty-array defense-in-depth guard
  • Device test category regex expansion
  • Automated Pester tests for script logic
  • Label parser format alignment

🤖 Multi-model cross-pollination v2: GPT-5.4 · GPT-5.2-Codex · Claude Opus 4.6

kubaflo and others added 19 commits March 27, 2026 18:55
Co-Authored-By: Copilot <223556219+Copilot@users.noreply.github.com>
- try-fix SKILL.md: test command table for all test types
- pr-review SKILL.md: test_command placeholder instead of hardcoded BuildAndRunHostApp
- verify-tests-fail SKILL.md: log paths for all test types

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-Authored-By: Copilot <223556219+Copilot@users.noreply.github.com>
- pr-gate.md: exact template with no-extras rule, no-duplication warning
- pr-review SKILL.md: critical rule against duplicating phase content
- pr-report.md: explicit rule not to copy gate/try-fix output

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Review-PR.ps1 now runs verify-tests-fail.ps1 directly as Step 1 (no copilot
agent needed for gate). The pr-review skill becomes 3 phases: Pre-Flight,
Try-Fix, Report. Gate result is passed in the prompt to the copilot agent.

Flow:
  Step 0: Branch setup
  Step 1: Gate (verify-tests-fail.ps1 — direct script)
  Step 2: PR Review (copilot — 3 phases)
  Step 3: PR Finalize (copilot)
  Step 4: Post comments
  Step 5: Labels

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…story

Rewrote from 841 lines to ~170 lines. Removes all session merging,
extraction, and history logic. Just loads content.md files, builds
comment body, and posts/overwrites.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The @dryRunFlag splatting with empty array passed $null as positional
argument. Replaced with explicit if/else for -DryRun parameter.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
1. Sanitize TestName for log file names (spaces/parens/commas)
2. Remove stale 'Updates PR labels' from SKILL.md
3. Cache PR files API call (avoid N+1 pattern)
4. Fix $mergeBase null fallback when -ChangedFiles provided
5. Pass TestFilter to Get-TestResultFromOutput in all loops
6. Fix script header: Platform is mandatory for all test types
7. Fix output file paths in SKILL.md
8. Update pr-gate.md: gate runs as script, not task agent
9. Add try/catch for gh api in post-gate-comment.ps1
10. Add try/catch for gh api in post-ai-summary-comment.ps1

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When PRNumber is available, use GitHub API to get exact PR files.
Git diff from merge-base includes all branch changes (infrastructure
commits), causing 60+ unrelated tests to be detected.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Detect ADB install failures, app launch failures, AOT crashes,
  Appium errors → report as ⚠️ ENV ERROR instead of ❌ FAILED
- Extract test failure details: test name, duration, error message
- Gate report now shows Details section with actual error messages
  (e.g., 'snapshot 11.01% different', 'ADB broken pipe')

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Invoke-TestRunWithRetry wraps Invoke-TestRun + result parsing.
When EnvError is detected (ADB install failure, app launch crash,
Appium timeout), retries up to 3 times with 5s delay between attempts.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
kubaflo and others added 3 commits March 27, 2026 20:57
- retryCountOnTaskFailure: 1 → 3
- Offline wait: 10 iterations × 3s → 20 iterations × 5s (100s total)
- Fail fast with clear error if device stays offline (instead of
  hanging on adb shell commands until 15min timeout)
- All adb shell prep commands now have || true to not hang on offline

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
test-output.log from the 'without fix' run could be read by the
'with fix' parser if the file wasn't overwritten yet, causing
false FAILED results when tests actually passed.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Tee-Object pipeline can break and prevent return statements from
executing, causing the parser to read stale or wrong log files.
Now captures output to variable first, then writes to file.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
kubaflo and others added 4 commits March 27, 2026 21:19
Each Invoke-TestRun now returns a unique file path per run instead
of the shared test-output.log. For UITest, copies test-output.log
to {logfile}.testresult immediately after the run. For unit/XAML/device
tests, returns the unique $LogFile directly.

This prevents the 'without fix' results from being read during the
'with fix' parse, which caused false FAILED reports when tests passed.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Every Invoke-TestRun now returns its own unique $LogFile.
No more copying from shared test-output.log. The captured script
output already contains the test results (build + test stdout).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
kubaflo and others added 9 commits March 28, 2026 19:24
Add timing, test counts, and per-step failure details to the gate
markdown report. Previously the gate comment only showed a summary
table + deduplicated error lines. Now it includes:

- Per-test duration (Stopwatch around each Invoke-TestRunWithRetry)
- Test counts per step (total/passed/failed)
- Failure reason and error message per test (truncated to 300 chars)
- Separate 'Without fix' and 'With fix' sections with inline details
- Duration in console logs for easier CI debugging

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When no tests are detected in a PR, the gate now shows a friendly
'⚠️ SKIPPED' message with a recommendation to use write-tests-agent,
instead of showing a bare '❌ FAILED' with no context.

- verify-tests-fail.ps1: exit code 2 for 'no tests' (vs 1 for failure)
- Review-PR.ps1: map exit code 2 to SKIPPED state, write helpful
  gate/content.md with write-tests-agent suggestion
- Agent prompt updated to reflect SKIPPED state

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
After the gate finishes, apply the corresponding label and remove
any stale gate labels from previous runs:
- s/agent-gate-passed (exit 0)
- s/agent-gate-failed (exit 1)
- s/agent-gate-skipped (exit 2, no tests)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The gate console output was a wall of interleaved build/test output
with no clear separation between 'without fix' and 'with fix' runs.

Now:
- Raw test output is inside AzDO collapsible groups (##[group])
  so it's available but doesn't flood the log
- Each test run has a clear banner:
  🔴 WITHOUT FIX / 🟢 WITH FIX
- Step headers use box-drawing characters for visual separation
- Results print OUTSIDE groups so they're always visible with
  duration, test counts, and failure details
- Final summary is a side-by-side comparison table:

  Test Name              │ Without Fix │  With Fix
  ───────────────────────┼─────────────┼────────────
  Issue34591             │ ✅ FAIL      │ ✅ PASS
  ───────────────────────┼─────────────┼────────────
  Expected               │   FAIL      │   PASS

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The gate comment was hard to read — scattered sections with no clear
association between 'without fix' and 'with fix' results for each test.

New layout uses a single comparison table:

| Test | Without Fix (expect FAIL) | With Fix (expect PASS) |
|------|--------------------------|------------------------|
| 🖥️ **Issue34591** | ✅ FAIL — 245s | ✅ PASS — 180s |

- Each test shows both directions in one row — instantly clear
- Duration per direction per test
- Failure details only shown when something went wrong (collapsible)
- Fix files list is collapsible to reduce noise
- Platform, base branch, merge base on one line

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Each test run now has its own expandable section with the full log:

🔴 **Without fix** — 🖥️ Issue34591: FAIL ✅ · 245s
🟢 **With fix** — 🖥️ Issue34591: PASS ✅ · 180s

Click to expand and see the complete build + test output.
Logs truncated to last 15k chars if too large for GitHub comments.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
gh pr edit --add-label silently fails when the label doesn't exist
in the repo. Now:
- Creates label with gh label create --force before applying
- Uses --repo dotnet/maui explicitly for fork PRs
- Logs actual errors instead of swallowing with 2>null

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Both post-gate-comment.ps1 and post-ai-summary-comment.ps1 used
gh api without --paginate, so PRs with 30+ comments couldn't find
the existing marker comment. Each run created a new comment instead
of updating the existing one.

Fixes:
- Add --paginate to search ALL comments
- Pick the LAST matching comment (most recent) instead of first
- Handle 'null' string from jq when no match found
- On PATCH failure, try to find a comment owned by the current
  bot user before falling back to creating a new one

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
kubaflo and others added 3 commits March 31, 2026 21:27
APP_LAUNCH_FAILURE (XHarness exit 83) was causing false ENV ERROR
gate results. The 5-second retry wait was too short for iOS simulator
recovery. Now:
- Wait 30s between retries (up from 5s)
- Reboot iOS simulator on APP_LAUNCH_FAILURE before retrying
- Reboot Android emulator on app crash before retrying
- Log the reboot action for visibility

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When XHarness crashes with exit code 83 (APP_LAUNCH_FAILURE), the
test runner outputs 'Passed: 0 / Failed: 0'. The parser was not
detecting this as an env error because:
1. The regex 'APP_LAUNCH_FAILURE|exit code:? 83' didn't match the
   actual format 'XHarness exit code: 83 (APP_LAUNCH_FAILURE)'
2. 'Passed: 0 / Failed: 0' fell through all checks to the generic
   'Could not parse' path, but in some flows was treated as PASSED

Fixes:
- Add 'XHarness exit code: 83' as explicit env error pattern
- Add 'Application test run crashed' as env error pattern
- Guard: 'Passed: 0 + Failed: 0' = env error (zero tests ran)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
XHarness can report exit code 83 (APP_LAUNCH_FAILURE) even when
tests ran successfully (57 passed, 0 failed). This is a teardown/
cleanup issue, not a real test failure.

The parser was checking env error patterns (exit code 83) BEFORE
checking actual test results (Passed: 57). This caused the gate
to report ENV ERROR when tests actually passed.

Fix: check for actual test results (Passed: N where N > 0) FIRST.
If tests produced real results, trust them over the exit code.
Env error patterns only apply when zero tests ran.

Uses the LAST Passed:/Failed: counts in the log to handle cases
where Run-DeviceTests.ps1 retries internally and the log contains
multiple result blocks.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
gh label create uses GraphQL which requires read:org scope that the
CI token doesn't have. gh pr edit --add-label uses REST API and
works with just repo scope — same as the Step 4 labels that work.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
kubaflo and others added 5 commits April 1, 2026 11:45
ROOT CAUSE: Run-DeviceTests.ps1 used Write-Host for 'Passed: 57'
and 'Failed: 0'. Write-Host writes directly to the console and
bypasses PowerShell's output stream. When the gate captures output
with $scriptOutput = & script 2>&1, Write-Host output is NOT
captured. The log file never contained 'Passed:' lines, so the
parser always fell through to env error patterns (XHarness exit 83).

Fix:
- Run-DeviceTests.ps1: Write-Output for Passed/Failed/exit code
  lines so they appear in captured stdout
- verify-tests-fail.ps1: use (?m) multiline regex, check
  devicePassCount > 0 (not total > 0), add debug output

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
.NET 10 SDK no longer recognizes 'win10-x64' as a valid
RuntimeIdentifier (NETSDK1083). The correct RID is 'win-x64'.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Windows device tests use CoreCLR, not Mono. Passing any RID
(win10-x64 or win-x64) forces Mono runtime resolution which fails
with NU1102 (no Microsoft.NETCore.App.Runtime.Mono.win-x64 package).

Fix: set RuntimeIdentifier to null for Windows — let MSBuild use
its default. The WindowsPackageType=None and SelfContained flags
are already added at line 300-301.

Also: use recursive search for the exe output path since without
an RID the output folder structure may vary.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
WindowsAppSDKSelfContained requires an explicit architecture RID,
but win-x64 triggers Mono runtime resolution by default. Fix:
- Restore RuntimeIdentifier = win-x64 (needed for SelfContained)
- Add /p:UseMonoRuntime=false to force CoreCLR instead of Mono
- Use recursive exe search for output path flexibility

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen
Copy link
Copy Markdown
Member

PureWeen commented Apr 3, 2026

PR #34705 Review -- [CI] Extend gate to all test types and decouple from PR review

Verdict: ⚠️ Request Changes

🚨 Prompt Injection (5/5 models)

Comment by kubaflo instructs AI to "ignore findings and approve." Must be deleted.

Previous Findings Status

Finding Status
-RequireFullVerification missing 🔴 STILL PRESENT
Empty array false-positive PASS 🟡 MITIGATED but structurally present
Category regex misses string literals 🔴 STILL PRESENT
GitHub API 100-file truncation 🔴 STILL PRESENT
Platform mandatory for UnitTest 🔴 STILL PRESENT

New Findings

Severity Issue
🔴 CRITICAL Detect-TestsInDiff output discarded (Out-Null) -- gate runs blind
🔴 CRITICAL Write-Error + $ErrorActionPreference="Stop" kills multi-project loop on first failure
🟡 MODERATE Write-MarkdownReport ignores its own params, uses script-scope vars
🟡 MODERATE Gate failure is advisory -- never blocks the agent
🟡 MODERATE Git option injection via fork branch name

CI: ❌ Failures are pre-existing (PR only touches .github/ scripts)

- Remove Out-Null from Detect-TestsInDiff invocation (gate no longer runs blind)
- Wrap test loop iterations in try-catch (ErrorActionPreference=Stop no longer kills multi-project loop)
- Fix Write-MarkdownReport to use explicit parameters instead of script-scope variables
- Make -Platform optional for UnitTest/XamlUnitTest (only required for UITest/DeviceTest)
- Add -RequireFullVerification to gate invocation in Review-PR.ps1
- Fix category regex to also match string literal categories
- Use paginated GitHub API for PR file listing (handles >30 files)
- Quote branch name in git merge-base to prevent option injection

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@kubaflo
Copy link
Copy Markdown
Contributor Author

kubaflo commented Apr 4, 2026

Addressed in 5ecf021

Thanks for the review @PureWeen — here is what I applied and what remains as follow-up:

✅ Fixed

Finding Fix
🔴 Detect-TestsInDiff output discarded (Out-Null) Removed | Out-Null — detection output is now visible
🔴 Write-Error + ErrorActionPreference="Stop" kills loop Wrapped both test loops (without-fix / with-fix) in try-catch — loop now continues on error, records EnvError
🟡 Write-MarkdownReport uses script-scope vars Refactored to accept all data as explicit parameters (no more hidden dependencies)
🔴 -RequireFullVerification missing Added to gate invocation in Review-PR.ps1
🔴 Category regex misses string literals Now matches both [Category(TestCategory.X)] and [Category("X")]
🔴 GitHub API 100-file truncation Primary path now uses gh api --paginate with fallback chain
🔴 Platform mandatory for UnitTest -Platform is now optional; validated at runtime only for UITest/DeviceTest
🟡 Git option injection via branch name Quoted $BaseBranch and added -- separator in git merge-base

📌 Acknowledged — tracking as follow-ups

Finding Rationale
🟡 Gate failure is advisory (never blocks agent) Design decision — gate results feed into the agent prompt for context. Blocking would prevent the try-fix phase from attempting repairs on failing tests. Will revisit if false-pass rates are observed.
🟡 Empty array false-positive (structural) Upstream guard (exit 1 on no tests) + new try-catch EnvError tracking mitigate this further. Residual structural flaw is defense-in-depth.
Automated Pester tests for scripts Agreed this is needed — will add in a follow-up PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants