Fix CI flakiness: MSB4216 task host failures, dotnet-watch hangs, NuGet errors by mmitche · Pull Request #53424 · dotnet/sdk

mmitche · 2026-03-12T20:04:26Z

Summary

This PR addresses intermittent CI failures identified through systematic analysis of recent builds.

Remaining Fixes (after merge from main)

Root Cause 1 - MSB4216 task host failures on macOS Helix

Impact: Tests using NuGet package tasks with TaskHostFactory fail intermittently on macOS.
Fix: Export DOTNET_HOST_PATH in RunTestsOnHelix.sh and RunTestsOnHelix.cmd.

Root Cause 2 - GZipCompress file lock races

Impact: Parallel.For in GZipCompress races with antivirus/file indexer.
Fix: Added retry with exponential backoff for file access.

Root Cause 3 - Noisy NuGet source removal errors

Impact: Helix test setup produces confusing errors removing non-existent NuGet sources.
Fix: Suppress errors from dotnet nuget remove source commands.

Root Cause 4 - DefaultRequestDispatcherTest timing issues

Impact: Test makes timing assumptions that fail under load.
Fix: Increased timeouts for CI environment.

Root Cause 5 - Missing runtimeconfig.json in test assets

Impact: Test tool projects fail to locate runtime configuration.
Fix: Added MSBuild target in test assets Directory.Build.targets.

Changes Removed

The dotnet-watch Aspire race condition fix has been removed from this PR — it was superseded by @tmat's proper fix in #53271 (now merged to main).

Validation

Starting fresh validation run after merge from main. Target: 25 consecutive passing builds.

Copilot

Pull request overview

This PR hardens Helix/CI test execution for Hot Reload / dotnet-watch scenarios by reducing hang risk and improving the reliability of Helix environment setup.

Changes:

Replace “effectively infinite” DCP/Aspire timeouts used by watch-based tests with bounded (5-minute) values to prevent multi-hour Helix hangs.
Improve test process cleanup by closing stdin prior to termination and adding a bounded wait for process exit.
Update Helix test entrypoint scripts to set DOTNET_HOST_PATH and make NuGet source removal resilient when sources are absent.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File	Description
test/Microsoft.DotNet.HotReload.Test.Utilities/WatchableApp.cs	Switches DCP/Aspire-related environment timeouts to bounded values for CI.
test/Microsoft.DotNet.HotReload.Test.Utilities/AwaitableProcess.cs	Adjusts disposal/termination behavior to reduce hangs during cleanup.
build/RunTestsOnHelix.sh	Sets `DOTNET_HOST_PATH` and makes `dotnet nuget remove source` tolerant of missing sources.
build/RunTestsOnHelix.cmd	Sets `DOTNET_HOST_PATH` and suppresses errors when removing non-existent NuGet sources.

You can also share your feedback on Copilot code review. Take the survey.

Copilot · 2026-03-12T20:08:07Z

test/Microsoft.DotNet.HotReload.Test.Utilities/AwaitableProcess.cs

+            catch (OperationCanceledException)
+            {
+                Logger.Log($"Process {Id} did not exit within 30 seconds after Kill()");


If the process still hasn’t exited after the 30s timeout, DisposeAsync only logs and continues. That can leave a runaway dotnet-watch process (or child processes) running in the Helix work item, causing resource leaks and cross-test interference while also hiding the failure signal. Consider failing the test/cleanup in this case (or at least making the cleanup path retry/force termination and surface the problem), rather than just logging and proceeding to dispose the Process handle.

Suggested change

catch (OperationCanceledException)

{

Logger.Log($"Process {Id} did not exit within 30 seconds after Kill()");

catch (OperationCanceledException ex)

{

Logger.Log($"Process {Id} did not exit within 30 seconds after Kill()");

throw new TimeoutException($"Process {Id} did not exit within 30 seconds after Kill().", ex);

Copilot · 2026-03-12T20:08:08Z

test/Microsoft.DotNet.HotReload.Test.Utilities/AwaitableProcess.cs

+            try
+            {
+                Process.StandardInput.Close();
+            }
+            catch


The broad catch around closing StandardInput/Kill() swallows all exceptions without logging. Since this code was added to address a platform-specific hang, swallowing the exception makes it hard to diagnose when stdin can’t be closed (or why). Consider at least logging the exception details (or narrowing the caught exception types) so cleanup failures are actionable in CI logs.

tmat · 2026-03-13T02:29:33Z

src/Dotnet.Watch/Watch/Aspire/AspireServiceFactory.cs

            _isDisposed = true;

            // wait for all in-flight process initialization to complete:
+            // If no session initialization is in-flight (_pendingSessionInitializationCount == 0),


Not entirely correct either. Ok to merge, I'll follow up with better fix.

@tmat This is an automatic fix by the AI for flakiness. Don't merge this...when it's gotten 25 passing runs we'll take a second pass over this to smooth it out.

Yeah, I figured. It looks very AI like. It's great it found the issue. I'll work on a better fix over the weekend.

tmat · 2026-03-13T02:34:13Z

watch changes lgtm

mmitche · 2026-03-13T12:44:20Z

🎯 Milestone: 5 Consecutive Passes

Validation Results

Build	Jobs	Result
1333280	18/18 ✅	Passed
1333378	18/18 ✅	Passed
1333489	18/18 ✅	Passed
1333567	18/18 ✅	Passed
1333595	18/18 ✅	Passed

Root Causes Fixed

MSB4216 TaskHostFactory — DOTNET_HOST_PATH not set in Helix scripts
dotnet-watch Aspire hang — Per-operation timeout was inheriting the 2-hour Helix work-item timeout instead of being capped at 5 minutes. Also fixed semaphore deadlock in AspireServiceFactory.DisposeAsync, added stdin close before process kill, and set DCP timeout environment variables.
GZipCompress file lock — Parallel.For races with antivirus/file indexer; added retry with exponential backoff
NuGet source removal noise — Suppressed stderr from removing non-existent NuGet sources
Missing runtimeconfig.json — Added MSBuild target to include runtimeconfig.json in NuGet packages for test tool projects

Baseline vs Current

Before: ~57% failure rate (17/30 builds failed on main)
After: 5/5 consecutive perfect builds (100% pass rate so far)

Continuing validation toward 25 consecutive passes target.

akoeplinger · 2026-03-13T13:04:38Z

build/RunTestsOnHelix.cmd


+REM Set DOTNET_HOST_PATH so MSBuild task hosts can locate the dotnet executable.
+REM Without this, tasks from NuGet packages that use TaskHostFactory fail with MSB4216.
+set DOTNET_HOST_PATH=%DOTNET_ROOT%\dotnet.exe


@ViktorHofer does this make sense?

akoeplinger · 2026-03-13T13:06:39Z

build/RunTestsOnHelix.cmd

-dotnet nuget remove source dotnet-tools-transport --configfile %TestExecutionDirectory%\nuget.config
-dotnet nuget remove source dotnet-libraries --configfile %TestExecutionDirectory%\nuget.config
-dotnet nuget remove source dotnet-eng --configfile %TestExecutionDirectory%\nuget.config
+REM Remove feeds not needed for tests. Errors from non-existent sources


any idea which feeds remain? maybe we should have a separate nuget.config just for the tests

I think it just updated comments here.

mmitche · 2026-03-13T13:44:05Z

@akoeplinger This is an automated run...let it go for a while..don't merge or approve. I'll have it go through another pass and also reivew at the end of when it thinks it gets a stable run.

tmat · 2026-03-16T18:47:33Z

#53271 has a better fix for the dotnet-watch race condition.

The dotnet-watch Aspire race condition fix from this PR has been superseded by a proper fix in #53271 (merged to main). Reverting these files so the merge from main brings in the better fix. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…ther updates Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Reverted dotnet-watch changes (superseded by tmat's fix in #53271). Merged from main to pick up latest changes. Starting fresh validation run toward 25 consecutive passes. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Build #4 failed due to pre-existing flaky dotnet-watch test: ProjectAndSourceFileChange_AddPackageReference (Assert.Equal 1 vs 0) This is not related to our PR changes. Restarting streak count. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The ProjectAndSourceFileChange_AddPackageReference and ProjectAndSourceFileChange_AddProjectReference tests check managedCodeChangesApplied.CurrentCount immediately after seeing app output. However, ManagedCodeChangesApplied is logged in a fire-and-forget task (CompilationHandler.cs:497) that may not have completed yet, causing the semaphore count to be 0 instead of 1. Fix: Wait for the semaphore with a 30s timeout instead of checking CurrentCount synchronously. This ensures the fire-and-forget task completes before the assertion. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Replace WaitAsync/await with Task.Wait(timeout) which uses kernel-level ManualResetEventSlim instead of thread pool-dependent timer callbacks. Under extreme thread pool starvation on Helix CI, WaitAsync's timer continuations can't be scheduled, causing the test to hang for 60+ minutes. Task.Wait uses a kernel wait that works regardless of thread pool state. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The xUnit analyzer prohibits blocking Task.Wait in test methods. Use a dedicated background thread with Thread.Join(timeout) for a kernel-level wait that doesn't depend on thread pool scheduling, avoiding both the xUnit1031 analyzer error and the thread pool starvation hang. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

TemplateEngine: Add scrubber for intermittent MSBuild debug log message ('MSBuild logs and debug information will be at...') that appears when telemetry/profiling is enabled on some Helix machines, causing snapshot mismatches. Added to all 6 WithCustomScrubbers locations across 3 test files. BrowserDiagnostics: Close stdin before killing process in AwaitableProcess DisposeAsync to unblock PhysicalConsole.ListenToStandardInputAsync() which uses CancellationToken.None on stdin ReadAsync. On Linux, stdin reads don't unblock on process kill, causing 60-minute hangs. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings March 12, 2026 20:04

github-actions bot added the Area-Infrastructure label Mar 12, 2026

Copilot started reviewing on behalf of mmitche March 12, 2026 20:05 View session

Copilot AI reviewed Mar 12, 2026

View reviewed changes

mmitche force-pushed the fix/ci-flakiness/main branch from 83d044f to ae5112f Compare March 12, 2026 20:21

mmitche requested a review from a team as a code owner March 12, 2026 20:21

mmitche force-pushed the fix/ci-flakiness/main branch 2 times, most recently from a35fa75 to d9f4358 Compare March 12, 2026 23:54

mmitche requested review from a team and tmat as code owners March 12, 2026 23:54

build-analysis bot mentioned this pull request Mar 13, 2026

dotnet-watch tests failing #40006

Open

mmitche force-pushed the fix/ci-flakiness/main branch from d9f4358 to 9949b40 Compare March 13, 2026 01:26

mmitche requested a review from a team as a code owner March 13, 2026 01:26

mmitche force-pushed the fix/ci-flakiness/main branch from 9949b40 to ed04d76 Compare March 13, 2026 02:28

tmat reviewed Mar 13, 2026

View reviewed changes

mmitche force-pushed the fix/ci-flakiness/main branch 3 times, most recently from 2eb12cb to bcae687 Compare March 13, 2026 12:43

mmitche force-pushed the fix/ci-flakiness/main branch from bcae687 to 1644c42 Compare March 13, 2026 12:45

akoeplinger reviewed Mar 13, 2026

View reviewed changes

mmitche force-pushed the fix/ci-flakiness/main branch 7 times, most recently from 4cb1ca9 to ea8fad3 Compare March 14, 2026 00:12

tmat mentioned this pull request Mar 16, 2026

Improve shutdown sequence of Aspire service and fix race condition #53271

Merged

mmitche and others added 11 commits March 17, 2026 10:35

Merge upstream/main to pick up tmat's dotnet-watch fix (#53271) and o…

842031e

…ther updates Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Trigger CI build #1 after merge from main

fccc655

Reverted dotnet-watch changes (superseded by tmat's fix in #53271). Merged from main to pick up latest changes. Starting fresh validation run toward 25 consecutive passes. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Trigger CI build #2 - streak: 1/25

5be6d9e

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Trigger CI build #3 - streak: 2/25

cdf5d19

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Trigger CI build #4 - streak: 3/25

22f201a

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Trigger CI build #6 - streak: 1/25

e22cf34

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Trigger CI build #7 - streak: 2/25

a1ebf22

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

ci: validation run 9

17ed46b

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

build-analysis bot mentioned this pull request Mar 18, 2026

"We stopped hearing from agent Azure Pipelines 32. Verify the agent machine is running and has a healthy network connection" dotnet/dnceng#1886

Open

3 tasks

mmitche and others added 2 commits March 18, 2026 14:00

ci: validation run 10

90c3350

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

ci: validation run 11

c690171

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

This was referenced Mar 19, 2026

slow macOS - "##[error]The job running on agent Azure Pipelines 9 ran longer than the maximum time of 60 minutes." dotnet/dnceng#1883

Open

The Operation will be canceled. The next steps may not contain expected logs. dotnet/dnceng#3008

Open

ci: validation run 12

5386d82

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

build-analysis bot mentioned this pull request Mar 19, 2026

[android-arm64] The Operation will be canceled. The next steps may not contain expected logs. dotnet/dnceng#6408

Open

3 tasks

mmitche and others added 3 commits March 18, 2026 22:24

ci: validation run 13

230d5ee

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

build-analysis bot mentioned this pull request Mar 19, 2026

TemplateEngineToolDiscovery Network Failure: Unable to read data from the transport connection: Connection reset by peer #51154

Open

mmitche requested a review from a team as a code owner March 19, 2026 15:27

mmitche and others added 4 commits March 19, 2026 09:36

ci: validation run 15

1677d0c

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

ci: validation run 16

e97da4d

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

ci: validation run 17

bbf6efd

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

ci: validation run 18

c3c701f

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Conversation

mmitche commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Remaining Fixes (after merge from main)

Root Cause 1 - MSB4216 task host failures on macOS Helix

Root Cause 2 - GZipCompress file lock races

Root Cause 3 - Noisy NuGet source removal errors

Root Cause 4 - DefaultRequestDispatcherTest timing issues

Root Cause 5 - Missing runtimeconfig.json in test assets

Changes Removed

Validation

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

tmat Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

mmitche Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

tmat Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

tmat commented Mar 13, 2026

Uh oh!

mmitche commented Mar 13, 2026

🎯 Milestone: 5 Consecutive Passes

Validation Results

Root Causes Fixed

Baseline vs Current

Uh oh!

akoeplinger Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

akoeplinger Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

mmitche Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

mmitche commented Mar 13, 2026

Uh oh!

tmat commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mmitche commented Mar 12, 2026 •

edited

Loading