[Membership] Ensure predecessors are killed early during startup by ReubenBond · Pull Request #9955 · dotnet/orleans

ReubenBond · 2026-02-27T22:59:53Z

Microsoft Reviewers: Open in CodeFlow

Copilot

Pull request overview

This PR updates Orleans membership startup behavior so that predecessor silo entries (older generations on the same endpoint) are declared Dead before the first membership snapshot is published to other components, aligning runtime behavior with the desired startup semantics.

Changes:

Update the membership startup test to assert predecessors are marked Dead in the first published snapshot.
Reorder RefreshInternal so publishing occurs after cleanup.
Add “acknowledged” kill requests and nullable annotations in MembershipTableManager.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File	Description
test/Orleans.Core.Tests/Membership/MembershipTableManagerTests.cs	Updates the restart scenario assertions to require predecessors be marked Dead before the first published snapshot.
src/Orleans.Runtime/MembershipService/MembershipTableManager.cs	Defers snapshot publication until after cleanup and introduces acknowledged kill requests + nullable updates.

Comments suppressed due to low confidence (1)

src/Orleans.Runtime/MembershipService/MembershipTableManager.cs:661

ProcessSuspectOrKillLists ignores the boolean result of InnerTryKill/InnerTryToSuspectOrKill and always completes acknowledged requests with success. If InnerTryKill returns false (eg, conditional update failed), CleanupMyTableEntries will still proceed and the published snapshot may still contain the predecessor as Active. Capture the return value and only complete the request successfully when it returns true; otherwise retry (up to attempts) or complete the TaskCompletionSource with failure.

                        switch (request.Type)
                        {
                            case SuspectOrKillRequest.RequestType.Kill:
                                await InnerTryKill(request.SiloAddress, _shutdownCts.Token);
                                break;
                            case SuspectOrKillRequest.RequestType.SuspectOrKill:
                                await InnerTryToSuspectOrKill(request.SiloAddress, request.OtherSilo, _shutdownCts.Token);
                                break;
                        }
                        runningFailureCount = 0;
                        request.Completion?.TrySetResult(true);
                    }

src/Orleans.Runtime/MembershipService/MembershipTableManager.cs

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

Copilot · 2026-03-25T19:40:43Z

src/Orleans.Runtime/MembershipService/MembershipTableManager.cs

+            try
+            {
+                await Task.WhenAll(completions).WaitAsync(_shutdownCts.Token);
+            }
+            catch (OperationCanceledException) when (_shutdownCts.IsCancellationRequested)
+            {
+            }


The Task.WhenAll(completions).WaitAsync(_shutdownCts.Token) wait in CleanupMyTableEntries is only cancelable via shutdown and has no bounded timeout. If killing predecessors repeatedly fails (eg due to persistent membership-table errors), Start() can block forever and will bypass MaxJoinAttemptTime because AsyncExecutorWithRetries cannot preempt a hung attempt. Consider enforcing a timeout (eg based on clusterMembershipOptions.MaxJoinAttemptTime) and/or surfacing failure so the outer retry loop can continue/fail fast.

Copilot · 2026-03-25T19:40:44Z

src/Orleans.Runtime/MembershipService/MembershipTableManager.cs

+            var completions = new List<Task>(silosToDeclareDead.Count);
            foreach (var siloData in silosToDeclareDead)
            {
-                await _trySuspectOrKillChannel.Writer.WriteAsync(
-                    SuspectOrKillRequest.CreateKillRequest(siloData.Item1.SiloAddress));
+                var (request, completion) = SuspectOrKillRequest.CreateAcknowledgedKillRequest(siloData.Item1.SiloAddress);
+                await _trySuspectOrKillChannel.Writer.WaitToWriteAsync(_shutdownCts.Token);
+                if (_trySuspectOrKillChannel.Writer.TryWrite(request))
+                {
+                    completions.Add(completion);
+                }
            }

+            try
+            {
+                await Task.WhenAll(completions).WaitAsync(_shutdownCts.Token);
+            }


By awaiting kill-processing completions here, membership startup/refresh now blocks on the work queue which currently includes an exponential backoff delay before each request is processed. That means a predecessor during startup can add ~1–2s latency per entry (even with zero failures). Consider a fast path for these cleanup kills (eg process inline or avoid initial backoff for first-attempt kills) so startup isn’t penalized.

See below for a potential fix:

foreach (var siloData in silosToDeclareDead) { var (request, completion) = SuspectOrKillRequest.CreateAcknowledgedKillRequest(siloData.Item1.SiloAddress); await _trySuspectOrKillChannel.Writer.WaitToWriteAsync(_shutdownCts.Token); if (_trySuspectOrKillChannel.Writer.TryWrite(request)) { _ = completion.ContinueWith( static t => { // Observe any exception from the kill request without blocking startup. _ = t.Exception; }, CancellationToken.None, TaskContinuationOptions.OnlyOnFaulted | TaskContinuationOptions.ExecuteSynchronously, TaskScheduler.Default); } }

…components observing them

- Use non-generic TaskCompletionSource instead of TaskCompletionSource<bool> since the result value is never used (pentp) - Fix misleading comment on myEtag null check to match actual logic - Use WaitToWriteAsync+TryWrite pattern in CleanupMyTableEntries to avoid relying on DropOldest semantics for acknowledged requests, and add cancellation via WaitAsync(_shutdownCts.Token) to prevent hanging - Always retry failed requests regardless of whether Completion is set; Completion is only signaled on success, not on individual failures Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings February 27, 2026 22:59

Copilot AI reviewed Feb 27, 2026

View reviewed changes

src/Orleans.Runtime/MembershipService/MembershipTableManager.cs Outdated Show resolved Hide resolved

src/Orleans.Runtime/MembershipService/MembershipTableManager.cs Show resolved Hide resolved

src/Orleans.Runtime/MembershipService/MembershipTableManager.cs Outdated Show resolved Hide resolved

pentp reviewed Mar 8, 2026

View reviewed changes

src/Orleans.Runtime/MembershipService/MembershipTableManager.cs Outdated Show resolved Hide resolved

ReubenBond force-pushed the fix/remove-predecessor-early branch 3 times, most recently from 5e0b015 to 95273db Compare March 25, 2026 19:22

ReubenBond requested a review from Copilot March 25, 2026 19:35

Copilot started reviewing on behalf of ReubenBond March 25, 2026 19:36 View session

Copilot AI reviewed Mar 25, 2026

View reviewed changes

ReubenBond and others added 3 commits March 26, 2026 15:16

[Membership] Ensure predecessors are marked as 'Dead' prior to other …

8d9fdac

…components observing them

pr suggestions

53c9ae5

ReubenBond force-pushed the fix/remove-predecessor-early branch from 95273db to 53c9ae5 Compare March 26, 2026 22:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Membership] Ensure predecessors are killed early during startup#9955

[Membership] Ensure predecessors are killed early during startup#9955
ReubenBond wants to merge 3 commits intodotnet:mainfrom
ReubenBond:fix/remove-predecessor-early

ReubenBond commented Feb 27, 2026 •

edited by dotnet-policy-service bot

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 25, 2026

Uh oh!

Copilot AI Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ReubenBond commented Feb 27, 2026 • edited by dotnet-policy-service bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Microsoft Reviewers: Open in CodeFlow

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ReubenBond commented Feb 27, 2026 •

edited by dotnet-policy-service bot

Loading