Skip to content

[Membership] Ensure predecessors are killed early during startup#9955

Open
ReubenBond wants to merge 3 commits intodotnet:mainfrom
ReubenBond:fix/remove-predecessor-early
Open

[Membership] Ensure predecessors are killed early during startup#9955
ReubenBond wants to merge 3 commits intodotnet:mainfrom
ReubenBond:fix/remove-predecessor-early

Conversation

@ReubenBond
Copy link
Copy Markdown
Member

@ReubenBond ReubenBond commented Feb 27, 2026

Microsoft Reviewers: Open in CodeFlow

Copilot AI review requested due to automatic review settings February 27, 2026 22:59
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates Orleans membership startup behavior so that predecessor silo entries (older generations on the same endpoint) are declared Dead before the first membership snapshot is published to other components, aligning runtime behavior with the desired startup semantics.

Changes:

  • Update the membership startup test to assert predecessors are marked Dead in the first published snapshot.
  • Reorder RefreshInternal so publishing occurs after cleanup.
  • Add “acknowledged” kill requests and nullable annotations in MembershipTableManager.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
test/Orleans.Core.Tests/Membership/MembershipTableManagerTests.cs Updates the restart scenario assertions to require predecessors be marked Dead before the first published snapshot.
src/Orleans.Runtime/MembershipService/MembershipTableManager.cs Defers snapshot publication until after cleanup and introduces acknowledged kill requests + nullable updates.
Comments suppressed due to low confidence (1)

src/Orleans.Runtime/MembershipService/MembershipTableManager.cs:661

  • ProcessSuspectOrKillLists ignores the boolean result of InnerTryKill/InnerTryToSuspectOrKill and always completes acknowledged requests with success. If InnerTryKill returns false (eg, conditional update failed), CleanupMyTableEntries will still proceed and the published snapshot may still contain the predecessor as Active. Capture the return value and only complete the request successfully when it returns true; otherwise retry (up to attempts) or complete the TaskCompletionSource with failure.
                        switch (request.Type)
                        {
                            case SuspectOrKillRequest.RequestType.Kill:
                                await InnerTryKill(request.SiloAddress, _shutdownCts.Token);
                                break;
                            case SuspectOrKillRequest.RequestType.SuspectOrKill:
                                await InnerTryToSuspectOrKill(request.SiloAddress, request.OtherSilo, _shutdownCts.Token);
                                break;
                        }
                        runningFailureCount = 0;
                        request.Completion?.TrySetResult(true);
                    }

@ReubenBond ReubenBond force-pushed the fix/remove-predecessor-early branch 3 times, most recently from 5e0b015 to 95273db Compare March 25, 2026 19:22
@ReubenBond ReubenBond requested a review from Copilot March 25, 2026 19:35
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

Comment on lines +540 to +546
try
{
await Task.WhenAll(completions).WaitAsync(_shutdownCts.Token);
}
catch (OperationCanceledException) when (_shutdownCts.IsCancellationRequested)
{
}
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Task.WhenAll(completions).WaitAsync(_shutdownCts.Token) wait in CleanupMyTableEntries is only cancelable via shutdown and has no bounded timeout. If killing predecessors repeatedly fails (eg due to persistent membership-table errors), Start() can block forever and will bypass MaxJoinAttemptTime because AsyncExecutorWithRetries cannot preempt a hung attempt. Consider enforcing a timeout (eg based on clusterMembershipOptions.MaxJoinAttemptTime) and/or surfacing failure so the outer retry loop can continue/fail fast.

Copilot uses AI. Check for mistakes.
Comment on lines +529 to +543
var completions = new List<Task>(silosToDeclareDead.Count);
foreach (var siloData in silosToDeclareDead)
{
await _trySuspectOrKillChannel.Writer.WriteAsync(
SuspectOrKillRequest.CreateKillRequest(siloData.Item1.SiloAddress));
var (request, completion) = SuspectOrKillRequest.CreateAcknowledgedKillRequest(siloData.Item1.SiloAddress);
await _trySuspectOrKillChannel.Writer.WaitToWriteAsync(_shutdownCts.Token);
if (_trySuspectOrKillChannel.Writer.TryWrite(request))
{
completions.Add(completion);
}
}

try
{
await Task.WhenAll(completions).WaitAsync(_shutdownCts.Token);
}
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By awaiting kill-processing completions here, membership startup/refresh now blocks on the work queue which currently includes an exponential backoff delay before each request is processed. That means a predecessor during startup can add ~1–2s latency per entry (even with zero failures). Consider a fast path for these cleanup kills (eg process inline or avoid initial backoff for first-attempt kills) so startup isn’t penalized.

See below for a potential fix:

            foreach (var siloData in silosToDeclareDead)
            {
                var (request, completion) = SuspectOrKillRequest.CreateAcknowledgedKillRequest(siloData.Item1.SiloAddress);
                await _trySuspectOrKillChannel.Writer.WaitToWriteAsync(_shutdownCts.Token);
                if (_trySuspectOrKillChannel.Writer.TryWrite(request))
                {
                    _ = completion.ContinueWith(
                        static t =>
                        {
                            // Observe any exception from the kill request without blocking startup.
                            _ = t.Exception;
                        },
                        CancellationToken.None,
                        TaskContinuationOptions.OnlyOnFaulted | TaskContinuationOptions.ExecuteSynchronously,
                        TaskScheduler.Default);
                }
            }

Copilot uses AI. Check for mistakes.
ReubenBond and others added 3 commits March 26, 2026 15:16
- Use non-generic TaskCompletionSource instead of TaskCompletionSource<bool>
  since the result value is never used (pentp)
- Fix misleading comment on myEtag null check to match actual logic
- Use WaitToWriteAsync+TryWrite pattern in CleanupMyTableEntries to avoid
  relying on DropOldest semantics for acknowledged requests, and add
  cancellation via WaitAsync(_shutdownCts.Token) to prevent hanging
- Always retry failed requests regardless of whether Completion is set;
  Completion is only signaled on success, not on individual failures

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@ReubenBond ReubenBond force-pushed the fix/remove-predecessor-early branch from 95273db to 53c9ae5 Compare March 26, 2026 22:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants