[Membership] Ensure predecessors are killed early during startup#9955
[Membership] Ensure predecessors are killed early during startup#9955ReubenBond wants to merge 3 commits intodotnet:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR updates Orleans membership startup behavior so that predecessor silo entries (older generations on the same endpoint) are declared Dead before the first membership snapshot is published to other components, aligning runtime behavior with the desired startup semantics.
Changes:
- Update the membership startup test to assert predecessors are marked Dead in the first published snapshot.
- Reorder
RefreshInternalso publishing occurs after cleanup. - Add “acknowledged” kill requests and nullable annotations in
MembershipTableManager.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| test/Orleans.Core.Tests/Membership/MembershipTableManagerTests.cs | Updates the restart scenario assertions to require predecessors be marked Dead before the first published snapshot. |
| src/Orleans.Runtime/MembershipService/MembershipTableManager.cs | Defers snapshot publication until after cleanup and introduces acknowledged kill requests + nullable updates. |
Comments suppressed due to low confidence (1)
src/Orleans.Runtime/MembershipService/MembershipTableManager.cs:661
- ProcessSuspectOrKillLists ignores the boolean result of InnerTryKill/InnerTryToSuspectOrKill and always completes acknowledged requests with success. If InnerTryKill returns false (eg, conditional update failed), CleanupMyTableEntries will still proceed and the published snapshot may still contain the predecessor as Active. Capture the return value and only complete the request successfully when it returns true; otherwise retry (up to attempts) or complete the TaskCompletionSource with failure.
switch (request.Type)
{
case SuspectOrKillRequest.RequestType.Kill:
await InnerTryKill(request.SiloAddress, _shutdownCts.Token);
break;
case SuspectOrKillRequest.RequestType.SuspectOrKill:
await InnerTryToSuspectOrKill(request.SiloAddress, request.OtherSilo, _shutdownCts.Token);
break;
}
runningFailureCount = 0;
request.Completion?.TrySetResult(true);
}
src/Orleans.Runtime/MembershipService/MembershipTableManager.cs
Outdated
Show resolved
Hide resolved
src/Orleans.Runtime/MembershipService/MembershipTableManager.cs
Outdated
Show resolved
Hide resolved
src/Orleans.Runtime/MembershipService/MembershipTableManager.cs
Outdated
Show resolved
Hide resolved
5e0b015 to
95273db
Compare
| try | ||
| { | ||
| await Task.WhenAll(completions).WaitAsync(_shutdownCts.Token); | ||
| } | ||
| catch (OperationCanceledException) when (_shutdownCts.IsCancellationRequested) | ||
| { | ||
| } |
There was a problem hiding this comment.
The Task.WhenAll(completions).WaitAsync(_shutdownCts.Token) wait in CleanupMyTableEntries is only cancelable via shutdown and has no bounded timeout. If killing predecessors repeatedly fails (eg due to persistent membership-table errors), Start() can block forever and will bypass MaxJoinAttemptTime because AsyncExecutorWithRetries cannot preempt a hung attempt. Consider enforcing a timeout (eg based on clusterMembershipOptions.MaxJoinAttemptTime) and/or surfacing failure so the outer retry loop can continue/fail fast.
| var completions = new List<Task>(silosToDeclareDead.Count); | ||
| foreach (var siloData in silosToDeclareDead) | ||
| { | ||
| await _trySuspectOrKillChannel.Writer.WriteAsync( | ||
| SuspectOrKillRequest.CreateKillRequest(siloData.Item1.SiloAddress)); | ||
| var (request, completion) = SuspectOrKillRequest.CreateAcknowledgedKillRequest(siloData.Item1.SiloAddress); | ||
| await _trySuspectOrKillChannel.Writer.WaitToWriteAsync(_shutdownCts.Token); | ||
| if (_trySuspectOrKillChannel.Writer.TryWrite(request)) | ||
| { | ||
| completions.Add(completion); | ||
| } | ||
| } | ||
|
|
||
| try | ||
| { | ||
| await Task.WhenAll(completions).WaitAsync(_shutdownCts.Token); | ||
| } |
There was a problem hiding this comment.
By awaiting kill-processing completions here, membership startup/refresh now blocks on the work queue which currently includes an exponential backoff delay before each request is processed. That means a predecessor during startup can add ~1–2s latency per entry (even with zero failures). Consider a fast path for these cleanup kills (eg process inline or avoid initial backoff for first-attempt kills) so startup isn’t penalized.
See below for a potential fix:
foreach (var siloData in silosToDeclareDead)
{
var (request, completion) = SuspectOrKillRequest.CreateAcknowledgedKillRequest(siloData.Item1.SiloAddress);
await _trySuspectOrKillChannel.Writer.WaitToWriteAsync(_shutdownCts.Token);
if (_trySuspectOrKillChannel.Writer.TryWrite(request))
{
_ = completion.ContinueWith(
static t =>
{
// Observe any exception from the kill request without blocking startup.
_ = t.Exception;
},
CancellationToken.None,
TaskContinuationOptions.OnlyOnFaulted | TaskContinuationOptions.ExecuteSynchronously,
TaskScheduler.Default);
}
}
…components observing them
- Use non-generic TaskCompletionSource instead of TaskCompletionSource<bool> since the result value is never used (pentp) - Fix misleading comment on myEtag null check to match actual logic - Use WaitToWriteAsync+TryWrite pattern in CleanupMyTableEntries to avoid relying on DropOldest semantics for acknowledged requests, and add cancellation via WaitAsync(_shutdownCts.Token) to prevent hanging - Always retry failed requests regardless of whether Completion is set; Completion is only signaled on success, not on individual failures Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
95273db to
53c9ae5
Compare
Microsoft Reviewers: Open in CodeFlow