Implement slow-start budget for orphaned shard claims and enhance related logging#9943
Open
benjaminpetit wants to merge 3 commits intodotnet:mainfrom
Open
Implement slow-start budget for orphaned shard claims and enhance related logging#9943benjaminpetit wants to merge 3 commits intodotnet:mainfrom
benjaminpetit wants to merge 3 commits intodotnet:mainfrom
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This pull request implements a "slow-start" mechanism for orphaned job shard claiming in Orleans' Durable Jobs feature to prevent silos from overwhelming themselves during startup and disaster recovery scenarios. The implementation adds configurable limits that ramp up over time, integrates with overload detection to pause claims when needed, and includes comprehensive logging for observability.
Changes:
- Added three new configuration options (
SlowStartInitialBudget,SlowStartMaxBudget,SlowStartRampUpDuration) with validation to control the slow-start behavior - Modified
AssignJobShardsAsyncAPI acrossJobShardManagerimplementations to accept amaxNewClaimsparameter that enforces the budget - Implemented ramp-up logic in
LocalDurableJobManagerthat computes the current claim budget, tracks claimed shards, and respects overload detection
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
src/Orleans.DurableJobs/Hosting/DurableJobsOptions.cs |
Added three new slow-start configuration properties with XML documentation and validation logic |
src/Orleans.DurableJobs/LocalDurableJobManager.cs |
Implemented slow-start state tracking, budget computation with overload integration, and orphaned shard claim counting |
src/Orleans.DurableJobs/LocalDurableJobManager.Log.cs |
Added three new log messages for claim budget, orphaned claims, and overload pauses |
src/Orleans.DurableJobs/JobShardManager.cs |
Updated AssignJobShardsAsync signature to include maxNewClaims parameter and implemented budget enforcement in InMemoryJobShardManager |
src/Azure/Orleans.DurableJobs.AzureStorage/AzureStorageJobShardManager.cs |
Implemented budget enforcement for the Azure Storage provider |
test/Tester/DurableJobs/JobShardManagerTestsRunner.cs |
Added three comprehensive tests for slow-start behavior and updated all existing test calls with int.MaxValue |
test/Tester/DurableJobs/InMemoryJobShardManagerTests.cs |
Added test method delegates for the three new slow-start tests |
test/Extensions/TesterAzureUtils/DurableJobs/AzureStorageJobShardManagerTests.cs |
Added test method delegates for the three new slow-start tests with appropriate test categories |
test/Extensions/TesterAzureUtils/DurableJobs/AzureStorageJobShardBatchingTests.cs |
Updated all test calls to use maxNewClaims: int.MaxValue for unlimited claiming |
ReubenBond
reviewed
Feb 20, 2026
ReubenBond
reviewed
Feb 20, 2026
6dc83a8 to
fc150cf
Compare
d39aade to
8d9a548
Compare
…ering, add tests - Rename SlowStartInitialBudget/SlowStartMaxBudget/SlowStartRampUpDuration to ShardClaimInitialBudget/ShardClaimMaxBudget/ShardClaimRampUpDuration to avoid naming collision with existing concurrency slow-start options. - Rename _totalClaimedOrphanedShards to _totalClaimedShards (counts all new claims). - Fix InMemoryJobShardManager: move budget check before AdoptedCount increment to prevent false poison detection when budget is exhausted. - Extract ComputeClaimBudget into a testable static method. - Add ShardClaimBudgetTests: linear interpolation, edge cases, validator rules. - Add SlowStart_BudgetExhaustion_DoesNotInflateAdoptedCount test to runner. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
ReubenBond
approved these changes
Mar 26, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This pull request introduces a "slow-start" mechanism for orphaned job shard claiming, designed to prevent silos from overwhelming themselves by claiming too many shards immediately after startup, especially during disaster recovery scenarios. The changes add configurable limits and ramp-up logic, integrate overload detection to pause claims when needed, and update logging and validation accordingly.
Slow-start shard claiming mechanism:
SlowStartInitialBudget,SlowStartMaxBudget, andSlowStartRampUpDurationconfiguration options toDurableJobsOptions, allowing control over how many orphaned shards a silo may claim immediately after startup and how this budget increases over time.LocalDurableJobManagerto compute the current claim budget, track claimed shards, and respect overload detection by pausing claims when overloaded. [1] [2] [3] [4] [5]API and implementation updates:
AssignJobShardsAsyncinJobShardManagerand its implementations to accept amaxNewClaimsparameter, enforcing the slow-start budget during shard assignment. [1] [2] [3] [4] [5] [6]maxNewClaimsparameter, ensuring compatibility and correctness. [1] [2] [3] [4]Validation and logging enhancements:
Integration with overload detection:
IOverloadDetectorinLocalDurableJobManagerto pause new shard claims when the silo is overloaded, further protecting system stability. [1] [2]These changes collectively improve the robustness and resilience of the job shard assignment process during silo startup and recovery scenarios.
Microsoft Reviewers: Open in CodeFlow