Implement slow-start budget for orphaned shard claims and enhance related logging by benjaminpetit · Pull Request #9943 · dotnet/orleans

benjaminpetit · 2026-02-20T16:57:20Z

This pull request introduces a "slow-start" mechanism for orphaned job shard claiming, designed to prevent silos from overwhelming themselves by claiming too many shards immediately after startup, especially during disaster recovery scenarios. The changes add configurable limits and ramp-up logic, integrate overload detection to pause claims when needed, and update logging and validation accordingly.

Slow-start shard claiming mechanism:

Added SlowStartInitialBudget, SlowStartMaxBudget, and SlowStartRampUpDuration configuration options to DurableJobsOptions, allowing control over how many orphaned shards a silo may claim immediately after startup and how this budget increases over time.
Implemented ramp-up logic in LocalDurableJobManager to compute the current claim budget, track claimed shards, and respect overload detection by pausing claims when overloaded. [1] [2] [3] [4] [5]

API and implementation updates:

Modified AssignJobShardsAsync in JobShardManager and its implementations to accept a maxNewClaims parameter, enforcing the slow-start budget during shard assignment. [1] [2] [3] [4] [5] [6]
Updated tests to use the new maxNewClaims parameter, ensuring compatibility and correctness. [1] [2] [3] [4]

Validation and logging enhancements:

Added configuration validation for the new slow-start options to prevent misconfiguration.
Introduced new log messages for shard claim budget, orphaned shard claims, and overload pauses to improve observability.

Integration with overload detection:

Integrated IOverloadDetector in LocalDurableJobManager to pause new shard claims when the silo is overloaded, further protecting system stability. [1] [2]

These changes collectively improve the robustness and resilience of the job shard assignment process during silo startup and recovery scenarios.

Microsoft Reviewers: Open in CodeFlow

Copilot

Pull request overview

This pull request implements a "slow-start" mechanism for orphaned job shard claiming in Orleans' Durable Jobs feature to prevent silos from overwhelming themselves during startup and disaster recovery scenarios. The implementation adds configurable limits that ramp up over time, integrates with overload detection to pause claims when needed, and includes comprehensive logging for observability.

Changes:

Added three new configuration options (SlowStartInitialBudget, SlowStartMaxBudget, SlowStartRampUpDuration) with validation to control the slow-start behavior
Modified AssignJobShardsAsync API across JobShardManager implementations to accept a maxNewClaims parameter that enforces the budget
Implemented ramp-up logic in LocalDurableJobManager that computes the current claim budget, tracks claimed shards, and respects overload detection

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
`src/Orleans.DurableJobs/Hosting/DurableJobsOptions.cs`	Added three new slow-start configuration properties with XML documentation and validation logic
`src/Orleans.DurableJobs/LocalDurableJobManager.cs`	Implemented slow-start state tracking, budget computation with overload integration, and orphaned shard claim counting
`src/Orleans.DurableJobs/LocalDurableJobManager.Log.cs`	Added three new log messages for claim budget, orphaned claims, and overload pauses
`src/Orleans.DurableJobs/JobShardManager.cs`	Updated `AssignJobShardsAsync` signature to include `maxNewClaims` parameter and implemented budget enforcement in InMemoryJobShardManager
`src/Azure/Orleans.DurableJobs.AzureStorage/AzureStorageJobShardManager.cs`	Implemented budget enforcement for the Azure Storage provider
`test/Tester/DurableJobs/JobShardManagerTestsRunner.cs`	Added three comprehensive tests for slow-start behavior and updated all existing test calls with `int.MaxValue`
`test/Tester/DurableJobs/InMemoryJobShardManagerTests.cs`	Added test method delegates for the three new slow-start tests
`test/Extensions/TesterAzureUtils/DurableJobs/AzureStorageJobShardManagerTests.cs`	Added test method delegates for the three new slow-start tests with appropriate test categories
`test/Extensions/TesterAzureUtils/DurableJobs/AzureStorageJobShardBatchingTests.cs`	Updated all test calls to use `maxNewClaims: int.MaxValue` for unlimited claiming

src/Orleans.DurableJobs/JobShardManager.cs

src/Orleans.DurableJobs/LocalDurableJobManager.cs

…ated logging

…ering, add tests - Rename SlowStartInitialBudget/SlowStartMaxBudget/SlowStartRampUpDuration to ShardClaimInitialBudget/ShardClaimMaxBudget/ShardClaimRampUpDuration to avoid naming collision with existing concurrency slow-start options. - Rename _totalClaimedOrphanedShards to _totalClaimedShards (counts all new claims). - Fix InMemoryJobShardManager: move budget check before AdoptedCount increment to prevent false poison detection when budget is exhausted. - Extract ComputeClaimBudget into a testable static method. - Add ShardClaimBudgetTests: linear interpolation, edge cases, validator rules. - Add SlowStart_BudgetExhaustion_DoesNotInflateAdoptedCount test to runner. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings February 20, 2026 16:57

Copilot started reviewing on behalf of benjaminpetit February 20, 2026 16:57 View session

benjaminpetit mentioned this pull request Feb 20, 2026

Durable Jobs follow-up #9750

Open

14 tasks

Copilot AI reviewed Feb 20, 2026

View reviewed changes

ReubenBond reviewed Feb 20, 2026

View reviewed changes

src/Orleans.DurableJobs/JobShardManager.cs Outdated Show resolved Hide resolved

ReubenBond reviewed Feb 20, 2026

View reviewed changes

src/Orleans.DurableJobs/LocalDurableJobManager.cs Outdated Show resolved Hide resolved

benjaminpetit force-pushed the feature/slow-shard-assignement branch 2 times, most recently from 6dc83a8 to fc150cf Compare February 20, 2026 17:43

benjaminpetit added 2 commits March 25, 2026 13:10

Implement slow-start budget for orphaned shard claims and enhance rel…

2752dcf

…ated logging

Fix DurableJobs test shard assignment signature

8d9a548

ReubenBond force-pushed the feature/slow-shard-assignement branch from d39aade to 8d9a548 Compare March 25, 2026 20:15

ReubenBond approved these changes Mar 26, 2026

View reviewed changes

ReubenBond added this pull request to the merge queue Mar 26, 2026

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Mar 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement slow-start budget for orphaned shard claims and enhance related logging#9943

Implement slow-start budget for orphaned shard claims and enhance related logging#9943
benjaminpetit wants to merge 3 commits intodotnet:mainfrom
benjaminpetit:feature/slow-shard-assignement

benjaminpetit commented Feb 20, 2026 •

edited by dotnet-policy-service bot

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

benjaminpetit commented Feb 20, 2026 • edited by dotnet-policy-service bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Microsoft Reviewers: Open in CodeFlow

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

benjaminpetit commented Feb 20, 2026 •

edited by dotnet-policy-service bot

Loading