Enhance purge with parallel batch deletes and partial purge timeout#1321
Enhance purge with parallel batch deletes and partial purge timeout#1321YunchuWang wants to merge 21 commits intomainfrom
Conversation
- Add TimeSpan? Timeout to PurgeInstanceFilter for partial purge support - Add bool? IsComplete to PurgeHistoryResult to indicate completion status - Add new PurgeInstanceHistoryAsync overload with TimeSpan timeout parameter - Use CancellationToken-based timeout (linked CTS) in DeleteHistoryAsync - Already-dispatched deletions complete before returning partial results - Backward compatible: no timeout = original behavior (IsComplete = null) - Forward IsComplete through ToCorePurgeHistoryResult to PurgeResult - Add scenario tests for partial purge timeout, generous timeout, and compat
- Always cap timeout to 30s max, even if not specified or exceeds 30s - Pass effectiveToken into DeleteAllDataForOrchestrationInstance so in-flight deletes are also cancelled on timeout - Catch OperationCanceledException from Task.WhenAll for timed-out in-flight deletes - External cancellationToken cancellation still propagates normally
There was a problem hiding this comment.
Pull request overview
Improves purge scalability and robustness for DurableTask’s Azure Storage backend by adding parallelized table batch deletes, optional timeout-based partial purging, better 404/idempotency handling, and expanded test coverage.
Changes:
- Add optional purge timeout (
PurgeInstanceFilter.Timeout) and propagate completion status viaIsCompleteinto corePurgeResult. - Implement parallel table batch deletion with 404 fallback to per-entity deletes.
- Add scenario + unit tests for partial purge behavior, blob cleanup, and parallel batch delete behavior.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| test/DurableTask.AzureStorage.Tests/TestOrchestrationClient.cs | Adds helper API to invoke the new timed purge overload in tests. |
| test/DurableTask.AzureStorage.Tests/AzureStorageScenarioTests.cs | Adds new purge/partial-purge scenario tests and validation for large-message blob cleanup. |
| src/DurableTask.Core/PurgeInstanceFilter.cs | Introduces optional Timeout for partial purge. |
| src/DurableTask.AzureStorage/Tracking/TrackingStoreBase.cs | Extends purge-by-time signature to include an optional timeout. |
| src/DurableTask.AzureStorage/Tracking/ITrackingStore.cs | Extends tracking store purge API contract to include optional timeout. |
| src/DurableTask.AzureStorage/Tracking/AzureTableTrackingStore.cs | Implements timeout-aware, parallel purge-by-time behavior and uses parallel batch delete. |
| src/DurableTask.AzureStorage/Storage/Table.cs | Adds DeleteBatchParallelAsync with transactional chunking and 404 fallback. |
| src/DurableTask.AzureStorage/PurgeHistoryResult.cs | Adds IsComplete and forwards it to core PurgeResult. |
| src/DurableTask.AzureStorage/MessageManager.cs | Improves 404 handling for large-message blob deletion by relying on list/delete with exception handling. |
| src/DurableTask.AzureStorage/AzureStorageOrchestrationService.cs | Adds timed purge overload and wires PurgeInstanceFilter.Timeout into the call path. |
| Test/DurableTask.AzureStorage.Tests/Storage/TableDeleteBatchParallelTests.cs | Adds unit tests validating parallel batch delete chunking, fallback, and cancellation behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
src/DurableTask.AzureStorage/Tracking/AzureTableTrackingStore.cs
Outdated
Show resolved
Hide resolved
src/DurableTask.AzureStorage/Tracking/AzureTableTrackingStore.cs
Outdated
Show resolved
Hide resolved
test/DurableTask.AzureStorage.Tests/AzureStorageScenarioTests.cs
Outdated
Show resolved
Hide resolved
src/DurableTask.AzureStorage/AzureStorageOrchestrationService.cs
Outdated
Show resolved
Hide resolved
src/DurableTask.AzureStorage/Tracking/AzureTableTrackingStore.cs
Outdated
Show resolved
Hide resolved
- Hard-code 30s CancellationToken-based timeout in DeleteHistoryAsync - Remove configurable Timeout from PurgeInstanceFilter (not needed) - Remove timeout overload from AzureStorageOrchestrationService - IsComplete = true when all purged within 30s, false when timed out - Callers loop until IsComplete = true for large-scale purge
- Add TimeSpan? Timeout property to PurgeInstanceFilter (opt-in, default null) - When null: unbounded purge, IsComplete=null (backward compat, no behavior change) - When set: CancellationToken-based timeout, IsComplete=true/false - Thread Timeout through IOrchestrationServicePurgeClient path - Zero breaking changes: existing callers unaffected
There was a problem hiding this comment.
Pull request overview
This PR enhances the Azure Storage purge pipeline to improve throughput and reliability for large purges by introducing parallelized batch deletes, a timeout-driven partial purge mechanism, and forwarding completion status back to the core purge result shape.
Changes:
- Added
PurgeInstanceFilter.Timeoutand plumbed timeout support into Azure Storage tracking-store purging. - Implemented
Table.DeleteBatchParallelAsyncwith 404/idempotency fallback and updated purge to use it. - Added/updated purge-related tests and extended purge result types to carry
IsComplete.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| test/DurableTask.AzureStorage.Tests/AzureStorageScenarioTests.cs | Adds new purge scenario tests for scalability/idempotency/large-blob cleanup and a test intended to validate completion semantics. |
| src/DurableTask.Core/PurgeInstanceFilter.cs | Adds Timeout option to the core purge filter contract. |
| src/DurableTask.AzureStorage/Tracking/TrackingStoreBase.cs | Extends time-range purge signature to accept optional timeout. |
| src/DurableTask.AzureStorage/Tracking/ITrackingStore.cs | Extends tracking store purge API with an optional timeout parameter. |
| src/DurableTask.AzureStorage/Tracking/AzureTableTrackingStore.cs | Implements timeout-aware, parallel instance purging and returns IsComplete based on timeout. |
| src/DurableTask.AzureStorage/Storage/Table.cs | Adds DeleteBatchParallelAsync with parallel transactions and 404 fallback to individual deletes. |
| src/DurableTask.AzureStorage/PurgeHistoryResult.cs | Adds IsComplete to AzureStorage purge result and forwards it to DurableTask.Core.PurgeResult. |
| src/DurableTask.AzureStorage/MessageManager.cs | Improves 404 handling for large message blob cleanup by relying on try/catch rather than container existence checks. |
| src/DurableTask.AzureStorage/AzureStorageOrchestrationService.cs | Wires PurgeInstanceFilter.Timeout into the tracking-store purge path used by IOrchestrationServicePurgeClient. |
| Test/DurableTask.AzureStorage.Tests/Storage/TableDeleteBatchParallelTests.cs | Adds unit tests for DeleteBatchParallelAsync (but currently placed outside the referenced test project directory). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
src/DurableTask.AzureStorage/Tracking/AzureTableTrackingStore.cs
Outdated
Show resolved
Hide resolved
src/DurableTask.AzureStorage/Tracking/AzureTableTrackingStore.cs
Outdated
Show resolved
Hide resolved
test/DurableTask.AzureStorage.Tests/Storage/TableDeleteBatchParallelTests.cs
Show resolved
Hide resolved
- Update PurgeInstanceFilter.Timeout docs: in-flight deletions are cancelled (intentional) - Add using var for SemaphoreSlim disposal - Fix DateTime.Now/UtcNow mixing in purge tests (use UtcNow consistently) - Rename PurgeReturnsIsComplete test to match actual assertions - Move TableDeleteBatchParallelTests.cs from Test/ to test/ (correct project path) - Fix typos: grater->greater, status->statuses in XML docs - Use LINQ Select for foreach loop per code quality suggestion
There was a problem hiding this comment.
Pull request overview
This PR improves the Azure Storage purge pipeline to better handle large-scale instance purges by adding parallelized table batch deletes, introducing an optional timeout for partial purges, and improving idempotency around already-deleted storage artifacts. It also expands scenario/unit test coverage to validate the new purge behaviors and scalability characteristics.
Changes:
- Add
PurgeInstanceFilter.Timeoutand propagateIsCompleteviaPurgeHistoryResult→PurgeResult. - Implement parallel table batch deletion with 404 fallback to per-entity deletes.
- Update purge and blob cleanup implementations for better cancellation/timeout behavior and add comprehensive tests.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| test/DurableTask.AzureStorage.Tests/Storage/TableDeleteBatchParallelTests.cs | Adds unit tests validating new parallel batch delete behavior (including 404 fallback and cancellation). |
| test/DurableTask.AzureStorage.Tests/AzureStorageScenarioTests.cs | Adds end-to-end purge scenario tests and uses UTC timestamps for purge windows. |
| src/DurableTask.Core/PurgeInstanceFilter.cs | Introduces optional Timeout for partial purge semantics. |
| src/DurableTask.AzureStorage/Tracking/TrackingStoreBase.cs | Extends tracking store purge API shape to accept optional timeout. |
| src/DurableTask.AzureStorage/Tracking/ITrackingStore.cs | Updates tracking store interface to include optional timeout parameter. |
| src/DurableTask.AzureStorage/Tracking/AzureTableTrackingStore.cs | Implements timeout-linked cancellation + throttled parallel instance purges and uses parallel history row deletes. |
| src/DurableTask.AzureStorage/Storage/Table.cs | Adds DeleteBatchParallelAsync with concurrent chunk submission and 404 fallback behavior. |
| src/DurableTask.AzureStorage/PurgeHistoryResult.cs | Adds IsComplete and forwards completion to core PurgeResult. |
| src/DurableTask.AzureStorage/MessageManager.cs | Improves large-message blob deletion to handle missing containers via exception-based 404 handling. |
| src/DurableTask.AzureStorage/AzureStorageOrchestrationService.cs | Threads the new timeout value through purge calls and fixes doc typos. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
test/DurableTask.AzureStorage.Tests/Storage/TableDeleteBatchParallelTests.cs
Show resolved
Hide resolved
- PurgeHistoryResultTests: constructor IsComplete (true/false/null), ToCorePurgeHistoryResult propagation, backward compat - PurgeInstanceFilterTests: Timeout default null, set/reset, PurgeResult IsComplete tri-state, old constructor compat
|
Regarding the pendingTasks memory concern: With the new opt-in timeout feature (default 30s when used), the maximum number of pending tasks is naturally bounded by how many instances can be dispatched within the timeout window (~100 concurrent 30s a few thousand tasks at most). For the no-timeout path (backward compat), the existing behavior is preserved. The SemaphoreSlim(100) already limits actual concurrency. Switching to Parallel.ForEachAsync would be a larger refactor that changes the async enumeration pattern better suited for a follow-up. |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
src/DurableTask.AzureStorage/Tracking/AzureTableTrackingStore.cs
Outdated
Show resolved
Hide resolved
- Revert to effectiveToken so in-flight deletes are cancelled on timeout - Update PurgeInstanceFilter.Timeout XML doc to match behavior - Docs and comments now consistently say in-flight deletes are cancelled
|
@YunchuWang I've opened a new pull request, #1325, to work on those changes. Once the pull request is ready, I'll request review from you. |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 12 out of 12 changed files in this pull request and generated 6 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
src/DurableTask.AzureStorage/Tracking/AzureTableTrackingStore.cs
Outdated
Show resolved
Hide resolved
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 12 out of 12 changed files in this pull request and generated 2 comments.
Comments suppressed due to low confidence (1)
src/DurableTask.AzureStorage/MessageManager.cs:1
storageOperationCountno longer counts the list operation when blobs exist (it only counts deletes and only adds “1” when there are no blobs). If this value is used as “requests sent to storage,” it will undercount in the common case where blobs exist. Consider initializing to 1 before enumerating (to count the list) and then adding delete counts, or explicitly incrementing for the list call regardless of whether any blobs were found.
// ----------------------------------------------------------------------------------
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
test/DurableTask.AzureStorage.Tests/Storage/TableDeleteBatchParallelTests.cs
Show resolved
Hide resolved
test/DurableTask.AzureStorage.Tests/AzureStorageScenarioTests.cs
Dismissed
Show dismissed
Hide dismissed
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Summary
Enhance the Azure Storage purge implementation with parallel batch deletes, CancellationToken-based partial purge timeout, improved error handling, and comprehensive tests.
Motivation
Purging large numbers of orchestration instances (100K+) with the current implementation causes:
DeleteBatchAsyncfails with 404 when entities are already deleted (race condition)Changes
Core (DurableTask.Core)
PurgeInstanceFilter.Timeout(TimeSpan?): Optional timeout for partial purgePurgeResult.IsComplete(bool?): Already existed, now properly populatedAzure Storage (DurableTask.AzureStorage)
PurgeHistoryResult.IsComplete: New property + constructor overload, forwarded viaToCorePurgeHistoryResult()AzureStorageOrchestrationService.PurgeInstanceHistoryAsync(..., TimeSpan timeout): New overloadAzureTableTrackingStore.DeleteHistoryAsync: CancellationToken-based timeout using linkedCancellationTokenSourceTable.DeleteBatchParallelAsync: New parallel batch delete with concurrent transactions and 404 fallbackMessageManager.DeleteLargeMessageBlobs: Fixed 404 handling with try/catch instead ofExistsAsync+ deleteSemaphoreSlim(100)for instance-level parallelismBehavior
When
Timeoutis set:CancellationTokenSource(timeout)linked with the caller'sCancellationTokenThrowIfCancellationRequestedOperationCanceledException, waits for in-flight deletions, returnsIsComplete = falseWhen
Timeoutis not set:IsComplete = nullfor backward compatibility)Benchmark Results
100K Instances (EP1, separate ASPs/storage)
500K Instances (EP1, isolated worker SDK path with 25s timeout)
Breaking Changes
None. All changes are additive:
Timeoutproperty onPurgeInstanceFilterPurgeHistoryResultPurgeInstanceHistoryAsyncoverload (original method unchanged)Tests Added
PartialPurge_TimesOutThenCompletesOnRetryPartialPurge_GenerousTimeout_CompletesAllPartialPurge_WithoutTimeout_ReturnsNullIsCompletePurgeMultipleInstancesHistoryByTimePeriod_ScalabilityValidationPurgeSingleInstanceWithIdempotencyPurgeSingleInstance_WithLargeBlobs_CleansUpBlobsPurgeInstance_WithManyHistoryRows_DeletesAllDeleteBatchParallelAsyncRelated PRs