fix: pruner retry only failed partition ranges instead of restarting all workers#623
Open
fix: pruner retry only failed partition ranges instead of restarting all workers#623
Conversation
…all workers When any partition worker timed out, PruneWithPartitions restarted ALL workers from scratch. This created a feedback loop: completed workers re-scanned already-deleted records, generating KEY_NOT_FOUND errors and extra Aerospike load, which caused more timeouts, producing the saw-tooth pattern seen in both clusters. Now only the timed-out partition ranges are retried. Successfully completed ranges are tracked and skipped on subsequent attempts. This also fixes a reporting bug where partial progress from timed-out attempts was lost (counters were reset each attempt). Changes: - Add prunerClient interface for Aerospike client abstraction - Track partition ranges across retries (partitionRange struct) - Extend workerResult with partition identity for per-worker classification - Add partitionWorkerFn for testability without Aerospike - Accumulate progress across retry attempts (cumulativeProcessed) - 8 unit tests covering all retry edge cases
Contributor
|
🤖 Claude Code Review Status: Complete Current Review: Minor Note: |
Contributor
Benchmark Comparison ReportBaseline: Current: Summary
All benchmark results (sec/op)
Threshold: >10% with p < 0.05 | Generated: 2026-03-26 13:28 UTC |
… MapPutItems - Remove prunerClient interface that was unused by tests (tests mock at partitionWorkerFn level), revert client field to concrete *uaerospike.Client - Reduce uniqueSpendingChildren preallocation from 100k to 1k to match typical usage (~50-100 entries per chunk) - Use single MapPutItemsOp instead of N individual MapPutOps for batch parent updates, reducing per-record operation count
|
freemans13
approved these changes
Mar 26, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.



Summary
PruneWithPartitionsrestarted all workers from partition 0, creating a feedback loop: completed workers re-scanned already-deleted records → KEY_NOT_FOUND errors → extra Aerospike load → more timeouts → saw-tooth patternChanges
prunerClientinterfacepartitionRangestructworkerResultextensionpartitionStart/partitionCountfields to identify which worker succeeded vs failedpartitionWorkerFnfieldpendingRangesbefore loop, classifies results per-worker, setspendingRanges = failedRangeson timeoutcumulativeProcessed/cumulativeSkippedpersist across retry attemptsTest plan
partition_retry_test.go:go build ./...passesgo vetpasses