fix: pruner retry only failed partition ranges instead of restarting all workers by icellan · Pull Request #623 · bsv-blockchain/teranode

icellan · 2026-03-26T09:44:11Z

Summary

When any partition worker timed out, PruneWithPartitions restarted all workers from partition 0, creating a feedback loop: completed workers re-scanned already-deleted records → KEY_NOT_FOUND errors → extra Aerospike load → more timeouts → saw-tooth pattern
Now only the timed-out partition ranges are retried; successfully completed ranges are tracked and skipped
Also fixes a reporting bug where partial progress from timed-out attempts was lost (counters reset each attempt)

Changes

Area	Detail
`prunerClient` interface	Abstracts Aerospike client methods used by the pruner (QueryPartitions, BatchOperate, GetNodes, GetConnectionQueueSize)
`partitionRange` struct	Tracks start/count for partition ranges across retries
`workerResult` extension	Added `partitionStart`/`partitionCount` fields to identify which worker succeeded vs failed
`partitionWorkerFn` field	Swappable worker function for testability without Aerospike
Retry logic rewrite	Builds `pendingRanges` before loop, classifies results per-worker, sets `pendingRanges = failedRanges` on timeout
Cumulative counters	`cumulativeProcessed`/`cumulativeSkipped` persist across retry attempts

Test plan

8 new unit tests in partition_retry_test.go:
- All workers succeed (no retry)
- Single worker timeout → only that range retried
- Multiple workers timeout → only failed ranges retried
- Non-timeout error → immediate return, no retry
- Mixed timeout + non-timeout → non-timeout wins
- Max retries exceeded → error with cumulative progress
- Progressive recovery across 3 attempts
- Context cancellation during retry
Full go build ./... passes
go vet passes
Pre-commit linter passes
Deploy to staging and observe Aerospike ops/s and KEY_NOT_FOUND rate during pruner catchup

…all workers When any partition worker timed out, PruneWithPartitions restarted ALL workers from scratch. This created a feedback loop: completed workers re-scanned already-deleted records, generating KEY_NOT_FOUND errors and extra Aerospike load, which caused more timeouts, producing the saw-tooth pattern seen in both clusters. Now only the timed-out partition ranges are retried. Successfully completed ranges are tracked and skipped on subsequent attempts. This also fixes a reporting bug where partial progress from timed-out attempts was lost (counters were reset each attempt). Changes: - Add prunerClient interface for Aerospike client abstraction - Track partition ranges across retries (partitionRange struct) - Extend workerResult with partition identity for per-worker classification - Add partitionWorkerFn for testability without Aerospike - Accumulate progress across retry attempts (cumulativeProcessed) - 8 unit tests covering all retry edge cases

github-actions · 2026-03-26T09:45:06Z

🤖 Claude Code Review

Status: Complete

Current Review:
No issues found. The implementation correctly addresses the retry feedback loop by tracking and retrying only failed partition ranges instead of restarting all workers. The test coverage is comprehensive with 8 unit tests covering all retry scenarios.

Minor Note:
The PR description mentions removing prunerClient interface, but this change does not appear in the diff. This may have been merged separately or is a documentation inconsistency.

github-actions · 2026-03-26T10:00:54Z

Benchmark Comparison Report

Baseline: main (unknown)

Current: PR-623 (bfc5d57)

Summary

Regressions: 0
Improvements: 0
Unchanged: 151
Significance level: p < 0.05

All benchmark results (sec/op)

Benchmark	Baseline	Current	Change	p-value
_NewBlockFromBytes-4	1.406µ	1.416µ	~	1.000
SplitSyncedParentMap_SetIfNotExists/256_buckets-4	61.51n	61.99n	~	0.100
SplitSyncedParentMap_SetIfNotExists/16_buckets-4	61.52n	61.43n	~	0.400
SplitSyncedParentMap_SetIfNotExists/1_bucket-4	61.49n	61.65n	~	0.100
SplitSyncedParentMap_ConcurrentSetIfNotExists/256_buckets...	30.02n	29.98n	~	1.000
SplitSyncedParentMap_ConcurrentSetIfNotExists/16_buckets_...	50.30n	49.74n	~	0.400
SplitSyncedParentMap_ConcurrentSetIfNotExists/1_bucket_pa...	132.9n	121.1n	~	0.200
MiningCandidate_Stringify_Short-4	257.6n	259.1n	~	0.600
MiningCandidate_Stringify_Long-4	1.853µ	1.862µ	~	0.300
MiningSolution_Stringify-4	938.6n	949.1n	~	0.100
BlockInfo_MarshalJSON-4	1.803µ	1.802µ	~	1.000
NewFromBytes-4	131.5n	129.1n	~	0.200
Mine_EasyDifficulty-4	63.00µ	62.98µ	~	0.400
Mine_WithAddress-4	4.927µ	5.130µ	~	0.100
BlockAssembler_AddTx-4	0.03169n	0.03235n	~	0.400
AddNode-4	10.10	10.64	~	0.400
AddNodeWithMap-4	10.47	10.72	~	0.700
DirectSubtreeAdd/4_per_subtree-4	62.36n	64.38n	~	1.000
DirectSubtreeAdd/64_per_subtree-4	32.02n	31.55n	~	0.200
DirectSubtreeAdd/256_per_subtree-4	30.64n	37.45n	~	0.100
DirectSubtreeAdd/1024_per_subtree-4	29.68n	29.45n	~	0.700
DirectSubtreeAdd/2048_per_subtree-4	28.99n	28.93n	~	0.200
SubtreeProcessorAdd/4_per_subtree-4	293.8n	291.4n	~	0.100
SubtreeProcessorAdd/64_per_subtree-4	292.8n	291.3n	~	0.100
SubtreeProcessorAdd/256_per_subtree-4	294.1n	291.9n	~	0.200
SubtreeProcessorAdd/1024_per_subtree-4	294.5n	292.6n	~	0.100
SubtreeProcessorAdd/2048_per_subtree-4	293.8n	292.6n	~	0.200
SubtreeProcessorRotate/4_per_subtree-4	300.4n	297.5n	~	0.100
SubtreeProcessorRotate/64_per_subtree-4	299.1n	296.5n	~	0.200
SubtreeProcessorRotate/256_per_subtree-4	295.8n	297.2n	~	1.000
SubtreeProcessorRotate/1024_per_subtree-4	297.8n	295.6n	~	1.000
SubtreeNodeAddOnly/4_per_subtree-4	63.73n	63.63n	~	1.000
SubtreeNodeAddOnly/64_per_subtree-4	38.52n	37.91n	~	0.100
SubtreeNodeAddOnly/256_per_subtree-4	37.71n	37.10n	~	0.100
SubtreeNodeAddOnly/1024_per_subtree-4	36.83n	36.15n	~	0.100
SubtreeCreationOnly/4_per_subtree-4	143.8n	139.1n	~	0.100
SubtreeCreationOnly/64_per_subtree-4	622.6n	609.3n	~	0.400
SubtreeCreationOnly/256_per_subtree-4	2.165µ	2.073µ	~	0.100
SubtreeCreationOnly/1024_per_subtree-4	7.883µ	7.488µ	~	0.100
SubtreeCreationOnly/2048_per_subtree-4	14.86µ	14.45µ	~	0.100
SubtreeProcessorOverheadBreakdown/64_per_subtree-4	301.6n	297.9n	~	0.100
SubtreeProcessorOverheadBreakdown/1024_per_subtree-4	290.4n	299.5n	~	0.100
ParallelGetAndSetIfNotExists/1k_nodes-4	926.3µ	927.7µ	~	1.000
ParallelGetAndSetIfNotExists/10k_nodes-4	1.849m	1.864m	~	0.200
ParallelGetAndSetIfNotExists/50k_nodes-4	7.887m	8.097m	~	0.100
ParallelGetAndSetIfNotExists/100k_nodes-4	15.70m	15.93m	~	0.100
SequentialGetAndSetIfNotExists/1k_nodes-4	739.7µ	762.5µ	~	0.100
SequentialGetAndSetIfNotExists/10k_nodes-4	2.899m	2.940m	~	0.100
SequentialGetAndSetIfNotExists/50k_nodes-4	10.56m	10.93m	~	0.100
SequentialGetAndSetIfNotExists/100k_nodes-4	20.95m	20.96m	~	1.000
ProcessOwnBlockSubtreeNodesParallel/1k_nodes-4	993.9µ	967.7µ	~	0.100
ProcessOwnBlockSubtreeNodesParallel/10k_nodes-4	4.628m	4.596m	~	0.700
ProcessOwnBlockSubtreeNodesParallel/100k_nodes-4	18.62m	18.86m	~	0.100
ProcessOwnBlockSubtreeNodesSequential/1k_nodes-4	799.8µ	804.4µ	~	0.700
ProcessOwnBlockSubtreeNodesSequential/10k_nodes-4	5.963m	6.091m	~	0.100
ProcessOwnBlockSubtreeNodesSequential/100k_nodes-4	39.60m	40.45m	~	0.100
DiskTxMap_SetIfNotExists-4	3.649µ	3.479µ	~	0.100
DiskTxMap_SetIfNotExists_Parallel-4	3.518µ	3.594µ	~	1.000
DiskTxMap_ExistenceOnly-4	318.3n	307.4n	~	0.100
Queue-4	193.8n	200.1n	~	0.100
AtomicPointer-4	4.708n	4.893n	~	0.100
ReorgOptimizations/DedupFilterPipeline/Old/10K-4	902.5µ	852.7µ	~	0.100
ReorgOptimizations/DedupFilterPipeline/New/10K-4	856.6µ	806.7µ	~	0.100
ReorgOptimizations/AllMarkFalse/Old/10K-4	112.9µ	117.4µ	~	0.100
ReorgOptimizations/AllMarkFalse/New/10K-4	62.16µ	62.32µ	~	0.400
ReorgOptimizations/HashSlicePool/Old/10K-4	73.03µ	72.73µ	~	0.700
ReorgOptimizations/HashSlicePool/New/10K-4	11.54µ	12.13µ	~	0.700
ReorgOptimizations/NodeFlags/Old/10K-4	5.717µ	5.985µ	~	0.700
ReorgOptimizations/NodeFlags/New/10K-4	2.296µ	2.105µ	~	0.100
ReorgOptimizations/DedupFilterPipeline/Old/100K-4	11.354m	9.991m	~	0.100
ReorgOptimizations/DedupFilterPipeline/New/100K-4	10.55m	11.15m	~	1.000
ReorgOptimizations/AllMarkFalse/Old/100K-4	1.218m	1.190m	~	0.100
ReorgOptimizations/AllMarkFalse/New/100K-4	690.0µ	688.9µ	~	0.700
ReorgOptimizations/HashSlicePool/Old/100K-4	680.6µ	641.4µ	~	0.100
ReorgOptimizations/HashSlicePool/New/100K-4	311.8µ	338.5µ	~	0.100
ReorgOptimizations/NodeFlags/Old/100K-4	56.32µ	57.29µ	~	1.000
ReorgOptimizations/NodeFlags/New/100K-4	19.87µ	19.67µ	~	0.200
TxMapSetIfNotExists-4	52.19n	51.56n	~	0.400
TxMapSetIfNotExistsDuplicate-4	37.92n	37.58n	~	0.700
ChannelSendReceive-4	604.6n	606.5n	~	0.200
CalcBlockWork-4	518.5n	522.7n	~	0.200
CalculateWork-4	708.8n	701.1n	~	0.700
BuildBlockLocatorString_Helpers/Size_10-4	1.365µ	1.411µ	~	0.100
BuildBlockLocatorString_Helpers/Size_100-4	13.01µ	13.10µ	~	0.200
BuildBlockLocatorString_Helpers/Size_1000-4	150.5µ	155.3µ	~	0.700
CatchupWithHeaderCache-4	104.5m	104.4m	~	1.000
_BufferPoolAllocation/16KB-4	3.408µ	3.340µ	~	0.700
_BufferPoolAllocation/32KB-4	6.690µ	7.033µ	~	0.700
_BufferPoolAllocation/64KB-4	16.28µ	13.43µ	~	0.700
_BufferPoolAllocation/128KB-4	27.11µ	28.18µ	~	0.200
_BufferPoolAllocation/512KB-4	121.3µ	110.5µ	~	0.100
_BufferPoolConcurrent/32KB-4	17.19µ	16.63µ	~	0.400
_BufferPoolConcurrent/64KB-4	26.89µ	25.58µ	~	0.700
_BufferPoolConcurrent/512KB-4	137.2µ	135.9µ	~	0.400
_SubtreeDeserializationWithBufferSizes/16KB-4	616.9µ	614.5µ	~	0.700
_SubtreeDeserializationWithBufferSizes/32KB-4	625.4µ	602.5µ	~	0.100
_SubtreeDeserializationWithBufferSizes/64KB-4	631.9µ	605.5µ	~	0.100
_SubtreeDeserializationWithBufferSizes/128KB-4	638.5µ	604.5µ	~	0.100
_SubtreeDeserializationWithBufferSizes/512KB-4	641.2µ	624.3µ	~	0.100
_SubtreeDataDeserializationWithBufferSizes/16KB-4	35.59m	35.49m	~	0.700
_SubtreeDataDeserializationWithBufferSizes/32KB-4	35.88m	35.24m	~	0.200
_SubtreeDataDeserializationWithBufferSizes/64KB-4	35.64m	35.49m	~	1.000
_SubtreeDataDeserializationWithBufferSizes/128KB-4	35.50m	35.40m	~	1.000
_SubtreeDataDeserializationWithBufferSizes/512KB-4	35.55m	35.13m	~	0.200
_PooledVsNonPooled/Pooled-4	736.3n	731.8n	~	0.100
_PooledVsNonPooled/NonPooled-4	6.585µ	6.401µ	~	0.100
_MemoryFootprint/Current_512KB_32concurrent-4	6.716µ	6.559µ	~	0.100
_MemoryFootprint/Proposed_32KB_32concurrent-4	8.988µ	9.386µ	~	0.100
_MemoryFootprint/Alternative_64KB_32concurrent-4	9.059µ	8.738µ	~	0.100
_prepareTxsPerLevel-4	411.4m	401.5m	~	0.700
_prepareTxsPerLevelOrdered-4	3.651m	3.493m	~	0.100
_prepareTxsPerLevel_Comparison/Original-4	424.4m	408.9m	~	0.100
_prepareTxsPerLevel_Comparison/Optimized-4	3.965m	3.713m	~	0.100
SubtreeProcessor/100_tx_64_per_subtree-4	78.43m	78.49m	~	0.400
SubtreeProcessor/500_tx_64_per_subtree-4	379.5m	376.8m	~	0.700
SubtreeProcessor/500_tx_256_per_subtree-4	394.1m	394.4m	~	1.000
SubtreeProcessor/1k_tx_64_per_subtree-4	755.3m	753.0m	~	0.700
SubtreeProcessor/1k_tx_256_per_subtree-4	777.0m	773.8m	~	0.700
StreamingProcessorPhases/FilterValidated/100_tx-4	2.685m	2.673m	~	1.000
StreamingProcessorPhases/ClassifyProcess/100_tx-4	238.4µ	239.9µ	~	0.200
StreamingProcessorPhases/FilterValidated/500_tx-4	13.14m	13.03m	~	0.100
StreamingProcessorPhases/ClassifyProcess/500_tx-4	585.2µ	591.3µ	~	0.100
StreamingProcessorPhases/FilterValidated/1k_tx-4	26.21m	26.30m	~	1.000
StreamingProcessorPhases/ClassifyProcess/1k_tx-4	1.050m	1.045m	~	1.000
SubtreeSizes/10k_tx_4_per_subtree-4	1.300m	1.280m	~	0.100
SubtreeSizes/10k_tx_16_per_subtree-4	320.0µ	303.1µ	~	0.100
SubtreeSizes/10k_tx_64_per_subtree-4	73.92µ	73.25µ	~	0.700
SubtreeSizes/10k_tx_256_per_subtree-4	18.39µ	18.41µ	~	0.700
SubtreeSizes/10k_tx_512_per_subtree-4	9.124µ	9.121µ	~	1.000
SubtreeSizes/10k_tx_1024_per_subtree-4	4.574µ	4.537µ	~	0.700
SubtreeSizes/10k_tx_2k_per_subtree-4	2.244µ	2.271µ	~	0.100
BlockSizeScaling/10k_tx_64_per_subtree-4	72.48µ	72.55µ	~	1.000
BlockSizeScaling/10k_tx_256_per_subtree-4	18.23µ	18.33µ	~	0.100
BlockSizeScaling/10k_tx_1024_per_subtree-4	4.519µ	4.543µ	~	0.200
BlockSizeScaling/50k_tx_64_per_subtree-4	386.6µ	383.7µ	~	0.400
BlockSizeScaling/50k_tx_256_per_subtree-4	91.51µ	91.08µ	~	1.000
BlockSizeScaling/50k_tx_1024_per_subtree-4	22.49µ	22.56µ	~	1.000
SubtreeAllocations/small_subtrees_exists_check-4	152.0µ	153.9µ	~	0.700
SubtreeAllocations/small_subtrees_data_fetch-4	162.2µ	162.3µ	~	1.000
SubtreeAllocations/small_subtrees_full_validation-4	316.7µ	317.3µ	~	1.000
SubtreeAllocations/medium_subtrees_exists_check-4	9.009µ	9.006µ	~	1.000
SubtreeAllocations/medium_subtrees_data_fetch-4	9.622µ	9.576µ	~	0.700
SubtreeAllocations/medium_subtrees_full_validation-4	18.43µ	18.30µ	~	0.400
SubtreeAllocations/large_subtrees_exists_check-4	2.143µ	2.172µ	~	0.200
SubtreeAllocations/large_subtrees_data_fetch-4	2.360µ	2.358µ	~	0.700
SubtreeAllocations/large_subtrees_full_validation-4	4.550µ	4.570µ	~	0.700
GetUtxoHashes-4	254.7n	265.9n	~	0.200
GetUtxoHashes_ManyOutputs-4	42.48µ	44.57µ	~	0.100
_NewMetaDataFromBytes-4	237.1n	240.7n	~	0.400
_Bytes-4	617.1n	619.3n	~	1.000
_MetaBytes-4	569.1n	565.8n	~	0.200

Threshold: >10% with p < 0.05 | Generated: 2026-03-26 13:28 UTC

… MapPutItems - Remove prunerClient interface that was unused by tests (tests mock at partitionWorkerFn level), revert client field to concrete *uaerospike.Client - Reduce uniqueSpendingChildren preallocation from 100k to 1k to match typical usage (~50-100 entries per chunk) - Use single MapPutItemsOp instead of N individual MapPutOps for batch parent updates, reducing per-record operation count

sonarqubecloud · 2026-03-26T13:23:31Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
98.5% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

icellan requested a review from freemans13 March 26, 2026 09:45

freemans13 approved these changes Mar 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: pruner retry only failed partition ranges instead of restarting all workers#623

fix: pruner retry only failed partition ranges instead of restarting all workers#623
icellan wants to merge 2 commits intomainfrom
fix/pruner-targeted-partition-retry

icellan commented Mar 26, 2026

Uh oh!

github-actions bot commented Mar 26, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 26, 2026 •

edited

Loading

Uh oh!

sonarqubecloud bot commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

icellan commented Mar 26, 2026

Summary

Changes

Test plan

Uh oh!

github-actions bot commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark Comparison Report

Summary

Uh oh!

sonarqubecloud bot commented Mar 26, 2026

Quality Gate passed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions bot commented Mar 26, 2026 •

edited

Loading

github-actions bot commented Mar 26, 2026 •

edited

Loading