Fix/784 fund atomicity error safety#789
Fix/784 fund atomicity error safety#7890xMayoor wants to merge 10 commits intogonka-ai:upgrade-v0.2.11from
Conversation
…-ai#784) - Return error from GetBitcoinSettleAmounts to prevent nil slice panic in EndBlock - Wrap state mutations in CacheContext for atomic commit (no partial settlement) - Check SetSettleAmountWithGovernanceTransfer error (was silently discarded) - Check AddTokenomicsData error (was silently discarded) - Move old settle cleanup after atomic commit (independent of current epoch)
Wrap payoutClaim in CacheContext so that if any coin transfer fails, the settle record is NOT consumed and the participant can retry. Previously, finishSettle was called on error paths (including handleUnderfundedWork), permanently consuming the settle record even when no payment was made. Remove dead handleUnderfundedWork function (its finishSettle call was the root cause of the bug). Add red-green rollback tests proving: - Escrow payment failure preserves settle record - Reward payment failure preserves settle record - Happy path correctly consumes settle record
Wrap expireInferenceAndIssueRefund in CacheContext so that if the refund transfer fails, the inference status is NOT changed to EXPIRED and the timeout is preserved for retry on the next EndBlock pass. Changes: - expireInferenceAndIssueRefund: uses CacheContext, returns (Inference, bool) - handleExpiredInferenceWithContext: returns bool indicating commit success - expireInferences: returns map of committed inference IDs - EndBlock: only removes timeouts for successfully expired inferences - 4 red-green atomicity tests in test/expiry-atomicity/
Wrap the vested payment path in PayParticipantFromModule with CacheContext so that if AddVestedRewards fails at any step (coin transfer or schedule update), no partial state persists. Previously, a failure in SetVestingSchedule after SendCoinsFromModuleToModule would leave coins in the streamvesting module with no tracking schedule. Changes: - PayParticipantFromModule: vested path uses CacheContext - Fixed confusing return value (was returning err from vesting path instead of explicit nil) - 4 atomicity tests for vested/direct payment paths
…tion - streamvesting_integration_test.go: AddVestedRewards expectations now use gomock.Any() for context arg since PayParticipantFromModule passes cacheCtx, not the original ctx - testutil/keeper/inference.go: set IgnoreDuplicateDenomRegistration in StubForInitGenesisWithValidators (matching StubForInitGenesis)
…r propagation - Wrap FinishInference mutation section (processInferencePayments, SetInference, handleInferenceCompleted) in CacheContext so partial state never persists on failure - Fix swallowed IssueRefund error in processInferencePayments: now returns error instead of logging and continuing, enabling CacheContext rollback - Fix LogError positional arg format (was passing err as value, now key-value pair) - Add centralized comment documenting nil-error-to-SDK contract and CacheContext rationale - Red-green atomicity test proves inference status preserved on payment failure
…uction casts - Create safecast.go with safeInt32FromInt64, safeUint32FromInt64, safeUint32FromUint64, safeInt32FromUint64, safeUint8FromUint32, clampInt32FromInt helpers - Table-driven boundary tests for all helpers (zero, max, overflow, underflow) - HIGH risk: ShouldValidate weight casts use safeUint32FromInt64 with skip-on-overflow - MEDIUM risk: query stats InferenceCount uses clampInt32FromInt with warning log - LOW risk: bridge/token decimals uses safeUint8FromUint32 with proper error returns - LOW risk: top_miners payout params checked against MaxInt32 before cast
…ix (F-11) - Document EndBlock error handling philosophy (UNRECOVERABLE vs RECOVERABLE vs CROSS-MODULE) with inline rationale at each return-err path - Add structured "epoch_error" SDK events for critical error paths in onEndOfPoCValidationStage (collateral advance, settle, weight adjustment) to enable operator monitoring beyond log-grep - Fix TransferOldSettleAmountsToGovernance: change from halting (return err) to non-fatal (log + continue), matching the existing comment that old settle cleanup should not roll back current participants' rewards
There was a problem hiding this comment.
BTW, here error also is not processed and propagated
There was a problem hiding this comment.
You’re right with this one @akup
you want to do a fix in a separate PR or should I commit one?
There was a problem hiding this comment.
I think it is so tiny change that actually doesn't change a lot, and it is better if you add it to the commit, then I will create one more just for this change
|
| @@ -344,12 +392,16 @@ func (am AppModule) EndBlock(ctx context.Context) error { | |||
| } | |||
|
|
|||
| timeouts := am.keeper.GetAllInferenceTimeoutForHeight(ctx, uint64(blockHeight)) | |||
There was a problem hiding this comment.
If refund fails at block 100, timeout stays stored with height=100. But EndBlock on block 101 only queries timeouts for height=101 (exact match), so the old timeout is never retried. Maybe should query <= blockHeight or move failed timeout to blockHeight+1?
There was a problem hiding this comment.
this is actually what F-07 in the PR fixes @x0152 — expireInferenceAndIssueRefund uses CacheContext now so if the refund fails nothing commits, inference stays STARTED. and the timeout only gets removed if the refund actually went through (check the committedInferences map in EndBlock).
so the scenario you're describing is the bug on main, not on this branch.
There was a problem hiding this comment.
It seams that @x0152 is right on this.
@0xMayoor
We are setting InferenceTimeouts on InferenceStart and we are setting their keys by ExpirationHeight.
So if we didn't commit changes at expireInferences as you do now, this inferences will not be read and cleared at next block height, as it has old expireInferences block height that we have already passed
There was a problem hiding this comment.
@akup Yes, exactly what I meant
Timeout stays at old height when refund fails, but endblock reads only current height, so it won’t retry on next block
|
Very accurate and good PR and should be merged. But should be rebased first, as there was a lot of conflicting changes at #812 |
Continuing #787 (settlement atomicity). Same
CacheContextapproach applied to the remaining fund-movement paths, plus error propagation fixes and an integer narrowing audit. Finding IDs reference the Task 1 analysis on #784.[F-06]
finishSettledeletes the settle record and marksClaimed = truebefore the payout is confirmed. If the payout fails, the claim is gone permanently — no retry path. Wrapped the entirepayoutClaim(escrow pay + reward pay + finishSettle) in a singleCacheContextso the record only gets removed after both payments succeed. The old code also had a partial path where work was paid but rewards failed and the settle was still consumed — that's gone now, it's all-or-nothing.[F-07] When an inference times out and
IssueRefundfails, the error was logged, inference marked EXPIRED, and the timeout record removed. Requester's escrow stuck with no retry. NowexpireInferenceAndIssueRefunduses its ownCacheContext— status only flips to EXPIRED after the refund commits. EndBlock only removes the timeout for inferences that actually committed, so failed ones get retried next pass. Executor penalty is also gated on commit, so nobody gets punished for a refund the chain couldn't process.[F-08] In
processInferencePayments, when FinishInference reprices lower andIssueRefundfails, the error was logged and execution continued. Inference finishes without the developer getting their change back. Now returns the error, which feeds into the FinishInference CacheContext — if refund fails, nothing commits.[F-09]
processInferencePayments,SetInference, andhandleInferenceCompletedall wrote to state on the raw context. If any step failed after earlier writes, partial state persisted andFinishedProcessed()blocked retry, making it permanent. Wrapped the whole mutation section in oneCacheContext. Found this while fixing F-08.[F-12] In
PayParticipantFromModule's vesting path,AddVestedRewardsdoes a coin transfer then a schedule update. If the schedule write fails after coins move, the schedule still has the old entry and the same amount gets sent again next epoch. Wrapped inCacheContextso the transfer and schedule commit together or not at all.[F-14] 11 narrowing casts across the keeper with no bounds checks. Highest risk: three weight casts in ClaimRewards validation sampling where
Weight(int64) gets cast to uint32 — silent truncation corrupts sampling probability. Added bounds-checking helpers insafecast.go. Consensus paths error on overflow (skips that inference for validation, which is better than corrupted probability), query paths clamp to MaxInt32 with a log warning. One tradeoff worth noting: if weights somehow exceed ~4.2B, that inference becomes unvalidatable. In practice weights are bounded by staked supply so this shouldn't happen, but it's a conscious choice over silent truncation.[F-15]
TransferOldSettleAmountsToGovernancefailure returned an error to EndBlock, halting the chain. The existing code comment already said this should be non-fatal — it's just cleanup of unclaimed settles from previous epochs. Changed to log and continue. The unclaimed amounts stay in the module account and get picked up on the next epoch's settlement pass.EndBlock error path audit — went through all 5
return errpaths in EndBlock and verified each one is genuinely unrecoverable (missing params, failed epoch state writes, failed DKG group creation). Added inline comments with the rationale. Also added minimalepoch_errorevents at 3 cross-module error paths (collateral advance, settlement, weight adjustment) so indexers can detect failures without parsing logs.Note on error semantics:
msg servers in this codebase return
(response, nil)to the SDK and encode failures inErrorMessage. This PR preserves that contract.CacheContexthandles atomicity within it — if a mutation fails, state rolls back, but the tx still "succeeds" from the SDK's perspective.