Fix/784 fund atomicity error safety by 0xMayoor · Pull Request #789 · gonka-ai/gonka

0xMayoor · 2026-02-22T18:18:02Z

Continuing #787 (settlement atomicity). Same CacheContext approach applied to the remaining fund-movement paths, plus error propagation fixes and an integer narrowing audit. Finding IDs reference the Task 1 analysis on #784.

[F-06] finishSettle deletes the settle record and marks Claimed = true before the payout is confirmed. If the payout fails, the claim is gone permanently — no retry path. Wrapped the entire payoutClaim (escrow pay + reward pay + finishSettle) in a single CacheContext so the record only gets removed after both payments succeed. The old code also had a partial path where work was paid but rewards failed and the settle was still consumed — that's gone now, it's all-or-nothing.

[F-07] When an inference times out and IssueRefund fails, the error was logged, inference marked EXPIRED, and the timeout record removed. Requester's escrow stuck with no retry. Now expireInferenceAndIssueRefund uses its own CacheContext — status only flips to EXPIRED after the refund commits. EndBlock only removes the timeout for inferences that actually committed, so failed ones get retried next pass. Executor penalty is also gated on commit, so nobody gets punished for a refund the chain couldn't process.

[F-08] In processInferencePayments, when FinishInference reprices lower and IssueRefund fails, the error was logged and execution continued. Inference finishes without the developer getting their change back. Now returns the error, which feeds into the FinishInference CacheContext — if refund fails, nothing commits.

[F-09] processInferencePayments, SetInference, and handleInferenceCompleted all wrote to state on the raw context. If any step failed after earlier writes, partial state persisted and FinishedProcessed() blocked retry, making it permanent. Wrapped the whole mutation section in one CacheContext. Found this while fixing F-08.

[F-12] In PayParticipantFromModule's vesting path, AddVestedRewards does a coin transfer then a schedule update. If the schedule write fails after coins move, the schedule still has the old entry and the same amount gets sent again next epoch. Wrapped in CacheContext so the transfer and schedule commit together or not at all.

[F-14] 11 narrowing casts across the keeper with no bounds checks. Highest risk: three weight casts in ClaimRewards validation sampling where Weight (int64) gets cast to uint32 — silent truncation corrupts sampling probability. Added bounds-checking helpers in safecast.go. Consensus paths error on overflow (skips that inference for validation, which is better than corrupted probability), query paths clamp to MaxInt32 with a log warning. One tradeoff worth noting: if weights somehow exceed ~4.2B, that inference becomes unvalidatable. In practice weights are bounded by staked supply so this shouldn't happen, but it's a conscious choice over silent truncation.

[F-15] TransferOldSettleAmountsToGovernance failure returned an error to EndBlock, halting the chain. The existing code comment already said this should be non-fatal — it's just cleanup of unclaimed settles from previous epochs. Changed to log and continue. The unclaimed amounts stay in the module account and get picked up on the next epoch's settlement pass.

EndBlock error path audit — went through all 5 return err paths in EndBlock and verified each one is genuinely unrecoverable (missing params, failed epoch state writes, failed DKG group creation). Added inline comments with the rationale. Also added minimal epoch_error events at 3 cross-module error paths (collateral advance, settlement, weight adjustment) so indexers can detect failures without parsing logs.

Note on error semantics:
msg servers in this codebase return (response, nil) to the SDK and encode failures in ErrorMessage. This PR preserves that contract. CacheContext handles atomicity within it — if a mutation fails, state rolls back, but the tx still "succeeds" from the SDK's perspective.

…-ai#784) - Return error from GetBitcoinSettleAmounts to prevent nil slice panic in EndBlock - Wrap state mutations in CacheContext for atomic commit (no partial settlement) - Check SetSettleAmountWithGovernanceTransfer error (was silently discarded) - Check AddTokenomicsData error (was silently discarded) - Move old settle cleanup after atomic commit (independent of current epoch)

Wrap payoutClaim in CacheContext so that if any coin transfer fails, the settle record is NOT consumed and the participant can retry. Previously, finishSettle was called on error paths (including handleUnderfundedWork), permanently consuming the settle record even when no payment was made. Remove dead handleUnderfundedWork function (its finishSettle call was the root cause of the bug). Add red-green rollback tests proving: - Escrow payment failure preserves settle record - Reward payment failure preserves settle record - Happy path correctly consumes settle record

Wrap expireInferenceAndIssueRefund in CacheContext so that if the refund transfer fails, the inference status is NOT changed to EXPIRED and the timeout is preserved for retry on the next EndBlock pass. Changes: - expireInferenceAndIssueRefund: uses CacheContext, returns (Inference, bool) - handleExpiredInferenceWithContext: returns bool indicating commit success - expireInferences: returns map of committed inference IDs - EndBlock: only removes timeouts for successfully expired inferences - 4 red-green atomicity tests in test/expiry-atomicity/

Wrap the vested payment path in PayParticipantFromModule with CacheContext so that if AddVestedRewards fails at any step (coin transfer or schedule update), no partial state persists. Previously, a failure in SetVestingSchedule after SendCoinsFromModuleToModule would leave coins in the streamvesting module with no tracking schedule. Changes: - PayParticipantFromModule: vested path uses CacheContext - Fixed confusing return value (was returning err from vesting path instead of explicit nil) - 4 atomicity tests for vested/direct payment paths

…tion - streamvesting_integration_test.go: AddVestedRewards expectations now use gomock.Any() for context arg since PayParticipantFromModule passes cacheCtx, not the original ctx - testutil/keeper/inference.go: set IgnoreDuplicateDenomRegistration in StubForInitGenesisWithValidators (matching StubForInitGenesis)

…r propagation - Wrap FinishInference mutation section (processInferencePayments, SetInference, handleInferenceCompleted) in CacheContext so partial state never persists on failure - Fix swallowed IssueRefund error in processInferencePayments: now returns error instead of logging and continuing, enabling CacheContext rollback - Fix LogError positional arg format (was passing err as value, now key-value pair) - Add centralized comment documenting nil-error-to-SDK contract and CacheContext rationale - Red-green atomicity test proves inference status preserved on payment failure

…uction casts - Create safecast.go with safeInt32FromInt64, safeUint32FromInt64, safeUint32FromUint64, safeInt32FromUint64, safeUint8FromUint32, clampInt32FromInt helpers - Table-driven boundary tests for all helpers (zero, max, overflow, underflow) - HIGH risk: ShouldValidate weight casts use safeUint32FromInt64 with skip-on-overflow - MEDIUM risk: query stats InferenceCount uses clampInt32FromInt with warning log - LOW risk: bridge/token decimals uses safeUint8FromUint32 with proper error returns - LOW risk: top_miners payout params checked against MaxInt32 before cast

…ix (F-11) - Document EndBlock error handling philosophy (UNRECOVERABLE vs RECOVERABLE vs CROSS-MODULE) with inline rationale at each return-err path - Add structured "epoch_error" SDK events for critical error paths in onEndOfPoCValidationStage (collateral advance, settle, weight adjustment) to enable operator monitoring beyond log-grep - Fix TransferOldSettleAmountsToGovernance: change from halting (return err) to non-fatal (log + continue), matching the existing comment that old settle cleanup should not roll back current participants' rewards

akup · 2026-02-26T18:05:18Z

inference-chain/x/inference/keeper/payment_handler.go

BTW, here error also is not processed and propagated

will check this one later @akup

You’re right with this one @akup

you want to do a fix in a separate PR or should I commit one?

I think it is so tiny change that actually doesn't change a lot, and it is better if you add it to the commit, then I will create one more just for this change

inference-chain/x/inference/keeper/msg_server_finish_inference.go

inference-chain/x/inference/keeper/msg_server_claim_rewards.go

Mayveskii · 2026-03-06T02:21:47Z

Reviewed msg_server_claim_rewards.go and settle_amount.go against the full settlement flow.

F-06 confirmed at line 86: finishSettle (removes settle record + sets Claimed=true) runs on the reward error path after work coins are already transferred. The settle entitlement is permanently consumed with no retry. CacheContext wrapping the entire payoutClaim is the correct fix.

F-14 confirmed at three sites: int64(GetWorkCoins()) (line 62), int64(GetRewardCoins()) (line 80), and uint32(totalWeight) / uint32(weight.Weight) in ShouldValidate (line 469). The safecast.go approach handles this correctly.

Additional finding — GetTotalCoins overflow (settle_amount.go:6): RewardCoins + WorkCoins is added as uint64 before the > math.MaxInt64 guard. If both values are large, uint64 wrap-around produces a small sum that bypasses the check. The guard should precede the addition.

Settlement correctness is foundational. Any additive weight feature built on top — EffectivePocWeight (#856), CacheQualityWeight (#859), or future quality axes — computes on top of the same settleAmount values that pass through ShouldValidate. Partial payouts or silent integer truncation here propagate into sampling probability and ultimately into epoch weight distribution. This PR makes the layer below those features trustworthy.

x0152 · 2026-03-06T13:04:05Z

inference-chain/x/inference/module/module.go

@@ -344,12 +392,16 @@ func (am AppModule) EndBlock(ctx context.Context) error {
 	}

 	timeouts := am.keeper.GetAllInferenceTimeoutForHeight(ctx, uint64(blockHeight))


If refund fails at block 100, timeout stays stored with height=100. But EndBlock on block 101 only queries timeouts for height=101 (exact match), so the old timeout is never retried. Maybe should query <= blockHeight or move failed timeout to blockHeight+1?

this is actually what F-07 in the PR fixes @x0152 — expireInferenceAndIssueRefund uses CacheContext now so if the refund fails nothing commits, inference stays STARTED. and the timeout only gets removed if the refund actually went through (check the committedInferences map in EndBlock).

so the scenario you're describing is the bug on main, not on this branch.

It seams that @x0152 is right on this.

@0xMayoor
We are setting InferenceTimeouts on InferenceStart and we are setting their keys by ExpirationHeight.
So if we didn't commit changes at expireInferences as you do now, this inferences will not be read and cleared at next block height, as it has old expireInferences block height that we have already passed

@akup Yes, exactly what I meant

Timeout stays at old height when refund fails, but endblock reads only current height, so it won’t retry on next block

akup · 2026-03-06T15:52:34Z

Very accurate and good PR and should be merged.

But should be rebased first, as there was a lot of conflicting changes at #812
Then it should be rereviewed

0xMayoor added 9 commits February 21, 2026 18:30

fix: remove placeholder test and use 'any' over 'interface{}'

1a3ffbc

0xMayoor mentioned this pull request Feb 22, 2026

Possible underfunded issues #784

Open

akup reviewed Feb 26, 2026

View reviewed changes

inference-chain/x/inference/keeper/msg_server_finish_inference.go Show resolved Hide resolved

akup reviewed Feb 26, 2026

View reviewed changes

inference-chain/x/inference/keeper/msg_server_claim_rewards.go Show resolved Hide resolved

tcharchian added this to the v0.2.11 milestone Feb 26, 2026

tcharchian added this to Upgrade v0.2.11 Feb 26, 2026

github-project-automation bot moved this to Todo in Upgrade v0.2.11 Feb 26, 2026

tcharchian linked an issue Feb 26, 2026 that may be closed by this pull request

Possible underfunded issues #784

Open

fix: handle AddTokenomicsData errors (found by @akup)

6923558

tcharchian requested a review from patimen February 28, 2026 00:51

tcharchian moved this from Todo to Needs reviewer in Upgrade v0.2.11 Feb 28, 2026

akup mentioned this pull request Mar 4, 2026

StartInference and FinishInference performance improvements #812

Merged

x0152 reviewed Mar 6, 2026

View reviewed changes

This was referenced Mar 6, 2026

fix: settlement atomicity and error handling in SettleAccounts (#784) #787

Open

fix(inference): prevent permanent token loss in ClaimRewards on partial payout failure #741

Open

fix: prevent permanent reward loss on partial claim payment failure #826

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix/784 fund atomicity error safety#789

Fix/784 fund atomicity error safety#789
0xMayoor wants to merge 10 commits intogonka-ai:upgrade-v0.2.11from
0xMayoor:fix/784-fund-atomicity-error-safety

0xMayoor commented Feb 22, 2026

Uh oh!

akup Feb 26, 2026 •

edited

Loading

Uh oh!

0xMayoor Feb 26, 2026

Uh oh!

0xMayoor Feb 27, 2026

Uh oh!

akup Feb 27, 2026

Uh oh!

Uh oh!

Uh oh!

Mayveskii commented Mar 6, 2026

Uh oh!

x0152 Mar 6, 2026

Uh oh!

0xMayoor Mar 6, 2026

Uh oh!

akup Mar 6, 2026

Uh oh!

x0152 Mar 7, 2026

Uh oh!

akup commented Mar 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

		@@ -344,12 +392,16 @@ func (am AppModule) EndBlock(ctx context.Context) error {
		}

		timeouts := am.keeper.GetAllInferenceTimeoutForHeight(ctx, uint64(blockHeight))

Conversation

0xMayoor commented Feb 22, 2026

Uh oh!

akup Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

0xMayoor Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

0xMayoor Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

akup Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Mayveskii commented Mar 6, 2026

Uh oh!

x0152 Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

0xMayoor Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

akup Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

x0152 Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

akup commented Mar 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

akup Feb 26, 2026 •

edited

Loading