Skip to content

Fix/784 fund atomicity error safety#789

Open
0xMayoor wants to merge 10 commits intogonka-ai:upgrade-v0.2.11from
0xMayoor:fix/784-fund-atomicity-error-safety
Open

Fix/784 fund atomicity error safety#789
0xMayoor wants to merge 10 commits intogonka-ai:upgrade-v0.2.11from
0xMayoor:fix/784-fund-atomicity-error-safety

Conversation

@0xMayoor
Copy link
Contributor

Continuing #787 (settlement atomicity). Same CacheContext approach applied to the remaining fund-movement paths, plus error propagation fixes and an integer narrowing audit. Finding IDs reference the Task 1 analysis on #784.


[F-06] finishSettle deletes the settle record and marks Claimed = true before the payout is confirmed. If the payout fails, the claim is gone permanently — no retry path. Wrapped the entire payoutClaim (escrow pay + reward pay + finishSettle) in a single CacheContext so the record only gets removed after both payments succeed. The old code also had a partial path where work was paid but rewards failed and the settle was still consumed — that's gone now, it's all-or-nothing.

[F-07] When an inference times out and IssueRefund fails, the error was logged, inference marked EXPIRED, and the timeout record removed. Requester's escrow stuck with no retry. Now expireInferenceAndIssueRefund uses its own CacheContext — status only flips to EXPIRED after the refund commits. EndBlock only removes the timeout for inferences that actually committed, so failed ones get retried next pass. Executor penalty is also gated on commit, so nobody gets punished for a refund the chain couldn't process.

[F-08] In processInferencePayments, when FinishInference reprices lower and IssueRefund fails, the error was logged and execution continued. Inference finishes without the developer getting their change back. Now returns the error, which feeds into the FinishInference CacheContext — if refund fails, nothing commits.

[F-09] processInferencePayments, SetInference, and handleInferenceCompleted all wrote to state on the raw context. If any step failed after earlier writes, partial state persisted and FinishedProcessed() blocked retry, making it permanent. Wrapped the whole mutation section in one CacheContext. Found this while fixing F-08.

[F-12] In PayParticipantFromModule's vesting path, AddVestedRewards does a coin transfer then a schedule update. If the schedule write fails after coins move, the schedule still has the old entry and the same amount gets sent again next epoch. Wrapped in CacheContext so the transfer and schedule commit together or not at all.

[F-14] 11 narrowing casts across the keeper with no bounds checks. Highest risk: three weight casts in ClaimRewards validation sampling where Weight (int64) gets cast to uint32 — silent truncation corrupts sampling probability. Added bounds-checking helpers in safecast.go. Consensus paths error on overflow (skips that inference for validation, which is better than corrupted probability), query paths clamp to MaxInt32 with a log warning. One tradeoff worth noting: if weights somehow exceed ~4.2B, that inference becomes unvalidatable. In practice weights are bounded by staked supply so this shouldn't happen, but it's a conscious choice over silent truncation.

[F-15] TransferOldSettleAmountsToGovernance failure returned an error to EndBlock, halting the chain. The existing code comment already said this should be non-fatal — it's just cleanup of unclaimed settles from previous epochs. Changed to log and continue. The unclaimed amounts stay in the module account and get picked up on the next epoch's settlement pass.


EndBlock error path audit — went through all 5 return err paths in EndBlock and verified each one is genuinely unrecoverable (missing params, failed epoch state writes, failed DKG group creation). Added inline comments with the rationale. Also added minimal epoch_error events at 3 cross-module error paths (collateral advance, settlement, weight adjustment) so indexers can detect failures without parsing logs.


Note on error semantics:
msg servers in this codebase return (response, nil) to the SDK and encode failures in ErrorMessage. This PR preserves that contract. CacheContext handles atomicity within it — if a mutation fails, state rolls back, but the tx still "succeeds" from the SDK's perspective.

…-ai#784)

- Return error from GetBitcoinSettleAmounts to prevent nil slice panic in EndBlock
- Wrap state mutations in CacheContext for atomic commit (no partial settlement)
- Check SetSettleAmountWithGovernanceTransfer error (was silently discarded)
- Check AddTokenomicsData error (was silently discarded)
- Move old settle cleanup after atomic commit (independent of current epoch)
Wrap payoutClaim in CacheContext so that if any coin transfer fails,
the settle record is NOT consumed and the participant can retry.

Previously, finishSettle was called on error paths (including
handleUnderfundedWork), permanently consuming the settle record
even when no payment was made.

Remove dead handleUnderfundedWork function (its finishSettle call
was the root cause of the bug).

Add red-green rollback tests proving:
- Escrow payment failure preserves settle record
- Reward payment failure preserves settle record
- Happy path correctly consumes settle record
Wrap expireInferenceAndIssueRefund in CacheContext so that if the refund
transfer fails, the inference status is NOT changed to EXPIRED and the
timeout is preserved for retry on the next EndBlock pass.

Changes:
- expireInferenceAndIssueRefund: uses CacheContext, returns (Inference, bool)
- handleExpiredInferenceWithContext: returns bool indicating commit success
- expireInferences: returns map of committed inference IDs
- EndBlock: only removes timeouts for successfully expired inferences
- 4 red-green atomicity tests in test/expiry-atomicity/
Wrap the vested payment path in PayParticipantFromModule with CacheContext
so that if AddVestedRewards fails at any step (coin transfer or schedule
update), no partial state persists. Previously, a failure in
SetVestingSchedule after SendCoinsFromModuleToModule would leave coins
in the streamvesting module with no tracking schedule.

Changes:
- PayParticipantFromModule: vested path uses CacheContext
- Fixed confusing return value (was returning err from vesting path instead
  of explicit nil)
- 4 atomicity tests for vested/direct payment paths
…tion

- streamvesting_integration_test.go: AddVestedRewards expectations now
  use gomock.Any() for context arg since PayParticipantFromModule passes
  cacheCtx, not the original ctx
- testutil/keeper/inference.go: set IgnoreDuplicateDenomRegistration in
  StubForInitGenesisWithValidators (matching StubForInitGenesis)
…r propagation

- Wrap FinishInference mutation section (processInferencePayments, SetInference,
  handleInferenceCompleted) in CacheContext so partial state never persists on failure
- Fix swallowed IssueRefund error in processInferencePayments: now returns error
  instead of logging and continuing, enabling CacheContext rollback
- Fix LogError positional arg format (was passing err as value, now key-value pair)
- Add centralized comment documenting nil-error-to-SDK contract and CacheContext rationale
- Red-green atomicity test proves inference status preserved on payment failure
…uction casts

- Create safecast.go with safeInt32FromInt64, safeUint32FromInt64, safeUint32FromUint64,
  safeInt32FromUint64, safeUint8FromUint32, clampInt32FromInt helpers
- Table-driven boundary tests for all helpers (zero, max, overflow, underflow)
- HIGH risk: ShouldValidate weight casts use safeUint32FromInt64 with skip-on-overflow
- MEDIUM risk: query stats InferenceCount uses clampInt32FromInt with warning log
- LOW risk: bridge/token decimals uses safeUint8FromUint32 with proper error returns
- LOW risk: top_miners payout params checked against MaxInt32 before cast
…ix (F-11)

- Document EndBlock error handling philosophy (UNRECOVERABLE vs RECOVERABLE
  vs CROSS-MODULE) with inline rationale at each return-err path
- Add structured "epoch_error" SDK events for critical error paths in
  onEndOfPoCValidationStage (collateral advance, settle, weight adjustment)
  to enable operator monitoring beyond log-grep
- Fix TransferOldSettleAmountsToGovernance: change from halting (return err)
  to non-fatal (log + continue), matching the existing comment that old settle
  cleanup should not roll back current participants' rewards
Copy link
Contributor

@akup akup Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, here error also is not processed and propagated

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will check this one later @akup

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You’re right with this one @akup

you want to do a fix in a separate PR or should I commit one?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is so tiny change that actually doesn't change a lot, and it is better if you add it to the commit, then I will create one more just for this change

@tcharchian tcharchian added this to the v0.2.11 milestone Feb 26, 2026
@tcharchian tcharchian linked an issue Feb 26, 2026 that may be closed by this pull request
@tcharchian tcharchian requested a review from patimen February 28, 2026 00:51
@tcharchian tcharchian moved this from Todo to Needs reviewer in Upgrade v0.2.11 Feb 28, 2026
@Mayveskii
Copy link

Reviewed msg_server_claim_rewards.go and settle_amount.go against the full settlement flow.

F-06 confirmed at line 86: finishSettle (removes settle record + sets Claimed=true) runs on the reward error path after work coins are already transferred. The settle entitlement is permanently consumed with no retry. CacheContext wrapping the entire payoutClaim is the correct fix.

F-14 confirmed at three sites: int64(GetWorkCoins()) (line 62), int64(GetRewardCoins()) (line 80), and uint32(totalWeight) / uint32(weight.Weight) in ShouldValidate (line 469). The safecast.go approach handles this correctly.

Additional finding — GetTotalCoins overflow (settle_amount.go:6): RewardCoins + WorkCoins is added as uint64 before the > math.MaxInt64 guard. If both values are large, uint64 wrap-around produces a small sum that bypasses the check. The guard should precede the addition.

Settlement correctness is foundational. Any additive weight feature built on top — EffectivePocWeight (#856), CacheQualityWeight (#859), or future quality axes — computes on top of the same settleAmount values that pass through ShouldValidate. Partial payouts or silent integer truncation here propagate into sampling probability and ultimately into epoch weight distribution. This PR makes the layer below those features trustworthy.

@@ -344,12 +392,16 @@ func (am AppModule) EndBlock(ctx context.Context) error {
}

timeouts := am.keeper.GetAllInferenceTimeoutForHeight(ctx, uint64(blockHeight))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If refund fails at block 100, timeout stays stored with height=100. But EndBlock on block 101 only queries timeouts for height=101 (exact match), so the old timeout is never retried. Maybe should query <= blockHeight or move failed timeout to blockHeight+1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is actually what F-07 in the PR fixes @x0152 — expireInferenceAndIssueRefund uses CacheContext now so if the refund fails nothing commits, inference stays STARTED. and the timeout only gets removed if the refund actually went through (check the committedInferences map in EndBlock).

so the scenario you're describing is the bug on main, not on this branch.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seams that @x0152 is right on this.

@0xMayoor
We are setting InferenceTimeouts on InferenceStart and we are setting their keys by ExpirationHeight.
So if we didn't commit changes at expireInferences as you do now, this inferences will not be read and cleared at next block height, as it has old expireInferences block height that we have already passed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@akup Yes, exactly what I meant

Timeout stays at old height when refund fails, but endblock reads only current height, so it won’t retry on next block

@akup
Copy link
Contributor

akup commented Mar 6, 2026

Very accurate and good PR and should be merged.

But should be rebased first, as there was a lot of conflicting changes at #812
Then it should be rereviewed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Needs reviewer

Development

Successfully merging this pull request may close these issues.

Possible underfunded issues

5 participants