fix(core): quorum pledge/unpledge correctness — TOCTOU races, token scoping, and unpledge invariant#651
Open
gklps wants to merge 16 commits intorubixchain:release-v1from
Open
fix(core): quorum pledge/unpledge correctness — TOCTOU races, token scoping, and unpledge invariant#651gklps wants to merge 16 commits intorubixchain:release-v1from
gklps wants to merge 16 commits intorubixchain:release-v1from
Conversation
Move FOR UPDATE SKIP LOCKED to the Phase 1 candidate SELECT so row locks are acquired before denomination selection runs in Go. Remove Phase 3 (re-lock on selected IDs — now redundant) and Phase 4 (TOCTOU check — impossible when lock precedes selection). The function shrinks from 5 phases to 3 and returns the `selected` slice directly. Concurrent callers each see only tokens not already row-locked by another in-flight transaction, closing the ~30% failure window at concurrency=5.
Move ReadLatestTokenChainRows outside pledgeTx to a SELECT ... FOR UPDATE on tokens inside pledgeTx. This closes the race window where two concurrent pledge calls could read the same latest_position and produce a silent duplicate skip. - Remove ON CONFLICT DO NOTHING from tokenchain INSERT (conflicts are now bugs) - Add row-count validation after FOR UPDATE (rejects non-Free or missing tokens) - Lock in ORDER BY token_id order to prevent deadlock across concurrent callers
Three Info logs in fullnode.go fire on every published transaction on every node — at 10 nodes x 1000 tokens this produces 21K+ log lines per run. Commented with ### marker for easy re-enable when log levels are configurable.
- PledgeV2 SELECT FOR UPDATE now filters token_status = Locked, matching the state left by LockTokensForSplit. Previously filtered Free, which guaranteed zero rows and a hard failure on every normal quorum pledge. - initiateConsensusHandler now locates the pledge token entry in txnInfo.Quorums by DID match instead of hardcoding index 0, so quorums whose entry is not first in the slice no longer receive foreign tokens to pledge. - Added a defensive consistency check before PledgeV2 that rejects count/ID mismatches between the txnInfo pledge token list and the slice actually passed to PledgeV2. Refs quick task 260408-0hn analysis Case B and §5 context.
- schema.go: add lock_reference_id column + partial index in InitSchema - token.go ReleaseNonSelectedLockedRBTTokensForDID: scope by reference_id (4th param) - token.go ReleaseTokens: clear lock_reference_id=NULL on release - quorum_initiator.go: pass ReferenceId through to the non-selected release call Prevents cross-request token leakage when two concurrent pledges hit the same quorum DID, and ensures released tokens do not retain stale lock_reference_id values.
…rr() check Switch from ORDER BY token_value ASC to ORDER BY token_id ASC with a WHERE token_id > $lastSeenID cursor so each batch iteration scans a strictly advancing window. SKIP LOCKED does not skip rows held by the same transaction, so without a cursor the same rows were pushed into candidates multiple times across batches, producing duplicate token_id entries that caused PledgeV2 hard errors under load. Also add rows.Err() check after the inner scan loop to surface driver-level iteration errors instead of silently dropping them.
…ledgeV2 (VULN-4) Add two pre-PledgeV2 guards in initiateConsensusHandler: 1. Soft dedup (F-2): walk tokenInfos and remove any duplicate TokenID entries before calling PledgeV2. Duplicates are logged at WARN with a metric-shaped log line (pledge_v2_duplicate_token_total) and silently dropped. The hard dedup guard inside PledgeV2 stays as the last line of defense. 2. Lock reference_id validation (VULN-4 / F-3): query lock_reference_id for every pledge token and reject the request (HTTP 400) if any token is missing from the DB, has a NULL lock_reference_id, or has a lock_reference_id that does not match consensusRequest.ReferenceId. This prevents a replayed or interleaved consensus request from pledging tokens that belong to a different reference_id. Add GetTokenLockReferenceIDs wallet helper (token_id -> *lock_reference_id map) used by the validation block to keep SQL out of the core layer.
…s-request token release (VULN-5) Add referenceID parameter to ReleaseTokens. The UPDATE now includes AND lock_reference_id=$3 so a call for one request cannot accidentally free tokens that belong to a concurrent request with a different lock_reference_id. Calling with an empty referenceID is a no-op guarded by an early-return + warn log to prevent unscoped accidental calls. Update the sole active caller in deploySmartContractToken to pass reqID, which matches the referenceID used when LockTokensForSplit locked the tokens. The previously silent error discard (broken _ = fmt.Errorf(...)) is replaced with a proper w.log.Warn call.
…high-contention txns - Add lock_reference_id TEXT column to tokens table (schema.go) so every lock operation is tagged with the request ID that acquired it; add composite index on (did, token_status, token_value) and sparse index on lock_reference_id for faster diagnostics - Thread referenceID through LockTokensForSplit, UnlockLockedTokens, and ReleaseAllLockedRBTTokensForDID so locked tokens can always be traced back to the request that owns them (transaction_builder.go, consensus.go, quorum_recv.go, transaction.go) - Replace fixed retryBackoff with retryWithRandomBackoff (base 50ms * attempt + uniform 0-1000ms jitter) to break correlated retry storms under high concurrency; add ReleaseReferenceID call on success to clear the tag - Tune DB connection pool: min_connections 5→10, statement_timeout 5→20s, idle/lifetime timeouts extended to prevent churning under load
Returns COALESCE(SUM(token_value), 0) for free RBT tokens owned by a DID, matching the exact filter semantics (token_type=RBT, token_status=Free, did=arg) that LockTokensForSplit would see. Used as a cheap pre-check before entering the SELECT FOR UPDATE path under load.
Before calling consensus.ReqPledgeToken (which enters LockTokensForSplit
with SELECT FOR UPDATE), check the quorum's free RBT balance. If strictly
less than the requested TransactionValue, return {status:false, message:
"insufficient quorum liquidity"} at HTTP 200 without touching the DB lock
path. Equal balance is allowed through (strict "<" comparison). DB errors
on the balance fetch are returned as 500. Insufficient liquidity is logged
at Debug level only — it is a normal business condition under load.
…dgeMismatch helper Lazy-init unpledgeAuditLog/*Once/*Mu fields on Core struct (zero values usable). writeUnpledgeMismatch opens unpledge_mismatch.log on first mismatch event via sync.Once, serialises writes with a mutex, and never panics — audit logging must not crash the node.
…dgeSequenceInfo Outer ownership gate: wallet helper now filters unpledge_sequence_info by BOTH tx_id and quorum_did so TxnCallBack cannot trigger unpledge for a DID that did not pledge the tokens. CallBackQuorumUnpledge propagates did to the query. Error message includes did for easier diagnosis.
…geV2 Inner ownership gate: idempotency SELECT now filters by BOTH tx_id AND quorum_did. On pgx.ErrNoRows, a fallback query distinguishes "row absent (already unpledged)" from "row exists but owned by different DID" — the latter writes to unpledge_mismatch.log and returns nil without mutation. Chain-progress guard: inserted after allFree early-return and before BeginTx. Verifies at least one tokenchain row has previous_transaction_id=mainTxID before proceeding. Premature or misrouted unpledge calls are audited and skipped without error so sibling tokens continue processing.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this fixes
This PR addresses correctness and concurrency bugs in the quorum-side
pledge/unpledge flow discovered during E2E stress testing.
Changes
TOCTOU race elimination
LockTokensForSplit: moved token selection inside a singleSELECT FOR UPDATEtransactionPledgeV2: movedReadLatestTokenChainRowsinsidepledgeTxwithSELECT FOR UPDATEToken scoping / cross-request safety
ReleaseTokensnow scoped bylock_reference_id— prevents one request releasing another's tokensreference_idenforcement added across all token release pathslock_reference_idvalidation beforePledgeV2entry (VULN-4)lockTokensForSplitOnceswitched to cursor-based batching withrows.Err()checkUnpledge correctness (core invariant)
tokenchain.previous_transaction_id == pledge_tx_idCheckTxnsPresentInUnpledgeSequenceInfo(outer gate)and inside
UnpledgeV2idempotency SELECT (inner gate)BeginTx— prevents early unpledgeunpledge_mismatch.login node config dirSupporting
lock_reference_idtracing + random retry backoff for high-contention transactionsLockTokensForSplitin pledge request handler