fix(core): quorum pledge/unpledge correctness — TOCTOU races, token scoping, and unpledge invariant by gklps · Pull Request #651 · rubixchain/rubixgoplatform

gklps · 2026-04-08T21:48:31Z

What this fixes

This PR addresses correctness and concurrency bugs in the quorum-side
pledge/unpledge flow discovered during E2E stress testing.

Changes

TOCTOU race elimination

LockTokensForSplit: moved token selection inside a single SELECT FOR UPDATE transaction
PledgeV2: moved ReadLatestTokenChainRows inside pledgeTx with SELECT FOR UPDATE

Token scoping / cross-request safety

ReleaseTokens now scoped by lock_reference_id — prevents one request releasing another's tokens
reference_id enforcement added across all token release paths
Soft dedup guard + lock_reference_id validation before PledgeV2 entry (VULN-4)
lockTokensForSplitOnce switched to cursor-based batching with rows.Err() check

Unpledge correctness (core invariant)

Unpledge now only triggers when tokenchain.previous_transaction_id == pledge_tx_id
Quorum ownership check added at both CheckTxnsPresentInUnpledgeSequenceInfo (outer gate)
and inside UnpledgeV2 idempotency SELECT (inner gate)
Chain-progress guard added before BeginTx — prevents early unpledge
Mismatch events logged to unpledge_mismatch.log in node config dir

Supporting

lock_reference_id tracing + random retry backoff for high-contention transactions
Liquidity pre-check guard before LockTokensForSplit in pledge request handler
High-volume pub-sub logs silenced on quorum nodes

Move FOR UPDATE SKIP LOCKED to the Phase 1 candidate SELECT so row locks are acquired before denomination selection runs in Go. Remove Phase 3 (re-lock on selected IDs — now redundant) and Phase 4 (TOCTOU check — impossible when lock precedes selection). The function shrinks from 5 phases to 3 and returns the `selected` slice directly. Concurrent callers each see only tokens not already row-locked by another in-flight transaction, closing the ~30% failure window at concurrency=5.

Move ReadLatestTokenChainRows outside pledgeTx to a SELECT ... FOR UPDATE on tokens inside pledgeTx. This closes the race window where two concurrent pledge calls could read the same latest_position and produce a silent duplicate skip. - Remove ON CONFLICT DO NOTHING from tokenchain INSERT (conflicts are now bugs) - Add row-count validation after FOR UPDATE (rejects non-Free or missing tokens) - Lock in ORDER BY token_id order to prevent deadlock across concurrent callers

Three Info logs in fullnode.go fire on every published transaction on every node — at 10 nodes x 1000 tokens this produces 21K+ log lines per run. Commented with ### marker for easy re-enable when log levels are configurable.

- PledgeV2 SELECT FOR UPDATE now filters token_status = Locked, matching the state left by LockTokensForSplit. Previously filtered Free, which guaranteed zero rows and a hard failure on every normal quorum pledge. - initiateConsensusHandler now locates the pledge token entry in txnInfo.Quorums by DID match instead of hardcoding index 0, so quorums whose entry is not first in the slice no longer receive foreign tokens to pledge. - Added a defensive consistency check before PledgeV2 that rejects count/ID mismatches between the txnInfo pledge token list and the slice actually passed to PledgeV2. Refs quick task 260408-0hn analysis Case B and §5 context.

- schema.go: add lock_reference_id column + partial index in InitSchema - token.go ReleaseNonSelectedLockedRBTTokensForDID: scope by reference_id (4th param) - token.go ReleaseTokens: clear lock_reference_id=NULL on release - quorum_initiator.go: pass ReferenceId through to the non-selected release call Prevents cross-request token leakage when two concurrent pledges hit the same quorum DID, and ensures released tokens do not retain stale lock_reference_id values.

…rr() check Switch from ORDER BY token_value ASC to ORDER BY token_id ASC with a WHERE token_id > $lastSeenID cursor so each batch iteration scans a strictly advancing window. SKIP LOCKED does not skip rows held by the same transaction, so without a cursor the same rows were pushed into candidates multiple times across batches, producing duplicate token_id entries that caused PledgeV2 hard errors under load. Also add rows.Err() check after the inner scan loop to surface driver-level iteration errors instead of silently dropping them.

…ledgeV2 (VULN-4) Add two pre-PledgeV2 guards in initiateConsensusHandler: 1. Soft dedup (F-2): walk tokenInfos and remove any duplicate TokenID entries before calling PledgeV2. Duplicates are logged at WARN with a metric-shaped log line (pledge_v2_duplicate_token_total) and silently dropped. The hard dedup guard inside PledgeV2 stays as the last line of defense. 2. Lock reference_id validation (VULN-4 / F-3): query lock_reference_id for every pledge token and reject the request (HTTP 400) if any token is missing from the DB, has a NULL lock_reference_id, or has a lock_reference_id that does not match consensusRequest.ReferenceId. This prevents a replayed or interleaved consensus request from pledging tokens that belong to a different reference_id. Add GetTokenLockReferenceIDs wallet helper (token_id -> *lock_reference_id map) used by the validation block to keep SQL out of the core layer.

…s-request token release (VULN-5) Add referenceID parameter to ReleaseTokens. The UPDATE now includes AND lock_reference_id=$3 so a call for one request cannot accidentally free tokens that belong to a concurrent request with a different lock_reference_id. Calling with an empty referenceID is a no-op guarded by an early-return + warn log to prevent unscoped accidental calls. Update the sole active caller in deploySmartContractToken to pass reqID, which matches the referenceID used when LockTokensForSplit locked the tokens. The previously silent error discard (broken _ = fmt.Errorf(...)) is replaced with a proper w.log.Warn call.

…high-contention txns - Add lock_reference_id TEXT column to tokens table (schema.go) so every lock operation is tagged with the request ID that acquired it; add composite index on (did, token_status, token_value) and sparse index on lock_reference_id for faster diagnostics - Thread referenceID through LockTokensForSplit, UnlockLockedTokens, and ReleaseAllLockedRBTTokensForDID so locked tokens can always be traced back to the request that owns them (transaction_builder.go, consensus.go, quorum_recv.go, transaction.go) - Replace fixed retryBackoff with retryWithRandomBackoff (base 50ms * attempt + uniform 0-1000ms jitter) to break correlated retry storms under high concurrency; add ReleaseReferenceID call on success to clear the tag - Tune DB connection pool: min_connections 5→10, statement_timeout 5→20s, idle/lifetime timeouts extended to prevent churning under load

Returns COALESCE(SUM(token_value), 0) for free RBT tokens owned by a DID, matching the exact filter semantics (token_type=RBT, token_status=Free, did=arg) that LockTokensForSplit would see. Used as a cheap pre-check before entering the SELECT FOR UPDATE path under load.

Before calling consensus.ReqPledgeToken (which enters LockTokensForSplit with SELECT FOR UPDATE), check the quorum's free RBT balance. If strictly less than the requested TransactionValue, return {status:false, message: "insufficient quorum liquidity"} at HTTP 200 without touching the DB lock path. Equal balance is allowed through (strict "<" comparison). DB errors on the balance fetch are returned as 500. Insufficient liquidity is logged at Debug level only — it is a normal business condition under load.

…R UPDATE query

…dgeMismatch helper Lazy-init unpledgeAuditLog/*Once/*Mu fields on Core struct (zero values usable). writeUnpledgeMismatch opens unpledge_mismatch.log on first mismatch event via sync.Once, serialises writes with a mutex, and never panics — audit logging must not crash the node.

…dgeSequenceInfo Outer ownership gate: wallet helper now filters unpledge_sequence_info by BOTH tx_id and quorum_did so TxnCallBack cannot trigger unpledge for a DID that did not pledge the tokens. CallBackQuorumUnpledge propagates did to the query. Error message includes did for easier diagnosis.

…geV2 Inner ownership gate: idempotency SELECT now filters by BOTH tx_id AND quorum_did. On pgx.ErrNoRows, a fallback query distinguishes "row absent (already unpledged)" from "row exists but owned by different DID" — the latter writes to unpledge_mismatch.log and returns nil without mutation. Chain-progress guard: inserted after allFree early-return and before BeginTx. Verifies at least one tokenchain row has previous_transaction_id=mainTxID before proceeding. Premature or misrouted unpledge calls are audited and skipped without error so sibling tokens continue processing.

…V2 token update

gklps added 16 commits April 9, 2026 03:08

fix(core): add referenceID parameter to PledgeV2 and update SELECT FO…

d29c64a

…R UPDATE query

fix(core): enhance error message and add referenceID filter in Pledge…

aef6403

…V2 token update

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(core): quorum pledge/unpledge correctness — TOCTOU races, token scoping, and unpledge invariant#651

fix(core): quorum pledge/unpledge correctness — TOCTOU races, token scoping, and unpledge invariant#651
gklps wants to merge 16 commits intorubixchain:release-v1from
gklps:fix/rbt-transfer-e2e-v2

gklps commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

gklps commented Apr 8, 2026

What this fixes

Changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant