Skip to content

fix(core): quorum pledge/unpledge correctness — TOCTOU races, token scoping, and unpledge invariant#651

Open
gklps wants to merge 16 commits intorubixchain:release-v1from
gklps:fix/rbt-transfer-e2e-v2
Open

fix(core): quorum pledge/unpledge correctness — TOCTOU races, token scoping, and unpledge invariant#651
gklps wants to merge 16 commits intorubixchain:release-v1from
gklps:fix/rbt-transfer-e2e-v2

Conversation

@gklps
Copy link
Copy Markdown
Member

@gklps gklps commented Apr 8, 2026

What this fixes

This PR addresses correctness and concurrency bugs in the quorum-side
pledge/unpledge flow discovered during E2E stress testing.

Changes

TOCTOU race elimination

  • LockTokensForSplit: moved token selection inside a single SELECT FOR UPDATE transaction
  • PledgeV2: moved ReadLatestTokenChainRows inside pledgeTx with SELECT FOR UPDATE

Token scoping / cross-request safety

  • ReleaseTokens now scoped by lock_reference_id — prevents one request releasing another's tokens
  • reference_id enforcement added across all token release paths
  • Soft dedup guard + lock_reference_id validation before PledgeV2 entry (VULN-4)
  • lockTokensForSplitOnce switched to cursor-based batching with rows.Err() check

Unpledge correctness (core invariant)

  • Unpledge now only triggers when tokenchain.previous_transaction_id == pledge_tx_id
  • Quorum ownership check added at both CheckTxnsPresentInUnpledgeSequenceInfo (outer gate)
    and inside UnpledgeV2 idempotency SELECT (inner gate)
  • Chain-progress guard added before BeginTx — prevents early unpledge
  • Mismatch events logged to unpledge_mismatch.log in node config dir

Supporting

  • lock_reference_id tracing + random retry backoff for high-contention transactions
  • Liquidity pre-check guard before LockTokensForSplit in pledge request handler
  • High-volume pub-sub logs silenced on quorum nodes

gklps added 16 commits April 9, 2026 03:08
Move FOR UPDATE SKIP LOCKED to the Phase 1 candidate SELECT so row
locks are acquired before denomination selection runs in Go.

Remove Phase 3 (re-lock on selected IDs — now redundant) and Phase 4
(TOCTOU check — impossible when lock precedes selection). The function
shrinks from 5 phases to 3 and returns the `selected` slice directly.

Concurrent callers each see only tokens not already row-locked by
another in-flight transaction, closing the ~30% failure window at
concurrency=5.
Move ReadLatestTokenChainRows outside pledgeTx to a SELECT ... FOR UPDATE on tokens
inside pledgeTx. This closes the race window where two concurrent pledge calls could
read the same latest_position and produce a silent duplicate skip.

- Remove ON CONFLICT DO NOTHING from tokenchain INSERT (conflicts are now bugs)
- Add row-count validation after FOR UPDATE (rejects non-Free or missing tokens)
- Lock in ORDER BY token_id order to prevent deadlock across concurrent callers
Three Info logs in fullnode.go fire on every published transaction on every
node — at 10 nodes x 1000 tokens this produces 21K+ log lines per run.
Commented with ### marker for easy re-enable when log levels are configurable.
- PledgeV2 SELECT FOR UPDATE now filters token_status = Locked,
  matching the state left by LockTokensForSplit. Previously filtered
  Free, which guaranteed zero rows and a hard failure on every
  normal quorum pledge.
- initiateConsensusHandler now locates the pledge token entry in
  txnInfo.Quorums by DID match instead of hardcoding index 0, so
  quorums whose entry is not first in the slice no longer receive
  foreign tokens to pledge.
- Added a defensive consistency check before PledgeV2 that rejects
  count/ID mismatches between the txnInfo pledge token list and
  the slice actually passed to PledgeV2.

Refs quick task 260408-0hn analysis Case B and §5 context.
- schema.go: add lock_reference_id column + partial index in InitSchema
- token.go ReleaseNonSelectedLockedRBTTokensForDID: scope by reference_id (4th param)
- token.go ReleaseTokens: clear lock_reference_id=NULL on release
- quorum_initiator.go: pass ReferenceId through to the non-selected release call

Prevents cross-request token leakage when two concurrent pledges hit the
same quorum DID, and ensures released tokens do not retain stale
lock_reference_id values.
…rr() check

Switch from ORDER BY token_value ASC to ORDER BY token_id ASC with a
WHERE token_id > $lastSeenID cursor so each batch iteration scans a
strictly advancing window. SKIP LOCKED does not skip rows held by the
same transaction, so without a cursor the same rows were pushed into
candidates multiple times across batches, producing duplicate token_id
entries that caused PledgeV2 hard errors under load.

Also add rows.Err() check after the inner scan loop to surface
driver-level iteration errors instead of silently dropping them.
…ledgeV2 (VULN-4)

Add two pre-PledgeV2 guards in initiateConsensusHandler:

1. Soft dedup (F-2): walk tokenInfos and remove any duplicate TokenID
   entries before calling PledgeV2. Duplicates are logged at WARN with a
   metric-shaped log line (pledge_v2_duplicate_token_total) and silently
   dropped. The hard dedup guard inside PledgeV2 stays as the last line
   of defense.

2. Lock reference_id validation (VULN-4 / F-3): query lock_reference_id
   for every pledge token and reject the request (HTTP 400) if any token
   is missing from the DB, has a NULL lock_reference_id, or has a
   lock_reference_id that does not match consensusRequest.ReferenceId.
   This prevents a replayed or interleaved consensus request from pledging
   tokens that belong to a different reference_id.

Add GetTokenLockReferenceIDs wallet helper (token_id -> *lock_reference_id
map) used by the validation block to keep SQL out of the core layer.
…s-request token release (VULN-5)

Add referenceID parameter to ReleaseTokens. The UPDATE now includes
AND lock_reference_id=$3 so a call for one request cannot accidentally
free tokens that belong to a concurrent request with a different
lock_reference_id. Calling with an empty referenceID is a no-op guarded
by an early-return + warn log to prevent unscoped accidental calls.

Update the sole active caller in deploySmartContractToken to pass reqID,
which matches the referenceID used when LockTokensForSplit locked the tokens.

The previously silent error discard (broken _ = fmt.Errorf(...)) is
replaced with a proper w.log.Warn call.
…high-contention txns

- Add lock_reference_id TEXT column to tokens table (schema.go) so every
  lock operation is tagged with the request ID that acquired it; add
  composite index on (did, token_status, token_value) and sparse index on
  lock_reference_id for faster diagnostics

- Thread referenceID through LockTokensForSplit, UnlockLockedTokens, and
  ReleaseAllLockedRBTTokensForDID so locked tokens can always be traced
  back to the request that owns them (transaction_builder.go, consensus.go,
  quorum_recv.go, transaction.go)

- Replace fixed retryBackoff with retryWithRandomBackoff (base 50ms * attempt
  + uniform 0-1000ms jitter) to break correlated retry storms under high
  concurrency; add ReleaseReferenceID call on success to clear the tag

- Tune DB connection pool: min_connections 5→10, statement_timeout 5→20s,
  idle/lifetime timeouts extended to prevent churning under load
Returns COALESCE(SUM(token_value), 0) for free RBT tokens owned by a DID,
matching the exact filter semantics (token_type=RBT, token_status=Free, did=arg)
that LockTokensForSplit would see. Used as a cheap pre-check before entering
the SELECT FOR UPDATE path under load.
Before calling consensus.ReqPledgeToken (which enters LockTokensForSplit
with SELECT FOR UPDATE), check the quorum's free RBT balance. If strictly
less than the requested TransactionValue, return {status:false, message:
"insufficient quorum liquidity"} at HTTP 200 without touching the DB lock
path. Equal balance is allowed through (strict "<" comparison). DB errors
on the balance fetch are returned as 500. Insufficient liquidity is logged
at Debug level only — it is a normal business condition under load.
…dgeMismatch helper

Lazy-init unpledgeAuditLog/*Once/*Mu fields on Core struct (zero values usable).
writeUnpledgeMismatch opens unpledge_mismatch.log on first mismatch event via
sync.Once, serialises writes with a mutex, and never panics — audit logging
must not crash the node.
…dgeSequenceInfo

Outer ownership gate: wallet helper now filters unpledge_sequence_info by
BOTH tx_id and quorum_did so TxnCallBack cannot trigger unpledge for a DID
that did not pledge the tokens. CallBackQuorumUnpledge propagates did to
the query. Error message includes did for easier diagnosis.
…geV2

Inner ownership gate: idempotency SELECT now filters by BOTH tx_id AND
quorum_did. On pgx.ErrNoRows, a fallback query distinguishes "row absent
(already unpledged)" from "row exists but owned by different DID" — the
latter writes to unpledge_mismatch.log and returns nil without mutation.

Chain-progress guard: inserted after allFree early-return and before BeginTx.
Verifies at least one tokenchain row has previous_transaction_id=mainTxID
before proceeding. Premature or misrouted unpledge calls are audited and
skipped without error so sibling tokens continue processing.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant