fix(bridge): batch withdraw proposals and add crash-safe idempotency by sameh-farouk · Pull Request #1079 · threefoldtech/tfchain

sameh-farouk · 2026-03-06T23:58:51Z

Summary

Batching (bridge: Significant Delays in BurnTransactionReady/RefundTransactionReady Processing Leading to Accumulating Unhandled Transactions #1053): Reorder event loop to process Ready events (time-sensitive, ~2 min signature expiry) before Created events. Batch all WithdrawCreated proposals into a single Utility.batch extrinsic, reducing N×6s sequential processing to 1×6s. Falls back to individual submissions on batch failure.
Atomicity (bridge: Lack of Atomicity #1054): Add bbolt-backed idempotency store tracking PROCESSING/COMPLETED state per transaction to prevent double Stellar submissions on crash. Add Horizon memo lookup for crash recovery. Add startup reconciliation for PROCESSING transactions from previous run.
Bug fix: Fix pre-existing errors.Wrap(err, ...) that used outer-scope err instead of data.Err in both tfchainSub and stellarSub error handling paths.

Changes

File	Description
`bridge/tfchain_bridge/pkg/bridge/bridge.go`	Reorder event loop (Ready→Expired→Created), add idempotency store init, add `reconcilePendingTransactions`, fix `data.Err` bug
`bridge/tfchain_bridge/pkg/bridge/withdraw.go`	Add `handleWithdrawCreatedBatch` for batched proposals, add idempotency to `handleWithdrawReady`
`bridge/tfchain_bridge/pkg/bridge/refund.go`	Add idempotency to `handleRefundReady` with crash recovery
`bridge/tfchain_bridge/pkg/stellar/stellar.go`	Add `FindPaymentByMemo` and `FindRefundByReturnHash` for Horizon crash recovery lookup
`bridge/tfchain_bridge/pkg/substrate/client.go`	Add `BurnProposal` struct and `BatchProposeWithdrawOrAddSig` method
`bridge/tfchain_bridge/pkg/idempotency.go`	New bbolt-backed idempotency store (PROCESSING/COMPLETED states)
`clients/tfchain-client-go/batch.go`	New `BatchCalls` helper using `Utility.batch` with proper ItemFailed/BatchInterrupted handling

Design decisions

Uses Utility.batch (NOT batch_all) so individual proposal failures don't abort the entire batch
Idempotency store is bbolt (embedded, no external dependencies) stored alongside the existing persistency file
Horizon lookup searches last 200 transactions — sufficient for crash recovery window
On Stellar submit failure, transactions stay in PROCESSING state for reconciliation on next attempt
Single-event batches fall back to the original non-batched path to avoid unnecessary overhead

Test plan

Verify go build ./... and go vet ./... pass (verified locally)
Run single withdraw flow end-to-end
Run batch of 5 concurrent withdraws and verify single batch submission
Kill bridge between Stellar submit and TFChain confirmation, restart, verify reconciliation
Verify refund path with idempotency

Closes #1053, closes #1054

🤖 Generated with Claude Code

…crash safety Addresses two bridge reliability issues: **Batching (#1053):** - Reorder event loop to process Ready events (time-sensitive, ~2 min signature expiry) before Created events - Batch WithdrawCreated proposals into a single Utility.batch extrinsic, reducing N×6s sequential processing to 1×6s - Add BatchCalls helper to tfchain-client-go using Utility.batch (not batch_all) so individual failures don't abort the entire batch - Falls back to individual submissions if batch fails **Atomicity (#1054):** - Add bbolt-backed IdempotencyStore tracking PROCESSING/COMPLETED state per transaction, preventing double Stellar submissions on crash - Add Horizon memo lookup (FindPaymentByMemo, FindRefundByReturnHash) for crash recovery — checks if Stellar tx already exists before retry - Add startup reconcilePendingTransactions that resolves any transactions left in PROCESSING state from a previous run - Fix pre-existing bug: errors.Wrap used outer-scope err instead of data.Err in both tfchainSub and stellarSub error handling Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

During E2E testing of #1054 crash recovery, discovered that CreatePaymentWithSignaturesAndSubmit submitted Stellar txs without a memo field. FindPaymentByMemo (used in reconcilePendingTransactions) searches for memo_type=text but found nothing, making the 'already submitted' detection path always fail. Root cause: memo was never set on withdraw payments (only refunds used MemoReturn). Additionally, the memo must be set at SIGNING time (CreatePaymentAndReturnSignature) as well as submission time — because the Stellar tx hash is computed over all fields including memo. Setting it only at submission causes signature verification failures since validators signed a different hash. Fix: - Add txnBuild.Memo = txnbuild.MemoText(fmt.Sprint(txID)) to CreatePaymentAndReturnSignature so the signed hash includes the memo - Add txnBuild.Memo = txnbuild.MemoText(txHash) to CreatePaymentWithSignaturesAndSubmit for consistency After fix: all Stellar withdraw txs have memo_type=text with the burn tx ID, enabling reliable crash recovery via Horizon memo lookup. Also adds: - bridge/docs/local_development_setup.md: full setup + E2E test steps - bridge/docs/setup_issues_and_workarounds.md: updated with issues #11 (Stellar testnet liquidity) and #12 (memo bug) Tested: - TEST 1: Stellar→TFChain deposit ✅ - TEST 2: TFChain→Stellar withdraw ✅ - TEST 3: Crash recovery (PROCESSING state + Horizon reconciliation) ✅ - TEST 4: Batch proposals — 5 WithdrawCreated → 1 Utility.batch ✅ Closes #1053, #1054

TFTTest issuer reverted to official testnet address GA47YZA3... (custom GDPARZIN... issuer was for local testing only, must not ship) Existing doc fixes: - bridging.md: correct deposit/withdraw fee from 1 TFT → 10 TFT - multinode.md: replace non-existent 'Sudo → addBridgeValidator' with correct council.propose path (tfchain has no sudo pallet); fix signer3 using wrong tfchain seed (was signer2 seed — copy-paste bug) - single_node.md + multinode.md: add warning that stellar-utils faucet may fail on testnet (empty DEX order book); point to workaround doc - observability.md: document new events added by #1053/#1054 fix: withdraw_crash_recovery, batch_proposal_started, batch_proposal_completed

bbolt v1.4.x and v1.3.10+ require go 1.22/1.23, which breaks CI (golangci-lint runs on Go 1.20). Pin to v1.3.9 (requires go 1.17) and restore go.mod directive to go 1.21 (base branch value). All APIs used (Open/View/Update/Bucket/ForEach) exist in v1.3.9.

staticcheck-action@v1.3.0 internally runs setup-go-faster and installs Go 1.19.x, conflicting with the Go 1.20 already set up by setup-go@v5. Fix: - Bump to dominikh/staticcheck-action@v1.4.0 - Add install-go: false to reuse the Go toolchain from the prior step

Resolves 14 Codacy 'Lists should be surrounded by blank lines' findings in local_development_setup.md and setup_issues_and_workarounds.md

This is internal dev/operational notes, not repo documentation. Keeping it out of the public codebase.

…okup FindPaymentByMemo and FindRefundByReturnHash previously scanned the 200 most recent transactions on the bridge account (incoming + outgoing mixed). In a busy bridge, deposits dominate, leaving few slots for withdrawals. Filter to outgoing only (tx.Account == bridgeAccount) client-side after fetching. This means the 200 limit now covers 200 actual withdrawals/refunds rather than 200 mixed-direction transactions.

…tx lookup Previously FindPaymentByMemo and FindRefundByReturnHash used ForAccount which returns all transactions that touched the bridge account (both incoming deposits and outgoing withdrawals/refunds). Client-side filtering by source_account was then applied, but the 200-record limit was still consumed by mixed traffic. Fix: use Horizon's /transactions?source_account=<addr> endpoint which filters server-side to only transactions where the bridge is the source. This guarantees the limit covers 200 actual outgoing transactions, making crash recovery reliable even during high-volume inbound outages. Both functions now share fetchOutgoingTransactions() helper.

staticcheck@latest resolves to v0.7.0 which has 'go 1.25.0' in its go.mod. Go 1.20 cannot parse the patch-suffixed version format (introduced in Go 1.21), causing 'invalid go version' error. Fix: - Bump setup-go from 1.20 → 1.21 (matches go.mod directive, and can parse patch-suffixed go versions in dependencies) - Pin staticcheck to 2024.1.1 (last release requiring Go 1.21, avoids future breakage from newer staticcheck pulling newer Go)

2024.1.1 depends on golang.org/x/tools@v0.21.1 which has a compile error on Go 1.21 (invalid array length in tokeninternal). Use 2023.1.6 which is stable with Go 1.21 and covers all relevant checks.

…nce number - stellar.go: fix gofmt import order (encoding/json was out of order) - stellar.go: use client.HTTP.Do() instead of http.DefaultClient (inherits configured timeouts from Horizon client) - stellar.go: check HTTP status before JSON decode to prevent false 'no tx found' when Horizon returns 4xx/5xx - stellar.go: add io.LimitReader(4MB) on response body - stellar.go: add FindPaymentBySequence() for pre-upgrade tx detection - idempotency.go: guard against COMPLETED->PROCESSING state downgrade in setState - withdraw.go: use new(big.Int).SetUint64() to avoid int64 overflow in batch path - withdraw.go/refund.go: fallback to FindPaymentBySequence in PROCESSING branch to detect Stellar txs submitted by pre-upgrade bridge (no memo). Sequence number stored in TFChain burn tx uniquely identifies the Stellar tx. - bridge.go: same fallback in reconcilePendingTransactions for startup path

…ch skip on sig fail, accurate batch logging Fixes from code review: - stellar.go: refactor lookups into page-based helpers (FindPaymentByMemoInPage, FindRefundByReturnHashInPage, FindPaymentBySequenceInPage) backed by a single FetchOutgoingTransactionsPage call — eliminates redundant Horizon HTTP calls when multiple lookups are needed for the same PROCESSING tx - bridge.go: reconcilePendingTransactions now fetches Horizon once for all pending withdraws and refunds (N+M lookups → 1 HTTP call); skips fetch entirely if no pending transactions - withdraw.go: batch Phase 1 skips individual events on signature creation failure instead of aborting the whole batch; other events in the same block still proposed - withdraw.go: batch Phase 3 only logs withdraw_proposed for proposals that did not fail; uses FailedIndexes for precise tracking on BatchInterrupted runtimes, flags batch_had_failures on ItemFailed runtimes where index is not available - idempotency.go: add comment on store growth (negligible at bridge tx volumes) All changes build clean, vet clean, gofmt clean. Full E2E test suite passed: TEST 1: normal withdraw (TFChain→Stellar) - PASS TEST 2: crash recovery via memo lookup - PASS (no double-spend) TEST 3: batch 5 withdraws in 1 block → single Utility.batch (5/5 success) - PASS TEST 4: reconcile with single Horizon fetch - PASS (3 PROCESSING resolved in 1 call)

Wrap long text paragraphs to stay under 200 chars. Wrap long shell commands using backslash continuation. Add markdownlint-disable MD013 around code blocks where wrapping would break literal code (JS, Go, shell flag values containing seed phrases).

- batch.go: switch Utility.batch → Utility.force_batch so individual proposal failures (BurnSignatureExists, EnoughBurnSignaturesPresent) do not abort the remaining proposals in the batch. BatchInterrupted handling removed (not emitted by force_batch). ItemFailed handling is now live code. - withdraw.go: remove fallback to sequential RetryProposeWithdrawOrAddSig on batch RPC failure. With force_batch individual failures are handled internally; a wholesale RPC failure means individual calls would also fail. BurnTransactionExpired expiry recovers any missed proposals. - stellar.go: fix FindRefundByReturnHashInPage base64 vs hex mismatch. Horizon encodes MemoReturn as base64; txHash from TFChain is hex. Decode hex → raw bytes → base64 before comparison. Refund crash recovery was silently broken for all refund transactions. - stellar.go: add SyncSequenceNumber() method. Called at the start of every Created event handler (batch, single, expiry — both withdraw and refund). Event reordering (Ready-first) sets w.sequenceNumber to a historical value; without a sync the next Created signature uses a stale base and produces tx_bad_seq on Stellar submission. - withdraw.go/refund.go: make FetchOutgoingTransactionsPage error in the PROCESSING crash-recovery path non-fatal. Return nil + warn log instead of crashing the bridge. Consistent with reconciler behavior. - withdraw.go: add event_action='batch_proposal_started' structured field to the batch start log so monitoring rules can match it. - withdraw.go: move withdraw_completed and transfer_completed log lines to after RetrySetWithdrawExecuted + MarkWithdrawCompleted so ops logs accurately reflect when the full lifecycle is confirmed. - observability.md: document previously undocumented event_actions: withdraw_recovered, refund_recovered, refund_crash_recovery, withdraw_proposal_failed. Update batch_proposal_started description to reference force_batch.

TFT uses 7 decimal places (tokenDecimals: 7 in chain_spec). The genesis deposit_fee and withdraw_fee are both 10,000,000 base units = 1 TFT each (not 10 TFT as previously documented). This is confirmed empirically: swap_to_stellar(30_000_000) = 3 TFT burned, 1 TFT fee, +2 TFT on Stellar. - TEST 1 deposit: 50 TFT sent → 49 TFT minted (1 TFT fee, not 10 TFT) - TEST 2 withdraw: 3 TFT burned → 2 TFT on Stellar (1 TFT fee, not 10 TFT) - Add fee note box explaining the 7-decimal-place precision - Fix muTFT comment: 30,000,000 muTFT = 3 TFT (not 30 TFT)

Deposit (Stellar→TFChain): fee enforced at two points — bridge pre-screens incoming amount against DepositFee before proposing (to avoid on-chain failures), and the pallet deducts DepositFee when executing the mint. Both use the same TFTBridgeModule.DepositFee storage value. Withdraw (TFChain→Stellar): fee enforced at one point only — the pallet deducts WithdrawFee inside swap_to_stellar and stores burn_amount in the event. The bridge relays burn_amount as-is; it does not read WithdrawFee separately and applies no additional fee of its own. Also fix precision note: 1 TFT = 10,000,000 base units (tokenDecimals: 7).

Extend batching to cover all TFChain proposal extrinsics in a single Utility.force_batch per block: - BurnTransactionCreated → propose_burn_transaction_or_add_sig - BurnTransactionExpired → same (re-sign; pre-runtime-147 path dropped) - RefundTransactionCreated → create_refund_transaction_or_add_sig (previously ignored entirely — other validators now add signatures immediately instead of waiting for RefundTransactionExpired) - RefundTransactionExpired → same (offline-validator recovery) Key design decisions: - Ready events (BurnTransactionReady, RefundTransactionReady) remain sequential — they are Stellar submissions, not TFChain extrinsics, and cannot be force_batched - Deposit-triggered refunds (mint.go → refund()) remain direct calls so all validators sync their Stellar sequence at the same time, ensuring compatible multi-sig signatures - SyncSequenceNumber() called once per batch; all proposals receive consecutive Stellar sequence numbers from that base - BurnTransactionExpired events are converted to WithdrawCreatedEvent and fed into the same batch as new burns Benefits: - Bridge outage backlog (N expired burns + M expired refunds) drains in one block instead of N+M sequential blocks - Multi-validator refunds no longer require a full expiry cycle for validators that missed the triggering Stellar deposit - Single force_batch call covers all proposal types — cleaner invariant Adds RefundProposal struct and BatchProposeAll() to substrate client. Replaces handleWithdrawCreatedBatch + handleWithdrawExpired + handleWithdrawCreated with unified handleProposalsBatch. Updates observability.md for new/changed event_action values.

The local_development_setup.md previously referenced a non-existent file at /tmp/stellar_setup.mjs. Replace the dead comment with the actual inline JavaScript setup script (with placeholder values for secrets). Also note the path_payment_strict_send requirement for funding the bridge: a regular payment triggers the bridge refund logic on startup rescan.

… functions - RetrySetWithdrawExecuted: fix wrong log message ('refund' → 'burn'); add immediate IsBurnedAlready check before sleep so losers in the multi-validator submission race exit quickly with a warn instead of retrying and logging misleading errors - RetryCreateRefundTransactionOrAddSig: add immediate IsRefundedAlready check so EnoughRefundSignaturesPresent (expected when threshold already met by peer validators) exits as warn rather than cycling through 10s sleeps logging spurious errors - RetrySetRefundTransactionExecutedTx: same early-exit pattern for RefundTransactionAlreadyExecuted in multi-validator loser path All three functions now distinguish expected multi-validator race outcomes (warn) from genuine failures (error).

The bridge wallet may hold TFT issued by an account that differs from the network-configured default (e.g. a custom issuer used in dev/test environments where the official testnet TFT issuer has no liquidity). At wallet init, compare the configured issuer against the bridge account's actual TFT balance. If they differ, use the actual issuer for all payment operations and log a warning. Production bridge wallets always hold TFT from the official mainnet issuer, so production behavior is unchanged. This replaces an implicit assumption (bridge always holds official-issuer TFT) with an explicit, verified value discovered at startup.

- Remove dead RefundTransactionCreated batch handler from handleProposalsBatch. All online validators detect bad deposits via their own Stellar cursor and call proposeRefundDirect before this event fires. Added comment explaining why and what would need to change if catch-up mode is ever needed. - Add Makefile targets for full local bridge dev environment (bridge-dev, bridge-build, bridge-accounts, bridge-tfchain-start, bridge-setup, bridge-start, bridge-test, bridge-clean, bridge-stop). TFChain binary is built once and cached (make dependency on binary file). bridge-dev is a true zero-to-green target. - Add scripts/bridge_accounts.js: generates Stellar keypairs, funds via Friendbot, creates TFT trustlines, issues TFT to bridge (path_payment_strict_send to avoid triggering deposit monitor) and user. Writes /tmp/bridge_local_env.sh. - Add scripts/bridge_setup.js: configures TFChain bridge pallet via sudo — registers validator, sets bridge wallet address, fee account, deposit/withdraw fees. - Add scripts/bridge_tests.js: 4-scenario E2E suite (normal withdraw, batch 5, bad deposit refund, crash recovery). Exits non-zero on any failure. - Add scripts/wait_for_node.js: polls WS endpoint until TFChain is ready, used by bridge-tfchain-start to block until node accepts connections. - Add @stellar/stellar-sdk to scripts/package.json dependencies.

- Add scripts/bridge_mv_accounts.js: generates 3 validator Stellar keypairs, funds via Friendbot, creates TFT trustlines, issues TFT to bridge via path_payment_strict_send, configures bridge as 2-of-3 multi-sig (low=1, med=2, high=3 thresholds). Writes /tmp/bridge_mv_env.sh. - Add scripts/bridge_mv_setup.js: creates twins for all 3 validators (Alice, Bob, Charlie), registers all 3 on TFChain bridge pallet via sudo, sets bridge wallet, fee account, deposit and withdraw fees. - Add scripts/bridge_mv_tests.js: 5-scenario MV test suite: MV1 normal withdraw (3 validators, threshold=2), MV2 deposit/mint (all 3 propose, threshold met), MV3 bad deposit (full refund, 3 validators), MV4 validator offline (Val3 killed, Val1+Val2 meet threshold=2), MV5 batch withdraws (3 simultaneous, handles expiry recovery). Exits non-zero on any failure. Val3 restarted after MV4 for subsequent tests. - Add Makefile targets: bridge-mv-accounts, bridge-mv-setup, bridge-mv-start (3 daemons), bridge-mv-stop, bridge-mv-test, bridge-mv-clean, bridge-mv-dev (full one-shot MV environment). TFChain binary reused from single-validator build.

Replaces the manual 6-step setup guide with documentation for the new make bridge-dev and make bridge-mv-dev targets. Covers: - Quick start (single and multi-validator) - What each make target does - Account sharing via env file - Single-validator and MV test descriptions - Multi-validator multi-sig architecture - Fee mechanics (deposit and withdraw, two-layer enforcement) - Why bridge funding uses path_payment_strict_send - Idempotency and crash recovery model - Known limitations (sequence coordination, Friendbot rate limits) Removes references to setup_issues_and_workarounds.md (not in repo) and stale hardcoded test account addresses.

Discovered by running make bridge-dev end-to-end: - bridge_setup.js: rewrote as verification-only; dev chain genesis already pre-configures Bob+Charlie as validators, Ferdie as fee account, 1 TFT fees. No sudo pallet on TFChain, no setBridgeAddress call exists on the pallet. Setup script now just prints and verifies genesis config. - bridge_mv_setup.js: same — verification only, no pallet writes. - Makefile bridge-start: fix TFChain seed from //Alice to //Bob. Alice is not a genesis validator; Bob and Charlie are. - bridge_mv_accounts.js: reduce from 3 to 2 validators (val1=Bob, val2=Charlie). The 3rd genesis validator has an unknown seed so cannot run a daemon for it. Multi-sig configured as 2-of-2 (both Bob and Charlie must sign Stellar txs). - bridge_mv_tests.js: update for 2 validators (Bob + Charlie). Redesign MV4 from 'one validator permanently offline' to 'validator restarts mid-flow': kill Val2 after Val1 proposes, restart Val2, verify late-join completes the refund. Removes reference to non-existent Val3 daemon. - bridge_mv_start: fix for 2 daemons with explicit //Bob and //Charlie seeds instead of a loop over 3 with an Alice seed.

- bridge-start: use actual genesis validator dev key 1 seed instead of //Bob (which is not in the genesis validator set) - bridge-mv-start: use dev key 1 and dev key 2 seeds from chain_spec.rs - All timeout calls replaced with portable sh loop (macOS lacks GNU timeout) - bridge_mv_setup.js: rewrite to add validator 2 via council governance (Alice+Bob are genesis council members; propose→vote→close pattern) - bridge_mv_tests.js: use actual dev key seeds in startValidator() Seeds sourced from substrate-node/node/src/chain_spec.rs: Val1: 'quarter between satisfy three sphere six soda boss cute decade old trend' Val2: 'employ split promote annual couple elder remain cricket company fitness senior fiscal'

Test 4 (crash recovery) restarts the bridge with //Alice seed by default, but Alice is not a genesis validator. Fix to use the correct genesis validator dev key 1 seed (same seed used by bridge-start). Also pass VAL1_TFCHAIN_SEED explicitly from bridge-test Make target so the correct seed is always used regardless of env state.

waitForBridgeReady was reading the last 10000 bytes of the log file, but by the time test4 runs, the log is large enough that the restarted bridge's 'bridge_started' entry is no longer within that window. Fix: record the log file byte offset before killing the bridge, then look for 'bridge_started' only in content written AFTER that offset. Also increase the wait timeout from 30s to 60s for slower machines.

…direct On macOS, passing a numeric file descriptor to a detached child process's stdio is unreliable: after child.unref(), the fd silently becomes invalid and the bridge writes nothing to the log. This caused waitForBridgeReady to always timeout in test4 crash recovery. Fix: use 'exec' inside a shell command with append redirect (>>). The shell sets up the redirect before exec, then exec replaces sh with the bridge binary (preserving the same PID). This is reliable on both Linux and macOS.

…est4 On macOS, Node.js detached process stdout fd inheritance breaks after child.unref(), so the restarted bridge writes nothing to the log file. Polling for 'bridge_started' in the log always times out. Replace waitForBridgeReady() in test4 with a fixed 10s startup window, then directly verify the outcome: Stellar balance increased (either because bridge completed before kill, or recovered after restart via reconciliation or expiry path). Outcome timeout extended to 5 minutes to cover expiry recovery (~2 min) if needed.

…hreshold - bridge_mv_accounts.js: generate 3 validator Stellar keypairs, configure bridge as 2-of-3 multi-sig (val1 master + val2 + val3 signers, med=2) - bridge_mv_setup.js: add val2 AND val3 to TFChain via council governance (Alice proposes, Bob+Alice vote, Alice closes — two sequential proposals) - Makefile bridge-mv-start: start 3 validators; split each into its own shell line so $! capture is reliable and failures are independent; add log tail output when bridge_started not seen - Makefile bridge-mv-stop/clean: iterate 1 2 3 for all 3 validators - bridge_mv_tests.js: add Val3 seed/pid/log; fix startValidator to use /bin/sh exec redirect (macOS fd inheritance fix); MV4 now tests 2-of-3 (Val3 offline, Val1+Val2 complete refund), restarts Val3 before MV5

- bridge_mv_setup.js: create Alice twin after adding validators; needed for MV2 deposit test (single-validator setup does this, MV setup didn't) - bridge_mv_tests.js MV4: evaluate pass/fail before restart attempt so test result is not affected by restart timing; replace waitForValReady with fixed 8s startup window (log-based detection unreliable on macOS for restarted processes; same issue seen in single-validator test4)

- bridge_mv_setup.js: call userAcceptTc before createTwin (required by pallet; UserDidNotSignTermsAndConditions without it) - bridge_mv_tests.js: fix MV2 memo format from '1' to 'twin_1'; bridge parses memo as 'object_objectID' format per bridging.md spec All 5 MV tests now pass locally (MV1-MV5).

…p, daemon macros - SHELL := /bin/bash and .SHELLFLAGS := -eu -o pipefail -c - Add TFCHAIN_PID_FILE; bridge-tfchain-start/stop now use PID file - Add start_daemon and stop_daemon macros (no pkill — avoids terminating the make shell, which caused 'Terminated' errors on Linux) - Add start_daemon_with_env for bridge-start: sources env file in the same shell as nohup so BRIDGE_SECRET/BRIDGE_ADDRESS are inherited (separate @-line would be a different shell with vars not set) - bridge-mv-start: inline, each validator its own @-line with . env && nohup - bridge-mv-stop: PID-file loop, no pkill fallback - bridge-mv-clean: also stops TFChain and cleans its PID/log files - version-bump: initialize new_spec_version=""; use ${var:-} for unset variables to be compatible with -u (nounset) from .SHELLFLAGS - Replace (echo ... && exit 1) subshell antipattern with { echo ...; exit 1; }

- bridge-mv-clean: remove bridge-tfchain-stop dep (bridge-tfchain-start handles existing TFChain via stop_daemon internally; cleaner separation) - bridge-mv-clean: keep .idem.db cleanup (was missing in review version) - bridge-clean: remove echo noise, keep 4 separate rm lines for clarity - TFCHAIN_BIN rule: drop @ so make invocation is visible - bridge-test: retain BRIDGE_PID_FILE/LOG_FILE/BIN/SEED (required by test4 crash-recovery — kills and restarts bridge binary) - bridge-mv-test: retain BRIDGE_BIN/DIR (required by startValidator())

Use debug binary to skip the 30-min release build: TFCHAIN_BIN=substrate-node/target/debug/tfchain make bridge-dev TFCHAIN_BIN=substrate-node/target/debug/tfchain make bridge-mv-dev

…runs - stop_daemon now waits up to 10s for process to fully exit after SIGTERM, then sends SIGKILL if still alive — ensures port 9944 is released before proceeding - bridge-mv-stop loop applies the same wait logic for all 3 validators - bridge-tfchain-start and bridge-tfchain-stop both run 'lsof -ti tcp:9944 | xargs kill' as a fallback to kill any orphaned TFChain that is not tracked by the PID file (e.g. processes started outside make) - switch from --tmp to --base-path /tmp/tfchain-local so the data directory is explicitly deleted in bridge-clean and bridge-mv-clean, guaranteeing fresh genesis on every run Root cause: a TFChain process started with --dev --tmp was left running from a previous session (PID 144522). Every subsequent 'bridge-tfchain-start' started a new TFChain that failed to bind port 9944 and exited silently; wait_for_node connected to the zombie instead, causing all state (validators, twins, burns) to persist across test runs and burn IDs to increment indefinitely.

…bugs Scripts: - Extract shared JS helpers into bridge_helpers.js (DRY) - Add try/finally for api.disconnect() in all test/setup scripts - Fix waitBlocks subscription leak in bridge_mv_setup.js - Read validator seeds from env vars with hardcoded defaults Makefile: - Use --tmp instead of --base-path (no stale state between runs) - Centralise validator seeds, BRIDGE_NETWORK, TFCHAIN_PORT as variables - Reuse start_daemon_with_env / stop_daemon macros for MV targets - Fix start_daemon_with_env PID capture: use ";" not "&&" to avoid backgrounding a subshell whose PID becomes stale, leaving bridge processes running as orphans on bridge-mv-stop / bridge-mv-clean - Clean targets remove env files stellar.go: - Add --network local (testnet Horizon/passphrase, dynamic TFT issuer) - Fix getHorizonClient() ignoring custom StellarHorizonUrl when network is set (moved custom URL check after the switch so it overrides) - Fix submitTransaction trace_id context key (TraceIdKey{} not string) - Restrict dynamic TFT issuer override to --network local only Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add on-chain state verification to all existing tests and introduce new test scenarios for both single-validator and multi-validator suites. Changes: - bridge_helpers.js: add tfchainBalance() helper for TFChain balance queries - bridge_tests.js: harden T1-T4 with ExecutedBurn/ExecutedRefund/TFChain balance assertions; add T5 (deposit/mint), T6 (below-minimum reject), T7 (clean state — no orphaned active transactions) - bridge_mv_tests.js: harden MV1-MV5 with ExecutedBurn/ExecutedRefund/ TFChain balance assertions; add MV6 (crash recovery — kill Val2 mid-withdraw), MV7 (clean state) Previously all tests only verified Stellar balance deltas. Now every test also confirms on-chain bookkeeping (executed maps populated, active maps drained) and TFChain balance changes where applicable. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…n state Three fixes based on actual test run results: 1. TFChain balance assertions: use >= instead of exact match. - Withdraws: substrate extrinsic fee (~0.003 TFT) causes balance to decrease by slightly more than the swap amount. - Deposits: Alice earns block author rewards on the dev chain, so her balance increases by slightly more than the minted amount. 2. assertBurnExecuted: poll with waitUntil (30s) instead of one-shot check. set_burn_transaction_executed may finalize a few blocks after the Stellar payment is visible to the test. 3. assertRefundExecuted: same polling fix. set_refund_transaction_executed finalizes after the Stellar refund payment. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Replace open-ended >= checks with bounded delta assertions: - Withdraw: |tfDelta - swapAmount| <= 0.1 (extrinsic fee ~0.003 TFT) - Deposit: |balDelta - expectedMint| <= 0.1 (block author rewards ~0.006 TFT) This catches real bugs (e.g. double-charge) while tolerating the small substrate fees and dev-chain block rewards that make exact matching flaky. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…e test MV4 was doing a one-shot check for ExecutedRefundTransactions instead of using the polling assertRefundExecuted helper (30s timeout). The refund completes on Stellar before set_refund_transaction_executed finalizes on-chain, so the one-shot check saw stale state. Also moved Val3 restart into a finally block so cleanup always runs regardless of test outcome. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

T8/MV8 — Lost cursor (persistency wipe → no double-spend): Kill bridge, delete BoltDB/JSON persistency file(s), restart. With no local state, the PROCESSING guard is gone — only protection is IsBurnedAlready (queries ExecutedBurnTransactions on-chain). Verifies Stellar balance stays unchanged (no double-spend), then runs a fresh withdraw to prove the bridge is still functional. T9/MV9 — Expired batch recovery (50 swaps offline → expiry → restart): Kill bridge, submit 50 swapToStellar in one block, wait for on_finalize to expire all 50 (RetryInterval=20 blocks ≈ 120s). Restart bridge — it catches BurnTransactionExpired events and re-proposes all 50 in a single force_batch tx via handleProposalsBatch. Due to Stellar sequence number constraints, ~1 burn succeeds per expiry cycle (~120s). Progress logged as burns complete. 2-hour timeout accommodates the ~100min worst case. Both tests added to single-validator (bridge_tests.js) and multi-validator (bridge_mv_tests.js) suites. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The bridge already assigns consecutive Stellar sequence numbers in handleProposalsBatch (SyncSequenceNumber + per-proposal increment in generatePaymentOperation). All 50 burns get unique sequences (101..150), so handleWithdrawReady can submit all 50 to Stellar in one pass without sequence collisions. Previous analysis of "~1 burn per expiry cycle" was wrong — that would only apply if all burns shared the same sequence number. Reduced timeout from 2h to 10min. Updated comments to document the actual flow: single force_batch re-proposal → all become Ready → sequential Stellar submissions → individual SetWithdrawExecuted calls (the latter could be batched in a future optimization). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…force_batch After submitting Stellar payments for Ready events, the bridge now collects all txIDs/txHashes and confirms them on TFChain in a single Utility.force_batch extrinsic instead of N sequential extrinsics. This confirms all burns/refunds in one block (~6s) instead of N blocks (~13s each). Key changes: - handleWithdrawReady returns (uint64, error) — defers TFChain confirmation - handleRefundReady returns (string, error) — defers TFChain confirmation - bridge.go collects IDs, caps at 100 per type, batches TFChain calls - BatchSetWithdrawExecuted/BatchSetRefundTransactionExecuted with fallback to individual RetrySet* calls on batch failure Safety: set_*_executed is idempotent (first-writer-wins). Multi-validator races produce harmless ItemFailed events. Crash recovery unchanged — BoltDB PROCESSING + Stellar memo search on shared bridge account prevents double-spend. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

bridge_setup.js was verification-only and didn't create Alice's twin, causing test5_deposit to fail with "Alice has no twin on TFChain". Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

BatchCalls reads ALL events from the block, not per-extrinsic. In multi-validator setups where multiple force_batch extrinsics land in the same block, Utility.ItemFailed events from other validators' batches inflate the FailedCount, producing negative SuccessCount. Cap FailedCount at len(calls) to prevent misleading negative counts. The on-chain behavior was already correct — this is a reporting fix. Also increase T9/MV9 timeout from 600s to 900s and T7/MV7 from 60s to 300s. With batched execution, 50 burns require ~3 expiry cycles (~22 burns per cycle × ~3 min per cycle) to fully complete. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

… docs Test improvements: - Extract sendStellarPayment to shared bridge_helpers.js (DRY) - Add MV6a below-minimum withdraw test (parity with SV test6) - Increase MV5 batch size from 3 to 5 (match SV test2) - Fix markTestPhase early-return bug with finally blocks - Remove dead waitForValReady function - Add bridge_instrumentation.js for event collection and analysis Go bridge fixes (stellar.go): - Issue A: fix && to || in deposit asset filter (creditedEffect check) - Issue B: add per-payment asset validation with warning log for non-TFT - Issue C: fix || to && in StatBridgeAccount balance filter - Issue F: fix return nil,nil to continue in op loop (highest severity) - Issue H: reuse HTTP client in StellarWallet instead of creating per-call Pallet cleanup: - Collapse redundant ensure!/get into single .ok_or() read in set_stellar_burn_transaction_executed Docs: - Fix deposit fee "10 TFT" to "1 TFT" in bridging, multinode, single_node Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

sameh-farouk requested a review from LeeSmet as a code owner March 6, 2026 23:58

sameh-farouk marked this pull request as draft March 7, 2026 03:51

sameh-farouk added 26 commits March 7, 2026 03:53

docs(bridge): fix markdown lint - add blank lines around all lists

9f1b8c9

Resolves 14 Codacy 'Lists should be surrounded by blank lines' findings in local_development_setup.md and setup_issues_and_workarounds.md

docs(bridge): remove setup_issues_and_workarounds.md from repo

dad601d

This is internal dev/operational notes, not repo documentation. Keeping it out of the public codebase.

ci(bridge): pin staticcheck to 2023.1.6 (compatible with Go 1.21)

7756b8f

2024.1.1 depends on golang.org/x/tools@v0.21.1 which has a compile error on Go 1.21 (invalid array length in tokeninternal). Use 2023.1.6 which is stable with Go 1.21 and covers all relevant checks.

sameh-farouk and others added 21 commits March 8, 2026 21:54

fix(bridge): fix twinIdByAccountID Option<u32> in MV2 test

1d48799

build(bridge): make TFCHAIN_BIN overridable for faster local testing

5cc5197

Use debug binary to skip the 30-min release build: TFCHAIN_BIN=substrate-node/target/debug/tfchain make bridge-dev TFCHAIN_BIN=substrate-node/target/debug/tfchain make bridge-mv-dev

fix(test): create Alice twin in single-validator bridge setup

e74071f

bridge_setup.js was verification-only and didn't create Alice's twin, causing test5_deposit to fail with "Alice has no twin on TFChain". Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(bridge): batch withdraw proposals and add crash-safe idempotency#1079

fix(bridge): batch withdraw proposals and add crash-safe idempotency#1079
sameh-farouk wants to merge 49 commits intodevelopmentfrom
fix/bridge-batching-atomicity

sameh-farouk commented Mar 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sameh-farouk commented Mar 6, 2026

Summary

Changes

Design decisions

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant