Skip to content

Nonce counter management bugs: RESET_STORAGE_ON_START doesn't clear counters, no periodic re-sync, INCR race condition #683

@snnbotchway

Description

@snnbotchway

Describe the bug

The nonce counter management system (RedisTransactionCounter) has several bugs that cause permanent nonce desync, making all new transactions fail with wrong nonces until manual intervention.

Bug 1: RESET_STORAGE_ON_START does not clear transaction counter Redis keys

When RESET_STORAGE_ON_START=true is set, the relayer clears repository data (transaction store, policy store) but does not clear the transaction_counter Redis keys (pattern: {prefix}:transaction_counter:{relayer_id}:{address}).

If a relayer's nonce counter gets inflated (e.g. from stuck/retried transactions), the bad counter persists across restarts even with RESET_STORAGE_ON_START=true. Since sync_nonce() uses max(on_chain_nonce, redis_counter), the inflated Redis counter always wins and the relayer continues using wrong nonces.

Bug 2: sync_nonce only runs at startup and on health check failure

sync_nonce() (in src/domain/relayer/evm/evm_relayer.rs) only executes:

  1. At relayer startup
  2. When a health check fails

There is no periodic nonce re-synchronization. If the nonce counter drifts out of sync during normal operation (e.g., a transaction gets mined via a different path, or a resubmission succeeds with an earlier nonce), the counter stays wrong indefinitely.

Bug 3: Race condition when manually deleting counter key

Even manually deleting the Redis transaction counter key doesn't reliably fix the issue. The RedisTransactionCounter uses INCR to atomically increment the counter for each new transaction. If a new transaction is submitted between the key deletion and the next sync_nonce() call, INCR on a non-existent key creates it at value 1 (essentially nonce 0), which is also wrong. The nonce needs to be re-synced before the next transaction is submitted.

Relationship to cascading provider pausing

These nonce management bugs are the root cause that leads to the cascading provider pausing described in #681. When the nonce counter is out of sync, every send_raw_transaction returns "Transaction nonce too low" (-32603), which the retry logic misclassifies as a provider health issue.

Steps to reproduce

Bug 1 (RESET_STORAGE_ON_START):

  1. Run a relayer with Redis-based transaction counter
  2. Submit transactions — counter increments in Redis
  3. Some transactions get stuck/retried, inflating the counter above on-chain nonce
  4. Restart with RESET_STORAGE_ON_START=true
  5. Observe: transaction store is cleared, but transaction_counter key still has inflated value
  6. sync_nonce() picks max(on_chain, redis_counter) = inflated value
  7. New transactions use wrong nonce and fail

Bug 2 (No periodic sync):

  1. Run a relayer normally
  2. A transaction with nonce N gets resubmitted multiple times (different gas prices)
  3. An earlier hash (nonce N) gets mined
  4. The counter still has N + resubmission_count as next nonce
  5. All subsequent transactions fail with nonce too high/low
  6. No recovery without restart or health check failure

Bug 3 (INCR race condition):

  1. Identify a relayer with inflated nonce counter in Redis
  2. Delete the key: DEL {prefix}:transaction_counter:{relayer_id}:{address}
  3. Before sync_nonce() runs, a new transaction submission calls INCR on the (now missing) key
  4. Redis creates key with value 1 (nonce 0) — also wrong
  5. Nonce is still desynced

Suggested fix

  1. RESET_STORAGE_ON_START should also clear transaction counter keys — include transaction_counter:* pattern in the reset logic
  2. Add periodic nonce re-sync — run sync_nonce() on a configurable interval (e.g., every 60 seconds), not just at startup
  3. Trigger nonce re-sync on "nonce too low" errors — when send_raw_transaction returns a nonce error, immediately re-sync from chain before the next transaction attempt
  4. Use SET instead of INCR for nonce management — or add a check-and-sync mechanism that detects drift

Version Information

openzeppelin-relayer 1.3.0

Network Type

EVM (Monad Mainnet, chain ID 143)

Deployment Type

Docker container (ECS)

Platform

Linux (x86)

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions