Skip to content

Commit 7c40cff

Browse files
simzzzkonstantinabl
authored andcommitted
feat: adds documentation for nonce ordering feature (#4641)
Signed-off-by: Simeon Nakov <[email protected]> Signed-off-by: Konstantina Blazhukova <[email protected]>
1 parent be9ec8b commit 7c40cff

File tree

1 file changed

+314
-0
lines changed

1 file changed

+314
-0
lines changed

docs/nonce-ordering-with-locks.md

Lines changed: 314 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,314 @@
1+
> [!NOTE]
2+
> This is an experimental feature hidden behind a flag `ENABLE_NONCE_ORDERING`
3+
4+
## Nonce ordering with locks
5+
6+
This document explains how per-sender address locking ensures transaction ordering and prevents nonce-related failures when multiple transactions from the same sender arrive in rapid succession.
7+
8+
It covers the background and motivation, configuration, locking strategies, request flows, failure handling, and how this impacts `eth_sendRawTransaction`.
9+
10+
---
11+
12+
### Background and motivation
13+
14+
The Hedera JSON-RPC Relay processes `eth_sendRawTransaction` requests asynchronously. When multiple transactions from the same sender arrive within milliseconds of each other, asynchronous operations can cause them to reach consensus nodes out of order:
15+
16+
```
17+
User submits: Tx(nonce=0) → Tx(nonce=1) → Tx(nonce=2)
18+
↓ ↓ ↓
19+
Async processing: [validate] [validate] [validate]
20+
↓ ↓ ↓
21+
Reaches consensus: Tx(nonce=1) ← Tx(nonce=0) ← Tx(nonce=2) ❌ Wrong order!
22+
```
23+
24+
**Result:** "Wrong nonce" errors because transactions reach consensus nodes out of order.
25+
26+
The root cause is that async calls to mirror nodes have variable latency, precheck operations complete at different speeds, and multiple relay instances can process transactions from the same sender simultaneously without any synchronization mechanism.
27+
28+
To address this, the relay implements a per-sender locking mechanism that serializes transaction processing per address while allowing concurrent processing for different senders.
29+
30+
---
31+
32+
### High-level behavior
33+
34+
- When enabled via `ENABLE_NONCE_ORDERING`, the relay acquires a per-address lock **before any async operations or side effects**, ensuring FIFO ordering.
35+
- Lock acquisition happens before prechecks, validation, and transaction pool updates to prevent race conditions.
36+
- Locks are automatically released immediately after consensus submission (whether successful or failure), with a maximum hold time (default: 30 seconds) to prevent deadlocks.
37+
- If lock acquisition fails (e.g., Redis connectivity issues), the relay fails open and processes the transaction without locking to maintain availability.
38+
- Different senders can process transactions concurrently without blocking each other, as locks are isolated per address.
39+
40+
Limitations (by design):
41+
42+
- This is not an Ethereum-style mempool. Transactions are processed in arrival order, not buffered for later reordering.
43+
- Hedera consensus nodes reject transactions with nonce gaps; users must resubmit later transactions after gaps are filled.
44+
45+
---
46+
47+
### Configuration
48+
49+
- `ENABLE_NONCE_ORDERING` (boolean; default: false)
50+
- Master feature flag that enables the nonce ordering mechanism.
51+
- When disabled, transactions are processed without any locking, maintaining current behavior.
52+
- When enabled, transactions acquire locks before any async operations or side effects.
53+
54+
- `REDIS_ENABLED` (boolean) and `REDIS_URL` (string)
55+
- If enabled and a valid URL is provided, the relay will use Redis for distributed locking across multiple relay instances.
56+
- If disabled or unavailable, an in-memory local locking strategy is used (single process only).
57+
58+
- `LOCK_MAX_HOLD_MS` (number; default: 30000)
59+
- Maximum time (in milliseconds) a lock can be held before automatic force release.
60+
- Prevents deadlocks when transaction processing hangs or crashes.
61+
62+
- `LOCK_QUEUE_POLL_INTERVAL_MS` (number; default: 50)
63+
- Polling interval (in milliseconds) for Redis queue checks when waiting for lock acquisition.
64+
- Only applicable to Redis locking strategy.
65+
66+
- `LOCAL_LOCK_MAX_ENTRIES` (number; default: 1000)
67+
- Maximum number of addresses to track in the local lock cache.
68+
- Uses LRU eviction when limit is reached.
69+
- Only applicable to local locking strategy.
70+
71+
Strategy selection:
72+
73+
- If Redis is enabled and reachable, the relay uses the distributed Redis locking strategy.
74+
- Otherwise, it falls back to the local in-memory strategy automatically.
75+
76+
---
77+
78+
### Locking strategies
79+
80+
The lock service uses a strategy pattern to support both local and distributed locking.
81+
82+
#### Local in-memory strategy
83+
84+
- Uses the `async-mutex` library wrapped with session key tracking and automatic expiration.
85+
- Stores lock state in an LRU cache with configurable maximum entries.
86+
- Guarantees FIFO ordering within a single process.
87+
- Locks are lost on process restart; state is not shared across relay instances.
88+
89+
Key properties:
90+
91+
- ✅ FIFO ordering guaranteed by `async-mutex`
92+
- ✅ Per-address isolation
93+
- ✅ Automatic cleanup via LRU cache
94+
- ✅ Never fails (always returns a session key)
95+
- ❌ Single process only (no distributed locking)
96+
97+
#### Redis distributed strategy
98+
99+
- Uses Redis `SET NX` (set if not exists) with TTL for lock ownership.
100+
- Uses Redis `LIST` for FIFO queue of waiters.
101+
- Polling-based acquisition (checks queue position every 50ms by default).
102+
- Automatic TTL-based expiration handles process crashes gracefully.
103+
104+
Key properties:
105+
106+
- ✅ Works across multiple relay instances
107+
- ✅ FIFO ordering via Redis queue
108+
- ✅ Automatic cleanup via TTL on process crashes
109+
- ✅ Fail-open behavior on errors (returns null, transaction proceeds without lock)
110+
- ⚠️ Requires Redis availability
111+
112+
Storage schema:
113+
114+
```
115+
lock:{address} → Current lock holder's session key (SET with TTL)
116+
lock:queue:{address} → FIFO queue of waiters (LIST)
117+
```
118+
119+
---
120+
121+
### Lock lifecycle
122+
123+
1. **Lock acquisition request**
124+
- Transaction arrives for processing.
125+
- Generate a unique session key (UUID) to identify this lock holder.
126+
127+
2. **Wait for lock**
128+
- Join the FIFO queue for this sender address.
129+
- Wait until first in queue (no timeout on waiting).
130+
- Acquire lock once available.
131+
132+
3. **Lock held**
133+
- Set ownership metadata (session key, acquisition time).
134+
- Start automatic force-release timer (default: 30 seconds).
135+
- Process transaction while holding the lock (validate, update transaction pool, submit to consensus).
136+
137+
4. **Lock release**
138+
- On successful submission or error, release lock.
139+
- Verify session key matches current holder (prevents hijacking).
140+
- Clear timer and wake next waiter in queue.
141+
142+
5. **Automatic force release**
143+
- If lock is held longer than `LOCK_MAX_HOLD_MS`, automatically release it.
144+
- Ensures queue progresses even if transaction processing hangs or crashes.
145+
146+
---
147+
148+
### Request flows
149+
150+
#### eth_sendRawTransaction
151+
152+
1. **Lock acquisition (before any async operations)**
153+
- If `ENABLE_NONCE_ORDERING` is enabled, acquire lock for sender address.
154+
- Normalize sender address (lowercase).
155+
- If acquisition fails (Redis error), returns null but proceeds without lock (fail-open).
156+
- Lock is acquired BEFORE any validation, side effects, or async operations to prevent race conditions.
157+
158+
2. **Prechecks** (protected by lock)
159+
- Validate transaction size, type, gas, and signature.
160+
- Verify account exists and nonce is valid via Mirror Node.
161+
- Add transaction to pending pool (if `ENABLE_TX_POOL` is enabled).
162+
163+
3. **Transaction processing** (protected by lock)
164+
- Submit transaction to consensus node.
165+
- Lock is released immediately after submission completes.
166+
167+
4. **Post-submission** (lock already released)
168+
- Remove transaction from pending pool (if `ENABLE_TX_POOL` is enabled).
169+
- Poll Mirror Node for confirmation and retrieve transaction hash (depending on `USE_ASYNC_TX_PROCESSING`).
170+
171+
5. **Error handling**
172+
- If an error occurs during prechecks or validation, release lock before throwing error.
173+
- Lock is always released via try-catch-finally pattern to ensure cleanup.
174+
175+
These rules ensure transactions from the same sender are processed in order while maintaining high availability through fail-open behavior.
176+
177+
---
178+
179+
### Fail-open behavior
180+
181+
When the Redis locking strategy encounters an error (e.g., network failure, connection timeout), it **fails open**:
182+
183+
- `acquireLock()` returns no session key instead of a session key.
184+
- The transaction proceeds without locking.
185+
- An error is logged for monitoring and debugging.
186+
187+
**Rationale:**
188+
189+
- Availability is prioritized over strict ordering in degraded states.
190+
- Temporary nonce ordering issues are preferable to blocking all transactions.
191+
- Users can still submit transactions even if Redis is down.
192+
193+
The local in-memory strategy never fails open because it has no external dependencies.
194+
195+
---
196+
197+
### Session keys and ownership verification
198+
199+
Each lock acquisition generates a unique session key (UUID) that:
200+
201+
- Proves ownership when releasing the lock.
202+
- Prevents double-release bugs.
203+
- Prevents lock hijacking by other sessions.
204+
205+
Only the session key holder can release a lock. Invalid release attempts are silently ignored.
206+
207+
Example:
208+
209+
```typescript
210+
const sessionKey = await lockService.acquireLock(address); // "a1b2c3d4-5678-..."
211+
// ... process transaction ...
212+
await lockService.releaseLock(address, sessionKey); // Only succeeds if sessionKey matches
213+
```
214+
215+
---
216+
217+
### Timeout strategy
218+
219+
| Timeout Type | Duration | Purpose | Behavior |
220+
| ----------------- | ------------------ | ------------------------------------------------ | --------------------------- |
221+
| **Waiting Time** | None | Allow queue buildup without failing transactions | Waits indefinitely in queue |
222+
| **Max Lock Time** | 30s (configurable) | Prevent deadlocks from hung transactions | Force release after 30s |
223+
224+
**Design decision:** No timeout on waiting in queue because the max lock time provides sufficient protection. If the current holder hangs, force release kicks in after 30 seconds and the queue progresses.
225+
226+
---
227+
228+
### Compatibility with async transaction processing
229+
230+
The lock service is fully compatible with `USE_ASYNC_TX_PROCESSING`:
231+
232+
- Lock is acquired before any prechecks or validation (synchronously in the main request path).
233+
- When async mode is enabled, the transaction hash is returned immediately after prechecks pass.
234+
- The lock persists across the async boundary during background processing.
235+
- The lock is released after consensus submission completes in the background.
236+
- Session key is passed to the async processor to ensure correct ownership.
237+
- If an error occurs during prechecks (before async processing starts), the lock is released immediately.
238+
239+
---
240+
241+
### Monitoring and observability
242+
243+
The lock service logs the following events at appropriate levels:
244+
245+
- **Debug:** Lock acquisition/release with hold times and queue lengths
246+
- **Trace:** Detailed lock lifecycle events (queue join, polling, acquisition)
247+
- **Error:** Lock acquisition failures with fail-open behavior
248+
249+
Key metrics to monitor:
250+
251+
- Lock hold times (should be well under 30 seconds)
252+
- Queue lengths (high values indicate congestion)
253+
- Failed lock acquisitions (indicates Redis issues)
254+
- Force releases (indicates hung transactions or timeouts)
255+
256+
---
257+
258+
### FAQ
259+
260+
#### Does this guarantee out-of-order nonce execution without resubmission?
261+
262+
No. Hedera consensus nodes do not maintain an execution buffer by nonce. This feature ensures transactions are submitted in order, but if a nonce gap exists when a transaction reaches the consensus node, it will be rejected and must be resubmitted.
263+
264+
#### Can transactions from different senders process in parallel?
265+
266+
Yes! Locks are per-sender address. Different senders have independent locks and process concurrently without blocking each other.
267+
268+
#### What happens if a transaction crashes while holding the lock?
269+
270+
The automatic force-release timer (default: 30 seconds) will release the lock. The next waiter in queue will be awakened and can proceed.
271+
272+
#### What happens if Redis goes down?
273+
274+
The Redis locking strategy fails open: transactions proceed without locking. Once Redis is restored, the relay automatically resumes using distributed locks. No manual intervention is required.
275+
276+
#### Why no timeout on waiting in queue?
277+
278+
The max lock time (30 seconds) provides sufficient protection. If the current holder hangs, they'll be force-released after 30 seconds and the queue progresses. Adding a wait timeout would cause later transactions to fail unnecessarily.
279+
280+
#### If 100 transactions are waiting in queue and the first one hangs, won't they all timeout?
281+
282+
No. Each transaction gets its own fresh 30-second window **after acquiring the lock**. The timer starts only when you hold the lock, not when you join the queue:
283+
284+
```
285+
t=0s: Tx1 acquires lock → 30s timer starts for Tx1
286+
t=1s: Tx2-100 join queue → NO timers yet, just waiting
287+
t=30s: Tx1's timer expires → Force released
288+
t=30s: Tx2 acquires lock → NEW 30s timer starts for Tx2
289+
t=35s: Tx2 completes and releases
290+
t=35s: Tx3 acquires lock → NEW 30s timer starts for Tx3
291+
```
292+
293+
Each transaction in the queue gets a full 30 seconds to process once they acquire the lock.
294+
295+
#### Does this work with the transaction pool feature (`ENABLE_TX_POOL`)?
296+
297+
Yes! The lock service and transaction pool work together:
298+
299+
1. Lock is acquired for the sender address (before any operations)
300+
2. Transaction prechecks are performed (protected by lock)
301+
3. Transaction is added to the pending pool (protected by lock)
302+
4. Transaction is submitted to consensus node (protected by lock)
303+
5. Lock is released immediately after submission
304+
6. Transaction is removed from pending pool after consensus (no longer needs lock)
305+
306+
Both features are independent and can be enabled/disabled separately.
307+
308+
#### How do I enable this feature?
309+
310+
Set the environment variable `ENABLE_NONCE_ORDERING=true`. The feature is disabled by default to allow for gradual rollout and testing.
311+
312+
#### What if I don't use Redis? Do I still get ordering guarantees?
313+
314+
Yes, but only within a single relay instance. The local in-memory strategy ensures FIFO ordering for transactions processed by the same relay process. If you run multiple relay instances without Redis, each instance has its own locks and cannot coordinate with others.

0 commit comments

Comments
 (0)