-
Notifications
You must be signed in to change notification settings - Fork 5
Description
What’s happening
The provider attempts many MsgWithdrawLease transactions (lease withdrawal). We see repeated failures with err="context deadline exceeded" and then the provider crashes with a SIGSEGV nil-pointer panic inside the AKT tx broadcaster client:
pkg.akt.dev/go/node/client/v1beta3.(*serialBroadcaster).syncSequence
panic dereference occurs in tx.go:541 (nil pointer), called from broadcaster (tx.go:572)
Example log evidence around the incident:
DBG sending withdraw cmp=balance-checker lease=<...>
many ERR failed to do lease withdrawal cmp=balance-checker err="context deadline exceeded"
provider panics immediately after (last observed withdraw at 6:55:48), with stack trace pointing to syncSequence in pkg.akt.dev/go@v0.1.9.
Why this seems to be happening
The withdraw timeouts (context deadline exceeded) indicate the tx broadcast flow is timing out under load/queueing/backpressure. The crash site suggests the broadcaster can encounter a typed-nil *sdk.TxResponse stored in an interface{} and then dereference it (e.g., accessing txResp.Code) without a nil guard.
Proposed fix #1 (upstream / chain-sdk): fix the nil pointer panic
In pkg.akt.dev/go/node/client/v1beta3/tx.go, update serialBroadcaster.syncSequence(...) to guard against a typed-nil txResp before dereferencing fields (e.g., txResp.Code).
Concretely: only evaluate the txResp.Code == ... branch if txResp != nil.
This ensures withdraw/broadcast timeouts never crash the broadcaster (and therefore never crash the provider).
Proposed fix #2 (provider mitigation): limit concurrent withdraw attempts
Even with the upstream crash fix, the system can still experience a “withdraw timeout storm” when many leases become eligible at once. To reduce queueing/timeouts and improve the chance withdrawals actually complete, we plan to cap in-flight withdraw attempts using a semaphore.
Provider-side mitigation implemented/planned:
add a semaphore in provider/balance_checker.go to cap concurrent withdraw goroutines
current plan: maxConcurrentWithdraws = 5
This should:
reduce the number of simultaneous BroadcastMsgs calls timing out
allow more sequential progress through the tx broadcaster
Expected behavior
Lease withdraw failures due to timeouts should return an error and be retried later.
The provider must never crash due to tx broadcaster internal nil dereferences.
Current status / operator workaround
Provider restart recovers, but it can recur.
Provider-side semaphore reduces likelihood of the timeout storm; upstream nil-guard is required for correctness.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status