-
Notifications
You must be signed in to change notification settings - Fork 1
Fix App Crashes and Lightning Node Recovery When Starting Offline #363
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
… network related failures
|
changed to draft to fix |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ran both test cases successfully with a caveat on test case 2
Phone online -> open app -> should display successfull status -> phone offline -> should update failure status -> online again -> resync and update status again -> send bitcoin with success
ended up with logs spam after restoring internet connection then trying swipe-to-refresh (all coming from lightingRepo.sync()):
ERROR❌️: ServiceQueue.LDK error [AppError='TxSyncTimeout='Syncing transactions timed out.'']
The error also shows up as toast after each pull-to-refresh:

But otherwise send and receive works both for onchain and LN, with one problem for LN send encountered only 1 time: estimateRoutingFeesForAmount takes a lot of waiting time right after tapping Continue on send amount screen, freezing the send flow for a while.
Remarks
- Suggesting to keep retrying in a loop, instead of stopping after 5 attempts.
If I wait long enough during test case 1, nothing works anymore even if I get back online.
|
Set as draft to apply changes and retest |
|
The send and receive flows checks while offline can be addressed to another branch to keep this PR as specific as possible |
|
Tests updated |
ovitrif
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a few remarks about the code, will test next…
| /** | ||
| * Determines if an error is retryable based on its type and characteristics | ||
| */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: private methods should not have doc comments
also applies on 315-317 for calculateRetryDelayWithJitter
| _lightningState.update { it.copy(nodeLifecycleState = NodeLifecycleState.Initializing) } | ||
| } | ||
|
|
||
| @Suppress("TooGenericExceptionCaught") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should've disabled the rule from lint config in detekt.yml but we can do it in another PR.
But… even better would've been to use runCatching instead.
Tbh this lint rule is like a nice guard to push for preferring runCatching over try/cach.
| try { | ||
| lightningService.connectToTrustedPeers() | ||
| Result.success(Unit) | ||
| } catch (e: NetworkException) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doesn't make much sense checking here if it's a network exception. because connecting to peer is in itself a network op :)…
| /** | ||
| * Enhanced network error detection for LDK-specific errors | ||
| */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit too much AI-like, but nvm, at least cleanup the comments pls 🙏🏻
| lowerMessage.contains("unreachable") || | ||
| lowerMessage.contains("refused") || | ||
| // VSS-specific network errors | ||
| lowerMessage.contains("vss") || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think any ldk-node error has any VSS mention
| ServiceQueue.LDK.background { | ||
| var networkFailures = 0 | ||
| val maxNetworkFailures = trustedLnPeers.size // Allow all to fail due to network issues | ||
|
|
||
| for (peer in trustedLnPeers) { | ||
| try { | ||
| node.connect(peer.nodeId, peer.address, persist = true) | ||
| Logger.info("Connected to trusted peer: $peer") | ||
| } catch (e: NodeException) { | ||
| Logger.error("Peer connect error: $peer", LdkError(e)) | ||
| val ldkError = LdkError(e) | ||
| val isNetworkError = isNetworkRelatedError(e.message) | ||
|
|
||
| if (isNetworkError) { | ||
| networkFailures++ | ||
| Logger.warn("Network error connecting to trusted peer: $peer", ldkError) | ||
|
|
||
| // If all connections failed due to network, throw network exception | ||
| if (networkFailures >= maxNetworkFailures) { | ||
| throw NetworkException("Failed to connect to any trusted peers due to network issues") | ||
| } | ||
| } else { | ||
| Logger.error("Peer connect error: $peer", ldkError) | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current networkFailures tracking adds complexity without clear benefit. What's the intended behavior when some (but not all) connections fail?
|
Converted to draft to check comments |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests
1️⃣ Start app when offline 🟢
Phone offline -> open app -> should not crash -> phone online -> should sync and update status successfully
2️⃣ Go offline after app started when online 🔴
Phone online -> open app -> should display successfull status -> phone offline -> should update failure status -> online again -> resync and update status again -> send bitcoin with success
App works very bad after restoring internet for me:
- from second 19 till 1m:26s the loading spinner keeps going
- then every pull-to-refresh still take a lot of time
- every pull-to-refresh ultimately fails with an error:
TxSyncTimeout
- every pull-to-refresh ultimately fails with an error:
- each time I go to receive sheet, it takes a lot of time for the QR to show up
Android.Studio.2025-09-11.000337.mp4
actually the app never recovers, not even after restarting it
errAfterAppRestart.mp4
The only way to fix the issue is to close the app via the node notification, and then to wait a bit.
if I pull-to-refresh before node is ready, I get another non-fixable toast error on each refresh
errAfterNodeRestart.mp4
2️⃣ pull-to-refresh when offline 🟢
- Phone offline > swype to refresh > display a device offline error
overall:
Honestly I'm still as concerned as yesterday about merging this PR, especially given we will have a testing session tomorrow.
Not sure if the test case 2️⃣ is better on master, but if yes, then I would suggest to retry this fix from scratch. I'm not very confident in the amount of code changes, makes too many core code paths too difficult to reason about without AI.
|
Found a simpler solution for the crash. I'll close this PR and open another one |



Description
The app was crashing when started without internet connection due to unhandled network exceptions. Additionally, when users started the app offline and later connected to the internet, the Lightning node would get stuck in "Starting" state and never recover, requiring a manual app restart.
Key changes:
Preview
Screen_recording_20250911_084013.mp4
QA Notes