Skip to content

Fix for wifi roaming for 4.2.2#1134

Open
SachD-Av wants to merge 16 commits intowgtunnel:masterfrom
SachD-Av:master-bssid-4.2.2
Open

Fix for wifi roaming for 4.2.2#1134
SachD-Av wants to merge 16 commits intowgtunnel:masterfrom
SachD-Av:master-bssid-4.2.2

Conversation

@SachD-Av
Copy link

@SachD-Av SachD-Av commented Jan 7, 2026

BSSID Roaming for 4.2.2

Works on all WiFi connexion when Auto-Tunnel is ON.

  • Event-driven BSSID detection for instant WiFi AP switching
  • Anti-leak block config (interface without peers during transition)
  • Adaptive debounce (first roaming immediate)
  • Auto-cancellation on WiFi loss (Cellular/Ethernet)
  • WakeLock management to prevent interruption

This implementation avoids socket checks and eliminates redundancy in the auto-detection pipeline. Without NET_CAPABILITY_VALIDATED, the tunnel would be restored before the new access point's network stack is fully ready, leading to DNS failures and unstable connections.

We perform exactly one tunnel bounce per BSSID change and only after the new network is confirmed ready.

@SachD-Av
Copy link
Author

Hello,

After some testing, the roaming handler could enter a "gray zone" state where it continues executing even after network changes. This resulted in dead tunnels that required manual restart to recover.

I added context tracking and network change detection to safely cancel roaming when necessary. It also adds validation checkpoints between each phase to prevent from continuing after cancellation, and includes a CancellationException handler to guarantee cleanup even when the coroutine is cancelled externally.

The fix specifically handles edge cases like rapid BSSID transitions, ensuring the tunnel always remains synchronized with the current network connection.

Roaming still takes precedence over AutoTunnel's normal operation, but now cancels safely when any network changes occur and lets AutoTunnel work.

@zaneschepke
Copy link
Collaborator

Thank you for all of this work. I'll take a look and do some testing.

@SachD-Av
Copy link
Author

Hello,

Still testing, roaming works about 90% of the time, but I occasionally see DNS resolution failures after the roaming procedure completes.

I think the issue is that we're restoring the main tunnel before the network's DNS is actually ready after a BSSID change. Looking at the logs, the tunnel starts up fine, but then it can't resolve the endpoint hostname because the DNS stack hasn't stabilized yet (I use DDNS).

Would it make sense to add a waitForDnsReady() function that uses Android's DnsResolver to actively verify the endpoint hostname resolves before restoring the tunnel? Something like retry every 300ms with a 10 second timeout - that way we only proceed once DNS is actually working ?

@SachD-Av
Copy link
Author

Hello again,

I have been testing a few things, and I have found a solution that is both fast and reliable.

Instead of an explicit stop with a smart fallback, we now use a direct startTunnel(BLOCK) approach. To resolve the 'already running' conflicts I was experiencing with the BLOCK tunnel, we are now using a unique ID = -1. I also implemented BSSID detection using drop(1) to ensure we wait for the actual network change before proceeding.

## Latest version

I've been testing this one for a few days without issue.  With the previous one I kept having DNS resolution issue.

Roaming runs only on Wi-Fi when needed, and Auto-Tunnel always has priority for network events.

We keep the active tunnel up to try a socket rebind using cached endpoint IPs (DDNS). If the rebind fails, we simply restart the tunnel. No block necessary.

## What it does

- Detects WiFi roaming (BSSID change on same SSID)
- Caches endpoint IPs for DDNS to avoid DNS resolution through broken tunnel
- Tries socket rebind first (zero downtime)
- Falls back to tunnel restart if rebind fails
- Skips kernel mode (handles rebinding natively)

## Keep it simple

- Does NOT interfere with AutoTunnel
- Does NOT trigger on 4G ↔ WiFi transitions
- Does NOT trigger on SSID changes
- Does NOT trigger on first WiFi connection

## Misc

- Thread-safe (Mutex for job management)
- WakeLock for background operation
- Debounce for rapid roaming (2s)
- First roaming is immediate
Recovery flow is now fully stateless:
- Phase 1: setState with cached IPs (instant, no DNS needed)
- Phase 2: fresh DNS via system resolver + setState (zero downtime)

Added a 500ms settle delay before all roaming recovery and always
verify still on WiFi before proceeding. During WiFi→4G the check
sees Cellular and skips recovery. During real roaming the phone
stays on WiFi and recovery proceeds normally.
Improved cache endpoints with IPv6 support

Added releaseOnFailure() to properly close sockets and channels if tunnel startup fails

Added a backendMode reset in UserspaceTunnel to ensure the VPN service recovers cleanly

Fix tunnel cleanup after app update
Fix for endpoint ip4 and ipv6
@zaneschepke
Copy link
Collaborator

Hello again!

I'm sorry I'm just getting back to you on this one.

First, great work on this and thank you for this contribution! I've been reviewing the code and overall I think it looks good.

That said, there is a fundamental change I'd like to make to how you are resolving the tunnel break on BSSID change/roaming. The current forceSocketRebind essentially forces the backend to bring down the old tunnel, remove the tunnel interface, and recreate it. This causes a ~1 second leak during this transition period. Although this is not the worst fix, there is a better way we can resolve this leak free by using VpnService.setUnderlyingNetworks. If we expose the Network instance from the network monitor and pass it to a new API in our go backend on BSSID change and call this method internally, it will refresh the tunnel routing paths binding and should resolve the roaming issues without ever having to recreate the tunnel interface.

This will also simplify things as we no longer need to do any caching/etc as you are currently doing.

I'll make these changes as part of your PR because there are upstream APIs that need to be created for this.

Once done, I could really use your help in testing this change to make sure it does indeed resolve the issue similarly to your current solution.

@zaneschepke zaneschepke self-requested a review March 8, 2026 00:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants