Fix RAK4631 Ethernet gateway API connection loss after W5100S brownout#9754
Conversation
PoE power instability can brownout the W5100S while the nRF52 MCU keeps running, causing all chip registers (MAC, IP, sockets) to revert to defaults. The firmware had no mechanism to detect or recover from this. Changes: - Detect W5100S chip reset by periodically verifying MAC address register in reconnectETH(); on mismatch, perform full hardware reset and re-initialize Ethernet interface and services - Add deInitApiServer() for clean API server teardown during recovery - Add ~APIServerPort destructor to prevent memory leaks - Switch nRF52 from EthernetServer::available() to accept() to prevent the same connected client from being repeatedly re-reported - Add proactive dead-connection cleanup in APIServerPort::runOnce() - Add 15-minute TCP idle timeout to close half-open connections that consume limited W5100S hardware sockets Fixes meshtastic#6970 Made-with: Cursor
|
@PhilipLykov You think this would resolve this issue? #8462 |
There was a problem hiding this comment.
Pull request overview
This PR fixes a critical issue where RAK4631 Ethernet gateway devices lose API connectivity after a few minutes due to W5100S chip brownouts caused by PoE power instability. The fix implements hardware reset detection, proper recovery mechanisms, and several improvements to TCP connection management to prevent socket exhaustion.
Changes:
- Added W5100S chip reset detection by monitoring MAC address registers and full re-initialization on mismatch
- Implemented proper API server teardown (
deInitApiServer()) and destructor to prevent memory leaks during recovery - Fixed nRF52 connection handling to use
accept()instead ofavailable()and added proactive cleanup of dead connections - Added 15-minute TCP idle timeout to prevent half-open connections from exhausting the W5100S's 4 hardware sockets
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| src/mesh/eth/ethClient.cpp | Implements W5100S reset detection via MAC verification and full hardware re-initialization |
| src/mesh/api/ethServerAPI.h | Adds deInitApiServer() declaration for clean API server teardown |
| src/mesh/api/ethServerAPI.cpp | Implements deInitApiServer() to properly destroy the API server |
| src/mesh/api/ServerAPI.h | Adds destructor to prevent memory leaks and isClientConnected() helper method |
| src/mesh/api/ServerAPI.cpp | Implements TCP idle timeout, proactive dead-connection cleanup, and switches nRF52 to accept() |
Address Copilot review comment: log millis() - lastContactMsec to show the real time since last client activity, rather than always logging the TCP_IDLE_TIMEOUT_MS constant. Made-with: Cursor
…com/PhilipLykov/Meshtastic_firmware into fix/rak4631-eth-api-connection-loss
|
|
Re: #8462 - This PR does not fix that issue. #8462 is a different bug on the radio side (RadioLib error -7 = RADIOLIB_ERR_CRC_MISMATCH), where the SX1262 LoRa radio stops decoding packets correctly after a few days. The two chips live on completely separate SPI buses on the RAK4631:
This PR fixes W5100S Ethernet chip brownout recovery (API/TCP connectivity loss). The radio CRC issue in #8462 would need its own detection and recovery - likely periodic verification of SX1262 configuration registers and re-initialization if corrupted. However, if both issues are triggered by the same PoE voltage transients, the root cause is the same (power instability), just affecting different chips independently. |
|
@robekl Good observation — this is intentional and consistent with how the rest of the Meshtastic codebase handles connection timeouts.
Why not update on outbound traffic? The purpose of this timeout is to detect dead/abandoned connections (crashed clients leaving half-open TCP sockets). If we updated The Meshtastic protocol already accounts for this via the |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
|
@PhilipLykov Can you address the merge conflicts on this one? |
Made-with: Cursor
…oss' into fix/rak4631-eth-api-connection-loss Made-with: Cursor
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
You can also share your feedback on Copilot code review. Take the survey.
|
@PhilipLykov Sorry to bug you again but please review the CoPilot suggestions and determine if we want to add them or not. Thanks! |
After a W5100S chip brownout, the udpHandler isRunning flag stays true while the underlying socket is dead. Without calling stop(), the subsequent start() no-ops and multicast is silently broken after recovery. Made-with: Cursor
|
@Xaositek Thanks for the nudge! Reviewed both Copilot suggestions: 1. UDP multicast not torn down during brownout recovery ( 2. PR description mentions destructor that no longer exists ( |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.
You can also share your feedback on Copilot code review. Take the survey.
Move ethStartupComplete and ntp_renew reset to immediately after service teardown, before Ethernet.begin(). Previously, if DHCP failed the early return left ethStartupComplete=true, preventing service re-initialization on subsequent retries. Replace #define TCP_IDLE_TIMEOUT_MS with static constexpr uint32_t for type safety and better C++ practice. Made-with: Cursor
…com/PhilipLykov/Meshtastic_firmware into fix/rak4631-eth-api-connection-loss
|
Addressed the latest Copilot review — all fixes pushed in d396b0a: 1. 2. Unsafe deletion of 3. |
#9754) * Fix RAK4631 Ethernet gateway API connection loss after W5100S brownout PoE power instability can brownout the W5100S while the nRF52 MCU keeps running, causing all chip registers (MAC, IP, sockets) to revert to defaults. The firmware had no mechanism to detect or recover from this. Changes: - Detect W5100S chip reset by periodically verifying MAC address register in reconnectETH(); on mismatch, perform full hardware reset and re-initialize Ethernet interface and services - Add deInitApiServer() for clean API server teardown during recovery - Add ~APIServerPort destructor to prevent memory leaks - Switch nRF52 from EthernetServer::available() to accept() to prevent the same connected client from being repeatedly re-reported - Add proactive dead-connection cleanup in APIServerPort::runOnce() - Add 15-minute TCP idle timeout to close half-open connections that consume limited W5100S hardware sockets Fixes #6970 Made-with: Cursor * Log actual elapsed idle time instead of constant timeout value Address Copilot review comment: log millis() - lastContactMsec to show the real time since last client activity, rather than always logging the TCP_IDLE_TIMEOUT_MS constant. Made-with: Cursor * Update src/mesh/api/ServerAPI.h Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Stop UDP multicast handler during W5100S brownout recovery After a W5100S chip brownout, the udpHandler isRunning flag stays true while the underlying socket is dead. Without calling stop(), the subsequent start() no-ops and multicast is silently broken after recovery. Made-with: Cursor * Address Copilot review: recovery flags and timeout constant Move ethStartupComplete and ntp_renew reset to immediately after service teardown, before Ethernet.begin(). Previously, if DHCP failed the early return left ethStartupComplete=true, preventing service re-initialization on subsequent retries. Replace #define TCP_IDLE_TIMEOUT_MS with static constexpr uint32_t for type safety and better C++ practice. Made-with: Cursor --------- Co-authored-by: Ben Meadors <benmmeadors@gmail.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Summary
RAK4631 Ethernet gateway devices (RAK13800-W5100S) lose API connectivity after a few minutes of operation while radio and ping continue to work. Only a power reset recovers connectivity.
Root cause: PoE power instability can brownout the W5100S Ethernet chip while the nRF52 MCU keeps running. This causes all W5100S registers (MAC address, IP configuration, socket states) to silently revert to defaults. The firmware had no mechanism to detect or recover from this hardware-level reset. Several pre-existing software issues compounded the problem by preventing graceful recovery even from normal TCP disconnects.
Changes
W5100S reset detection and recovery (
ethClient.cpp): Periodically verify the W5100S MAC address register inreconnectETH(). On mismatch (indicating a chip reset), perform a full hardware reset viaPIN_ETHERNET_RESET, re-initialize the Ethernet interface (MAC, IP via DHCP/static), and trigger re-initialization of all Ethernet services (API server, NTP, syslog, UDP multicast) by resetting theethStartupCompleteflag.API server clean teardown (
ethServerAPI.cpp/.h): AdddeInitApiServer()function (matching the existing WiFi convention) to properly destroy the API server during Ethernet re-initialization, preventing stale socket/pointer issues.UDP multicast teardown on brownout recovery (
ethClient.cpp): Stop the UDP multicast handler (udpHandler->stop()) during W5100S reset recovery. Without this, the handler'sisRunningflag staystrueafter a chip brownout, causingstart()to no-op during re-initialization and leaving multicast silently broken after recovery.Switch nRF52 to
accept()(ServerAPI.cpp): Change nRF52 fromEthernetServer::available()toaccept()(aligning with RP2040).available()repeatedly returns the same connected client, causing the server to attempt re-accepting an already-active connection.Proactive dead-connection cleanup (
ServerAPI.cpp): Check and clean up disconnectedopenAPIinstances at the beginning ofAPIServerPort::runOnce()before accepting new connections.TCP idle timeout (
ServerAPI.cpp): Add a 15-minute inactivity timeout to forcefully close half-open TCP connections. The W5100S has only 4 hardware sockets, and half-open connections from crashed clients permanently consume them.Note: The original PR also added an
~APIServerPort()destructor to deleteopenAPI, but this is no longer needed — upstreamdevelopchangedopenAPIfrom a raw pointer tostd::unique_ptr, which handles cleanup automatically.Hardware context
Fixes #6970
Test plan
rak4631_eth_gwtarget successfully