Skip to content

Fix W5100S socket exhaustion blocking MQTT and additional TCP clients#9770

Open
PhilipLykov wants to merge 6 commits intomeshtastic:developfrom
PhilipLykov:fix/w5100s-socket-exhaustion
Open

Fix W5100S socket exhaustion blocking MQTT and additional TCP clients#9770
PhilipLykov wants to merge 6 commits intomeshtastic:developfrom
PhilipLykov:fix/w5100s-socket-exhaustion

Conversation

@PhilipLykov
Copy link
Contributor

@PhilipLykov PhilipLykov commented Feb 27, 2026

Summary

Fix W5100S socket exhaustion on RAK4631 Ethernet gateways that prevents MQTT connections, DHCP lease renewal, and additional TCP API clients.

The W5100S chip has only 4 hardware sockets. With NTP and Syslog enabled, all 4 sockets were permanently consumed:

Socket Service Lifecycle
0 NTP UDP Permanent (held from boot)
1 Syslog UDP Permanent (held from boot)
2 TCP API Server (LISTEN) Permanent
3 TCP API Client (e.g. MeshMonitor) While connected

This left zero sockets for MQTT, DHCP lease renewal, or NTP updates — causing MQTT to silently fail with no packets sent on the wire.

Changes

  • NTP on-demand (ethClient.cpp): Remove permanent timeClient.begin() at startup. NTPClient::update() auto-initializes when _udpSetup is false. Add timeClient.end() after each query to release the socket immediately.
  • Syslog on-demand (DebugConfiguration.cpp): Remove socket allocation from Syslog::enable(). Open and close the UDP socket in _sendLog() around each message send. Syslog overhead is ~320μs per message for socket open/close, well under 1% CPU at typical log rates.
  • MQTT validation socket leak (MQTT.cpp): Fix isValidConfig() where a successful test connection was never closed — PubSubClient::~PubSubClient() only frees buffers, it does not call disconnect() or _client->stop(). The leaked socket in ESTABLISHED state permanently consumed a W5100S slot.

After fix

Socket Service Lifecycle
0 TCP API Server (LISTEN) Permanent
1 TCP API Client While connected
2 MQTT Client While connected
3 NTP / Syslog / DHCP (shared) On-demand, brief

Test plan

  • RAK4631 Ethernet gateway with syslog, NTP, MQTT, and API client all enabled simultaneously
  • Verify MQTT connects successfully while MeshMonitor is connected
  • Verify syslog messages still appear on the syslog server
  • Verify NTP time sync works after the change
  • Verify DHCP lease renewal works (check over 24+ hours)
  • Verify no regressions on ESP32 WiFi builds (syslog change affects all HAS_NETWORKING platforms)

The W5100S Ethernet chip has only 4 hardware sockets. On RAK4631
Ethernet gateways with syslog and NTP enabled, all 4 sockets were
permanently consumed (NTP UDP + Syslog UDP + TCP API listener + TCP
API client), leaving none for MQTT, DHCP lease renewal, or additional
TCP connections.

- NTP: Remove permanent timeClient.begin() at startup; NTPClient::update()
  auto-initializes when needed. Add timeClient.end() after each query to
  release the UDP socket immediately.
- Syslog: Remove socket allocation from Syslog::enable(). Open and close
  the UDP socket on-demand in _sendLog() around each message send.
- MQTT: Fix socket leak in isValidConfig() where a successful test
  connection was never closed (PubSubClient destructor does not call
  disconnect). Add explicit pubSub->disconnect() before returning.

Made-with: Cursor
@github-actions github-actions bot added the bugfix Pull request that fixes bugs label Feb 27, 2026
delete[] message;
message = new char[len + 1];

vsnprintf(message, len + 1, fmt, args);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

args is reused without va_copy(); reusing a traversed va_list is undefined behavior and can produce corrupted logs or crashes

@@ -193,6 +199,8 @@
this->_client->print(message);
this->_client->endPacket();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return value of endPacket() is ignored; true is returned unconditionally. On embedded Ethernet stacks, endPacket() is the point where TX failure is reported; this currently reports success even when the datagram send fails.

return connectPubSub(parsed, *pubSub, (client != nullptr) ? *client : *clientConnection);
bool result = connectPubSub(parsed, *pubSub, (client != nullptr) ? *client : *clientConnection);
if (client == nullptr) {
pubSub->disconnect();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Socket cleanup is conditional on client == nullptr; when a non-null client is passed, successful validation can still leave the connection open. Since this function is a validator, that side effect can leak scarce sockets on Ethernet-class devices. Consider ensuring cleanup for all validation paths (or explicitly documenting and renaming if persistent connection is intended).

@thebentern
Copy link
Contributor

@robekl please stop using LLMs to review random PRs. We have CoPilot for this. Additionally the reviews are also commenting on existing unchanged code. This is your one and only warning.

@PhilipLykov
Copy link
Contributor Author

@robekl Thanks for the review.

  1. va_copy() — This is pre-existing code in vlogf() that we did not modify. Valid observation, but out of scope for this PR.

  2. endPacket() return value — Also pre-existing behavior in _sendLog(). Our change only added the begin()/stop() wrapping around it.

  3. Conditional disconnect() — Intentional. When client != nullptr, a caller-provided mock is in use (unit tests assert client.connected_ stays true after isValidConfig()). In production client is always nullptr, so disconnect() always runs.

Resolve conflict in MQTT.cpp: take upstream's rewritten isValidConfig()
which replaced connectPubSub() validation with a lightweight TCP-only
MQTTClient check. The upstream approach inherently fixes the socket
leak our PR addressed (testClient.stop() is always called), and adds
user-visible warnings via clientNotification.

Made-with: Cursor
@PhilipLykov
Copy link
Contributor Author

Resolved the merge conflict in MQTT.cpp — upstream rewrote isValidConfig() to use a local MQTTClient testClient with an explicit testClient.stop(), which inherently fixes the socket leak our PR addressed in that function. Took the upstream version.

The other two fixes in this PR are still needed and unaffected by upstream:

  • NTP on-demand (ethClient.cpp): timeClient.end() after each NTP query to release the UDP socket — upstream still holds it permanently from boot.
  • Syslog on-demand (DebugConfiguration.cpp): Open/close the UDP socket in _sendLog() per message — upstream still holds it permanently from enable().

Without these two changes, the W5100S still exhausts all 4 hardware sockets (NTP UDP + Syslog UDP + TCP listener + TCP client), leaving none for MQTT, DHCP renewal, or additional connections.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bugfix Pull request that fixes bugs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants