-
Notifications
You must be signed in to change notification settings - Fork 43
Description
Hi! First off — thanks for endlessh-go, it's a great project. I've been running it on multiple servers for a while now.
I noticed the issue while building Endlessh Fisher, a gamification layer on top of endlessh's Prometheus metrics. The "Currently Trapped" counter didn't match reality, which led me to investigate.
What I observed
On one of my servers running Debian 13 (kernel 6.12), endlessh_client_open_count_total and endlessh_client_closed_count_total drifted apart permanently over time — even when no actual TCP connections existed:
# Prometheus says 24 connections "open"
curl -s http://localhost:2112/metrics | grep client_open_count_total # 3969
curl -s http://localhost:2112/metrics | grep client_closed_count_total # 3945
# But reality:
ss -tn 'sport = :22' | wc -l # 0On my other two servers running Debian 12 (kernel 6.1), the same drift appeared briefly during traffic spikes but self-healed within hours.
What I think is happening
When a remote peer disconnects without the server-side kernel detecting it promptly (no FIN/RST received, or delayed detection), the send goroutine in startSending() keeps cycling: Send() → Write() succeeds (kernel buffers the small data) → re-enqueue → repeat. Since no TCP keepalive is set on accepted connections, the kernel has no proactive mechanism to detect dead peers. On kernel 6.1, TCP retransmission timeouts alone seem sufficient to eventually detect the dead peer. On kernel 6.12, the behavior appears more conservative, so goroutines accumulate permanently.
My data (3 servers, 7 days via InfluxDB)
| Server | Kernel | Arch | Behavior |
|---|---|---|---|
| Server A | 6.1.0-43 (Debian 12) | amd64 | Drift self-heals within hours |
| Server B | 6.12.73 (Debian 13) | arm64 | Drift accumulates permanently (0→24 in 24h) |
| Server C | 6.1.0-43 (Debian 12) | arm64 | Drift self-heals within hours |
A potential fix that worked for me
I tried enabling TCP keepalive on accepted connections (30s period) plus a write deadline as safety net. The rationale: keepalive fires based on last received ACK — not last sent data — so it detects dead peers even while endlessh actively writes.
After deploying on all 3 servers with real scanner traffic:
| Server | Kernel | Connections (12h) | Max ghost diff | Before fix |
|---|---|---|---|---|
| Server A | 6.1 | 228 | 0-1 (transient) | 0-1 (same) |
| Server B | 6.12 | 327 | 0-1 (transient) | 0→24 (permanent) |
| Server C | 6.1 | 166 | 0 | 0-1 (same) |
I submitted a PR with the changes in case this is useful for the project.
Environment
- endlessh-go version: 2025.0914.0 (latest)
- Go: 1.24.2
- Docker on distroless base