Skip to content

Observation: send goroutines may not terminate on dead connections (missing TCP keepalive) #195

@DarkWolfCave

Description

@DarkWolfCave

Hi! First off — thanks for endlessh-go, it's a great project. I've been running it on multiple servers for a while now.

I noticed the issue while building Endlessh Fisher, a gamification layer on top of endlessh's Prometheus metrics. The "Currently Trapped" counter didn't match reality, which led me to investigate.

What I observed

On one of my servers running Debian 13 (kernel 6.12), endlessh_client_open_count_total and endlessh_client_closed_count_total drifted apart permanently over time — even when no actual TCP connections existed:

# Prometheus says 24 connections "open"
curl -s http://localhost:2112/metrics | grep client_open_count_total   # 3969
curl -s http://localhost:2112/metrics | grep client_closed_count_total # 3945
# But reality:
ss -tn 'sport = :22' | wc -l   # 0

On my other two servers running Debian 12 (kernel 6.1), the same drift appeared briefly during traffic spikes but self-healed within hours.

What I think is happening

When a remote peer disconnects without the server-side kernel detecting it promptly (no FIN/RST received, or delayed detection), the send goroutine in startSending() keeps cycling: Send()Write() succeeds (kernel buffers the small data) → re-enqueue → repeat. Since no TCP keepalive is set on accepted connections, the kernel has no proactive mechanism to detect dead peers. On kernel 6.1, TCP retransmission timeouts alone seem sufficient to eventually detect the dead peer. On kernel 6.12, the behavior appears more conservative, so goroutines accumulate permanently.

My data (3 servers, 7 days via InfluxDB)

Server Kernel Arch Behavior
Server A 6.1.0-43 (Debian 12) amd64 Drift self-heals within hours
Server B 6.12.73 (Debian 13) arm64 Drift accumulates permanently (0→24 in 24h)
Server C 6.1.0-43 (Debian 12) arm64 Drift self-heals within hours

A potential fix that worked for me

I tried enabling TCP keepalive on accepted connections (30s period) plus a write deadline as safety net. The rationale: keepalive fires based on last received ACK — not last sent data — so it detects dead peers even while endlessh actively writes.

After deploying on all 3 servers with real scanner traffic:

Server Kernel Connections (12h) Max ghost diff Before fix
Server A 6.1 228 0-1 (transient) 0-1 (same)
Server B 6.12 327 0-1 (transient) 0→24 (permanent)
Server C 6.1 166 0 0-1 (same)

I submitted a PR with the changes in case this is useful for the project.

Environment

  • endlessh-go version: 2025.0914.0 (latest)
  • Go: 1.24.2
  • Docker on distroless base

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions