Observation: send goroutines may not terminate on dead connections (missing TCP keepalive)

Hi! First off — thanks for endlessh-go, it's a great project. I've been running it on multiple servers for a while now.

I noticed the issue while building [Endlessh Fisher](https://github.com/DarkWolfCave/endlessh-fisher), a gamification layer on top of endlessh's Prometheus metrics. The "Currently Trapped" counter didn't match reality, which led me to investigate.

**What I observed**

On one of my servers running Debian 13 (kernel 6.12), `endlessh_client_open_count_total` and `endlessh_client_closed_count_total` drifted apart permanently over time — even when no actual TCP connections existed:

```bash
# Prometheus says 24 connections "open"
curl -s http://localhost:2112/metrics | grep client_open_count_total   # 3969
curl -s http://localhost:2112/metrics | grep client_closed_count_total # 3945
# But reality:
ss -tn 'sport = :22' | wc -l   # 0
```

On my other two servers running Debian 12 (kernel 6.1), the same drift appeared briefly during traffic spikes but self-healed within hours.

**What I think is happening**

When a remote peer disconnects without the server-side kernel detecting it promptly (no FIN/RST received, or delayed detection), the send goroutine in `startSending()` keeps cycling: `Send()` → `Write()` succeeds (kernel buffers the small data) → re-enqueue → repeat. Since no TCP keepalive is set on accepted connections, the kernel has no proactive mechanism to detect dead peers. On kernel 6.1, TCP retransmission timeouts alone seem sufficient to eventually detect the dead peer. On kernel 6.12, the behavior appears more conservative, so goroutines accumulate permanently.

**My data (3 servers, 7 days via InfluxDB)**

| Server | Kernel | Arch | Behavior |
|--------|--------|------|----------|
| Server A | 6.1.0-43 (Debian 12) | amd64 | Drift self-heals within hours |
| Server B | 6.12.73 (Debian 13) | arm64 | Drift accumulates permanently (0→24 in 24h) |
| Server C | 6.1.0-43 (Debian 12) | arm64 | Drift self-heals within hours |

**A potential fix that worked for me**

I tried enabling TCP keepalive on accepted connections (30s period) plus a write deadline as safety net. The rationale: keepalive fires based on last *received* ACK — not last sent data — so it detects dead peers even while endlessh actively writes.

After deploying on all 3 servers with real scanner traffic:

| Server | Kernel | Connections (12h) | Max ghost diff | Before fix |
|--------|--------|-------------------|----------------|------------|
| Server A | 6.1 | 228 | 0-1 (transient) | 0-1 (same) |
| Server B | 6.12 | 327 | 0-1 (transient) | 0→24 (permanent) |
| Server C | 6.1 | 166 | 0 | 0-1 (same) |

I submitted a PR with the changes in case this is useful for the project.

**Environment**
- endlessh-go version: 2025.0914.0 (latest)
- Go: 1.24.2
- Docker on distroless base

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Observation: send goroutines may not terminate on dead connections (missing TCP keepalive) #195

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Server	Kernel	Arch	Behavior
Server A	6.1.0-43 (Debian 12)	amd64	Drift self-heals within hours
Server B	6.12.73 (Debian 13)	arm64	Drift accumulates permanently (0→24 in 24h)
Server C	6.1.0-43 (Debian 12)	arm64	Drift self-heals within hours

Server	Kernel	Connections (12h)	Max ghost diff	Before fix
Server A	6.1	228	0-1 (transient)	0-1 (same)
Server B	6.12	327	0-1 (transient)	0→24 (permanent)
Server C	6.1	166	0	0-1 (same)

Observation: send goroutines may not terminate on dead connections (missing TCP keepalive) #195

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions