Fix agent connection resets by lowering default ping interval to 30s#26353
Fix agent connection resets by lowering default ping interval to 30s#26353subwaycookiecrunch wants to merge 4 commits intojenkinsci:masterfrom
Conversation
MarkEWaite
left a comment
There was a problem hiding this comment.
Please explain how you duplicated the issue described in issue #26338 and why this helps. Changing the timeout is unlikely to alter a connection error.
|
Hi @MarkEWaite, thanks for reviewing! To clarify right away: this PR does not change the connection timeout (PING_TIMEOUT_SECONDS_DEFAULT is intentionally left untouched at 4 minutes). Instead, this PR only changes the ping interval (PING_INTERVAL_SECONDS_DEFAULT) from 5 minutes (300 seconds) down to 30 seconds. Why this helps and prevents the Connection Error: Issue #26338 is caused by network intermediaries (like AWS NLB or Nginx Ingress) silently dropping the TCP connection due to their own idle timeouts. For example, Nginx has a default proxy-read-timeout of 60 seconds. If a Jenkins agent is idle, no log or build data is sent. With the old default ChannelPinger At the 60-second mark of silence, Nginx closes the connection due to inactivity. (Note: this 30-second interval also perfectly aligns with the jenkins.websocket.pingInterval property added in recent Jenkins WebSocket transports, unifying the behavior across protocols). How it was duplicated: While I relied on the highly detailed packet captures (pcaps) provided by the original reporter in #26338 to verify the exact TCP sequence, this behavior is completely reproducible by placing a Jenkins agent behind an Nginx reverse proxy with the default 60s idle timeout. If you let the agent sit completely idle (no jobs, no monitoring data), the agent will consistently drop with a Connection reset exactly when it tries to send its first packet after the proxy's silence limit. Let me know if this makes sense or if you prefer this to be handled via a configurable annotation rather than changing the default! |
|
I consider this as not needed. It is already possible to adjust the pinginterval seconds via a system property |
|
@mawinter69 Yeah, it's true you can override it with -Dhudson.slaves.ChannelPinger.pingIntervalSeconds, but I really think the default itself needs to drop. The 5-minute default made sense back when a lot of Jenkins setups were just raw TCP sockets on a local network, but these days almost everyone is running behind an AWS NLB or a K8s Nginx ingress controller. Those proxies drop idle connections aggressively (nginx is 60s by default). When that happens, the connection resets silently. So for most new Jenkins setups out-of-the-box, agents just randomly drop with Connection reset errors when they're idle, and the admin has no idea they need to dig up an undocumented system property just to keep them alive. Also, it looks like jenkins.websocket.pingInterval was recently added and it defaults to 30s for this exact reason. Bumping the TCP pinger down to 30s just makes them consistent so people don't have to tune different properties depending on what transport they use. |
There was a problem hiding this comment.
The explanation for interval vs timeout and the proxy scenario is clear.
One open question: do we want to change the default for all installations, or keep 5 min and document the proxy/idle-timeout issue plus the hudson.slaves.ChannelPinger.pingIntervalSeconds property (e.g. in troubleshooting or release notes)?
Changing the default helps out-of-the-box behind NLB/ingress but increases ping traffic for on-prem/direct-TCP setups. Would be good to align with @MarkEWaite and others on that before merging.
|
@andreahlert that's a fair point. I'd argue that changing the default still makes more sense than just documenting the property, mainly because:
That said, I'm totally fine waiting for @MarkEWaite to weigh in before this goes anywhere. If the preference is to keep 5 minutes and just document it better, I can close this and open a docs PR instead , no strong feelings either way, just think the default change is more practical for most setups. |
Fixes #26338
Testing done
PING_INTERVAL_SECONDS_DEFAULTto 30 automatically applies at runtime unless manually overridden by the deprecatedhudson.slaves.ChannelPinger.pingIntervalproperty.Screenshots (UI changes only)
Before
After
Proposed changelog entries
Proposed changelog category
/label bug
Proposed upgrade guidelines
N/A
Submitter checklist
@Restrictedor have@since TODOJavadocs, as appropriate.@Deprecated(since = "TODO")or@Deprecated(forRemoval = true, since = "TODO"), if applicable.evalto ease future introduction of Content Security Policy (CSP) directives (see documentation).Desired reviewers
@mention
Before the changes are marked as
ready-for-merge:Maintainer checklist
upgrade-guide-neededlabel is set and there is a Proposed upgrade guidelines section in the pull request title (see example).lts-candidateto be considered.