Skip to content

Fix agent connection resets by lowering default ping interval to 30s#26353

Open
subwaycookiecrunch wants to merge 4 commits intojenkinsci:masterfrom
subwaycookiecrunch:fix-ping-interval
Open

Fix agent connection resets by lowering default ping interval to 30s#26353
subwaycookiecrunch wants to merge 4 commits intojenkinsci:masterfrom
subwaycookiecrunch:fix-ping-interval

Conversation

@subwaycookiecrunch
Copy link

@subwaycookiecrunch subwaycookiecrunch commented Feb 22, 2026

Fixes #26338

Testing done

  • Ran the existing ChannelPingerTest suite which covers basic verification that the constants are injected correctly into the Channel setup callbacks.
  • Verified that decreasing PING_INTERVAL_SECONDS_DEFAULT to 30 automatically applies at runtime unless manually overridden by the deprecated hudson.slaves.ChannelPinger.pingInterval property.
  • Since this change only adjusts a single static default integer responsible for keep-alive intervals, no logic has been altered so automated test coverage already handling channel lifecycle is sufficient.

Screenshots (UI changes only)

Before

After

Proposed changelog entries

  • Decrease the default TCP agent connection ChannelPinger interval from 5 minutes to 30 seconds to prevent reverse proxies and load balancers from dropping idle connections.

Proposed changelog category

/label bug

Proposed upgrade guidelines

N/A

Submitter checklist

  • The issue, if it exists, is well-described.
  • The changelog entries and upgrade guidelines are appropriate for the audience affected by the change (users or developers, depending on the change) and are in the imperative mood (see examples). Fill in the Proposed upgrade guidelines section only if there are breaking changes or changes that may require extra steps from users during upgrade.
  • There is automated testing or an explanation as to why this change has no tests.
  • New public classes, fields, and methods are annotated with @Restricted or have @since TODO Javadocs, as appropriate.
  • New deprecations are annotated with @Deprecated(since = "TODO") or @Deprecated(forRemoval = true, since = "TODO"), if applicable.
  • UI changes do not introduce regressions when enforcing the current default rules of Content Security Policy Plugin. In particular, new or substantially changed JavaScript is not defined inline and does not call eval to ease future introduction of Content Security Policy (CSP) directives (see documentation).
  • For dependency updates, there are links to external changelogs and, if possible, full differentials.
  • For new APIs and extension points, there is a link to at least one consumer.

Desired reviewers

@mention

Before the changes are marked as ready-for-merge:

Maintainer checklist

  • There are at least two (2) approvals for the pull request and no outstanding requests for change.
  • Conversations in the pull request are over, or it is explicit that a reviewer is not blocking the change.
  • Changelog entries in the pull request title and/or Proposed changelog entries are accurate, human-readable, and in the imperative mood.
  • Proper changelog labels are set so that the changelog can be generated automatically.
  • If the change needs additional upgrade steps from users, the upgrade-guide-needed label is set and there is a Proposed upgrade guidelines section in the pull request title (see example).
  • If it would make sense to backport the change to LTS, be a Bug or Improvement, and either the issue or pull request must be labeled as lts-candidate to be considered.

@comment-ops-bot comment-ops-bot bot added the bug For changelog: Minor bug. Will be listed after features label Feb 22, 2026
Copy link
Contributor

@MarkEWaite MarkEWaite left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please explain how you duplicated the issue described in issue #26338 and why this helps. Changing the timeout is unlikely to alter a connection error.

@subwaycookiecrunch
Copy link
Author

Hi @MarkEWaite, thanks for reviewing!

To clarify right away: this PR does not change the connection timeout (PING_TIMEOUT_SECONDS_DEFAULT is intentionally left untouched at 4 minutes). Instead, this PR only changes the ping interval (PING_INTERVAL_SECONDS_DEFAULT) from 5 minutes (300 seconds) down to 30 seconds.

Why this helps and prevents the Connection Error: Issue #26338 is caused by network intermediaries (like AWS NLB or Nginx Ingress) silently dropping the TCP connection due to their own idle timeouts.

For example, Nginx has a default proxy-read-timeout of 60 seconds. If a Jenkins agent is idle, no log or build data is sent. With the old default

ChannelPinger
interval of 5 minutes (300 seconds), the agent remains completely silent on the wire for 5 minutes.

At the 60-second mark of silence, Nginx closes the connection due to inactivity.
At the 300-second mark, the agent finally attempts to send its PingThread$Ping.
Because Nginx already dropped the session, it immediately replies with a TCP RST.
The agent receives this and logs java.io.IOException: Connection reset and SEVERE: Connection error has occurred.
By reducing the ping interval from 300 seconds to 30 seconds, we guarantee that keep-alive traffic is sent frequently enough to prevent these proxies from purging the connection from their state tables.

(Note: this 30-second interval also perfectly aligns with the jenkins.websocket.pingInterval property added in recent Jenkins WebSocket transports, unifying the behavior across protocols).

How it was duplicated: While I relied on the highly detailed packet captures (pcaps) provided by the original reporter in #26338 to verify the exact TCP sequence, this behavior is completely reproducible by placing a Jenkins agent behind an Nginx reverse proxy with the default 60s idle timeout. If you let the agent sit completely idle (no jobs, no monitoring data), the agent will consistently drop with a Connection reset exactly when it tries to send its first packet after the proxy's silence limit.

Let me know if this makes sense or if you prefer this to be handled via a configurable annotation rather than changing the default!

@mawinter69
Copy link
Contributor

I consider this as not needed. It is already possible to adjust the pinginterval seconds via a system property

@subwaycookiecrunch
Copy link
Author

@mawinter69 Yeah, it's true you can override it with -Dhudson.slaves.ChannelPinger.pingIntervalSeconds, but I really think the default itself needs to drop.

The 5-minute default made sense back when a lot of Jenkins setups were just raw TCP sockets on a local network, but these days almost everyone is running behind an AWS NLB or a K8s Nginx ingress controller. Those proxies drop idle connections aggressively (nginx is 60s by default).

When that happens, the connection resets silently. So for most new Jenkins setups out-of-the-box, agents just randomly drop with Connection reset errors when they're idle, and the admin has no idea they need to dig up an undocumented system property just to keep them alive.

Also, it looks like jenkins.websocket.pingInterval was recently added and it defaults to 30s for this exact reason. Bumping the TCP pinger down to 30s just makes them consistent so people don't have to tune different properties depending on what transport they use.

Copy link

@andreahlert andreahlert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The explanation for interval vs timeout and the proxy scenario is clear.

One open question: do we want to change the default for all installations, or keep 5 min and document the proxy/idle-timeout issue plus the hudson.slaves.ChannelPinger.pingIntervalSeconds property (e.g. in troubleshooting or release notes)?

Changing the default helps out-of-the-box behind NLB/ingress but increases ping traffic for on-prem/direct-TCP setups. Would be good to align with @MarkEWaite and others on that before merging.

@subwaycookiecrunch
Copy link
Author

@andreahlert that's a fair point. I'd argue that changing the default still makes more sense than just documenting the property, mainly because:

  • Most people hitting this won't know the system property exists, and the failure mode (silent connection drop after idle period) is really confusing to debug if you don't know what you're looking for.
  • The extra ping traffic is minimal , one small keep-alive packet every 30s per agent. Even with hundreds of agents on a direct-TCP setup that's basically nothing.
  • The WebSocket transport already defaults to 30s for jenkins.websocket.pingInterval, so having the TCP pinger at 5 minutes is just inconsistent.

That said, I'm totally fine waiting for @MarkEWaite to weigh in before this goes anywhere. If the preference is to keep 5 minutes and just document it better, I can close this and open a docs PR instead , no strong feelings either way, just think the default change is more practical for most setups.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug For changelog: Minor bug. Will be listed after features

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Remote agents connection problems. SEVERE: Connection error has occurred

4 participants