fix(grpc): Add keepalive and fix reconnect issue #777

hzhaop · 2025-10-24T08:59:59Z

This commit addresses two issues related to gRPC connection stability and recovery.

Half-open connections: In unstable network environments, the agent could encounter half-open TCP connections where the server-side connection is terminated, but the client-side remains. This would cause the send-queue to grow indefinitely without automatic recovery. To resolve this, this change introduces gRPC keepalive probes. The agent will now send keepalive pings to the collector, ensuring that dead connections are detected and pruned in a timely manner. Two new configuration parameters, collector.grpc_ke epalive_time and collector.grpc_keepalive_timeout, have been added to control this behavior.
Reconnect logic: The existing reconnection logic did not immediately re-establish a connection if the same backend instance was selected during a reconnect attempt. This could lead to a delay of up to an hour before the connection was re-established. The logic has been updated to ensure that the channel is always shut down and recreated, forcing an immediate reconnection attempt regardless of which backend is selected.

This commit addresses two issues related to gRPC connection stability and recovery. 1. **Half-open connections:** In unstable network environments, the agent could encounter half-open TCP connections where the server-side connection is terminated, but the client-side remains. This would cause the send-queue to grow indefinitely without automatic recovery. To resolve this, this change introduces gRPC keepalive probes. The agent will now send keepalive pings to the collector, ensuring that dead connections are detected and pruned in a timely manner. Two new configuration parameters, `collector.grpc_keepalive_time` and `collector.grpc_keepalive_timeout`, have been added to control this behavior. 2. **Reconnect logic:** The existing reconnection logic did not immediately re-establish a connection if the same backend instance was selected during a reconnect attempt. This could lead to a delay of up to an hour before the connection was re-established. The logic has been updated to ensure that the channel is always shut down and recreated, forcing an immediate reconnection attempt regardless of which backend is selected.

wu-sheng · 2025-10-24T09:09:39Z

I am confused about this. In our test, the agent reconnected quickly and automatically when server rebooted.
Why did your side take so long?
If nothing changed, there is no point to create a new channel.

hzhaop · 2025-10-24T09:23:25Z

The scenario you mentioned, where the agent quickly reconnects after a server reboot, typically occurs when the server shuts down cleanly, allowing TCP connections to terminate properly.

However, the problem we encountered primarily arises in unstable network environments, leading to TCP connections entering a half-open state. In such situations:

The server-side connection is terminated, but the client still believes the connection is alive. This causes the client's send-Q to continuously accumulate data, and the agent remains unaware that the connection has become invalid, thus not triggering an automatic reconnection.
The role of gRPC keepalive: The purpose of introducing gRPC keepalive is precisely to actively detect these half-open connections.By periodically sending heartbeats, the agent can promptly discover connections that are actually dead but still perceived as alive by the client, thereby forcing their closure and initiating the reconnection process.

Regarding your point, "If nothing changed, there is no point to create a new channel":

Change in connection state: Even if the target backend address remains unchanged, the internal state of the previous connection iscorrupted due to its half-open status. In this scenario, simply reusing the old channel is ineffective as it cannot recover.
Necessity of forced reconnection: We observed that after keepalive detected a connection failure, if the agent subsequently selected the same backend, the original reconnection logic would not immediately force the establishment of a new channel. Instead, it would wait for a long period (approximately one hour) before attempting to reconnect. Therefore, modifying the reconnection logic toensure that, upon detecting a connection failure, the old channel is forcibly closed and a new channel is established, regardlessof whether the same backend is selected, is crucial for ensuring timely connection recovery and preventing prolonged serviceinterruptions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

fix(grpc): Add keepalive and fix reconnect issue #777

fix(grpc): Add keepalive and fix reconnect issue #777

hzhaop commented Oct 24, 2025 •

edited

Loading

Uh oh!

wu-sheng commented Oct 24, 2025

Uh oh!

hzhaop commented Oct 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

fix(grpc): Add keepalive and fix reconnect issue #777

Are you sure you want to change the base?

fix(grpc): Add keepalive and fix reconnect issue #777

Conversation

hzhaop commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wu-sheng commented Oct 24, 2025

Uh oh!

hzhaop commented Oct 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hzhaop commented Oct 24, 2025 •

edited

Loading