Skip to content

Conversation

@hzhaop
Copy link

@hzhaop hzhaop commented Oct 24, 2025

This commit addresses two issues related to gRPC connection stability and recovery.

  1. Half-open connections: In unstable network environments, the agent could encounter half-open TCP connections where the server-side connection is terminated, but the client-side remains. This would cause the send-queue to grow indefinitely without automatic recovery. To resolve this, this change introduces gRPC keepalive probes. The agent will now send keepalive pings to the collector, ensuring that dead connections are detected and pruned in a timely manner. Two new configuration parameters, collector.grpc_ke epalive_time and collector.grpc_keepalive_timeout, have been added to control this behavior.
    sw1

  2. Reconnect logic: The existing reconnection logic did not immediately re-establish a connection if the same backend instance was selected during a reconnect attempt. This could lead to a delay of up to an hour before the connection was re-established. The logic has been updated to ensure that the channel is always shut down and recreated, forcing an immediate reconnection attempt regardless of which backend is selected.
    sw3

This commit addresses two issues related to gRPC connection stability and recovery.

1.  **Half-open connections:** In unstable network environments, the agent could encounter half-open TCP connections where the server-side connection is terminated, but the client-side remains. This would cause the send-queue to grow indefinitely without automatic recovery. To resolve this, this change introduces gRPC keepalive probes. The agent will now send keepalive pings to the collector, ensuring that dead connections are detected and pruned in a timely manner. Two new configuration parameters, `collector.grpc_keepalive_time` and `collector.grpc_keepalive_timeout`, have been added to control this behavior.

2.  **Reconnect logic:** The existing reconnection logic did not immediately re-establish a connection if the same backend instance was selected during a reconnect attempt. This could lead to a delay of up to an hour before the connection was re-established. The logic has been updated to ensure that the channel is always shut down and recreated, forcing an immediate reconnection attempt regardless of which backend is selected.
@wu-sheng
Copy link
Member

I am confused about this. In our test, the agent reconnected quickly and automatically when server rebooted.
Why did your side take so long?
If nothing changed, there is no point to create a new channel.

@hzhaop
Copy link
Author

hzhaop commented Oct 24, 2025

The scenario you mentioned, where the agent quickly reconnects after a server reboot, typically occurs when the server shuts down cleanly, allowing TCP connections to terminate properly.

However, the problem we encountered primarily arises in unstable network environments, leading to TCP connections entering a half-open state. In such situations:

  1. The server-side connection is terminated, but the client still believes the connection is alive. This causes the client's send-Q to continuously accumulate data, and the agent remains unaware that the connection has become invalid, thus not triggering an automatic reconnection.

  2. The role of gRPC keepalive: The purpose of introducing gRPC keepalive is precisely to actively detect these half-open connections.By periodically sending heartbeats, the agent can promptly discover connections that are actually dead but still perceived as alive by the client, thereby forcing their closure and initiating the reconnection process.

Regarding your point, "If nothing changed, there is no point to create a new channel":

  • Change in connection state: Even if the target backend address remains unchanged, the internal state of the previous connection iscorrupted due to its half-open status. In this scenario, simply reusing the old channel is ineffective as it cannot recover.

  • Necessity of forced reconnection: We observed that after keepalive detected a connection failure, if the agent subsequently selected the same backend, the original reconnection logic would not immediately force the establishment of a new channel. Instead, it would wait for a long period (approximately one hour) before attempting to reconnect. Therefore, modifying the reconnection logic toensure that, upon detecting a connection failure, the old channel is forcibly closed and a new channel is established, regardlessof whether the same backend is selected, is crucial for ensuring timely connection recovery and preventing prolonged serviceinterruptions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants