You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Previously, our only metric for gauging network latency was
rpc.connection.avg_round_trip_latency. This metric was calculated by timing
heartbeats in the RPC layer. However, because these measurements are computed
within cockroach, they can be confounded by CPU-heavy workloads.
Through escalations, we've found that elevated network latencies (outside of
CRDB's control) can severely degrade cluster performance. So, being able to
directly and accurately identify these cases would be helpful.
To address this, this patch introduces two new metrics whose values are computed
by Linux. As kernel-computed metrics, these are less sensitive to CPU overload:
1. rpc.connection.tcp_rtt: TCP smoothed round-trip time
2. rpc.connection.tcp_rtt_var: TCP round-trip time variance
Since these metrics are internally aggregated by Linux, we only need to sample
them periodically. We update them in the heartbeat loop, at the same cadence as
the original avg_round_trip_latency.
To obtain these metrics, we need access to the underlying *net.TCPConn of our
gRPC peer connection. So, the dial function we pass to gRPC has been modified to
update the tcpConn field of the peer struct on each network dial.
Part of: #149959
Release note: None
0 commit comments