You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
150869: rpc: add tcp_rtt and tcp_rtt_var metrics for gRPC r=cthumuluru-crdb a=xuchef
### `util/sysutil: add functions for getting TCP round-trip time and variance`
Previously, we relied on timing heartbeats in the RPC layer to estimate network
latency (rpc.connection.avg_round_trip_latency). However, because these
measurements are computed within cockroach, they can be confounded by CPU-heavy
workloads.
Luckily, Linux maintains the smoothed round-trip time (SRTT) and RTT variance
for each TCP socket it opens. As kernel-computed metrics, these are less
sensitive to CPU overload.
This patch adds a sysutil function for obtaining SRTT and RTT variance given a
a net.TCPConn pointer.
Part of: #149959
Release note: None
### `rpc: add tcp_rtt and tcp_rtt_var metrics for gRPC`
Previously, our only metric for gauging network latency was
rpc.connection.avg_round_trip_latency. This metric was calculated by timing
heartbeats in the RPC layer. However, because these measurements are computed
within cockroach, they can be confounded by CPU-heavy workloads.
Through escalations, we've found that elevated network latencies (outside of
CRDB's control) can severely degrade cluster performance. So, being able to
directly and accurately identify these cases would be helpful.
To address this, this patch introduces two new metrics whose values are computed
by Linux. As kernel-computed metrics, these are less sensitive to CPU overload:
1. rpc.connection.tcp_rtt: TCP smoothed round-trip time
2. rpc.connection.tcp_rtt_var: TCP round-trip time variance
Since these metrics are internally aggregated by Linux, we only need to sample
them periodically. We update them in the heartbeat loop, at the same cadence as
the original avg_round_trip_latency.
To obtain these metrics, we need access to the underlying *net.TCPConn of our
gRPC peer connection. So, the dial function we pass to gRPC has been modified to
update the tcpConn field of the peer struct on each network dial.
Part of: #149959
Release note: None
Co-authored-by: Michael Xu <[email protected]>
0 commit comments