Skip to content

Commit 36d351d

Browse files
craig[bot]xuchef
andcommitted
Merge #150869
150869: rpc: add tcp_rtt and tcp_rtt_var metrics for gRPC r=cthumuluru-crdb a=xuchef ### `util/sysutil: add functions for getting TCP round-trip time and variance` Previously, we relied on timing heartbeats in the RPC layer to estimate network latency (rpc.connection.avg_round_trip_latency). However, because these measurements are computed within cockroach, they can be confounded by CPU-heavy workloads. Luckily, Linux maintains the smoothed round-trip time (SRTT) and RTT variance for each TCP socket it opens. As kernel-computed metrics, these are less sensitive to CPU overload. This patch adds a sysutil function for obtaining SRTT and RTT variance given a a net.TCPConn pointer. Part of: #149959 Release note: None ### `rpc: add tcp_rtt and tcp_rtt_var metrics for gRPC` Previously, our only metric for gauging network latency was rpc.connection.avg_round_trip_latency. This metric was calculated by timing heartbeats in the RPC layer. However, because these measurements are computed within cockroach, they can be confounded by CPU-heavy workloads. Through escalations, we've found that elevated network latencies (outside of CRDB's control) can severely degrade cluster performance. So, being able to directly and accurately identify these cases would be helpful. To address this, this patch introduces two new metrics whose values are computed by Linux. As kernel-computed metrics, these are less sensitive to CPU overload: 1. rpc.connection.tcp_rtt: TCP smoothed round-trip time 2. rpc.connection.tcp_rtt_var: TCP round-trip time variance Since these metrics are internally aggregated by Linux, we only need to sample them periodically. We update them in the heartbeat loop, at the same cadence as the original avg_round_trip_latency. To obtain these metrics, we need access to the underlying *net.TCPConn of our gRPC peer connection. So, the dial function we pass to gRPC has been modified to update the tcpConn field of the peer struct on each network dial. Part of: #149959 Release note: None Co-authored-by: Michael Xu <[email protected]>
2 parents a491fd9 + 3d8c5ca commit 36d351d

File tree

14 files changed

+386
-21
lines changed

14 files changed

+386
-21
lines changed

docs/generated/metrics/metrics.yaml

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -198,6 +198,10 @@ layers:
198198
description: |
199199
Sum of exponentially weighted moving average of round-trip latencies, as measured through a gRPC RPC.
200200

201+
Since this metric is based on gRPC RPCs, it is affected by application-level
202+
processing delays and CPU overload effects. See rpc.connection.tcp_rtt for a
203+
metric that is obtained from the kernel's TCP stack.
204+
201205
Dividing this Gauge by rpc.connection.healthy gives an approximation of average
202206
latency, but the top-level round-trip-latency histogram is more useful. Instead,
203207
users should consult the label families of this metric if they are available
@@ -265,6 +269,40 @@ layers:
265269
derivative: NON_NEGATIVE_DERIVATIVE
266270
how_to_use: See Description.
267271
essential: true
272+
- name: rpc.connection.tcp_rtt
273+
exported_name: rpc_connection_tcp_rtt
274+
description: |
275+
Kernel-level TCP round-trip time as measured by the Linux TCP stack.
276+
277+
This metric reports the smoothed round-trip time (SRTT) as maintained by the
278+
kernel's TCP implementation. Unlike application-level RPC latency measurements,
279+
this reflects pure network latency and is less affected by CPU overload effects.
280+
281+
This metric is only available on Linux.
282+
y_axis_label: Latency
283+
type: GAUGE
284+
unit: NANOSECONDS
285+
aggregation: AVG
286+
derivative: NONE
287+
how_to_use: High TCP RTT values indicate network issues outside of CockroachDB that could be impacting the user's workload.
288+
essential: true
289+
- name: rpc.connection.tcp_rtt_var
290+
exported_name: rpc_connection_tcp_rtt_var
291+
description: |
292+
Kernel-level TCP round-trip time variance as measured by the Linux TCP stack.
293+
294+
This metric reports the smoothed round-trip time variance (RTTVAR) as maintained
295+
by the kernel's TCP implementation. This measures the stability of the
296+
connection latency.
297+
298+
This metric is only available on Linux.
299+
y_axis_label: Latency Variance
300+
type: GAUGE
301+
unit: NANOSECONDS
302+
aggregation: AVG
303+
derivative: NONE
304+
how_to_use: High TCP RTT variance values indicate network stability issues outside of CockroachDB that could be impacting the user's workload.
305+
essential: true
268306
- name: rpc.connection.unhealthy
269307
exported_name: rpc_connection_unhealthy
270308
description: Gauge of current connections in an unhealthy state (not bidirectionally connected or heartbeating)

pkg/rpc/BUILD.bazel

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -62,6 +62,7 @@ go_library(
6262
"//pkg/util/netutil/addr",
6363
"//pkg/util/stop",
6464
"//pkg/util/syncutil",
65+
"//pkg/util/sysutil",
6566
"//pkg/util/timeutil",
6667
"//pkg/util/tracing",
6768
"//pkg/util/tracing/grpcinterceptor",

pkg/rpc/context.go

Lines changed: 42 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1390,12 +1390,23 @@ func (rpcCtx *Context) GRPCDialOptions(
13901390
// See the explanation on loopbackDialFn for an explanation about this.
13911391
transport = loopbackTransport
13921392
}
1393-
return rpcCtx.grpcDialOptionsInternal(ctx, target, class, transport)
1393+
// In other invokations of grpcDialOptionsInternal, we care about having a
1394+
// hook into each network dial so we can store the most recent TCP
1395+
// connection that we've dialed.
1396+
//
1397+
// Here, though, we don't currently care about the underlying TCP connection
1398+
// backing a gRPC channel so onNetworkDial is a no-op.
1399+
onNetworkDial := func(conn net.Conn) {}
1400+
return rpcCtx.grpcDialOptionsInternal(ctx, target, class, transport, onNetworkDial)
13941401
}
13951402

13961403
// grpcDialOptions produces dial options suitable for connecting to the given target and class.
13971404
func (rpcCtx *Context) grpcDialOptionsInternal(
1398-
ctx context.Context, target string, class rpcbase.ConnectionClass, transport transportType,
1405+
ctx context.Context,
1406+
target string,
1407+
class rpcbase.ConnectionClass,
1408+
transport transportType,
1409+
onNetworkDial onDialFunc,
13991410
) ([]grpc.DialOption, error) {
14001411
dialOpts, err := rpcCtx.dialOptsCommon(ctx, target, class)
14011412
if err != nil {
@@ -1404,7 +1415,7 @@ func (rpcCtx *Context) grpcDialOptionsInternal(
14041415

14051416
switch transport {
14061417
case tcpTransport:
1407-
netOpts, err := rpcCtx.dialOptsNetwork(ctx, target, class)
1418+
netOpts, err := rpcCtx.dialOptsNetwork(ctx, target, class, onNetworkDial)
14081419
if err != nil {
14091420
return nil, err
14101421
}
@@ -1549,10 +1560,27 @@ func (t *statsTracker) HandleConn(ctx context.Context, s stats.ConnStats) {
15491560
}
15501561
}
15511562

1563+
type onDialFunc func(conn net.Conn)
1564+
1565+
func (rpcCtx *Context) dialerWithCallback(
1566+
dialerFunc dialerFunc, onNetworkDial onDialFunc,
1567+
) dialerFunc {
1568+
return func(ctx context.Context, addr string) (net.Conn, error) {
1569+
conn, err := dialerFunc(ctx, addr)
1570+
if err != nil {
1571+
return nil, err
1572+
}
1573+
if onNetworkDial != nil {
1574+
onNetworkDial(conn)
1575+
}
1576+
return conn, nil
1577+
}
1578+
}
1579+
15521580
// dialOptsNetwork compute options used only for over-the-network RPC
15531581
// connections.
15541582
func (rpcCtx *Context) dialOptsNetwork(
1555-
ctx context.Context, target string, class rpcbase.ConnectionClass,
1583+
ctx context.Context, target string, class rpcbase.ConnectionClass, onNetworkDial onDialFunc,
15561584
) ([]grpc.DialOption, error) {
15571585
dialOpts, err := rpcCtx.dialOptsNetworkCredentials()
15581586
if err != nil {
@@ -1639,6 +1667,11 @@ func (rpcCtx *Context) dialOptsNetwork(
16391667
}
16401668
dialerFunc = dialer.dial
16411669
}
1670+
// Wrap the dial function with the callback that's been passed down so we
1671+
// have a hook into each network dial from higher up.
1672+
//
1673+
// This allows us to keep the peer's tcpConn up to date.
1674+
dialerFunc = rpcCtx.dialerWithCallback(dialerFunc, onNetworkDial)
16421675
dialOpts = append(dialOpts, grpc.WithContextDialer(dialerFunc))
16431676

16441677
// Don't retry on dial errors either, otherwise the onlyOnceDialer will get
@@ -1982,14 +2015,15 @@ func (rpcCtx *Context) grpcDialRaw(
19822015
ctx context.Context,
19832016
target string,
19842017
class rpcbase.ConnectionClass,
2018+
onNetworkDial onDialFunc,
19852019
additionalOpts ...grpc.DialOption,
19862020
) (*grpc.ClientConn, error) {
19872021
transport := tcpTransport
19882022
if rpcCtx.ContextOptions.AdvertiseAddr == target && !rpcCtx.ClientOnly {
19892023
// See the explanation on loopbackDialFn for an explanation about this.
19902024
transport = loopbackTransport
19912025
}
1992-
dialOpts, err := rpcCtx.grpcDialOptionsInternal(ctx, target, class, transport)
2026+
dialOpts, err := rpcCtx.grpcDialOptionsInternal(ctx, target, class, transport, onNetworkDial)
19932027
if err != nil {
19942028
return nil, err
19952029
}
@@ -2190,7 +2224,7 @@ type Dialbacker interface {
21902224
GRPCUnvalidatedDial(string, roachpb.Locality) *GRPCConnection
21912225
GRPCDialNode(string, roachpb.NodeID, roachpb.Locality, rpcbase.ConnectionClass) *GRPCConnection
21922226
grpcDialRaw(
2193-
context.Context, string, rpcbase.ConnectionClass, ...grpc.DialOption,
2227+
context.Context, string, rpcbase.ConnectionClass, onDialFunc, ...grpc.DialOption,
21942228
) (*grpc.ClientConn, error)
21952229
wrapCtx(
21962230
ctx context.Context, target string, remoteNodeID roachpb.NodeID, class rpcbase.ConnectionClass,
@@ -2266,7 +2300,8 @@ func VerifyDialback(
22662300
// A throwaway connection keeps it simple.
22672301
ctx := rpcCtx.wrapCtx(ctx, target, request.OriginNodeID, rpcbase.SystemClass)
22682302
ctx = logtags.AddTag(ctx, "dialback", nil)
2269-
conn, err := rpcCtx.grpcDialRaw(ctx, target, rpcbase.SystemClass, grpc.WithBlock())
2303+
onNetworkDial := func(conn net.Conn) {}
2304+
conn, err := rpcCtx.grpcDialRaw(ctx, target, rpcbase.SystemClass, onNetworkDial, grpc.WithBlock())
22702305
if conn != nil { // NB: the nil check simplifies mocking in TestVerifyDialback
22712306
_ = conn.Close() // nolint:grpcconnclose
22722307
}

pkg/rpc/context_test.go

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1690,7 +1690,7 @@ func BenchmarkGRPCDial(b *testing.B) {
16901690

16911691
b.RunParallel(func(pb *testing.PB) {
16921692
for pb.Next() {
1693-
_, err := rpcCtx.grpcDialRaw(ctx, remoteAddr, rpcbase.DefaultClass)
1693+
_, err := rpcCtx.grpcDialRaw(ctx, remoteAddr, rpcbase.DefaultClass, nil /* onNetworkDial */)
16941694
if err != nil {
16951695
b.Fatal(err)
16961696
}
@@ -2052,8 +2052,8 @@ func TestVerifyDialback(t *testing.T) {
20522052
ctx context.Context, _ string, _ roachpb.NodeID, _ rpcbase.ConnectionClass) context.Context {
20532053
return ctx
20542054
})
2055-
mockRPCCtx.EXPECT().grpcDialRaw(gomock.Any() /* ctx */, "1.1.1.1", rpcbase.SystemClass, gomock.Any()).
2056-
DoAndReturn(func(context.Context, string, rpcbase.ConnectionClass, ...grpc.DialOption) (*grpc.ClientConn, error) {
2055+
mockRPCCtx.EXPECT().grpcDialRaw(gomock.Any() /* ctx */, "1.1.1.1", rpcbase.SystemClass, gomock.Any() /* onDialFunc */, gomock.Any()).
2056+
DoAndReturn(func(context.Context, string, rpcbase.ConnectionClass, onDialFunc, ...grpc.DialOption) (*grpc.ClientConn, error) {
20572057
if dialbackOK {
20582058
return nil, nil
20592059
}
@@ -2088,8 +2088,8 @@ func TestVerifyDialback(t *testing.T) {
20882088
ctx context.Context, _ string, _ roachpb.NodeID, _ rpcbase.ConnectionClass) context.Context {
20892089
return ctx
20902090
})
2091-
mockRPCCtx.EXPECT().grpcDialRaw(gomock.Any() /* ctx */, "1.1.1.1", rpcbase.SystemClass, gomock.Any()).
2092-
DoAndReturn(func(context.Context, string, rpcbase.ConnectionClass, ...grpc.DialOption) (*grpc.ClientConn, error) {
2091+
mockRPCCtx.EXPECT().grpcDialRaw(gomock.Any() /* ctx */, "1.1.1.1", rpcbase.SystemClass, gomock.Any() /* onDialFunc */, gomock.Any()).
2092+
DoAndReturn(func(context.Context, string, rpcbase.ConnectionClass, onDialFunc, ...grpc.DialOption) (*grpc.ClientConn, error) {
20932093
return nil, nil
20942094
})
20952095
require.NoError(t, VerifyDialback(context.Background(), mockRPCCtx, req, &PingResponse{}, roachpb.Locality{}, sv))
@@ -2297,7 +2297,7 @@ func BenchmarkGRPCPing(b *testing.B) {
22972297

22982298
cliRPCCtx := newTestContext(uuid.MakeV4(), clock, maxOffset, stopper)
22992299
cliRPCCtx.NodeID.Set(ctx, 2)
2300-
cc, err := cliRPCCtx.grpcDialRaw(ctx, remoteAddr, rpcbase.DefaultClass)
2300+
cc, err := cliRPCCtx.grpcDialRaw(ctx, remoteAddr, rpcbase.DefaultClass, nil /* onNetworkDial */)
23012301
require.NoError(b, err)
23022302

23032303
for _, tc := range []struct {

pkg/rpc/grpc.go

Lines changed: 22 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,10 +7,12 @@ package rpc
77

88
import (
99
"context"
10+
"net"
1011

1112
"github.com/cockroachdb/cockroach/pkg/kv/kvpb"
1213
"github.com/cockroachdb/cockroach/pkg/roachpb"
1314
"github.com/cockroachdb/cockroach/pkg/rpc/rpcbase"
15+
"github.com/cockroachdb/cockroach/pkg/util/log"
1416
"github.com/cockroachdb/cockroach/pkg/util/stop"
1517
"google.golang.org/grpc"
1618
"google.golang.org/grpc/connectivity"
@@ -63,7 +65,26 @@ func newGRPCPeerOptions(
6365
dial: func(ctx context.Context, target string, class rpcbase.ConnectionClass) (*grpc.ClientConn, error) {
6466
additionalDialOpts := []grpc.DialOption{grpc.WithStatsHandler(&statsTracker{lm})}
6567
additionalDialOpts = append(additionalDialOpts, rpcCtx.testingDialOpts...)
66-
return rpcCtx.grpcDialRaw(ctx, target, class, additionalDialOpts...)
68+
// onNetworkDial is a callback that is called after we dial a TCP connection.
69+
// It is not called if we use the loopback dialer.
70+
// We define it here because we need access to the peer map.
71+
onNetworkDial := func(conn net.Conn) {
72+
tcpConn, ok := conn.(*net.TCPConn)
73+
if !ok {
74+
return
75+
}
76+
77+
rpcCtx.peers.mu.Lock()
78+
defer rpcCtx.peers.mu.Unlock()
79+
p := rpcCtx.peers.mu.m[k]
80+
81+
p.mu.Lock()
82+
defer p.mu.Unlock()
83+
p.mu.tcpConn = tcpConn
84+
85+
log.VEventf(ctx, 2, "gRPC network dial: laddr=%v", tcpConn.LocalAddr())
86+
}
87+
return rpcCtx.grpcDialRaw(ctx, target, class, onNetworkDial, additionalDialOpts...)
6788
},
6889
connEquals: func(a, b *grpc.ClientConn) bool {
6990
return a == b

pkg/rpc/metrics.go

Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -128,6 +128,10 @@ Decommissioned peers are excluded.
128128
Unit: metric.Unit_NANOSECONDS,
129129
Help: `Sum of exponentially weighted moving average of round-trip latencies, as measured through a gRPC RPC.
130130
131+
Since this metric is based on gRPC RPCs, it is affected by application-level
132+
processing delays and CPU overload effects. See rpc.connection.tcp_rtt for a
133+
metric that is obtained from the kernel's TCP stack.
134+
131135
Dividing this Gauge by rpc.connection.healthy gives an approximation of average
132136
latency, but the top-level round-trip-latency histogram is more useful. Instead,
133137
users should consult the label families of this metric if they are available
@@ -142,6 +146,40 @@ is reset to zero.
142146
Category: metric.Metadata_NETWORKING,
143147
HowToUse: `This metric is helpful in understanding general network issues outside of CockroachDB that could be impacting the user’s workload.`,
144148
}
149+
150+
metaConnectionTCPRTT = metric.Metadata{
151+
Name: "rpc.connection.tcp_rtt",
152+
Unit: metric.Unit_NANOSECONDS,
153+
Help: `Kernel-level TCP round-trip time as measured by the Linux TCP stack.
154+
155+
This metric reports the smoothed round-trip time (SRTT) as maintained by the
156+
kernel's TCP implementation. Unlike application-level RPC latency measurements,
157+
this reflects pure network latency and is less affected by CPU overload effects.
158+
159+
This metric is only available on Linux.
160+
`,
161+
Measurement: "Latency",
162+
Essential: true,
163+
Category: metric.Metadata_NETWORKING,
164+
HowToUse: `High TCP RTT values indicate network issues outside of CockroachDB that could be impacting the user's workload.`,
165+
}
166+
167+
metaConnectionTCPRTTVar = metric.Metadata{
168+
Name: "rpc.connection.tcp_rtt_var",
169+
Unit: metric.Unit_NANOSECONDS,
170+
Help: `Kernel-level TCP round-trip time variance as measured by the Linux TCP stack.
171+
172+
This metric reports the smoothed round-trip time variance (RTTVAR) as maintained
173+
by the kernel's TCP implementation. This measures the stability of the
174+
connection latency.
175+
176+
This metric is only available on Linux.
177+
`,
178+
Measurement: "Latency Variance",
179+
Essential: true,
180+
Category: metric.Metadata_NETWORKING,
181+
HowToUse: `High TCP RTT variance values indicate network stability issues outside of CockroachDB that could be impacting the user's workload.`,
182+
}
145183
metaConnectionConnected = metric.Metadata{
146184
Name: "rpc.connection.connected",
147185
Help: `Counter of TCP level connected connections.
@@ -226,6 +264,8 @@ func newMetrics(locality roachpb.Locality) *Metrics {
226264
ConnectionBytesSent: aggmetric.NewCounter(metaNetworkBytesEgress, localityLabels...),
227265
ConnectionBytesRecv: aggmetric.NewCounter(metaNetworkBytesIngress, localityLabels...),
228266
ConnectionAvgRoundTripLatency: aggmetric.NewGauge(metaConnectionAvgRoundTripLatency, childLabels...),
267+
ConnectionTCPRTT: aggmetric.NewGauge(metaConnectionTCPRTT, childLabels...),
268+
ConnectionTCPRTTVar: aggmetric.NewGauge(metaConnectionTCPRTTVar, childLabels...),
229269
}
230270
m.mu.peerMetrics = make(map[string]peerMetrics)
231271
m.mu.localityMetrics = make(map[string]localityMetrics)
@@ -270,6 +310,8 @@ type Metrics struct {
270310
ConnectionBytesSent *aggmetric.AggCounter
271311
ConnectionBytesRecv *aggmetric.AggCounter
272312
ConnectionAvgRoundTripLatency *aggmetric.AggGauge
313+
ConnectionTCPRTT *aggmetric.AggGauge
314+
ConnectionTCPRTTVar *aggmetric.AggGauge
273315
mu struct {
274316
syncutil.Mutex
275317
// peerMetrics is a map of peerKey to peerMetrics.
@@ -318,6 +360,12 @@ type peerMetrics struct {
318360
// Updated on each successful heartbeat, reset (along with roundTripLatency)
319361
// after runHeartbeatUntilFailure returns.
320362
AvgRoundTripLatency *aggmetric.Gauge
363+
// TCP-level round trip time as measured by the kernel's TCP stack.
364+
// This provides network-level latency without application overhead.
365+
TCPRTT *aggmetric.Gauge
366+
// TCP-level round trip time variance as measured by the kernel's TCP stack.
367+
// This indicates connection stability and jitter.
368+
TCPRTTVar *aggmetric.Gauge
321369
// roundTripLatency is the source for the AvgRoundTripLatency gauge. We don't
322370
// want to maintain a full histogram per peer, so instead on each heartbeat we
323371
// update roundTripLatency and flush the result into AvgRoundTripLatency.
@@ -353,6 +401,8 @@ func (m *Metrics) acquire(k peerKey, l roachpb.Locality) (peerMetrics, localityM
353401
ConnectionHeartbeats: m.ConnectionHeartbeats.AddChild(labelVals...),
354402
ConnectionFailures: m.ConnectionFailures.AddChild(labelVals...),
355403
AvgRoundTripLatency: m.ConnectionAvgRoundTripLatency.AddChild(labelVals...),
404+
TCPRTT: m.ConnectionTCPRTT.AddChild(labelVals...),
405+
TCPRTTVar: m.ConnectionTCPRTTVar.AddChild(labelVals...),
356406
// We use a SimpleEWMA which uses the zero value to mean "uninitialized"
357407
// and operates on a ~60s decay rate.
358408
roundTripLatency: &ThreadSafeMovingAverage{ma: &ewma.SimpleEWMA{}},

pkg/rpc/metrics_test.go

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -60,7 +60,7 @@ func TestMetricsRelease(t *testing.T) {
6060
return metricFields
6161
}
6262

63-
const expectedCount = 11
63+
const expectedCount = 13
6464
k1 := peerKey{NodeID: 5, TargetAddr: "192.168.0.1:1234", Class: rpcbase.DefaultClass}
6565
k2 := peerKey{NodeID: 6, TargetAddr: "192.168.0.1:1234", Class: rpcbase.DefaultClass}
6666
l1 := roachpb.Locality{Tiers: []roachpb.Tier{{Key: "region", Value: "us-east"}}}

0 commit comments

Comments
 (0)