Skip to content

Commit dafaf1d

Browse files
craig[bot]tbg
andcommitted
Merge #149928
149928: status: add sys.host.net.send.tcp.retrans_segs (and friends) r=tbg a=tbg `retrans_segs` tracks TCP segments retransmitted for any reason. Please refer to the help text of the metric, the (internal) [talk], and the excellent [post by Arthur Chiao]. [talk]: https://cockroachlabs.slack.com/archives/C05B1BPMJLE/p1752780299376659 [post by Arthur Chiao]: http://arthurchiao.art/blog/tcp-retransmission-may-be-misleading/ A number of additional metrics have been added. They all have in common that they break up the retrans_segs metric, i.e. they can possibly allow us to dig deeper. These metrics are parsed from /proc/net/netstat (and /proc/net/snmp, via `gops`), so they're only supported on linux. When not supported, these counters will remain at a value of -1. Fixes #149598. Epic: none Release note: the `sys.host.net.send.tcp.retrans_segs` metric has been added, alongside a number of additional TCP metrics. These can be useful to diagnose network issues. Co-authored-by: Tobias Grieger <[email protected]>
2 parents fccbd7e + 04e5627 commit dafaf1d

File tree

4 files changed

+520
-99
lines changed

4 files changed

+520
-99
lines changed

docs/generated/metrics/metrics.yaml

Lines changed: 80 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9981,6 +9981,86 @@ layers:
99819981
derivative: NON_NEGATIVE_DERIVATIVE
99829982
how_to_use: This metric measures the length of time, in seconds, that the CockroachDB process has been running. Monitor this metric to detect events such as node restarts, which may require investigation or intervention.
99839983
essential: true
9984+
- name: NETWORKING
9985+
metrics:
9986+
- name: sys.host.net.send.tcp.fast_retrans_segs
9987+
exported_name: sys_host_net_send_tcp_fast_retrans_segs
9988+
description: |-
9989+
Segments retransmitted due to the fast retransmission mechanism in TCP.
9990+
Fast retransmissions occur when the sender learns that intermediate segments have been lost.
9991+
y_axis_label: Segments
9992+
type: COUNTER
9993+
unit: COUNT
9994+
aggregation: AVG
9995+
derivative: NON_NEGATIVE_DERIVATIVE
9996+
- name: sys.host.net.send.tcp.loss_probes
9997+
exported_name: sys_host_net_send_tcp_loss_probes
9998+
description: |2-
9999+
10000+
Number of TCP tail loss probes sent. Loss probes are an optimization to detect
10001+
loss of the last packet earlier than the retransmission timer, and can indicate
10002+
network issues. Tail loss probes are aggressive, so the base rate is often nonzero
10003+
even in healthy networks.
10004+
y_axis_label: Probes
10005+
type: COUNTER
10006+
unit: COUNT
10007+
aggregation: AVG
10008+
derivative: NON_NEGATIVE_DERIVATIVE
10009+
- name: sys.host.net.send.tcp.retrans_segs
10010+
exported_name: sys_host_net_send_tcp_retrans_segs
10011+
description: |2
10012+
10013+
The number of TCP segments retransmitted across all network interfaces.
10014+
This can indicate packet loss occurring in the network. However, it can
10015+
also be caused by recipient nodes not consuming packets in a timely manner,
10016+
or the local node overflowing its outgoing buffers, for example due to overload.
10017+
10018+
Retransmissions also occur in the absence of problems, as modern TCP stacks
10019+
err on the side of aggressively retransmitting segments.
10020+
10021+
The linux tool 'ss -i' can show the Linux kernel's smoothed view of round-trip
10022+
latency and variance on a per-connection basis. Additionally, 'netstat -s'
10023+
shows all TCP counters maintained by the kernel.
10024+
y_axis_label: Segments
10025+
type: COUNTER
10026+
unit: COUNT
10027+
aggregation: AVG
10028+
derivative: NON_NEGATIVE_DERIVATIVE
10029+
how_to_use: |2
10030+
10031+
Phase changes, especially when occurring on groups of nodes, can indicate packet
10032+
loss in the network or a slow consumer of packets. On slow consumers, the
10033+
'sys.host.net.rcvd.drop' metric may be elevated; on overloaded senders, it
10034+
is worth checking the 'sys.host.net.send.drop' metric.
10035+
Additionally, the 'sys.host.net.send.tcp.*' may provide more insight into the
10036+
specific type of retransmission.
10037+
essential: true
10038+
- name: sys.host.net.send.tcp.slow_start_retrans
10039+
exported_name: sys_host_net_send_tcp_slow_start_retrans
10040+
description: |2
10041+
10042+
Number of TCP retransmissions in slow start. This can indicate that the network
10043+
is unable to support the initial fast ramp-up in window size, and can be a sign
10044+
of packet loss or congestion.
10045+
y_axis_label: Segments
10046+
type: COUNTER
10047+
unit: COUNT
10048+
aggregation: AVG
10049+
derivative: NON_NEGATIVE_DERIVATIVE
10050+
- name: sys.host.net.send.tcp_timeouts
10051+
exported_name: sys_host_net_send_tcp_timeouts
10052+
description: |2
10053+
10054+
Number of TCP retransmission timeouts. These typically imply that a packet has
10055+
not been acknowledged within at least 200ms. Modern TCP stacks use
10056+
optimizations such as fast retransmissions and loss probes to avoid hitting
10057+
retransmission timeouts. Anecdotally, they still occasionally present themselves
10058+
even in supposedly healthy cloud environments.
10059+
y_axis_label: Timeouts
10060+
type: COUNTER
10061+
unit: COUNT
10062+
aggregation: AVG
10063+
derivative: NON_NEGATIVE_DERIVATIVE
998410064
- name: UNSET
998510065
metrics:
998610066
- name: build.timestamp

pkg/server/status/BUILD.bazel

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -157,6 +157,7 @@ go_test(
157157
"//pkg/settings/cluster",
158158
"//pkg/sql/sem/catconstants",
159159
"//pkg/testutils/serverutils",
160+
"//pkg/testutils/skip",
160161
"//pkg/ts/tspb",
161162
"//pkg/ts/tsutil",
162163
"//pkg/util/hlc",
@@ -171,6 +172,7 @@ go_test(
171172
"@com_github_prometheus_client_model//go",
172173
"@com_github_prometheus_common//expfmt",
173174
"@com_github_shirou_gopsutil_v3//net",
175+
"@com_github_stretchr_testify//assert",
174176
"@com_github_stretchr_testify//require",
175177
],
176178
)

0 commit comments

Comments
 (0)