Skip to content

Conversation

@netoptimizer
Copy link
Contributor

@netoptimizer netoptimizer commented Oct 29, 2025

WARNING this is not intended for "upstream" cloudflare repo... will fix soon

Evaluating netstacklat TCP Head-of-Line (HoL) filtering for ebpf_exporter

This builds on top of netoptimizer#1

This is a direct copy from bpf-examples
but from Simon's devel branch 'netstacklat-groupby'

https://github.com/simosund/bpf-examples/tree/netstacklat-groupby/netstacklat

Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Keeping this as seperate commit to track what needed to change

Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Gotcha#1: My devel laptop have TAI offset zero
 - Other systems (incl prod) all have 37 sec

Gotcha#2: RX-timestamping need to be enabled maually
 - something else have to enable RX-timestamping

Signed-off-by: Jesper Dangaard Brouer <[email protected]>
For ebpf_exporter we cannot control which BPF sections
gets loaded.  Instead we compile time disable some 
of the hooks via define/ifdef's.

Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Instead hardcode ifindex limits based on prod setup.
As we don't have a way to configure this via YAML.

Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Execution runtime: netstacklat_tcp_recv_timestamp
 - run_time_ns 18885872401 run_cnt 71,443,820 = 264.34 ns

Execution runtime: netstacklat_skb_consume_udp
 - run_time_ns 6124797061 run_cnt 16,324,309 = 375.19 ns

Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Something is buggy with this filter
 - All latency records is on max bucket

The READ_ONCE change doesn't fix the issue

Signed-off-by: Jesper Dangaard Brouer <[email protected]>
This was the real reason for seeing wrong numbers in prod.

Signed-off-by: Jesper Dangaard Brouer <[email protected]>
…sync

Execution runtime: netstacklat_tcp_recv_timestamp
 - run_time_ns 32164560127 run_cnt 116,590,498 = 275.88 ns

Execution runtime: netstacklat_skb_consume_udp
 - run_time_ns 10490230543 run_cnt 23,993,428 = 437.21 ns

Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Moved filter_nonempty_sockqueue to callers
 - because record_socket_latency() becomes a BPF function-call
 - perf e.g. shows bpf_prog_fb69587c6ea462b7_record_socket_latency

Execution runtime: netstacklat_tcp_recv_timestamp
 - run_time_ns 181391511590 run_cnt 788663546 = 229.99 sn

Execution runtime: netstacklat_skb_consume_udp
 - run_time_ns 16212598612 run_cnt 137812779 = 117.64 ns

This clearly have a huge improvement for UDP packets.

Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Enable filter_nonempty_sockqueue compile time to make sure
that this config setting doesn't influence performance.

I'm only seeing a small effect.

Signed-off-by: Jesper Dangaard Brouer <[email protected]>
The filter_nonempty_sockqueue() is effecient for UDP packets,
but doesn't work well for TCP packets.

Add filtering on socket queue lenght. Try filtering
if qlen is not above 3 packets for TCP.

Execution runtime: netstacklat_tcp_recv_timestamp
 - run_time_ns 10690540076 run_cnt 117852699 = 90.71 ns

Execution runtime: netstacklat_skb_consume_udp
 - run_time_ns 2206621632 run_cnt 20004338 = 110.30 ns

This have a HUGE improvement for TCP case.

Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Leverage BPF_MAP_TYPE_CGRP_STORAGE for our cgroup filter.

To evaluate the two different cgroup_id_map types code
macro CONFIG_CGRP_STORAGE is introduced, to allow
us to switch implementation compile time.

Signed-off-by: Jesper Dangaard Brouer <[email protected]>
The ebpf_exporter variant of netstacklat is not runtime configurable at
BPF-load time. Thus, below user_config isn't define as 'volatile', instead
the 'const' allows the compiler to do dead-code elimination.

We also disable user_config.filter_nonempty_sockqueue as we want to
stress the cgroup lookup types some more.

Signed-off-by: Jesper Dangaard Brouer <[email protected]>
This required changing call signature of filter_cgroup() to also
populate the cgroup_id as that is used for groupby_cgroup.

We are still using the CONFIG_CGRP_STORAGE that leverages the
BPF_MAP_TYPE_CGRP_STORAGE. Without the queue length filters
(filter_nonempty_sockqueue) we are getting more calls. This is on purpose to
evaluate cgroup_id_map types.

name netstacklat_tcp_recv_timestamp
 - run_time_ns 17953390952 run_cnt 55079620 = 325.95 ns

netstacklat_skb_consume_udp
 - run_time_ns 5779869863 run_cnt 11650472 = 496.10 ns

Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Changing the filter_cgroup() to use the cgroup_id_map type hash.
Surprisingly this type seems to be faster than type cgrp_storage.

name netstacklat_tcp_recv_timestamp
 - run_time_ns 10705407914 run_cnt 41576592 = 257.48 ns
 - diff: 325.95 - 257.48 = 68.47 ns better

name netstacklat_skb_consume_udp
 - run_time_ns 3716653454 run_cnt 8499677 = 437.27 ns
 - diff: 496.10 - 437.27 = 58.83 ns better

On this AMD CPU with SRSO enabled, we have extra overheads on BPF helper calls.
The filter_cgroup() for type cgrp_storage has two extra helper calls,
bpf_get_current_task_btf() and bpf_cgrp_storage_get(), but we eliminated the
bpf_get_current_cgroup_id() helper call. Still this doesn't fully account for
diff. That said, if a BPF-prog already have the struct_task available then the
bpf_get_current_task_btf() can also be eliminated, so type cgrp_storage might
still be useful in that case

Signed-off-by: Jesper Dangaard Brouer <[email protected]>
If we have disabled network_ns filtering, the code still does a lookup of the
current name space via calling get_network_ns().

Reorg the code to avoid this call if feature is disabled.

name netstacklat_tcp_recv_timestamp
 - run_time_ns 10623365578 run_cnt 44842812 = 236.90 ns
 - diff: 257.48 - 236.90 = 20.58 ns better

name netstacklat_skb_consume_udp
 - run_time_ns 3718153230 run_cnt 9902613 = 375.47 ns
 - diff: 437.27 - 375.47 = 61.80 ns better

Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Let us control the filter_queue_len() via seperate user_config.
Also enable this for UDP sockets.

Notice the filter_cgroup() is still running first, so this is the primary
filter. And filter_nonempty_sockqueue is still false.

name netstacklat_tcp_recv_timestamp
 - run_time_ns 15661530364 run_cnt 94922963 = 164.99 ns

name netstacklat_skb_consume_udp
 - run_time_ns 3255451250 run_cnt 14532586 = 224.01 ns

Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Verified performance is same as previous patch.

Signed-off-by: Jesper Dangaard Brouer <[email protected]>
It is more efficient to first filter out sockets with empty or small queues.
This is a tradeoff as our production use-case is to capture when the system
becomes overloaded.

name netstacklat_tcp_recv_timestamp
 - run_time_ns 15347880145 run_cnt 177786472 = 86.32 ns

name netstacklat_skb_consume_udp
 - run_time_ns 3096529442 run_cnt 33903931 = 91.33 ns

The performance gain is huge. Do remember that this is the average runtime cost
that is reduced, because we can skip recording many of these events.

Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Our checks for empty or almost empty sockets were wrong,
because sockets also have a backlog queue.

Signed-off-by: Jesper Dangaard Brouer <[email protected]>
This change test to be 'ge' greater-than-or-equal.
We want the ability specify 1, meaning a queue size of one
and above. Before 1 meant queue size 2 and above.

Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Let also record_skb_latency() use it.

Signed-off-by: Jesper Dangaard Brouer <[email protected]>
For production usage, we want the ability to disable the UDP
hooks.  Introduce CONFIG_xxx_HOOKS for IP, UDP and TCP.

Signed-off-by: Jesper Dangaard Brouer <[email protected]>
For production we need to reduce the number of Prometheus
buckets metric.  As the dequeue hooks are unlikely to see
below usecs latencies we reduce the resolution to usecs.

Signed-off-by: Jesper Dangaard Brouer <[email protected]>
netoptimizer and others added 15 commits August 27, 2025 18:52
When importing this into ebpf_exporter we need to reformat
the C-code, which is done via command line:

 clang-format -i configs/netstacklat.{h,bpf.c}

But it doesn't convert these macros and comments.

Signed-off-by: Jesper Dangaard Brouer <[email protected]>
We are currently only using a single hook so change N_HOOKS
that is used in the max_entries calc of netstack_latency_seconds.

Detect if other hooks gets enabled and make compile fail.

Signed-off-by: Jesper Dangaard Brouer <[email protected]>
We unfortunately need atomic counter update for the nth-filter.
This is because hooks like tcp-socket-read runs outside the socket
lock in a preempt/migrate-able user context.

We don't need accurate nth-counter across CPU, as this is
just a down-sampling mechanism.  Thus, we keep the PERCPU
array map and have nth-counter on a per CPU basis.  The
trick here is that in most cases the counter is only used
by the current running CPU, and the cache-line will mostly
be in a cache coherency Exclusive/Modified (MOESI) state,
which will cost less when doing atomic updates.

Manually testing on production showed 7ns runtime increase
(before 150.88 ns, after 157.67 ns).

Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Sample every 2-nth packet (50%)
 - Overhead: 6039419756 / 32219314 = 187.44 ns

Compared to local atomic-nth overead: 157.67 ns
 - approx 30 ns extra cost to sample 50% vs 3.12%

Signed-off-by: Jesper Dangaard Brouer <[email protected]>
name netstacklat_tcp_recv_timestamp
 - run_time_ns 17510185954 run_cnt 101083454 = 173.23 ns

Sample every 4-nth packet (25%)
 - Overhead: 173.23 ns
 - Compared to nth-2 (187.44 ns) saved 14.21 ns (187.44-173.23)
 - Compared to nth-32 (157.67 ns) cost 15.56 ns more (173.23-157.67)

Signed-off-by: Jesper Dangaard Brouer <[email protected]>
name netstacklat_tcp_recv_timestamp
 - run_time_ns 24383044912 run_cnt 121125888 = 201.30 ns

Compared to
 - nth-2  : 186 ns -> +15 ns
 - nth-4  : 173 ns -> +28 ns
 - nth-32 : 157 ns -> +44 ns

Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Because production have too little traffic on the internal
interface that latency stats becomes unusable.

Via CONFIG_GROUPBY_IFINDEX also remove this from the
hist_key and note that YAML file also needs adjustments.

Signed-off-by: Jesper Dangaard Brouer <[email protected]>
The upstream version of netstacklat that we are based on
got merged see PR#129.

xdp-project/bpf-examples#129

Some adjustments were made, so lets sync with these
to avoid diverting too much from upstream.

Signed-off-by: Jesper Dangaard Brouer <[email protected]>
As the filter_min_sockqueue_len can replaced it.

This was also part of PR#129 merge, but it makes it easier
to review, to keep this in a seperate commit.

xdp-project/bpf-examples#129

Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Lacking a YAML ebpf_exporter config for selecting iface names
we hard-coded ifindex, but some production servers have higher
ifindex for vlan100.

Relax ifindex range as a workaround.

Signed-off-by: Jesper Dangaard Brouer <[email protected]>
The 'tcp-socket-read' currently reports the latency for the skb
containing the last TCP segment read from the socket. However, this
segment might have been head of line (HOL) blocked by a previous
segment missing. In this case, netstacklat's reported latency will
include HOL blocking periods that is dependent on external
factors (such as network packet loss, and network latency impacts
retransmission time). As netstacklat is primarily intended to identify
issues within the local host (in the network stack or receiving
applications), by default filter out any socket reads were the last
read SKB might have experienced HOL-blocking.

Add the new -y/--include-tcp-hol-delay option to retain the old
behavior of reporting latency for all reads, including those that are
HOL-blocked. This may be useful in some scenarios when you still want
to be aware of latency issues caused by HOL-blocking, even though it
is caused by external components. For example, in a data center
context were you have full control over the network, it may still be
relevant to monitor HOL-based caused by the network.

To exclude HOL-blocked reads, detect if any new ooo-segments have
arrived by checking for differences in the number of ooo-packets in
tcp_sock->rcv_ooopack. If any new ooo-segments have arrived, exclude
the latency sample from the current read and set a limit for the next
safe sequence number to read where the current ooo-packets must have
been passed so segments can no longer be HOL-blocked. If there are
skbs in the ooo-queue, set the limit to the end of the
ooo-queue. Otherwise, set the limit to the current rcv_nxt (as if the
ooo-queue is empty the detected ooo-segments must already have been
merged into the receive queue and rcv_nxt must have advanced past
them). If the read is past the safe sequence limit and no new
ooo-segments have arrived, it's safe to start including the latency
samples again.

For sockets were some ooo-segments have been observed, keep the
ooo-range state in socket storage (BPF_MAP_TYPE_SK_STORAGE). Skip
protecting this state with a spin-lock, as it should only be
concurrently accessed if there are concurrent reads on the same TCP
socket, which is assumed to be very rare as applications attempting
that cannot know which part of the data each of their concurrent reads
will get.

There are some scenarios that may cause this ooo-filtering to fail.
- If multiple reads are done to the socket concurrently, we may not
  correctly track the last read byte. The kernel does not keep a lock
  on the TCP socket at the time our hooked function
  tcp_recv_timestamp() runs. If two reads are done in parallel, it's
  therefore possible that for both reads we will check the last read
  byte (tcp_sock.copied_seq) after the second read has updated it. We
  may then incorrectly conclude that the first read was ahead of the
  ooo-range when it was not, and record its latency when we should
  have excluded it. In practice I belive this issue should be quite
  rare, as most applications will probably not attempt to perform
  multiple concurrent reads to a single connected TCP socket in
  parallel (as then you cannot know which part of the payload the
  parallel reads will return).

- As tcp_recv_timestamp() runs outside of the socket lock, the various
  state members we access may concurrently be updated as we're
  attempting to read them. An especially problematic one is
  tcp_sock.ooo_last_skb, which keeps a pointer to an SKB that is only
  valid while the ooo-queue is non-empty. It is possible that between
  our check for if the ooo-queue is non-empty and following the
  ooo_last_skb pointer, the ooo-queue is cleared and the ooo_last_skb
  pointer may end up pointing towards a freed SKB. If the socket
  members we access are updated before or while we read them, it can
  break the filtering in numerous ways, e.g. result in includes
  samples that should have been excluded (due to e.g. copied_seq being
  updated before our read) or excluding a large amount of valid
  samples (due to e.g. setting a sequence limit based on garbage in a
  freed SKB).

Signed-off-by: Simon Sundberg <[email protected]>
Signed-off-by: Jesper Dangaard Brouer <[email protected]>
The logic for excluding samples from TCP reads that may have been
delayed by HOL blocking relies on reading a number of fields from the
TCP socket outside of the socket lock. This may be prone to errors due
to the socket state being updated at another place in the kernel while
our eBPF program is running. To reduce the risk that a data race
causes the filter to fail, add a sanity check for the maximum out of
order sequence used to exclude future TCP reads from monitoring.

The most problematic of the read fields in the tcp_sock is
ooo_last_skb, as that is a pointer to another SKB rather than a direct
value. This pointer is only valid as long as the out_of_order_queue is
non-empty. Due to a data race, we may check that the ooo-queue is
non-empty while there are still SKBs in it, then have the kernel clear
out the ooo-queue, and finally attempt to read the ooo_last_skb
pointer later when it is no longer valid (and may now point to a
freed/recycled SKB). This may result in incorrect values being used
for the sequence limit used to exclude future reads of
ooo-segments. The faulty sequence limit may both cause reads of
HOL-blocked segments to be included or the exclusion of an
unnecessarily large amount of future reads (up to 2 GB).

To reduce the risk that the garbage data from an invalid SKB is used,
introduce two sanity checks for end_seq in the ooo_last_skb. First
check if the sequence number is zero, if so assume it is invalid (even
though it can be a valid sequence number). Even though we will get an
error code if reading the data from this SKB fails altogether, we may
still succeed reading from a no longer valid SKB, in which case there
is a high risk the data will have been zeroed. If it's non-zero, also
check that it is within the current receive window (if not, clamp it
to the receive window).

Signed-off-by: Simon Sundberg <[email protected]>
Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants