-
Couldn't load subscription status.
- Fork 256
Netstacklat02 - porting TCP HoL patches from netstacklat PR#133 #587
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
netoptimizer
wants to merge
45
commits into
cloudflare:master
from
netoptimizer:netstacklat02-TCP-HoL-filter
Closed
Netstacklat02 - porting TCP HoL patches from netstacklat PR#133 #587
netoptimizer
wants to merge
45
commits into
cloudflare:master
from
netoptimizer:netstacklat02-TCP-HoL-filter
+936
−0
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This is a direct copy from bpf-examples but from Simon's devel branch 'netstacklat-groupby' https://github.com/simosund/bpf-examples/tree/netstacklat-groupby/netstacklat Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Keeping this as seperate commit to track what needed to change Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Gotcha#1: My devel laptop have TAI offset zero - Other systems (incl prod) all have 37 sec Gotcha#2: RX-timestamping need to be enabled maually - something else have to enable RX-timestamping Signed-off-by: Jesper Dangaard Brouer <[email protected]>
For ebpf_exporter we cannot control which BPF sections gets loaded. Instead we compile time disable some of the hooks via define/ifdef's. Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Instead hardcode ifindex limits based on prod setup. As we don't have a way to configure this via YAML. Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Execution runtime: netstacklat_tcp_recv_timestamp - run_time_ns 18885872401 run_cnt 71,443,820 = 264.34 ns Execution runtime: netstacklat_skb_consume_udp - run_time_ns 6124797061 run_cnt 16,324,309 = 375.19 ns Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Something is buggy with this filter - All latency records is on max bucket The READ_ONCE change doesn't fix the issue Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Signed-off-by: Jesper Dangaard Brouer <[email protected]>
This was the real reason for seeing wrong numbers in prod. Signed-off-by: Jesper Dangaard Brouer <[email protected]>
…sync Execution runtime: netstacklat_tcp_recv_timestamp - run_time_ns 32164560127 run_cnt 116,590,498 = 275.88 ns Execution runtime: netstacklat_skb_consume_udp - run_time_ns 10490230543 run_cnt 23,993,428 = 437.21 ns Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Moved filter_nonempty_sockqueue to callers - because record_socket_latency() becomes a BPF function-call - perf e.g. shows bpf_prog_fb69587c6ea462b7_record_socket_latency Execution runtime: netstacklat_tcp_recv_timestamp - run_time_ns 181391511590 run_cnt 788663546 = 229.99 sn Execution runtime: netstacklat_skb_consume_udp - run_time_ns 16212598612 run_cnt 137812779 = 117.64 ns This clearly have a huge improvement for UDP packets. Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Enable filter_nonempty_sockqueue compile time to make sure that this config setting doesn't influence performance. I'm only seeing a small effect. Signed-off-by: Jesper Dangaard Brouer <[email protected]>
The filter_nonempty_sockqueue() is effecient for UDP packets, but doesn't work well for TCP packets. Add filtering on socket queue lenght. Try filtering if qlen is not above 3 packets for TCP. Execution runtime: netstacklat_tcp_recv_timestamp - run_time_ns 10690540076 run_cnt 117852699 = 90.71 ns Execution runtime: netstacklat_skb_consume_udp - run_time_ns 2206621632 run_cnt 20004338 = 110.30 ns This have a HUGE improvement for TCP case. Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Leverage BPF_MAP_TYPE_CGRP_STORAGE for our cgroup filter. To evaluate the two different cgroup_id_map types code macro CONFIG_CGRP_STORAGE is introduced, to allow us to switch implementation compile time. Signed-off-by: Jesper Dangaard Brouer <[email protected]>
The ebpf_exporter variant of netstacklat is not runtime configurable at BPF-load time. Thus, below user_config isn't define as 'volatile', instead the 'const' allows the compiler to do dead-code elimination. We also disable user_config.filter_nonempty_sockqueue as we want to stress the cgroup lookup types some more. Signed-off-by: Jesper Dangaard Brouer <[email protected]>
This required changing call signature of filter_cgroup() to also populate the cgroup_id as that is used for groupby_cgroup. We are still using the CONFIG_CGRP_STORAGE that leverages the BPF_MAP_TYPE_CGRP_STORAGE. Without the queue length filters (filter_nonempty_sockqueue) we are getting more calls. This is on purpose to evaluate cgroup_id_map types. name netstacklat_tcp_recv_timestamp - run_time_ns 17953390952 run_cnt 55079620 = 325.95 ns netstacklat_skb_consume_udp - run_time_ns 5779869863 run_cnt 11650472 = 496.10 ns Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Changing the filter_cgroup() to use the cgroup_id_map type hash. Surprisingly this type seems to be faster than type cgrp_storage. name netstacklat_tcp_recv_timestamp - run_time_ns 10705407914 run_cnt 41576592 = 257.48 ns - diff: 325.95 - 257.48 = 68.47 ns better name netstacklat_skb_consume_udp - run_time_ns 3716653454 run_cnt 8499677 = 437.27 ns - diff: 496.10 - 437.27 = 58.83 ns better On this AMD CPU with SRSO enabled, we have extra overheads on BPF helper calls. The filter_cgroup() for type cgrp_storage has two extra helper calls, bpf_get_current_task_btf() and bpf_cgrp_storage_get(), but we eliminated the bpf_get_current_cgroup_id() helper call. Still this doesn't fully account for diff. That said, if a BPF-prog already have the struct_task available then the bpf_get_current_task_btf() can also be eliminated, so type cgrp_storage might still be useful in that case Signed-off-by: Jesper Dangaard Brouer <[email protected]>
If we have disabled network_ns filtering, the code still does a lookup of the current name space via calling get_network_ns(). Reorg the code to avoid this call if feature is disabled. name netstacklat_tcp_recv_timestamp - run_time_ns 10623365578 run_cnt 44842812 = 236.90 ns - diff: 257.48 - 236.90 = 20.58 ns better name netstacklat_skb_consume_udp - run_time_ns 3718153230 run_cnt 9902613 = 375.47 ns - diff: 437.27 - 375.47 = 61.80 ns better Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Let us control the filter_queue_len() via seperate user_config. Also enable this for UDP sockets. Notice the filter_cgroup() is still running first, so this is the primary filter. And filter_nonempty_sockqueue is still false. name netstacklat_tcp_recv_timestamp - run_time_ns 15661530364 run_cnt 94922963 = 164.99 ns name netstacklat_skb_consume_udp - run_time_ns 3255451250 run_cnt 14532586 = 224.01 ns Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Verified performance is same as previous patch. Signed-off-by: Jesper Dangaard Brouer <[email protected]>
It is more efficient to first filter out sockets with empty or small queues. This is a tradeoff as our production use-case is to capture when the system becomes overloaded. name netstacklat_tcp_recv_timestamp - run_time_ns 15347880145 run_cnt 177786472 = 86.32 ns name netstacklat_skb_consume_udp - run_time_ns 3096529442 run_cnt 33903931 = 91.33 ns The performance gain is huge. Do remember that this is the average runtime cost that is reduced, because we can skip recording many of these events. Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Our checks for empty or almost empty sockets were wrong, because sockets also have a backlog queue. Signed-off-by: Jesper Dangaard Brouer <[email protected]>
This change test to be 'ge' greater-than-or-equal. We want the ability specify 1, meaning a queue size of one and above. Before 1 meant queue size 2 and above. Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Let also record_skb_latency() use it. Signed-off-by: Jesper Dangaard Brouer <[email protected]>
For production usage, we want the ability to disable the UDP hooks. Introduce CONFIG_xxx_HOOKS for IP, UDP and TCP. Signed-off-by: Jesper Dangaard Brouer <[email protected]>
For production we need to reduce the number of Prometheus buckets metric. As the dequeue hooks are unlikely to see below usecs latencies we reduce the resolution to usecs. Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Signed-off-by: Jesper Dangaard Brouer <[email protected]>
When importing this into ebpf_exporter we need to reformat
the C-code, which is done via command line:
clang-format -i configs/netstacklat.{h,bpf.c}
But it doesn't convert these macros and comments.
Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Signed-off-by: Jesper Dangaard Brouer <[email protected]>
We are currently only using a single hook so change N_HOOKS that is used in the max_entries calc of netstack_latency_seconds. Detect if other hooks gets enabled and make compile fail. Signed-off-by: Jesper Dangaard Brouer <[email protected]>
We unfortunately need atomic counter update for the nth-filter. This is because hooks like tcp-socket-read runs outside the socket lock in a preempt/migrate-able user context. We don't need accurate nth-counter across CPU, as this is just a down-sampling mechanism. Thus, we keep the PERCPU array map and have nth-counter on a per CPU basis. The trick here is that in most cases the counter is only used by the current running CPU, and the cache-line will mostly be in a cache coherency Exclusive/Modified (MOESI) state, which will cost less when doing atomic updates. Manually testing on production showed 7ns runtime increase (before 150.88 ns, after 157.67 ns). Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Sample every 2-nth packet (50%) - Overhead: 6039419756 / 32219314 = 187.44 ns Compared to local atomic-nth overead: 157.67 ns - approx 30 ns extra cost to sample 50% vs 3.12% Signed-off-by: Jesper Dangaard Brouer <[email protected]>
name netstacklat_tcp_recv_timestamp - run_time_ns 17510185954 run_cnt 101083454 = 173.23 ns Sample every 4-nth packet (25%) - Overhead: 173.23 ns - Compared to nth-2 (187.44 ns) saved 14.21 ns (187.44-173.23) - Compared to nth-32 (157.67 ns) cost 15.56 ns more (173.23-157.67) Signed-off-by: Jesper Dangaard Brouer <[email protected]>
name netstacklat_tcp_recv_timestamp - run_time_ns 24383044912 run_cnt 121125888 = 201.30 ns Compared to - nth-2 : 186 ns -> +15 ns - nth-4 : 173 ns -> +28 ns - nth-32 : 157 ns -> +44 ns Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Because production have too little traffic on the internal interface that latency stats becomes unusable. Via CONFIG_GROUPBY_IFINDEX also remove this from the hist_key and note that YAML file also needs adjustments. Signed-off-by: Jesper Dangaard Brouer <[email protected]>
The upstream version of netstacklat that we are based on got merged see PR#129. xdp-project/bpf-examples#129 Some adjustments were made, so lets sync with these to avoid diverting too much from upstream. Signed-off-by: Jesper Dangaard Brouer <[email protected]>
As the filter_min_sockqueue_len can replaced it. This was also part of PR#129 merge, but it makes it easier to review, to keep this in a seperate commit. xdp-project/bpf-examples#129 Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Lacking a YAML ebpf_exporter config for selecting iface names we hard-coded ifindex, but some production servers have higher ifindex for vlan100. Relax ifindex range as a workaround. Signed-off-by: Jesper Dangaard Brouer <[email protected]>
The 'tcp-socket-read' currently reports the latency for the skb containing the last TCP segment read from the socket. However, this segment might have been head of line (HOL) blocked by a previous segment missing. In this case, netstacklat's reported latency will include HOL blocking periods that is dependent on external factors (such as network packet loss, and network latency impacts retransmission time). As netstacklat is primarily intended to identify issues within the local host (in the network stack or receiving applications), by default filter out any socket reads were the last read SKB might have experienced HOL-blocking. Add the new -y/--include-tcp-hol-delay option to retain the old behavior of reporting latency for all reads, including those that are HOL-blocked. This may be useful in some scenarios when you still want to be aware of latency issues caused by HOL-blocking, even though it is caused by external components. For example, in a data center context were you have full control over the network, it may still be relevant to monitor HOL-based caused by the network. To exclude HOL-blocked reads, detect if any new ooo-segments have arrived by checking for differences in the number of ooo-packets in tcp_sock->rcv_ooopack. If any new ooo-segments have arrived, exclude the latency sample from the current read and set a limit for the next safe sequence number to read where the current ooo-packets must have been passed so segments can no longer be HOL-blocked. If there are skbs in the ooo-queue, set the limit to the end of the ooo-queue. Otherwise, set the limit to the current rcv_nxt (as if the ooo-queue is empty the detected ooo-segments must already have been merged into the receive queue and rcv_nxt must have advanced past them). If the read is past the safe sequence limit and no new ooo-segments have arrived, it's safe to start including the latency samples again. For sockets were some ooo-segments have been observed, keep the ooo-range state in socket storage (BPF_MAP_TYPE_SK_STORAGE). Skip protecting this state with a spin-lock, as it should only be concurrently accessed if there are concurrent reads on the same TCP socket, which is assumed to be very rare as applications attempting that cannot know which part of the data each of their concurrent reads will get. There are some scenarios that may cause this ooo-filtering to fail. - If multiple reads are done to the socket concurrently, we may not correctly track the last read byte. The kernel does not keep a lock on the TCP socket at the time our hooked function tcp_recv_timestamp() runs. If two reads are done in parallel, it's therefore possible that for both reads we will check the last read byte (tcp_sock.copied_seq) after the second read has updated it. We may then incorrectly conclude that the first read was ahead of the ooo-range when it was not, and record its latency when we should have excluded it. In practice I belive this issue should be quite rare, as most applications will probably not attempt to perform multiple concurrent reads to a single connected TCP socket in parallel (as then you cannot know which part of the payload the parallel reads will return). - As tcp_recv_timestamp() runs outside of the socket lock, the various state members we access may concurrently be updated as we're attempting to read them. An especially problematic one is tcp_sock.ooo_last_skb, which keeps a pointer to an SKB that is only valid while the ooo-queue is non-empty. It is possible that between our check for if the ooo-queue is non-empty and following the ooo_last_skb pointer, the ooo-queue is cleared and the ooo_last_skb pointer may end up pointing towards a freed SKB. If the socket members we access are updated before or while we read them, it can break the filtering in numerous ways, e.g. result in includes samples that should have been excluded (due to e.g. copied_seq being updated before our read) or excluding a large amount of valid samples (due to e.g. setting a sequence limit based on garbage in a freed SKB). Signed-off-by: Simon Sundberg <[email protected]> Signed-off-by: Jesper Dangaard Brouer <[email protected]>
The logic for excluding samples from TCP reads that may have been delayed by HOL blocking relies on reading a number of fields from the TCP socket outside of the socket lock. This may be prone to errors due to the socket state being updated at another place in the kernel while our eBPF program is running. To reduce the risk that a data race causes the filter to fail, add a sanity check for the maximum out of order sequence used to exclude future TCP reads from monitoring. The most problematic of the read fields in the tcp_sock is ooo_last_skb, as that is a pointer to another SKB rather than a direct value. This pointer is only valid as long as the out_of_order_queue is non-empty. Due to a data race, we may check that the ooo-queue is non-empty while there are still SKBs in it, then have the kernel clear out the ooo-queue, and finally attempt to read the ooo_last_skb pointer later when it is no longer valid (and may now point to a freed/recycled SKB). This may result in incorrect values being used for the sequence limit used to exclude future reads of ooo-segments. The faulty sequence limit may both cause reads of HOL-blocked segments to be included or the exclusion of an unnecessarily large amount of future reads (up to 2 GB). To reduce the risk that the garbage data from an invalid SKB is used, introduce two sanity checks for end_seq in the ooo_last_skb. First check if the sequence number is zero, if so assume it is invalid (even though it can be a valid sequence number). Even though we will get an error code if reading the data from this SKB fails altogether, we may still succeed reading from a no longer valid SKB, in which case there is a high risk the data will have been zeroed. If it's non-zero, also check that it is within the current receive window (if not, clamp it to the receive window). Signed-off-by: Simon Sundberg <[email protected]> Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
WARNING this is not intended for "upstream" cloudflare repo... will fix soon
Evaluating netstacklat TCP Head-of-Line (HoL) filtering for ebpf_exporter
This builds on top of netoptimizer#1