ebpf perfs 1.8 #7

jotak · 2025-01-29T19:47:48Z

No description provided.

github-actions · 2025-01-29T23:42:11Z

🙈 The PR is closed and the preview is expired.

content/posts/2025-01-15-perf-improvements.md

skrthomas

Nice blog, @jotak ! Thanks for all this! I left a few copy edits for you to consider :)

skrthomas · 2025-01-30T16:33:40Z

content/posts/2025-01-15-perf-improvements.md

+authors: [jotak]
+---
+
+Last year, 2024, there has been a few discussions at Red Hat, R&D related, around the eBPF Agent provided with NetObserv. One of these discussions focused especially on its performances and was a starting point to bring life to ideas. I'll take this opportunity to warmly thank Simone Ferlin-Reiter, Toke Hoiland Jorgensen, Mohamed S. Mahmoud and Donald Hunter for their contributions. Toke took the time to deep-dive in the code and shared his thoughts on the potential improvements.


Suggested change

Last year, 2024, there has been a few discussions at Red Hat, R&D related, around the eBPF Agent provided with NetObserv. One of these discussions focused especially on its performances and was a starting point to bring life to ideas. I'll take this opportunity to warmly thank Simone Ferlin-Reiter, Toke Hoiland Jorgensen, Mohamed S. Mahmoud and Donald Hunter for their contributions. Toke took the time to deep-dive in the code and shared his thoughts on the potential improvements.

Last year, 2024, there has been a few discussions at Red Hat, R&D related, around the eBPF Agent provided with Network Observability. One of these discussions focused especially on its performances and was a starting point to bring life to ideas. I'll take this opportunity to warmly thank Simone Ferlin-Reiter, Toke Hoiland Jorgensen, Mohamed S. Mahmoud and Donald Hunter for their contributions. Toke took the time to deep-dive in the code and shared his thoughts on the potential improvements.

Global comment: Network Observability instead of NetObserv?

On the upstream blog, it's more NetObserv (project name). If it ends up in redhat blog it will be Network Observability

content/posts/2025-01-15-perf-improvements.md

memodi

Nice technical read, @jotak . Thanks! few minor comments/questions from me.

memodi · 2025-02-04T16:47:22Z

content/posts/2025-01-15-perf-improvements.md

+- Medium load: 200 distributed workers rate-limited at 50 qps
+- High load: 900 distributed workers rate-limited at 50 qps


does "workers" here mean "workloads"?

No it's threads (or goroutines to be precise), within a single pod : it's the -c flag from hey: https://github.com/rakyll/hey/blob/master/README.md#usage

content/posts/2025-01-15-perf-improvements.md

stleerh · 2025-02-04T23:12:49Z

content/posts/2025-01-15-perf-improvements.md

+  0.085 [1]	|
+  0.097 [0]	|
+  0.109 [0]	|
+  0.121 [50]	|


Is each row the number of queries (HTTP requests?) in square brackets with the stated response time (first number)?

Is it because 0.013 is so dominant that the other values (even 538) is just a single "|"?

What does 0.0.97 [0] and 0.109 [0] mean?

rows are the histogram buckets, and yes I think what's between brackets are the number of requests/responses in each bucket.
So there's indeed the vast majority of requests falling between 0.001 and 0.013 seconds

Is it because 0.013 is so dominant that the other values (even 538) is just a single "|"?

yes it seems so

What does 0.0.97 [0] and 0.109 [0] mean?

There's been 0 requests between 0.085 and 0.097 seconds, and also 0 between 0.097 and 0.109

Okay, I was wondering why it even lists the "0 requests" unless these are preset time ranges.

Well, the buckets are not always the same, so I believe they're adapted dynamically during the test, not sure exactly how.

content/posts/2025-01-15-perf-improvements.md

msherif1234 · 2025-02-07T11:54:33Z

content/posts/2025-02-05-perf-improvements.md

+
+Today, we change all of that by refactoring our data structures. Our solution is to avoid writing the flow map from the enrichment hooks. Instead of that, we are introducing a new map, dedicated to enriched data. Map keys are going to be duplicated across those two maps, but it's a lesser evil. So now, we can change the main map to be a shared one across CPUs, with a spinlock. We still have some reassembling work to do in the user space though, to merge the main map with the enrichment map, but it is more straightforward than merging entire flows together. We also have a couple of ideas to further improve this process, more on that later.
+
+Splitting data between a main map and an enrichment map has another benefit: when no enrichment is needed (e.g. when none of the [agent features](https://github.com/netobserv/network-observability-operator/blob/main/docs/FlowCollector.md#flowcollectorspecagentebpf-1) are enabled), no memory is allocated for them, resulting — again — in a more efficient memory usage.


this isn't the case because the interface list is in the additional map so it will always be allocated

remember, we moved it to the main map: netobserv/netobserv-ebpf-agent#509 :-)

true but the map still allocated even with no features I don't think we changed that back ?

the map is not really allocated since it has BPF_F_NO_PREALLOC so its size depends on its content.

content/posts/2025-02-05-perf-improvements.md

msherif1234 · 2025-02-07T12:06:54Z

content/posts/2025-02-05-perf-improvements.md

+
+Those changes triggered a refactoring that doesn't come without consequences and tradeoffs.
+
+- Most importantly, **partial flows**: because now having two different maps, one for the main flows, generated from the TC hook, and another for the enrichments, generated from other hooks, there can sometimes be a mismatch between these two. Especially when the enrichment map keys aren't found in the flows map, the result is a generated partial flow, which is a flow that lacks some information, namely the TCP flags, the MAC addresses, and the bytes and packets counters. It doesn't mean these values are entirely lost; you could still be able to find them in an adjacent flow — it's because flows are evicted periodically, and an eviction might occur precisely at the _wrong moment_ (it's a race condition), with only partial data being available at that time. Another cause for partial flows is when the agent is configured with a sampling rate greater than one. If, for some reason, the enrichment data is sampled but the corresponding TC hook packet isn't, this would also result in a partial flow.


I really never digested this for some hooks like network events it must have entry in the flow map and the sampling is global so I failed to explain why we can end up with partial flows in such cases maybe other hooks can run b4 the TC but one like this it shouldn't I feel this is something need to be further analyzed
i know this is not related to blog but just never get to share this concern clearly and took this change to do so

I think the issue comes from do_sampling that is not set-then-read in a fully predictable/synchronized way, I mean, I think it can be set to 1 for packet A then set back to 0 for next packet B even before packet A reaches the other hooks ; I'm not totally sure but I wonder if it's something along those lines

that's something the mark field could be useful for, setting whether a packet is sampled or not, if only it wasn't too overloaded ¯\_(ツ)_/¯

msherif1234 · 2025-02-07T12:09:40Z

content/posts/2025-02-05-perf-improvements.md

+
+[Previously](https://opensource.com/article/22/8/ebpf-network-observability-cloud) we found that a hash map is more relevant than a ringbuffer for forwarding flows to the user space. This is mostly because the hash map allows in-kernel aggregation, thus reducing the size of the data to copy, whereas the ringbuffer is for a stateless usage, transferring data as soon as it comes, thus with the risk to copy some duplicated parts again and again.
+
+However, now that we have split the data into two maps, while the above certainly remains true for the main flows map, we can't say the same for the enrichment map. Enriched data is less subject to in-kernel aggregation, and the enrichment specialized hooks are less often triggered. So it's surely worth it to re-evaluate in that context.


pktDrop statistics and events maybe still an exception to this statement right ?

why do you think so? Because they're too frequent?

because we maintain counters in that drop record which needs aggregation

yes, so maybe it is less likely to have a positive impact, but that doesn't mean they can't be re-evaluated either? Drops are still much less frequent than non-drops, so I can imagine they're less likely to trigger an actual aggregation during a 5s time frame. Nothing sure here, but the purpose of evaluating it is precisely to better know it.

content/posts/2025-02-05-perf-improvements.md

Co-authored-by: Julien Pinsonneau <[email protected]> Co-authored-by: Sara Thomas <[email protected]>

stleerh · 2025-02-17T18:13:02Z

content/posts/2025-02-13-perf-improvements/index.md

+eBPF provides several [structures](https://docs.ebpf.io/linux/map-type/) that allow sharing data between the kernel space and the user space. Hash maps are especially useful as they allow for data persistence across calls and hook points. Thanks to hash maps, NetObserv can keep track of flows in the kernel space, aggregate new packets observed there, and thus limit the size of data transfer between the kernel and the user spaces for better performance.
+
+Among the different map types, there is [BPF_MAP_TYPE_HASH](https://docs.ebpf.io/linux/map-type/BPF_MAP_TYPE_HASH/) and [BPF_MAP_TYPE_PERCPU_HASH](https://docs.ebpf.io/linux/map-type/BPF_MAP_TYPE_PERCPU_HASH/).
+


I know this PR is already merged, but I have another comment now.

The diagram for Per-CPU hash map shows a single hash map, but in fact, according to BPF_MAP_TYPE_PERCPU_HASH, it is actually "a separate hash map for each logical CPU". It should look like the diagram on the left, but one for each CPU. Retrieving a value is the same regardless of which model you choose, because you are given the right hash map to begin with when you are in a function, whereas the Per-CPU diagram seems to imply that the value is a list.

I am not exactly sure how the internal implementation is, but I can tell that when per-cpu map are fetched from the user space, it's a list that you get: see the diff here : https://github.com/netobserv/netobserv-ebpf-agent/pull/469/files#diff-daa370b7c53b55b200b08ed318bef1a61e5838f5f5c5ded3442c83caf4e96f54L726
Previously we fetched map[ebpf.BpfFlowId][]ebpf.BpfFlowMetrics (so, a list of BpfFlowMetrics - one per CPU) , and now without per-cpu map we have map[ebpf.BpfFlowId]model.BpfFlowContent, so no list.

but maybe my representation here is only valid as from a user-space perspective, using the provided bpf helpers , with some transformation happening under the hood

jotak force-pushed the ebpf-perfs branch 2 times, most recently from 3be70e0 to 5d96778 Compare January 29, 2025 23:40

jotak force-pushed the ebpf-perfs branch from 5d96778 to ec460af Compare January 30, 2025 14:29

jpinsonneau reviewed Feb 4, 2025

View reviewed changes

content/posts/2025-01-15-perf-improvements.md Outdated Show resolved Hide resolved

content/posts/2025-01-15-perf-improvements.md Outdated Show resolved Hide resolved

skrthomas reviewed Feb 4, 2025

View reviewed changes

memodi reviewed Feb 4, 2025

View reviewed changes

stleerh reviewed Feb 4, 2025

View reviewed changes

msherif1234 reviewed Feb 7, 2025

View reviewed changes

content/posts/2025-02-05-perf-improvements.md Outdated Show resolved Hide resolved

msherif1234 reviewed Feb 7, 2025

View reviewed changes

content/posts/2025-02-05-perf-improvements.md Outdated Show resolved Hide resolved

msherif1234 reviewed Feb 7, 2025

View reviewed changes

content/posts/2025-02-05-perf-improvements.md Outdated Show resolved Hide resolved

jotak and others added 10 commits February 13, 2025 17:01

ebpf perfs 1.8

292fc9f

Add partial flow example

9e5b0e6

Apply suggestions from code review

4e6ed94

Co-authored-by: Julien Pinsonneau <[email protected]> Co-authored-by: Sara Thomas <[email protected]>

add thanks

15e2c97

add precision on concurrency

f16aa51

Update / review feedback

d4ea8b0

hyphen/dash

fd5fec3

Fix Toke's name

7479de1

more feedback

6db18c7

Use direct links to images

33005cd

jotak force-pushed the ebpf-perfs branch from 218ed81 to 33005cd Compare February 13, 2025 16:30

jotak added 2 commits February 13, 2025 17:35

update release datye

4ce70f1

link to filters blog

74b677d

jotak merged commit 8248592 into netobserv:main Feb 13, 2025
1 check passed

stleerh reviewed Feb 17, 2025

View reviewed changes

		- Medium load: 200 distributed workers rate-limited at 50 qps
		- High load: 900 distributed workers rate-limited at 50 qps


		Today, we change all of that by refactoring our data structures. Our solution is to avoid writing the flow map from the enrichment hooks. Instead of that, we are introducing a new map, dedicated to enriched data. Map keys are going to be duplicated across those two maps, but it's a lesser evil. So now, we can change the main map to be a shared one across CPUs, with a spinlock. We still have some reassembling work to do in the user space though, to merge the main map with the enrichment map, but it is more straightforward than merging entire flows together. We also have a couple of ideas to further improve this process, more on that later.

		Splitting data between a main map and an enrichment map has another benefit: when no enrichment is needed (e.g. when none of the [agent features](https://github.com/netobserv/network-observability-operator/blob/main/docs/FlowCollector.md#flowcollectorspecagentebpf-1) are enabled), no memory is allocated for them, resulting — again — in a more efficient memory usage.


		Those changes triggered a refactoring that doesn't come without consequences and tradeoffs.

		- Most importantly, partial flows: because now having two different maps, one for the main flows, generated from the TC hook, and another for the enrichments, generated from other hooks, there can sometimes be a mismatch between these two. Especially when the enrichment map keys aren't found in the flows map, the result is a generated partial flow, which is a flow that lacks some information, namely the TCP flags, the MAC addresses, and the bytes and packets counters. It doesn't mean these values are entirely lost; you could still be able to find them in an adjacent flow — it's because flows are evicted periodically, and an eviction might occur precisely at the _wrong moment_ (it's a race condition), with only partial data being available at that time. Another cause for partial flows is when the agent is configured with a sampling rate greater than one. If, for some reason, the enrichment data is sampled but the corresponding TC hook packet isn't, this would also result in a partial flow.


		[Previously](https://opensource.com/article/22/8/ebpf-network-observability-cloud) we found that a hash map is more relevant than a ringbuffer for forwarding flows to the user space. This is mostly because the hash map allows in-kernel aggregation, thus reducing the size of the data to copy, whereas the ringbuffer is for a stateless usage, transferring data as soon as it comes, thus with the risk to copy some duplicated parts again and again.

		However, now that we have split the data into two maps, while the above certainly remains true for the main flows map, we can't say the same for the enrichment map. Enriched data is less subject to in-kernel aggregation, and the enrichment specialized hooks are less often triggered. So it's surely worth it to re-evaluate in that context.

		eBPF provides several [structures](https://docs.ebpf.io/linux/map-type/) that allow sharing data between the kernel space and the user space. Hash maps are especially useful as they allow for data persistence across calls and hook points. Thanks to hash maps, NetObserv can keep track of flows in the kernel space, aggregate new packets observed there, and thus limit the size of data transfer between the kernel and the user spaces for better performance.

		Among the different map types, there is [BPF_MAP_TYPE_HASH](https://docs.ebpf.io/linux/map-type/BPF_MAP_TYPE_HASH/) and [BPF_MAP_TYPE_PERCPU_HASH](https://docs.ebpf.io/linux/map-type/BPF_MAP_TYPE_PERCPU_HASH/).

ebpf perfs 1.8 #7

ebpf perfs 1.8 #7

Uh oh!

Conversation

jotak commented Jan 29, 2025

Uh oh!

github-actions bot commented Jan 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

skrthomas left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

memodi left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jotak Feb 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jotak Feb 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jan 29, 2025 •

edited

Loading

jotak Feb 10, 2025 •

edited

Loading

jotak Feb 7, 2025 •

edited

Loading

stleerh Feb 17, 2025 •

edited

Loading