Skip to content

Conversation

jotak
Copy link
Member

@jotak jotak commented Jan 29, 2025

No description provided.

@jotak jotak force-pushed the ebpf-perfs branch 2 times, most recently from 3be70e0 to 5d96778 Compare January 29, 2025 23:40
Copy link

github-actions bot commented Jan 29, 2025

🙈 The PR is closed and the preview is expired.

Copy link
Contributor

@skrthomas skrthomas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice blog, @jotak ! Thanks for all this! I left a few copy edits for you to consider :)

authors: [jotak]
---

Last year, 2024, there has been a few discussions at Red Hat, R&D related, around the eBPF Agent provided with NetObserv. One of these discussions focused especially on its performances and was a starting point to bring life to ideas. I'll take this opportunity to warmly thank Simone Ferlin-Reiter, Toke Hoiland Jorgensen, Mohamed S. Mahmoud and Donald Hunter for their contributions. Toke took the time to deep-dive in the code and shared his thoughts on the potential improvements.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Last year, 2024, there has been a few discussions at Red Hat, R&D related, around the eBPF Agent provided with NetObserv. One of these discussions focused especially on its performances and was a starting point to bring life to ideas. I'll take this opportunity to warmly thank Simone Ferlin-Reiter, Toke Hoiland Jorgensen, Mohamed S. Mahmoud and Donald Hunter for their contributions. Toke took the time to deep-dive in the code and shared his thoughts on the potential improvements.
Last year, 2024, there has been a few discussions at Red Hat, R&D related, around the eBPF Agent provided with Network Observability. One of these discussions focused especially on its performances and was a starting point to bring life to ideas. I'll take this opportunity to warmly thank Simone Ferlin-Reiter, Toke Hoiland Jorgensen, Mohamed S. Mahmoud and Donald Hunter for their contributions. Toke took the time to deep-dive in the code and shared his thoughts on the potential improvements.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Global comment: Network Observability instead of NetObserv?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the upstream blog, it's more NetObserv (project name). If it ends up in redhat blog it will be Network Observability

Copy link
Member

@memodi memodi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice technical read, @jotak . Thanks! few minor comments/questions from me.

Comment on lines 36 to 37
- Medium load: 200 distributed workers rate-limited at 50 qps
- High load: 900 distributed workers rate-limited at 50 qps
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does "workers" here mean "workloads"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No it's threads (or goroutines to be precise), within a single pod : it's the -c flag from hey: https://github.com/rakyll/hey/blob/master/README.md#usage

0.085 [1] |
0.097 [0] |
0.109 [0] |
0.121 [50] |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is each row the number of queries (HTTP requests?) in square brackets with the stated response time (first number)?

Is it because 0.013 is so dominant that the other values (even 538) is just a single "|"?

What does 0.0.97 [0] and 0.109 [0] mean?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rows are the histogram buckets, and yes I think what's between brackets are the number of requests/responses in each bucket.
So there's indeed the vast majority of requests falling between 0.001 and 0.013 seconds

Is it because 0.013 is so dominant that the other values (even 538) is just a single "|"?

yes it seems so

What does 0.0.97 [0] and 0.109 [0] mean?

There's been 0 requests between 0.085 and 0.097 seconds, and also 0 between 0.097 and 0.109

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I was wondering why it even lists the "0 requests" unless these are preset time ranges.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the buckets are not always the same, so I believe they're adapted dynamically during the test, not sure exactly how.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm


Today, we change all of that by refactoring our data structures. Our solution is to avoid writing the flow map from the enrichment hooks. Instead of that, we are introducing a new map, dedicated to enriched data. Map keys are going to be duplicated across those two maps, but it's a lesser evil. So now, we can change the main map to be a shared one across CPUs, with a spinlock. We still have some reassembling work to do in the user space though, to merge the main map with the enrichment map, but it is more straightforward than merging entire flows together. We also have a couple of ideas to further improve this process, more on that later.

Splitting data between a main map and an enrichment map has another benefit: when no enrichment is needed (e.g. when none of the [agent features](https://github.com/netobserv/network-observability-operator/blob/main/docs/FlowCollector.md#flowcollectorspecagentebpf-1) are enabled), no memory is allocated for them, resulting — again — in a more efficient memory usage.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this isn't the case because the interface list is in the additional map so it will always be allocated

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remember, we moved it to the main map: netobserv/netobserv-ebpf-agent#509 :-)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true but the map still allocated even with no features I don't think we changed that back ?

Copy link
Member Author

@jotak jotak Feb 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the map is not really allocated since it has BPF_F_NO_PREALLOC so its size depends on its content.


Those changes triggered a refactoring that doesn't come without consequences and tradeoffs.

- Most importantly, **partial flows**: because now having two different maps, one for the main flows, generated from the TC hook, and another for the enrichments, generated from other hooks, there can sometimes be a mismatch between these two. Especially when the enrichment map keys aren't found in the flows map, the result is a generated partial flow, which is a flow that lacks some information, namely the TCP flags, the MAC addresses, and the bytes and packets counters. It doesn't mean these values are entirely lost; you could still be able to find them in an adjacent flow — it's because flows are evicted periodically, and an eviction might occur precisely at the _wrong moment_ (it's a race condition), with only partial data being available at that time. Another cause for partial flows is when the agent is configured with a sampling rate greater than one. If, for some reason, the enrichment data is sampled but the corresponding TC hook packet isn't, this would also result in a partial flow.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really never digested this for some hooks like network events it must have entry in the flow map and the sampling is global so I failed to explain why we can end up with partial flows in such cases maybe other hooks can run b4 the TC but one like this it shouldn't I feel this is something need to be further analyzed
i know this is not related to blog but just never get to share this concern clearly and took this change to do so

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the issue comes from do_sampling that is not set-then-read in a fully predictable/synchronized way, I mean, I think it can be set to 1 for packet A then set back to 0 for next packet B even before packet A reaches the other hooks ; I'm not totally sure but I wonder if it's something along those lines

Copy link
Member Author

@jotak jotak Feb 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's something the mark field could be useful for, setting whether a packet is sampled or not, if only it wasn't too overloaded ¯\_(ツ)_/¯


[Previously](https://opensource.com/article/22/8/ebpf-network-observability-cloud) we found that a hash map is more relevant than a ringbuffer for forwarding flows to the user space. This is mostly because the hash map allows in-kernel aggregation, thus reducing the size of the data to copy, whereas the ringbuffer is for a stateless usage, transferring data as soon as it comes, thus with the risk to copy some duplicated parts again and again.

However, now that we have split the data into two maps, while the above certainly remains true for the main flows map, we can't say the same for the enrichment map. Enriched data is less subject to in-kernel aggregation, and the enrichment specialized hooks are less often triggered. So it's surely worth it to re-evaluate in that context.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pktDrop statistics and events maybe still an exception to this statement right ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you think so? Because they're too frequent?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because we maintain counters in that drop record which needs aggregation

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, so maybe it is less likely to have a positive impact, but that doesn't mean they can't be re-evaluated either? Drops are still much less frequent than non-drops, so I can imagine they're less likely to trigger an actual aggregation during a 5s time frame. Nothing sure here, but the purpose of evaluating it is precisely to better know it.

@jotak jotak merged commit 8248592 into netobserv:main Feb 13, 2025
1 check passed
eBPF provides several [structures](https://docs.ebpf.io/linux/map-type/) that allow sharing data between the kernel space and the user space. Hash maps are especially useful as they allow for data persistence across calls and hook points. Thanks to hash maps, NetObserv can keep track of flows in the kernel space, aggregate new packets observed there, and thus limit the size of data transfer between the kernel and the user spaces for better performance.

Among the different map types, there is [BPF_MAP_TYPE_HASH](https://docs.ebpf.io/linux/map-type/BPF_MAP_TYPE_HASH/) and [BPF_MAP_TYPE_PERCPU_HASH](https://docs.ebpf.io/linux/map-type/BPF_MAP_TYPE_PERCPU_HASH/).

Copy link
Contributor

@stleerh stleerh Feb 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this PR is already merged, but I have another comment now.

The diagram for Per-CPU hash map shows a single hash map, but in fact, according to BPF_MAP_TYPE_PERCPU_HASH, it is actually "a separate hash map for each logical CPU". It should look like the diagram on the left, but one for each CPU. Retrieving a value is the same regardless of which model you choose, because you are given the right hash map to begin with when you are in a function, whereas the Per-CPU diagram seems to imply that the value is a list.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not exactly sure how the internal implementation is, but I can tell that when per-cpu map are fetched from the user space, it's a list that you get: see the diff here : https://github.com/netobserv/netobserv-ebpf-agent/pull/469/files#diff-daa370b7c53b55b200b08ed318bef1a61e5838f5f5c5ded3442c83caf4e96f54L726
Previously we fetched map[ebpf.BpfFlowId][]ebpf.BpfFlowMetrics (so, a list of BpfFlowMetrics - one per CPU) , and now without per-cpu map we have map[ebpf.BpfFlowId]model.BpfFlowContent, so no list.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but maybe my representation here is only valid as from a user-space perspective, using the provided bpf helpers , with some transformation happening under the hood

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants