-
Notifications
You must be signed in to change notification settings - Fork 4
ebpf perfs 1.8 #7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
3be70e0
to
5d96778
Compare
🙈 The PR is closed and the preview is expired. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice blog, @jotak ! Thanks for all this! I left a few copy edits for you to consider :)
authors: [jotak] | ||
--- | ||
|
||
Last year, 2024, there has been a few discussions at Red Hat, R&D related, around the eBPF Agent provided with NetObserv. One of these discussions focused especially on its performances and was a starting point to bring life to ideas. I'll take this opportunity to warmly thank Simone Ferlin-Reiter, Toke Hoiland Jorgensen, Mohamed S. Mahmoud and Donald Hunter for their contributions. Toke took the time to deep-dive in the code and shared his thoughts on the potential improvements. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Last year, 2024, there has been a few discussions at Red Hat, R&D related, around the eBPF Agent provided with NetObserv. One of these discussions focused especially on its performances and was a starting point to bring life to ideas. I'll take this opportunity to warmly thank Simone Ferlin-Reiter, Toke Hoiland Jorgensen, Mohamed S. Mahmoud and Donald Hunter for their contributions. Toke took the time to deep-dive in the code and shared his thoughts on the potential improvements. | |
Last year, 2024, there has been a few discussions at Red Hat, R&D related, around the eBPF Agent provided with Network Observability. One of these discussions focused especially on its performances and was a starting point to bring life to ideas. I'll take this opportunity to warmly thank Simone Ferlin-Reiter, Toke Hoiland Jorgensen, Mohamed S. Mahmoud and Donald Hunter for their contributions. Toke took the time to deep-dive in the code and shared his thoughts on the potential improvements. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Global comment: Network Observability instead of NetObserv?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the upstream blog, it's more NetObserv (project name). If it ends up in redhat blog it will be Network Observability
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice technical read, @jotak . Thanks! few minor comments/questions from me.
- Medium load: 200 distributed workers rate-limited at 50 qps | ||
- High load: 900 distributed workers rate-limited at 50 qps |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does "workers" here mean "workloads"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No it's threads (or goroutines to be precise), within a single pod : it's the -c
flag from hey: https://github.com/rakyll/hey/blob/master/README.md#usage
0.085 [1] | | ||
0.097 [0] | | ||
0.109 [0] | | ||
0.121 [50] | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is each row the number of queries (HTTP requests?) in square brackets with the stated response time (first number)?
Is it because 0.013 is so dominant that the other values (even 538) is just a single "|"?
What does 0.0.97 [0]
and 0.109 [0]
mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rows are the histogram buckets, and yes I think what's between brackets are the number of requests/responses in each bucket.
So there's indeed the vast majority of requests falling between 0.001 and 0.013 seconds
Is it because 0.013 is so dominant that the other values (even 538) is just a single "|"?
yes it seems so
What does 0.0.97 [0] and 0.109 [0] mean?
There's been 0 requests between 0.085 and 0.097 seconds, and also 0 between 0.097 and 0.109
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I was wondering why it even lists the "0 requests" unless these are preset time ranges.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, the buckets are not always the same, so I believe they're adapted dynamically during the test, not sure exactly how.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
|
||
Today, we change all of that by refactoring our data structures. Our solution is to avoid writing the flow map from the enrichment hooks. Instead of that, we are introducing a new map, dedicated to enriched data. Map keys are going to be duplicated across those two maps, but it's a lesser evil. So now, we can change the main map to be a shared one across CPUs, with a spinlock. We still have some reassembling work to do in the user space though, to merge the main map with the enrichment map, but it is more straightforward than merging entire flows together. We also have a couple of ideas to further improve this process, more on that later. | ||
|
||
Splitting data between a main map and an enrichment map has another benefit: when no enrichment is needed (e.g. when none of the [agent features](https://github.com/netobserv/network-observability-operator/blob/main/docs/FlowCollector.md#flowcollectorspecagentebpf-1) are enabled), no memory is allocated for them, resulting — again — in a more efficient memory usage. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this isn't the case because the interface list is in the additional map so it will always be allocated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remember, we moved it to the main map: netobserv/netobserv-ebpf-agent#509 :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
true but the map still allocated even with no features I don't think we changed that back ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the map is not really allocated since it has BPF_F_NO_PREALLOC
so its size depends on its content.
|
||
Those changes triggered a refactoring that doesn't come without consequences and tradeoffs. | ||
|
||
- Most importantly, **partial flows**: because now having two different maps, one for the main flows, generated from the TC hook, and another for the enrichments, generated from other hooks, there can sometimes be a mismatch between these two. Especially when the enrichment map keys aren't found in the flows map, the result is a generated partial flow, which is a flow that lacks some information, namely the TCP flags, the MAC addresses, and the bytes and packets counters. It doesn't mean these values are entirely lost; you could still be able to find them in an adjacent flow — it's because flows are evicted periodically, and an eviction might occur precisely at the _wrong moment_ (it's a race condition), with only partial data being available at that time. Another cause for partial flows is when the agent is configured with a sampling rate greater than one. If, for some reason, the enrichment data is sampled but the corresponding TC hook packet isn't, this would also result in a partial flow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really never digested this for some hooks like network events it must have entry in the flow map and the sampling is global so I failed to explain why we can end up with partial flows in such cases maybe other hooks can run b4 the TC but one like this it shouldn't I feel this is something need to be further analyzed
i know this is not related to blog but just never get to share this concern clearly and took this change to do so
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the issue comes from do_sampling
that is not set-then-read in a fully predictable/synchronized way, I mean, I think it can be set to 1 for packet A then set back to 0 for next packet B even before packet A reaches the other hooks ; I'm not totally sure but I wonder if it's something along those lines
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's something the mark
field could be useful for, setting whether a packet is sampled or not, if only it wasn't too overloaded ¯\_(ツ)_/¯
|
||
[Previously](https://opensource.com/article/22/8/ebpf-network-observability-cloud) we found that a hash map is more relevant than a ringbuffer for forwarding flows to the user space. This is mostly because the hash map allows in-kernel aggregation, thus reducing the size of the data to copy, whereas the ringbuffer is for a stateless usage, transferring data as soon as it comes, thus with the risk to copy some duplicated parts again and again. | ||
|
||
However, now that we have split the data into two maps, while the above certainly remains true for the main flows map, we can't say the same for the enrichment map. Enriched data is less subject to in-kernel aggregation, and the enrichment specialized hooks are less often triggered. So it's surely worth it to re-evaluate in that context. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pktDrop statistics and events maybe still an exception to this statement right ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do you think so? Because they're too frequent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because we maintain counters in that drop record which needs aggregation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, so maybe it is less likely to have a positive impact, but that doesn't mean they can't be re-evaluated either? Drops are still much less frequent than non-drops, so I can imagine they're less likely to trigger an actual aggregation during a 5s time frame. Nothing sure here, but the purpose of evaluating it is precisely to better know it.
Co-authored-by: Julien Pinsonneau <[email protected]> Co-authored-by: Sara Thomas <[email protected]>
eBPF provides several [structures](https://docs.ebpf.io/linux/map-type/) that allow sharing data between the kernel space and the user space. Hash maps are especially useful as they allow for data persistence across calls and hook points. Thanks to hash maps, NetObserv can keep track of flows in the kernel space, aggregate new packets observed there, and thus limit the size of data transfer between the kernel and the user spaces for better performance. | ||
|
||
Among the different map types, there is [BPF_MAP_TYPE_HASH](https://docs.ebpf.io/linux/map-type/BPF_MAP_TYPE_HASH/) and [BPF_MAP_TYPE_PERCPU_HASH](https://docs.ebpf.io/linux/map-type/BPF_MAP_TYPE_PERCPU_HASH/). | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this PR is already merged, but I have another comment now.
The diagram for Per-CPU hash map shows a single hash map, but in fact, according to BPF_MAP_TYPE_PERCPU_HASH, it is actually "a separate hash map for each logical CPU". It should look like the diagram on the left, but one for each CPU. Retrieving a value is the same regardless of which model you choose, because you are given the right hash map to begin with when you are in a function, whereas the Per-CPU diagram seems to imply that the value is a list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not exactly sure how the internal implementation is, but I can tell that when per-cpu map are fetched from the user space, it's a list that you get: see the diff here : https://github.com/netobserv/netobserv-ebpf-agent/pull/469/files#diff-daa370b7c53b55b200b08ed318bef1a61e5838f5f5c5ded3442c83caf4e96f54L726
Previously we fetched map[ebpf.BpfFlowId][]ebpf.BpfFlowMetrics
(so, a list of BpfFlowMetrics
- one per CPU) , and now without per-cpu map we have map[ebpf.BpfFlowId]model.BpfFlowContent
, so no list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but maybe my representation here is only valid as from a user-space perspective, using the provided bpf helpers , with some transformation happening under the hood
No description provided.