-
Notifications
You must be signed in to change notification settings - Fork 57
Uniform debug print messages in ebpf code #941
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Giuseppe Ognibene <[email protected]>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #941 +/- ##
==========================================
+ Coverage 47.13% 55.85% +8.71%
==========================================
Files 255 255
Lines 22139 22139
==========================================
+ Hits 10436 12366 +1930
+ Misses 11041 8935 -2106
- Partials 662 838 +176
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Is there a way to avoid passing |
Hi @marctc I hope I understand the question. I pass |
grcevski
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good idea, I'd propose that we think a bit more on how to do this that will have the best outcome. At the moment the log line is limited to 80 chars, since we send a lot of them and we want to capture as many as possible. I'm afraid that if we prefix the line with the function name, we'll end up losing valuable information, which is the values we keep.
|
Another thing to watch out is that kernel 5.15 and lower support only up to 3 arguments in the format, so adding one more will cause us to go on separate lines for many of our prints. Perhaps the best way to do this is to ensure that all probes have printed the function name on all entry points. |
There's a section at the end of the document about this problem. The solutions I could think of were splitting or hard-coding the function name.
Yes, but that's not enough. If there are two printouts with the same (or similar) message in two different probes, we could have a mix-up. The only way to ensure uniqueness is for each printout to print the function name. I had also thought about adding an additional argument to some functions so that the probe name would be passed and then printed, but it seemed like too much. |
Good point, I thought about this and wondered why do we send data to userspace via a ringbuffer instead of reading directly from the trace pipe? That just requires tracefs/debugfs to be mounted but I think debug will only be enabled in dev/test environments so it should be ok. That will have benefits such as:
Are there things I'm not taking into consideration for this approach? |
I don't how tracefs/debugfs works but sometimes we ask users to enable BPF logging to investigate bugs regardless their environment they use to run OBI. |
rafaelroquetto
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, I think this will definitely help with the current state of the onion.
In addition to my inline comments, I think we need some sort of linter to validate this (for the cases which cannot be enforced by code), otherwise people will eventually and unintentionally deviate from these rules.
| if (invocation) { | ||
| bpf_dbg_printk( | ||
| "Found Go HTTP invocation, resetting the span id to %x%x", tcp->seq, tcp->ack); | ||
| bpf_dbg_printk("%s: found Go HTTP invocation, resetting the span id to seq=%x, ack=%x", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is a mistake - the span id here is indeed meant to be the combination of %x%x, i.e. the concatenation of seq and ack.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ops, thanks!
| Addressing issues like logging multiple variables without a separator (e.g., avoiding `%x%x`). Example: | ||
|
|
||
| ```c | ||
| bpf_dbg_printk("%s: found HTTP info, resetting the span id to seq=%x, ack=%x", __FUNCTION__, tcp->seq, tcp->ack); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get the point, but this is a bad example (see previous comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, thanks
| At the beginning of an eBPF probe (not every probe, as it can be too verbose\!), print the function name in the following way: | ||
|
|
||
| ```c | ||
| bpf_dbg_printk("=== %s ===", __FUNCTION__); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could wrap this in a macro, so that one simply writes:
bpf_dbg_log_func()
|
|
||
| #### Function name in generic functions | ||
|
|
||
| To print the function name in a generic function, use the **`__FUNCTION__`** identifier in **every log statement** without the bounding **`===`**. The triple equals signs are reserved exclusively for entry points of eBPF probes. Example: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the value here, but there's an issue as others have pointed out: some kernels only allow at most 3 arguments, and now we are "wasting" one with __FUNCTION__ - this has been a problem before, and you will notice that because of that some bpf_dbg_printk statements had to be split into at least 2 as a workaround.
This why I've explicitly opted to spell out the function name as part of the format string rather than using __FUNCTION__ in tpinjector.c and friends. I think it's a fair trade-off, given we are dealing with eBPF, but YMMV.
Now, if we want something like that anyway, we could probably have a macro like this:
#define bpf_func_printk(fmt, ...) \
bpf_printk("%s: " fmt, __func__, ##__VA_ARGS__)
(or wrap around bpf_dbg_printk, I haven't tried with that).
If we do that, it should be a separate macro, whilst bpf_dbg_printk remains as is, giving the caller the flexibility to pick - think of it as a convenience macro.
| It can happen, sometimes, that the number of arguments to print is 3, but when `__FUNCTION__` is included, the total reaches 4. In these cases, `bpf_dbg_printk` calls `bpf_printk`, which subsequently calls `___bpf_pick_printk`. This chain then selects and calls `__bpf_vprintk` (for 4+ arguments), which finally invokes `bpf_trace_vprintk`. Since `bpf_trace_vprintk` is only available starting in [kernel version 5.16](https://docs.ebpf.io/linux/helper-function/bpf_trace_vprintk/), two options are available to avoid errors when loading the eBPF program: | ||
| 1. Split the print into two separate calls: one with one argument and one with two arguments, where `__FUNCTION__` is added to both. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, in light of this comment, perhaps the way is to simply provide the convenience macro above - makes it flexible.
As @marctc pointed out, this is because 99.9% of the logs end users can provide us comes from the usual OBI logs - people don't know how or don't have access to It also means sometimes logs lines can get lost, when the ring buffer is under pressure, but that's the nature of it. |
Could mounting host |
It might, but in a sense it's not always enough. For some platforms like GKE Autopilot we need to explicitly make that access as something we need, which reminds me I need to follow-up on the update about the cgroup. Also sometimes the trace pipe might be held by another program, although it's rare. |
I know, how about this? What if we changed the bpf_debug_printk to add the line for |
|
@mmat11 in addition to what @grcevski, we removed the dependency on In my experience, we tend to know which customers do have access to the node/host and those few select ones are indeed able to provide us with a Some people literally only have access to logs from a web interface (e.g. Grafana Dashboard/logs) and they end up pasting that. Or sometimes granting us access to their web UI where we can check the logs ourselves, so regardless of |
I see, I think my proposal is not viable for now then. What about using dynamic log sizes then, similar to large buffer events? It would at least remove the "truncated" limit |
@mmat11 I think that's a great idea, and probably the way to go. One thing along those lines to consider (not sure if a good idea): if we want true structured logging, that can be achieved in this fashion - imagine something like this: this gets serialised into a "log event" struct that gets shipped to userspace, which can then decide to unpack it as JSON or as a line (similar to TracePrinter). but I digress |
This PR is a proposal to try to find a more uniform way to print ebpf debug information.
I only modified a few debug prints as a test and wrote the
bpf-print-format.mdfile to document the proposal.I'm open to suggestions.