-
Notifications
You must be signed in to change notification settings - Fork 499
handle resolve of NULL pointers #4327
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
handle resolve of NULL pointers #4327
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello !
Thank you for taking time to write this PR.
I opened an issue for this few month ago, see #3728.
I don’t know of you will share my point of view because will I have thinking about this i have found 3 problems :
- First, what happen if the null structure is not the last resolving field. How can you track where it ends ?
- Also, what happened if you are resolving
int *valueand the ptr is null ? - And if you are resolving a struct (e.g.
type: file) you need to letextract_argends correctly, because at this point you have no idea ifextract_argreturn a legitimate0or a null struct. But when resolving it's fields, you will figure out that it is actually NULL.
The potential solution I was thinking of for this was to take all the return values of the probe_read and find a way to send this error to the userspace if it is < 0. So in case of a null pointer, you can have a new error code that the kernel does not cover, for instance -100, that is dedicated to it.
bpf/process/generic_calls.h
Outdated
| return 1; | ||
| /* NULL pointer */ | ||
| if (*data->arg == 0) { | ||
| *data->null_ptr_found = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the last btf_config resolve in a null pointer, you would not reach that code right ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is good point. I think my new placement for error detection takes care of this problem.
| - index: 1 | ||
| type: "uint8" | ||
| btfType: "mystruct" | ||
| resolve: "sub.v8" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think 2 test cases should be added. This one is good, but if sub alone is resolved, does it raise an error too ? For instance, if type: file is null, does Tetragon display null as well ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think 2 test cases should be added.
You go on to describe what sounds like one test case. What is the other test case you're thinking about here?
but if sub alone is resolved, does it raise an error too ? For instance, if type: file is null, does Tetragon display null as well ?
As I mentioned in #3728, I think that we should break this up into two independent tasks. One is to handle the dereferencing in extract_arg_depth, and the other is to handle issues for the terminal types reach by resolve (in read_arg), which would happen even without the use of the resolve feature. With pointer types, read_arg() has some issues when the pointer is NULL. Because of this, I don't want to test scenarios where we resolve to a NULL pointer and pass that NULL pointer to read_arg() (yet).
Also, for the example you have above:
what happened if you are resolving int *value and the ptr is null ?
From my testing, I see that read_arg() is passed a pointer to an int in this scenario. If we want to resolve the int value that the pointer points to, I think we need to call extract_arg_depth one more time in order to dereference, because read_arg() takes the integer "by value" instead of by pointer to an integer. So, I don't want to test this scenario until we make this work in the non-NULL case.
So, I think that I should add a test for when the first pointer is NULL, and when the second to last pointer is NULL, but I don't want to test the scenario where the last pointer is NULL, because that problem is not specific to resolve, so I want to solve that issue independently.
c9bf1d8 to
478412b
Compare
✅ Deploy Preview for tetragon ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
Thanks for bringing this to my attention, and your feedback/ideas here.
I've changed things so that I keep track of which level of depth we are unable to dereference. I use a sentinel value of -1 to indicated that no issues were detected. This is then used as an indicator that selectors against this arg shouldn't match. I'm thinking that perhaps this level of depth of failure information could be put in the event not only to distinguish resolve failures from success, but also what depth the failure occurred.
I'm investigated this, and it seems that this scenario doesn't work for the non-NULL case. So, I want to fix that first, before considering how to handle the NULL case. This is what I'm trying: In the userspace program I'm testing against, I do this: The policy's arg config: Is this the scenario you're describing here? It seems like we might need to do one more iteration of calling extract_arg_depth and dereference again to make this work as expected. Put another way, I think that we are acting on a pointer as if it's an integer.
I'll look into this more closely, but from reading the code it seems that we don't handle the
Thanks for this idea. I think this is the right approach, and I have started moving in this direction.
I'm not sure if I follow this. I think we need to communicate the error out of band of the arg value or else we will have a collision. Perhaps we can report the depth at which dereferencing failed in the event, in order to distinguish between resolve success and failure. I'm trying to figure out what is the right behavior is for when we cannot resolve due to a NULL pointer. My current approach is to indicate there was a resolve error, but not let that prevent the event from firing, unless there is a matchArgs selector associated with the arg. In this method, I would communicate that the arg was not actually resolved in the event somehow. But, I realized from the issue that you created about this, that NULL string pointers prevent an event from firing, regardless of selectors. So maybe that's a more appropriate outcome. |
Above makes sense to me.
I don't think that the appropriate outcome. I'll use the example from the issue for clarity: lsmhooks:
- hook: "bprm_check_security"
args:
- index: 0
type: "string"
resolve: "executable.f_path.dentry.d_name.name"
selectors:
- matchActions:
- action: PostIn this case (that is, where there no matchArgs), I think we should generate an event (perhaps with an empty argument, or even a special value to indicate that some resolution went wrong). |
765d55d to
f56102b
Compare
e900a2c to
5bcb17c
Compare
kkourt
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the great PR!
I have some minor suggestions, but LGTM!
6334632 to
47ef274
Compare
47ef274 to
2ea598d
Compare
Use the configuration to determine if we are examining a valid argument. Commit 8dea69f ("tetragon: Read only defined arguments ") made it so we only read arguments defined in the policy. So, we need to adapt the rate limiting logic to account for this. Otherwise, we might put garbage in the key. Signed-off-by: Andy Strohman <[email protected]>
We have been ignoring errors returned from bpf_probe_read() when dereferencing during resolve. bpf_probe_read() returns a negative value when it detects that a seg fault would occur. This is problematic both in terms of exposing bogus argument values in the event and also when preforming filtering on behalf of the matchArgs selector. This change notes when we were unable to dereference so that arg filtering does not happen against a bogus value. Resolve may traverse multiple pointers, so we need a way to distinguish which pointer could not be dereferenced. The "depth" of the dereference failure is recorded, and exposed to user space, so that we can report the pointer that could not be dereferenced in the event. For each arg, we set the depth at which the dereference failed in a status header that precedes the argument value. If the status value is 0, then no error occurred and the argument's value follows. Any non-zero status indicates failure reading the arg value. When user space encounters this non-zero status, it will not attempt to read the arg's value. This allows us to avoid transmitting bogus argument values from the bpf program to user space. Signed-off-by: Andy Strohman <[email protected]>
This arg type will be found in the event instead of a configured arg type, when the configured arg could not be read. This error argumetn type includes a message that describes what error occurred while attempting to read the arg value. Signed-off-by: Andy Strohman <[email protected]>
The previous commit altered api/v1/tetragon/tetragon.proto. The is commit reflects the outcome of running "make protogen" which is required in order to generated code/documentation based on that that change. Signed-off-by: Andy Strohman <[email protected]>
Signed-off-by: Andy Strohman <[email protected]>
Signed-off-by: Andy Strohman <[email protected]>
This program will be used to test how resolve handles NULL pointers. Signed-off-by: Andy Strohman <[email protected]>
Test that the resolve error depth is expected when NULL pointers are encountered. Test that matchArgs will not match when resolve fails. Signed-off-by: Andy Strohman <[email protected]>
With this change, GetIndex() has the same semantic meaning regardless of hook type. It returns the index of the Arg within the spec. This fixes a returnCopy issue where we were overwriting the wrong arg when we merge the arg value after the retprobe event. Signed-off-by: Andy Strohman <[email protected]>
This test confirms that we don't confuse the meaning of index. Arguments have a position (index) within the args section of a tracing policy spec. The arg's position defined within the spec is not related to the arg's position (index) within the function signature or tracepoint that is being hooked. We had a bug where we confused the meaning of index. Retprobes need to overwrite a argument value that not available at function entry. We locate the argument which needs to be overwritten by referencing its position within the tracing policy spec. The bug happened because we overwrote based on arg's position within the function signature. As such, returnCopy only worked as expected when the user defined the args in the tracing policy spec in the same order as they appear in the function/tracepoint signature. This returnCopy test is constructed such that the configured argument's index within the spec does not correspond to the arguments position within the function signature. Signed-off-by: Andy Strohman <[email protected]>
To a string that shows which pointer could not be dereferenced. Signed-off-by: Andy Strohman <[email protected]>
failed prog /host/src/tetragon/bpf/objs/bpf_generic_tracepoint_v53.o kern_version 328959 loadInstance: opening collection '/host/src/tetragon/bpf/objs/bpf_generic_tracepoint_v53.o' failed: program generic_tracepoint_process_event: load program: invalid argument: math between map_value pointer and register with unbounded min value is not allowed (4126 line(s) omitted) Signed-off-by: Andy Strohman <[email protected]>
Use it for masking to appease the verifier when we have arrays with MAX_POSSIBLE_ARGS. This prevents silent breakage if we increase the size of MAX_POSSIBLE_ARGS beyond 8 in the future. Suggested-by: Kornilios Kourtis <[email protected]> Signed-off-by: Andy Strohman <[email protected]>
2ea598d to
964a1c0
Compare
Currently, the max number of selectors is the same as the max number of configured args, so we don't have problems due to the fact that we use the MAX_SELECTORS_MASK on arg indexes. This change makes arg index masking use MAX_POSSIBLE_ARGS_INDEX for clarity and to decouple these independent concepts. EVENT_CONFIG_MAX_ARG is replaced everywhere with MAX_POSSIBLE_ARGS because these two values cannot deviate. There are places where we mask an index that is used for both event config and kprobe messages. This consolidation helps in those cases. Convenience masks are added for the usdt and reg arg masks, because these arrays can/do have different sizes than the original arg arrays. There is a limit on the number of args that a user can config. There is also an unrelated limit about which args can be reached within a tracepoint/function signature. Both of these limits are currently 5, but they are logically unrelated limits. However, we use the same arg index mask for both. This change decouples these two concepts from a masking perspective so that we could increase the number of arguments that can be reached from a function signature perspective without increasing the allowed number of configured arguments. Signed-off-by: Andy Strohman <[email protected]>
The resolve logic follows pointers by dereferencing. Some of these pointer values may be NULL, or otherwise impossible to dereference. In such cases, matchArgs selectors should fail to match, because the argument was not really resolved.
When we fail to dereference during resolve, report the pointer at which the failure occurred. Also, when argument reading fails, the event's kprobe argument type will be converted to KprobeError, instead of what was specified in the tracing policy.
When investigating how to plumb this to the event, I found a bug regarding how returnCopy works. I've added a fix for that as well.