Skip to content

Commit cdf355c

Browse files
anakryikomhiramat
authored andcommitted
uprobes: add speculative lockless system-wide uprobe filter check
It's very common with BPF-based uprobe/uretprobe use cases to have a system-wide (not PID specific) probes used. In this case uprobe's trace_uprobe_filter->nr_systemwide counter is bumped at registration time, and actual filtering is short circuited at the time when uprobe/uretprobe is triggered. This is a great optimization, and the only issue with it is that to even get to checking this counter uprobe subsystem is taking read-side trace_uprobe_filter->rwlock. This is actually noticeable in profiles and is just another point of contention when uprobe is triggered on multiple CPUs simultaneously. This patch moves this nr_systemwide check outside of filter list's rwlock scope, as rwlock is meant to protect list modification, while nr_systemwide-based check is speculative and racy already, despite the lock (as discussed in [0]). trace_uprobe_filter_remove() and trace_uprobe_filter_add() already check for filter->nr_systewide explicitly outside of __uprobe_perf_filter, so no modifications are required there. Confirming with BPF selftests's based benchmarks. BEFORE (based on changes in previous patch) =========================================== uprobe-nop : 2.732 ± 0.022M/s uprobe-push : 2.621 ± 0.016M/s uprobe-ret : 1.105 ± 0.007M/s uretprobe-nop : 1.396 ± 0.007M/s uretprobe-push : 1.347 ± 0.008M/s uretprobe-ret : 0.800 ± 0.006M/s AFTER ===== uprobe-nop : 2.878 ± 0.017M/s (+5.5%, total +8.3%) uprobe-push : 2.753 ± 0.013M/s (+5.3%, total +10.2%) uprobe-ret : 1.142 ± 0.010M/s (+3.8%, total +3.8%) uretprobe-nop : 1.444 ± 0.008M/s (+3.5%, total +6.5%) uretprobe-push : 1.410 ± 0.010M/s (+4.8%, total +7.1%) uretprobe-ret : 0.816 ± 0.002M/s (+2.0%, total +3.9%) In the above, first percentage value is based on top of previous patch (lazy uprobe buffer optimization), while the "total" percentage is based on kernel without any of the changes in this patch set. As can be seen, we get about 4% - 10% speed up, in total, with both lazy uprobe buffer and speculative filter check optimizations. [0] https://lore.kernel.org/bpf/[email protected]/ Reviewed-by: Jiri Olsa <[email protected]> Link: https://lore.kernel.org/all/[email protected]/ Signed-off-by: Andrii Nakryiko <[email protected]> Acked-by: Masami Hiramatsu (Google) <[email protected]> Signed-off-by: Masami Hiramatsu (Google) <[email protected]>
1 parent 1b8f85d commit cdf355c

File tree

1 file changed

+7
-3
lines changed

1 file changed

+7
-3
lines changed

kernel/trace/trace_uprobe.c

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1226,9 +1226,6 @@ __uprobe_perf_filter(struct trace_uprobe_filter *filter, struct mm_struct *mm)
12261226
{
12271227
struct perf_event *event;
12281228

1229-
if (filter->nr_systemwide)
1230-
return true;
1231-
12321229
list_for_each_entry(event, &filter->perf_events, hw.tp_list) {
12331230
if (event->hw.target->mm == mm)
12341231
return true;
@@ -1353,6 +1350,13 @@ static bool uprobe_perf_filter(struct uprobe_consumer *uc,
13531350
tu = container_of(uc, struct trace_uprobe, consumer);
13541351
filter = tu->tp.event->filter;
13551352

1353+
/*
1354+
* speculative short-circuiting check to avoid unnecessarily taking
1355+
* filter->rwlock below, if the uprobe has system-wide consumer
1356+
*/
1357+
if (READ_ONCE(filter->nr_systemwide))
1358+
return true;
1359+
13561360
read_lock(&filter->rwlock);
13571361
ret = __uprobe_perf_filter(filter, mm);
13581362
read_unlock(&filter->rwlock);

0 commit comments

Comments
 (0)