Skip to content

Commit baaf553

Browse files
mrutland-armctmarinas
authored andcommitted
arm64: Implement HAVE_DYNAMIC_FTRACE_WITH_CALL_OPS
This patch enables support for DYNAMIC_FTRACE_WITH_CALL_OPS on arm64. This allows each ftrace callsite to provide an ftrace_ops to the common ftrace trampoline, allowing each callsite to invoke distinct tracer functions without the need to fall back to list processing or to allocate custom trampolines for each callsite. This significantly speeds up cases where multiple distinct trace functions are used and callsites are mostly traced by a single tracer. The main idea is to place a pointer to the ftrace_ops as a literal at a fixed offset from the function entry point, which can be recovered by the common ftrace trampoline. Using a 64-bit literal avoids branch range limitations, and permits the ops to be swapped atomically without special considerations that apply to code-patching. In future this will also allow for the implementation of DYNAMIC_FTRACE_WITH_DIRECT_CALLS without branch range limitations by using additional fields in struct ftrace_ops. As noted in the core patch adding support for DYNAMIC_FTRACE_WITH_CALL_OPS, this approach allows for directly invoking ftrace_ops::func even for ftrace_ops which are dynamically-allocated (or part of a module), without going via ftrace_ops_list_func. Currently, this approach is not compatible with CLANG_CFI, as the presence/absence of pre-function NOPs changes the offset of the pre-function type hash, and there's no existing mechanism to ensure a consistent offset for instrumented and uninstrumented functions. When CLANG_CFI is enabled, the existing scheme with a global ops->func pointer is used, and there should be no functional change. I am currently working with others to allow the two to work together in future (though this will liekly require updated compiler support). I've benchamrked this with the ftrace_ops sample module [1], which is not currently upstream, but available at: https://lore.kernel.org/lkml/[email protected] git://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git ftrace-ops-sample-20230109 Using that module I measured the total time taken for 100,000 calls to a trivial instrumented function, with a number of tracers enabled with relevant filters (which would apply to the instrumented function) and a number of tracers enabled with irrelevant filters (which would not apply to the instrumented function). I tested on an M1 MacBook Pro, running under a HVF-accelerated QEMU VM (i.e. on real hardware). Before this patch: Number of tracers || Total time | Per-call average time (ns) Relevant | Irrelevant || (ns) | Total | Overhead =========+============++=============+==============+============ 0 | 0 || 94,583 | 0.95 | - 0 | 1 || 93,709 | 0.94 | - 0 | 2 || 93,666 | 0.94 | - 0 | 10 || 93,709 | 0.94 | - 0 | 100 || 93,792 | 0.94 | - ---------+------------++-------------+--------------+------------ 1 | 1 || 6,467,833 | 64.68 | 63.73 1 | 2 || 7,509,708 | 75.10 | 74.15 1 | 10 || 23,786,792 | 237.87 | 236.92 1 | 100 || 106,432,500 | 1,064.43 | 1063.38 ---------+------------++-------------+--------------+------------ 1 | 0 || 1,431,875 | 14.32 | 13.37 2 | 0 || 6,456,334 | 64.56 | 63.62 10 | 0 || 22,717,000 | 227.17 | 226.22 100 | 0 || 103,293,667 | 1032.94 | 1031.99 ---------+------------++-------------+--------------+-------------- Note: per-call overhead is estimated relative to the baseline case with 0 relevant tracers and 0 irrelevant tracers. After this patch Number of tracers || Total time | Per-call average time (ns) Relevant | Irrelevant || (ns) | Total | Overhead =========+============++=============+==============+============ 0 | 0 || 94,541 | 0.95 | - 0 | 1 || 93,666 | 0.94 | - 0 | 2 || 93,709 | 0.94 | - 0 | 10 || 93,667 | 0.94 | - 0 | 100 || 93,792 | 0.94 | - ---------+------------++-------------+--------------+------------ 1 | 1 || 281,000 | 2.81 | 1.86 1 | 2 || 281,042 | 2.81 | 1.87 1 | 10 || 280,958 | 2.81 | 1.86 1 | 100 || 281,250 | 2.81 | 1.87 ---------+------------++-------------+--------------+------------ 1 | 0 || 280,959 | 2.81 | 1.86 2 | 0 || 6,502,708 | 65.03 | 64.08 10 | 0 || 18,681,209 | 186.81 | 185.87 100 | 0 || 103,550,458 | 1,035.50 | 1034.56 ---------+------------++-------------+--------------+------------ Note: per-call overhead is estimated relative to the baseline case with 0 relevant tracers and 0 irrelevant tracers. As can be seen from the above: a) Whenever there is a single relevant tracer function associated with a tracee, the overhead of invoking the tracer is constant, and does not scale with the number of tracers which are *not* associated with that tracee. b) The overhead for a single relevant tracer has dropped to ~1/7 of the overhead prior to this series (from 13.37ns to 1.86ns). This is largely due to permitting calls to dynamically-allocated ftrace_ops without going through ftrace_ops_list_func. I've run the ftrace selftests from v6.2-rc3, which reports: | # of passed: 110 | # of failed: 0 | # of unresolved: 3 | # of untested: 0 | # of unsupported: 0 | # of xfailed: 1 | # of undefined(test bug): 0 ... where the unresolved entries were the tests for DIRECT functions (which are not supported), and the checkbashisms selftest (which is irrelevant here): | [8] Test ftrace direct functions against tracers [UNRESOLVED] | [9] Test ftrace direct functions against kprobes [UNRESOLVED] | [62] Meta-selftest: Checkbashisms [UNRESOLVED] ... with all other tests passing (or failing as expected). Signed-off-by: Mark Rutland <[email protected]> Cc: Florent Revest <[email protected]> Cc: Masami Hiramatsu <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Will Deacon <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Catalin Marinas <[email protected]>
1 parent 90955d7 commit baaf553

File tree

6 files changed

+195
-20
lines changed

6 files changed

+195
-20
lines changed

arch/arm64/Kconfig

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -124,6 +124,7 @@ config ARM64
124124
select EDAC_SUPPORT
125125
select FRAME_POINTER
126126
select FUNCTION_ALIGNMENT_4B
127+
select FUNCTION_ALIGNMENT_8B if DYNAMIC_FTRACE_WITH_CALL_OPS
127128
select GENERIC_ALLOCATOR
128129
select GENERIC_ARCH_TOPOLOGY
129130
select GENERIC_CLOCKEVENTS_BROADCAST
@@ -187,6 +188,8 @@ config ARM64
187188
select HAVE_DYNAMIC_FTRACE
188189
select HAVE_DYNAMIC_FTRACE_WITH_ARGS \
189190
if $(cc-option,-fpatchable-function-entry=2)
191+
select HAVE_DYNAMIC_FTRACE_WITH_CALL_OPS \
192+
if (DYNAMIC_FTRACE_WITH_ARGS && !CFI_CLANG)
190193
select FTRACE_MCOUNT_USE_PATCHABLE_FUNCTION_ENTRY \
191194
if DYNAMIC_FTRACE_WITH_ARGS
192195
select HAVE_EFFICIENT_UNALIGNED_ACCESS

arch/arm64/Makefile

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -139,7 +139,10 @@ endif
139139

140140
CHECKFLAGS += -D__aarch64__
141141

142-
ifeq ($(CONFIG_DYNAMIC_FTRACE_WITH_ARGS),y)
142+
ifeq ($(CONFIG_DYNAMIC_FTRACE_WITH_CALL_OPS),y)
143+
KBUILD_CPPFLAGS += -DCC_USING_PATCHABLE_FUNCTION_ENTRY
144+
CC_FLAGS_FTRACE := -fpatchable-function-entry=4,2
145+
else ifeq ($(CONFIG_DYNAMIC_FTRACE_WITH_ARGS),y)
143146
KBUILD_CPPFLAGS += -DCC_USING_PATCHABLE_FUNCTION_ENTRY
144147
CC_FLAGS_FTRACE := -fpatchable-function-entry=2
145148
endif

arch/arm64/include/asm/ftrace.h

Lines changed: 1 addition & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -62,20 +62,7 @@ extern unsigned long ftrace_graph_call;
6262

6363
extern void return_to_handler(void);
6464

65-
static inline unsigned long ftrace_call_adjust(unsigned long addr)
66-
{
67-
/*
68-
* Adjust addr to point at the BL in the callsite.
69-
* See ftrace_init_nop() for the callsite sequence.
70-
*/
71-
if (IS_ENABLED(CONFIG_DYNAMIC_FTRACE_WITH_ARGS))
72-
return addr + AARCH64_INSN_SIZE;
73-
/*
74-
* addr is the address of the mcount call instruction.
75-
* recordmcount does the necessary offset calculation.
76-
*/
77-
return addr;
78-
}
65+
unsigned long ftrace_call_adjust(unsigned long addr);
7966

8067
#ifdef CONFIG_DYNAMIC_FTRACE_WITH_ARGS
8168
struct dyn_ftrace;

arch/arm64/kernel/asm-offsets.c

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@
99

1010
#include <linux/arm_sdei.h>
1111
#include <linux/sched.h>
12+
#include <linux/ftrace.h>
1213
#include <linux/kexec.h>
1314
#include <linux/mm.h>
1415
#include <linux/dma-mapping.h>
@@ -193,6 +194,9 @@ int main(void)
193194
DEFINE(KIMAGE_HEAD, offsetof(struct kimage, head));
194195
DEFINE(KIMAGE_START, offsetof(struct kimage, start));
195196
BLANK();
197+
#endif
198+
#ifdef CONFIG_FUNCTION_TRACER
199+
DEFINE(FTRACE_OPS_FUNC, offsetof(struct ftrace_ops, func));
196200
#endif
197201
return 0;
198202
}

arch/arm64/kernel/entry-ftrace.S

Lines changed: 27 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -65,13 +65,35 @@ SYM_CODE_START(ftrace_caller)
6565
stp x29, x30, [sp, #FREGS_SIZE]
6666
add x29, sp, #FREGS_SIZE
6767

68-
sub x0, x30, #AARCH64_INSN_SIZE // ip (callsite's BL insn)
69-
mov x1, x9 // parent_ip (callsite's LR)
70-
ldr_l x2, function_trace_op // op
71-
mov x3, sp // regs
68+
/* Prepare arguments for the the tracer func */
69+
sub x0, x30, #AARCH64_INSN_SIZE // ip (callsite's BL insn)
70+
mov x1, x9 // parent_ip (callsite's LR)
71+
mov x3, sp // regs
72+
73+
#ifdef CONFIG_DYNAMIC_FTRACE_WITH_CALL_OPS
74+
/*
75+
* The literal pointer to the ops is at an 8-byte aligned boundary
76+
* which is either 12 or 16 bytes before the BL instruction in the call
77+
* site. See ftrace_call_adjust() for details.
78+
*
79+
* Therefore here the LR points at `literal + 16` or `literal + 20`,
80+
* and we can find the address of the literal in either case by
81+
* aligning to an 8-byte boundary and subtracting 16. We do the
82+
* alignment first as this allows us to fold the subtraction into the
83+
* LDR.
84+
*/
85+
bic x2, x30, 0x7
86+
ldr x2, [x2, #-16] // op
87+
88+
ldr x4, [x2, #FTRACE_OPS_FUNC] // op->func
89+
blr x4 // op->func(ip, parent_ip, op, regs)
90+
91+
#else
92+
ldr_l x2, function_trace_op // op
7293

7394
SYM_INNER_LABEL(ftrace_call, SYM_L_GLOBAL)
74-
bl ftrace_stub
95+
bl ftrace_stub // func(ip, parent_ip, op, regs)
96+
#endif
7597

7698
/*
7799
* At the callsite x0-x8 and x19-x30 were live. Any C code will have preserved

arch/arm64/kernel/ftrace.c

Lines changed: 156 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -60,6 +60,89 @@ int ftrace_regs_query_register_offset(const char *name)
6060
}
6161
#endif
6262

63+
unsigned long ftrace_call_adjust(unsigned long addr)
64+
{
65+
/*
66+
* When using mcount, addr is the address of the mcount call
67+
* instruction, and no adjustment is necessary.
68+
*/
69+
if (!IS_ENABLED(CONFIG_DYNAMIC_FTRACE_WITH_ARGS))
70+
return addr;
71+
72+
/*
73+
* When using patchable-function-entry without pre-function NOPS, addr
74+
* is the address of the first NOP after the function entry point.
75+
*
76+
* The compiler has either generated:
77+
*
78+
* addr+00: func: NOP // To be patched to MOV X9, LR
79+
* addr+04: NOP // To be patched to BL <caller>
80+
*
81+
* Or:
82+
*
83+
* addr-04: BTI C
84+
* addr+00: func: NOP // To be patched to MOV X9, LR
85+
* addr+04: NOP // To be patched to BL <caller>
86+
*
87+
* We must adjust addr to the address of the NOP which will be patched
88+
* to `BL <caller>`, which is at `addr + 4` bytes in either case.
89+
*
90+
*/
91+
if (!IS_ENABLED(CONFIG_DYNAMIC_FTRACE_WITH_CALL_OPS))
92+
return addr + AARCH64_INSN_SIZE;
93+
94+
/*
95+
* When using patchable-function-entry with pre-function NOPs, addr is
96+
* the address of the first pre-function NOP.
97+
*
98+
* Starting from an 8-byte aligned base, the compiler has either
99+
* generated:
100+
*
101+
* addr+00: NOP // Literal (first 32 bits)
102+
* addr+04: NOP // Literal (last 32 bits)
103+
* addr+08: func: NOP // To be patched to MOV X9, LR
104+
* addr+12: NOP // To be patched to BL <caller>
105+
*
106+
* Or:
107+
*
108+
* addr+00: NOP // Literal (first 32 bits)
109+
* addr+04: NOP // Literal (last 32 bits)
110+
* addr+08: func: BTI C
111+
* addr+12: NOP // To be patched to MOV X9, LR
112+
* addr+16: NOP // To be patched to BL <caller>
113+
*
114+
* We must adjust addr to the address of the NOP which will be patched
115+
* to `BL <caller>`, which is at either addr+12 or addr+16 depending on
116+
* whether there is a BTI.
117+
*/
118+
119+
if (!IS_ALIGNED(addr, sizeof(unsigned long))) {
120+
WARN_RATELIMIT(1, "Misaligned patch-site %pS\n",
121+
(void *)(addr + 8));
122+
return 0;
123+
}
124+
125+
/* Skip the NOPs placed before the function entry point */
126+
addr += 2 * AARCH64_INSN_SIZE;
127+
128+
/* Skip any BTI */
129+
if (IS_ENABLED(CONFIG_ARM64_BTI_KERNEL)) {
130+
u32 insn = le32_to_cpu(*(__le32 *)addr);
131+
132+
if (aarch64_insn_is_bti(insn)) {
133+
addr += AARCH64_INSN_SIZE;
134+
} else if (insn != aarch64_insn_gen_nop()) {
135+
WARN_RATELIMIT(1, "unexpected insn in patch-site %pS: 0x%08x\n",
136+
(void *)addr, insn);
137+
}
138+
}
139+
140+
/* Skip the first NOP after function entry */
141+
addr += AARCH64_INSN_SIZE;
142+
143+
return addr;
144+
}
145+
63146
/*
64147
* Replace a single instruction, which may be a branch or NOP.
65148
* If @validate == true, a replaced instruction is checked against 'old'.
@@ -98,6 +181,13 @@ int ftrace_update_ftrace_func(ftrace_func_t func)
98181
unsigned long pc;
99182
u32 new;
100183

184+
/*
185+
* When using CALL_OPS, the function to call is associated with the
186+
* call site, and we don't have a global function pointer to update.
187+
*/
188+
if (IS_ENABLED(CONFIG_DYNAMIC_FTRACE_WITH_CALL_OPS))
189+
return 0;
190+
101191
pc = (unsigned long)ftrace_call;
102192
new = aarch64_insn_gen_branch_imm(pc, (unsigned long)func,
103193
AARCH64_INSN_BRANCH_LINK);
@@ -176,13 +266,56 @@ static bool ftrace_find_callable_addr(struct dyn_ftrace *rec,
176266
return true;
177267
}
178268

269+
#ifdef CONFIG_DYNAMIC_FTRACE_WITH_CALL_OPS
270+
static const struct ftrace_ops *arm64_rec_get_ops(struct dyn_ftrace *rec)
271+
{
272+
const struct ftrace_ops *ops = NULL;
273+
274+
if (rec->flags & FTRACE_FL_CALL_OPS_EN) {
275+
ops = ftrace_find_unique_ops(rec);
276+
WARN_ON_ONCE(!ops);
277+
}
278+
279+
if (!ops)
280+
ops = &ftrace_list_ops;
281+
282+
return ops;
283+
}
284+
285+
static int ftrace_rec_set_ops(const struct dyn_ftrace *rec,
286+
const struct ftrace_ops *ops)
287+
{
288+
unsigned long literal = ALIGN_DOWN(rec->ip - 12, 8);
289+
return aarch64_insn_write_literal_u64((void *)literal,
290+
(unsigned long)ops);
291+
}
292+
293+
static int ftrace_rec_set_nop_ops(struct dyn_ftrace *rec)
294+
{
295+
return ftrace_rec_set_ops(rec, &ftrace_nop_ops);
296+
}
297+
298+
static int ftrace_rec_update_ops(struct dyn_ftrace *rec)
299+
{
300+
return ftrace_rec_set_ops(rec, arm64_rec_get_ops(rec));
301+
}
302+
#else
303+
static int ftrace_rec_set_nop_ops(struct dyn_ftrace *rec) { return 0; }
304+
static int ftrace_rec_update_ops(struct dyn_ftrace *rec) { return 0; }
305+
#endif
306+
179307
/*
180308
* Turn on the call to ftrace_caller() in instrumented function
181309
*/
182310
int ftrace_make_call(struct dyn_ftrace *rec, unsigned long addr)
183311
{
184312
unsigned long pc = rec->ip;
185313
u32 old, new;
314+
int ret;
315+
316+
ret = ftrace_rec_update_ops(rec);
317+
if (ret)
318+
return ret;
186319

187320
if (!ftrace_find_callable_addr(rec, NULL, &addr))
188321
return -EINVAL;
@@ -193,6 +326,19 @@ int ftrace_make_call(struct dyn_ftrace *rec, unsigned long addr)
193326
return ftrace_modify_code(pc, old, new, true);
194327
}
195328

329+
#ifdef CONFIG_DYNAMIC_FTRACE_WITH_CALL_OPS
330+
int ftrace_modify_call(struct dyn_ftrace *rec, unsigned long old_addr,
331+
unsigned long addr)
332+
{
333+
if (WARN_ON_ONCE(old_addr != (unsigned long)ftrace_caller))
334+
return -EINVAL;
335+
if (WARN_ON_ONCE(addr != (unsigned long)ftrace_caller))
336+
return -EINVAL;
337+
338+
return ftrace_rec_update_ops(rec);
339+
}
340+
#endif
341+
196342
#ifdef CONFIG_DYNAMIC_FTRACE_WITH_ARGS
197343
/*
198344
* The compiler has inserted two NOPs before the regular function prologue.
@@ -220,6 +366,11 @@ int ftrace_init_nop(struct module *mod, struct dyn_ftrace *rec)
220366
{
221367
unsigned long pc = rec->ip - AARCH64_INSN_SIZE;
222368
u32 old, new;
369+
int ret;
370+
371+
ret = ftrace_rec_set_nop_ops(rec);
372+
if (ret)
373+
return ret;
223374

224375
old = aarch64_insn_gen_nop();
225376
new = aarch64_insn_gen_move_reg(AARCH64_INSN_REG_9,
@@ -237,9 +388,14 @@ int ftrace_make_nop(struct module *mod, struct dyn_ftrace *rec,
237388
{
238389
unsigned long pc = rec->ip;
239390
u32 old = 0, new;
391+
int ret;
240392

241393
new = aarch64_insn_gen_nop();
242394

395+
ret = ftrace_rec_set_nop_ops(rec);
396+
if (ret)
397+
return ret;
398+
243399
/*
244400
* When using mcount, callsites in modules may have been initalized to
245401
* call an arbitrary module PLT (which redirects to the _mcount stub)

0 commit comments

Comments
 (0)