Skip to content

Commit c1bc51f

Browse files
author
Alexei Starovoitov
committed
Merge branch 'bpf-support-private-stack-for-bpf-progs'
Yonghong Song says: ==================== bpf: Support private stack for bpf progs The main motivation for private stack comes from nested scheduler in sched-ext from Tejun. The basic idea is that - each cgroup will its own associated bpf program, - bpf program with parent cgroup will call bpf programs in immediate child cgroups. Let us say we have the following cgroup hierarchy: root_cg (prog0): cg1 (prog1): cg11 (prog11): cg111 (prog111) cg112 (prog112) cg12 (prog12): cg121 (prog121) cg122 (prog122) cg2 (prog2): cg21 (prog21) cg22 (prog22) cg23 (prog23) In the above example, prog0 will call a kfunc which will call prog1 and prog2 to get sched info for cg1 and cg2 and then the information is summarized and sent back to prog0. Similarly, prog11 and prog12 will be invoked in the kfunc and the result will be summarized and sent back to prog1, etc. The following illustrates a possible call sequence: ... -> bpf prog A -> kfunc -> ops.<callback_fn> (bpf prog B) ... Currently, for each thread, the x86 kernel allocate 16KB stack. Each bpf program (including its subprograms) has maximum 512B stack size to avoid potential stack overflow. Nested bpf programs further increase the risk of stack overflow. To avoid potential stack overflow caused by bpf programs, this patch set supported private stack and bpf program stack space is allocated during jit time. Using private stack for bpf progs can reduce or avoid potential kernel stack overflow. Currently private stack is applied to tracing programs like kprobe/uprobe, perf_event, tracepoint, raw tracepoint and struct_ops progs. Tracing progs enable private stack if any subprog stack size is more than a threshold (i.e. 64B). Struct-ops progs enable private stack based on particular struct op implementation which can enable private stack before verification at per-insn level. Struct-ops progs have the same treatment as tracing progs w.r.t when to enable private stack. For all these progs, the kernel will do recursion check (no nesting for per prog per cpu) to ensure that private stack won't be overwritten. The bpf_prog_aux struct has a callback func recursion_detected() which can be implemented by kernel subsystem to synchronously detect recursion, report error, etc. Only x86_64 arch supports private stack now. It can be extended to other archs later. Please see each individual patch for details. Change logs: v11 -> v12: - v11 link: https://lore.kernel.org/bpf/[email protected]/ - Fix a bug where allocated percpu space is less than actual private stack. - Add guard memory (before and after actual prog stack) to detect potential underflow/overflow. v10 -> v11: - v10 link: https://lore.kernel.org/bpf/[email protected]/ - Use two bool variables, priv_stack_requested (used by struct-ops only) and jits_use_priv_stack, in order to make code cleaner. - Set env->prog->aux->jits_use_priv_stack to true if any subprog uses private stack. This is for struct-ops use case to kick in recursion protection. v9 -> v10: - v9 link: https://lore.kernel.org/bpf/[email protected]/ - Simplify handling async cbs by making those async cb related progs using normal kernel stack. - Do percpu allocation in jit instead of verifier. v8 -> v9: - v8 link: https://lore.kernel.org/bpf/[email protected]/ - Use enum to express priv stack mode. - Use bits in bpf_subprog_info struct to do subprog recursion check between main/async and async subprogs. - Fix potential memory leak. - Rename recursion detection func from recursion_skipped() to recursion_detected(). v7 -> v8: - v7 link: https://lore.kernel.org/bpf/[email protected]/ - Add recursion_skipped() callback func to bpf_prog->aux structure such that if a recursion miss happened and bpf_prog->aux->recursion_skipped is not NULL, the callback fn will be called so the subsystem can do proper action based on their respective design. v6 -> v7: - v6 link: https://lore.kernel.org/bpf/[email protected]/ - Going back to do private stack allocation per prog instead per subtree. This can simplify implementation and avoid verifier complexity. - Handle potential nested subprog run if async callback exists. - Use struct_ops->check_member() callback to set whether a particular struct-ops prog wants private stack or not. v5 -> v6: - v5 link: https://lore.kernel.org/bpf/[email protected]/ - Instead of using (or not using) private stack at struct_ops level, each prog in struct_ops can decide whether to use private stack or not. v4 -> v5: - v4 link: https://lore.kernel.org/bpf/[email protected]/ - Remove bpf_prog_call() related implementation. - Allow (opt-in) private stack for sched-ext progs. v3 -> v4: - v3 link: https://lore.kernel.org/bpf/[email protected]/ There is a long discussion in the above v3 link trying to allow private stack to be used by kernel functions in order to simplify implementation. But unfortunately we didn't find a workable solution yet, so we return to the approach where private stack is only used by bpf programs. - Add bpf_prog_call() kfunc. v2 -> v3: - Instead of per-subprog private stack allocation, allocate private stacks at main prog or callback entry prog. Subprogs not main or callback progs will increment the inherited stack pointer to be their frame pointer. - Private stack allows each prog max stack size to be 512 bytes, intead of the whole prog hierarchy to be 512 bytes. - Add some tests. ==================== Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2 parents c748a25 + becfe32 commit c1bc51f

File tree

15 files changed

+930
-14
lines changed

15 files changed

+930
-14
lines changed

arch/x86/net/bpf_jit_comp.c

Lines changed: 143 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -325,6 +325,22 @@ struct jit_context {
325325
/* Number of bytes that will be skipped on tailcall */
326326
#define X86_TAIL_CALL_OFFSET (12 + ENDBR_INSN_SIZE)
327327

328+
static void push_r9(u8 **pprog)
329+
{
330+
u8 *prog = *pprog;
331+
332+
EMIT2(0x41, 0x51); /* push r9 */
333+
*pprog = prog;
334+
}
335+
336+
static void pop_r9(u8 **pprog)
337+
{
338+
u8 *prog = *pprog;
339+
340+
EMIT2(0x41, 0x59); /* pop r9 */
341+
*pprog = prog;
342+
}
343+
328344
static void push_r12(u8 **pprog)
329345
{
330346
u8 *prog = *pprog;
@@ -1404,6 +1420,24 @@ static void emit_shiftx(u8 **pprog, u32 dst_reg, u8 src_reg, bool is64, u8 op)
14041420
*pprog = prog;
14051421
}
14061422

1423+
static void emit_priv_frame_ptr(u8 **pprog, void __percpu *priv_frame_ptr)
1424+
{
1425+
u8 *prog = *pprog;
1426+
1427+
/* movabs r9, priv_frame_ptr */
1428+
emit_mov_imm64(&prog, X86_REG_R9, (__force long) priv_frame_ptr >> 32,
1429+
(u32) (__force long) priv_frame_ptr);
1430+
1431+
#ifdef CONFIG_SMP
1432+
/* add <r9>, gs:[<off>] */
1433+
EMIT2(0x65, 0x4c);
1434+
EMIT3(0x03, 0x0c, 0x25);
1435+
EMIT((u32)(unsigned long)&this_cpu_off, 4);
1436+
#endif
1437+
1438+
*pprog = prog;
1439+
}
1440+
14071441
#define INSN_SZ_DIFF (((addrs[i] - addrs[i - 1]) - (prog - temp)))
14081442

14091443
#define __LOAD_TCC_PTR(off) \
@@ -1412,6 +1446,10 @@ static void emit_shiftx(u8 **pprog, u32 dst_reg, u8 src_reg, bool is64, u8 op)
14121446
#define LOAD_TAIL_CALL_CNT_PTR(stack) \
14131447
__LOAD_TCC_PTR(BPF_TAIL_CALL_CNT_PTR_STACK_OFF(stack))
14141448

1449+
/* Memory size/value to protect private stack overflow/underflow */
1450+
#define PRIV_STACK_GUARD_SZ 8
1451+
#define PRIV_STACK_GUARD_VAL 0xEB9F12345678eb9fULL
1452+
14151453
static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, u8 *rw_image,
14161454
int oldproglen, struct jit_context *ctx, bool jmp_padding)
14171455
{
@@ -1421,18 +1459,28 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, u8 *rw_image
14211459
int insn_cnt = bpf_prog->len;
14221460
bool seen_exit = false;
14231461
u8 temp[BPF_MAX_INSN_SIZE + BPF_INSN_SAFETY];
1462+
void __percpu *priv_frame_ptr = NULL;
14241463
u64 arena_vm_start, user_vm_start;
1464+
void __percpu *priv_stack_ptr;
14251465
int i, excnt = 0;
14261466
int ilen, proglen = 0;
14271467
u8 *prog = temp;
1468+
u32 stack_depth;
14281469
int err;
14291470

1471+
stack_depth = bpf_prog->aux->stack_depth;
1472+
priv_stack_ptr = bpf_prog->aux->priv_stack_ptr;
1473+
if (priv_stack_ptr) {
1474+
priv_frame_ptr = priv_stack_ptr + PRIV_STACK_GUARD_SZ + round_up(stack_depth, 8);
1475+
stack_depth = 0;
1476+
}
1477+
14301478
arena_vm_start = bpf_arena_get_kern_vm_start(bpf_prog->aux->arena);
14311479
user_vm_start = bpf_arena_get_user_vm_start(bpf_prog->aux->arena);
14321480

14331481
detect_reg_usage(insn, insn_cnt, callee_regs_used);
14341482

1435-
emit_prologue(&prog, bpf_prog->aux->stack_depth,
1483+
emit_prologue(&prog, stack_depth,
14361484
bpf_prog_was_classic(bpf_prog), tail_call_reachable,
14371485
bpf_is_subprog(bpf_prog), bpf_prog->aux->exception_cb);
14381486
/* Exception callback will clobber callee regs for its own use, and
@@ -1454,6 +1502,9 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, u8 *rw_image
14541502
emit_mov_imm64(&prog, X86_REG_R12,
14551503
arena_vm_start >> 32, (u32) arena_vm_start);
14561504

1505+
if (priv_frame_ptr)
1506+
emit_priv_frame_ptr(&prog, priv_frame_ptr);
1507+
14571508
ilen = prog - temp;
14581509
if (rw_image)
14591510
memcpy(rw_image + proglen, temp, ilen);
@@ -1473,6 +1524,14 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, u8 *rw_image
14731524
u8 *func;
14741525
int nops;
14751526

1527+
if (priv_frame_ptr) {
1528+
if (src_reg == BPF_REG_FP)
1529+
src_reg = X86_REG_R9;
1530+
1531+
if (dst_reg == BPF_REG_FP)
1532+
dst_reg = X86_REG_R9;
1533+
}
1534+
14761535
switch (insn->code) {
14771536
/* ALU */
14781537
case BPF_ALU | BPF_ADD | BPF_X:
@@ -2128,14 +2187,20 @@ st: if (is_imm8(insn->off))
21282187

21292188
func = (u8 *) __bpf_call_base + imm32;
21302189
if (tail_call_reachable) {
2131-
LOAD_TAIL_CALL_CNT_PTR(bpf_prog->aux->stack_depth);
2190+
LOAD_TAIL_CALL_CNT_PTR(stack_depth);
21322191
ip += 7;
21332192
}
21342193
if (!imm32)
21352194
return -EINVAL;
2195+
if (priv_frame_ptr) {
2196+
push_r9(&prog);
2197+
ip += 2;
2198+
}
21362199
ip += x86_call_depth_emit_accounting(&prog, func, ip);
21372200
if (emit_call(&prog, func, ip))
21382201
return -EINVAL;
2202+
if (priv_frame_ptr)
2203+
pop_r9(&prog);
21392204
break;
21402205
}
21412206

@@ -2145,13 +2210,13 @@ st: if (is_imm8(insn->off))
21452210
&bpf_prog->aux->poke_tab[imm32 - 1],
21462211
&prog, image + addrs[i - 1],
21472212
callee_regs_used,
2148-
bpf_prog->aux->stack_depth,
2213+
stack_depth,
21492214
ctx);
21502215
else
21512216
emit_bpf_tail_call_indirect(bpf_prog,
21522217
&prog,
21532218
callee_regs_used,
2154-
bpf_prog->aux->stack_depth,
2219+
stack_depth,
21552220
image + addrs[i - 1],
21562221
ctx);
21572222
break;
@@ -3303,6 +3368,42 @@ int arch_prepare_bpf_dispatcher(void *image, void *buf, s64 *funcs, int num_func
33033368
return emit_bpf_dispatcher(&prog, 0, num_funcs - 1, funcs, image, buf);
33043369
}
33053370

3371+
static const char *bpf_get_prog_name(struct bpf_prog *prog)
3372+
{
3373+
if (prog->aux->ksym.prog)
3374+
return prog->aux->ksym.name;
3375+
return prog->aux->name;
3376+
}
3377+
3378+
static void priv_stack_init_guard(void __percpu *priv_stack_ptr, int alloc_size)
3379+
{
3380+
int cpu, underflow_idx = (alloc_size - PRIV_STACK_GUARD_SZ) >> 3;
3381+
u64 *stack_ptr;
3382+
3383+
for_each_possible_cpu(cpu) {
3384+
stack_ptr = per_cpu_ptr(priv_stack_ptr, cpu);
3385+
stack_ptr[0] = PRIV_STACK_GUARD_VAL;
3386+
stack_ptr[underflow_idx] = PRIV_STACK_GUARD_VAL;
3387+
}
3388+
}
3389+
3390+
static void priv_stack_check_guard(void __percpu *priv_stack_ptr, int alloc_size,
3391+
struct bpf_prog *prog)
3392+
{
3393+
int cpu, underflow_idx = (alloc_size - PRIV_STACK_GUARD_SZ) >> 3;
3394+
u64 *stack_ptr;
3395+
3396+
for_each_possible_cpu(cpu) {
3397+
stack_ptr = per_cpu_ptr(priv_stack_ptr, cpu);
3398+
if (stack_ptr[0] != PRIV_STACK_GUARD_VAL ||
3399+
stack_ptr[underflow_idx] != PRIV_STACK_GUARD_VAL) {
3400+
pr_err("BPF private stack overflow/underflow detected for prog %sx\n",
3401+
bpf_get_prog_name(prog));
3402+
break;
3403+
}
3404+
}
3405+
}
3406+
33063407
struct x64_jit_data {
33073408
struct bpf_binary_header *rw_header;
33083409
struct bpf_binary_header *header;
@@ -3320,7 +3421,9 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
33203421
struct bpf_binary_header *rw_header = NULL;
33213422
struct bpf_binary_header *header = NULL;
33223423
struct bpf_prog *tmp, *orig_prog = prog;
3424+
void __percpu *priv_stack_ptr = NULL;
33233425
struct x64_jit_data *jit_data;
3426+
int priv_stack_alloc_sz;
33243427
int proglen, oldproglen = 0;
33253428
struct jit_context ctx = {};
33263429
bool tmp_blinded = false;
@@ -3356,6 +3459,23 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
33563459
}
33573460
prog->aux->jit_data = jit_data;
33583461
}
3462+
priv_stack_ptr = prog->aux->priv_stack_ptr;
3463+
if (!priv_stack_ptr && prog->aux->jits_use_priv_stack) {
3464+
/* Allocate actual private stack size with verifier-calculated
3465+
* stack size plus two memory guards to protect overflow and
3466+
* underflow.
3467+
*/
3468+
priv_stack_alloc_sz = round_up(prog->aux->stack_depth, 8) +
3469+
2 * PRIV_STACK_GUARD_SZ;
3470+
priv_stack_ptr = __alloc_percpu_gfp(priv_stack_alloc_sz, 8, GFP_KERNEL);
3471+
if (!priv_stack_ptr) {
3472+
prog = orig_prog;
3473+
goto out_priv_stack;
3474+
}
3475+
3476+
priv_stack_init_guard(priv_stack_ptr, priv_stack_alloc_sz);
3477+
prog->aux->priv_stack_ptr = priv_stack_ptr;
3478+
}
33593479
addrs = jit_data->addrs;
33603480
if (addrs) {
33613481
ctx = jit_data->ctx;
@@ -3491,6 +3611,11 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
34913611
bpf_prog_fill_jited_linfo(prog, addrs + 1);
34923612
out_addrs:
34933613
kvfree(addrs);
3614+
if (!image && priv_stack_ptr) {
3615+
free_percpu(priv_stack_ptr);
3616+
prog->aux->priv_stack_ptr = NULL;
3617+
}
3618+
out_priv_stack:
34943619
kfree(jit_data);
34953620
prog->aux->jit_data = NULL;
34963621
}
@@ -3529,6 +3654,8 @@ void bpf_jit_free(struct bpf_prog *prog)
35293654
if (prog->jited) {
35303655
struct x64_jit_data *jit_data = prog->aux->jit_data;
35313656
struct bpf_binary_header *hdr;
3657+
void __percpu *priv_stack_ptr;
3658+
int priv_stack_alloc_sz;
35323659

35333660
/*
35343661
* If we fail the final pass of JIT (from jit_subprogs),
@@ -3544,6 +3671,13 @@ void bpf_jit_free(struct bpf_prog *prog)
35443671
prog->bpf_func = (void *)prog->bpf_func - cfi_get_offset();
35453672
hdr = bpf_jit_binary_pack_hdr(prog);
35463673
bpf_jit_binary_pack_free(hdr, NULL);
3674+
priv_stack_ptr = prog->aux->priv_stack_ptr;
3675+
if (priv_stack_ptr) {
3676+
priv_stack_alloc_sz = round_up(prog->aux->stack_depth, 8) +
3677+
2 * PRIV_STACK_GUARD_SZ;
3678+
priv_stack_check_guard(priv_stack_ptr, priv_stack_alloc_sz, prog);
3679+
free_percpu(prog->aux->priv_stack_ptr);
3680+
}
35473681
WARN_ON_ONCE(!bpf_prog_kallsyms_verify_off(prog));
35483682
}
35493683

@@ -3559,6 +3693,11 @@ bool bpf_jit_supports_exceptions(void)
35593693
return IS_ENABLED(CONFIG_UNWINDER_ORC);
35603694
}
35613695

3696+
bool bpf_jit_supports_private_stack(void)
3697+
{
3698+
return true;
3699+
}
3700+
35623701
void arch_bpf_stack_walk(bool (*consume_fn)(void *cookie, u64 ip, u64 sp, u64 bp), void *cookie)
35633702
{
35643703
#if defined(CONFIG_UNWINDER_ORC)

include/linux/bpf.h

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1507,6 +1507,7 @@ struct bpf_prog_aux {
15071507
u32 max_rdwr_access;
15081508
struct btf *attach_btf;
15091509
const struct bpf_ctx_arg_aux *ctx_arg_info;
1510+
void __percpu *priv_stack_ptr;
15101511
struct mutex dst_mutex; /* protects dst_* pointers below, *after* prog becomes visible */
15111512
struct bpf_prog *dst_prog;
15121513
struct bpf_trampoline *dst_trampoline;
@@ -1523,9 +1524,12 @@ struct bpf_prog_aux {
15231524
bool exception_cb;
15241525
bool exception_boundary;
15251526
bool is_extended; /* true if extended by freplace program */
1527+
bool jits_use_priv_stack;
1528+
bool priv_stack_requested;
15261529
u64 prog_array_member_cnt; /* counts how many times as member of prog_array */
15271530
struct mutex ext_mutex; /* mutex for is_extended and prog_array_member_cnt */
15281531
struct bpf_arena *arena;
1532+
void (*recursion_detected)(struct bpf_prog *prog); /* callback if recursion is detected */
15291533
/* BTF_KIND_FUNC_PROTO for valid attach_btf_id */
15301534
const struct btf_type *attach_func_proto;
15311535
/* function name for valid attach_btf_id */

include/linux/bpf_verifier.h

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -633,6 +633,12 @@ struct bpf_subprog_arg_info {
633633
};
634634
};
635635

636+
enum priv_stack_mode {
637+
PRIV_STACK_UNKNOWN,
638+
NO_PRIV_STACK,
639+
PRIV_STACK_ADAPTIVE,
640+
};
641+
636642
struct bpf_subprog_info {
637643
/* 'start' has to be the first field otherwise find_subprog() won't work */
638644
u32 start; /* insn idx of function entry point */
@@ -653,6 +659,7 @@ struct bpf_subprog_info {
653659
/* true if bpf_fastcall stack region is used by functions that can't be inlined */
654660
bool keep_fastcall_stack: 1;
655661

662+
enum priv_stack_mode priv_stack_mode;
656663
u8 arg_cnt;
657664
struct bpf_subprog_arg_info args[MAX_BPF_FUNC_REG_ARGS];
658665
};
@@ -872,6 +879,7 @@ static inline bool bpf_prog_check_recur(const struct bpf_prog *prog)
872879
case BPF_PROG_TYPE_TRACING:
873880
return prog->expected_attach_type != BPF_TRACE_ITER;
874881
case BPF_PROG_TYPE_STRUCT_OPS:
882+
return prog->aux->jits_use_priv_stack;
875883
case BPF_PROG_TYPE_LSM:
876884
return false;
877885
default:

include/linux/filter.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1119,6 +1119,7 @@ bool bpf_jit_supports_exceptions(void);
11191119
bool bpf_jit_supports_ptr_xchg(void);
11201120
bool bpf_jit_supports_arena(void);
11211121
bool bpf_jit_supports_insn(struct bpf_insn *insn, bool in_arena);
1122+
bool bpf_jit_supports_private_stack(void);
11221123
u64 bpf_arch_uaddress_limit(void);
11231124
void arch_bpf_stack_walk(bool (*consume_fn)(void *cookie, u64 ip, u64 sp, u64 bp), void *cookie);
11241125
bool bpf_helper_changes_pkt_data(void *func);

kernel/bpf/core.c

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3045,6 +3045,11 @@ bool __weak bpf_jit_supports_exceptions(void)
30453045
return false;
30463046
}
30473047

3048+
bool __weak bpf_jit_supports_private_stack(void)
3049+
{
3050+
return false;
3051+
}
3052+
30483053
void __weak arch_bpf_stack_walk(bool (*consume_fn)(void *cookie, u64 ip, u64 sp, u64 bp), void *cookie)
30493054
{
30503055
}

kernel/bpf/trampoline.c

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -899,6 +899,8 @@ static u64 notrace __bpf_prog_enter_recur(struct bpf_prog *prog, struct bpf_tram
899899

900900
if (unlikely(this_cpu_inc_return(*(prog->active)) != 1)) {
901901
bpf_prog_inc_misses_counter(prog);
902+
if (prog->aux->recursion_detected)
903+
prog->aux->recursion_detected(prog);
902904
return 0;
903905
}
904906
return bpf_prog_start_time();
@@ -975,6 +977,8 @@ u64 notrace __bpf_prog_enter_sleepable_recur(struct bpf_prog *prog,
975977

976978
if (unlikely(this_cpu_inc_return(*(prog->active)) != 1)) {
977979
bpf_prog_inc_misses_counter(prog);
980+
if (prog->aux->recursion_detected)
981+
prog->aux->recursion_detected(prog);
978982
return 0;
979983
}
980984
return bpf_prog_start_time();

0 commit comments

Comments
 (0)