Skip to content

Commit ee8d72a

Browse files
committed
Daniel Borkmann says: ==================== pull-request: bpf-next 2023-02-17 We've added 64 non-merge commits during the last 7 day(s) which contain a total of 158 files changed, 4190 insertions(+), 988 deletions(-). The main changes are: 1) Add a rbtree data structure following the "next-gen data structure" precedent set by recently-added linked-list, that is, by using kfunc + kptr instead of adding a new BPF map type, from Dave Marchevsky. 2) Add a new benchmark for hashmap lookups to BPF selftests, from Anton Protopopov. 3) Fix bpf_fib_lookup to only return valid neighbors and add an option to skip the neigh table lookup, from Martin KaFai Lau. 4) Add cgroup.memory=nobpf kernel parameter option to disable BPF memory accouting for container environments, from Yafang Shao. 5) Batch of ice multi-buffer and driver performance fixes, from Alexander Lobakin. 6) Fix a bug in determining whether global subprog's argument is PTR_TO_CTX, which is based on type names which breaks kprobe progs, from Andrii Nakryiko. 7) Prep work for future -mcpu=v4 LLVM option which includes usage of BPF_ST insn. Thus improve BPF_ST-related value tracking in verifier, from Eduard Zingerman. 8) More prep work for later building selftests with Memory Sanitizer in order to detect usages of undefined memory, from Ilya Leoshkevich. 9) Fix xsk sockets to check IFF_UP earlier to avoid a NULL pointer dereference via sendmsg(), from Maciej Fijalkowski. 10) Implement BPF trampoline for RV64 JIT compiler, from Pu Lehui. 11) Fix BPF memory allocator in combination with BPF hashtab where it could corrupt special fields e.g. used in bpf_spin_lock, from Hou Tao. 12) Fix LoongArch BPF JIT to always use 4 instructions for function address so that instruction sequences don't change between passes, from Hengqi Chen. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (64 commits) selftests/bpf: Add bpf_fib_lookup test bpf: Add BPF_FIB_LOOKUP_SKIP_NEIGH for bpf_fib_lookup riscv, bpf: Add bpf trampoline support for RV64 riscv, bpf: Add bpf_arch_text_poke support for RV64 riscv, bpf: Factor out emit_call for kernel and bpf context riscv: Extend patch_text for multiple instructions Revert "bpf, test_run: fix &xdp_frame misplacement for LIVE_FRAMES" selftests/bpf: Add global subprog context passing tests selftests/bpf: Convert test_global_funcs test to test_loader framework bpf: Fix global subprog context argument resolution logic LoongArch, bpf: Use 4 instructions for function address in JIT bpf: bpf_fib_lookup should not return neigh in NUD_FAILED state bpf: Disable bh in bpf_test_run for xdp and tc prog xsk: check IFF_UP earlier in Tx path Fix typos in selftest/bpf files selftests/bpf: Use bpf_{btf,link,map,prog}_get_info_by_fd() samples/bpf: Use bpf_{btf,link,map,prog}_get_info_by_fd() bpftool: Use bpf_{btf,link,map,prog}_get_info_by_fd() libbpf: Use bpf_{btf,link,map,prog}_get_info_by_fd() libbpf: Introduce bpf_{btf,link,map,prog}_get_info_by_fd() ... ==================== Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2 parents 01bb11a + 168de02 commit ee8d72a

File tree

158 files changed

+4194
-1007
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

158 files changed

+4194
-1007
lines changed

Documentation/admin-guide/kernel-parameters.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -557,6 +557,7 @@
557557
Format: <string>
558558
nosocket -- Disable socket memory accounting.
559559
nokmem -- Disable kernel memory accounting.
560+
nobpf -- Disable BPF memory accounting.
560561

561562
checkreqprot= [SELINUX] Set initial checkreqprot flag value.
562563
Format: { "0" | "1" }

Documentation/bpf/graph_ds_impl.rst

Lines changed: 267 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,267 @@
1+
=========================
2+
BPF Graph Data Structures
3+
=========================
4+
5+
This document describes implementation details of new-style "graph" data
6+
structures (linked_list, rbtree), with particular focus on the verifier's
7+
implementation of semantics specific to those data structures.
8+
9+
Although no specific verifier code is referred to in this document, the document
10+
assumes that the reader has general knowledge of BPF verifier internals, BPF
11+
maps, and BPF program writing.
12+
13+
Note that the intent of this document is to describe the current state of
14+
these graph data structures. **No guarantees** of stability for either
15+
semantics or APIs are made or implied here.
16+
17+
.. contents::
18+
:local:
19+
:depth: 2
20+
21+
Introduction
22+
------------
23+
24+
The BPF map API has historically been the main way to expose data structures
25+
of various types for use within BPF programs. Some data structures fit naturally
26+
with the map API (HASH, ARRAY), others less so. Consequentially, programs
27+
interacting with the latter group of data structures can be hard to parse
28+
for kernel programmers without previous BPF experience.
29+
30+
Luckily, some restrictions which necessitated the use of BPF map semantics are
31+
no longer relevant. With the introduction of kfuncs, kptrs, and the any-context
32+
BPF allocator, it is now possible to implement BPF data structures whose API
33+
and semantics more closely match those exposed to the rest of the kernel.
34+
35+
Two such data structures - linked_list and rbtree - have many verification
36+
details in common. Because both have "root"s ("head" for linked_list) and
37+
"node"s, the verifier code and this document refer to common functionality
38+
as "graph_api", "graph_root", "graph_node", etc.
39+
40+
Unless otherwise stated, examples and semantics below apply to both graph data
41+
structures.
42+
43+
Unstable API
44+
------------
45+
46+
Data structures implemented using the BPF map API have historically used BPF
47+
helper functions - either standard map API helpers like ``bpf_map_update_elem``
48+
or map-specific helpers. The new-style graph data structures instead use kfuncs
49+
to define their manipulation helpers. Because there are no stability guarantees
50+
for kfuncs, the API and semantics for these data structures can be evolved in
51+
a way that breaks backwards compatibility if necessary.
52+
53+
Root and node types for the new data structures are opaquely defined in the
54+
``uapi/linux/bpf.h`` header.
55+
56+
Locking
57+
-------
58+
59+
The new-style data structures are intrusive and are defined similarly to their
60+
vanilla kernel counterparts:
61+
62+
.. code-block:: c
63+
64+
struct node_data {
65+
long key;
66+
long data;
67+
struct bpf_rb_node node;
68+
};
69+
70+
struct bpf_spin_lock glock;
71+
struct bpf_rb_root groot __contains(node_data, node);
72+
73+
The "root" type for both linked_list and rbtree expects to be in a map_value
74+
which also contains a ``bpf_spin_lock`` - in the above example both global
75+
variables are placed in a single-value arraymap. The verifier considers this
76+
spin_lock to be associated with the ``bpf_rb_root`` by virtue of both being in
77+
the same map_value and will enforce that the correct lock is held when
78+
verifying BPF programs that manipulate the tree. Since this lock checking
79+
happens at verification time, there is no runtime penalty.
80+
81+
Non-owning references
82+
---------------------
83+
84+
**Motivation**
85+
86+
Consider the following BPF code:
87+
88+
.. code-block:: c
89+
90+
struct node_data *n = bpf_obj_new(typeof(*n)); /* ACQUIRED */
91+
92+
bpf_spin_lock(&lock);
93+
94+
bpf_rbtree_add(&tree, n); /* PASSED */
95+
96+
bpf_spin_unlock(&lock);
97+
98+
From the verifier's perspective, the pointer ``n`` returned from ``bpf_obj_new``
99+
has type ``PTR_TO_BTF_ID | MEM_ALLOC``, with a ``btf_id`` of
100+
``struct node_data`` and a nonzero ``ref_obj_id``. Because it holds ``n``, the
101+
program has ownership of the pointee's (object pointed to by ``n``) lifetime.
102+
The BPF program must pass off ownership before exiting - either via
103+
``bpf_obj_drop``, which ``free``'s the object, or by adding it to ``tree`` with
104+
``bpf_rbtree_add``.
105+
106+
(``ACQUIRED`` and ``PASSED`` comments in the example denote statements where
107+
"ownership is acquired" and "ownership is passed", respectively)
108+
109+
What should the verifier do with ``n`` after ownership is passed off? If the
110+
object was ``free``'d with ``bpf_obj_drop`` the answer is obvious: the verifier
111+
should reject programs which attempt to access ``n`` after ``bpf_obj_drop`` as
112+
the object is no longer valid. The underlying memory may have been reused for
113+
some other allocation, unmapped, etc.
114+
115+
When ownership is passed to ``tree`` via ``bpf_rbtree_add`` the answer is less
116+
obvious. The verifier could enforce the same semantics as for ``bpf_obj_drop``,
117+
but that would result in programs with useful, common coding patterns being
118+
rejected, e.g.:
119+
120+
.. code-block:: c
121+
122+
int x;
123+
struct node_data *n = bpf_obj_new(typeof(*n)); /* ACQUIRED */
124+
125+
bpf_spin_lock(&lock);
126+
127+
bpf_rbtree_add(&tree, n); /* PASSED */
128+
x = n->data;
129+
n->data = 42;
130+
131+
bpf_spin_unlock(&lock);
132+
133+
Both the read from and write to ``n->data`` would be rejected. The verifier
134+
can do better, though, by taking advantage of two details:
135+
136+
* Graph data structure APIs can only be used when the ``bpf_spin_lock``
137+
associated with the graph root is held
138+
139+
* Both graph data structures have pointer stability
140+
141+
* Because graph nodes are allocated with ``bpf_obj_new`` and
142+
adding / removing from the root involves fiddling with the
143+
``bpf_{list,rb}_node`` field of the node struct, a graph node will
144+
remain at the same address after either operation.
145+
146+
Because the associated ``bpf_spin_lock`` must be held by any program adding
147+
or removing, if we're in the critical section bounded by that lock, we know
148+
that no other program can add or remove until the end of the critical section.
149+
This combined with pointer stability means that, until the critical section
150+
ends, we can safely access the graph node through ``n`` even after it was used
151+
to pass ownership.
152+
153+
The verifier considers such a reference a *non-owning reference*. The ref
154+
returned by ``bpf_obj_new`` is accordingly considered an *owning reference*.
155+
Both terms currently only have meaning in the context of graph nodes and API.
156+
157+
**Details**
158+
159+
Let's enumerate the properties of both types of references.
160+
161+
*owning reference*
162+
163+
* This reference controls the lifetime of the pointee
164+
165+
* Ownership of pointee must be 'released' by passing it to some graph API
166+
kfunc, or via ``bpf_obj_drop``, which ``free``'s the pointee
167+
168+
* If not released before program ends, verifier considers program invalid
169+
170+
* Access to the pointee's memory will not page fault
171+
172+
*non-owning reference*
173+
174+
* This reference does not own the pointee
175+
176+
* It cannot be used to add the graph node to a graph root, nor ``free``'d via
177+
``bpf_obj_drop``
178+
179+
* No explicit control of lifetime, but can infer valid lifetime based on
180+
non-owning ref existence (see explanation below)
181+
182+
* Access to the pointee's memory will not page fault
183+
184+
From verifier's perspective non-owning references can only exist
185+
between spin_lock and spin_unlock. Why? After spin_unlock another program
186+
can do arbitrary operations on the data structure like removing and ``free``-ing
187+
via bpf_obj_drop. A non-owning ref to some chunk of memory that was remove'd,
188+
``free``'d, and reused via bpf_obj_new would point to an entirely different thing.
189+
Or the memory could go away.
190+
191+
To prevent this logic violation all non-owning references are invalidated by the
192+
verifier after a critical section ends. This is necessary to ensure the "will
193+
not page fault" property of non-owning references. So if the verifier hasn't
194+
invalidated a non-owning ref, accessing it will not page fault.
195+
196+
Currently ``bpf_obj_drop`` is not allowed in the critical section, so
197+
if there's a valid non-owning ref, we must be in a critical section, and can
198+
conclude that the ref's memory hasn't been dropped-and- ``free``'d or
199+
dropped-and-reused.
200+
201+
Any reference to a node that is in an rbtree _must_ be non-owning, since
202+
the tree has control of the pointee's lifetime. Similarly, any ref to a node
203+
that isn't in rbtree _must_ be owning. This results in a nice property:
204+
graph API add / remove implementations don't need to check if a node
205+
has already been added (or already removed), as the ownership model
206+
allows the verifier to prevent such a state from being valid by simply checking
207+
types.
208+
209+
However, pointer aliasing poses an issue for the above "nice property".
210+
Consider the following example:
211+
212+
.. code-block:: c
213+
214+
struct node_data *n, *m, *o, *p;
215+
n = bpf_obj_new(typeof(*n)); /* 1 */
216+
217+
bpf_spin_lock(&lock);
218+
219+
bpf_rbtree_add(&tree, n); /* 2 */
220+
m = bpf_rbtree_first(&tree); /* 3 */
221+
222+
o = bpf_rbtree_remove(&tree, n); /* 4 */
223+
p = bpf_rbtree_remove(&tree, m); /* 5 */
224+
225+
bpf_spin_unlock(&lock);
226+
227+
bpf_obj_drop(o);
228+
bpf_obj_drop(p); /* 6 */
229+
230+
Assume the tree is empty before this program runs. If we track verifier state
231+
changes here using numbers in above comments:
232+
233+
1) n is an owning reference
234+
235+
2) n is a non-owning reference, it's been added to the tree
236+
237+
3) n and m are non-owning references, they both point to the same node
238+
239+
4) o is an owning reference, n and m non-owning, all point to same node
240+
241+
5) o and p are owning, n and m non-owning, all point to the same node
242+
243+
6) a double-free has occurred, since o and p point to same node and o was
244+
``free``'d in previous statement
245+
246+
States 4 and 5 violate our "nice property", as there are non-owning refs to
247+
a node which is not in an rbtree. Statement 5 will try to remove a node which
248+
has already been removed as a result of this violation. State 6 is a dangerous
249+
double-free.
250+
251+
At a minimum we should prevent state 6 from being possible. If we can't also
252+
prevent state 5 then we must abandon our "nice property" and check whether a
253+
node has already been removed at runtime.
254+
255+
We prevent both by generalizing the "invalidate non-owning references" behavior
256+
of ``bpf_spin_unlock`` and doing similar invalidation after
257+
``bpf_rbtree_remove``. The logic here being that any graph API kfunc which:
258+
259+
* takes an arbitrary node argument
260+
261+
* removes it from the data structure
262+
263+
* returns an owning reference to the removed node
264+
265+
May result in a state where some other non-owning reference points to the same
266+
node. So ``remove``-type kfuncs must be considered a non-owning reference
267+
invalidation point as well.

Documentation/bpf/other.rst

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,4 +6,5 @@ Other
66
:maxdepth: 1
77

88
ringbuf
9-
llvm_reloc
9+
llvm_reloc
10+
graph_ds_impl

MAINTAINERS

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4023,6 +4023,13 @@ L: [email protected]
40234023
S: Maintained
40244024
F: tools/testing/selftests/bpf/
40254025

4026+
BPF [DOCUMENTATION] (Related to Standardization)
4027+
R: David Vernet <[email protected]>
4028+
4029+
4030+
S: Maintained
4031+
F: Documentation/bpf/instruction-set.rst
4032+
40264033
BPF [MISC]
40274034
40284035
S: Odd Fixes

arch/loongarch/net/bpf_jit.c

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -841,7 +841,7 @@ static int build_insn(const struct bpf_insn *insn, struct jit_ctx *ctx, bool ext
841841
if (ret < 0)
842842
return ret;
843843

844-
move_imm(ctx, t1, func_addr, is32);
844+
move_addr(ctx, t1, func_addr);
845845
emit_insn(ctx, jirl, t1, LOONGARCH_GPR_RA, 0);
846846
move_reg(ctx, regmap[BPF_REG_0], LOONGARCH_GPR_A0);
847847
break;

arch/loongarch/net/bpf_jit.h

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -82,6 +82,27 @@ static inline void emit_sext_32(struct jit_ctx *ctx, enum loongarch_gpr reg, boo
8282
emit_insn(ctx, addiw, reg, reg, 0);
8383
}
8484

85+
static inline void move_addr(struct jit_ctx *ctx, enum loongarch_gpr rd, u64 addr)
86+
{
87+
u64 imm_11_0, imm_31_12, imm_51_32, imm_63_52;
88+
89+
/* lu12iw rd, imm_31_12 */
90+
imm_31_12 = (addr >> 12) & 0xfffff;
91+
emit_insn(ctx, lu12iw, rd, imm_31_12);
92+
93+
/* ori rd, rd, imm_11_0 */
94+
imm_11_0 = addr & 0xfff;
95+
emit_insn(ctx, ori, rd, rd, imm_11_0);
96+
97+
/* lu32id rd, imm_51_32 */
98+
imm_51_32 = (addr >> 32) & 0xfffff;
99+
emit_insn(ctx, lu32id, rd, imm_51_32);
100+
101+
/* lu52id rd, rd, imm_63_52 */
102+
imm_63_52 = (addr >> 52) & 0xfff;
103+
emit_insn(ctx, lu52id, rd, rd, imm_63_52);
104+
}
105+
85106
static inline void move_imm(struct jit_ctx *ctx, enum loongarch_gpr rd, long imm, bool is32)
86107
{
87108
long imm_11_0, imm_31_12, imm_51_32, imm_63_52, imm_51_0, imm_51_31;

arch/riscv/include/asm/patch.h

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,6 @@
77
#define _ASM_RISCV_PATCH_H
88

99
int patch_text_nosync(void *addr, const void *insns, size_t len);
10-
int patch_text(void *addr, u32 insn);
10+
int patch_text(void *addr, u32 *insns, int ninsns);
1111

1212
#endif /* _ASM_RISCV_PATCH_H */

arch/riscv/kernel/patch.c

Lines changed: 12 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,8 @@
1515

1616
struct patch_insn {
1717
void *addr;
18-
u32 insn;
18+
u32 *insns;
19+
int ninsns;
1920
atomic_t cpu_count;
2021
};
2122

@@ -102,12 +103,15 @@ NOKPROBE_SYMBOL(patch_text_nosync);
102103
static int patch_text_cb(void *data)
103104
{
104105
struct patch_insn *patch = data;
105-
int ret = 0;
106+
unsigned long len;
107+
int i, ret = 0;
106108

107109
if (atomic_inc_return(&patch->cpu_count) == num_online_cpus()) {
108-
ret =
109-
patch_text_nosync(patch->addr, &patch->insn,
110-
GET_INSN_LENGTH(patch->insn));
110+
for (i = 0; ret == 0 && i < patch->ninsns; i++) {
111+
len = GET_INSN_LENGTH(patch->insns[i]);
112+
ret = patch_text_nosync(patch->addr + i * len,
113+
&patch->insns[i], len);
114+
}
111115
atomic_inc(&patch->cpu_count);
112116
} else {
113117
while (atomic_read(&patch->cpu_count) <= num_online_cpus())
@@ -119,11 +123,12 @@ static int patch_text_cb(void *data)
119123
}
120124
NOKPROBE_SYMBOL(patch_text_cb);
121125

122-
int patch_text(void *addr, u32 insn)
126+
int patch_text(void *addr, u32 *insns, int ninsns)
123127
{
124128
struct patch_insn patch = {
125129
.addr = addr,
126-
.insn = insn,
130+
.insns = insns,
131+
.ninsns = ninsns,
127132
.cpu_count = ATOMIC_INIT(0),
128133
};
129134

0 commit comments

Comments
 (0)