|
| 1 | +========================= |
| 2 | +BPF Graph Data Structures |
| 3 | +========================= |
| 4 | + |
| 5 | +This document describes implementation details of new-style "graph" data |
| 6 | +structures (linked_list, rbtree), with particular focus on the verifier's |
| 7 | +implementation of semantics specific to those data structures. |
| 8 | + |
| 9 | +Although no specific verifier code is referred to in this document, the document |
| 10 | +assumes that the reader has general knowledge of BPF verifier internals, BPF |
| 11 | +maps, and BPF program writing. |
| 12 | + |
| 13 | +Note that the intent of this document is to describe the current state of |
| 14 | +these graph data structures. **No guarantees** of stability for either |
| 15 | +semantics or APIs are made or implied here. |
| 16 | + |
| 17 | +.. contents:: |
| 18 | + :local: |
| 19 | + :depth: 2 |
| 20 | + |
| 21 | +Introduction |
| 22 | +------------ |
| 23 | + |
| 24 | +The BPF map API has historically been the main way to expose data structures |
| 25 | +of various types for use within BPF programs. Some data structures fit naturally |
| 26 | +with the map API (HASH, ARRAY), others less so. Consequentially, programs |
| 27 | +interacting with the latter group of data structures can be hard to parse |
| 28 | +for kernel programmers without previous BPF experience. |
| 29 | + |
| 30 | +Luckily, some restrictions which necessitated the use of BPF map semantics are |
| 31 | +no longer relevant. With the introduction of kfuncs, kptrs, and the any-context |
| 32 | +BPF allocator, it is now possible to implement BPF data structures whose API |
| 33 | +and semantics more closely match those exposed to the rest of the kernel. |
| 34 | + |
| 35 | +Two such data structures - linked_list and rbtree - have many verification |
| 36 | +details in common. Because both have "root"s ("head" for linked_list) and |
| 37 | +"node"s, the verifier code and this document refer to common functionality |
| 38 | +as "graph_api", "graph_root", "graph_node", etc. |
| 39 | + |
| 40 | +Unless otherwise stated, examples and semantics below apply to both graph data |
| 41 | +structures. |
| 42 | + |
| 43 | +Unstable API |
| 44 | +------------ |
| 45 | + |
| 46 | +Data structures implemented using the BPF map API have historically used BPF |
| 47 | +helper functions - either standard map API helpers like ``bpf_map_update_elem`` |
| 48 | +or map-specific helpers. The new-style graph data structures instead use kfuncs |
| 49 | +to define their manipulation helpers. Because there are no stability guarantees |
| 50 | +for kfuncs, the API and semantics for these data structures can be evolved in |
| 51 | +a way that breaks backwards compatibility if necessary. |
| 52 | + |
| 53 | +Root and node types for the new data structures are opaquely defined in the |
| 54 | +``uapi/linux/bpf.h`` header. |
| 55 | + |
| 56 | +Locking |
| 57 | +------- |
| 58 | + |
| 59 | +The new-style data structures are intrusive and are defined similarly to their |
| 60 | +vanilla kernel counterparts: |
| 61 | + |
| 62 | +.. code-block:: c |
| 63 | +
|
| 64 | + struct node_data { |
| 65 | + long key; |
| 66 | + long data; |
| 67 | + struct bpf_rb_node node; |
| 68 | + }; |
| 69 | +
|
| 70 | + struct bpf_spin_lock glock; |
| 71 | + struct bpf_rb_root groot __contains(node_data, node); |
| 72 | +
|
| 73 | +The "root" type for both linked_list and rbtree expects to be in a map_value |
| 74 | +which also contains a ``bpf_spin_lock`` - in the above example both global |
| 75 | +variables are placed in a single-value arraymap. The verifier considers this |
| 76 | +spin_lock to be associated with the ``bpf_rb_root`` by virtue of both being in |
| 77 | +the same map_value and will enforce that the correct lock is held when |
| 78 | +verifying BPF programs that manipulate the tree. Since this lock checking |
| 79 | +happens at verification time, there is no runtime penalty. |
| 80 | + |
| 81 | +Non-owning references |
| 82 | +--------------------- |
| 83 | + |
| 84 | +**Motivation** |
| 85 | + |
| 86 | +Consider the following BPF code: |
| 87 | + |
| 88 | +.. code-block:: c |
| 89 | +
|
| 90 | + struct node_data *n = bpf_obj_new(typeof(*n)); /* ACQUIRED */ |
| 91 | +
|
| 92 | + bpf_spin_lock(&lock); |
| 93 | +
|
| 94 | + bpf_rbtree_add(&tree, n); /* PASSED */ |
| 95 | +
|
| 96 | + bpf_spin_unlock(&lock); |
| 97 | +
|
| 98 | +From the verifier's perspective, the pointer ``n`` returned from ``bpf_obj_new`` |
| 99 | +has type ``PTR_TO_BTF_ID | MEM_ALLOC``, with a ``btf_id`` of |
| 100 | +``struct node_data`` and a nonzero ``ref_obj_id``. Because it holds ``n``, the |
| 101 | +program has ownership of the pointee's (object pointed to by ``n``) lifetime. |
| 102 | +The BPF program must pass off ownership before exiting - either via |
| 103 | +``bpf_obj_drop``, which ``free``'s the object, or by adding it to ``tree`` with |
| 104 | +``bpf_rbtree_add``. |
| 105 | + |
| 106 | +(``ACQUIRED`` and ``PASSED`` comments in the example denote statements where |
| 107 | +"ownership is acquired" and "ownership is passed", respectively) |
| 108 | + |
| 109 | +What should the verifier do with ``n`` after ownership is passed off? If the |
| 110 | +object was ``free``'d with ``bpf_obj_drop`` the answer is obvious: the verifier |
| 111 | +should reject programs which attempt to access ``n`` after ``bpf_obj_drop`` as |
| 112 | +the object is no longer valid. The underlying memory may have been reused for |
| 113 | +some other allocation, unmapped, etc. |
| 114 | + |
| 115 | +When ownership is passed to ``tree`` via ``bpf_rbtree_add`` the answer is less |
| 116 | +obvious. The verifier could enforce the same semantics as for ``bpf_obj_drop``, |
| 117 | +but that would result in programs with useful, common coding patterns being |
| 118 | +rejected, e.g.: |
| 119 | + |
| 120 | +.. code-block:: c |
| 121 | +
|
| 122 | + int x; |
| 123 | + struct node_data *n = bpf_obj_new(typeof(*n)); /* ACQUIRED */ |
| 124 | +
|
| 125 | + bpf_spin_lock(&lock); |
| 126 | +
|
| 127 | + bpf_rbtree_add(&tree, n); /* PASSED */ |
| 128 | + x = n->data; |
| 129 | + n->data = 42; |
| 130 | +
|
| 131 | + bpf_spin_unlock(&lock); |
| 132 | +
|
| 133 | +Both the read from and write to ``n->data`` would be rejected. The verifier |
| 134 | +can do better, though, by taking advantage of two details: |
| 135 | + |
| 136 | + * Graph data structure APIs can only be used when the ``bpf_spin_lock`` |
| 137 | + associated with the graph root is held |
| 138 | + |
| 139 | + * Both graph data structures have pointer stability |
| 140 | + |
| 141 | + * Because graph nodes are allocated with ``bpf_obj_new`` and |
| 142 | + adding / removing from the root involves fiddling with the |
| 143 | + ``bpf_{list,rb}_node`` field of the node struct, a graph node will |
| 144 | + remain at the same address after either operation. |
| 145 | + |
| 146 | +Because the associated ``bpf_spin_lock`` must be held by any program adding |
| 147 | +or removing, if we're in the critical section bounded by that lock, we know |
| 148 | +that no other program can add or remove until the end of the critical section. |
| 149 | +This combined with pointer stability means that, until the critical section |
| 150 | +ends, we can safely access the graph node through ``n`` even after it was used |
| 151 | +to pass ownership. |
| 152 | + |
| 153 | +The verifier considers such a reference a *non-owning reference*. The ref |
| 154 | +returned by ``bpf_obj_new`` is accordingly considered an *owning reference*. |
| 155 | +Both terms currently only have meaning in the context of graph nodes and API. |
| 156 | + |
| 157 | +**Details** |
| 158 | + |
| 159 | +Let's enumerate the properties of both types of references. |
| 160 | + |
| 161 | +*owning reference* |
| 162 | + |
| 163 | + * This reference controls the lifetime of the pointee |
| 164 | + |
| 165 | + * Ownership of pointee must be 'released' by passing it to some graph API |
| 166 | + kfunc, or via ``bpf_obj_drop``, which ``free``'s the pointee |
| 167 | + |
| 168 | + * If not released before program ends, verifier considers program invalid |
| 169 | + |
| 170 | + * Access to the pointee's memory will not page fault |
| 171 | + |
| 172 | +*non-owning reference* |
| 173 | + |
| 174 | + * This reference does not own the pointee |
| 175 | + |
| 176 | + * It cannot be used to add the graph node to a graph root, nor ``free``'d via |
| 177 | + ``bpf_obj_drop`` |
| 178 | + |
| 179 | + * No explicit control of lifetime, but can infer valid lifetime based on |
| 180 | + non-owning ref existence (see explanation below) |
| 181 | + |
| 182 | + * Access to the pointee's memory will not page fault |
| 183 | + |
| 184 | +From verifier's perspective non-owning references can only exist |
| 185 | +between spin_lock and spin_unlock. Why? After spin_unlock another program |
| 186 | +can do arbitrary operations on the data structure like removing and ``free``-ing |
| 187 | +via bpf_obj_drop. A non-owning ref to some chunk of memory that was remove'd, |
| 188 | +``free``'d, and reused via bpf_obj_new would point to an entirely different thing. |
| 189 | +Or the memory could go away. |
| 190 | + |
| 191 | +To prevent this logic violation all non-owning references are invalidated by the |
| 192 | +verifier after a critical section ends. This is necessary to ensure the "will |
| 193 | +not page fault" property of non-owning references. So if the verifier hasn't |
| 194 | +invalidated a non-owning ref, accessing it will not page fault. |
| 195 | + |
| 196 | +Currently ``bpf_obj_drop`` is not allowed in the critical section, so |
| 197 | +if there's a valid non-owning ref, we must be in a critical section, and can |
| 198 | +conclude that the ref's memory hasn't been dropped-and- ``free``'d or |
| 199 | +dropped-and-reused. |
| 200 | + |
| 201 | +Any reference to a node that is in an rbtree _must_ be non-owning, since |
| 202 | +the tree has control of the pointee's lifetime. Similarly, any ref to a node |
| 203 | +that isn't in rbtree _must_ be owning. This results in a nice property: |
| 204 | +graph API add / remove implementations don't need to check if a node |
| 205 | +has already been added (or already removed), as the ownership model |
| 206 | +allows the verifier to prevent such a state from being valid by simply checking |
| 207 | +types. |
| 208 | + |
| 209 | +However, pointer aliasing poses an issue for the above "nice property". |
| 210 | +Consider the following example: |
| 211 | + |
| 212 | +.. code-block:: c |
| 213 | +
|
| 214 | + struct node_data *n, *m, *o, *p; |
| 215 | + n = bpf_obj_new(typeof(*n)); /* 1 */ |
| 216 | +
|
| 217 | + bpf_spin_lock(&lock); |
| 218 | +
|
| 219 | + bpf_rbtree_add(&tree, n); /* 2 */ |
| 220 | + m = bpf_rbtree_first(&tree); /* 3 */ |
| 221 | +
|
| 222 | + o = bpf_rbtree_remove(&tree, n); /* 4 */ |
| 223 | + p = bpf_rbtree_remove(&tree, m); /* 5 */ |
| 224 | +
|
| 225 | + bpf_spin_unlock(&lock); |
| 226 | +
|
| 227 | + bpf_obj_drop(o); |
| 228 | + bpf_obj_drop(p); /* 6 */ |
| 229 | +
|
| 230 | +Assume the tree is empty before this program runs. If we track verifier state |
| 231 | +changes here using numbers in above comments: |
| 232 | + |
| 233 | + 1) n is an owning reference |
| 234 | + |
| 235 | + 2) n is a non-owning reference, it's been added to the tree |
| 236 | + |
| 237 | + 3) n and m are non-owning references, they both point to the same node |
| 238 | + |
| 239 | + 4) o is an owning reference, n and m non-owning, all point to same node |
| 240 | + |
| 241 | + 5) o and p are owning, n and m non-owning, all point to the same node |
| 242 | + |
| 243 | + 6) a double-free has occurred, since o and p point to same node and o was |
| 244 | + ``free``'d in previous statement |
| 245 | + |
| 246 | +States 4 and 5 violate our "nice property", as there are non-owning refs to |
| 247 | +a node which is not in an rbtree. Statement 5 will try to remove a node which |
| 248 | +has already been removed as a result of this violation. State 6 is a dangerous |
| 249 | +double-free. |
| 250 | + |
| 251 | +At a minimum we should prevent state 6 from being possible. If we can't also |
| 252 | +prevent state 5 then we must abandon our "nice property" and check whether a |
| 253 | +node has already been removed at runtime. |
| 254 | + |
| 255 | +We prevent both by generalizing the "invalidate non-owning references" behavior |
| 256 | +of ``bpf_spin_unlock`` and doing similar invalidation after |
| 257 | +``bpf_rbtree_remove``. The logic here being that any graph API kfunc which: |
| 258 | + |
| 259 | + * takes an arbitrary node argument |
| 260 | + |
| 261 | + * removes it from the data structure |
| 262 | + |
| 263 | + * returns an owning reference to the removed node |
| 264 | + |
| 265 | +May result in a state where some other non-owning reference points to the same |
| 266 | +node. So ``remove``-type kfuncs must be considered a non-owning reference |
| 267 | +invalidation point as well. |
0 commit comments