Skip to content

Commit bd68439

Browse files
kernelCTF: add CVE-2024-26923_lts_cos
1 parent 7297a9e commit bd68439

File tree

15 files changed

+2726
-0
lines changed

15 files changed

+2726
-0
lines changed
Lines changed: 219 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,219 @@
1+
## Triggering the race condition
2+
3+
In this vulnerability we have two race windows.
4+
The first one is in unix_stream_connect():
5+
6+
```
7+
static int unix_stream_connect(struct socket *sock, struct sockaddr *uaddr,
8+
int addr_len, int flags)
9+
{
10+
...
11+
12+
unix_peer(sk) = newsk;
13+
14+
[window 1 start]
15+
unix_state_unlock(sk);
16+
17+
/* take ten and send info to listening sock */
18+
spin_lock(&other->sk_receive_queue.lock);
19+
[window 1 end]
20+
21+
__skb_queue_tail(&other->sk_receive_queue, skb);
22+
spin_unlock(&other->sk_receive_queue.lock);
23+
...
24+
```
25+
26+
This function is triggered by executing connect on CPU 0. This CPU will do nothing else until the race conditions part of the exploit is over.
27+
28+
We have to use the other CPU available to perform 2 operations during this window:
29+
1. Send the victim socket through this connecting socket.
30+
2. Close the victim socket
31+
3. Trigger garbage collection and run unix_gc() until the start of window 2.
32+
33+
If we start too early, our send will fail because the socket is not connected yet and the exploit will fail.
34+
35+
Second window is in the unix_gc function:
36+
```
37+
void unix_gc(void)
38+
{
39+
...
40+
list_for_each_entry(u, &gc_candidates, link)
41+
{
42+
scan_children(&u->sk, dec_inflight, NULL);
43+
}
44+
45+
[window 2 start]
46+
/* Restore the references for children of all candidates,
47+
* which have remaining references. Do this recursively, so
48+
* only those remain, which form cyclic references.
49+
*
50+
* Use a "cursor" link, to make the list traversal safe, even
51+
* though elements might be moved about.
52+
*/
53+
list_add(&cursor, &gc_candidates);
54+
[window 2 end]
55+
while (cursor.next != &gc_candidates) {
56+
u = list_entry(cursor.next, struct unix_sock, link);
57+
58+
/* Move cursor to after the current position. */
59+
list_move(&cursor, &u->link);
60+
61+
if (atomic_long_read(&u->inflight) > 0) {
62+
list_move_tail(&u->link, &not_cycle_list);
63+
__clear_bit(UNIX_GC_MAYBE_CYCLE, &u->gc_flags);
64+
scan_children(&u->sk, inc_inflight_move_tail, NULL);
65+
}
66+
}
67+
list_del(&cursor);
68+
69+
70+
```
71+
72+
For the vulnerability to be triggered two conditions have to be met:
73+
1. The first scan_children() can not see the embryo in the receive queue of the server socket
74+
2. The second scan_children() has to see the embryo.
75+
76+
This causes a decrement/increment mismatch and the resulting use-after-free.
77+
78+
In other words, window 2 has run inside the window 1 of unix_stream_connect().
79+
80+
81+
To have a chance of aligning the two threads correctly we have to extend both race windows as much as possible.
82+
To do that we use a well-known timerfd technique invented by Jann Horn.
83+
The basic idea is to set hrtimer based timerfd to trigger a timer interrupt during our race window and attach a lot (as much as RLIMIT_NOFILE allows)
84+
of epoll watches to this timerfd to make the time needed to handle the interrupt longer.
85+
For more details see the original [blog post](https://googleprojectzero.blogspot.com/2022/03/racing-against-clock-hitting-tiny.html).
86+
87+
Here's the triggering sequence (we use 2 CPUs, CPU1 is executing child_send() thread):
88+
89+
| CPU 0 | CPU 1 |
90+
| -------- | -------- |
91+
| arms timer 1 to trigger after a delay | - |
92+
| calls connect() on the client socket | - |
93+
| unix_stream_connect() runs until the start of window 1 | - |
94+
| timer 1 is triggered during window 1 | - |
95+
| timer goes through all epoll notifications | sends victim socket through the client socket |
96+
| ... | closes victim socket |
97+
| ... | arms timer 2 to trigger after a delay |
98+
| ... | closes another socket to trigger unix_gc() |
99+
| ... | unix_gc() runs until the start of window 2 |
100+
| ... | timer 2 is triggered during window 2 |
101+
| timer 1 handler ends | timer goes through all epoll notifications |
102+
| window 1 ends, embryo is added to the receive queue | ... |
103+
| - | timer 2 handler ends, window 2 ends |
104+
| - | second scan_children() executes inc_inflight_move_tail() on the victim socket |
105+
106+
## Exploiting the use-after-free
107+
108+
At this point our victim socket is inflight, linked in the gc_inflight_list and has a inflight reference value of 2.
109+
Next step is to receive this socket and close it. This will cause its struct sock object to be freed, but it will stay referenced in the gc_inflight_list.
110+
In case of unix sockets, struct sock is allocated from a separate kmalloc cached called 'UNIX'. On our target one slab takes a order 2 (size 0x4000) page and fits 15 objects.
111+
112+
To be able to exploit the use-after-free we have to cause the slab containing our victim objects to be discarded and returned to the page allocator.
113+
This is done using standard cross-cache techniques:
114+
1. Free all objects of the given slab
115+
2. Create a lot of partial slabs to unfreeze the empty slab and get it discarded
116+
117+
However, in this case we need maximum reliability - winning the race is such a rare event that we can't afford to make mistakes in the later stages of the exploit.
118+
119+
Because of this we used the /proc/zoneinfo parsing technique to establish a known UNIX cache state before starting the exploit attempt.
120+
This is done in the get_fresh_unix() function.
121+
One problem that we have to solve is that when a unix socket an allocation is also made from sock_inode_cache, which uses slabs of the same size (0x4000) as the UNIX cache, causing issues with detecting a new UNIX slab.
122+
123+
To solve this we first allocate some netlinks objects, which do not have their own dedicated sock object cache, so the only order 2 page allocation comes from sock_inode_cache.
124+
This allows us to get a fresh sock_inode_cache slab first and then proceed with unix socket allocations to get a fresh UNIX slab.
125+
126+
After the slab page is returned to the page allocator we can easily reallocate it using an xattr of size 0x4000 - xattrs larger than 0x2000 are allocated directly from the page allocator.
127+
128+
## Getting RIP control
129+
130+
At this point we have a struct sock object linked in the gc_inflight_list that we can fill with arbitrary data.
131+
This list is used by unix_gc() and if we are able to craft a fake sock object convincing enough that unix_gc() will be able to traverse the gc_inflight_list and move sk_buff objects from our sock object to the 'hitlist' that will be passed to the skb_queue_purge().
132+
133+
unix_gc() uses list handling functions to move the victim object between lists multiple times and CONFIG_DEBUG_LIST is on, so our object has to have valid prev/next list pointers.
134+
Also, properties such as sk_socket->file are accessed, meaning we have to also craft related objects (or at least their part) like struct socket, struct file, struct inode and finally sk_buff - this object will contain function pointer giving us RIP control and the ROP chain.
135+
136+
But first, we need a place with a known address to store all these objects.
137+
138+
### Crafting objects in kernel memory
139+
140+
The fake kernel objects were sprayed into the physical memory by creating a lot of large tmpfs xattrs and referenced by using a direct mapping address - more details about that can be found in the [novel techniques](novel-techniques.md) section.
141+
142+
The first fake socket to replace the victim object on the gc_inflight_list is prepared in prepare_sock() and has pointers to the ones prepared in prepare_more_socks(). These are actually allocated at the very beginning of the exploit - we can do it because their location in memory is known in advance.
143+
144+
### Triggering the sk_buff destructor to get RIP control
145+
146+
When an inflight socket is chosen to be released by the unix_gc() the sk_buff carrying it is removed from the sk_receive_queue and linked into the 'hitlist' and then skb_queue_purge() is called on that list:
147+
```
148+
static inline void __skb_queue_purge(struct sk_buff_head *list)
149+
{
150+
struct sk_buff *skb;
151+
while ((skb = __skb_dequeue(list)) != NULL)
152+
kfree_skb(skb);
153+
}
154+
```
155+
156+
If this is the last reference to a given skb, skb_release_head_state() is eventually called:
157+
```
158+
void skb_release_head_state(struct sk_buff *skb)
159+
{
160+
skb_dst_drop(skb);
161+
if (skb->destructor) {
162+
DEBUG_NET_WARN_ON_ONCE(in_hardirq());
163+
skb->destructor(skb);
164+
}
165+
#if IS_ENABLED(CONFIG_NF_CONNTRACK)
166+
nf_conntrack_put(skb_nfct(skb));
167+
#endif
168+
skb_ext_put(skb);
169+
}
170+
```
171+
172+
Because we control all the contents of the sk_buff object we can make sure the destructor will be called.
173+
174+
## Pivot to ROP
175+
176+
When the destructor is called, RDI contains a pointer to our fake sk_buff object.
177+
First 16 bytes of this objects are used the the list head, so we can't start our ROP there.
178+
179+
Following chain of gadgets allows us to pivot to rdi+0x10 where our ROP chain starts:
180+
181+
```
182+
mov r8,QWORD PTR [rdi+0xc8]
183+
mov eax,0x1
184+
test r8,r8
185+
je 0xffffffff8218aac1
186+
mov rsi,rdi
187+
mov rcx,r14
188+
mov rdi,rbp
189+
mov rdx,r15
190+
call 0xffffffff8242ca60 <__x86_indirect_thunk_r8>
191+
```
192+
193+
This copies RDI to RSI
194+
195+
```
196+
push rsi
197+
jmp qword [rsi-0x70]
198+
```
199+
200+
This pushes RSI to the stack. We can safely use -0x70 offset because our sk_buff is a part of a larger allocation
201+
202+
Finally:
203+
204+
```
205+
pop rsp
206+
pop rbp
207+
pop rbx
208+
ret
209+
```
210+
211+
Two pops at the end move RSP after the list_head head of sk_buff
212+
213+
## Second pivot
214+
215+
There is not much space at the beginning of the sk_buff - next used field is at 0x38 offset, so we have space for only 3 gadgets, but this is enough to pivot to a larger space with a simple pop rsp ; ret
216+
217+
## Privilege escalation
218+
219+
The second stage of the ROP does the standard commit_creds(init_cred); switch_task_namespaces(pid, init_nsproxy); sequence and returns to the userspace.
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
### Storing objects under a known address in kernel memory
2+
3+
There's a surprisingly simple way to store almost unlimited amount of data at a known kernel address, which makes tricks like using cpu_entry_area obsolete.
4+
5+
All data stored in the memory is accessible in kernel mode via direct physical memory mapping.
6+
Virtual address of an object is the start of the direct mapping (page_offset_base) plus an offset based on the PFN of the physical page.
7+
Physical memory addresses of heap slabs or user memory are easily predicted - only kernel code/data sections are randomized.
8+
Even if everything would be randomized, we could just spray most of the physical memory with our payload, defeating any such mitigation.
9+
10+
page_offset_base is randomized, but on systems with the PTI disabled we can use a side channel technique like prefetch to leak this address, same way we do with the start of the kernel code section.
11+
12+
There are many ways to store data in memory - I prefer using large xattrs on tmpfs.
13+
The maximum size is of one xattr 0xffff bytes and allocations over 0x2000 are served directly from the page allocator.
Lines changed: 84 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,84 @@
1+
## Requirements to trigger the vulnerability
2+
3+
- Kernel configuration: CONFIG_UNIX
4+
- User namespaces required: no
5+
6+
## Commit which introduced the vulnerability
7+
8+
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=1fd05ba5a2f2aa8e7b9b52ef55df850e2e7d54c9
9+
10+
## Commit which fixed the vulnerability
11+
12+
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=47d8ac011fe1c9251070e1bd64cb10b48193ec51
13+
14+
## Affected kernel versions
15+
16+
Introduced in 3.0. Fixed in 6.1.87, 6.6.28 and other stable trees.
17+
18+
## Affected component, subsystem
19+
20+
net/unix
21+
22+
## Description
23+
24+
Garbage collector does not take into account the risk of embryo getting
25+
enqueued during the garbage collection. If such embryo has a peer that
26+
carries SCM_RIGHTS, two consecutive passes of scan_children() may see a
27+
different set of children. Leading to an incorrectly elevated inflight
28+
count, and then a dangling pointer within the gc_inflight_list.
29+
30+
sockets are AF_UNIX/SOCK_STREAM:
31+
- S is an unconnected socket
32+
- L is a listening in-flight socket bound to addr, not in fdtable
33+
- V's fd will be passed via sendmsg(), gets inflight count bumped
34+
35+
```
36+
connect(S, addr) sendmsg(S, [V]); close(V) __unix_gc()
37+
---------------- ------------------------- -----------
38+
39+
NS = unix_create1()
40+
skb1 = sock_wmalloc(NS)
41+
L = unix_find_other(addr)
42+
unix_state_lock(L)
43+
unix_peer(S) = NS
44+
45+
46+
// V count=1 inflight=0
47+
48+
NS = unix_peer(S)
49+
skb2 = sock_alloc()
50+
skb_queue_tail(NS, skb2[V])
51+
52+
// V became in-flight
53+
// V count=2 inflight=1
54+
55+
close(V)
56+
57+
// V count=1 inflight=1
58+
// GC candidate condition met
59+
60+
for u in gc_inflight_list:
61+
if (total_refs == inflight_refs)
62+
add u to gc_candidates
63+
64+
// gc_candidates={L, V}
65+
66+
for u in gc_candidates:
67+
scan_children(u, dec_inflight)
68+
69+
// embryo (skb1) was not
70+
// reachable from L yet, so V's
71+
// inflight remains unchanged
72+
73+
74+
__skb_queue_tail(L, skb1)
75+
unix_state_unlock(L)
76+
77+
78+
for u in gc_candidates:
79+
if (u.inflight)
80+
scan_children(u, inc_inflight_move_tail)
81+
82+
// V count=1 inflight=2 (!)
83+
84+
```
Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
INCLUDES =
2+
LIBS = -pthread -ldl
3+
CFLAGS = -fomit-frame-pointer -static -fcf-protection=none
4+
5+
exploit: exploit.c kernelver_17800.147.54.h
6+
gcc -o $@ exploit.c $(INCLUDES) $(CFLAGS) $(LIBS)
7+
8+
prerequisites:
9+
sudo apt-get install libkeyutils-dev
Binary file not shown.

0 commit comments

Comments
 (0)