Skip to content

Commit d20eb2d

Browse files
Frederic WeisbeckerPeter Zijlstra
authored andcommitted
perf: Fix irq work dereferencing garbage
The following commit: da916e9 ("perf: Make perf_pmu_unregister() useable") has introduced two significant event's parent lifecycle changes: 1) An event that has exited now has EVENT_TOMBSTONE as a parent. This can result in a situation where the delayed wakeup irq_work can accidentally dereference EVENT_TOMBSTONE on: CPU 0 CPU 1 ----- ----- __schedule() local_irq_disable() rq_lock() <NMI> perf_event_overflow() irq_work_queue(&child->pending_irq) </NMI> perf_event_task_sched_out() raw_spin_lock(&ctx->lock) ctx_sched_out() ctx->is_active = 0 event_sched_out(child) raw_spin_unlock(&ctx->lock) perf_event_release_kernel(parent) perf_remove_from_context(child) raw_spin_lock_irq(&ctx->lock) // Sees !ctx->is_active // Removes from context inline __perf_remove_from_context(child) perf_child_detach(child) event->parent = EVENT_TOMBSTONE raw_spin_rq_unlock_irq(rq); <IRQ> perf_pending_irq() perf_event_wakeup(child) ring_buffer_wakeup(child) rcu_dereference(child->parent->rb) <--- CRASH This also concerns the call to kill_fasync() on parent->fasync. 2) The final parent reference count decrement can now happen before the the final child reference count decrement. ie: the parent can now be freed before its child. On PREEMPT_RT, this can result in a situation where the delayed wakeup irq_work can accidentally dereference a freed parent: CPU 0 CPU 1 CPU 2 ----- ----- ------ perf_pmu_unregister() pmu_detach_events() pmu_get_event() atomic_long_inc_not_zero(&child->refcount) <NMI> perf_event_overflow() irq_work_queue(&child->pending_irq); </NMI> <IRQ> irq_work_run() wake_irq_workd() </IRQ> preempt_schedule_irq() =========> SWITCH to workd irq_work_run_list() perf_pending_irq() perf_event_wakeup(child) ring_buffer_wakeup(child) event = child->parent perf_event_release_kernel(parent) // Not last ref, PMU holds it put_event(child) // Last ref put_event(parent) free_event() call_rcu(...) rcu_core() free_event_rcu() rcu_dereference(event->rb) <--- CRASH This also concerns the call to kill_fasync() on parent->fasync. The "easy" solution to 1) is to check that event->parent is not EVENT_TOMBSTONE on perf_event_wakeup() (including both ring buffer and fasync uses). The "easy" solution to 2) is to turn perf_event_wakeup() to wholefully run under rcu_read_lock(). However because of 2), sanity would prescribe to make event::parent an __rcu pointer and annotate each and every users to prove they are reliable. Propose an alternate solution and restore the stable pointer to the parent until all its children have called _free_event() themselves to avoid any further accident. Also revert the EVENT_TOMBSTONE design that is mostly here to determine which caller of perf_event_exit_event() must perform the refcount decrement on a child event matching the increment in inherit_event(). Arrange instead for checking the attach state of an event prior to its removal and decrement the refcount of the child accordingly. Fixes: da916e9 ("perf: Make perf_pmu_unregister() useable") Signed-off-by: Frederic Weisbecker <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
1 parent 22d38ba commit d20eb2d

File tree

1 file changed

+15
-16
lines changed

1 file changed

+15
-16
lines changed

kernel/events/core.c

Lines changed: 15 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -208,7 +208,6 @@ static void perf_ctx_unlock(struct perf_cpu_context *cpuctx,
208208
}
209209

210210
#define TASK_TOMBSTONE ((void *)-1L)
211-
#define EVENT_TOMBSTONE ((void *)-1L)
212211

213212
static bool is_kernel_event(struct perf_event *event)
214213
{
@@ -2338,12 +2337,6 @@ static void perf_child_detach(struct perf_event *event)
23382337

23392338
sync_child_event(event);
23402339
list_del_init(&event->child_list);
2341-
/*
2342-
* Cannot set to NULL, as that would confuse the situation vs
2343-
* not being a child event. See for example unaccount_event().
2344-
*/
2345-
event->parent = EVENT_TOMBSTONE;
2346-
put_event(parent_event);
23472340
}
23482341

23492342
static bool is_orphaned_event(struct perf_event *event)
@@ -5705,7 +5698,7 @@ static void put_event(struct perf_event *event)
57055698
_free_event(event);
57065699

57075700
/* Matches the refcount bump in inherit_event() */
5708-
if (parent && parent != EVENT_TOMBSTONE)
5701+
if (parent)
57095702
put_event(parent);
57105703
}
57115704

@@ -9998,7 +9991,7 @@ void perf_event_text_poke(const void *addr, const void *old_bytes,
99989991

99999992
void perf_event_itrace_started(struct perf_event *event)
100009993
{
10001-
event->attach_state |= PERF_ATTACH_ITRACE;
9994+
WRITE_ONCE(event->attach_state, event->attach_state | PERF_ATTACH_ITRACE);
100029995
}
100039996

100049997
static void perf_log_itrace_start(struct perf_event *event)
@@ -13922,10 +13915,7 @@ perf_event_exit_event(struct perf_event *event,
1392213915
{
1392313916
struct perf_event *parent_event = event->parent;
1392413917
unsigned long detach_flags = DETACH_EXIT;
13925-
bool is_child = !!parent_event;
13926-
13927-
if (parent_event == EVENT_TOMBSTONE)
13928-
parent_event = NULL;
13918+
unsigned int attach_state;
1392913919

1393013920
if (parent_event) {
1393113921
/*
@@ -13942,6 +13932,8 @@ perf_event_exit_event(struct perf_event *event,
1394213932
*/
1394313933
detach_flags |= DETACH_GROUP | DETACH_CHILD;
1394413934
mutex_lock(&parent_event->child_mutex);
13935+
/* PERF_ATTACH_ITRACE might be set concurrently */
13936+
attach_state = READ_ONCE(event->attach_state);
1394513937
}
1394613938

1394713939
if (revoke)
@@ -13951,18 +13943,25 @@ perf_event_exit_event(struct perf_event *event,
1395113943
/*
1395213944
* Child events can be freed.
1395313945
*/
13954-
if (is_child) {
13955-
if (parent_event) {
13956-
mutex_unlock(&parent_event->child_mutex);
13946+
if (parent_event) {
13947+
mutex_unlock(&parent_event->child_mutex);
13948+
13949+
/*
13950+
* Match the refcount initialization. Make sure it doesn't happen
13951+
* twice if pmu_detach_event() calls it on an already exited task.
13952+
*/
13953+
if (attach_state & PERF_ATTACH_CHILD) {
1395713954
/*
1395813955
* Kick perf_poll() for is_event_hup();
1395913956
*/
1396013957
perf_event_wakeup(parent_event);
1396113958
/*
1396213959
* pmu_detach_event() will have an extra refcount.
13960+
* perf_pending_task() might have one too.
1396313961
*/
1396413962
put_event(event);
1396513963
}
13964+
1396613965
return;
1396713966
}
1396813967

0 commit comments

Comments
 (0)