Skip to content

Commit b027789

Browse files
minipli-ossPeter Zijlstra
authored andcommitted
sched/fair: Prevent dead task groups from regaining cfs_rq's
Kevin is reporting crashes which point to a use-after-free of a cfs_rq in update_blocked_averages(). Initial debugging revealed that we've live cfs_rq's (on_list=1) in an about to be kfree()'d task group in free_fair_sched_group(). However, it was unclear how that can happen. His kernel config happened to lead to a layout of struct sched_entity that put the 'my_q' member directly into the middle of the object which makes it incidentally overlap with SLUB's freelist pointer. That, in combination with SLAB_FREELIST_HARDENED's freelist pointer mangling, leads to a reliable access violation in form of a #GP which made the UAF fail fast. Michal seems to have run into the same issue[1]. He already correctly diagnosed that commit a7b359f ("sched/fair: Correctly insert cfs_rq's to list on unthrottle") is causing the preconditions for the UAF to happen by re-adding cfs_rq's also to task groups that have no more running tasks, i.e. also to dead ones. His analysis, however, misses the real root cause and it cannot be seen from the crash backtrace only, as the real offender is tg_unthrottle_up() getting called via sched_cfs_period_timer() via the timer interrupt at an inconvenient time. When unregister_fair_sched_group() unlinks all cfs_rq's from the dying task group, it doesn't protect itself from getting interrupted. If the timer interrupt triggers while we iterate over all CPUs or after unregister_fair_sched_group() has finished but prior to unlinking the task group, sched_cfs_period_timer() will execute and walk the list of task groups, trying to unthrottle cfs_rq's, i.e. re-add them to the dying task group. These will later -- in free_fair_sched_group() -- be kfree()'ed while still being linked, leading to the fireworks Kevin and Michal are seeing. To fix this race, ensure the dying task group gets unlinked first. However, simply switching the order of unregistering and unlinking the task group isn't sufficient, as concurrent RCU walkers might still see it, as can be seen below: CPU1: CPU2: : timer IRQ: : do_sched_cfs_period_timer(): : : : distribute_cfs_runtime(): : rcu_read_lock(); : : : unthrottle_cfs_rq(): sched_offline_group(): : : walk_tg_tree_from(…,tg_unthrottle_up,…): list_del_rcu(&tg->list); : (1) : list_for_each_entry_rcu(child, &parent->children, siblings) : : (2) list_del_rcu(&tg->siblings); : : tg_unthrottle_up(): unregister_fair_sched_group(): struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)]; : : list_del_leaf_cfs_rq(tg->cfs_rq[cpu]); : : : : if (!cfs_rq_is_decayed(cfs_rq) || cfs_rq->nr_running) (3) : list_add_leaf_cfs_rq(cfs_rq); : : : : : : : : : : (4) : rcu_read_unlock(); CPU 2 walks the task group list in parallel to sched_offline_group(), specifically, it'll read the soon to be unlinked task group entry at (1). Unlinking it on CPU 1 at (2) therefore won't prevent CPU 2 from still passing it on to tg_unthrottle_up(). CPU 1 now tries to unlink all cfs_rq's via list_del_leaf_cfs_rq() in unregister_fair_sched_group(). Meanwhile CPU 2 will re-add some of these at (3), which is the cause of the UAF later on. To prevent this additional race from happening, we need to wait until walk_tg_tree_from() has finished traversing the task groups, i.e. after the RCU read critical section ends in (4). Afterwards we're safe to call unregister_fair_sched_group(), as each new walk won't see the dying task group any more. On top of that, we need to wait yet another RCU grace period after unregister_fair_sched_group() to ensure print_cfs_stats(), which might run concurrently, always sees valid objects, i.e. not already free'd ones. This patch survives Michal's reproducer[2] for 8h+ now, which used to trigger within minutes before. [1] https://lore.kernel.org/lkml/[email protected]/ [2] https://lore.kernel.org/lkml/[email protected]/ Fixes: a7b359f ("sched/fair: Correctly insert cfs_rq's to list on unthrottle") [peterz: shuffle code around a bit] Reported-by: Kevin Tanguy <[email protected]> Signed-off-by: Mathias Krause <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
1 parent 42dc938 commit b027789

File tree

5 files changed

+49
-16
lines changed

5 files changed

+49
-16
lines changed

kernel/sched/autogroup.c

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,7 @@ static inline void autogroup_destroy(struct kref *kref)
3131
ag->tg->rt_se = NULL;
3232
ag->tg->rt_rq = NULL;
3333
#endif
34-
sched_offline_group(ag->tg);
34+
sched_release_group(ag->tg);
3535
sched_destroy_group(ag->tg);
3636
}
3737

kernel/sched/core.c

Lines changed: 35 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -9719,6 +9719,22 @@ static void sched_free_group(struct task_group *tg)
97199719
kmem_cache_free(task_group_cache, tg);
97209720
}
97219721

9722+
static void sched_free_group_rcu(struct rcu_head *rcu)
9723+
{
9724+
sched_free_group(container_of(rcu, struct task_group, rcu));
9725+
}
9726+
9727+
static void sched_unregister_group(struct task_group *tg)
9728+
{
9729+
unregister_fair_sched_group(tg);
9730+
unregister_rt_sched_group(tg);
9731+
/*
9732+
* We have to wait for yet another RCU grace period to expire, as
9733+
* print_cfs_stats() might run concurrently.
9734+
*/
9735+
call_rcu(&tg->rcu, sched_free_group_rcu);
9736+
}
9737+
97229738
/* allocate runqueue etc for a new task group */
97239739
struct task_group *sched_create_group(struct task_group *parent)
97249740
{
@@ -9762,25 +9778,35 @@ void sched_online_group(struct task_group *tg, struct task_group *parent)
97629778
}
97639779

97649780
/* rcu callback to free various structures associated with a task group */
9765-
static void sched_free_group_rcu(struct rcu_head *rhp)
9781+
static void sched_unregister_group_rcu(struct rcu_head *rhp)
97669782
{
97679783
/* Now it should be safe to free those cfs_rqs: */
9768-
sched_free_group(container_of(rhp, struct task_group, rcu));
9784+
sched_unregister_group(container_of(rhp, struct task_group, rcu));
97699785
}
97709786

97719787
void sched_destroy_group(struct task_group *tg)
97729788
{
97739789
/* Wait for possible concurrent references to cfs_rqs complete: */
9774-
call_rcu(&tg->rcu, sched_free_group_rcu);
9790+
call_rcu(&tg->rcu, sched_unregister_group_rcu);
97759791
}
97769792

9777-
void sched_offline_group(struct task_group *tg)
9793+
void sched_release_group(struct task_group *tg)
97789794
{
97799795
unsigned long flags;
97809796

9781-
/* End participation in shares distribution: */
9782-
unregister_fair_sched_group(tg);
9783-
9797+
/*
9798+
* Unlink first, to avoid walk_tg_tree_from() from finding us (via
9799+
* sched_cfs_period_timer()).
9800+
*
9801+
* For this to be effective, we have to wait for all pending users of
9802+
* this task group to leave their RCU critical section to ensure no new
9803+
* user will see our dying task group any more. Specifically ensure
9804+
* that tg_unthrottle_up() won't add decayed cfs_rq's to it.
9805+
*
9806+
* We therefore defer calling unregister_fair_sched_group() to
9807+
* sched_unregister_group() which is guarantied to get called only after the
9808+
* current RCU grace period has expired.
9809+
*/
97849810
spin_lock_irqsave(&task_group_lock, flags);
97859811
list_del_rcu(&tg->list);
97869812
list_del_rcu(&tg->siblings);
@@ -9899,7 +9925,7 @@ static void cpu_cgroup_css_released(struct cgroup_subsys_state *css)
98999925
{
99009926
struct task_group *tg = css_tg(css);
99019927

9902-
sched_offline_group(tg);
9928+
sched_release_group(tg);
99039929
}
99049930

99059931
static void cpu_cgroup_css_free(struct cgroup_subsys_state *css)
@@ -9909,7 +9935,7 @@ static void cpu_cgroup_css_free(struct cgroup_subsys_state *css)
99099935
/*
99109936
* Relies on the RCU grace period between css_released() and this.
99119937
*/
9912-
sched_free_group(tg);
9938+
sched_unregister_group(tg);
99139939
}
99149940

99159941
/*

kernel/sched/fair.c

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -11456,8 +11456,6 @@ void free_fair_sched_group(struct task_group *tg)
1145611456
{
1145711457
int i;
1145811458

11459-
destroy_cfs_bandwidth(tg_cfs_bandwidth(tg));
11460-
1146111459
for_each_possible_cpu(i) {
1146211460
if (tg->cfs_rq)
1146311461
kfree(tg->cfs_rq[i]);
@@ -11534,6 +11532,8 @@ void unregister_fair_sched_group(struct task_group *tg)
1153411532
struct rq *rq;
1153511533
int cpu;
1153611534

11535+
destroy_cfs_bandwidth(tg_cfs_bandwidth(tg));
11536+
1153711537
for_each_possible_cpu(cpu) {
1153811538
if (tg->se[cpu])
1153911539
remove_entity_load_avg(tg->se[cpu]);

kernel/sched/rt.c

Lines changed: 9 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -137,13 +137,17 @@ static inline struct rq *rq_of_rt_se(struct sched_rt_entity *rt_se)
137137
return rt_rq->rq;
138138
}
139139

140-
void free_rt_sched_group(struct task_group *tg)
140+
void unregister_rt_sched_group(struct task_group *tg)
141141
{
142-
int i;
143-
144142
if (tg->rt_se)
145143
destroy_rt_bandwidth(&tg->rt_bandwidth);
146144

145+
}
146+
147+
void free_rt_sched_group(struct task_group *tg)
148+
{
149+
int i;
150+
147151
for_each_possible_cpu(i) {
148152
if (tg->rt_rq)
149153
kfree(tg->rt_rq[i]);
@@ -250,6 +254,8 @@ static inline struct rt_rq *rt_rq_of_se(struct sched_rt_entity *rt_se)
250254
return &rq->rt;
251255
}
252256

257+
void unregister_rt_sched_group(struct task_group *tg) { }
258+
253259
void free_rt_sched_group(struct task_group *tg) { }
254260

255261
int alloc_rt_sched_group(struct task_group *tg, struct task_group *parent)

kernel/sched/sched.h

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -488,6 +488,7 @@ extern void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b);
488488
extern void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b);
489489
extern void unthrottle_cfs_rq(struct cfs_rq *cfs_rq);
490490

491+
extern void unregister_rt_sched_group(struct task_group *tg);
491492
extern void free_rt_sched_group(struct task_group *tg);
492493
extern int alloc_rt_sched_group(struct task_group *tg, struct task_group *parent);
493494
extern void init_tg_rt_entry(struct task_group *tg, struct rt_rq *rt_rq,
@@ -503,7 +504,7 @@ extern struct task_group *sched_create_group(struct task_group *parent);
503504
extern void sched_online_group(struct task_group *tg,
504505
struct task_group *parent);
505506
extern void sched_destroy_group(struct task_group *tg);
506-
extern void sched_offline_group(struct task_group *tg);
507+
extern void sched_release_group(struct task_group *tg);
507508

508509
extern void sched_move_task(struct task_struct *tsk);
509510

0 commit comments

Comments
 (0)