Commit 0e4a169
sched/fair: Start a cfs_rq on throttled hierarchy with PELT clock throttled
Matteo reported hitting the assert_list_leaf_cfs_rq() warning from
enqueue_task_fair() post commit fe8d238 ("sched/fair: Propagate
load for throttled cfs_rq") which transitioned to using
cfs_rq_pelt_clock_throttled() check for leaf cfs_rq insertions in
propagate_entity_cfs_rq().
The "cfs_rq->pelt_clock_throttled" flag is used to indicate if the
hierarchy has its PELT frozen. If a cfs_rq's PELT is marked frozen, all
its descendants should have their PELT frozen too or weird things can
happen as a result of children accumulating PELT signals when the
parents have their PELT clock stopped.
Another side effect of this is the loss of integrity of the leaf cfs_rq
list. As debugged by Aaron, consider the following hierarchy:
root(#)
/ \
A(#) B(*)
|
C <--- new cgroup
|
D <--- new cgroup
# - Already on leaf cfs_rq list
* - Throttled with PELT frozen
The newly created cgroups don't have their "pelt_clock_throttled" signal
synced with cgroup B. Next, the following series of events occur:
1. online_fair_sched_group() for cgroup D will call
propagate_entity_cfs_rq(). (Same can happen if a throttled task is
moved to cgroup C and enqueue_task_fair() returns early.)
propagate_entity_cfs_rq() adds the cfs_rq of cgroup C to
"rq->tmp_alone_branch" since its PELT clock is not marked throttled
and cfs_rq of cgroup B is not on the list.
cfs_rq of cgroup B is skipped since its PELT is throttled.
root cfs_rq already exists on cfs_rq leading to
list_add_leaf_cfs_rq() returning early.
The cfs_rq of cgroup C is left dangling on the
"rq->tmp_alone_branch".
2. A new task wakes up on cgroup A. Since the whole hierarchy is already
on the leaf cfs_rq list, list_add_leaf_cfs_rq() keeps returning early
without any modifications to "rq->tmp_alone_branch".
The final assert_list_leaf_cfs_rq() in enqueue_task_fair() sees the
dangling reference to cgroup C's cfs_rq in "rq->tmp_alone_branch".
!!! Splat !!!
Syncing the "pelt_clock_throttled" indicator with parent cfs_rq is not
enough since the new cfs_rq is not yet enqueued on the hierarchy. A
dequeue on other subtree on the throttled hierarchy can freeze the PELT
clock for the parent hierarchy without setting the indicators for this
newly added cfs_rq which was never enqueued.
Since there are no tasks on the new hierarchy, start a cfs_rq on a
throttled hierarchy with its PELT clock throttled. The first enqueue, or
the distribution (whichever happens first) will unfreeze the PELT clock
and queue the cfs_rq on the leaf cfs_rq list.
While at it, add an assert_list_leaf_cfs_rq() in
propagate_entity_cfs_rq() to catch such cases in the future.
Closes: https://lore.kernel.org/lkml/[email protected]/
Fixes: e1fad12 ("sched/fair: Switch to task based throttle model")
Reported-by: Matteo Martelli <[email protected]>
Suggested-by: Aaron Lu <[email protected]>
Signed-off-by: K Prateek Nayak <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Aaron Lu <[email protected]>
Tested-by: Aaron Lu <[email protected]>
Tested-by: Matteo Martelli <[email protected]>
Link: https://patch.msgid.link/[email protected]1 parent 211ddde commit 0e4a169
1 file changed
+12
-0
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
6437 | 6437 | | |
6438 | 6438 | | |
6439 | 6439 | | |
| 6440 | + | |
| 6441 | + | |
| 6442 | + | |
| 6443 | + | |
| 6444 | + | |
| 6445 | + | |
| 6446 | + | |
| 6447 | + | |
| 6448 | + | |
| 6449 | + | |
6440 | 6450 | | |
6441 | 6451 | | |
6442 | 6452 | | |
| |||
13187 | 13197 | | |
13188 | 13198 | | |
13189 | 13199 | | |
| 13200 | + | |
| 13201 | + | |
13190 | 13202 | | |
13191 | 13203 | | |
13192 | 13204 | | |
| |||
0 commit comments