Skip to content

Commit 65176f5

Browse files
Chengming ZhouPeter Zijlstra
authored andcommitted
sched/psi: Optimize task switch inside shared cgroups again
Way back when PSI_MEM_FULL was accounted from the timer tick, task switching could simply iterate next and prev to the common ancestor to update TSK_ONCPU and be done. Then memstall ticks were replaced with checking curr->in_memstall directly in psi_group_change(). That meant that now if the task switch was between a memstall and a !memstall task, we had to iterate through the common ancestors at least ONCE to fix up their state_masks. We added the identical_state filter to make sure the common ancestor elimination was skipped in that case. It seems that was always a little too eager, because it caused us to walk the common ancestors *twice* instead of the required once: the iteration for next could have stopped at the common ancestor; prev could have updated TSK_ONCPU up to the common ancestor, then finish to the root without changing any flags, just to get the new curr->in_memstall into the state_masks. This patch recognizes this and makes it so that we walk to the root exactly once if state_mask needs updating, which is simply catching up on a missed optimization that could have been done in commit 7fae6c8 ("psi: Use ONCPU state tracking machinery to detect reclaim") directly. Apart from this, it's also necessary for the next patch "sched/psi: remove NR_ONCPU task accounting". Suppose we walk the common ancestors twice: (1) psi_group_change(.clear = 0, .set = TSK_ONCPU) (2) psi_group_change(.clear = TSK_ONCPU, .set = 0) We previously used tasks[NR_ONCPU] to record TSK_ONCPU, tasks[NR_ONCPU]++ in (1) then tasks[NR_ONCPU]-- in (2), so tasks[NR_ONCPU] still be correct. The next patch change to use one bit in state mask to record TSK_ONCPU, PSI_ONCPU bit will be set in (1), but then be cleared in (2), which cause the psi_group_cpu has task running on CPU but without PSI_ONCPU bit set! With this patch, we will never walk the common ancestors twice, so won't have above problem. Suggested-by: Johannes Weiner <[email protected]> Signed-off-by: Chengming Zhou <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Johannes Weiner <[email protected]> Link: https://lore.kernel.org/r/[email protected]
1 parent d79ddb0 commit 65176f5

File tree

1 file changed

+9
-12
lines changed

1 file changed

+9
-12
lines changed

kernel/sched/psi.c

Lines changed: 9 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -820,20 +820,15 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
820820
u64 now = cpu_clock(cpu);
821821

822822
if (next->pid) {
823-
bool identical_state;
824-
825823
psi_flags_change(next, 0, TSK_ONCPU);
826824
/*
827-
* When switching between tasks that have an identical
828-
* runtime state, the cgroup that contains both tasks
829-
* we reach the first common ancestor. Iterate @next's
830-
* ancestors only until we encounter @prev's ONCPU.
825+
* Set TSK_ONCPU on @next's cgroups. If @next shares any
826+
* ancestors with @prev, those will already have @prev's
827+
* TSK_ONCPU bit set, and we can stop the iteration there.
831828
*/
832-
identical_state = prev->psi_flags == next->psi_flags;
833829
iter = NULL;
834830
while ((group = iterate_groups(next, &iter))) {
835-
if (identical_state &&
836-
per_cpu_ptr(group->pcpu, cpu)->tasks[NR_ONCPU]) {
831+
if (per_cpu_ptr(group->pcpu, cpu)->tasks[NR_ONCPU]) {
837832
common = group;
838833
break;
839834
}
@@ -877,10 +872,12 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
877872
psi_group_change(group, cpu, clear, set, now, wake_clock);
878873

879874
/*
880-
* TSK_ONCPU is handled up to the common ancestor. If we're tasked
881-
* with dequeuing too, finish that for the rest of the hierarchy.
875+
* TSK_ONCPU is handled up to the common ancestor. If there are
876+
* any other differences between the two tasks (e.g. prev goes
877+
* to sleep, or only one task is memstall), finish propagating
878+
* those differences all the way up to the root.
882879
*/
883-
if (sleep) {
880+
if ((prev->psi_flags ^ next->psi_flags) & ~TSK_ONCPU) {
884881
clear &= ~TSK_ONCPU;
885882
for (; group; group = iterate_groups(prev, &iter))
886883
psi_group_change(group, cpu, clear, set, now, wake_clock);

0 commit comments

Comments
 (0)