Skip to content

Commit 318e18e

Browse files
Pingfan Liuhtejun
authored andcommitted
sched/deadline: Walk up cpuset hierarchy to decide root domain when hot-unplug
*** Bug description *** When testing kexec-reboot on a 144 cpus machine with isolcpus=managed_irq,domain,1-71,73-143 in kernel command line, I encounter the following bug: [ 97.114759] psci: CPU142 killed (polled 0 ms) [ 97.333236] Failed to offline CPU143 - error=-16 [ 97.333246] ------------[ cut here ]------------ [ 97.342682] kernel BUG at kernel/cpu.c:1569! [ 97.347049] Internal error: Oops - BUG: 00000000f2000800 [#1] SMP [...] In essence, the issue originates from the CPU hot-removal process, not limited to kexec. It can be reproduced by writing a SCHED_DEADLINE program that waits indefinitely on a semaphore, spawning multiple instances to ensure some run on CPU 72, and then offlining CPUs 1–143 one by one. When attempting this, CPU 143 failed to go offline. bash -c 'taskset -cp 0 $$ && for i in {1..143}; do echo 0 > /sys/devices/system/cpu/cpu$i/online 2>/dev/null; done' Tracking down this issue, I found that dl_bw_deactivate() returned -EBUSY, which caused sched_cpu_deactivate() to fail on the last CPU. But that is not the fact, and contributed by the following factors: When a CPU is inactive, cpu_rq()->rd is set to def_root_domain. For an blocked-state deadline task (in this case, "cppc_fie"), it was not migrated to CPU0, and its task_rq() information is stale. So its rq->rd points to def_root_domain instead of the one shared with CPU0. As a result, its bandwidth is wrongly accounted into a wrong root domain during domain rebuild. *** Issue *** The key point is that root_domain is only tracked through active rq->rd. To avoid using a global data structure to track all root_domains in the system, there should be a method to locate an active CPU within the corresponding root_domain. *** Solution *** To locate the active cpu, the following rules for deadline sub-system is useful -1.any cpu belongs to a unique root domain at a given time -2.DL bandwidth checker ensures that the root domain has active cpus. Now, let's examine the blocked-state task P. If P is attached to a cpuset that is a partition root, it is straightforward to find an active CPU. If P is attached to a cpuset that has changed from 'root' to 'member', the active CPUs are grouped into the parent root domain. Naturally, the CPUs' capacity and reserved DL bandwidth are taken into account in the ancestor root domain. (In practice, it may be unsafe to attach P to an arbitrary root domain, since that domain may lack sufficient DL bandwidth for P.) Again, it is straightforward to find an active CPU in the ancestor root domain. This patch groups CPUs into isolated and housekeeping sets. For the housekeeping group, it walks up the cpuset hierarchy to find active CPUs in P's root domain and retrieves the valid rd from cpu_rq(cpu)->rd. Signed-off-by: Pingfan Liu <[email protected]> Cc: Waiman Long <[email protected]> Cc: Chen Ridong <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Juri Lelli <[email protected]> Cc: Pierre Gondois <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Vincent Guittot <[email protected]> Cc: Dietmar Eggemann <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Ben Segall <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Valentin Schneider <[email protected]> To: [email protected] Signed-off-by: Tejun Heo <[email protected]>
1 parent 1f38221 commit 318e18e

File tree

1 file changed

+48
-6
lines changed

1 file changed

+48
-6
lines changed

kernel/sched/deadline.c

Lines changed: 48 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -2465,6 +2465,7 @@ static struct task_struct *pick_earliest_pushable_dl_task(struct rq *rq, int cpu
24652465
return NULL;
24662466
}
24672467

2468+
/* Access rule: must be called on local CPU with preemption disabled */
24682469
static DEFINE_PER_CPU(cpumask_var_t, local_cpu_mask_dl);
24692470

24702471
static int find_later_rq(struct task_struct *task)
@@ -2907,28 +2908,69 @@ void __init init_sched_dl_class(void)
29072908
GFP_KERNEL, cpu_to_node(i));
29082909
}
29092910

2911+
/*
2912+
* This function always returns a non-empty bitmap in @cpus. This is because
2913+
* if a root domain has reserved bandwidth for DL tasks, the DL bandwidth
2914+
* check will prevent CPU hotplug from deactivating all CPUs in that domain.
2915+
*/
2916+
static void dl_get_task_effective_cpus(struct task_struct *p, struct cpumask *cpus)
2917+
{
2918+
const struct cpumask *hk_msk;
2919+
2920+
hk_msk = housekeeping_cpumask(HK_TYPE_DOMAIN);
2921+
if (housekeeping_enabled(HK_TYPE_DOMAIN)) {
2922+
if (!cpumask_intersects(p->cpus_ptr, hk_msk)) {
2923+
/*
2924+
* CPUs isolated by isolcpu="domain" always belong to
2925+
* def_root_domain.
2926+
*/
2927+
cpumask_andnot(cpus, cpu_active_mask, hk_msk);
2928+
return;
2929+
}
2930+
}
2931+
2932+
/*
2933+
* If a root domain holds a DL task, it must have active CPUs. So
2934+
* active CPUs can always be found by walking up the task's cpuset
2935+
* hierarchy up to the partition root.
2936+
*/
2937+
cpuset_cpus_allowed_locked(p, cpus);
2938+
}
2939+
2940+
/* The caller should hold cpuset_mutex */
29102941
void dl_add_task_root_domain(struct task_struct *p)
29112942
{
29122943
struct rq_flags rf;
29132944
struct rq *rq;
29142945
struct dl_bw *dl_b;
2946+
unsigned int cpu;
2947+
struct cpumask *msk = this_cpu_cpumask_var_ptr(local_cpu_mask_dl);
29152948

29162949
raw_spin_lock_irqsave(&p->pi_lock, rf.flags);
29172950
if (!dl_task(p) || dl_entity_is_special(&p->dl)) {
29182951
raw_spin_unlock_irqrestore(&p->pi_lock, rf.flags);
29192952
return;
29202953
}
29212954

2922-
rq = __task_rq_lock(p, &rf);
2923-
2955+
/*
2956+
* Get an active rq, whose rq->rd traces the correct root
2957+
* domain.
2958+
* Ideally this would be under cpuset reader lock until rq->rd is
2959+
* fetched. However, sleepable locks cannot nest inside pi_lock, so we
2960+
* rely on the caller of dl_add_task_root_domain() holds 'cpuset_mutex'
2961+
* to guarantee the CPU stays in the cpuset.
2962+
*/
2963+
dl_get_task_effective_cpus(p, msk);
2964+
cpu = cpumask_first_and(cpu_active_mask, msk);
2965+
BUG_ON(cpu >= nr_cpu_ids);
2966+
rq = cpu_rq(cpu);
29242967
dl_b = &rq->rd->dl_bw;
2925-
raw_spin_lock(&dl_b->lock);
2968+
/* End of fetching rd */
29262969

2970+
raw_spin_lock(&dl_b->lock);
29272971
__dl_add(dl_b, p->dl.dl_bw, cpumask_weight(rq->rd->span));
2928-
29292972
raw_spin_unlock(&dl_b->lock);
2930-
2931-
task_rq_unlock(rq, p, &rf);
2973+
raw_spin_unlock_irqrestore(&p->pi_lock, rf.flags);
29322974
}
29332975

29342976
void dl_clear_root_domain(struct root_domain *rd)

0 commit comments

Comments
 (0)