Skip to content

Commit 10a0e6f

Browse files
anna-marialxKAGA-KOKO
authored andcommitted
timers/migration: Move hierarchy setup into cpuhotplug prepare callback
When a CPU comes online the first time, it is possible that a new top level group will be created. In general all propagation is done from the bottom to top. This minimizes complexity and prevents possible races. But when a new top level group is created, the formely top level group needs to be connected to the new level. This is the only time, when the direction to propagate changes is changed: the changes are propagated from top (new top level group) to bottom (formerly top level group). This introduces two races (see (A) and (B)) as reported by Frederic: (A) This race happens, when marking the formely top level group as active, but the last active CPU of the formerly top level group goes idle. Then it's likely that formerly group is no longer active, but marked nevertheless as active in new top level group: [GRP0:0] migrator = 0 active = 0 nextevt = KTIME_MAX / \ 0 1 .. 7 active idle 0) Hierarchy has for now only 8 CPUs and CPU 0 is the only active CPU. [GRP1:0] migrator = TMIGR_NONE active = NONE nextevt = KTIME_MAX \ [GRP0:0] [GRP0:1] migrator = 0 migrator = TMIGR_NONE active = 0 active = NONE nextevt = KTIME_MAX nextevt = KTIME_MAX / \ 0 1 .. 7 8 active idle !online 1) CPU 8 is booting and creates a new group in first level GRP0:1 and therefore also a new top group GRP1:0. For now the setup code proceeded only until the connected between GRP0:1 to the new top group. The connection between CPU8 and GRP0:1 is not yet established and CPU 8 is still !online. [GRP1:0] migrator = TMIGR_NONE active = NONE nextevt = KTIME_MAX / \ [GRP0:0] [GRP0:1] migrator = 0 migrator = TMIGR_NONE active = 0 active = NONE nextevt = KTIME_MAX nextevt = KTIME_MAX / \ 0 1 .. 7 8 active idle !online 2) Setup code now connects GRP0:0 to GRP1:0 and observes while in tmigr_connect_child_parent() that GRP0:0 is not TMIGR_NONE. So it prepares to call tmigr_active_up() on it. It hasn't done it yet. [GRP1:0] migrator = TMIGR_NONE active = NONE nextevt = KTIME_MAX / \ [GRP0:0] [GRP0:1] migrator = TMIGR_NONE migrator = TMIGR_NONE active = NONE active = NONE nextevt = KTIME_MAX nextevt = KTIME_MAX / \ 0 1 .. 7 8 idle idle !online 3) CPU 0 goes idle. Since GRP0:0->parent has been updated by CPU 8 with GRP0:0->lock held, CPU 0 observes GRP1:0 after calling tmigr_update_events() and it propagates the change to the top (no change there and no wakeup programmed since there is no timer). [GRP1:0] migrator = GRP0:0 active = GRP0:0 nextevt = KTIME_MAX / \ [GRP0:0] [GRP0:1] migrator = TMIGR_NONE migrator = TMIGR_NONE active = NONE active = NONE nextevt = KTIME_MAX nextevt = KTIME_MAX / \ 0 1 .. 7 8 idle idle !online 4) Now the setup code finally calls tmigr_active_up() to and sets GRP0:0 active in GRP1:0 [GRP1:0] migrator = GRP0:0 active = GRP0:0, GRP0:1 nextevt = KTIME_MAX / \ [GRP0:0] [GRP0:1] migrator = TMIGR_NONE migrator = 8 active = NONE active = 8 nextevt = KTIME_MAX nextevt = KTIME_MAX / \ | 0 1 .. 7 8 idle idle active 5) Now CPU 8 is connected with GRP0:1 and CPU 8 calls tmigr_active_up() out of tmigr_cpu_online(). [GRP1:0] migrator = GRP0:0 active = GRP0:0 nextevt = T8 / \ [GRP0:0] [GRP0:1] migrator = TMIGR_NONE migrator = TMIGR_NONE active = NONE active = NONE nextevt = KTIME_MAX nextevt = T8 / \ | 0 1 .. 7 8 idle idle idle 5) CPU 8 goes idle with a timer T8 and relies on GRP0:0 as the migrator. But it's not really active, so T8 gets ignored. --> The update which is done in third step is not noticed by setup code. So a wrong migrator is set to top level group and a timer could get ignored. (B) Reading group->parent and group->childmask when an hierarchy update is ongoing and reaches the formerly top level group is racy as those values could be inconsistent. (The notation of migrator and active now slightly changes in contrast to the above example, as now the childmasks are used.) [GRP1:0] migrator = TMIGR_NONE active = 0x00 nextevt = KTIME_MAX \ [GRP0:0] [GRP0:1] migrator = TMIGR_NONE migrator = TMIGR_NONE active = 0x00 active = 0x00 nextevt = KTIME_MAX nextevt = KTIME_MAX childmask= 0 childmask= 1 parent = NULL parent = GRP1:0 / \ 0 1 .. 7 8 idle idle !online childmask=1 1) Hierarchy has 8 CPUs. CPU 8 is at the moment in the process of onlining but did not yet connect GRP0:0 to GRP1:0. [GRP1:0] migrator = TMIGR_NONE active = 0x00 nextevt = KTIME_MAX / \ [GRP0:0] [GRP0:1] migrator = TMIGR_NONE migrator = TMIGR_NONE active = 0x00 active = 0x00 nextevt = KTIME_MAX nextevt = KTIME_MAX childmask= 0 childmask= 1 parent = GRP1:0 parent = GRP1:0 / \ 0 1 .. 7 8 idle idle !online childmask=1 2) Setup code (running on CPU 8) now connects GRP0:0 to GRP1:0, updates parent pointer of GRP0:0 and ... [GRP1:0] migrator = TMIGR_NONE active = 0x00 nextevt = KTIME_MAX / \ [GRP0:0] [GRP0:1] migrator = 0x01 migrator = TMIGR_NONE active = 0x01 active = 0x00 nextevt = KTIME_MAX nextevt = KTIME_MAX childmask= 0 childmask= 1 parent = GRP1:0 parent = GRP1:0 / \ 0 1 .. 7 8 active idle !online childmask=1 tmigr_walk.childmask = 0 3) ... CPU 0 comes active in the same time. As migrator in GRP0:0 was TMIGR_NONE, childmask of GRP0:0 is stored in update propagation data structure tmigr_walk (as update of childmask is not yet visible/updated). And now ... [GRP1:0] migrator = TMIGR_NONE active = 0x00 nextevt = KTIME_MAX / \ [GRP0:0] [GRP0:1] migrator = 0x01 migrator = TMIGR_NONE active = 0x01 active = 0x00 nextevt = KTIME_MAX nextevt = KTIME_MAX childmask= 2 childmask= 1 parent = GRP1:0 parent = GRP1:0 / \ 0 1 .. 7 8 active idle !online childmask=1 tmigr_walk.childmask = 0 4) ... childmask of GRP0:0 is updated by CPU 8 (still part of setup code). [GRP1:0] migrator = 0x00 active = 0x00 nextevt = KTIME_MAX / \ [GRP0:0] [GRP0:1] migrator = 0x01 migrator = TMIGR_NONE active = 0x01 active = 0x00 nextevt = KTIME_MAX nextevt = KTIME_MAX childmask= 2 childmask= 1 parent = GRP1:0 parent = GRP1:0 / \ 0 1 .. 7 8 active idle !online childmask=1 tmigr_walk.childmask = 0 5) CPU 0 sees the connection to GRP1:0 and now propagates active state to GRP1:0 but with childmask = 0 as stored in propagation data structure. --> Now GRP1:0 always has a migrator as 0x00 != TMIGR_NONE and for all CPUs it looks like GRP1:0 is always active. To prevent those races, the setup of the hierarchy is moved into the cpuhotplug prepare callback. The prepare callback is not executed by the CPU which will come online, it is executed by the CPU which prepares onlining of the other CPU. This CPU is active while it is connecting the formerly top level to the new one. This prevents from (A) to happen and it also prevents from any further walk above the formerly top level until that active CPU becomes inactive, releasing the new ->parent and ->childmask updates to be visible by any subsequent walk up above the formerly top level hierarchy. This prevents from (B) to happen. The direction for the updates is now forced to look like "from bottom to top". However if the active CPU prevents from tmigr_cpu_(in)active() to walk up with the update not-or-half visible, nothing prevents walking up to the new top with a 0 childmask in tmigr_handle_remote_up() or tmigr_requires_handle_remote_up() if the active CPU doing the prepare is not the migrator. But then it looks fine because: * tmigr_check_migrator() should just return false * The migrator is active and should eventually observe the new childmask at some point in a future tick. Split setup functionality of online callback into the cpuhotplug prepare callback and setup hotplug state. Change init call into early_initcall() to make sure an already active CPU prepares everything for newly upcoming CPUs. Reorder the code, that all prepare related functions are close to each other and online and offline callbacks are also close together. Fixes: 7ee9887 ("timers: Implement the hierarchical pull model") Signed-off-by: Anna-Maria Behnsen <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Frederic Weisbecker <[email protected]> Link: https://lore.kernel.org/r/[email protected]
1 parent facd40a commit 10a0e6f

File tree

2 files changed

+120
-87
lines changed

2 files changed

+120
-87
lines changed

include/linux/cpuhotplug.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -122,6 +122,7 @@ enum cpuhp_state {
122122
CPUHP_KVM_PPC_BOOK3S_PREPARE,
123123
CPUHP_ZCOMP_PREPARE,
124124
CPUHP_TIMERS_PREPARE,
125+
CPUHP_TMIGR_PREPARE,
125126
CPUHP_MIPS_SOC_PREPARE,
126127
CPUHP_BP_PREPARE_DYN,
127128
CPUHP_BP_PREPARE_DYN_END = CPUHP_BP_PREPARE_DYN + 20,

kernel/time/timer_migration.c

Lines changed: 119 additions & 87 deletions
Original file line numberDiff line numberDiff line change
@@ -1438,6 +1438,66 @@ u64 tmigr_quick_check(u64 nextevt)
14381438
return KTIME_MAX;
14391439
}
14401440

1441+
/*
1442+
* tmigr_trigger_active() - trigger a CPU to become active again
1443+
*
1444+
* This function is executed on a CPU which is part of cpu_online_mask, when the
1445+
* last active CPU in the hierarchy is offlining. With this, it is ensured that
1446+
* the other CPU is active and takes over the migrator duty.
1447+
*/
1448+
static long tmigr_trigger_active(void *unused)
1449+
{
1450+
struct tmigr_cpu *tmc = this_cpu_ptr(&tmigr_cpu);
1451+
1452+
WARN_ON_ONCE(!tmc->online || tmc->idle);
1453+
1454+
return 0;
1455+
}
1456+
1457+
static int tmigr_cpu_offline(unsigned int cpu)
1458+
{
1459+
struct tmigr_cpu *tmc = this_cpu_ptr(&tmigr_cpu);
1460+
int migrator;
1461+
u64 firstexp;
1462+
1463+
raw_spin_lock_irq(&tmc->lock);
1464+
tmc->online = false;
1465+
WRITE_ONCE(tmc->wakeup, KTIME_MAX);
1466+
1467+
/*
1468+
* CPU has to handle the local events on his own, when on the way to
1469+
* offline; Therefore nextevt value is set to KTIME_MAX
1470+
*/
1471+
firstexp = __tmigr_cpu_deactivate(tmc, KTIME_MAX);
1472+
trace_tmigr_cpu_offline(tmc);
1473+
raw_spin_unlock_irq(&tmc->lock);
1474+
1475+
if (firstexp != KTIME_MAX) {
1476+
migrator = cpumask_any_but(cpu_online_mask, cpu);
1477+
work_on_cpu(migrator, tmigr_trigger_active, NULL);
1478+
}
1479+
1480+
return 0;
1481+
}
1482+
1483+
static int tmigr_cpu_online(unsigned int cpu)
1484+
{
1485+
struct tmigr_cpu *tmc = this_cpu_ptr(&tmigr_cpu);
1486+
1487+
/* Check whether CPU data was successfully initialized */
1488+
if (WARN_ON_ONCE(!tmc->tmgroup))
1489+
return -EINVAL;
1490+
1491+
raw_spin_lock_irq(&tmc->lock);
1492+
trace_tmigr_cpu_online(tmc);
1493+
tmc->idle = timer_base_is_idle();
1494+
if (!tmc->idle)
1495+
__tmigr_cpu_activate(tmc);
1496+
tmc->online = true;
1497+
raw_spin_unlock_irq(&tmc->lock);
1498+
return 0;
1499+
}
1500+
14411501
static void tmigr_init_group(struct tmigr_group *group, unsigned int lvl,
14421502
int node)
14431503
{
@@ -1510,9 +1570,10 @@ static struct tmigr_group *tmigr_get_group(unsigned int cpu, int node,
15101570
}
15111571

15121572
static void tmigr_connect_child_parent(struct tmigr_group *child,
1513-
struct tmigr_group *parent)
1573+
struct tmigr_group *parent,
1574+
bool activate)
15141575
{
1515-
union tmigr_state childstate;
1576+
struct tmigr_walk data;
15161577

15171578
raw_spin_lock_irq(&child->lock);
15181579
raw_spin_lock_nested(&parent->lock, SINGLE_DEPTH_NESTING);
@@ -1525,6 +1586,9 @@ static void tmigr_connect_child_parent(struct tmigr_group *child,
15251586

15261587
trace_tmigr_connect_child_parent(child);
15271588

1589+
if (!activate)
1590+
return;
1591+
15281592
/*
15291593
* To prevent inconsistent states, active children need to be active in
15301594
* the new parent as well. Inactive children are already marked inactive
@@ -1540,22 +1604,24 @@ static void tmigr_connect_child_parent(struct tmigr_group *child,
15401604
* child to the new parent. So tmigr_connect_child_parent() is
15411605
* executed with the formerly top level group (child) and the newly
15421606
* created group (parent).
1607+
*
1608+
* * It is ensured that the child is active, as this setup path is
1609+
* executed in hotplug prepare callback. This is exectued by an
1610+
* already connected and !idle CPU. Even if all other CPUs go idle,
1611+
* the CPU executing the setup will be responsible up to current top
1612+
* level group. And the next time it goes inactive, it will release
1613+
* the new childmask and parent to subsequent walkers through this
1614+
* @child. Therefore propagate active state unconditionally.
15431615
*/
1544-
childstate.state = atomic_read(&child->migr_state);
1545-
if (childstate.migrator != TMIGR_NONE) {
1546-
struct tmigr_walk data;
1547-
1548-
data.childmask = child->childmask;
1616+
data.childmask = child->childmask;
15491617

1550-
/*
1551-
* There is only one new level per time (which is protected by
1552-
* tmigr_mutex). When connecting the child and the parent and
1553-
* set the child active when the parent is inactive, the parent
1554-
* needs to be the uppermost level. Otherwise there went
1555-
* something wrong!
1556-
*/
1557-
WARN_ON(!tmigr_active_up(parent, child, &data) && parent->parent);
1558-
}
1618+
/*
1619+
* There is only one new level per time (which is protected by
1620+
* tmigr_mutex). When connecting the child and the parent and set the
1621+
* child active when the parent is inactive, the parent needs to be the
1622+
* uppermost level. Otherwise there went something wrong!
1623+
*/
1624+
WARN_ON(!tmigr_active_up(parent, child, &data) && parent->parent);
15591625
}
15601626

15611627
static int tmigr_setup_groups(unsigned int cpu, unsigned int node)
@@ -1608,7 +1674,7 @@ static int tmigr_setup_groups(unsigned int cpu, unsigned int node)
16081674
* Update tmc -> group / child -> group connection
16091675
*/
16101676
if (i == 0) {
1611-
struct tmigr_cpu *tmc = this_cpu_ptr(&tmigr_cpu);
1677+
struct tmigr_cpu *tmc = per_cpu_ptr(&tmigr_cpu, cpu);
16121678

16131679
raw_spin_lock_irq(&group->lock);
16141680

@@ -1623,7 +1689,8 @@ static int tmigr_setup_groups(unsigned int cpu, unsigned int node)
16231689
continue;
16241690
} else {
16251691
child = stack[i - 1];
1626-
tmigr_connect_child_parent(child, group);
1692+
/* Will be activated at online time */
1693+
tmigr_connect_child_parent(child, group, false);
16271694
}
16281695

16291696
/* check if uppermost level was newly created */
@@ -1634,12 +1701,21 @@ static int tmigr_setup_groups(unsigned int cpu, unsigned int node)
16341701

16351702
lvllist = &tmigr_level_list[top];
16361703
if (group->num_children == 1 && list_is_singular(lvllist)) {
1704+
/*
1705+
* The target CPU must never do the prepare work, except
1706+
* on early boot when the boot CPU is the target. Otherwise
1707+
* it may spuriously activate the old top level group inside
1708+
* the new one (nevertheless whether old top level group is
1709+
* active or not) and/or release an uninitialized childmask.
1710+
*/
1711+
WARN_ON_ONCE(cpu == raw_smp_processor_id());
1712+
16371713
lvllist = &tmigr_level_list[top - 1];
16381714
list_for_each_entry(child, lvllist, list) {
16391715
if (child->parent)
16401716
continue;
16411717

1642-
tmigr_connect_child_parent(child, group);
1718+
tmigr_connect_child_parent(child, group, true);
16431719
}
16441720
}
16451721
}
@@ -1661,80 +1737,31 @@ static int tmigr_add_cpu(unsigned int cpu)
16611737
return ret;
16621738
}
16631739

1664-
static int tmigr_cpu_online(unsigned int cpu)
1740+
static int tmigr_cpu_prepare(unsigned int cpu)
16651741
{
1666-
struct tmigr_cpu *tmc = this_cpu_ptr(&tmigr_cpu);
1667-
int ret;
1668-
1669-
/* First online attempt? Initialize CPU data */
1670-
if (!tmc->tmgroup) {
1671-
raw_spin_lock_init(&tmc->lock);
1672-
1673-
ret = tmigr_add_cpu(cpu);
1674-
if (ret < 0)
1675-
return ret;
1676-
1677-
if (tmc->childmask == 0)
1678-
return -EINVAL;
1679-
1680-
timerqueue_init(&tmc->cpuevt.nextevt);
1681-
tmc->cpuevt.nextevt.expires = KTIME_MAX;
1682-
tmc->cpuevt.ignore = true;
1683-
tmc->cpuevt.cpu = cpu;
1684-
1685-
tmc->remote = false;
1686-
WRITE_ONCE(tmc->wakeup, KTIME_MAX);
1687-
}
1688-
raw_spin_lock_irq(&tmc->lock);
1689-
trace_tmigr_cpu_online(tmc);
1690-
tmc->idle = timer_base_is_idle();
1691-
if (!tmc->idle)
1692-
__tmigr_cpu_activate(tmc);
1693-
tmc->online = true;
1694-
raw_spin_unlock_irq(&tmc->lock);
1695-
return 0;
1696-
}
1697-
1698-
/*
1699-
* tmigr_trigger_active() - trigger a CPU to become active again
1700-
*
1701-
* This function is executed on a CPU which is part of cpu_online_mask, when the
1702-
* last active CPU in the hierarchy is offlining. With this, it is ensured that
1703-
* the other CPU is active and takes over the migrator duty.
1704-
*/
1705-
static long tmigr_trigger_active(void *unused)
1706-
{
1707-
struct tmigr_cpu *tmc = this_cpu_ptr(&tmigr_cpu);
1742+
struct tmigr_cpu *tmc = per_cpu_ptr(&tmigr_cpu, cpu);
1743+
int ret = 0;
17081744

1709-
WARN_ON_ONCE(!tmc->online || tmc->idle);
1710-
1711-
return 0;
1712-
}
1713-
1714-
static int tmigr_cpu_offline(unsigned int cpu)
1715-
{
1716-
struct tmigr_cpu *tmc = this_cpu_ptr(&tmigr_cpu);
1717-
int migrator;
1718-
u64 firstexp;
1745+
/* Not first online attempt? */
1746+
if (tmc->tmgroup)
1747+
return ret;
17191748

1720-
raw_spin_lock_irq(&tmc->lock);
1721-
tmc->online = false;
1749+
raw_spin_lock_init(&tmc->lock);
1750+
timerqueue_init(&tmc->cpuevt.nextevt);
1751+
tmc->cpuevt.nextevt.expires = KTIME_MAX;
1752+
tmc->cpuevt.ignore = true;
1753+
tmc->cpuevt.cpu = cpu;
1754+
tmc->remote = false;
17221755
WRITE_ONCE(tmc->wakeup, KTIME_MAX);
17231756

1724-
/*
1725-
* CPU has to handle the local events on his own, when on the way to
1726-
* offline; Therefore nextevt value is set to KTIME_MAX
1727-
*/
1728-
firstexp = __tmigr_cpu_deactivate(tmc, KTIME_MAX);
1729-
trace_tmigr_cpu_offline(tmc);
1730-
raw_spin_unlock_irq(&tmc->lock);
1757+
ret = tmigr_add_cpu(cpu);
1758+
if (ret < 0)
1759+
return ret;
17311760

1732-
if (firstexp != KTIME_MAX) {
1733-
migrator = cpumask_any_but(cpu_online_mask, cpu);
1734-
work_on_cpu(migrator, tmigr_trigger_active, NULL);
1735-
}
1761+
if (tmc->childmask == 0)
1762+
return -EINVAL;
17361763

1737-
return 0;
1764+
return ret;
17381765
}
17391766

17401767
static int __init tmigr_init(void)
@@ -1793,6 +1820,11 @@ static int __init tmigr_init(void)
17931820
tmigr_hierarchy_levels, TMIGR_CHILDREN_PER_GROUP,
17941821
tmigr_crossnode_level);
17951822

1823+
ret = cpuhp_setup_state(CPUHP_TMIGR_PREPARE, "tmigr:prepare",
1824+
tmigr_cpu_prepare, NULL);
1825+
if (ret)
1826+
goto err;
1827+
17961828
ret = cpuhp_setup_state(CPUHP_AP_TMIGR_ONLINE, "tmigr:online",
17971829
tmigr_cpu_online, tmigr_cpu_offline);
17981830
if (ret)
@@ -1804,4 +1836,4 @@ static int __init tmigr_init(void)
18041836
pr_err("Timer migration setup failed\n");
18051837
return ret;
18061838
}
1807-
late_initcall(tmigr_init);
1839+
early_initcall(tmigr_init);

0 commit comments

Comments
 (0)