@@ -1970,6 +1970,134 @@ corresponding CPU's leaf node lock is held. This avoids race conditions
19701970between RCU's hotplug notifier hooks, the grace period initialization
19711971code, and the FQS loop, all of which refer to or modify this bookkeeping.
19721972
1973+ Note that grace period initialization (rcu_gp_init()) must carefully sequence
1974+ CPU hotplug scanning with grace period state changes. For example, the
1975+ following race could occur in rcu_gp_init() if rcu_seq_start() were to happen
1976+ after the CPU hotplug scanning.
1977+
1978+ .. code-block :: none
1979+
1980+ CPU0 (rcu_gp_init) CPU1 CPU2
1981+ --------------------- ---- ----
1982+ // Hotplug scan first (WRONG ORDER)
1983+ rcu_for_each_leaf_node(rnp) {
1984+ rnp->qsmaskinit = rnp->qsmaskinitnext;
1985+ }
1986+ rcutree_report_cpu_starting()
1987+ rnp->qsmaskinitnext |= mask;
1988+ rcu_read_lock()
1989+ r0 = *X;
1990+ r1 = *X;
1991+ X = NULL;
1992+ cookie = get_state_synchronize_rcu();
1993+ // cookie = 8 (future GP)
1994+ rcu_seq_start(&rcu_state.gp_seq);
1995+ // gp_seq = 5
1996+
1997+ // CPU1 now invisible to this GP!
1998+ rcu_for_each_node_breadth_first() {
1999+ rnp->qsmask = rnp->qsmaskinit;
2000+ // CPU1 not included!
2001+ }
2002+
2003+ // GP completes without CPU1
2004+ rcu_seq_end(&rcu_state.gp_seq);
2005+ // gp_seq = 8
2006+ poll_state_synchronize_rcu(cookie);
2007+ // Returns true!
2008+ kfree(r1);
2009+ r2 = *r0; // USE-AFTER-FREE!
2010+
2011+ By incrementing gp_seq first, CPU1's RCU read-side critical section
2012+ is guaranteed to not be missed by CPU2.
2013+
2014+ **Concurrent Quiescent State Reporting for Offline CPUs **
2015+
2016+ RCU must ensure that CPUs going offline report quiescent states to avoid
2017+ blocking grace periods. This requires careful synchronization to handle
2018+ race conditions
2019+
2020+ **Race condition causing Offline CPU to hang GP **
2021+
2022+ A race between CPU offlining and new GP initialization (gp_init) may occur
2023+ because `rcu_report_qs_rnp() ` in `rcutree_report_cpu_dead() ` must temporarily
2024+ release the `rcu_node ` lock to wake the RCU grace-period kthread:
2025+
2026+ .. code-block :: none
2027+
2028+ CPU1 (going offline) CPU0 (GP kthread)
2029+ -------------------- -----------------
2030+ rcutree_report_cpu_dead()
2031+ rcu_report_qs_rnp()
2032+ // Must release rnp->lock to wake GP kthread
2033+ raw_spin_unlock_irqrestore_rcu_node()
2034+ // Wakes up and starts new GP
2035+ rcu_gp_init()
2036+ // First loop:
2037+ copies qsmaskinitnext->qsmaskinit
2038+ // CPU1 still in qsmaskinitnext!
2039+
2040+ // Second loop:
2041+ rnp->qsmask = rnp->qsmaskinit
2042+ mask = rnp->qsmask & ~rnp->qsmaskinitnext
2043+ // mask is 0! CPU1 still in both masks
2044+ // Reacquire lock (but too late)
2045+ rnp->qsmaskinitnext &= ~mask // Finally clears bit
2046+
2047+ Without `ofl_lock `, the new grace period includes the offline CPU and waits
2048+ forever for its quiescent state causing a GP hang.
2049+
2050+ **A solution with ofl_lock **
2051+
2052+ The `ofl_lock ` (offline lock) prevents `rcu_gp_init() ` from running during
2053+ the vulnerable window when `rcu_report_qs_rnp() ` has released `rnp->lock `:
2054+
2055+ .. code-block :: none
2056+
2057+ CPU0 (rcu_gp_init) CPU1 (rcutree_report_cpu_dead)
2058+ ------------------ ------------------------------
2059+ rcu_for_each_leaf_node(rnp) {
2060+ arch_spin_lock(&ofl_lock) -----> arch_spin_lock(&ofl_lock) [BLOCKED]
2061+
2062+ // Safe: CPU1 can't interfere
2063+ rnp->qsmaskinit = rnp->qsmaskinitnext
2064+
2065+ arch_spin_unlock(&ofl_lock) ---> // Now CPU1 can proceed
2066+ } // But snapshot already taken
2067+
2068+ **Another race causing GP hangs in rcu_gpu_init(): Reporting QS for Now-offline CPUs **
2069+
2070+ After the first loop takes an atomic snapshot of online CPUs, as shown above,
2071+ the second loop in `rcu_gp_init() ` detects CPUs that went offline between
2072+ releasing `ofl_lock ` and acquiring the per-node `rnp->lock `. This detection is
2073+ crucial because:
2074+
2075+ 1. The CPU might have gone offline after the snapshot but before the second loop
2076+ 2. The offline CPU cannot report its own QS if it's already dead
2077+ 3. Without this detection, the grace period would wait forever for CPUs that
2078+ are now offline.
2079+
2080+ The second loop performs this detection safely:
2081+
2082+ .. code-block :: none
2083+
2084+ rcu_for_each_node_breadth_first(rnp) {
2085+ raw_spin_lock_irqsave_rcu_node(rnp, flags);
2086+ rnp->qsmask = rnp->qsmaskinit; // Apply the snapshot
2087+
2088+ // Detect CPUs offline after snapshot
2089+ mask = rnp->qsmask & ~rnp->qsmaskinitnext;
2090+
2091+ if (mask && rcu_is_leaf_node(rnp))
2092+ rcu_report_qs_rnp(mask, ...) // Report QS for offline CPUs
2093+ }
2094+
2095+ This approach ensures atomicity: quiescent state reporting for offline CPUs
2096+ happens either in `rcu_gp_init() ` (second loop) or in `rcutree_report_cpu_dead() `,
2097+ never both and never neither. The `rnp->lock ` held throughout the sequence
2098+ prevents races - `rcutree_report_cpu_dead() ` also acquires this lock when
2099+ clearing `qsmaskinitnext `, ensuring mutual exclusion.
2100+
19732101Scheduler and RCU
19742102~~~~~~~~~~~~~~~~~
19752103
0 commit comments