Skip to content

Commit cc1d136

Browse files
author
Neeraj Upadhyay (AMD)
committed
Merge branches 'rcu-exp.23.07.2025', 'rcu.22.07.2025', 'torture-scripts.16.07.2025', 'srcu.19.07.2025', 'rcu.nocb.18.07.2025' and 'refscale.07.07.2025' into rcu.merge.23.07.2025
6 parents fc39760 + 5d71c2b + cbd5d35 + 954c0d7 + 463d460 + 005b618 commit cc1d136

File tree

23 files changed

+755
-266
lines changed

23 files changed

+755
-266
lines changed

Documentation/RCU/Design/Data-Structures/Data-Structures.rst

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -286,6 +286,39 @@ in order to detect the beginnings and ends of grace periods in a
286286
distributed fashion. The values flow from ``rcu_state`` to ``rcu_node``
287287
(down the tree from the root to the leaves) to ``rcu_data``.
288288

289+
+-----------------------------------------------------------------------+
290+
| **Quick Quiz**: |
291+
+-----------------------------------------------------------------------+
292+
| Given that the root rcu_node structure has a gp_seq field, |
293+
| why does RCU maintain a separate gp_seq in the rcu_state structure? |
294+
| Why not just use the root rcu_node's gp_seq as the official record |
295+
| and update it directly when starting a new grace period? |
296+
+-----------------------------------------------------------------------+
297+
| **Answer**: |
298+
+-----------------------------------------------------------------------+
299+
| On single-node RCU trees (where the root node is also a leaf), |
300+
| updating the root node's gp_seq immediately would create unnecessary |
301+
| lock contention. Here's why: |
302+
| |
303+
| If we did rcu_seq_start() directly on the root node's gp_seq: |
304+
| |
305+
| 1. All CPUs would immediately see their node's gp_seq from their rdp's|
306+
| gp_seq, in rcu_pending(). They would all then invoke the RCU-core. |
307+
| 2. Which calls note_gp_changes() and try to acquire the node lock. |
308+
| 3. But rnp->qsmask isn't initialized yet (happens later in |
309+
| rcu_gp_init()) |
310+
| 4. So each CPU would acquire the lock, find it can't determine if it |
311+
| needs to report quiescent state (no qsmask), update rdp->gp_seq, |
312+
| and release the lock. |
313+
| 5. Result: Lots of lock acquisitions with no grace period progress |
314+
| |
315+
| By having a separate rcu_state.gp_seq, we can increment the official |
316+
| grace period counter without immediately affecting what CPUs see in |
317+
| their nodes. The hierarchical propagation in rcu_gp_init() then |
318+
| updates the root node's gp_seq and qsmask together under the same lock|
319+
| acquisition, avoiding this useless contention. |
320+
+-----------------------------------------------------------------------+
321+
289322
Miscellaneous
290323
'''''''''''''
291324

Documentation/RCU/Design/Requirements/Requirements.rst

Lines changed: 128 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1970,6 +1970,134 @@ corresponding CPU's leaf node lock is held. This avoids race conditions
19701970
between RCU's hotplug notifier hooks, the grace period initialization
19711971
code, and the FQS loop, all of which refer to or modify this bookkeeping.
19721972

1973+
Note that grace period initialization (rcu_gp_init()) must carefully sequence
1974+
CPU hotplug scanning with grace period state changes. For example, the
1975+
following race could occur in rcu_gp_init() if rcu_seq_start() were to happen
1976+
after the CPU hotplug scanning.
1977+
1978+
.. code-block:: none
1979+
1980+
CPU0 (rcu_gp_init) CPU1 CPU2
1981+
--------------------- ---- ----
1982+
// Hotplug scan first (WRONG ORDER)
1983+
rcu_for_each_leaf_node(rnp) {
1984+
rnp->qsmaskinit = rnp->qsmaskinitnext;
1985+
}
1986+
rcutree_report_cpu_starting()
1987+
rnp->qsmaskinitnext |= mask;
1988+
rcu_read_lock()
1989+
r0 = *X;
1990+
r1 = *X;
1991+
X = NULL;
1992+
cookie = get_state_synchronize_rcu();
1993+
// cookie = 8 (future GP)
1994+
rcu_seq_start(&rcu_state.gp_seq);
1995+
// gp_seq = 5
1996+
1997+
// CPU1 now invisible to this GP!
1998+
rcu_for_each_node_breadth_first() {
1999+
rnp->qsmask = rnp->qsmaskinit;
2000+
// CPU1 not included!
2001+
}
2002+
2003+
// GP completes without CPU1
2004+
rcu_seq_end(&rcu_state.gp_seq);
2005+
// gp_seq = 8
2006+
poll_state_synchronize_rcu(cookie);
2007+
// Returns true!
2008+
kfree(r1);
2009+
r2 = *r0; // USE-AFTER-FREE!
2010+
2011+
By incrementing gp_seq first, CPU1's RCU read-side critical section
2012+
is guaranteed to not be missed by CPU2.
2013+
2014+
**Concurrent Quiescent State Reporting for Offline CPUs**
2015+
2016+
RCU must ensure that CPUs going offline report quiescent states to avoid
2017+
blocking grace periods. This requires careful synchronization to handle
2018+
race conditions
2019+
2020+
**Race condition causing Offline CPU to hang GP**
2021+
2022+
A race between CPU offlining and new GP initialization (gp_init) may occur
2023+
because `rcu_report_qs_rnp()` in `rcutree_report_cpu_dead()` must temporarily
2024+
release the `rcu_node` lock to wake the RCU grace-period kthread:
2025+
2026+
.. code-block:: none
2027+
2028+
CPU1 (going offline) CPU0 (GP kthread)
2029+
-------------------- -----------------
2030+
rcutree_report_cpu_dead()
2031+
rcu_report_qs_rnp()
2032+
// Must release rnp->lock to wake GP kthread
2033+
raw_spin_unlock_irqrestore_rcu_node()
2034+
// Wakes up and starts new GP
2035+
rcu_gp_init()
2036+
// First loop:
2037+
copies qsmaskinitnext->qsmaskinit
2038+
// CPU1 still in qsmaskinitnext!
2039+
2040+
// Second loop:
2041+
rnp->qsmask = rnp->qsmaskinit
2042+
mask = rnp->qsmask & ~rnp->qsmaskinitnext
2043+
// mask is 0! CPU1 still in both masks
2044+
// Reacquire lock (but too late)
2045+
rnp->qsmaskinitnext &= ~mask // Finally clears bit
2046+
2047+
Without `ofl_lock`, the new grace period includes the offline CPU and waits
2048+
forever for its quiescent state causing a GP hang.
2049+
2050+
**A solution with ofl_lock**
2051+
2052+
The `ofl_lock` (offline lock) prevents `rcu_gp_init()` from running during
2053+
the vulnerable window when `rcu_report_qs_rnp()` has released `rnp->lock`:
2054+
2055+
.. code-block:: none
2056+
2057+
CPU0 (rcu_gp_init) CPU1 (rcutree_report_cpu_dead)
2058+
------------------ ------------------------------
2059+
rcu_for_each_leaf_node(rnp) {
2060+
arch_spin_lock(&ofl_lock) -----> arch_spin_lock(&ofl_lock) [BLOCKED]
2061+
2062+
// Safe: CPU1 can't interfere
2063+
rnp->qsmaskinit = rnp->qsmaskinitnext
2064+
2065+
arch_spin_unlock(&ofl_lock) ---> // Now CPU1 can proceed
2066+
} // But snapshot already taken
2067+
2068+
**Another race causing GP hangs in rcu_gpu_init(): Reporting QS for Now-offline CPUs**
2069+
2070+
After the first loop takes an atomic snapshot of online CPUs, as shown above,
2071+
the second loop in `rcu_gp_init()` detects CPUs that went offline between
2072+
releasing `ofl_lock` and acquiring the per-node `rnp->lock`. This detection is
2073+
crucial because:
2074+
2075+
1. The CPU might have gone offline after the snapshot but before the second loop
2076+
2. The offline CPU cannot report its own QS if it's already dead
2077+
3. Without this detection, the grace period would wait forever for CPUs that
2078+
are now offline.
2079+
2080+
The second loop performs this detection safely:
2081+
2082+
.. code-block:: none
2083+
2084+
rcu_for_each_node_breadth_first(rnp) {
2085+
raw_spin_lock_irqsave_rcu_node(rnp, flags);
2086+
rnp->qsmask = rnp->qsmaskinit; // Apply the snapshot
2087+
2088+
// Detect CPUs offline after snapshot
2089+
mask = rnp->qsmask & ~rnp->qsmaskinitnext;
2090+
2091+
if (mask && rcu_is_leaf_node(rnp))
2092+
rcu_report_qs_rnp(mask, ...) // Report QS for offline CPUs
2093+
}
2094+
2095+
This approach ensures atomicity: quiescent state reporting for offline CPUs
2096+
happens either in `rcu_gp_init()` (second loop) or in `rcutree_report_cpu_dead()`,
2097+
never both and never neither. The `rnp->lock` held throughout the sequence
2098+
prevents races - `rcutree_report_cpu_dead()` also acquires this lock when
2099+
clearing `qsmaskinitnext`, ensuring mutual exclusion.
2100+
19732101
Scheduler and RCU
19742102
~~~~~~~~~~~~~~~~~
19752103

Documentation/admin-guide/kernel-parameters.txt

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5485,7 +5485,8 @@
54855485
echo 1 > /sys/module/rcutree/parameters/rcu_normal_wake_from_gp
54865486
or pass a boot parameter "rcutree.rcu_normal_wake_from_gp=1"
54875487

5488-
Default is 0.
5488+
Default is 1 if num_possible_cpus() <= 16 and it is not explicitly
5489+
disabled by the boot parameter passing 0.
54895490

54905491
rcuscale.gp_async= [KNL]
54915492
Measure performance of asynchronous

include/linux/srcu.h

Lines changed: 8 additions & 46 deletions
Original file line numberDiff line numberDiff line change
@@ -46,11 +46,11 @@ int init_srcu_struct(struct srcu_struct *ssp);
4646
/* Values for SRCU Tree srcu_data ->srcu_reader_flavor, but also used by rcutorture. */
4747
#define SRCU_READ_FLAVOR_NORMAL 0x1 // srcu_read_lock().
4848
#define SRCU_READ_FLAVOR_NMI 0x2 // srcu_read_lock_nmisafe().
49-
#define SRCU_READ_FLAVOR_LITE 0x4 // srcu_read_lock_lite().
49+
// 0x4 // SRCU-lite is no longer with us.
5050
#define SRCU_READ_FLAVOR_FAST 0x8 // srcu_read_lock_fast().
5151
#define SRCU_READ_FLAVOR_ALL (SRCU_READ_FLAVOR_NORMAL | SRCU_READ_FLAVOR_NMI | \
52-
SRCU_READ_FLAVOR_LITE | SRCU_READ_FLAVOR_FAST) // All of the above.
53-
#define SRCU_READ_FLAVOR_SLOWGP (SRCU_READ_FLAVOR_LITE | SRCU_READ_FLAVOR_FAST)
52+
SRCU_READ_FLAVOR_FAST) // All of the above.
53+
#define SRCU_READ_FLAVOR_SLOWGP SRCU_READ_FLAVOR_FAST
5454
// Flavors requiring synchronize_rcu()
5555
// instead of smp_mb().
5656
void __srcu_read_unlock(struct srcu_struct *ssp, int idx) __releases(ssp);
@@ -299,33 +299,6 @@ static inline struct srcu_ctr __percpu *srcu_down_read_fast(struct srcu_struct *
299299
return __srcu_read_lock_fast(ssp);
300300
}
301301

302-
/**
303-
* srcu_read_lock_lite - register a new reader for an SRCU-protected structure.
304-
* @ssp: srcu_struct in which to register the new reader.
305-
*
306-
* Enter an SRCU read-side critical section, but for a light-weight
307-
* smp_mb()-free reader. See srcu_read_lock() for more information.
308-
*
309-
* If srcu_read_lock_lite() is ever used on an srcu_struct structure,
310-
* then none of the other flavors may be used, whether before, during,
311-
* or after. Note that grace-period auto-expediting is disabled for _lite
312-
* srcu_struct structures because auto-expedited grace periods invoke
313-
* synchronize_rcu_expedited(), IPIs and all.
314-
*
315-
* Note that srcu_read_lock_lite() can be invoked only from those contexts
316-
* where RCU is watching, that is, from contexts where it would be legal
317-
* to invoke rcu_read_lock(). Otherwise, lockdep will complain.
318-
*/
319-
static inline int srcu_read_lock_lite(struct srcu_struct *ssp) __acquires(ssp)
320-
{
321-
int retval;
322-
323-
srcu_check_read_flavor_force(ssp, SRCU_READ_FLAVOR_LITE);
324-
retval = __srcu_read_lock_lite(ssp);
325-
rcu_try_lock_acquire(&ssp->dep_map);
326-
return retval;
327-
}
328-
329302
/**
330303
* srcu_read_lock_nmisafe - register a new reader for an SRCU-protected structure.
331304
* @ssp: srcu_struct in which to register the new reader.
@@ -434,22 +407,6 @@ static inline void srcu_up_read_fast(struct srcu_struct *ssp, struct srcu_ctr __
434407
__srcu_read_unlock_fast(ssp, scp);
435408
}
436409

437-
/**
438-
* srcu_read_unlock_lite - unregister a old reader from an SRCU-protected structure.
439-
* @ssp: srcu_struct in which to unregister the old reader.
440-
* @idx: return value from corresponding srcu_read_lock_lite().
441-
*
442-
* Exit a light-weight SRCU read-side critical section.
443-
*/
444-
static inline void srcu_read_unlock_lite(struct srcu_struct *ssp, int idx)
445-
__releases(ssp)
446-
{
447-
WARN_ON_ONCE(idx & ~0x1);
448-
srcu_check_read_flavor(ssp, SRCU_READ_FLAVOR_LITE);
449-
srcu_lock_release(&ssp->dep_map);
450-
__srcu_read_unlock_lite(ssp, idx);
451-
}
452-
453410
/**
454411
* srcu_read_unlock_nmisafe - unregister a old reader from an SRCU-protected structure.
455412
* @ssp: srcu_struct in which to unregister the old reader.
@@ -524,4 +481,9 @@ DEFINE_LOCK_GUARD_1(srcu, struct srcu_struct,
524481
srcu_read_unlock(_T->lock, _T->idx),
525482
int idx)
526483

484+
DEFINE_LOCK_GUARD_1(srcu_fast, struct srcu_struct,
485+
_T->scp = srcu_read_lock_fast(_T->lock),
486+
srcu_read_unlock_fast(_T->lock, _T->scp),
487+
struct srcu_ctr __percpu *scp)
488+
527489
#endif

include/linux/srcutiny.h

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -93,9 +93,6 @@ static inline void __srcu_read_unlock_fast(struct srcu_struct *ssp, struct srcu_
9393
__srcu_read_unlock(ssp, __srcu_ptr_to_ctr(ssp, scp));
9494
}
9595

96-
#define __srcu_read_lock_lite __srcu_read_lock
97-
#define __srcu_read_unlock_lite __srcu_read_unlock
98-
9996
static inline void synchronize_srcu_expedited(struct srcu_struct *ssp)
10097
{
10198
synchronize_srcu(ssp);

include/linux/srcutree.h

Lines changed: 0 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -278,44 +278,6 @@ static inline void __srcu_read_unlock_fast(struct srcu_struct *ssp, struct srcu_
278278
RCU_LOCKDEP_WARN(!rcu_is_watching(), "RCU must be watching srcu_read_unlock_fast().");
279279
}
280280

281-
/*
282-
* Counts the new reader in the appropriate per-CPU element of the
283-
* srcu_struct. Returns an index that must be passed to the matching
284-
* srcu_read_unlock_lite().
285-
*
286-
* Note that this_cpu_inc() is an RCU read-side critical section either
287-
* because it disables interrupts, because it is a single instruction,
288-
* or because it is a read-modify-write atomic operation, depending on
289-
* the whims of the architecture.
290-
*/
291-
static inline int __srcu_read_lock_lite(struct srcu_struct *ssp)
292-
{
293-
struct srcu_ctr __percpu *scp = READ_ONCE(ssp->srcu_ctrp);
294-
295-
RCU_LOCKDEP_WARN(!rcu_is_watching(), "RCU must be watching srcu_read_lock_lite().");
296-
this_cpu_inc(scp->srcu_locks.counter); /* Y */
297-
barrier(); /* Avoid leaking the critical section. */
298-
return __srcu_ptr_to_ctr(ssp, scp);
299-
}
300-
301-
/*
302-
* Removes the count for the old reader from the appropriate
303-
* per-CPU element of the srcu_struct. Note that this may well be a
304-
* different CPU than that which was incremented by the corresponding
305-
* srcu_read_lock_lite(), but it must be within the same task.
306-
*
307-
* Note that this_cpu_inc() is an RCU read-side critical section either
308-
* because it disables interrupts, because it is a single instruction,
309-
* or because it is a read-modify-write atomic operation, depending on
310-
* the whims of the architecture.
311-
*/
312-
static inline void __srcu_read_unlock_lite(struct srcu_struct *ssp, int idx)
313-
{
314-
barrier(); /* Avoid leaking the critical section. */
315-
this_cpu_inc(__srcu_ctr_to_ptr(ssp, idx)->srcu_unlocks.counter); /* Z */
316-
RCU_LOCKDEP_WARN(!rcu_is_watching(), "RCU must be watching srcu_read_unlock_lite().");
317-
}
318-
319281
void __srcu_check_read_flavor(struct srcu_struct *ssp, int read_flavor);
320282

321283
// Record reader usage even for CONFIG_PROVE_RCU=n kernels. This is

0 commit comments

Comments
 (0)