Skip to content

Commit 0cd4b50

Browse files
committed
srcu: Yet more detail for srcu_readers_active_idx_check() comments
The comment in srcu_readers_active_idx_check() following the smp_mb() is out of date, hailing from a simpler time when preemption was disabled across the bulk of __srcu_read_lock(). The fact that preemption was disabled meant that the number of tasks that had fetched the old index but not yet incremented counters was limited by the number of CPUs. In our more complex modern times, the number of CPUs is no longer a limit. This commit therefore updates this comment, additionally giving more memory-ordering detail. [ paulmck: Apply Nt->Nc feedback from Joel Fernandes. ] Reported-by: Boqun Feng <[email protected]> Reported-by: Frederic Weisbecker <[email protected]> Reported-by: "Joel Fernandes (Google)" <[email protected]> Reported-by: Neeraj Upadhyay <[email protected]> Reported-by: Uladzislau Rezki <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
1 parent 1bafbfb commit 0cd4b50

File tree

1 file changed

+51
-16
lines changed

1 file changed

+51
-16
lines changed

kernel/rcu/srcutree.c

Lines changed: 51 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -469,24 +469,59 @@ static bool srcu_readers_active_idx_check(struct srcu_struct *ssp, int idx)
469469

470470
/*
471471
* If the locks are the same as the unlocks, then there must have
472-
* been no readers on this index at some time in between. This does
473-
* not mean that there are no more readers, as one could have read
474-
* the current index but not have incremented the lock counter yet.
472+
* been no readers on this index at some point in this function.
473+
* But there might be more readers, as a task might have read
474+
* the current ->srcu_idx but not yet have incremented its CPU's
475+
* ->srcu_lock_count[idx] counter. In fact, it is possible
476+
* that most of the tasks have been preempted between fetching
477+
* ->srcu_idx and incrementing ->srcu_lock_count[idx]. And there
478+
* could be almost (ULONG_MAX / sizeof(struct task_struct)) tasks
479+
* in a system whose address space was fully populated with memory.
480+
* Call this quantity Nt.
475481
*
476-
* So suppose that the updater is preempted here for so long
477-
* that more than ULONG_MAX non-nested readers come and go in
478-
* the meantime. It turns out that this cannot result in overflow
479-
* because if a reader modifies its unlock count after we read it
480-
* above, then that reader's next load of ->srcu_idx is guaranteed
481-
* to get the new value, which will cause it to operate on the
482-
* other bank of counters, where it cannot contribute to the
483-
* overflow of these counters. This means that there is a maximum
484-
* of 2*NR_CPUS increments, which cannot overflow given current
485-
* systems, especially not on 64-bit systems.
482+
* So suppose that the updater is preempted at this point in the
483+
* code for a long time. That now-preempted updater has already
484+
* flipped ->srcu_idx (possibly during the preceding grace period),
485+
* done an smp_mb() (again, possibly during the preceding grace
486+
* period), and summed up the ->srcu_unlock_count[idx] counters.
487+
* How many times can a given one of the aforementioned Nt tasks
488+
* increment the old ->srcu_idx value's ->srcu_lock_count[idx]
489+
* counter, in the absence of nesting?
486490
*
487-
* OK, how about nesting? This does impose a limit on nesting
488-
* of floor(ULONG_MAX/NR_CPUS/2), which should be sufficient,
489-
* especially on 64-bit systems.
491+
* It can clearly do so once, given that it has already fetched
492+
* the old value of ->srcu_idx and is just about to use that value
493+
* to index its increment of ->srcu_lock_count[idx]. But as soon as
494+
* it leaves that SRCU read-side critical section, it will increment
495+
* ->srcu_unlock_count[idx], which must follow the updater's above
496+
* read from that same value. Thus, as soon the reading task does
497+
* an smp_mb() and a later fetch from ->srcu_idx, that task will be
498+
* guaranteed to get the new index. Except that the increment of
499+
* ->srcu_unlock_count[idx] in __srcu_read_unlock() is after the
500+
* smp_mb(), and the fetch from ->srcu_idx in __srcu_read_lock()
501+
* is before the smp_mb(). Thus, that task might not see the new
502+
* value of ->srcu_idx until the -second- __srcu_read_lock(),
503+
* which in turn means that this task might well increment
504+
* ->srcu_lock_count[idx] for the old value of ->srcu_idx twice,
505+
* not just once.
506+
*
507+
* However, it is important to note that a given smp_mb() takes
508+
* effect not just for the task executing it, but also for any
509+
* later task running on that same CPU.
510+
*
511+
* That is, there can be almost Nt + Nc further increments of
512+
* ->srcu_lock_count[idx] for the old index, where Nc is the number
513+
* of CPUs. But this is OK because the size of the task_struct
514+
* structure limits the value of Nt and current systems limit Nc
515+
* to a few thousand.
516+
*
517+
* OK, but what about nesting? This does impose a limit on
518+
* nesting of half of the size of the task_struct structure
519+
* (measured in bytes), which should be sufficient. A late 2022
520+
* TREE01 rcutorture run reported this size to be no less than
521+
* 9408 bytes, allowing up to 4704 levels of nesting, which is
522+
* comfortably beyond excessive. Especially on 64-bit systems,
523+
* which are unlikely to be configured with an address space fully
524+
* populated with memory, at least not anytime soon.
490525
*/
491526
return srcu_readers_lock_idx(ssp, idx) == unlocks;
492527
}

0 commit comments

Comments
 (0)