Skip to content

Commit af1a731

Browse files
committed
doc: Update listRCU.rst
This commit updates listRCU.txt to reflect RCU additions and changes over the past few years. Signed-off-by: Paul E. McKenney <[email protected]>
1 parent 2254ac1 commit af1a731

File tree

1 file changed

+103
-71
lines changed

1 file changed

+103
-71
lines changed

Documentation/RCU/listRCU.rst

Lines changed: 103 additions & 71 deletions
Original file line numberDiff line numberDiff line change
@@ -3,11 +3,10 @@
33
Using RCU to Protect Read-Mostly Linked Lists
44
=============================================
55

6-
One of the best applications of RCU is to protect read-mostly linked lists
7-
(``struct list_head`` in list.h). One big advantage of this approach
8-
is that all of the required memory barriers are included for you in
9-
the list macros. This document describes several applications of RCU,
10-
with the best fits first.
6+
One of the most common uses of RCU is protecting read-mostly linked lists
7+
(``struct list_head`` in list.h). One big advantage of this approach is
8+
that all of the required memory ordering is provided by the list macros.
9+
This document describes several list-based RCU use cases.
1110

1211

1312
Example 1: Read-mostly list: Deferred Destruction
@@ -35,7 +34,8 @@ The code traversing the list of all processes typically looks like::
3534
}
3635
rcu_read_unlock();
3736

38-
The simplified code for removing a process from a task list is::
37+
The simplified and heavily inlined code for removing a process from a
38+
task list is::
3939

4040
void release_task(struct task_struct *p)
4141
{
@@ -45,39 +45,48 @@ The simplified code for removing a process from a task list is::
4545
call_rcu(&p->rcu, delayed_put_task_struct);
4646
}
4747

48-
When a process exits, ``release_task()`` calls ``list_del_rcu(&p->tasks)`` under
49-
``tasklist_lock`` writer lock protection, to remove the task from the list of
50-
all tasks. The ``tasklist_lock`` prevents concurrent list additions/removals
51-
from corrupting the list. Readers using ``for_each_process()`` are not protected
52-
with the ``tasklist_lock``. To prevent readers from noticing changes in the list
53-
pointers, the ``task_struct`` object is freed only after one or more grace
54-
periods elapse (with the help of call_rcu()). This deferring of destruction
55-
ensures that any readers traversing the list will see valid ``p->tasks.next``
56-
pointers and deletion/freeing can happen in parallel with traversal of the list.
57-
This pattern is also called an **existence lock**, since RCU pins the object in
58-
memory until all existing readers finish.
48+
When a process exits, ``release_task()`` calls ``list_del_rcu(&p->tasks)``
49+
via __exit_signal() and __unhash_process() under ``tasklist_lock``
50+
writer lock protection. The list_del_rcu() invocation removes
51+
the task from the list of all tasks. The ``tasklist_lock``
52+
prevents concurrent list additions/removals from corrupting the
53+
list. Readers using ``for_each_process()`` are not protected with the
54+
``tasklist_lock``. To prevent readers from noticing changes in the list
55+
pointers, the ``task_struct`` object is freed only after one or more
56+
grace periods elapse, with the help of call_rcu(), which is invoked via
57+
put_task_struct_rcu_user(). This deferring of destruction ensures that
58+
any readers traversing the list will see valid ``p->tasks.next`` pointers
59+
and deletion/freeing can happen in parallel with traversal of the list.
60+
This pattern is also called an **existence lock**, since RCU refrains
61+
from invoking the delayed_put_task_struct() callback function until until
62+
all existing readers finish, which guarantees that the ``task_struct``
63+
object in question will remain in existence until after the completion
64+
of all RCU readers that might possibly have a reference to that object.
5965

6066

6167
Example 2: Read-Side Action Taken Outside of Lock: No In-Place Updates
6268
----------------------------------------------------------------------
6369

64-
The best applications are cases where, if reader-writer locking were
65-
used, the read-side lock would be dropped before taking any action
66-
based on the results of the search. The most celebrated example is
67-
the routing table. Because the routing table is tracking the state of
68-
equipment outside of the computer, it will at times contain stale data.
69-
Therefore, once the route has been computed, there is no need to hold
70-
the routing table static during transmission of the packet. After all,
71-
you can hold the routing table static all you want, but that won't keep
72-
the external Internet from changing, and it is the state of the external
73-
Internet that really matters. In addition, routing entries are typically
74-
added or deleted, rather than being modified in place.
75-
76-
A straightforward example of this use of RCU may be found in the
77-
system-call auditing support. For example, a reader-writer locked
70+
Some reader-writer locking use cases compute a value while holding
71+
the read-side lock, but continue to use that value after that lock is
72+
released. These use cases are often good candidates for conversion
73+
to RCU. One prominent example involves network packet routing.
74+
Because the packet-routing data tracks the state of equipment outside
75+
of the computer, it will at times contain stale data. Therefore, once
76+
the route has been computed, there is no need to hold the routing table
77+
static during transmission of the packet. After all, you can hold the
78+
routing table static all you want, but that won't keep the external
79+
Internet from changing, and it is the state of the external Internet
80+
that really matters. In addition, routing entries are typically added
81+
or deleted, rather than being modified in place. This is a rare example
82+
of the finite speed of light and the non-zero size of atoms actually
83+
helping make synchronization be lighter weight.
84+
85+
A straightforward example of this type of RCU use case may be found in
86+
the system-call auditing support. For example, a reader-writer locked
7887
implementation of ``audit_filter_task()`` might be as follows::
7988

80-
static enum audit_state audit_filter_task(struct task_struct *tsk)
89+
static enum audit_state audit_filter_task(struct task_struct *tsk, char **key)
8190
{
8291
struct audit_entry *e;
8392
enum audit_state state;
@@ -86,6 +95,8 @@ implementation of ``audit_filter_task()`` might be as follows::
8695
/* Note: audit_filter_mutex held by caller. */
8796
list_for_each_entry(e, &audit_tsklist, list) {
8897
if (audit_filter_rules(tsk, &e->rule, NULL, &state)) {
98+
if (state == AUDIT_STATE_RECORD)
99+
*key = kstrdup(e->rule.filterkey, GFP_ATOMIC);
89100
read_unlock(&auditsc_lock);
90101
return state;
91102
}
@@ -101,7 +112,7 @@ you are turning auditing off, it is OK to audit a few extra system calls.
101112

102113
This means that RCU can be easily applied to the read side, as follows::
103114

104-
static enum audit_state audit_filter_task(struct task_struct *tsk)
115+
static enum audit_state audit_filter_task(struct task_struct *tsk, char **key)
105116
{
106117
struct audit_entry *e;
107118
enum audit_state state;
@@ -110,6 +121,8 @@ This means that RCU can be easily applied to the read side, as follows::
110121
/* Note: audit_filter_mutex held by caller. */
111122
list_for_each_entry_rcu(e, &audit_tsklist, list) {
112123
if (audit_filter_rules(tsk, &e->rule, NULL, &state)) {
124+
if (state == AUDIT_STATE_RECORD)
125+
*key = kstrdup(e->rule.filterkey, GFP_ATOMIC);
113126
rcu_read_unlock();
114127
return state;
115128
}
@@ -118,13 +131,15 @@ This means that RCU can be easily applied to the read side, as follows::
118131
return AUDIT_BUILD_CONTEXT;
119132
}
120133

121-
The ``read_lock()`` and ``read_unlock()`` calls have become rcu_read_lock()
122-
and rcu_read_unlock(), respectively, and the list_for_each_entry() has
123-
become list_for_each_entry_rcu(). The **_rcu()** list-traversal primitives
124-
insert the read-side memory barriers that are required on DEC Alpha CPUs.
134+
The read_lock() and read_unlock() calls have become rcu_read_lock()
135+
and rcu_read_unlock(), respectively, and the list_for_each_entry()
136+
has become list_for_each_entry_rcu(). The **_rcu()** list-traversal
137+
primitives add READ_ONCE() and diagnostic checks for incorrect use
138+
outside of an RCU read-side critical section.
125139

126140
The changes to the update side are also straightforward. A reader-writer lock
127-
might be used as follows for deletion and insertion::
141+
might be used as follows for deletion and insertion in these simplified
142+
versions of audit_del_rule() and audit_add_rule()::
128143

129144
static inline int audit_del_rule(struct audit_rule *rule,
130145
struct list_head *list)
@@ -188,16 +203,16 @@ Following are the RCU equivalents for these two functions::
188203
return 0;
189204
}
190205

191-
Normally, the ``write_lock()`` and ``write_unlock()`` would be replaced by a
206+
Normally, the write_lock() and write_unlock() would be replaced by a
192207
spin_lock() and a spin_unlock(). But in this case, all callers hold
193208
``audit_filter_mutex``, so no additional locking is required. The
194-
``auditsc_lock`` can therefore be eliminated, since use of RCU eliminates the
209+
auditsc_lock can therefore be eliminated, since use of RCU eliminates the
195210
need for writers to exclude readers.
196211

197212
The list_del(), list_add(), and list_add_tail() primitives have been
198213
replaced by list_del_rcu(), list_add_rcu(), and list_add_tail_rcu().
199-
The **_rcu()** list-manipulation primitives add memory barriers that are needed on
200-
weakly ordered CPUs (most of them!). The list_del_rcu() primitive omits the
214+
The **_rcu()** list-manipulation primitives add memory barriers that are
215+
needed on weakly ordered CPUs. The list_del_rcu() primitive omits the
201216
pointer poisoning debug-assist code that would otherwise cause concurrent
202217
readers to fail spectacularly.
203218

@@ -238,7 +253,9 @@ need to be filled in)::
238253
The RCU version creates a copy, updates the copy, then replaces the old
239254
entry with the newly updated entry. This sequence of actions, allowing
240255
concurrent reads while making a copy to perform an update, is what gives
241-
RCU (*read-copy update*) its name. The RCU code is as follows::
256+
RCU (*read-copy update*) its name.
257+
258+
The RCU version of audit_upd_rule() is as follows::
242259

243260
static inline int audit_upd_rule(struct audit_rule *rule,
244261
struct list_head *list,
@@ -267,6 +284,9 @@ RCU (*read-copy update*) its name. The RCU code is as follows::
267284
Again, this assumes that the caller holds ``audit_filter_mutex``. Normally, the
268285
writer lock would become a spinlock in this sort of code.
269286

287+
The update_lsm_rule() does something very similar, for those who would
288+
prefer to look at real Linux-kernel code.
289+
270290
Another use of this pattern can be found in the openswitch driver's *connection
271291
tracking table* code in ``ct_limit_set()``. The table holds connection tracking
272292
entries and has a limit on the maximum entries. There is one such table
@@ -281,9 +301,10 @@ Example 4: Eliminating Stale Data
281301
---------------------------------
282302

283303
The auditing example above tolerates stale data, as do most algorithms
284-
that are tracking external state. Because there is a delay from the
285-
time the external state changes before Linux becomes aware of the change,
286-
additional RCU-induced staleness is generally not a problem.
304+
that are tracking external state. After all, given there is a delay
305+
from the time the external state changes before Linux becomes aware
306+
of the change, and so as noted earlier, a small quantity of additional
307+
RCU-induced staleness is generally not a problem.
287308

288309
However, there are many examples where stale data cannot be tolerated.
289310
One example in the Linux kernel is the System V IPC (see the shm_lock()
@@ -302,7 +323,7 @@ Quick Quiz:
302323

303324
If the system-call audit module were to ever need to reject stale data, one way
304325
to accomplish this would be to add a ``deleted`` flag and a ``lock`` spinlock to the
305-
audit_entry structure, and modify ``audit_filter_task()`` as follows::
326+
``audit_entry`` structure, and modify audit_filter_task() as follows::
306327

307328
static enum audit_state audit_filter_task(struct task_struct *tsk)
308329
{
@@ -319,19 +340,15 @@ audit_entry structure, and modify ``audit_filter_task()`` as follows::
319340
return AUDIT_BUILD_CONTEXT;
320341
}
321342
rcu_read_unlock();
343+
if (state == AUDIT_STATE_RECORD)
344+
*key = kstrdup(e->rule.filterkey, GFP_ATOMIC);
322345
return state;
323346
}
324347
}
325348
rcu_read_unlock();
326349
return AUDIT_BUILD_CONTEXT;
327350
}
328351

329-
Note that this example assumes that entries are only added and deleted.
330-
Additional mechanism is required to deal correctly with the update-in-place
331-
performed by ``audit_upd_rule()``. For one thing, ``audit_upd_rule()`` would
332-
need additional memory barriers to ensure that the list_add_rcu() was really
333-
executed before the list_del_rcu().
334-
335352
The ``audit_del_rule()`` function would need to set the ``deleted`` flag under the
336353
spinlock as follows::
337354

@@ -357,38 +374,49 @@ spinlock as follows::
357374

358375
This too assumes that the caller holds ``audit_filter_mutex``.
359376

377+
Note that this example assumes that entries are only added and deleted.
378+
Additional mechanism is required to deal correctly with the update-in-place
379+
performed by audit_upd_rule(). For one thing, audit_upd_rule() would
380+
need to hold the locks of both the old ``audit_entry`` and its replacement
381+
while executing the list_replace_rcu().
382+
360383

361384
Example 5: Skipping Stale Objects
362385
---------------------------------
363386

364-
For some usecases, reader performance can be improved by skipping stale objects
365-
during read-side list traversal if the object in concern is pending destruction
366-
after one or more grace periods. One such example can be found in the timerfd
367-
subsystem. When a ``CLOCK_REALTIME`` clock is reprogrammed - for example due to
368-
setting of the system time, then all programmed timerfds that depend on this
369-
clock get triggered and processes waiting on them to expire are woken up in
370-
advance of their scheduled expiry. To facilitate this, all such timers are added
371-
to an RCU-managed ``cancel_list`` when they are setup in
387+
For some use cases, reader performance can be improved by skipping
388+
stale objects during read-side list traversal, where stale objects
389+
are those that will be removed and destroyed after one or more grace
390+
periods. One such example can be found in the timerfd subsystem. When a
391+
``CLOCK_REALTIME`` clock is reprogrammed (for example due to setting
392+
of the system time) then all programmed ``timerfds`` that depend on
393+
this clock get triggered and processes waiting on them are awakened in
394+
advance of their scheduled expiry. To facilitate this, all such timers
395+
are added to an RCU-managed ``cancel_list`` when they are setup in
372396
``timerfd_setup_cancel()``::
373397

374398
static void timerfd_setup_cancel(struct timerfd_ctx *ctx, int flags)
375399
{
376400
spin_lock(&ctx->cancel_lock);
377-
if ((ctx->clockid == CLOCK_REALTIME &&
401+
if ((ctx->clockid == CLOCK_REALTIME ||
402+
ctx->clockid == CLOCK_REALTIME_ALARM) &&
378403
(flags & TFD_TIMER_ABSTIME) && (flags & TFD_TIMER_CANCEL_ON_SET)) {
379404
if (!ctx->might_cancel) {
380405
ctx->might_cancel = true;
381406
spin_lock(&cancel_lock);
382407
list_add_rcu(&ctx->clist, &cancel_list);
383408
spin_unlock(&cancel_lock);
384409
}
410+
} else {
411+
__timerfd_remove_cancel(ctx);
385412
}
386413
spin_unlock(&ctx->cancel_lock);
387414
}
388415

389-
When a timerfd is freed (fd is closed), then the ``might_cancel`` flag of the
390-
timerfd object is cleared, the object removed from the ``cancel_list`` and
391-
destroyed::
416+
When a timerfd is freed (fd is closed), then the ``might_cancel``
417+
flag of the timerfd object is cleared, the object removed from the
418+
``cancel_list`` and destroyed, as shown in this simplified and inlined
419+
version of timerfd_release()::
392420

393421
int timerfd_release(struct inode *inode, struct file *file)
394422
{
@@ -403,7 +431,10 @@ destroyed::
403431
}
404432
spin_unlock(&ctx->cancel_lock);
405433

406-
hrtimer_cancel(&ctx->t.tmr);
434+
if (isalarm(ctx))
435+
alarm_cancel(&ctx->t.alarm);
436+
else
437+
hrtimer_cancel(&ctx->t.tmr);
407438
kfree_rcu(ctx, rcu);
408439
return 0;
409440
}
@@ -416,6 +447,7 @@ objects::
416447

417448
void timerfd_clock_was_set(void)
418449
{
450+
ktime_t moffs = ktime_mono_to_real(0);
419451
struct timerfd_ctx *ctx;
420452
unsigned long flags;
421453

@@ -424,7 +456,7 @@ objects::
424456
if (!ctx->might_cancel)
425457
continue;
426458
spin_lock_irqsave(&ctx->wqh.lock, flags);
427-
if (ctx->moffs != ktime_mono_to_real(0)) {
459+
if (ctx->moffs != moffs) {
428460
ctx->moffs = KTIME_MAX;
429461
ctx->ticks++;
430462
wake_up_locked_poll(&ctx->wqh, EPOLLIN);
@@ -434,10 +466,10 @@ objects::
434466
rcu_read_unlock();
435467
}
436468

437-
The key point here is, because RCU-traversal of the ``cancel_list`` happens
438-
while objects are being added and removed to the list, sometimes the traversal
439-
can step on an object that has been removed from the list. In this example, it
440-
is seen that it is better to skip such objects using a flag.
469+
The key point is that because RCU-protected traversal of the
470+
``cancel_list`` happens concurrently with object addition and removal,
471+
sometimes the traversal can access an object that has been removed from
472+
the list. In this example, a flag is used to skip such objects.
441473

442474

443475
Summary

0 commit comments

Comments
 (0)