33Using RCU to Protect Read-Mostly Linked Lists
44=============================================
55
6- One of the best applications of RCU is to protect read-mostly linked lists
7- (``struct list_head `` in list.h). One big advantage of this approach
8- is that all of the required memory barriers are included for you in
9- the list macros. This document describes several applications of RCU,
10- with the best fits first.
6+ One of the most common uses of RCU is protecting read-mostly linked lists
7+ (``struct list_head `` in list.h). One big advantage of this approach is
8+ that all of the required memory ordering is provided by the list macros.
9+ This document describes several list-based RCU use cases.
1110
1211
1312Example 1: Read-mostly list: Deferred Destruction
@@ -35,7 +34,8 @@ The code traversing the list of all processes typically looks like::
3534 }
3635 rcu_read_unlock();
3736
38- The simplified code for removing a process from a task list is::
37+ The simplified and heavily inlined code for removing a process from a
38+ task list is::
3939
4040 void release_task(struct task_struct *p)
4141 {
@@ -45,39 +45,48 @@ The simplified code for removing a process from a task list is::
4545 call_rcu(&p->rcu, delayed_put_task_struct);
4646 }
4747
48- When a process exits, ``release_task() `` calls ``list_del_rcu(&p->tasks) `` under
49- ``tasklist_lock `` writer lock protection, to remove the task from the list of
50- all tasks. The ``tasklist_lock `` prevents concurrent list additions/removals
51- from corrupting the list. Readers using ``for_each_process() `` are not protected
52- with the ``tasklist_lock ``. To prevent readers from noticing changes in the list
53- pointers, the ``task_struct `` object is freed only after one or more grace
54- periods elapse (with the help of call_rcu()). This deferring of destruction
55- ensures that any readers traversing the list will see valid ``p->tasks.next ``
56- pointers and deletion/freeing can happen in parallel with traversal of the list.
57- This pattern is also called an **existence lock **, since RCU pins the object in
58- memory until all existing readers finish.
48+ When a process exits, ``release_task() `` calls ``list_del_rcu(&p->tasks) ``
49+ via __exit_signal() and __unhash_process() under ``tasklist_lock ``
50+ writer lock protection. The list_del_rcu() invocation removes
51+ the task from the list of all tasks. The ``tasklist_lock ``
52+ prevents concurrent list additions/removals from corrupting the
53+ list. Readers using ``for_each_process() `` are not protected with the
54+ ``tasklist_lock ``. To prevent readers from noticing changes in the list
55+ pointers, the ``task_struct `` object is freed only after one or more
56+ grace periods elapse, with the help of call_rcu(), which is invoked via
57+ put_task_struct_rcu_user(). This deferring of destruction ensures that
58+ any readers traversing the list will see valid ``p->tasks.next `` pointers
59+ and deletion/freeing can happen in parallel with traversal of the list.
60+ This pattern is also called an **existence lock **, since RCU refrains
61+ from invoking the delayed_put_task_struct() callback function until until
62+ all existing readers finish, which guarantees that the ``task_struct ``
63+ object in question will remain in existence until after the completion
64+ of all RCU readers that might possibly have a reference to that object.
5965
6066
6167Example 2: Read-Side Action Taken Outside of Lock: No In-Place Updates
6268----------------------------------------------------------------------
6369
64- The best applications are cases where, if reader-writer locking were
65- used, the read-side lock would be dropped before taking any action
66- based on the results of the search. The most celebrated example is
67- the routing table. Because the routing table is tracking the state of
68- equipment outside of the computer, it will at times contain stale data.
69- Therefore, once the route has been computed, there is no need to hold
70- the routing table static during transmission of the packet. After all,
71- you can hold the routing table static all you want, but that won't keep
72- the external Internet from changing, and it is the state of the external
73- Internet that really matters. In addition, routing entries are typically
74- added or deleted, rather than being modified in place.
75-
76- A straightforward example of this use of RCU may be found in the
77- system-call auditing support. For example, a reader-writer locked
70+ Some reader-writer locking use cases compute a value while holding
71+ the read-side lock, but continue to use that value after that lock is
72+ released. These use cases are often good candidates for conversion
73+ to RCU. One prominent example involves network packet routing.
74+ Because the packet-routing data tracks the state of equipment outside
75+ of the computer, it will at times contain stale data. Therefore, once
76+ the route has been computed, there is no need to hold the routing table
77+ static during transmission of the packet. After all, you can hold the
78+ routing table static all you want, but that won't keep the external
79+ Internet from changing, and it is the state of the external Internet
80+ that really matters. In addition, routing entries are typically added
81+ or deleted, rather than being modified in place. This is a rare example
82+ of the finite speed of light and the non-zero size of atoms actually
83+ helping make synchronization be lighter weight.
84+
85+ A straightforward example of this type of RCU use case may be found in
86+ the system-call auditing support. For example, a reader-writer locked
7887implementation of ``audit_filter_task() `` might be as follows::
7988
80- static enum audit_state audit_filter_task(struct task_struct *tsk)
89+ static enum audit_state audit_filter_task(struct task_struct *tsk, char **key )
8190 {
8291 struct audit_entry *e;
8392 enum audit_state state;
@@ -86,6 +95,8 @@ implementation of ``audit_filter_task()`` might be as follows::
8695 /* Note: audit_filter_mutex held by caller. */
8796 list_for_each_entry(e, &audit_tsklist, list) {
8897 if (audit_filter_rules(tsk, &e->rule, NULL, &state)) {
98+ if (state == AUDIT_STATE_RECORD)
99+ *key = kstrdup(e->rule.filterkey, GFP_ATOMIC);
89100 read_unlock(&auditsc_lock);
90101 return state;
91102 }
@@ -101,7 +112,7 @@ you are turning auditing off, it is OK to audit a few extra system calls.
101112
102113This means that RCU can be easily applied to the read side, as follows::
103114
104- static enum audit_state audit_filter_task(struct task_struct *tsk)
115+ static enum audit_state audit_filter_task(struct task_struct *tsk, char **key )
105116 {
106117 struct audit_entry *e;
107118 enum audit_state state;
@@ -110,6 +121,8 @@ This means that RCU can be easily applied to the read side, as follows::
110121 /* Note: audit_filter_mutex held by caller. */
111122 list_for_each_entry_rcu(e, &audit_tsklist, list) {
112123 if (audit_filter_rules(tsk, &e->rule, NULL, &state)) {
124+ if (state == AUDIT_STATE_RECORD)
125+ *key = kstrdup(e->rule.filterkey, GFP_ATOMIC);
113126 rcu_read_unlock();
114127 return state;
115128 }
@@ -118,13 +131,15 @@ This means that RCU can be easily applied to the read side, as follows::
118131 return AUDIT_BUILD_CONTEXT;
119132 }
120133
121- The ``read_lock() `` and ``read_unlock() `` calls have become rcu_read_lock()
122- and rcu_read_unlock(), respectively, and the list_for_each_entry() has
123- become list_for_each_entry_rcu(). The **_rcu() ** list-traversal primitives
124- insert the read-side memory barriers that are required on DEC Alpha CPUs.
134+ The read_lock() and read_unlock() calls have become rcu_read_lock()
135+ and rcu_read_unlock(), respectively, and the list_for_each_entry()
136+ has become list_for_each_entry_rcu(). The **_rcu() ** list-traversal
137+ primitives add READ_ONCE() and diagnostic checks for incorrect use
138+ outside of an RCU read-side critical section.
125139
126140The changes to the update side are also straightforward. A reader-writer lock
127- might be used as follows for deletion and insertion::
141+ might be used as follows for deletion and insertion in these simplified
142+ versions of audit_del_rule() and audit_add_rule()::
128143
129144 static inline int audit_del_rule(struct audit_rule *rule,
130145 struct list_head *list)
@@ -188,16 +203,16 @@ Following are the RCU equivalents for these two functions::
188203 return 0;
189204 }
190205
191- Normally, the `` write_lock() `` and `` write_unlock() `` would be replaced by a
206+ Normally, the write_lock() and write_unlock() would be replaced by a
192207spin_lock() and a spin_unlock(). But in this case, all callers hold
193208``audit_filter_mutex ``, so no additional locking is required. The
194- `` auditsc_lock `` can therefore be eliminated, since use of RCU eliminates the
209+ auditsc_lock can therefore be eliminated, since use of RCU eliminates the
195210need for writers to exclude readers.
196211
197212The list_del(), list_add(), and list_add_tail() primitives have been
198213replaced by list_del_rcu(), list_add_rcu(), and list_add_tail_rcu().
199- The **_rcu() ** list-manipulation primitives add memory barriers that are needed on
200- weakly ordered CPUs (most of them!) . The list_del_rcu() primitive omits the
214+ The **_rcu() ** list-manipulation primitives add memory barriers that are
215+ needed on weakly ordered CPUs. The list_del_rcu() primitive omits the
201216pointer poisoning debug-assist code that would otherwise cause concurrent
202217readers to fail spectacularly.
203218
@@ -238,7 +253,9 @@ need to be filled in)::
238253The RCU version creates a copy, updates the copy, then replaces the old
239254entry with the newly updated entry. This sequence of actions, allowing
240255concurrent reads while making a copy to perform an update, is what gives
241- RCU (*read-copy update *) its name. The RCU code is as follows::
256+ RCU (*read-copy update *) its name.
257+
258+ The RCU version of audit_upd_rule() is as follows::
242259
243260 static inline int audit_upd_rule(struct audit_rule *rule,
244261 struct list_head *list,
@@ -267,6 +284,9 @@ RCU (*read-copy update*) its name. The RCU code is as follows::
267284Again, this assumes that the caller holds ``audit_filter_mutex ``. Normally, the
268285writer lock would become a spinlock in this sort of code.
269286
287+ The update_lsm_rule() does something very similar, for those who would
288+ prefer to look at real Linux-kernel code.
289+
270290Another use of this pattern can be found in the openswitch driver's *connection
271291tracking table * code in ``ct_limit_set() ``. The table holds connection tracking
272292entries and has a limit on the maximum entries. There is one such table
@@ -281,9 +301,10 @@ Example 4: Eliminating Stale Data
281301---------------------------------
282302
283303The auditing example above tolerates stale data, as do most algorithms
284- that are tracking external state. Because there is a delay from the
285- time the external state changes before Linux becomes aware of the change,
286- additional RCU-induced staleness is generally not a problem.
304+ that are tracking external state. After all, given there is a delay
305+ from the time the external state changes before Linux becomes aware
306+ of the change, and so as noted earlier, a small quantity of additional
307+ RCU-induced staleness is generally not a problem.
287308
288309However, there are many examples where stale data cannot be tolerated.
289310One example in the Linux kernel is the System V IPC (see the shm_lock()
@@ -302,7 +323,7 @@ Quick Quiz:
302323
303324If the system-call audit module were to ever need to reject stale data, one way
304325to accomplish this would be to add a ``deleted `` flag and a ``lock `` spinlock to the
305- audit_entry structure, and modify `` audit_filter_task() `` as follows::
326+ `` audit_entry `` structure, and modify audit_filter_task() as follows::
306327
307328 static enum audit_state audit_filter_task(struct task_struct *tsk)
308329 {
@@ -319,19 +340,15 @@ audit_entry structure, and modify ``audit_filter_task()`` as follows::
319340 return AUDIT_BUILD_CONTEXT;
320341 }
321342 rcu_read_unlock();
343+ if (state == AUDIT_STATE_RECORD)
344+ *key = kstrdup(e->rule.filterkey, GFP_ATOMIC);
322345 return state;
323346 }
324347 }
325348 rcu_read_unlock();
326349 return AUDIT_BUILD_CONTEXT;
327350 }
328351
329- Note that this example assumes that entries are only added and deleted.
330- Additional mechanism is required to deal correctly with the update-in-place
331- performed by ``audit_upd_rule() ``. For one thing, ``audit_upd_rule() `` would
332- need additional memory barriers to ensure that the list_add_rcu() was really
333- executed before the list_del_rcu().
334-
335352The ``audit_del_rule() `` function would need to set the ``deleted `` flag under the
336353spinlock as follows::
337354
@@ -357,38 +374,49 @@ spinlock as follows::
357374
358375This too assumes that the caller holds ``audit_filter_mutex ``.
359376
377+ Note that this example assumes that entries are only added and deleted.
378+ Additional mechanism is required to deal correctly with the update-in-place
379+ performed by audit_upd_rule(). For one thing, audit_upd_rule() would
380+ need to hold the locks of both the old ``audit_entry `` and its replacement
381+ while executing the list_replace_rcu().
382+
360383
361384Example 5: Skipping Stale Objects
362385---------------------------------
363386
364- For some usecases, reader performance can be improved by skipping stale objects
365- during read-side list traversal if the object in concern is pending destruction
366- after one or more grace periods. One such example can be found in the timerfd
367- subsystem. When a ``CLOCK_REALTIME `` clock is reprogrammed - for example due to
368- setting of the system time, then all programmed timerfds that depend on this
369- clock get triggered and processes waiting on them to expire are woken up in
370- advance of their scheduled expiry. To facilitate this, all such timers are added
371- to an RCU-managed ``cancel_list `` when they are setup in
387+ For some use cases, reader performance can be improved by skipping
388+ stale objects during read-side list traversal, where stale objects
389+ are those that will be removed and destroyed after one or more grace
390+ periods. One such example can be found in the timerfd subsystem. When a
391+ ``CLOCK_REALTIME `` clock is reprogrammed (for example due to setting
392+ of the system time) then all programmed ``timerfds `` that depend on
393+ this clock get triggered and processes waiting on them are awakened in
394+ advance of their scheduled expiry. To facilitate this, all such timers
395+ are added to an RCU-managed ``cancel_list `` when they are setup in
372396``timerfd_setup_cancel() ``::
373397
374398 static void timerfd_setup_cancel(struct timerfd_ctx *ctx, int flags)
375399 {
376400 spin_lock(&ctx->cancel_lock);
377- if ((ctx->clockid == CLOCK_REALTIME &&
401+ if ((ctx->clockid == CLOCK_REALTIME ||
402+ ctx->clockid == CLOCK_REALTIME_ALARM) &&
378403 (flags & TFD_TIMER_ABSTIME) && (flags & TFD_TIMER_CANCEL_ON_SET)) {
379404 if (!ctx->might_cancel) {
380405 ctx->might_cancel = true;
381406 spin_lock(&cancel_lock);
382407 list_add_rcu(&ctx->clist, &cancel_list);
383408 spin_unlock(&cancel_lock);
384409 }
410+ } else {
411+ __timerfd_remove_cancel(ctx);
385412 }
386413 spin_unlock(&ctx->cancel_lock);
387414 }
388415
389- When a timerfd is freed (fd is closed), then the ``might_cancel `` flag of the
390- timerfd object is cleared, the object removed from the ``cancel_list `` and
391- destroyed::
416+ When a timerfd is freed (fd is closed), then the ``might_cancel ``
417+ flag of the timerfd object is cleared, the object removed from the
418+ ``cancel_list `` and destroyed, as shown in this simplified and inlined
419+ version of timerfd_release()::
392420
393421 int timerfd_release(struct inode *inode, struct file *file)
394422 {
@@ -403,7 +431,10 @@ destroyed::
403431 }
404432 spin_unlock(&ctx->cancel_lock);
405433
406- hrtimer_cancel(&ctx->t.tmr);
434+ if (isalarm(ctx))
435+ alarm_cancel(&ctx->t.alarm);
436+ else
437+ hrtimer_cancel(&ctx->t.tmr);
407438 kfree_rcu(ctx, rcu);
408439 return 0;
409440 }
@@ -416,6 +447,7 @@ objects::
416447
417448 void timerfd_clock_was_set(void)
418449 {
450+ ktime_t moffs = ktime_mono_to_real(0);
419451 struct timerfd_ctx *ctx;
420452 unsigned long flags;
421453
@@ -424,7 +456,7 @@ objects::
424456 if (!ctx->might_cancel)
425457 continue;
426458 spin_lock_irqsave(&ctx->wqh.lock, flags);
427- if (ctx->moffs != ktime_mono_to_real(0) ) {
459+ if (ctx->moffs != moffs ) {
428460 ctx->moffs = KTIME_MAX;
429461 ctx->ticks++;
430462 wake_up_locked_poll(&ctx->wqh, EPOLLIN);
@@ -434,10 +466,10 @@ objects::
434466 rcu_read_unlock();
435467 }
436468
437- The key point here is, because RCU-traversal of the `` cancel_list `` happens
438- while objects are being added and removed to the list, sometimes the traversal
439- can step on an object that has been removed from the list. In this example, it
440- is seen that it is better to skip such objects using a flag .
469+ The key point is that because RCU-protected traversal of the
470+ `` cancel_list `` happens concurrently with object addition and removal,
471+ sometimes the traversal can access an object that has been removed from
472+ the list. In this example, a flag is used to skip such objects.
441473
442474
443475Summary
0 commit comments