Skip to content

Commit 4b9fd8a

Browse files
committed
Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking updates from Ingo Molnar: "The main changes in this cycle were: - Continued user-access cleanups in the futex code. - percpu-rwsem rewrite that uses its own waitqueue and atomic_t instead of an embedded rwsem. This addresses a couple of weaknesses, but the primary motivation was complications on the -rt kernel. - Introduce raw lock nesting detection on lockdep (CONFIG_PROVE_RAW_LOCK_NESTING=y), document the raw_lock vs. normal lock differences. This too originates from -rt. - Reuse lockdep zapped chain_hlocks entries, to conserve RAM footprint on distro-ish kernels running into the "BUG: MAX_LOCKDEP_CHAIN_HLOCKS too low!" depletion of the lockdep chain-entries pool. - Misc cleanups, smaller fixes and enhancements - see the changelog for details" * 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (55 commits) fs/buffer: Make BH_Uptodate_Lock bit_spin_lock a regular spinlock_t thermal/x86_pkg_temp: Make pkg_temp_lock a raw_spinlock_t Documentation/locking/locktypes: Minor copy editor fixes Documentation/locking/locktypes: Further clarifications and wordsmithing m68knommu: Remove mm.h include from uaccess_no.h x86: get rid of user_atomic_cmpxchg_inatomic() generic arch_futex_atomic_op_inuser() doesn't need access_ok() x86: don't reload after cmpxchg in unsafe_atomic_op2() loop x86: convert arch_futex_atomic_op_inuser() to user_access_begin/user_access_end() objtool: whitelist __sanitizer_cov_trace_switch() [parisc, s390, sparc64] no need for access_ok() in futex handling sh: no need of access_ok() in arch_futex_atomic_op_inuser() futex: arch_futex_atomic_op_inuser() calling conventions change completion: Use lockdep_assert_RT_in_threaded_ctx() in complete_all() lockdep: Add posixtimer context tracing bits lockdep: Annotate irq_work lockdep: Add hrtimer context tracing bits lockdep: Introduce wait-type checks completion: Use simple wait queues sched/swait: Prepare usage in completions ...
2 parents a776c27 + f1e67e3 commit 4b9fd8a

File tree

85 files changed

+1611
-702
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

85 files changed

+1611
-702
lines changed

Documentation/locking/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@ locking
77
.. toctree::
88
:maxdepth: 1
99

10+
locktypes
1011
lockdep-design
1112
lockstat
1213
locktorture

Documentation/locking/locktypes.rst

Lines changed: 347 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,347 @@
1+
.. SPDX-License-Identifier: GPL-2.0
2+
3+
.. _kernel_hacking_locktypes:
4+
5+
==========================
6+
Lock types and their rules
7+
==========================
8+
9+
Introduction
10+
============
11+
12+
The kernel provides a variety of locking primitives which can be divided
13+
into two categories:
14+
15+
- Sleeping locks
16+
- Spinning locks
17+
18+
This document conceptually describes these lock types and provides rules
19+
for their nesting, including the rules for use under PREEMPT_RT.
20+
21+
22+
Lock categories
23+
===============
24+
25+
Sleeping locks
26+
--------------
27+
28+
Sleeping locks can only be acquired in preemptible task context.
29+
30+
Although implementations allow try_lock() from other contexts, it is
31+
necessary to carefully evaluate the safety of unlock() as well as of
32+
try_lock(). Furthermore, it is also necessary to evaluate the debugging
33+
versions of these primitives. In short, don't acquire sleeping locks from
34+
other contexts unless there is no other option.
35+
36+
Sleeping lock types:
37+
38+
- mutex
39+
- rt_mutex
40+
- semaphore
41+
- rw_semaphore
42+
- ww_mutex
43+
- percpu_rw_semaphore
44+
45+
On PREEMPT_RT kernels, these lock types are converted to sleeping locks:
46+
47+
- spinlock_t
48+
- rwlock_t
49+
50+
Spinning locks
51+
--------------
52+
53+
- raw_spinlock_t
54+
- bit spinlocks
55+
56+
On non-PREEMPT_RT kernels, these lock types are also spinning locks:
57+
58+
- spinlock_t
59+
- rwlock_t
60+
61+
Spinning locks implicitly disable preemption and the lock / unlock functions
62+
can have suffixes which apply further protections:
63+
64+
=================== ====================================================
65+
_bh() Disable / enable bottom halves (soft interrupts)
66+
_irq() Disable / enable interrupts
67+
_irqsave/restore() Save and disable / restore interrupt disabled state
68+
=================== ====================================================
69+
70+
Owner semantics
71+
===============
72+
73+
The aforementioned lock types except semaphores have strict owner
74+
semantics:
75+
76+
The context (task) that acquired the lock must release it.
77+
78+
rw_semaphores have a special interface which allows non-owner release for
79+
readers.
80+
81+
82+
rtmutex
83+
=======
84+
85+
RT-mutexes are mutexes with support for priority inheritance (PI).
86+
87+
PI has limitations on non-PREEMPT_RT kernels due to preemption and
88+
interrupt disabled sections.
89+
90+
PI clearly cannot preempt preemption-disabled or interrupt-disabled
91+
regions of code, even on PREEMPT_RT kernels. Instead, PREEMPT_RT kernels
92+
execute most such regions of code in preemptible task context, especially
93+
interrupt handlers and soft interrupts. This conversion allows spinlock_t
94+
and rwlock_t to be implemented via RT-mutexes.
95+
96+
97+
semaphore
98+
=========
99+
100+
semaphore is a counting semaphore implementation.
101+
102+
Semaphores are often used for both serialization and waiting, but new use
103+
cases should instead use separate serialization and wait mechanisms, such
104+
as mutexes and completions.
105+
106+
semaphores and PREEMPT_RT
107+
----------------------------
108+
109+
PREEMPT_RT does not change the semaphore implementation because counting
110+
semaphores have no concept of owners, thus preventing PREEMPT_RT from
111+
providing priority inheritance for semaphores. After all, an unknown
112+
owner cannot be boosted. As a consequence, blocking on semaphores can
113+
result in priority inversion.
114+
115+
116+
rw_semaphore
117+
============
118+
119+
rw_semaphore is a multiple readers and single writer lock mechanism.
120+
121+
On non-PREEMPT_RT kernels the implementation is fair, thus preventing
122+
writer starvation.
123+
124+
rw_semaphore complies by default with the strict owner semantics, but there
125+
exist special-purpose interfaces that allow non-owner release for readers.
126+
These interfaces work independent of the kernel configuration.
127+
128+
rw_semaphore and PREEMPT_RT
129+
---------------------------
130+
131+
PREEMPT_RT kernels map rw_semaphore to a separate rt_mutex-based
132+
implementation, thus changing the fairness:
133+
134+
Because an rw_semaphore writer cannot grant its priority to multiple
135+
readers, a preempted low-priority reader will continue holding its lock,
136+
thus starving even high-priority writers. In contrast, because readers
137+
can grant their priority to a writer, a preempted low-priority writer will
138+
have its priority boosted until it releases the lock, thus preventing that
139+
writer from starving readers.
140+
141+
142+
raw_spinlock_t and spinlock_t
143+
=============================
144+
145+
raw_spinlock_t
146+
--------------
147+
148+
raw_spinlock_t is a strict spinning lock implementation regardless of the
149+
kernel configuration including PREEMPT_RT enabled kernels.
150+
151+
raw_spinlock_t is a strict spinning lock implementation in all kernels,
152+
including PREEMPT_RT kernels. Use raw_spinlock_t only in real critical
153+
core code, low-level interrupt handling and places where disabling
154+
preemption or interrupts is required, for example, to safely access
155+
hardware state. raw_spinlock_t can sometimes also be used when the
156+
critical section is tiny, thus avoiding RT-mutex overhead.
157+
158+
spinlock_t
159+
----------
160+
161+
The semantics of spinlock_t change with the state of PREEMPT_RT.
162+
163+
On a non-PREEMPT_RT kernel spinlock_t is mapped to raw_spinlock_t and has
164+
exactly the same semantics.
165+
166+
spinlock_t and PREEMPT_RT
167+
-------------------------
168+
169+
On a PREEMPT_RT kernel spinlock_t is mapped to a separate implementation
170+
based on rt_mutex which changes the semantics:
171+
172+
- Preemption is not disabled.
173+
174+
- The hard interrupt related suffixes for spin_lock / spin_unlock
175+
operations (_irq, _irqsave / _irqrestore) do not affect the CPU's
176+
interrupt disabled state.
177+
178+
- The soft interrupt related suffix (_bh()) still disables softirq
179+
handlers.
180+
181+
Non-PREEMPT_RT kernels disable preemption to get this effect.
182+
183+
PREEMPT_RT kernels use a per-CPU lock for serialization which keeps
184+
preemption disabled. The lock disables softirq handlers and also
185+
prevents reentrancy due to task preemption.
186+
187+
PREEMPT_RT kernels preserve all other spinlock_t semantics:
188+
189+
- Tasks holding a spinlock_t do not migrate. Non-PREEMPT_RT kernels
190+
avoid migration by disabling preemption. PREEMPT_RT kernels instead
191+
disable migration, which ensures that pointers to per-CPU variables
192+
remain valid even if the task is preempted.
193+
194+
- Task state is preserved across spinlock acquisition, ensuring that the
195+
task-state rules apply to all kernel configurations. Non-PREEMPT_RT
196+
kernels leave task state untouched. However, PREEMPT_RT must change
197+
task state if the task blocks during acquisition. Therefore, it saves
198+
the current task state before blocking and the corresponding lock wakeup
199+
restores it, as shown below::
200+
201+
task->state = TASK_INTERRUPTIBLE
202+
lock()
203+
block()
204+
task->saved_state = task->state
205+
task->state = TASK_UNINTERRUPTIBLE
206+
schedule()
207+
lock wakeup
208+
task->state = task->saved_state
209+
210+
Other types of wakeups would normally unconditionally set the task state
211+
to RUNNING, but that does not work here because the task must remain
212+
blocked until the lock becomes available. Therefore, when a non-lock
213+
wakeup attempts to awaken a task blocked waiting for a spinlock, it
214+
instead sets the saved state to RUNNING. Then, when the lock
215+
acquisition completes, the lock wakeup sets the task state to the saved
216+
state, in this case setting it to RUNNING::
217+
218+
task->state = TASK_INTERRUPTIBLE
219+
lock()
220+
block()
221+
task->saved_state = task->state
222+
task->state = TASK_UNINTERRUPTIBLE
223+
schedule()
224+
non lock wakeup
225+
task->saved_state = TASK_RUNNING
226+
227+
lock wakeup
228+
task->state = task->saved_state
229+
230+
This ensures that the real wakeup cannot be lost.
231+
232+
233+
rwlock_t
234+
========
235+
236+
rwlock_t is a multiple readers and single writer lock mechanism.
237+
238+
Non-PREEMPT_RT kernels implement rwlock_t as a spinning lock and the
239+
suffix rules of spinlock_t apply accordingly. The implementation is fair,
240+
thus preventing writer starvation.
241+
242+
rwlock_t and PREEMPT_RT
243+
-----------------------
244+
245+
PREEMPT_RT kernels map rwlock_t to a separate rt_mutex-based
246+
implementation, thus changing semantics:
247+
248+
- All the spinlock_t changes also apply to rwlock_t.
249+
250+
- Because an rwlock_t writer cannot grant its priority to multiple
251+
readers, a preempted low-priority reader will continue holding its lock,
252+
thus starving even high-priority writers. In contrast, because readers
253+
can grant their priority to a writer, a preempted low-priority writer
254+
will have its priority boosted until it releases the lock, thus
255+
preventing that writer from starving readers.
256+
257+
258+
PREEMPT_RT caveats
259+
==================
260+
261+
spinlock_t and rwlock_t
262+
-----------------------
263+
264+
These changes in spinlock_t and rwlock_t semantics on PREEMPT_RT kernels
265+
have a few implications. For example, on a non-PREEMPT_RT kernel the
266+
following code sequence works as expected::
267+
268+
local_irq_disable();
269+
spin_lock(&lock);
270+
271+
and is fully equivalent to::
272+
273+
spin_lock_irq(&lock);
274+
275+
Same applies to rwlock_t and the _irqsave() suffix variants.
276+
277+
On PREEMPT_RT kernel this code sequence breaks because RT-mutex requires a
278+
fully preemptible context. Instead, use spin_lock_irq() or
279+
spin_lock_irqsave() and their unlock counterparts. In cases where the
280+
interrupt disabling and locking must remain separate, PREEMPT_RT offers a
281+
local_lock mechanism. Acquiring the local_lock pins the task to a CPU,
282+
allowing things like per-CPU interrupt disabled locks to be acquired.
283+
However, this approach should be used only where absolutely necessary.
284+
285+
286+
raw_spinlock_t
287+
--------------
288+
289+
Acquiring a raw_spinlock_t disables preemption and possibly also
290+
interrupts, so the critical section must avoid acquiring a regular
291+
spinlock_t or rwlock_t, for example, the critical section must avoid
292+
allocating memory. Thus, on a non-PREEMPT_RT kernel the following code
293+
works perfectly::
294+
295+
raw_spin_lock(&lock);
296+
p = kmalloc(sizeof(*p), GFP_ATOMIC);
297+
298+
But this code fails on PREEMPT_RT kernels because the memory allocator is
299+
fully preemptible and therefore cannot be invoked from truly atomic
300+
contexts. However, it is perfectly fine to invoke the memory allocator
301+
while holding normal non-raw spinlocks because they do not disable
302+
preemption on PREEMPT_RT kernels::
303+
304+
spin_lock(&lock);
305+
p = kmalloc(sizeof(*p), GFP_ATOMIC);
306+
307+
308+
bit spinlocks
309+
-------------
310+
311+
PREEMPT_RT cannot substitute bit spinlocks because a single bit is too
312+
small to accommodate an RT-mutex. Therefore, the semantics of bit
313+
spinlocks are preserved on PREEMPT_RT kernels, so that the raw_spinlock_t
314+
caveats also apply to bit spinlocks.
315+
316+
Some bit spinlocks are replaced with regular spinlock_t for PREEMPT_RT
317+
using conditional (#ifdef'ed) code changes at the usage site. In contrast,
318+
usage-site changes are not needed for the spinlock_t substitution.
319+
Instead, conditionals in header files and the core locking implemementation
320+
enable the compiler to do the substitution transparently.
321+
322+
323+
Lock type nesting rules
324+
=======================
325+
326+
The most basic rules are:
327+
328+
- Lock types of the same lock category (sleeping, spinning) can nest
329+
arbitrarily as long as they respect the general lock ordering rules to
330+
prevent deadlocks.
331+
332+
- Sleeping lock types cannot nest inside spinning lock types.
333+
334+
- Spinning lock types can nest inside sleeping lock types.
335+
336+
These constraints apply both in PREEMPT_RT and otherwise.
337+
338+
The fact that PREEMPT_RT changes the lock category of spinlock_t and
339+
rwlock_t from spinning to sleeping means that they cannot be acquired while
340+
holding a raw spinlock. This results in the following nesting ordering:
341+
342+
1) Sleeping locks
343+
2) spinlock_t and rwlock_t
344+
3) raw_spinlock_t and bit spinlocks
345+
346+
Lockdep will complain if these constraints are violated, both in
347+
PREEMPT_RT and otherwise.

arch/alpha/include/asm/futex.h

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,8 @@ static inline int arch_futex_atomic_op_inuser(int op, int oparg, int *oval,
3131
{
3232
int oldval = 0, ret;
3333

34-
pagefault_disable();
34+
if (!access_ok(uaddr, sizeof(u32)))
35+
return -EFAULT;
3536

3637
switch (op) {
3738
case FUTEX_OP_SET:
@@ -53,8 +54,6 @@ static inline int arch_futex_atomic_op_inuser(int op, int oparg, int *oval,
5354
ret = -ENOSYS;
5455
}
5556

56-
pagefault_enable();
57-
5857
if (!ret)
5958
*oval = oldval;
6059

arch/arc/include/asm/futex.h

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -75,10 +75,12 @@ static inline int arch_futex_atomic_op_inuser(int op, int oparg, int *oval,
7575
{
7676
int oldval = 0, ret;
7777

78+
if (!access_ok(uaddr, sizeof(u32)))
79+
return -EFAULT;
80+
7881
#ifndef CONFIG_ARC_HAS_LLSC
7982
preempt_disable(); /* to guarantee atomic r-m-w of futex op */
8083
#endif
81-
pagefault_disable();
8284

8385
switch (op) {
8486
case FUTEX_OP_SET:
@@ -101,7 +103,6 @@ static inline int arch_futex_atomic_op_inuser(int op, int oparg, int *oval,
101103
ret = -ENOSYS;
102104
}
103105

104-
pagefault_enable();
105106
#ifndef CONFIG_ARC_HAS_LLSC
106107
preempt_enable();
107108
#endif

0 commit comments

Comments
 (0)