|
| 1 | +.. SPDX-License-Identifier: GPL-2.0 |
| 2 | +
|
| 3 | +.. _kernel_hacking_locktypes: |
| 4 | + |
| 5 | +========================== |
| 6 | +Lock types and their rules |
| 7 | +========================== |
| 8 | + |
| 9 | +Introduction |
| 10 | +============ |
| 11 | + |
| 12 | +The kernel provides a variety of locking primitives which can be divided |
| 13 | +into two categories: |
| 14 | + |
| 15 | + - Sleeping locks |
| 16 | + - Spinning locks |
| 17 | + |
| 18 | +This document conceptually describes these lock types and provides rules |
| 19 | +for their nesting, including the rules for use under PREEMPT_RT. |
| 20 | + |
| 21 | + |
| 22 | +Lock categories |
| 23 | +=============== |
| 24 | + |
| 25 | +Sleeping locks |
| 26 | +-------------- |
| 27 | + |
| 28 | +Sleeping locks can only be acquired in preemptible task context. |
| 29 | + |
| 30 | +Although implementations allow try_lock() from other contexts, it is |
| 31 | +necessary to carefully evaluate the safety of unlock() as well as of |
| 32 | +try_lock(). Furthermore, it is also necessary to evaluate the debugging |
| 33 | +versions of these primitives. In short, don't acquire sleeping locks from |
| 34 | +other contexts unless there is no other option. |
| 35 | + |
| 36 | +Sleeping lock types: |
| 37 | + |
| 38 | + - mutex |
| 39 | + - rt_mutex |
| 40 | + - semaphore |
| 41 | + - rw_semaphore |
| 42 | + - ww_mutex |
| 43 | + - percpu_rw_semaphore |
| 44 | + |
| 45 | +On PREEMPT_RT kernels, these lock types are converted to sleeping locks: |
| 46 | + |
| 47 | + - spinlock_t |
| 48 | + - rwlock_t |
| 49 | + |
| 50 | +Spinning locks |
| 51 | +-------------- |
| 52 | + |
| 53 | + - raw_spinlock_t |
| 54 | + - bit spinlocks |
| 55 | + |
| 56 | +On non-PREEMPT_RT kernels, these lock types are also spinning locks: |
| 57 | + |
| 58 | + - spinlock_t |
| 59 | + - rwlock_t |
| 60 | + |
| 61 | +Spinning locks implicitly disable preemption and the lock / unlock functions |
| 62 | +can have suffixes which apply further protections: |
| 63 | + |
| 64 | + =================== ==================================================== |
| 65 | + _bh() Disable / enable bottom halves (soft interrupts) |
| 66 | + _irq() Disable / enable interrupts |
| 67 | + _irqsave/restore() Save and disable / restore interrupt disabled state |
| 68 | + =================== ==================================================== |
| 69 | + |
| 70 | +Owner semantics |
| 71 | +=============== |
| 72 | + |
| 73 | +The aforementioned lock types except semaphores have strict owner |
| 74 | +semantics: |
| 75 | + |
| 76 | + The context (task) that acquired the lock must release it. |
| 77 | + |
| 78 | +rw_semaphores have a special interface which allows non-owner release for |
| 79 | +readers. |
| 80 | + |
| 81 | + |
| 82 | +rtmutex |
| 83 | +======= |
| 84 | + |
| 85 | +RT-mutexes are mutexes with support for priority inheritance (PI). |
| 86 | + |
| 87 | +PI has limitations on non-PREEMPT_RT kernels due to preemption and |
| 88 | +interrupt disabled sections. |
| 89 | + |
| 90 | +PI clearly cannot preempt preemption-disabled or interrupt-disabled |
| 91 | +regions of code, even on PREEMPT_RT kernels. Instead, PREEMPT_RT kernels |
| 92 | +execute most such regions of code in preemptible task context, especially |
| 93 | +interrupt handlers and soft interrupts. This conversion allows spinlock_t |
| 94 | +and rwlock_t to be implemented via RT-mutexes. |
| 95 | + |
| 96 | + |
| 97 | +semaphore |
| 98 | +========= |
| 99 | + |
| 100 | +semaphore is a counting semaphore implementation. |
| 101 | + |
| 102 | +Semaphores are often used for both serialization and waiting, but new use |
| 103 | +cases should instead use separate serialization and wait mechanisms, such |
| 104 | +as mutexes and completions. |
| 105 | + |
| 106 | +semaphores and PREEMPT_RT |
| 107 | +---------------------------- |
| 108 | + |
| 109 | +PREEMPT_RT does not change the semaphore implementation because counting |
| 110 | +semaphores have no concept of owners, thus preventing PREEMPT_RT from |
| 111 | +providing priority inheritance for semaphores. After all, an unknown |
| 112 | +owner cannot be boosted. As a consequence, blocking on semaphores can |
| 113 | +result in priority inversion. |
| 114 | + |
| 115 | + |
| 116 | +rw_semaphore |
| 117 | +============ |
| 118 | + |
| 119 | +rw_semaphore is a multiple readers and single writer lock mechanism. |
| 120 | + |
| 121 | +On non-PREEMPT_RT kernels the implementation is fair, thus preventing |
| 122 | +writer starvation. |
| 123 | + |
| 124 | +rw_semaphore complies by default with the strict owner semantics, but there |
| 125 | +exist special-purpose interfaces that allow non-owner release for readers. |
| 126 | +These interfaces work independent of the kernel configuration. |
| 127 | + |
| 128 | +rw_semaphore and PREEMPT_RT |
| 129 | +--------------------------- |
| 130 | + |
| 131 | +PREEMPT_RT kernels map rw_semaphore to a separate rt_mutex-based |
| 132 | +implementation, thus changing the fairness: |
| 133 | + |
| 134 | + Because an rw_semaphore writer cannot grant its priority to multiple |
| 135 | + readers, a preempted low-priority reader will continue holding its lock, |
| 136 | + thus starving even high-priority writers. In contrast, because readers |
| 137 | + can grant their priority to a writer, a preempted low-priority writer will |
| 138 | + have its priority boosted until it releases the lock, thus preventing that |
| 139 | + writer from starving readers. |
| 140 | + |
| 141 | + |
| 142 | +raw_spinlock_t and spinlock_t |
| 143 | +============================= |
| 144 | + |
| 145 | +raw_spinlock_t |
| 146 | +-------------- |
| 147 | + |
| 148 | +raw_spinlock_t is a strict spinning lock implementation regardless of the |
| 149 | +kernel configuration including PREEMPT_RT enabled kernels. |
| 150 | + |
| 151 | +raw_spinlock_t is a strict spinning lock implementation in all kernels, |
| 152 | +including PREEMPT_RT kernels. Use raw_spinlock_t only in real critical |
| 153 | +core code, low-level interrupt handling and places where disabling |
| 154 | +preemption or interrupts is required, for example, to safely access |
| 155 | +hardware state. raw_spinlock_t can sometimes also be used when the |
| 156 | +critical section is tiny, thus avoiding RT-mutex overhead. |
| 157 | + |
| 158 | +spinlock_t |
| 159 | +---------- |
| 160 | + |
| 161 | +The semantics of spinlock_t change with the state of PREEMPT_RT. |
| 162 | + |
| 163 | +On a non-PREEMPT_RT kernel spinlock_t is mapped to raw_spinlock_t and has |
| 164 | +exactly the same semantics. |
| 165 | + |
| 166 | +spinlock_t and PREEMPT_RT |
| 167 | +------------------------- |
| 168 | + |
| 169 | +On a PREEMPT_RT kernel spinlock_t is mapped to a separate implementation |
| 170 | +based on rt_mutex which changes the semantics: |
| 171 | + |
| 172 | + - Preemption is not disabled. |
| 173 | + |
| 174 | + - The hard interrupt related suffixes for spin_lock / spin_unlock |
| 175 | + operations (_irq, _irqsave / _irqrestore) do not affect the CPU's |
| 176 | + interrupt disabled state. |
| 177 | + |
| 178 | + - The soft interrupt related suffix (_bh()) still disables softirq |
| 179 | + handlers. |
| 180 | + |
| 181 | + Non-PREEMPT_RT kernels disable preemption to get this effect. |
| 182 | + |
| 183 | + PREEMPT_RT kernels use a per-CPU lock for serialization which keeps |
| 184 | + preemption disabled. The lock disables softirq handlers and also |
| 185 | + prevents reentrancy due to task preemption. |
| 186 | + |
| 187 | +PREEMPT_RT kernels preserve all other spinlock_t semantics: |
| 188 | + |
| 189 | + - Tasks holding a spinlock_t do not migrate. Non-PREEMPT_RT kernels |
| 190 | + avoid migration by disabling preemption. PREEMPT_RT kernels instead |
| 191 | + disable migration, which ensures that pointers to per-CPU variables |
| 192 | + remain valid even if the task is preempted. |
| 193 | + |
| 194 | + - Task state is preserved across spinlock acquisition, ensuring that the |
| 195 | + task-state rules apply to all kernel configurations. Non-PREEMPT_RT |
| 196 | + kernels leave task state untouched. However, PREEMPT_RT must change |
| 197 | + task state if the task blocks during acquisition. Therefore, it saves |
| 198 | + the current task state before blocking and the corresponding lock wakeup |
| 199 | + restores it, as shown below:: |
| 200 | + |
| 201 | + task->state = TASK_INTERRUPTIBLE |
| 202 | + lock() |
| 203 | + block() |
| 204 | + task->saved_state = task->state |
| 205 | + task->state = TASK_UNINTERRUPTIBLE |
| 206 | + schedule() |
| 207 | + lock wakeup |
| 208 | + task->state = task->saved_state |
| 209 | + |
| 210 | + Other types of wakeups would normally unconditionally set the task state |
| 211 | + to RUNNING, but that does not work here because the task must remain |
| 212 | + blocked until the lock becomes available. Therefore, when a non-lock |
| 213 | + wakeup attempts to awaken a task blocked waiting for a spinlock, it |
| 214 | + instead sets the saved state to RUNNING. Then, when the lock |
| 215 | + acquisition completes, the lock wakeup sets the task state to the saved |
| 216 | + state, in this case setting it to RUNNING:: |
| 217 | + |
| 218 | + task->state = TASK_INTERRUPTIBLE |
| 219 | + lock() |
| 220 | + block() |
| 221 | + task->saved_state = task->state |
| 222 | + task->state = TASK_UNINTERRUPTIBLE |
| 223 | + schedule() |
| 224 | + non lock wakeup |
| 225 | + task->saved_state = TASK_RUNNING |
| 226 | + |
| 227 | + lock wakeup |
| 228 | + task->state = task->saved_state |
| 229 | + |
| 230 | + This ensures that the real wakeup cannot be lost. |
| 231 | + |
| 232 | + |
| 233 | +rwlock_t |
| 234 | +======== |
| 235 | + |
| 236 | +rwlock_t is a multiple readers and single writer lock mechanism. |
| 237 | + |
| 238 | +Non-PREEMPT_RT kernels implement rwlock_t as a spinning lock and the |
| 239 | +suffix rules of spinlock_t apply accordingly. The implementation is fair, |
| 240 | +thus preventing writer starvation. |
| 241 | + |
| 242 | +rwlock_t and PREEMPT_RT |
| 243 | +----------------------- |
| 244 | + |
| 245 | +PREEMPT_RT kernels map rwlock_t to a separate rt_mutex-based |
| 246 | +implementation, thus changing semantics: |
| 247 | + |
| 248 | + - All the spinlock_t changes also apply to rwlock_t. |
| 249 | + |
| 250 | + - Because an rwlock_t writer cannot grant its priority to multiple |
| 251 | + readers, a preempted low-priority reader will continue holding its lock, |
| 252 | + thus starving even high-priority writers. In contrast, because readers |
| 253 | + can grant their priority to a writer, a preempted low-priority writer |
| 254 | + will have its priority boosted until it releases the lock, thus |
| 255 | + preventing that writer from starving readers. |
| 256 | + |
| 257 | + |
| 258 | +PREEMPT_RT caveats |
| 259 | +================== |
| 260 | + |
| 261 | +spinlock_t and rwlock_t |
| 262 | +----------------------- |
| 263 | + |
| 264 | +These changes in spinlock_t and rwlock_t semantics on PREEMPT_RT kernels |
| 265 | +have a few implications. For example, on a non-PREEMPT_RT kernel the |
| 266 | +following code sequence works as expected:: |
| 267 | + |
| 268 | + local_irq_disable(); |
| 269 | + spin_lock(&lock); |
| 270 | + |
| 271 | +and is fully equivalent to:: |
| 272 | + |
| 273 | + spin_lock_irq(&lock); |
| 274 | + |
| 275 | +Same applies to rwlock_t and the _irqsave() suffix variants. |
| 276 | + |
| 277 | +On PREEMPT_RT kernel this code sequence breaks because RT-mutex requires a |
| 278 | +fully preemptible context. Instead, use spin_lock_irq() or |
| 279 | +spin_lock_irqsave() and their unlock counterparts. In cases where the |
| 280 | +interrupt disabling and locking must remain separate, PREEMPT_RT offers a |
| 281 | +local_lock mechanism. Acquiring the local_lock pins the task to a CPU, |
| 282 | +allowing things like per-CPU interrupt disabled locks to be acquired. |
| 283 | +However, this approach should be used only where absolutely necessary. |
| 284 | + |
| 285 | + |
| 286 | +raw_spinlock_t |
| 287 | +-------------- |
| 288 | + |
| 289 | +Acquiring a raw_spinlock_t disables preemption and possibly also |
| 290 | +interrupts, so the critical section must avoid acquiring a regular |
| 291 | +spinlock_t or rwlock_t, for example, the critical section must avoid |
| 292 | +allocating memory. Thus, on a non-PREEMPT_RT kernel the following code |
| 293 | +works perfectly:: |
| 294 | + |
| 295 | + raw_spin_lock(&lock); |
| 296 | + p = kmalloc(sizeof(*p), GFP_ATOMIC); |
| 297 | + |
| 298 | +But this code fails on PREEMPT_RT kernels because the memory allocator is |
| 299 | +fully preemptible and therefore cannot be invoked from truly atomic |
| 300 | +contexts. However, it is perfectly fine to invoke the memory allocator |
| 301 | +while holding normal non-raw spinlocks because they do not disable |
| 302 | +preemption on PREEMPT_RT kernels:: |
| 303 | + |
| 304 | + spin_lock(&lock); |
| 305 | + p = kmalloc(sizeof(*p), GFP_ATOMIC); |
| 306 | + |
| 307 | + |
| 308 | +bit spinlocks |
| 309 | +------------- |
| 310 | + |
| 311 | +PREEMPT_RT cannot substitute bit spinlocks because a single bit is too |
| 312 | +small to accommodate an RT-mutex. Therefore, the semantics of bit |
| 313 | +spinlocks are preserved on PREEMPT_RT kernels, so that the raw_spinlock_t |
| 314 | +caveats also apply to bit spinlocks. |
| 315 | + |
| 316 | +Some bit spinlocks are replaced with regular spinlock_t for PREEMPT_RT |
| 317 | +using conditional (#ifdef'ed) code changes at the usage site. In contrast, |
| 318 | +usage-site changes are not needed for the spinlock_t substitution. |
| 319 | +Instead, conditionals in header files and the core locking implemementation |
| 320 | +enable the compiler to do the substitution transparently. |
| 321 | + |
| 322 | + |
| 323 | +Lock type nesting rules |
| 324 | +======================= |
| 325 | + |
| 326 | +The most basic rules are: |
| 327 | + |
| 328 | + - Lock types of the same lock category (sleeping, spinning) can nest |
| 329 | + arbitrarily as long as they respect the general lock ordering rules to |
| 330 | + prevent deadlocks. |
| 331 | + |
| 332 | + - Sleeping lock types cannot nest inside spinning lock types. |
| 333 | + |
| 334 | + - Spinning lock types can nest inside sleeping lock types. |
| 335 | + |
| 336 | +These constraints apply both in PREEMPT_RT and otherwise. |
| 337 | + |
| 338 | +The fact that PREEMPT_RT changes the lock category of spinlock_t and |
| 339 | +rwlock_t from spinning to sleeping means that they cannot be acquired while |
| 340 | +holding a raw spinlock. This results in the following nesting ordering: |
| 341 | + |
| 342 | + 1) Sleeping locks |
| 343 | + 2) spinlock_t and rwlock_t |
| 344 | + 3) raw_spinlock_t and bit spinlocks |
| 345 | + |
| 346 | +Lockdep will complain if these constraints are violated, both in |
| 347 | +PREEMPT_RT and otherwise. |
0 commit comments