locktypes.rst - Documentation/locking/locktypes.rst - Linux diff v5.9

  1.. SPDX-License-Identifier: GPL-2.0
  2
  3.. _kernel_hacking_locktypes:
  4
  5==========================
  6Lock types and their rules
  7==========================
  8
  9Introduction
 10============
 11
 12The kernel provides a variety of locking primitives which can be divided
 13into three categories:
 14
 15 - Sleeping locks
 16 - CPU local locks
 17 - Spinning locks
 18
 19This document conceptually describes these lock types and provides rules
 20for their nesting, including the rules for use under PREEMPT_RT.
 21
 22
 23Lock categories
 24===============
 25
 26Sleeping locks
 27--------------
 28
 29Sleeping locks can only be acquired in preemptible task context.
 30
 31Although implementations allow try_lock() from other contexts, it is
 32necessary to carefully evaluate the safety of unlock() as well as of
 33try_lock().  Furthermore, it is also necessary to evaluate the debugging
 34versions of these primitives.  In short, don't acquire sleeping locks from
 35other contexts unless there is no other option.
 36
 37Sleeping lock types:
 38
 39 - mutex
 40 - rt_mutex
 41 - semaphore
 42 - rw_semaphore
 43 - ww_mutex
 44 - percpu_rw_semaphore
 45
 46On PREEMPT_RT kernels, these lock types are converted to sleeping locks:
 47
 48 - local_lock
 49 - spinlock_t
 50 - rwlock_t
 51
 52
 53CPU local locks
 54---------------
 55
 56 - local_lock
 57
 58On non-PREEMPT_RT kernels, local_lock functions are wrappers around
 59preemption and interrupt disabling primitives. Contrary to other locking
 60mechanisms, disabling preemption or interrupts are pure CPU local
 61concurrency control mechanisms and not suited for inter-CPU concurrency
 62control.
 63
 64
 65Spinning locks
 66--------------
 67
 68 - raw_spinlock_t
 69 - bit spinlocks
 70
 71On non-PREEMPT_RT kernels, these lock types are also spinning locks:
 72
 73 - spinlock_t
 74 - rwlock_t
 75
 76Spinning locks implicitly disable preemption and the lock / unlock functions
 77can have suffixes which apply further protections:
 78
 79 ===================  ====================================================
 80 _bh()                Disable / enable bottom halves (soft interrupts)
 81 _irq()               Disable / enable interrupts
 82 _irqsave/restore()   Save and disable / restore interrupt disabled state
 83 ===================  ====================================================
 84
 85
 86Owner semantics
 87===============
 88
 89The aforementioned lock types except semaphores have strict owner
 90semantics:
 91
 92  The context (task) that acquired the lock must release it.
 93
 94rw_semaphores have a special interface which allows non-owner release for
 95readers.
 96
 97
 98rtmutex
 99=======
100
101RT-mutexes are mutexes with support for priority inheritance (PI).
102
103PI has limitations on non-PREEMPT_RT kernels due to preemption and
104interrupt disabled sections.
105
106PI clearly cannot preempt preemption-disabled or interrupt-disabled
107regions of code, even on PREEMPT_RT kernels.  Instead, PREEMPT_RT kernels
108execute most such regions of code in preemptible task context, especially
109interrupt handlers and soft interrupts.  This conversion allows spinlock_t
110and rwlock_t to be implemented via RT-mutexes.
111
112
113semaphore
114=========
115
116semaphore is a counting semaphore implementation.
117
118Semaphores are often used for both serialization and waiting, but new use
119cases should instead use separate serialization and wait mechanisms, such
120as mutexes and completions.
121
122semaphores and PREEMPT_RT
123----------------------------
124
125PREEMPT_RT does not change the semaphore implementation because counting
126semaphores have no concept of owners, thus preventing PREEMPT_RT from
127providing priority inheritance for semaphores.  After all, an unknown
128owner cannot be boosted. As a consequence, blocking on semaphores can
129result in priority inversion.
130
131
132rw_semaphore
133============
134
135rw_semaphore is a multiple readers and single writer lock mechanism.
136
137On non-PREEMPT_RT kernels the implementation is fair, thus preventing
138writer starvation.
139
140rw_semaphore complies by default with the strict owner semantics, but there
141exist special-purpose interfaces that allow non-owner release for readers.
142These interfaces work independent of the kernel configuration.
143
144rw_semaphore and PREEMPT_RT
145---------------------------
146
147PREEMPT_RT kernels map rw_semaphore to a separate rt_mutex-based
148implementation, thus changing the fairness:
149
150 Because an rw_semaphore writer cannot grant its priority to multiple
151 readers, a preempted low-priority reader will continue holding its lock,
152 thus starving even high-priority writers.  In contrast, because readers
153 can grant their priority to a writer, a preempted low-priority writer will
154 have its priority boosted until it releases the lock, thus preventing that
155 writer from starving readers.
156
157
158local_lock
159==========
160
161local_lock provides a named scope to critical sections which are protected
162by disabling preemption or interrupts.
163
164On non-PREEMPT_RT kernels local_lock operations map to the preemption and
165interrupt disabling and enabling primitives:
166
167 ===============================  ======================
168 local_lock(&llock)               preempt_disable()
169 local_unlock(&llock)             preempt_enable()
170 local_lock_irq(&llock)           local_irq_disable()
171 local_unlock_irq(&llock)         local_irq_enable()
172 local_lock_irqsave(&llock)       local_irq_save()
173 local_unlock_irqrestore(&llock)  local_irq_restore()
174 ===============================  ======================
175
176The named scope of local_lock has two advantages over the regular
177primitives:
178
179  - The lock name allows static analysis and is also a clear documentation
180    of the protection scope while the regular primitives are scopeless and
181    opaque.
182
183  - If lockdep is enabled the local_lock gains a lockmap which allows to
184    validate the correctness of the protection. This can detect cases where
185    e.g. a function using preempt_disable() as protection mechanism is
186    invoked from interrupt or soft-interrupt context. Aside of that
187    lockdep_assert_held(&llock) works as with any other locking primitive.
188
189local_lock and PREEMPT_RT
190-------------------------
191
192PREEMPT_RT kernels map local_lock to a per-CPU spinlock_t, thus changing
193semantics:
194
195  - All spinlock_t changes also apply to local_lock.
196
197local_lock usage
198----------------
199
200local_lock should be used in situations where disabling preemption or
201interrupts is the appropriate form of concurrency control to protect
202per-CPU data structures on a non PREEMPT_RT kernel.
203
204local_lock is not suitable to protect against preemption or interrupts on a
205PREEMPT_RT kernel due to the PREEMPT_RT specific spinlock_t semantics.
206
207
208raw_spinlock_t and spinlock_t
209=============================
210
211raw_spinlock_t
212--------------
213
214raw_spinlock_t is a strict spinning lock implementation regardless of the
215kernel configuration including PREEMPT_RT enabled kernels.
216
217raw_spinlock_t is a strict spinning lock implementation in all kernels,
218including PREEMPT_RT kernels.  Use raw_spinlock_t only in real critical
219core code, low-level interrupt handling and places where disabling
220preemption or interrupts is required, for example, to safely access
221hardware state.  raw_spinlock_t can sometimes also be used when the
222critical section is tiny, thus avoiding RT-mutex overhead.
223
224spinlock_t
225----------
226
227The semantics of spinlock_t change with the state of PREEMPT_RT.
228
229On a non-PREEMPT_RT kernel spinlock_t is mapped to raw_spinlock_t and has
230exactly the same semantics.
231
232spinlock_t and PREEMPT_RT
233-------------------------
234
235On a PREEMPT_RT kernel spinlock_t is mapped to a separate implementation
236based on rt_mutex which changes the semantics:
237
238 - Preemption is not disabled.
239
240 - The hard interrupt related suffixes for spin_lock / spin_unlock
241   operations (_irq, _irqsave / _irqrestore) do not affect the CPU's
242   interrupt disabled state.
243
244 - The soft interrupt related suffix (_bh()) still disables softirq
245   handlers.
246
247   Non-PREEMPT_RT kernels disable preemption to get this effect.
248
249   PREEMPT_RT kernels use a per-CPU lock for serialization which keeps
250   preemption disabled. The lock disables softirq handlers and also
251   prevents reentrancy due to task preemption.
252
253PREEMPT_RT kernels preserve all other spinlock_t semantics:
254
255 - Tasks holding a spinlock_t do not migrate.  Non-PREEMPT_RT kernels
256   avoid migration by disabling preemption.  PREEMPT_RT kernels instead
257   disable migration, which ensures that pointers to per-CPU variables
258   remain valid even if the task is preempted.
259
260 - Task state is preserved across spinlock acquisition, ensuring that the
261   task-state rules apply to all kernel configurations.  Non-PREEMPT_RT
262   kernels leave task state untouched.  However, PREEMPT_RT must change
263   task state if the task blocks during acquisition.  Therefore, it saves
264   the current task state before blocking and the corresponding lock wakeup
265   restores it, as shown below::
266
267    task->state = TASK_INTERRUPTIBLE
268     lock()
269       block()
270         task->saved_state = task->state
271	 task->state = TASK_UNINTERRUPTIBLE
272	 schedule()
273					lock wakeup
274					  task->state = task->saved_state
275
276   Other types of wakeups would normally unconditionally set the task state
277   to RUNNING, but that does not work here because the task must remain
278   blocked until the lock becomes available.  Therefore, when a non-lock
279   wakeup attempts to awaken a task blocked waiting for a spinlock, it
280   instead sets the saved state to RUNNING.  Then, when the lock
281   acquisition completes, the lock wakeup sets the task state to the saved
282   state, in this case setting it to RUNNING::
283
284    task->state = TASK_INTERRUPTIBLE
285     lock()
286       block()
287         task->saved_state = task->state
288	 task->state = TASK_UNINTERRUPTIBLE
289	 schedule()
290					non lock wakeup
291					  task->saved_state = TASK_RUNNING
292
293					lock wakeup
294					  task->state = task->saved_state
295
296   This ensures that the real wakeup cannot be lost.
297
298
299rwlock_t
300========
301
302rwlock_t is a multiple readers and single writer lock mechanism.
303
304Non-PREEMPT_RT kernels implement rwlock_t as a spinning lock and the
305suffix rules of spinlock_t apply accordingly. The implementation is fair,
306thus preventing writer starvation.
307
308rwlock_t and PREEMPT_RT
309-----------------------
310
311PREEMPT_RT kernels map rwlock_t to a separate rt_mutex-based
312implementation, thus changing semantics:
313
314 - All the spinlock_t changes also apply to rwlock_t.
315
316 - Because an rwlock_t writer cannot grant its priority to multiple
317   readers, a preempted low-priority reader will continue holding its lock,
318   thus starving even high-priority writers.  In contrast, because readers
319   can grant their priority to a writer, a preempted low-priority writer
320   will have its priority boosted until it releases the lock, thus
321   preventing that writer from starving readers.
322
323
324PREEMPT_RT caveats
325==================
326
327local_lock on RT
328----------------
329
330The mapping of local_lock to spinlock_t on PREEMPT_RT kernels has a few
331implications. For example, on a non-PREEMPT_RT kernel the following code
332sequence works as expected::
333
334  local_lock_irq(&local_lock);
335  raw_spin_lock(&lock);
336
337and is fully equivalent to::
338
339   raw_spin_lock_irq(&lock);
340
341On a PREEMPT_RT kernel this code sequence breaks because local_lock_irq()
342is mapped to a per-CPU spinlock_t which neither disables interrupts nor
343preemption. The following code sequence works perfectly correct on both
344PREEMPT_RT and non-PREEMPT_RT kernels::
345
346  local_lock_irq(&local_lock);
347  spin_lock(&lock);
348
349Another caveat with local locks is that each local_lock has a specific
350protection scope. So the following substitution is wrong::
351
352  func1()
353  {
354    local_irq_save(flags);    -> local_lock_irqsave(&local_lock_1, flags);
355    func3();
356    local_irq_restore(flags); -> local_unlock_irqrestore(&local_lock_1, flags);
357  }
358
359  func2()
360  {
361    local_irq_save(flags);    -> local_lock_irqsave(&local_lock_2, flags);
362    func3();
363    local_irq_restore(flags); -> local_unlock_irqrestore(&local_lock_2, flags);
364  }
365
366  func3()
367  {
368    lockdep_assert_irqs_disabled();
369    access_protected_data();
370  }
371
372On a non-PREEMPT_RT kernel this works correctly, but on a PREEMPT_RT kernel
373local_lock_1 and local_lock_2 are distinct and cannot serialize the callers
374of func3(). Also the lockdep assert will trigger on a PREEMPT_RT kernel
375because local_lock_irqsave() does not disable interrupts due to the
376PREEMPT_RT-specific semantics of spinlock_t. The correct substitution is::
377
378  func1()
379  {
380    local_irq_save(flags);    -> local_lock_irqsave(&local_lock, flags);
381    func3();
382    local_irq_restore(flags); -> local_unlock_irqrestore(&local_lock, flags);
383  }
384
385  func2()
386  {
387    local_irq_save(flags);    -> local_lock_irqsave(&local_lock, flags);
388    func3();
389    local_irq_restore(flags); -> local_unlock_irqrestore(&local_lock, flags);
390  }
391
392  func3()
393  {
394    lockdep_assert_held(&local_lock);
395    access_protected_data();
396  }
397
398
399spinlock_t and rwlock_t
400-----------------------
401
402The changes in spinlock_t and rwlock_t semantics on PREEMPT_RT kernels
403have a few implications.  For example, on a non-PREEMPT_RT kernel the
404following code sequence works as expected::
405
406   local_irq_disable();
407   spin_lock(&lock);
408
409and is fully equivalent to::
410
411   spin_lock_irq(&lock);
412
413Same applies to rwlock_t and the _irqsave() suffix variants.
414
415On PREEMPT_RT kernel this code sequence breaks because RT-mutex requires a
416fully preemptible context.  Instead, use spin_lock_irq() or
417spin_lock_irqsave() and their unlock counterparts.  In cases where the
418interrupt disabling and locking must remain separate, PREEMPT_RT offers a
419local_lock mechanism.  Acquiring the local_lock pins the task to a CPU,
420allowing things like per-CPU interrupt disabled locks to be acquired.
421However, this approach should be used only where absolutely necessary.
422
423A typical scenario is protection of per-CPU variables in thread context::
424
425  struct foo *p = get_cpu_ptr(&var1);
426
427  spin_lock(&p->lock);
428  p->count += this_cpu_read(var2);
429
430This is correct code on a non-PREEMPT_RT kernel, but on a PREEMPT_RT kernel
431this breaks. The PREEMPT_RT-specific change of spinlock_t semantics does
432not allow to acquire p->lock because get_cpu_ptr() implicitly disables
433preemption. The following substitution works on both kernels::
434
435  struct foo *p;
436
437  migrate_disable();
438  p = this_cpu_ptr(&var1);
439  spin_lock(&p->lock);
440  p->count += this_cpu_read(var2);
441
442On a non-PREEMPT_RT kernel migrate_disable() maps to preempt_disable()
443which makes the above code fully equivalent. On a PREEMPT_RT kernel
444migrate_disable() ensures that the task is pinned on the current CPU which
445in turn guarantees that the per-CPU access to var1 and var2 are staying on
446the same CPU.
447
448The migrate_disable() substitution is not valid for the following
449scenario::
450
451  func()
452  {
453    struct foo *p;
454
455    migrate_disable();
456    p = this_cpu_ptr(&var1);
457    p->val = func2();
458
459While correct on a non-PREEMPT_RT kernel, this breaks on PREEMPT_RT because
460here migrate_disable() does not protect against reentrancy from a
461preempting task. A correct substitution for this case is::
462
463  func()
464  {
465    struct foo *p;
466
467    local_lock(&foo_lock);
468    p = this_cpu_ptr(&var1);
469    p->val = func2();
470
471On a non-PREEMPT_RT kernel this protects against reentrancy by disabling
472preemption. On a PREEMPT_RT kernel this is achieved by acquiring the
473underlying per-CPU spinlock.
474
475
476raw_spinlock_t on RT
477--------------------
478
479Acquiring a raw_spinlock_t disables preemption and possibly also
480interrupts, so the critical section must avoid acquiring a regular
481spinlock_t or rwlock_t, for example, the critical section must avoid
482allocating memory.  Thus, on a non-PREEMPT_RT kernel the following code
483works perfectly::
484
485  raw_spin_lock(&lock);
486  p = kmalloc(sizeof(*p), GFP_ATOMIC);
487
488But this code fails on PREEMPT_RT kernels because the memory allocator is
489fully preemptible and therefore cannot be invoked from truly atomic
490contexts.  However, it is perfectly fine to invoke the memory allocator
491while holding normal non-raw spinlocks because they do not disable
492preemption on PREEMPT_RT kernels::
493
494  spin_lock(&lock);
495  p = kmalloc(sizeof(*p), GFP_ATOMIC);
496
497
498bit spinlocks
499-------------
500
501PREEMPT_RT cannot substitute bit spinlocks because a single bit is too
502small to accommodate an RT-mutex.  Therefore, the semantics of bit
503spinlocks are preserved on PREEMPT_RT kernels, so that the raw_spinlock_t
504caveats also apply to bit spinlocks.
505
506Some bit spinlocks are replaced with regular spinlock_t for PREEMPT_RT
507using conditional (#ifdef'ed) code changes at the usage site.  In contrast,
508usage-site changes are not needed for the spinlock_t substitution.
509Instead, conditionals in header files and the core locking implemementation
510enable the compiler to do the substitution transparently.
511
512
513Lock type nesting rules
514=======================
515
516The most basic rules are:
517
518  - Lock types of the same lock category (sleeping, CPU local, spinning)
519    can nest arbitrarily as long as they respect the general lock ordering
520    rules to prevent deadlocks.
521
522  - Sleeping lock types cannot nest inside CPU local and spinning lock types.
523
524  - CPU local and spinning lock types can nest inside sleeping lock types.
525
526  - Spinning lock types can nest inside all lock types
527
528These constraints apply both in PREEMPT_RT and otherwise.
529
530The fact that PREEMPT_RT changes the lock category of spinlock_t and
531rwlock_t from spinning to sleeping and substitutes local_lock with a
532per-CPU spinlock_t means that they cannot be acquired while holding a raw
533spinlock.  This results in the following nesting ordering:
534
535  1) Sleeping locks
536  2) spinlock_t, rwlock_t, local_lock
537  3) raw_spinlock_t and bit spinlocks
538
539Lockdep will complain if these constraints are violated, both in
540PREEMPT_RT and otherwise.

  1.. SPDX-License-Identifier: GPL-2.0
  2
  3.. _kernel_hacking_locktypes:
  4
  5==========================
  6Lock types and their rules
  7==========================
  8
  9Introduction
 10============
 11
 12The kernel provides a variety of locking primitives which can be divided
 13into three categories:
 14
 15 - Sleeping locks
 16 - CPU local locks
 17 - Spinning locks
 18
 19This document conceptually describes these lock types and provides rules
 20for their nesting, including the rules for use under PREEMPT_RT.
 21
 22
 23Lock categories
 24===============
 25
 26Sleeping locks
 27--------------
 28
 29Sleeping locks can only be acquired in preemptible task context.
 30
 31Although implementations allow try_lock() from other contexts, it is
 32necessary to carefully evaluate the safety of unlock() as well as of
 33try_lock().  Furthermore, it is also necessary to evaluate the debugging
 34versions of these primitives.  In short, don't acquire sleeping locks from
 35other contexts unless there is no other option.
 36
 37Sleeping lock types:
 38
 39 - mutex
 40 - rt_mutex
 41 - semaphore
 42 - rw_semaphore
 43 - ww_mutex
 44 - percpu_rw_semaphore
 45
 46On PREEMPT_RT kernels, these lock types are converted to sleeping locks:
 47
 48 - local_lock
 49 - spinlock_t
 50 - rwlock_t
 51
 52
 53CPU local locks
 54---------------
 55
 56 - local_lock
 57
 58On non-PREEMPT_RT kernels, local_lock functions are wrappers around
 59preemption and interrupt disabling primitives. Contrary to other locking
 60mechanisms, disabling preemption or interrupts are pure CPU local
 61concurrency control mechanisms and not suited for inter-CPU concurrency
 62control.
 63
 64
 65Spinning locks
 66--------------
 67
 68 - raw_spinlock_t
 69 - bit spinlocks
 70
 71On non-PREEMPT_RT kernels, these lock types are also spinning locks:
 72
 73 - spinlock_t
 74 - rwlock_t
 75
 76Spinning locks implicitly disable preemption and the lock / unlock functions
 77can have suffixes which apply further protections:
 78
 79 ===================  ====================================================
 80 _bh()                Disable / enable bottom halves (soft interrupts)
 81 _irq()               Disable / enable interrupts
 82 _irqsave/restore()   Save and disable / restore interrupt disabled state
 83 ===================  ====================================================
 84
 85
 86Owner semantics
 87===============
 88
 89The aforementioned lock types except semaphores have strict owner
 90semantics:
 91
 92  The context (task) that acquired the lock must release it.
 93
 94rw_semaphores have a special interface which allows non-owner release for
 95readers.
 96
 97
 98rtmutex
 99=======
100
101RT-mutexes are mutexes with support for priority inheritance (PI).
102
103PI has limitations on non-PREEMPT_RT kernels due to preemption and
104interrupt disabled sections.
105
106PI clearly cannot preempt preemption-disabled or interrupt-disabled
107regions of code, even on PREEMPT_RT kernels.  Instead, PREEMPT_RT kernels
108execute most such regions of code in preemptible task context, especially
109interrupt handlers and soft interrupts.  This conversion allows spinlock_t
110and rwlock_t to be implemented via RT-mutexes.
111
112
113semaphore
114=========
115
116semaphore is a counting semaphore implementation.
117
118Semaphores are often used for both serialization and waiting, but new use
119cases should instead use separate serialization and wait mechanisms, such
120as mutexes and completions.
121
122semaphores and PREEMPT_RT
123----------------------------
124
125PREEMPT_RT does not change the semaphore implementation because counting
126semaphores have no concept of owners, thus preventing PREEMPT_RT from
127providing priority inheritance for semaphores.  After all, an unknown
128owner cannot be boosted. As a consequence, blocking on semaphores can
129result in priority inversion.
130
131
132rw_semaphore
133============
134
135rw_semaphore is a multiple readers and single writer lock mechanism.
136
137On non-PREEMPT_RT kernels the implementation is fair, thus preventing
138writer starvation.
139
140rw_semaphore complies by default with the strict owner semantics, but there
141exist special-purpose interfaces that allow non-owner release for readers.
142These interfaces work independent of the kernel configuration.
143
144rw_semaphore and PREEMPT_RT
145---------------------------
146
147PREEMPT_RT kernels map rw_semaphore to a separate rt_mutex-based
148implementation, thus changing the fairness:
149
150 Because an rw_semaphore writer cannot grant its priority to multiple
151 readers, a preempted low-priority reader will continue holding its lock,
152 thus starving even high-priority writers.  In contrast, because readers
153 can grant their priority to a writer, a preempted low-priority writer will
154 have its priority boosted until it releases the lock, thus preventing that
155 writer from starving readers.
156
157
158local_lock
159==========
160
161local_lock provides a named scope to critical sections which are protected
162by disabling preemption or interrupts.
163
164On non-PREEMPT_RT kernels local_lock operations map to the preemption and
165interrupt disabling and enabling primitives:
166
167 ===============================  ======================
168 local_lock(&llock)               preempt_disable()
169 local_unlock(&llock)             preempt_enable()
170 local_lock_irq(&llock)           local_irq_disable()
171 local_unlock_irq(&llock)         local_irq_enable()
172 local_lock_irqsave(&llock)       local_irq_save()
173 local_unlock_irqrestore(&llock)  local_irq_restore()
174 ===============================  ======================
175
176The named scope of local_lock has two advantages over the regular
177primitives:
178
179  - The lock name allows static analysis and is also a clear documentation
180    of the protection scope while the regular primitives are scopeless and
181    opaque.
182
183  - If lockdep is enabled the local_lock gains a lockmap which allows to
184    validate the correctness of the protection. This can detect cases where
185    e.g. a function using preempt_disable() as protection mechanism is
186    invoked from interrupt or soft-interrupt context. Aside of that
187    lockdep_assert_held(&llock) works as with any other locking primitive.
188
189local_lock and PREEMPT_RT
190-------------------------
191
192PREEMPT_RT kernels map local_lock to a per-CPU spinlock_t, thus changing
193semantics:
194
195  - All spinlock_t changes also apply to local_lock.
196
197local_lock usage
198----------------
199
200local_lock should be used in situations where disabling preemption or
201interrupts is the appropriate form of concurrency control to protect
202per-CPU data structures on a non PREEMPT_RT kernel.
203
204local_lock is not suitable to protect against preemption or interrupts on a
205PREEMPT_RT kernel due to the PREEMPT_RT specific spinlock_t semantics.
206
207
208raw_spinlock_t and spinlock_t
209=============================
210
211raw_spinlock_t
212--------------
213
 
 
 
214raw_spinlock_t is a strict spinning lock implementation in all kernels,
215including PREEMPT_RT kernels.  Use raw_spinlock_t only in real critical
216core code, low-level interrupt handling and places where disabling
217preemption or interrupts is required, for example, to safely access
218hardware state.  raw_spinlock_t can sometimes also be used when the
219critical section is tiny, thus avoiding RT-mutex overhead.
220
221spinlock_t
222----------
223
224The semantics of spinlock_t change with the state of PREEMPT_RT.
225
226On a non-PREEMPT_RT kernel spinlock_t is mapped to raw_spinlock_t and has
227exactly the same semantics.
228
229spinlock_t and PREEMPT_RT
230-------------------------
231
232On a PREEMPT_RT kernel spinlock_t is mapped to a separate implementation
233based on rt_mutex which changes the semantics:
234
235 - Preemption is not disabled.
236
237 - The hard interrupt related suffixes for spin_lock / spin_unlock
238   operations (_irq, _irqsave / _irqrestore) do not affect the CPU's
239   interrupt disabled state.
240
241 - The soft interrupt related suffix (_bh()) still disables softirq
242   handlers.
243
244   Non-PREEMPT_RT kernels disable preemption to get this effect.
245
246   PREEMPT_RT kernels use a per-CPU lock for serialization which keeps
247   preemption enabled. The lock disables softirq handlers and also
248   prevents reentrancy due to task preemption.
249
250PREEMPT_RT kernels preserve all other spinlock_t semantics:
251
252 - Tasks holding a spinlock_t do not migrate.  Non-PREEMPT_RT kernels
253   avoid migration by disabling preemption.  PREEMPT_RT kernels instead
254   disable migration, which ensures that pointers to per-CPU variables
255   remain valid even if the task is preempted.
256
257 - Task state is preserved across spinlock acquisition, ensuring that the
258   task-state rules apply to all kernel configurations.  Non-PREEMPT_RT
259   kernels leave task state untouched.  However, PREEMPT_RT must change
260   task state if the task blocks during acquisition.  Therefore, it saves
261   the current task state before blocking and the corresponding lock wakeup
262   restores it, as shown below::
263
264    task->state = TASK_INTERRUPTIBLE
265     lock()
266       block()
267         task->saved_state = task->state
268	 task->state = TASK_UNINTERRUPTIBLE
269	 schedule()
270					lock wakeup
271					  task->state = task->saved_state
272
273   Other types of wakeups would normally unconditionally set the task state
274   to RUNNING, but that does not work here because the task must remain
275   blocked until the lock becomes available.  Therefore, when a non-lock
276   wakeup attempts to awaken a task blocked waiting for a spinlock, it
277   instead sets the saved state to RUNNING.  Then, when the lock
278   acquisition completes, the lock wakeup sets the task state to the saved
279   state, in this case setting it to RUNNING::
280
281    task->state = TASK_INTERRUPTIBLE
282     lock()
283       block()
284         task->saved_state = task->state
285	 task->state = TASK_UNINTERRUPTIBLE
286	 schedule()
287					non lock wakeup
288					  task->saved_state = TASK_RUNNING
289
290					lock wakeup
291					  task->state = task->saved_state
292
293   This ensures that the real wakeup cannot be lost.
294
295
296rwlock_t
297========
298
299rwlock_t is a multiple readers and single writer lock mechanism.
300
301Non-PREEMPT_RT kernels implement rwlock_t as a spinning lock and the
302suffix rules of spinlock_t apply accordingly. The implementation is fair,
303thus preventing writer starvation.
304
305rwlock_t and PREEMPT_RT
306-----------------------
307
308PREEMPT_RT kernels map rwlock_t to a separate rt_mutex-based
309implementation, thus changing semantics:
310
311 - All the spinlock_t changes also apply to rwlock_t.
312
313 - Because an rwlock_t writer cannot grant its priority to multiple
314   readers, a preempted low-priority reader will continue holding its lock,
315   thus starving even high-priority writers.  In contrast, because readers
316   can grant their priority to a writer, a preempted low-priority writer
317   will have its priority boosted until it releases the lock, thus
318   preventing that writer from starving readers.
319
320
321PREEMPT_RT caveats
322==================
323
324local_lock on RT
325----------------
326
327The mapping of local_lock to spinlock_t on PREEMPT_RT kernels has a few
328implications. For example, on a non-PREEMPT_RT kernel the following code
329sequence works as expected::
330
331  local_lock_irq(&local_lock);
332  raw_spin_lock(&lock);
333
334and is fully equivalent to::
335
336   raw_spin_lock_irq(&lock);
337
338On a PREEMPT_RT kernel this code sequence breaks because local_lock_irq()
339is mapped to a per-CPU spinlock_t which neither disables interrupts nor
340preemption. The following code sequence works perfectly correct on both
341PREEMPT_RT and non-PREEMPT_RT kernels::
342
343  local_lock_irq(&local_lock);
344  spin_lock(&lock);
345
346Another caveat with local locks is that each local_lock has a specific
347protection scope. So the following substitution is wrong::
348
349  func1()
350  {
351    local_irq_save(flags);    -> local_lock_irqsave(&local_lock_1, flags);
352    func3();
353    local_irq_restore(flags); -> local_unlock_irqrestore(&local_lock_1, flags);
354  }
355
356  func2()
357  {
358    local_irq_save(flags);    -> local_lock_irqsave(&local_lock_2, flags);
359    func3();
360    local_irq_restore(flags); -> local_unlock_irqrestore(&local_lock_2, flags);
361  }
362
363  func3()
364  {
365    lockdep_assert_irqs_disabled();
366    access_protected_data();
367  }
368
369On a non-PREEMPT_RT kernel this works correctly, but on a PREEMPT_RT kernel
370local_lock_1 and local_lock_2 are distinct and cannot serialize the callers
371of func3(). Also the lockdep assert will trigger on a PREEMPT_RT kernel
372because local_lock_irqsave() does not disable interrupts due to the
373PREEMPT_RT-specific semantics of spinlock_t. The correct substitution is::
374
375  func1()
376  {
377    local_irq_save(flags);    -> local_lock_irqsave(&local_lock, flags);
378    func3();
379    local_irq_restore(flags); -> local_unlock_irqrestore(&local_lock, flags);
380  }
381
382  func2()
383  {
384    local_irq_save(flags);    -> local_lock_irqsave(&local_lock, flags);
385    func3();
386    local_irq_restore(flags); -> local_unlock_irqrestore(&local_lock, flags);
387  }
388
389  func3()
390  {
391    lockdep_assert_held(&local_lock);
392    access_protected_data();
393  }
394
395
396spinlock_t and rwlock_t
397-----------------------
398
399The changes in spinlock_t and rwlock_t semantics on PREEMPT_RT kernels
400have a few implications.  For example, on a non-PREEMPT_RT kernel the
401following code sequence works as expected::
402
403   local_irq_disable();
404   spin_lock(&lock);
405
406and is fully equivalent to::
407
408   spin_lock_irq(&lock);
409
410Same applies to rwlock_t and the _irqsave() suffix variants.
411
412On PREEMPT_RT kernel this code sequence breaks because RT-mutex requires a
413fully preemptible context.  Instead, use spin_lock_irq() or
414spin_lock_irqsave() and their unlock counterparts.  In cases where the
415interrupt disabling and locking must remain separate, PREEMPT_RT offers a
416local_lock mechanism.  Acquiring the local_lock pins the task to a CPU,
417allowing things like per-CPU interrupt disabled locks to be acquired.
418However, this approach should be used only where absolutely necessary.
419
420A typical scenario is protection of per-CPU variables in thread context::
421
422  struct foo *p = get_cpu_ptr(&var1);
423
424  spin_lock(&p->lock);
425  p->count += this_cpu_read(var2);
426
427This is correct code on a non-PREEMPT_RT kernel, but on a PREEMPT_RT kernel
428this breaks. The PREEMPT_RT-specific change of spinlock_t semantics does
429not allow to acquire p->lock because get_cpu_ptr() implicitly disables
430preemption. The following substitution works on both kernels::
431
432  struct foo *p;
433
434  migrate_disable();
435  p = this_cpu_ptr(&var1);
436  spin_lock(&p->lock);
437  p->count += this_cpu_read(var2);
438
 
 
439migrate_disable() ensures that the task is pinned on the current CPU which
440in turn guarantees that the per-CPU access to var1 and var2 are staying on
441the same CPU while the task remains preemptible.
442
443The migrate_disable() substitution is not valid for the following
444scenario::
445
446  func()
447  {
448    struct foo *p;
449
450    migrate_disable();
451    p = this_cpu_ptr(&var1);
452    p->val = func2();
453
454This breaks because migrate_disable() does not protect against reentrancy from
455a preempting task. A correct substitution for this case is::
 
456
457  func()
458  {
459    struct foo *p;
460
461    local_lock(&foo_lock);
462    p = this_cpu_ptr(&var1);
463    p->val = func2();
464
465On a non-PREEMPT_RT kernel this protects against reentrancy by disabling
466preemption. On a PREEMPT_RT kernel this is achieved by acquiring the
467underlying per-CPU spinlock.
468
469
470raw_spinlock_t on RT
471--------------------
472
473Acquiring a raw_spinlock_t disables preemption and possibly also
474interrupts, so the critical section must avoid acquiring a regular
475spinlock_t or rwlock_t, for example, the critical section must avoid
476allocating memory.  Thus, on a non-PREEMPT_RT kernel the following code
477works perfectly::
478
479  raw_spin_lock(&lock);
480  p = kmalloc(sizeof(*p), GFP_ATOMIC);
481
482But this code fails on PREEMPT_RT kernels because the memory allocator is
483fully preemptible and therefore cannot be invoked from truly atomic
484contexts.  However, it is perfectly fine to invoke the memory allocator
485while holding normal non-raw spinlocks because they do not disable
486preemption on PREEMPT_RT kernels::
487
488  spin_lock(&lock);
489  p = kmalloc(sizeof(*p), GFP_ATOMIC);
490
491
492bit spinlocks
493-------------
494
495PREEMPT_RT cannot substitute bit spinlocks because a single bit is too
496small to accommodate an RT-mutex.  Therefore, the semantics of bit
497spinlocks are preserved on PREEMPT_RT kernels, so that the raw_spinlock_t
498caveats also apply to bit spinlocks.
499
500Some bit spinlocks are replaced with regular spinlock_t for PREEMPT_RT
501using conditional (#ifdef'ed) code changes at the usage site.  In contrast,
502usage-site changes are not needed for the spinlock_t substitution.
503Instead, conditionals in header files and the core locking implementation
504enable the compiler to do the substitution transparently.
505
506
507Lock type nesting rules
508=======================
509
510The most basic rules are:
511
512  - Lock types of the same lock category (sleeping, CPU local, spinning)
513    can nest arbitrarily as long as they respect the general lock ordering
514    rules to prevent deadlocks.
515
516  - Sleeping lock types cannot nest inside CPU local and spinning lock types.
517
518  - CPU local and spinning lock types can nest inside sleeping lock types.
519
520  - Spinning lock types can nest inside all lock types
521
522These constraints apply both in PREEMPT_RT and otherwise.
523
524The fact that PREEMPT_RT changes the lock category of spinlock_t and
525rwlock_t from spinning to sleeping and substitutes local_lock with a
526per-CPU spinlock_t means that they cannot be acquired while holding a raw
527spinlock.  This results in the following nesting ordering:
528
529  1) Sleeping locks
530  2) spinlock_t, rwlock_t, local_lock
531  3) raw_spinlock_t and bit spinlocks
532
533Lockdep will complain if these constraints are violated, both in
534PREEMPT_RT and otherwise.