Linux Audio

Check our new training course

Loading...
v6.13.7
   1====================
   2PCI Power Management
   3====================
   4
   5Copyright (c) 2010 Rafael J. Wysocki <rjw@sisk.pl>, Novell Inc.
   6
   7An overview of concepts and the Linux kernel's interfaces related to PCI power
   8management.  Based on previous work by Patrick Mochel <mochel@transmeta.com>
   9(and others).
  10
  11This document only covers the aspects of power management specific to PCI
  12devices.  For general description of the kernel's interfaces related to device
  13power management refer to Documentation/driver-api/pm/devices.rst and
  14Documentation/power/runtime_pm.rst.
  15
  16.. contents:
  17
  18   1. Hardware and Platform Support for PCI Power Management
  19   2. PCI Subsystem and Device Power Management
  20   3. PCI Device Drivers and Power Management
  21   4. Resources
  22
  23
  241. Hardware and Platform Support for PCI Power Management
  25=========================================================
  26
  271.1. Native and Platform-Based Power Management
  28-----------------------------------------------
  29
  30In general, power management is a feature allowing one to save energy by putting
  31devices into states in which they draw less power (low-power states) at the
  32price of reduced functionality or performance.
  33
  34Usually, a device is put into a low-power state when it is underutilized or
  35completely inactive.  However, when it is necessary to use the device once
  36again, it has to be put back into the "fully functional" state (full-power
  37state).  This may happen when there are some data for the device to handle or
  38as a result of an external event requiring the device to be active, which may
  39be signaled by the device itself.
  40
  41PCI devices may be put into low-power states in two ways, by using the device
  42capabilities introduced by the PCI Bus Power Management Interface Specification,
  43or with the help of platform firmware, such as an ACPI BIOS.  In the first
  44approach, that is referred to as the native PCI power management (native PCI PM)
  45in what follows, the device power state is changed as a result of writing a
  46specific value into one of its standard configuration registers.  The second
  47approach requires the platform firmware to provide special methods that may be
  48used by the kernel to change the device's power state.
  49
  50Devices supporting the native PCI PM usually can generate wakeup signals called
  51Power Management Events (PMEs) to let the kernel know about external events
  52requiring the device to be active.  After receiving a PME the kernel is supposed
  53to put the device that sent it into the full-power state.  However, the PCI Bus
  54Power Management Interface Specification doesn't define any standard method of
  55delivering the PME from the device to the CPU and the operating system kernel.
  56It is assumed that the platform firmware will perform this task and therefore,
  57even though a PCI device is set up to generate PMEs, it also may be necessary to
  58prepare the platform firmware for notifying the CPU of the PMEs coming from the
  59device (e.g. by generating interrupts).
  60
  61In turn, if the methods provided by the platform firmware are used for changing
  62the power state of a device, usually the platform also provides a method for
  63preparing the device to generate wakeup signals.  In that case, however, it
  64often also is necessary to prepare the device for generating PMEs using the
  65native PCI PM mechanism, because the method provided by the platform depends on
  66that.
  67
  68Thus in many situations both the native and the platform-based power management
  69mechanisms have to be used simultaneously to obtain the desired result.
  70
  711.2. Native PCI Power Management
  72--------------------------------
  73
  74The PCI Bus Power Management Interface Specification (PCI PM Spec) was
  75introduced between the PCI 2.1 and PCI 2.2 Specifications.  It defined a
  76standard interface for performing various operations related to power
  77management.
  78
  79The implementation of the PCI PM Spec is optional for conventional PCI devices,
  80but it is mandatory for PCI Express devices.  If a device supports the PCI PM
  81Spec, it has an 8 byte power management capability field in its PCI
  82configuration space.  This field is used to describe and control the standard
  83features related to the native PCI power management.
  84
  85The PCI PM Spec defines 4 operating states for devices (D0-D3) and for buses
  86(B0-B3).  The higher the number, the less power is drawn by the device or bus
  87in that state.  However, the higher the number, the longer the latency for
  88the device or bus to return to the full-power state (D0 or B0, respectively).
  89
  90There are two variants of the D3 state defined by the specification.  The first
  91one is D3hot, referred to as the software accessible D3, because devices can be
  92programmed to go into it.  The second one, D3cold, is the state that PCI devices
  93are in when the supply voltage (Vcc) is removed from them.  It is not possible
  94to program a PCI device to go into D3cold, although there may be a programmable
  95interface for putting the bus the device is on into a state in which Vcc is
  96removed from all devices on the bus.
  97
  98PCI bus power management, however, is not supported by the Linux kernel at the
  99time of this writing and therefore it is not covered by this document.
 100
 101Note that every PCI device can be in the full-power state (D0) or in D3cold,
 102regardless of whether or not it implements the PCI PM Spec.  In addition to
 103that, if the PCI PM Spec is implemented by the device, it must support D3hot
 104as well as D0.  The support for the D1 and D2 power states is optional.
 105
 106PCI devices supporting the PCI PM Spec can be programmed to go to any of the
 107supported low-power states (except for D3cold).  While in D1-D3hot the
 108standard configuration registers of the device must be accessible to software
 109(i.e. the device is required to respond to PCI configuration accesses), although
 110its I/O and memory spaces are then disabled.  This allows the device to be
 111programmatically put into D0.  Thus the kernel can switch the device back and
 112forth between D0 and the supported low-power states (except for D3cold) and the
 113possible power state transitions the device can undergo are the following:
 114
 115+----------------------------+
 116| Current State | New State  |
 117+----------------------------+
 118| D0            | D1, D2, D3 |
 119+----------------------------+
 120| D1            | D2, D3     |
 121+----------------------------+
 122| D2            | D3         |
 123+----------------------------+
 124| D1, D2, D3    | D0         |
 125+----------------------------+
 126
 127The transition from D3cold to D0 occurs when the supply voltage is provided to
 128the device (i.e. power is restored).  In that case the device returns to D0 with
 129a full power-on reset sequence and the power-on defaults are restored to the
 130device by hardware just as at initial power up.
 131
 132PCI devices supporting the PCI PM Spec can be programmed to generate PMEs
 133while in any power state (D0-D3), but they are not required to be capable
 134of generating PMEs from all supported power states.  In particular, the
 135capability of generating PMEs from D3cold is optional and depends on the
 136presence of additional voltage (3.3Vaux) allowing the device to remain
 137sufficiently active to generate a wakeup signal.
 138
 1391.3. ACPI Device Power Management
 140---------------------------------
 141
 142The platform firmware support for the power management of PCI devices is
 143system-specific.  However, if the system in question is compliant with the
 144Advanced Configuration and Power Interface (ACPI) Specification, like the
 145majority of x86-based systems, it is supposed to implement device power
 146management interfaces defined by the ACPI standard.
 147
 148For this purpose the ACPI BIOS provides special functions called "control
 149methods" that may be executed by the kernel to perform specific tasks, such as
 150putting a device into a low-power state.  These control methods are encoded
 151using special byte-code language called the ACPI Machine Language (AML) and
 152stored in the machine's BIOS.  The kernel loads them from the BIOS and executes
 153them as needed using an AML interpreter that translates the AML byte code into
 154computations and memory or I/O space accesses.  This way, in theory, a BIOS
 155writer can provide the kernel with a means to perform actions depending
 156on the system design in a system-specific fashion.
 157
 158ACPI control methods may be divided into global control methods, that are not
 159associated with any particular devices, and device control methods, that have
 160to be defined separately for each device supposed to be handled with the help of
 161the platform.  This means, in particular, that ACPI device control methods can
 162only be used to handle devices that the BIOS writer knew about in advance.  The
 163ACPI methods used for device power management fall into that category.
 164
 165The ACPI specification assumes that devices can be in one of four power states
 166labeled as D0, D1, D2, and D3 that roughly correspond to the native PCI PM
 167D0-D3 states (although the difference between D3hot and D3cold is not taken
 168into account by ACPI).  Moreover, for each power state of a device there is a
 169set of power resources that have to be enabled for the device to be put into
 170that state.  These power resources are controlled (i.e. enabled or disabled)
 171with the help of their own control methods, _ON and _OFF, that have to be
 172defined individually for each of them.
 173
 174To put a device into the ACPI power state Dx (where x is a number between 0 and
 1753 inclusive) the kernel is supposed to (1) enable the power resources required
 176by the device in this state using their _ON control methods and (2) execute the
 177_PSx control method defined for the device.  In addition to that, if the device
 178is going to be put into a low-power state (D1-D3) and is supposed to generate
 179wakeup signals from that state, the _DSW (or _PSW, replaced with _DSW by ACPI
 1803.0) control method defined for it has to be executed before _PSx.  Power
 181resources that are not required by the device in the target power state and are
 182not required any more by any other device should be disabled (by executing their
 183_OFF control methods).  If the current power state of the device is D3, it can
 184only be put into D0 this way.
 185
 186However, quite often the power states of devices are changed during a
 187system-wide transition into a sleep state or back into the working state.  ACPI
 188defines four system sleep states, S1, S2, S3, and S4, and denotes the system
 189working state as S0.  In general, the target system sleep (or working) state
 190determines the highest power (lowest number) state the device can be put
 191into and the kernel is supposed to obtain this information by executing the
 192device's _SxD control method (where x is a number between 0 and 4 inclusive).
 193If the device is required to wake up the system from the target sleep state, the
 194lowest power (highest number) state it can be put into is also determined by the
 195target state of the system.  The kernel is then supposed to use the device's
 196_SxW control method to obtain the number of that state.  It also is supposed to
 197use the device's _PRW control method to learn which power resources need to be
 198enabled for the device to be able to generate wakeup signals.
 199
 2001.4. Wakeup Signaling
 201---------------------
 202
 203Wakeup signals generated by PCI devices, either as native PCI PMEs, or as
 204a result of the execution of the _DSW (or _PSW) ACPI control method before
 205putting the device into a low-power state, have to be caught and handled as
 206appropriate.  If they are sent while the system is in the working state
 207(ACPI S0), they should be translated into interrupts so that the kernel can
 208put the devices generating them into the full-power state and take care of the
 209events that triggered them.  In turn, if they are sent while the system is
 210sleeping, they should cause the system's core logic to trigger wakeup.
 211
 212On ACPI-based systems wakeup signals sent by conventional PCI devices are
 213converted into ACPI General-Purpose Events (GPEs) which are hardware signals
 214from the system core logic generated in response to various events that need to
 215be acted upon.  Every GPE is associated with one or more sources of potentially
 216interesting events.  In particular, a GPE may be associated with a PCI device
 217capable of signaling wakeup.  The information on the connections between GPEs
 218and event sources is recorded in the system's ACPI BIOS from where it can be
 219read by the kernel.
 220
 221If a PCI device known to the system's ACPI BIOS signals wakeup, the GPE
 222associated with it (if there is one) is triggered.  The GPEs associated with PCI
 223bridges may also be triggered in response to a wakeup signal from one of the
 224devices below the bridge (this also is the case for root bridges) and, for
 225example, native PCI PMEs from devices unknown to the system's ACPI BIOS may be
 226handled this way.
 227
 228A GPE may be triggered when the system is sleeping (i.e. when it is in one of
 229the ACPI S1-S4 states), in which case system wakeup is started by its core logic
 230(the device that was the source of the signal causing the system wakeup to occur
 231may be identified later).  The GPEs used in such situations are referred to as
 232wakeup GPEs.
 233
 234Usually, however, GPEs are also triggered when the system is in the working
 235state (ACPI S0) and in that case the system's core logic generates a System
 236Control Interrupt (SCI) to notify the kernel of the event.  Then, the SCI
 237handler identifies the GPE that caused the interrupt to be generated which,
 238in turn, allows the kernel to identify the source of the event (that may be
 239a PCI device signaling wakeup).  The GPEs used for notifying the kernel of
 240events occurring while the system is in the working state are referred to as
 241runtime GPEs.
 242
 243Unfortunately, there is no standard way of handling wakeup signals sent by
 244conventional PCI devices on systems that are not ACPI-based, but there is one
 245for PCI Express devices.  Namely, the PCI Express Base Specification introduced
 246a native mechanism for converting native PCI PMEs into interrupts generated by
 247root ports.  For conventional PCI devices native PMEs are out-of-band, so they
 248are routed separately and they need not pass through bridges (in principle they
 249may be routed directly to the system's core logic), but for PCI Express devices
 250they are in-band messages that have to pass through the PCI Express hierarchy,
 251including the root port on the path from the device to the Root Complex.  Thus
 252it was possible to introduce a mechanism by which a root port generates an
 253interrupt whenever it receives a PME message from one of the devices below it.
 254The PCI Express Requester ID of the device that sent the PME message is then
 255recorded in one of the root port's configuration registers from where it may be
 256read by the interrupt handler allowing the device to be identified.  [PME
 257messages sent by PCI Express endpoints integrated with the Root Complex don't
 258pass through root ports, but instead they cause a Root Complex Event Collector
 259(if there is one) to generate interrupts.]
 260
 261In principle the native PCI Express PME signaling may also be used on ACPI-based
 262systems along with the GPEs, but to use it the kernel has to ask the system's
 263ACPI BIOS to release control of root port configuration registers.  The ACPI
 264BIOS, however, is not required to allow the kernel to control these registers
 265and if it doesn't do that, the kernel must not modify their contents.  Of course
 266the native PCI Express PME signaling cannot be used by the kernel in that case.
 267
 268
 2692. PCI Subsystem and Device Power Management
 270============================================
 271
 2722.1. Device Power Management Callbacks
 273--------------------------------------
 274
 275The PCI Subsystem participates in the power management of PCI devices in a
 276number of ways.  First of all, it provides an intermediate code layer between
 277the device power management core (PM core) and PCI device drivers.
 278Specifically, the pm field of the PCI subsystem's struct bus_type object,
 279pci_bus_type, points to a struct dev_pm_ops object, pci_dev_pm_ops, containing
 280pointers to several device power management callbacks::
 281
 282  const struct dev_pm_ops pci_dev_pm_ops = {
 283	.prepare = pci_pm_prepare,
 284	.complete = pci_pm_complete,
 285	.suspend = pci_pm_suspend,
 286	.resume = pci_pm_resume,
 287	.freeze = pci_pm_freeze,
 288	.thaw = pci_pm_thaw,
 289	.poweroff = pci_pm_poweroff,
 290	.restore = pci_pm_restore,
 291	.suspend_noirq = pci_pm_suspend_noirq,
 292	.resume_noirq = pci_pm_resume_noirq,
 293	.freeze_noirq = pci_pm_freeze_noirq,
 294	.thaw_noirq = pci_pm_thaw_noirq,
 295	.poweroff_noirq = pci_pm_poweroff_noirq,
 296	.restore_noirq = pci_pm_restore_noirq,
 297	.runtime_suspend = pci_pm_runtime_suspend,
 298	.runtime_resume = pci_pm_runtime_resume,
 299	.runtime_idle = pci_pm_runtime_idle,
 300  };
 301
 302These callbacks are executed by the PM core in various situations related to
 303device power management and they, in turn, execute power management callbacks
 304provided by PCI device drivers.  They also perform power management operations
 305involving some standard configuration registers of PCI devices that device
 306drivers need not know or care about.
 307
 308The structure representing a PCI device, struct pci_dev, contains several fields
 309that these callbacks operate on::
 310
 311  struct pci_dev {
 312	...
 313	pci_power_t     current_state;  /* Current operating state. */
 314	int		pm_cap;		/* PM capability offset in the
 315					   configuration space */
 316	unsigned int	pme_support:5;	/* Bitmask of states from which PME#
 317					   can be generated */
 318	unsigned int	pme_poll:1;	/* Poll device's PME status bit */
 319	unsigned int	d1_support:1;	/* Low power state D1 is supported */
 320	unsigned int	d2_support:1;	/* Low power state D2 is supported */
 321	unsigned int	no_d1d2:1;	/* D1 and D2 are forbidden */
 322	unsigned int	wakeup_prepared:1;  /* Device prepared for wake up */
 323	unsigned int	d3hot_delay;	/* D3hot->D0 transition time in ms */
 324	...
 325  };
 326
 327They also indirectly use some fields of the struct device that is embedded in
 328struct pci_dev.
 329
 3302.2. Device Initialization
 331--------------------------
 332
 333The PCI subsystem's first task related to device power management is to
 334prepare the device for power management and initialize the fields of struct
 335pci_dev used for this purpose.  This happens in two functions defined in
 336drivers/pci/, pci_pm_init() and pci_acpi_setup().
 337
 338The first of these functions checks if the device supports native PCI PM
 339and if that's the case the offset of its power management capability structure
 340in the configuration space is stored in the pm_cap field of the device's struct
 341pci_dev object.  Next, the function checks which PCI low-power states are
 342supported by the device and from which low-power states the device can generate
 343native PCI PMEs.  The power management fields of the device's struct pci_dev and
 344the struct device embedded in it are updated accordingly and the generation of
 345PMEs by the device is disabled.
 346
 347The second function checks if the device can be prepared to signal wakeup with
 348the help of the platform firmware, such as the ACPI BIOS.  If that is the case,
 349the function updates the wakeup fields in struct device embedded in the
 350device's struct pci_dev and uses the firmware-provided method to prevent the
 351device from signaling wakeup.
 352
 353At this point the device is ready for power management.  For driverless devices,
 354however, this functionality is limited to a few basic operations carried out
 355during system-wide transitions to a sleep state and back to the working state.
 356
 3572.3. Runtime Device Power Management
 358------------------------------------
 359
 360The PCI subsystem plays a vital role in the runtime power management of PCI
 361devices.  For this purpose it uses the general runtime power management
 362(runtime PM) framework described in Documentation/power/runtime_pm.rst.
 363Namely, it provides subsystem-level callbacks::
 364
 365	pci_pm_runtime_suspend()
 366	pci_pm_runtime_resume()
 367	pci_pm_runtime_idle()
 368
 369that are executed by the core runtime PM routines.  It also implements the
 370entire mechanics necessary for handling runtime wakeup signals from PCI devices
 371in low-power states, which at the time of this writing works for both the native
 372PCI Express PME signaling and the ACPI GPE-based wakeup signaling described in
 373Section 1.
 374
 375First, a PCI device is put into a low-power state, or suspended, with the help
 376of pm_schedule_suspend() or pm_runtime_suspend() which for PCI devices call
 377pci_pm_runtime_suspend() to do the actual job.  For this to work, the device's
 378driver has to provide a pm->runtime_suspend() callback (see below), which is
 379run by pci_pm_runtime_suspend() as the first action.  If the driver's callback
 380returns successfully, the device's standard configuration registers are saved,
 381the device is prepared to generate wakeup signals and, finally, it is put into
 382the target low-power state.
 383
 384The low-power state to put the device into is the lowest-power (highest number)
 385state from which it can signal wakeup.  The exact method of signaling wakeup is
 386system-dependent and is determined by the PCI subsystem on the basis of the
 387reported capabilities of the device and the platform firmware.  To prepare the
 388device for signaling wakeup and put it into the selected low-power state, the
 389PCI subsystem can use the platform firmware as well as the device's native PCI
 390PM capabilities, if supported.
 391
 392It is expected that the device driver's pm->runtime_suspend() callback will
 393not attempt to prepare the device for signaling wakeup or to put it into a
 394low-power state.  The driver ought to leave these tasks to the PCI subsystem
 395that has all of the information necessary to perform them.
 396
 397A suspended device is brought back into the "active" state, or resumed,
 398with the help of pm_request_resume() or pm_runtime_resume() which both call
 399pci_pm_runtime_resume() for PCI devices.  Again, this only works if the device's
 400driver provides a pm->runtime_resume() callback (see below).  However, before
 401the driver's callback is executed, pci_pm_runtime_resume() brings the device
 402back into the full-power state, prevents it from signaling wakeup while in that
 403state and restores its standard configuration registers.  Thus the driver's
 404callback need not worry about the PCI-specific aspects of the device resume.
 405
 406Note that generally pci_pm_runtime_resume() may be called in two different
 407situations.  First, it may be called at the request of the device's driver, for
 408example if there are some data for it to process.  Second, it may be called
 409as a result of a wakeup signal from the device itself (this sometimes is
 410referred to as "remote wakeup").  Of course, for this purpose the wakeup signal
 411is handled in one of the ways described in Section 1 and finally converted into
 412a notification for the PCI subsystem after the source device has been
 413identified.
 414
 415The pci_pm_runtime_idle() function, called for PCI devices by pm_runtime_idle()
 416and pm_request_idle(), executes the device driver's pm->runtime_idle()
 417callback, if defined, and if that callback doesn't return error code (or is not
 418present at all), suspends the device with the help of pm_runtime_suspend().
 419Sometimes pci_pm_runtime_idle() is called automatically by the PM core (for
 420example, it is called right after the device has just been resumed), in which
 421cases it is expected to suspend the device if that makes sense.  Usually,
 422however, the PCI subsystem doesn't really know if the device really can be
 423suspended, so it lets the device's driver decide by running its
 424pm->runtime_idle() callback.
 425
 4262.4. System-Wide Power Transitions
 427----------------------------------
 428There are a few different types of system-wide power transitions, described in
 429Documentation/driver-api/pm/devices.rst.  Each of them requires devices to be
 430handled in a specific way and the PM core executes subsystem-level power
 431management callbacks for this purpose.  They are executed in phases such that
 432each phase involves executing the same subsystem-level callback for every device
 433belonging to the given subsystem before the next phase begins.  These phases
 434always run after tasks have been frozen.
 435
 4362.4.1. System Suspend
 437^^^^^^^^^^^^^^^^^^^^^
 438
 439When the system is going into a sleep state in which the contents of memory will
 440be preserved, such as one of the ACPI sleep states S1-S3, the phases are:
 441
 442	prepare, suspend, suspend_noirq.
 443
 444The following PCI bus type's callbacks, respectively, are used in these phases::
 445
 446	pci_pm_prepare()
 447	pci_pm_suspend()
 448	pci_pm_suspend_noirq()
 449
 450The pci_pm_prepare() routine first puts the device into the "fully functional"
 451state with the help of pm_runtime_resume().  Then, it executes the device
 452driver's pm->prepare() callback if defined (i.e. if the driver's struct
 453dev_pm_ops object is present and the prepare pointer in that object is valid).
 454
 455The pci_pm_suspend() routine first checks if the device's driver implements
 456legacy PCI suspend routines (see Section 3), in which case the driver's legacy
 457suspend callback is executed, if present, and its result is returned.  Next, if
 458the device's driver doesn't provide a struct dev_pm_ops object (containing
 459pointers to the driver's callbacks), pci_pm_default_suspend() is called, which
 460simply turns off the device's bus master capability and runs
 461pcibios_disable_device() to disable it, unless the device is a bridge (PCI
 462bridges are ignored by this routine).  Next, the device driver's pm->suspend()
 463callback is executed, if defined, and its result is returned if it fails.
 464Finally, pci_fixup_device() is called to apply hardware suspend quirks related
 465to the device if necessary.
 466
 467Note that the suspend phase is carried out asynchronously for PCI devices, so
 468the pci_pm_suspend() callback may be executed in parallel for any pair of PCI
 469devices that don't depend on each other in a known way (i.e. none of the paths
 470in the device tree from the root bridge to a leaf device contains both of them).
 471
 472The pci_pm_suspend_noirq() routine is executed after suspend_device_irqs() has
 473been called, which means that the device driver's interrupt handler won't be
 474invoked while this routine is running.  It first checks if the device's driver
 475implements legacy PCI suspends routines (Section 3), in which case the legacy
 476late suspend routine is called and its result is returned (the standard
 477configuration registers of the device are saved if the driver's callback hasn't
 478done that).  Second, if the device driver's struct dev_pm_ops object is not
 479present, the device's standard configuration registers are saved and the routine
 480returns success.  Otherwise the device driver's pm->suspend_noirq() callback is
 481executed, if present, and its result is returned if it fails.  Next, if the
 482device's standard configuration registers haven't been saved yet (one of the
 483device driver's callbacks executed before might do that), pci_pm_suspend_noirq()
 484saves them, prepares the device to signal wakeup (if necessary) and puts it into
 485a low-power state.
 486
 487The low-power state to put the device into is the lowest-power (highest number)
 488state from which it can signal wakeup while the system is in the target sleep
 489state.  Just like in the runtime PM case described above, the mechanism of
 490signaling wakeup is system-dependent and determined by the PCI subsystem, which
 491is also responsible for preparing the device to signal wakeup from the system's
 492target sleep state as appropriate.
 493
 494PCI device drivers (that don't implement legacy power management callbacks) are
 495generally not expected to prepare devices for signaling wakeup or to put them
 496into low-power states.  However, if one of the driver's suspend callbacks
 497(pm->suspend() or pm->suspend_noirq()) saves the device's standard configuration
 498registers, pci_pm_suspend_noirq() will assume that the device has been prepared
 499to signal wakeup and put into a low-power state by the driver (the driver is
 500then assumed to have used the helper functions provided by the PCI subsystem for
 501this purpose).  PCI device drivers are not encouraged to do that, but in some
 502rare cases doing that in the driver may be the optimum approach.
 503
 5042.4.2. System Resume
 505^^^^^^^^^^^^^^^^^^^^
 506
 507When the system is undergoing a transition from a sleep state in which the
 508contents of memory have been preserved, such as one of the ACPI sleep states
 509S1-S3, into the working state (ACPI S0), the phases are:
 510
 511	resume_noirq, resume, complete.
 512
 513The following PCI bus type's callbacks, respectively, are executed in these
 514phases::
 515
 516	pci_pm_resume_noirq()
 517	pci_pm_resume()
 518	pci_pm_complete()
 519
 520The pci_pm_resume_noirq() routine first puts the device into the full-power
 521state, restores its standard configuration registers and applies early resume
 522hardware quirks related to the device, if necessary.  This is done
 523unconditionally, regardless of whether or not the device's driver implements
 524legacy PCI power management callbacks (this way all PCI devices are in the
 525full-power state and their standard configuration registers have been restored
 526when their interrupt handlers are invoked for the first time during resume,
 527which allows the kernel to avoid problems with the handling of shared interrupts
 528by drivers whose devices are still suspended).  If legacy PCI power management
 529callbacks (see Section 3) are implemented by the device's driver, the legacy
 530early resume callback is executed and its result is returned.  Otherwise, the
 531device driver's pm->resume_noirq() callback is executed, if defined, and its
 532result is returned.
 533
 534The pci_pm_resume() routine first checks if the device's standard configuration
 535registers have been restored and restores them if that's not the case (this
 536only is necessary in the error path during a failing suspend).  Next, resume
 537hardware quirks related to the device are applied, if necessary, and if the
 538device's driver implements legacy PCI power management callbacks (see
 539Section 3), the driver's legacy resume callback is executed and its result is
 540returned.  Otherwise, the device's wakeup signaling mechanisms are blocked and
 541its driver's pm->resume() callback is executed, if defined (the callback's
 542result is then returned).
 543
 544The resume phase is carried out asynchronously for PCI devices, like the
 545suspend phase described above, which means that if two PCI devices don't depend
 546on each other in a known way, the pci_pm_resume() routine may be executed for
 547the both of them in parallel.
 548
 549The pci_pm_complete() routine only executes the device driver's pm->complete()
 550callback, if defined.
 551
 5522.4.3. System Hibernation
 553^^^^^^^^^^^^^^^^^^^^^^^^^
 554
 555System hibernation is more complicated than system suspend, because it requires
 556a system image to be created and written into a persistent storage medium.  The
 557image is created atomically and all devices are quiesced, or frozen, before that
 558happens.
 559
 560The freezing of devices is carried out after enough memory has been freed (at
 561the time of this writing the image creation requires at least 50% of system RAM
 562to be free) in the following three phases:
 563
 564	prepare, freeze, freeze_noirq
 565
 566that correspond to the PCI bus type's callbacks::
 567
 568	pci_pm_prepare()
 569	pci_pm_freeze()
 570	pci_pm_freeze_noirq()
 571
 572This means that the prepare phase is exactly the same as for system suspend.
 573The other two phases, however, are different.
 574
 575The pci_pm_freeze() routine is quite similar to pci_pm_suspend(), but it runs
 576the device driver's pm->freeze() callback, if defined, instead of pm->suspend(),
 577and it doesn't apply the suspend-related hardware quirks.  It is executed
 578asynchronously for different PCI devices that don't depend on each other in a
 579known way.
 580
 581The pci_pm_freeze_noirq() routine, in turn, is similar to
 582pci_pm_suspend_noirq(), but it calls the device driver's pm->freeze_noirq()
 583routine instead of pm->suspend_noirq().  It also doesn't attempt to prepare the
 584device for signaling wakeup and put it into a low-power state.  Still, it saves
 585the device's standard configuration registers if they haven't been saved by one
 586of the driver's callbacks.
 587
 588Once the image has been created, it has to be saved.  However, at this point all
 589devices are frozen and they cannot handle I/O, while their ability to handle
 590I/O is obviously necessary for the image saving.  Thus they have to be brought
 591back to the fully functional state and this is done in the following phases:
 592
 593	thaw_noirq, thaw, complete
 594
 595using the following PCI bus type's callbacks::
 596
 597	pci_pm_thaw_noirq()
 598	pci_pm_thaw()
 599	pci_pm_complete()
 600
 601respectively.
 602
 603The first of them, pci_pm_thaw_noirq(), is analogous to pci_pm_resume_noirq().
 604It puts the device into the full power state and restores its standard
 605configuration registers.  It also executes the device driver's pm->thaw_noirq()
 606callback, if defined, instead of pm->resume_noirq().
 607
 608The pci_pm_thaw() routine is similar to pci_pm_resume(), but it runs the device
 609driver's pm->thaw() callback instead of pm->resume().  It is executed
 610asynchronously for different PCI devices that don't depend on each other in a
 611known way.
 612
 613The complete phase is the same as for system resume.
 614
 615After saving the image, devices need to be powered down before the system can
 616enter the target sleep state (ACPI S4 for ACPI-based systems).  This is done in
 617three phases:
 618
 619	prepare, poweroff, poweroff_noirq
 620
 621where the prepare phase is exactly the same as for system suspend.  The other
 622two phases are analogous to the suspend and suspend_noirq phases, respectively.
 623The PCI subsystem-level callbacks they correspond to::
 624
 625	pci_pm_poweroff()
 626	pci_pm_poweroff_noirq()
 627
 628work in analogy with pci_pm_suspend() and pci_pm_suspend_noirq(), respectively,
 629although they don't attempt to save the device's standard configuration
 630registers.
 631
 6322.4.4. System Restore
 633^^^^^^^^^^^^^^^^^^^^^
 634
 635System restore requires a hibernation image to be loaded into memory and the
 636pre-hibernation memory contents to be restored before the pre-hibernation system
 637activity can be resumed.
 638
 639As described in Documentation/driver-api/pm/devices.rst, the hibernation image
 640is loaded into memory by a fresh instance of the kernel, called the boot kernel,
 641which in turn is loaded and run by a boot loader in the usual way.  After the
 642boot kernel has loaded the image, it needs to replace its own code and data with
 643the code and data of the "hibernated" kernel stored within the image, called the
 644image kernel.  For this purpose all devices are frozen just like before creating
 645the image during hibernation, in the
 646
 647	prepare, freeze, freeze_noirq
 648
 649phases described above.  However, the devices affected by these phases are only
 650those having drivers in the boot kernel; other devices will still be in whatever
 651state the boot loader left them.
 652
 653Should the restoration of the pre-hibernation memory contents fail, the boot
 654kernel would go through the "thawing" procedure described above, using the
 655thaw_noirq, thaw, and complete phases (that will only affect the devices having
 656drivers in the boot kernel), and then continue running normally.
 657
 658If the pre-hibernation memory contents are restored successfully, which is the
 659usual situation, control is passed to the image kernel, which then becomes
 660responsible for bringing the system back to the working state.  To achieve this,
 661it must restore the devices' pre-hibernation functionality, which is done much
 662like waking up from the memory sleep state, although it involves different
 663phases:
 664
 665	restore_noirq, restore, complete
 666
 667The first two of these are analogous to the resume_noirq and resume phases
 668described above, respectively, and correspond to the following PCI subsystem
 669callbacks::
 670
 671	pci_pm_restore_noirq()
 672	pci_pm_restore()
 673
 674These callbacks work in analogy with pci_pm_resume_noirq() and pci_pm_resume(),
 675respectively, but they execute the device driver's pm->restore_noirq() and
 676pm->restore() callbacks, if available.
 677
 678The complete phase is carried out in exactly the same way as during system
 679resume.
 680
 681
 6823. PCI Device Drivers and Power Management
 683==========================================
 684
 6853.1. Power Management Callbacks
 686-------------------------------
 687
 688PCI device drivers participate in power management by providing callbacks to be
 689executed by the PCI subsystem's power management routines described above and by
 690controlling the runtime power management of their devices.
 691
 692At the time of this writing there are two ways to define power management
 693callbacks for a PCI device driver, the recommended one, based on using a
 694dev_pm_ops structure described in Documentation/driver-api/pm/devices.rst, and
 695the "legacy" one, in which the .suspend() and .resume() callbacks from struct
 696pci_driver are used.  The legacy approach, however, doesn't allow one to define
 697runtime power management callbacks and is not really suitable for any new
 698drivers.  Therefore it is not covered by this document (refer to the source code
 699to learn more about it).
 700
 701It is recommended that all PCI device drivers define a struct dev_pm_ops object
 702containing pointers to power management (PM) callbacks that will be executed by
 703the PCI subsystem's PM routines in various circumstances.  A pointer to the
 704driver's struct dev_pm_ops object has to be assigned to the driver.pm field in
 705its struct pci_driver object.  Once that has happened, the "legacy" PM callbacks
 706in struct pci_driver are ignored (even if they are not NULL).
 707
 708The PM callbacks in struct dev_pm_ops are not mandatory and if they are not
 709defined (i.e. the respective fields of struct dev_pm_ops are unset) the PCI
 710subsystem will handle the device in a simplified default manner.  If they are
 711defined, though, they are expected to behave as described in the following
 712subsections.
 713
 7143.1.1. prepare()
 715^^^^^^^^^^^^^^^^
 716
 717The prepare() callback is executed during system suspend, during hibernation
 718(when a hibernation image is about to be created), during power-off after
 719saving a hibernation image and during system restore, when a hibernation image
 720has just been loaded into memory.
 721
 722This callback is only necessary if the driver's device has children that in
 723general may be registered at any time.  In that case the role of the prepare()
 724callback is to prevent new children of the device from being registered until
 725one of the resume_noirq(), thaw_noirq(), or restore_noirq() callbacks is run.
 726
 727In addition to that the prepare() callback may carry out some operations
 728preparing the device to be suspended, although it should not allocate memory
 729(if additional memory is required to suspend the device, it has to be
 730preallocated earlier, for example in a suspend/hibernate notifier as described
 731in Documentation/driver-api/pm/notifiers.rst).
 732
 7333.1.2. suspend()
 734^^^^^^^^^^^^^^^^
 735
 736The suspend() callback is only executed during system suspend, after prepare()
 737callbacks have been executed for all devices in the system.
 738
 739This callback is expected to quiesce the device and prepare it to be put into a
 740low-power state by the PCI subsystem.  It is not required (in fact it even is
 741not recommended) that a PCI driver's suspend() callback save the standard
 742configuration registers of the device, prepare it for waking up the system, or
 743put it into a low-power state.  All of these operations can very well be taken
 744care of by the PCI subsystem, without the driver's participation.
 745
 746However, in some rare case it is convenient to carry out these operations in
 747a PCI driver.  Then, pci_save_state(), pci_prepare_to_sleep(), and
 748pci_set_power_state() should be used to save the device's standard configuration
 749registers, to prepare it for system wakeup (if necessary), and to put it into a
 750low-power state, respectively.  Moreover, if the driver calls pci_save_state(),
 751the PCI subsystem will not execute either pci_prepare_to_sleep(), or
 752pci_set_power_state() for its device, so the driver is then responsible for
 753handling the device as appropriate.
 754
 755While the suspend() callback is being executed, the driver's interrupt handler
 756can be invoked to handle an interrupt from the device, so all suspend-related
 757operations relying on the driver's ability to handle interrupts should be
 758carried out in this callback.
 759
 7603.1.3. suspend_noirq()
 761^^^^^^^^^^^^^^^^^^^^^^
 762
 763The suspend_noirq() callback is only executed during system suspend, after
 764suspend() callbacks have been executed for all devices in the system and
 765after device interrupts have been disabled by the PM core.
 766
 767The difference between suspend_noirq() and suspend() is that the driver's
 768interrupt handler will not be invoked while suspend_noirq() is running.  Thus
 769suspend_noirq() can carry out operations that would cause race conditions to
 770arise if they were performed in suspend().
 771
 7723.1.4. freeze()
 773^^^^^^^^^^^^^^^
 774
 775The freeze() callback is hibernation-specific and is executed in two situations,
 776during hibernation, after prepare() callbacks have been executed for all devices
 777in preparation for the creation of a system image, and during restore,
 778after a system image has been loaded into memory from persistent storage and the
 779prepare() callbacks have been executed for all devices.
 780
 781The role of this callback is analogous to the role of the suspend() callback
 782described above.  In fact, they only need to be different in the rare cases when
 783the driver takes the responsibility for putting the device into a low-power
 784state.
 785
 786In that cases the freeze() callback should not prepare the device system wakeup
 787or put it into a low-power state.  Still, either it or freeze_noirq() should
 788save the device's standard configuration registers using pci_save_state().
 789
 7903.1.5. freeze_noirq()
 791^^^^^^^^^^^^^^^^^^^^^
 792
 793The freeze_noirq() callback is hibernation-specific.  It is executed during
 794hibernation, after prepare() and freeze() callbacks have been executed for all
 795devices in preparation for the creation of a system image, and during restore,
 796after a system image has been loaded into memory and after prepare() and
 797freeze() callbacks have been executed for all devices.  It is always executed
 798after device interrupts have been disabled by the PM core.
 799
 800The role of this callback is analogous to the role of the suspend_noirq()
 801callback described above and it very rarely is necessary to define
 802freeze_noirq().
 803
 804The difference between freeze_noirq() and freeze() is analogous to the
 805difference between suspend_noirq() and suspend().
 806
 8073.1.6. poweroff()
 808^^^^^^^^^^^^^^^^^
 809
 810The poweroff() callback is hibernation-specific.  It is executed when the system
 811is about to be powered off after saving a hibernation image to a persistent
 812storage.  prepare() callbacks are executed for all devices before poweroff() is
 813called.
 814
 815The role of this callback is analogous to the role of the suspend() and freeze()
 816callbacks described above, although it does not need to save the contents of
 817the device's registers.  In particular, if the driver wants to put the device
 818into a low-power state itself instead of allowing the PCI subsystem to do that,
 819the poweroff() callback should use pci_prepare_to_sleep() and
 820pci_set_power_state() to prepare the device for system wakeup and to put it
 821into a low-power state, respectively, but it need not save the device's standard
 822configuration registers.
 823
 8243.1.7. poweroff_noirq()
 825^^^^^^^^^^^^^^^^^^^^^^^
 826
 827The poweroff_noirq() callback is hibernation-specific.  It is executed after
 828poweroff() callbacks have been executed for all devices in the system.
 829
 830The role of this callback is analogous to the role of the suspend_noirq() and
 831freeze_noirq() callbacks described above, but it does not need to save the
 832contents of the device's registers.
 833
 834The difference between poweroff_noirq() and poweroff() is analogous to the
 835difference between suspend_noirq() and suspend().
 836
 8373.1.8. resume_noirq()
 838^^^^^^^^^^^^^^^^^^^^^
 839
 840The resume_noirq() callback is only executed during system resume, after the
 841PM core has enabled the non-boot CPUs.  The driver's interrupt handler will not
 842be invoked while resume_noirq() is running, so this callback can carry out
 843operations that might race with the interrupt handler.
 844
 845Since the PCI subsystem unconditionally puts all devices into the full power
 846state in the resume_noirq phase of system resume and restores their standard
 847configuration registers, resume_noirq() is usually not necessary.  In general
 848it should only be used for performing operations that would lead to race
 849conditions if carried out by resume().
 850
 8513.1.9. resume()
 852^^^^^^^^^^^^^^^
 853
 854The resume() callback is only executed during system resume, after
 855resume_noirq() callbacks have been executed for all devices in the system and
 856device interrupts have been enabled by the PM core.
 857
 858This callback is responsible for restoring the pre-suspend configuration of the
 859device and bringing it back to the fully functional state.  The device should be
 860able to process I/O in a usual way after resume() has returned.
 861
 8623.1.10. thaw_noirq()
 863^^^^^^^^^^^^^^^^^^^^
 864
 865The thaw_noirq() callback is hibernation-specific.  It is executed after a
 866system image has been created and the non-boot CPUs have been enabled by the PM
 867core, in the thaw_noirq phase of hibernation.  It also may be executed if the
 868loading of a hibernation image fails during system restore (it is then executed
 869after enabling the non-boot CPUs).  The driver's interrupt handler will not be
 870invoked while thaw_noirq() is running.
 871
 872The role of this callback is analogous to the role of resume_noirq().  The
 873difference between these two callbacks is that thaw_noirq() is executed after
 874freeze() and freeze_noirq(), so in general it does not need to modify the
 875contents of the device's registers.
 876
 8773.1.11. thaw()
 878^^^^^^^^^^^^^^
 879
 880The thaw() callback is hibernation-specific.  It is executed after thaw_noirq()
 881callbacks have been executed for all devices in the system and after device
 882interrupts have been enabled by the PM core.
 883
 884This callback is responsible for restoring the pre-freeze configuration of
 885the device, so that it will work in a usual way after thaw() has returned.
 886
 8873.1.12. restore_noirq()
 888^^^^^^^^^^^^^^^^^^^^^^^
 889
 890The restore_noirq() callback is hibernation-specific.  It is executed in the
 891restore_noirq phase of hibernation, when the boot kernel has passed control to
 892the image kernel and the non-boot CPUs have been enabled by the image kernel's
 893PM core.
 894
 895This callback is analogous to resume_noirq() with the exception that it cannot
 896make any assumption on the previous state of the device, even if the BIOS (or
 897generally the platform firmware) is known to preserve that state over a
 898suspend-resume cycle.
 899
 900For the vast majority of PCI device drivers there is no difference between
 901resume_noirq() and restore_noirq().
 902
 9033.1.13. restore()
 904^^^^^^^^^^^^^^^^^
 905
 906The restore() callback is hibernation-specific.  It is executed after
 907restore_noirq() callbacks have been executed for all devices in the system and
 908after the PM core has enabled device drivers' interrupt handlers to be invoked.
 909
 910This callback is analogous to resume(), just like restore_noirq() is analogous
 911to resume_noirq().  Consequently, the difference between restore_noirq() and
 912restore() is analogous to the difference between resume_noirq() and resume().
 913
 914For the vast majority of PCI device drivers there is no difference between
 915resume() and restore().
 916
 9173.1.14. complete()
 918^^^^^^^^^^^^^^^^^^
 919
 920The complete() callback is executed in the following situations:
 921
 922  - during system resume, after resume() callbacks have been executed for all
 923    devices,
 924  - during hibernation, before saving the system image, after thaw() callbacks
 925    have been executed for all devices,
 926  - during system restore, when the system is going back to its pre-hibernation
 927    state, after restore() callbacks have been executed for all devices.
 928
 929It also may be executed if the loading of a hibernation image into memory fails
 930(in that case it is run after thaw() callbacks have been executed for all
 931devices that have drivers in the boot kernel).
 932
 933This callback is entirely optional, although it may be necessary if the
 934prepare() callback performs operations that need to be reversed.
 935
 9363.1.15. runtime_suspend()
 937^^^^^^^^^^^^^^^^^^^^^^^^^
 938
 939The runtime_suspend() callback is specific to device runtime power management
 940(runtime PM).  It is executed by the PM core's runtime PM framework when the
 941device is about to be suspended (i.e. quiesced and put into a low-power state)
 942at run time.
 943
 944This callback is responsible for freezing the device and preparing it to be
 945put into a low-power state, but it must allow the PCI subsystem to perform all
 946of the PCI-specific actions necessary for suspending the device.
 947
 9483.1.16. runtime_resume()
 949^^^^^^^^^^^^^^^^^^^^^^^^
 950
 951The runtime_resume() callback is specific to device runtime PM.  It is executed
 952by the PM core's runtime PM framework when the device is about to be resumed
 953(i.e. put into the full-power state and programmed to process I/O normally) at
 954run time.
 955
 956This callback is responsible for restoring the normal functionality of the
 957device after it has been put into the full-power state by the PCI subsystem.
 958The device is expected to be able to process I/O in the usual way after
 959runtime_resume() has returned.
 960
 9613.1.17. runtime_idle()
 962^^^^^^^^^^^^^^^^^^^^^^
 963
 964The runtime_idle() callback is specific to device runtime PM.  It is executed
 965by the PM core's runtime PM framework whenever it may be desirable to suspend
 966the device according to the PM core's information.  In particular, it is
 967automatically executed right after runtime_resume() has returned in case the
 968resume of the device has happened as a result of a spurious event.
 969
 970This callback is optional, but if it is not implemented or if it returns 0, the
 971PCI subsystem will call pm_runtime_suspend() for the device, which in turn will
 972cause the driver's runtime_suspend() callback to be executed.
 973
 9743.1.18. Pointing Multiple Callback Pointers to One Routine
 975^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 976
 977Although in principle each of the callbacks described in the previous
 978subsections can be defined as a separate function, it often is convenient to
 979point two or more members of struct dev_pm_ops to the same routine.  There are
 980a few convenience macros that can be used for this purpose.
 981
 982The DEFINE_SIMPLE_DEV_PM_OPS() declares a struct dev_pm_ops object with one
 983suspend routine pointed to by the .suspend(), .freeze(), and .poweroff()
 984members and one resume routine pointed to by the .resume(), .thaw(), and
 985.restore() members.  The other function pointers in this struct dev_pm_ops are
 986unset.
 987
 988The DEFINE_RUNTIME_DEV_PM_OPS() is similar to DEFINE_SIMPLE_DEV_PM_OPS(), but it
 989additionally sets the .runtime_resume() pointer to pm_runtime_force_resume()
 990and the .runtime_suspend() pointer to pm_runtime_force_suspend().
 
 991
 992The SYSTEM_SLEEP_PM_OPS() can be used inside of a declaration of struct
 993dev_pm_ops to indicate that one suspend routine is to be pointed to by the
 994.suspend(), .freeze(), and .poweroff() members and one resume routine is to
 995be pointed to by the .resume(), .thaw(), and .restore() members.
 996
 9973.1.19. Driver Flags for Power Management
 998^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 999
1000The PM core allows device drivers to set flags that influence the handling of
1001power management for the devices by the core itself and by middle layer code
1002including the PCI bus type.  The flags should be set once at the driver probe
1003time with the help of the dev_pm_set_driver_flags() function and they should not
1004be updated directly afterwards.
1005
1006The DPM_FLAG_NO_DIRECT_COMPLETE flag prevents the PM core from using the
1007direct-complete mechanism allowing device suspend/resume callbacks to be skipped
1008if the device is in runtime suspend when the system suspend starts.  That also
1009affects all of the ancestors of the device, so this flag should only be used if
1010absolutely necessary.
1011
1012The DPM_FLAG_SMART_PREPARE flag causes the PCI bus type to return a positive
1013value from pci_pm_prepare() only if the ->prepare callback provided by the
1014driver of the device returns a positive value.  That allows the driver to opt
1015out from using the direct-complete mechanism dynamically (whereas setting
1016DPM_FLAG_NO_DIRECT_COMPLETE means permanent opt-out).
1017
1018The DPM_FLAG_SMART_SUSPEND flag tells the PCI bus type that from the driver's
1019perspective the device can be safely left in runtime suspend during system
1020suspend.  That causes pci_pm_suspend(), pci_pm_freeze() and pci_pm_poweroff()
1021to avoid resuming the device from runtime suspend unless there are PCI-specific
1022reasons for doing that.  Also, it causes pci_pm_suspend_late/noirq() and
1023pci_pm_poweroff_late/noirq() to return early if the device remains in runtime
1024suspend during the "late" phase of the system-wide transition under way.
1025Moreover, if the device is in runtime suspend in pci_pm_resume_noirq() or
1026pci_pm_restore_noirq(), its runtime PM status will be changed to "active" (as it
1027is going to be put into D0 going forward).
1028
1029Setting the DPM_FLAG_MAY_SKIP_RESUME flag means that the driver allows its
1030"noirq" and "early" resume callbacks to be skipped if the device can be left
1031in suspend after a system-wide transition into the working state.  This flag is
1032taken into consideration by the PM core along with the power.may_skip_resume
1033status bit of the device which is set by pci_pm_suspend_noirq() in certain
1034situations.  If the PM core determines that the driver's "noirq" and "early"
1035resume callbacks should be skipped, the dev_pm_skip_resume() helper function
1036will return "true" and that will cause pci_pm_resume_noirq() and
1037pci_pm_resume_early() to return upfront without touching the device and
1038executing the driver callbacks.
1039
10403.2. Device Runtime Power Management
1041------------------------------------
1042
1043In addition to providing device power management callbacks PCI device drivers
1044are responsible for controlling the runtime power management (runtime PM) of
1045their devices.
1046
1047The PCI device runtime PM is optional, but it is recommended that PCI device
1048drivers implement it at least in the cases where there is a reliable way of
1049verifying that the device is not used (like when the network cable is detached
1050from an Ethernet adapter or there are no devices attached to a USB controller).
1051
1052To support the PCI runtime PM the driver first needs to implement the
1053runtime_suspend() and runtime_resume() callbacks.  It also may need to implement
1054the runtime_idle() callback to prevent the device from being suspended again
1055every time right after the runtime_resume() callback has returned
1056(alternatively, the runtime_suspend() callback will have to check if the
1057device should really be suspended and return -EAGAIN if that is not the case).
1058
1059The runtime PM of PCI devices is enabled by default by the PCI core.  PCI
1060device drivers do not need to enable it and should not attempt to do so.
1061However, it is blocked by pci_pm_init() that runs the pm_runtime_forbid()
1062helper function.  In addition to that, the runtime PM usage counter of
1063each PCI device is incremented by local_pci_probe() before executing the
1064probe callback provided by the device's driver.
1065
1066If a PCI driver implements the runtime PM callbacks and intends to use the
1067runtime PM framework provided by the PM core and the PCI subsystem, it needs
1068to decrement the device's runtime PM usage counter in its probe callback
1069function.  If it doesn't do that, the counter will always be different from
1070zero for the device and it will never be runtime-suspended.  The simplest
1071way to do that is by calling pm_runtime_put_noidle(), but if the driver
1072wants to schedule an autosuspend right away, for example, it may call
1073pm_runtime_put_autosuspend() instead for this purpose.  Generally, it
1074just needs to call a function that decrements the devices usage counter
1075from its probe routine to make runtime PM work for the device.
1076
1077It is important to remember that the driver's runtime_suspend() callback
1078may be executed right after the usage counter has been decremented, because
1079user space may already have caused the pm_runtime_allow() helper function
1080unblocking the runtime PM of the device to run via sysfs, so the driver must
1081be prepared to cope with that.
1082
1083The driver itself should not call pm_runtime_allow(), though.  Instead, it
1084should let user space or some platform-specific code do that (user space can
1085do it via sysfs as stated above), but it must be prepared to handle the
1086runtime PM of the device correctly as soon as pm_runtime_allow() is called
1087(which may happen at any time, even before the driver is loaded).
1088
1089When the driver's remove callback runs, it has to balance the decrementation
1090of the device's runtime PM usage counter at the probe time.  For this reason,
1091if it has decremented the counter in its probe callback, it must run
1092pm_runtime_get_noresume() in its remove callback.  [Since the core carries
1093out a runtime resume of the device and bumps up the device's usage counter
1094before running the driver's remove callback, the runtime PM of the device
1095is effectively disabled for the duration of the remove execution and all
1096runtime PM helper functions incrementing the device's usage counter are
1097then effectively equivalent to pm_runtime_get_noresume().]
1098
1099The runtime PM framework works by processing requests to suspend or resume
1100devices, or to check if they are idle (in which cases it is reasonable to
1101subsequently request that they be suspended).  These requests are represented
1102by work items put into the power management workqueue, pm_wq.  Although there
1103are a few situations in which power management requests are automatically
1104queued by the PM core (for example, after processing a request to resume a
1105device the PM core automatically queues a request to check if the device is
1106idle), device drivers are generally responsible for queuing power management
1107requests for their devices.  For this purpose they should use the runtime PM
1108helper functions provided by the PM core, discussed in
1109Documentation/power/runtime_pm.rst.
1110
1111Devices can also be suspended and resumed synchronously, without placing a
1112request into pm_wq.  In the majority of cases this also is done by their
1113drivers that use helper functions provided by the PM core for this purpose.
1114
1115For more information on the runtime PM of devices refer to
1116Documentation/power/runtime_pm.rst.
1117
1118
11194. Resources
1120============
1121
1122PCI Local Bus Specification, Rev. 3.0
1123
1124PCI Bus Power Management Interface Specification, Rev. 1.2
1125
1126Advanced Configuration and Power Interface (ACPI) Specification, Rev. 3.0b
1127
1128PCI Express Base Specification, Rev. 2.0
1129
1130Documentation/driver-api/pm/devices.rst
1131
1132Documentation/power/runtime_pm.rst
v6.2
   1====================
   2PCI Power Management
   3====================
   4
   5Copyright (c) 2010 Rafael J. Wysocki <rjw@sisk.pl>, Novell Inc.
   6
   7An overview of concepts and the Linux kernel's interfaces related to PCI power
   8management.  Based on previous work by Patrick Mochel <mochel@transmeta.com>
   9(and others).
  10
  11This document only covers the aspects of power management specific to PCI
  12devices.  For general description of the kernel's interfaces related to device
  13power management refer to Documentation/driver-api/pm/devices.rst and
  14Documentation/power/runtime_pm.rst.
  15
  16.. contents:
  17
  18   1. Hardware and Platform Support for PCI Power Management
  19   2. PCI Subsystem and Device Power Management
  20   3. PCI Device Drivers and Power Management
  21   4. Resources
  22
  23
  241. Hardware and Platform Support for PCI Power Management
  25=========================================================
  26
  271.1. Native and Platform-Based Power Management
  28-----------------------------------------------
  29
  30In general, power management is a feature allowing one to save energy by putting
  31devices into states in which they draw less power (low-power states) at the
  32price of reduced functionality or performance.
  33
  34Usually, a device is put into a low-power state when it is underutilized or
  35completely inactive.  However, when it is necessary to use the device once
  36again, it has to be put back into the "fully functional" state (full-power
  37state).  This may happen when there are some data for the device to handle or
  38as a result of an external event requiring the device to be active, which may
  39be signaled by the device itself.
  40
  41PCI devices may be put into low-power states in two ways, by using the device
  42capabilities introduced by the PCI Bus Power Management Interface Specification,
  43or with the help of platform firmware, such as an ACPI BIOS.  In the first
  44approach, that is referred to as the native PCI power management (native PCI PM)
  45in what follows, the device power state is changed as a result of writing a
  46specific value into one of its standard configuration registers.  The second
  47approach requires the platform firmware to provide special methods that may be
  48used by the kernel to change the device's power state.
  49
  50Devices supporting the native PCI PM usually can generate wakeup signals called
  51Power Management Events (PMEs) to let the kernel know about external events
  52requiring the device to be active.  After receiving a PME the kernel is supposed
  53to put the device that sent it into the full-power state.  However, the PCI Bus
  54Power Management Interface Specification doesn't define any standard method of
  55delivering the PME from the device to the CPU and the operating system kernel.
  56It is assumed that the platform firmware will perform this task and therefore,
  57even though a PCI device is set up to generate PMEs, it also may be necessary to
  58prepare the platform firmware for notifying the CPU of the PMEs coming from the
  59device (e.g. by generating interrupts).
  60
  61In turn, if the methods provided by the platform firmware are used for changing
  62the power state of a device, usually the platform also provides a method for
  63preparing the device to generate wakeup signals.  In that case, however, it
  64often also is necessary to prepare the device for generating PMEs using the
  65native PCI PM mechanism, because the method provided by the platform depends on
  66that.
  67
  68Thus in many situations both the native and the platform-based power management
  69mechanisms have to be used simultaneously to obtain the desired result.
  70
  711.2. Native PCI Power Management
  72--------------------------------
  73
  74The PCI Bus Power Management Interface Specification (PCI PM Spec) was
  75introduced between the PCI 2.1 and PCI 2.2 Specifications.  It defined a
  76standard interface for performing various operations related to power
  77management.
  78
  79The implementation of the PCI PM Spec is optional for conventional PCI devices,
  80but it is mandatory for PCI Express devices.  If a device supports the PCI PM
  81Spec, it has an 8 byte power management capability field in its PCI
  82configuration space.  This field is used to describe and control the standard
  83features related to the native PCI power management.
  84
  85The PCI PM Spec defines 4 operating states for devices (D0-D3) and for buses
  86(B0-B3).  The higher the number, the less power is drawn by the device or bus
  87in that state.  However, the higher the number, the longer the latency for
  88the device or bus to return to the full-power state (D0 or B0, respectively).
  89
  90There are two variants of the D3 state defined by the specification.  The first
  91one is D3hot, referred to as the software accessible D3, because devices can be
  92programmed to go into it.  The second one, D3cold, is the state that PCI devices
  93are in when the supply voltage (Vcc) is removed from them.  It is not possible
  94to program a PCI device to go into D3cold, although there may be a programmable
  95interface for putting the bus the device is on into a state in which Vcc is
  96removed from all devices on the bus.
  97
  98PCI bus power management, however, is not supported by the Linux kernel at the
  99time of this writing and therefore it is not covered by this document.
 100
 101Note that every PCI device can be in the full-power state (D0) or in D3cold,
 102regardless of whether or not it implements the PCI PM Spec.  In addition to
 103that, if the PCI PM Spec is implemented by the device, it must support D3hot
 104as well as D0.  The support for the D1 and D2 power states is optional.
 105
 106PCI devices supporting the PCI PM Spec can be programmed to go to any of the
 107supported low-power states (except for D3cold).  While in D1-D3hot the
 108standard configuration registers of the device must be accessible to software
 109(i.e. the device is required to respond to PCI configuration accesses), although
 110its I/O and memory spaces are then disabled.  This allows the device to be
 111programmatically put into D0.  Thus the kernel can switch the device back and
 112forth between D0 and the supported low-power states (except for D3cold) and the
 113possible power state transitions the device can undergo are the following:
 114
 115+----------------------------+
 116| Current State | New State  |
 117+----------------------------+
 118| D0            | D1, D2, D3 |
 119+----------------------------+
 120| D1            | D2, D3     |
 121+----------------------------+
 122| D2            | D3         |
 123+----------------------------+
 124| D1, D2, D3    | D0         |
 125+----------------------------+
 126
 127The transition from D3cold to D0 occurs when the supply voltage is provided to
 128the device (i.e. power is restored).  In that case the device returns to D0 with
 129a full power-on reset sequence and the power-on defaults are restored to the
 130device by hardware just as at initial power up.
 131
 132PCI devices supporting the PCI PM Spec can be programmed to generate PMEs
 133while in any power state (D0-D3), but they are not required to be capable
 134of generating PMEs from all supported power states.  In particular, the
 135capability of generating PMEs from D3cold is optional and depends on the
 136presence of additional voltage (3.3Vaux) allowing the device to remain
 137sufficiently active to generate a wakeup signal.
 138
 1391.3. ACPI Device Power Management
 140---------------------------------
 141
 142The platform firmware support for the power management of PCI devices is
 143system-specific.  However, if the system in question is compliant with the
 144Advanced Configuration and Power Interface (ACPI) Specification, like the
 145majority of x86-based systems, it is supposed to implement device power
 146management interfaces defined by the ACPI standard.
 147
 148For this purpose the ACPI BIOS provides special functions called "control
 149methods" that may be executed by the kernel to perform specific tasks, such as
 150putting a device into a low-power state.  These control methods are encoded
 151using special byte-code language called the ACPI Machine Language (AML) and
 152stored in the machine's BIOS.  The kernel loads them from the BIOS and executes
 153them as needed using an AML interpreter that translates the AML byte code into
 154computations and memory or I/O space accesses.  This way, in theory, a BIOS
 155writer can provide the kernel with a means to perform actions depending
 156on the system design in a system-specific fashion.
 157
 158ACPI control methods may be divided into global control methods, that are not
 159associated with any particular devices, and device control methods, that have
 160to be defined separately for each device supposed to be handled with the help of
 161the platform.  This means, in particular, that ACPI device control methods can
 162only be used to handle devices that the BIOS writer knew about in advance.  The
 163ACPI methods used for device power management fall into that category.
 164
 165The ACPI specification assumes that devices can be in one of four power states
 166labeled as D0, D1, D2, and D3 that roughly correspond to the native PCI PM
 167D0-D3 states (although the difference between D3hot and D3cold is not taken
 168into account by ACPI).  Moreover, for each power state of a device there is a
 169set of power resources that have to be enabled for the device to be put into
 170that state.  These power resources are controlled (i.e. enabled or disabled)
 171with the help of their own control methods, _ON and _OFF, that have to be
 172defined individually for each of them.
 173
 174To put a device into the ACPI power state Dx (where x is a number between 0 and
 1753 inclusive) the kernel is supposed to (1) enable the power resources required
 176by the device in this state using their _ON control methods and (2) execute the
 177_PSx control method defined for the device.  In addition to that, if the device
 178is going to be put into a low-power state (D1-D3) and is supposed to generate
 179wakeup signals from that state, the _DSW (or _PSW, replaced with _DSW by ACPI
 1803.0) control method defined for it has to be executed before _PSx.  Power
 181resources that are not required by the device in the target power state and are
 182not required any more by any other device should be disabled (by executing their
 183_OFF control methods).  If the current power state of the device is D3, it can
 184only be put into D0 this way.
 185
 186However, quite often the power states of devices are changed during a
 187system-wide transition into a sleep state or back into the working state.  ACPI
 188defines four system sleep states, S1, S2, S3, and S4, and denotes the system
 189working state as S0.  In general, the target system sleep (or working) state
 190determines the highest power (lowest number) state the device can be put
 191into and the kernel is supposed to obtain this information by executing the
 192device's _SxD control method (where x is a number between 0 and 4 inclusive).
 193If the device is required to wake up the system from the target sleep state, the
 194lowest power (highest number) state it can be put into is also determined by the
 195target state of the system.  The kernel is then supposed to use the device's
 196_SxW control method to obtain the number of that state.  It also is supposed to
 197use the device's _PRW control method to learn which power resources need to be
 198enabled for the device to be able to generate wakeup signals.
 199
 2001.4. Wakeup Signaling
 201---------------------
 202
 203Wakeup signals generated by PCI devices, either as native PCI PMEs, or as
 204a result of the execution of the _DSW (or _PSW) ACPI control method before
 205putting the device into a low-power state, have to be caught and handled as
 206appropriate.  If they are sent while the system is in the working state
 207(ACPI S0), they should be translated into interrupts so that the kernel can
 208put the devices generating them into the full-power state and take care of the
 209events that triggered them.  In turn, if they are sent while the system is
 210sleeping, they should cause the system's core logic to trigger wakeup.
 211
 212On ACPI-based systems wakeup signals sent by conventional PCI devices are
 213converted into ACPI General-Purpose Events (GPEs) which are hardware signals
 214from the system core logic generated in response to various events that need to
 215be acted upon.  Every GPE is associated with one or more sources of potentially
 216interesting events.  In particular, a GPE may be associated with a PCI device
 217capable of signaling wakeup.  The information on the connections between GPEs
 218and event sources is recorded in the system's ACPI BIOS from where it can be
 219read by the kernel.
 220
 221If a PCI device known to the system's ACPI BIOS signals wakeup, the GPE
 222associated with it (if there is one) is triggered.  The GPEs associated with PCI
 223bridges may also be triggered in response to a wakeup signal from one of the
 224devices below the bridge (this also is the case for root bridges) and, for
 225example, native PCI PMEs from devices unknown to the system's ACPI BIOS may be
 226handled this way.
 227
 228A GPE may be triggered when the system is sleeping (i.e. when it is in one of
 229the ACPI S1-S4 states), in which case system wakeup is started by its core logic
 230(the device that was the source of the signal causing the system wakeup to occur
 231may be identified later).  The GPEs used in such situations are referred to as
 232wakeup GPEs.
 233
 234Usually, however, GPEs are also triggered when the system is in the working
 235state (ACPI S0) and in that case the system's core logic generates a System
 236Control Interrupt (SCI) to notify the kernel of the event.  Then, the SCI
 237handler identifies the GPE that caused the interrupt to be generated which,
 238in turn, allows the kernel to identify the source of the event (that may be
 239a PCI device signaling wakeup).  The GPEs used for notifying the kernel of
 240events occurring while the system is in the working state are referred to as
 241runtime GPEs.
 242
 243Unfortunately, there is no standard way of handling wakeup signals sent by
 244conventional PCI devices on systems that are not ACPI-based, but there is one
 245for PCI Express devices.  Namely, the PCI Express Base Specification introduced
 246a native mechanism for converting native PCI PMEs into interrupts generated by
 247root ports.  For conventional PCI devices native PMEs are out-of-band, so they
 248are routed separately and they need not pass through bridges (in principle they
 249may be routed directly to the system's core logic), but for PCI Express devices
 250they are in-band messages that have to pass through the PCI Express hierarchy,
 251including the root port on the path from the device to the Root Complex.  Thus
 252it was possible to introduce a mechanism by which a root port generates an
 253interrupt whenever it receives a PME message from one of the devices below it.
 254The PCI Express Requester ID of the device that sent the PME message is then
 255recorded in one of the root port's configuration registers from where it may be
 256read by the interrupt handler allowing the device to be identified.  [PME
 257messages sent by PCI Express endpoints integrated with the Root Complex don't
 258pass through root ports, but instead they cause a Root Complex Event Collector
 259(if there is one) to generate interrupts.]
 260
 261In principle the native PCI Express PME signaling may also be used on ACPI-based
 262systems along with the GPEs, but to use it the kernel has to ask the system's
 263ACPI BIOS to release control of root port configuration registers.  The ACPI
 264BIOS, however, is not required to allow the kernel to control these registers
 265and if it doesn't do that, the kernel must not modify their contents.  Of course
 266the native PCI Express PME signaling cannot be used by the kernel in that case.
 267
 268
 2692. PCI Subsystem and Device Power Management
 270============================================
 271
 2722.1. Device Power Management Callbacks
 273--------------------------------------
 274
 275The PCI Subsystem participates in the power management of PCI devices in a
 276number of ways.  First of all, it provides an intermediate code layer between
 277the device power management core (PM core) and PCI device drivers.
 278Specifically, the pm field of the PCI subsystem's struct bus_type object,
 279pci_bus_type, points to a struct dev_pm_ops object, pci_dev_pm_ops, containing
 280pointers to several device power management callbacks::
 281
 282  const struct dev_pm_ops pci_dev_pm_ops = {
 283	.prepare = pci_pm_prepare,
 284	.complete = pci_pm_complete,
 285	.suspend = pci_pm_suspend,
 286	.resume = pci_pm_resume,
 287	.freeze = pci_pm_freeze,
 288	.thaw = pci_pm_thaw,
 289	.poweroff = pci_pm_poweroff,
 290	.restore = pci_pm_restore,
 291	.suspend_noirq = pci_pm_suspend_noirq,
 292	.resume_noirq = pci_pm_resume_noirq,
 293	.freeze_noirq = pci_pm_freeze_noirq,
 294	.thaw_noirq = pci_pm_thaw_noirq,
 295	.poweroff_noirq = pci_pm_poweroff_noirq,
 296	.restore_noirq = pci_pm_restore_noirq,
 297	.runtime_suspend = pci_pm_runtime_suspend,
 298	.runtime_resume = pci_pm_runtime_resume,
 299	.runtime_idle = pci_pm_runtime_idle,
 300  };
 301
 302These callbacks are executed by the PM core in various situations related to
 303device power management and they, in turn, execute power management callbacks
 304provided by PCI device drivers.  They also perform power management operations
 305involving some standard configuration registers of PCI devices that device
 306drivers need not know or care about.
 307
 308The structure representing a PCI device, struct pci_dev, contains several fields
 309that these callbacks operate on::
 310
 311  struct pci_dev {
 312	...
 313	pci_power_t     current_state;  /* Current operating state. */
 314	int		pm_cap;		/* PM capability offset in the
 315					   configuration space */
 316	unsigned int	pme_support:5;	/* Bitmask of states from which PME#
 317					   can be generated */
 318	unsigned int	pme_poll:1;	/* Poll device's PME status bit */
 319	unsigned int	d1_support:1;	/* Low power state D1 is supported */
 320	unsigned int	d2_support:1;	/* Low power state D2 is supported */
 321	unsigned int	no_d1d2:1;	/* D1 and D2 are forbidden */
 322	unsigned int	wakeup_prepared:1;  /* Device prepared for wake up */
 323	unsigned int	d3hot_delay;	/* D3hot->D0 transition time in ms */
 324	...
 325  };
 326
 327They also indirectly use some fields of the struct device that is embedded in
 328struct pci_dev.
 329
 3302.2. Device Initialization
 331--------------------------
 332
 333The PCI subsystem's first task related to device power management is to
 334prepare the device for power management and initialize the fields of struct
 335pci_dev used for this purpose.  This happens in two functions defined in
 336drivers/pci/pci.c, pci_pm_init() and platform_pci_wakeup_init().
 337
 338The first of these functions checks if the device supports native PCI PM
 339and if that's the case the offset of its power management capability structure
 340in the configuration space is stored in the pm_cap field of the device's struct
 341pci_dev object.  Next, the function checks which PCI low-power states are
 342supported by the device and from which low-power states the device can generate
 343native PCI PMEs.  The power management fields of the device's struct pci_dev and
 344the struct device embedded in it are updated accordingly and the generation of
 345PMEs by the device is disabled.
 346
 347The second function checks if the device can be prepared to signal wakeup with
 348the help of the platform firmware, such as the ACPI BIOS.  If that is the case,
 349the function updates the wakeup fields in struct device embedded in the
 350device's struct pci_dev and uses the firmware-provided method to prevent the
 351device from signaling wakeup.
 352
 353At this point the device is ready for power management.  For driverless devices,
 354however, this functionality is limited to a few basic operations carried out
 355during system-wide transitions to a sleep state and back to the working state.
 356
 3572.3. Runtime Device Power Management
 358------------------------------------
 359
 360The PCI subsystem plays a vital role in the runtime power management of PCI
 361devices.  For this purpose it uses the general runtime power management
 362(runtime PM) framework described in Documentation/power/runtime_pm.rst.
 363Namely, it provides subsystem-level callbacks::
 364
 365	pci_pm_runtime_suspend()
 366	pci_pm_runtime_resume()
 367	pci_pm_runtime_idle()
 368
 369that are executed by the core runtime PM routines.  It also implements the
 370entire mechanics necessary for handling runtime wakeup signals from PCI devices
 371in low-power states, which at the time of this writing works for both the native
 372PCI Express PME signaling and the ACPI GPE-based wakeup signaling described in
 373Section 1.
 374
 375First, a PCI device is put into a low-power state, or suspended, with the help
 376of pm_schedule_suspend() or pm_runtime_suspend() which for PCI devices call
 377pci_pm_runtime_suspend() to do the actual job.  For this to work, the device's
 378driver has to provide a pm->runtime_suspend() callback (see below), which is
 379run by pci_pm_runtime_suspend() as the first action.  If the driver's callback
 380returns successfully, the device's standard configuration registers are saved,
 381the device is prepared to generate wakeup signals and, finally, it is put into
 382the target low-power state.
 383
 384The low-power state to put the device into is the lowest-power (highest number)
 385state from which it can signal wakeup.  The exact method of signaling wakeup is
 386system-dependent and is determined by the PCI subsystem on the basis of the
 387reported capabilities of the device and the platform firmware.  To prepare the
 388device for signaling wakeup and put it into the selected low-power state, the
 389PCI subsystem can use the platform firmware as well as the device's native PCI
 390PM capabilities, if supported.
 391
 392It is expected that the device driver's pm->runtime_suspend() callback will
 393not attempt to prepare the device for signaling wakeup or to put it into a
 394low-power state.  The driver ought to leave these tasks to the PCI subsystem
 395that has all of the information necessary to perform them.
 396
 397A suspended device is brought back into the "active" state, or resumed,
 398with the help of pm_request_resume() or pm_runtime_resume() which both call
 399pci_pm_runtime_resume() for PCI devices.  Again, this only works if the device's
 400driver provides a pm->runtime_resume() callback (see below).  However, before
 401the driver's callback is executed, pci_pm_runtime_resume() brings the device
 402back into the full-power state, prevents it from signaling wakeup while in that
 403state and restores its standard configuration registers.  Thus the driver's
 404callback need not worry about the PCI-specific aspects of the device resume.
 405
 406Note that generally pci_pm_runtime_resume() may be called in two different
 407situations.  First, it may be called at the request of the device's driver, for
 408example if there are some data for it to process.  Second, it may be called
 409as a result of a wakeup signal from the device itself (this sometimes is
 410referred to as "remote wakeup").  Of course, for this purpose the wakeup signal
 411is handled in one of the ways described in Section 1 and finally converted into
 412a notification for the PCI subsystem after the source device has been
 413identified.
 414
 415The pci_pm_runtime_idle() function, called for PCI devices by pm_runtime_idle()
 416and pm_request_idle(), executes the device driver's pm->runtime_idle()
 417callback, if defined, and if that callback doesn't return error code (or is not
 418present at all), suspends the device with the help of pm_runtime_suspend().
 419Sometimes pci_pm_runtime_idle() is called automatically by the PM core (for
 420example, it is called right after the device has just been resumed), in which
 421cases it is expected to suspend the device if that makes sense.  Usually,
 422however, the PCI subsystem doesn't really know if the device really can be
 423suspended, so it lets the device's driver decide by running its
 424pm->runtime_idle() callback.
 425
 4262.4. System-Wide Power Transitions
 427----------------------------------
 428There are a few different types of system-wide power transitions, described in
 429Documentation/driver-api/pm/devices.rst.  Each of them requires devices to be
 430handled in a specific way and the PM core executes subsystem-level power
 431management callbacks for this purpose.  They are executed in phases such that
 432each phase involves executing the same subsystem-level callback for every device
 433belonging to the given subsystem before the next phase begins.  These phases
 434always run after tasks have been frozen.
 435
 4362.4.1. System Suspend
 437^^^^^^^^^^^^^^^^^^^^^
 438
 439When the system is going into a sleep state in which the contents of memory will
 440be preserved, such as one of the ACPI sleep states S1-S3, the phases are:
 441
 442	prepare, suspend, suspend_noirq.
 443
 444The following PCI bus type's callbacks, respectively, are used in these phases::
 445
 446	pci_pm_prepare()
 447	pci_pm_suspend()
 448	pci_pm_suspend_noirq()
 449
 450The pci_pm_prepare() routine first puts the device into the "fully functional"
 451state with the help of pm_runtime_resume().  Then, it executes the device
 452driver's pm->prepare() callback if defined (i.e. if the driver's struct
 453dev_pm_ops object is present and the prepare pointer in that object is valid).
 454
 455The pci_pm_suspend() routine first checks if the device's driver implements
 456legacy PCI suspend routines (see Section 3), in which case the driver's legacy
 457suspend callback is executed, if present, and its result is returned.  Next, if
 458the device's driver doesn't provide a struct dev_pm_ops object (containing
 459pointers to the driver's callbacks), pci_pm_default_suspend() is called, which
 460simply turns off the device's bus master capability and runs
 461pcibios_disable_device() to disable it, unless the device is a bridge (PCI
 462bridges are ignored by this routine).  Next, the device driver's pm->suspend()
 463callback is executed, if defined, and its result is returned if it fails.
 464Finally, pci_fixup_device() is called to apply hardware suspend quirks related
 465to the device if necessary.
 466
 467Note that the suspend phase is carried out asynchronously for PCI devices, so
 468the pci_pm_suspend() callback may be executed in parallel for any pair of PCI
 469devices that don't depend on each other in a known way (i.e. none of the paths
 470in the device tree from the root bridge to a leaf device contains both of them).
 471
 472The pci_pm_suspend_noirq() routine is executed after suspend_device_irqs() has
 473been called, which means that the device driver's interrupt handler won't be
 474invoked while this routine is running.  It first checks if the device's driver
 475implements legacy PCI suspends routines (Section 3), in which case the legacy
 476late suspend routine is called and its result is returned (the standard
 477configuration registers of the device are saved if the driver's callback hasn't
 478done that).  Second, if the device driver's struct dev_pm_ops object is not
 479present, the device's standard configuration registers are saved and the routine
 480returns success.  Otherwise the device driver's pm->suspend_noirq() callback is
 481executed, if present, and its result is returned if it fails.  Next, if the
 482device's standard configuration registers haven't been saved yet (one of the
 483device driver's callbacks executed before might do that), pci_pm_suspend_noirq()
 484saves them, prepares the device to signal wakeup (if necessary) and puts it into
 485a low-power state.
 486
 487The low-power state to put the device into is the lowest-power (highest number)
 488state from which it can signal wakeup while the system is in the target sleep
 489state.  Just like in the runtime PM case described above, the mechanism of
 490signaling wakeup is system-dependent and determined by the PCI subsystem, which
 491is also responsible for preparing the device to signal wakeup from the system's
 492target sleep state as appropriate.
 493
 494PCI device drivers (that don't implement legacy power management callbacks) are
 495generally not expected to prepare devices for signaling wakeup or to put them
 496into low-power states.  However, if one of the driver's suspend callbacks
 497(pm->suspend() or pm->suspend_noirq()) saves the device's standard configuration
 498registers, pci_pm_suspend_noirq() will assume that the device has been prepared
 499to signal wakeup and put into a low-power state by the driver (the driver is
 500then assumed to have used the helper functions provided by the PCI subsystem for
 501this purpose).  PCI device drivers are not encouraged to do that, but in some
 502rare cases doing that in the driver may be the optimum approach.
 503
 5042.4.2. System Resume
 505^^^^^^^^^^^^^^^^^^^^
 506
 507When the system is undergoing a transition from a sleep state in which the
 508contents of memory have been preserved, such as one of the ACPI sleep states
 509S1-S3, into the working state (ACPI S0), the phases are:
 510
 511	resume_noirq, resume, complete.
 512
 513The following PCI bus type's callbacks, respectively, are executed in these
 514phases::
 515
 516	pci_pm_resume_noirq()
 517	pci_pm_resume()
 518	pci_pm_complete()
 519
 520The pci_pm_resume_noirq() routine first puts the device into the full-power
 521state, restores its standard configuration registers and applies early resume
 522hardware quirks related to the device, if necessary.  This is done
 523unconditionally, regardless of whether or not the device's driver implements
 524legacy PCI power management callbacks (this way all PCI devices are in the
 525full-power state and their standard configuration registers have been restored
 526when their interrupt handlers are invoked for the first time during resume,
 527which allows the kernel to avoid problems with the handling of shared interrupts
 528by drivers whose devices are still suspended).  If legacy PCI power management
 529callbacks (see Section 3) are implemented by the device's driver, the legacy
 530early resume callback is executed and its result is returned.  Otherwise, the
 531device driver's pm->resume_noirq() callback is executed, if defined, and its
 532result is returned.
 533
 534The pci_pm_resume() routine first checks if the device's standard configuration
 535registers have been restored and restores them if that's not the case (this
 536only is necessary in the error path during a failing suspend).  Next, resume
 537hardware quirks related to the device are applied, if necessary, and if the
 538device's driver implements legacy PCI power management callbacks (see
 539Section 3), the driver's legacy resume callback is executed and its result is
 540returned.  Otherwise, the device's wakeup signaling mechanisms are blocked and
 541its driver's pm->resume() callback is executed, if defined (the callback's
 542result is then returned).
 543
 544The resume phase is carried out asynchronously for PCI devices, like the
 545suspend phase described above, which means that if two PCI devices don't depend
 546on each other in a known way, the pci_pm_resume() routine may be executed for
 547the both of them in parallel.
 548
 549The pci_pm_complete() routine only executes the device driver's pm->complete()
 550callback, if defined.
 551
 5522.4.3. System Hibernation
 553^^^^^^^^^^^^^^^^^^^^^^^^^
 554
 555System hibernation is more complicated than system suspend, because it requires
 556a system image to be created and written into a persistent storage medium.  The
 557image is created atomically and all devices are quiesced, or frozen, before that
 558happens.
 559
 560The freezing of devices is carried out after enough memory has been freed (at
 561the time of this writing the image creation requires at least 50% of system RAM
 562to be free) in the following three phases:
 563
 564	prepare, freeze, freeze_noirq
 565
 566that correspond to the PCI bus type's callbacks::
 567
 568	pci_pm_prepare()
 569	pci_pm_freeze()
 570	pci_pm_freeze_noirq()
 571
 572This means that the prepare phase is exactly the same as for system suspend.
 573The other two phases, however, are different.
 574
 575The pci_pm_freeze() routine is quite similar to pci_pm_suspend(), but it runs
 576the device driver's pm->freeze() callback, if defined, instead of pm->suspend(),
 577and it doesn't apply the suspend-related hardware quirks.  It is executed
 578asynchronously for different PCI devices that don't depend on each other in a
 579known way.
 580
 581The pci_pm_freeze_noirq() routine, in turn, is similar to
 582pci_pm_suspend_noirq(), but it calls the device driver's pm->freeze_noirq()
 583routine instead of pm->suspend_noirq().  It also doesn't attempt to prepare the
 584device for signaling wakeup and put it into a low-power state.  Still, it saves
 585the device's standard configuration registers if they haven't been saved by one
 586of the driver's callbacks.
 587
 588Once the image has been created, it has to be saved.  However, at this point all
 589devices are frozen and they cannot handle I/O, while their ability to handle
 590I/O is obviously necessary for the image saving.  Thus they have to be brought
 591back to the fully functional state and this is done in the following phases:
 592
 593	thaw_noirq, thaw, complete
 594
 595using the following PCI bus type's callbacks::
 596
 597	pci_pm_thaw_noirq()
 598	pci_pm_thaw()
 599	pci_pm_complete()
 600
 601respectively.
 602
 603The first of them, pci_pm_thaw_noirq(), is analogous to pci_pm_resume_noirq().
 604It puts the device into the full power state and restores its standard
 605configuration registers.  It also executes the device driver's pm->thaw_noirq()
 606callback, if defined, instead of pm->resume_noirq().
 607
 608The pci_pm_thaw() routine is similar to pci_pm_resume(), but it runs the device
 609driver's pm->thaw() callback instead of pm->resume().  It is executed
 610asynchronously for different PCI devices that don't depend on each other in a
 611known way.
 612
 613The complete phase is the same as for system resume.
 614
 615After saving the image, devices need to be powered down before the system can
 616enter the target sleep state (ACPI S4 for ACPI-based systems).  This is done in
 617three phases:
 618
 619	prepare, poweroff, poweroff_noirq
 620
 621where the prepare phase is exactly the same as for system suspend.  The other
 622two phases are analogous to the suspend and suspend_noirq phases, respectively.
 623The PCI subsystem-level callbacks they correspond to::
 624
 625	pci_pm_poweroff()
 626	pci_pm_poweroff_noirq()
 627
 628work in analogy with pci_pm_suspend() and pci_pm_poweroff_noirq(), respectively,
 629although they don't attempt to save the device's standard configuration
 630registers.
 631
 6322.4.4. System Restore
 633^^^^^^^^^^^^^^^^^^^^^
 634
 635System restore requires a hibernation image to be loaded into memory and the
 636pre-hibernation memory contents to be restored before the pre-hibernation system
 637activity can be resumed.
 638
 639As described in Documentation/driver-api/pm/devices.rst, the hibernation image
 640is loaded into memory by a fresh instance of the kernel, called the boot kernel,
 641which in turn is loaded and run by a boot loader in the usual way.  After the
 642boot kernel has loaded the image, it needs to replace its own code and data with
 643the code and data of the "hibernated" kernel stored within the image, called the
 644image kernel.  For this purpose all devices are frozen just like before creating
 645the image during hibernation, in the
 646
 647	prepare, freeze, freeze_noirq
 648
 649phases described above.  However, the devices affected by these phases are only
 650those having drivers in the boot kernel; other devices will still be in whatever
 651state the boot loader left them.
 652
 653Should the restoration of the pre-hibernation memory contents fail, the boot
 654kernel would go through the "thawing" procedure described above, using the
 655thaw_noirq, thaw, and complete phases (that will only affect the devices having
 656drivers in the boot kernel), and then continue running normally.
 657
 658If the pre-hibernation memory contents are restored successfully, which is the
 659usual situation, control is passed to the image kernel, which then becomes
 660responsible for bringing the system back to the working state.  To achieve this,
 661it must restore the devices' pre-hibernation functionality, which is done much
 662like waking up from the memory sleep state, although it involves different
 663phases:
 664
 665	restore_noirq, restore, complete
 666
 667The first two of these are analogous to the resume_noirq and resume phases
 668described above, respectively, and correspond to the following PCI subsystem
 669callbacks::
 670
 671	pci_pm_restore_noirq()
 672	pci_pm_restore()
 673
 674These callbacks work in analogy with pci_pm_resume_noirq() and pci_pm_resume(),
 675respectively, but they execute the device driver's pm->restore_noirq() and
 676pm->restore() callbacks, if available.
 677
 678The complete phase is carried out in exactly the same way as during system
 679resume.
 680
 681
 6823. PCI Device Drivers and Power Management
 683==========================================
 684
 6853.1. Power Management Callbacks
 686-------------------------------
 687
 688PCI device drivers participate in power management by providing callbacks to be
 689executed by the PCI subsystem's power management routines described above and by
 690controlling the runtime power management of their devices.
 691
 692At the time of this writing there are two ways to define power management
 693callbacks for a PCI device driver, the recommended one, based on using a
 694dev_pm_ops structure described in Documentation/driver-api/pm/devices.rst, and
 695the "legacy" one, in which the .suspend() and .resume() callbacks from struct
 696pci_driver are used.  The legacy approach, however, doesn't allow one to define
 697runtime power management callbacks and is not really suitable for any new
 698drivers.  Therefore it is not covered by this document (refer to the source code
 699to learn more about it).
 700
 701It is recommended that all PCI device drivers define a struct dev_pm_ops object
 702containing pointers to power management (PM) callbacks that will be executed by
 703the PCI subsystem's PM routines in various circumstances.  A pointer to the
 704driver's struct dev_pm_ops object has to be assigned to the driver.pm field in
 705its struct pci_driver object.  Once that has happened, the "legacy" PM callbacks
 706in struct pci_driver are ignored (even if they are not NULL).
 707
 708The PM callbacks in struct dev_pm_ops are not mandatory and if they are not
 709defined (i.e. the respective fields of struct dev_pm_ops are unset) the PCI
 710subsystem will handle the device in a simplified default manner.  If they are
 711defined, though, they are expected to behave as described in the following
 712subsections.
 713
 7143.1.1. prepare()
 715^^^^^^^^^^^^^^^^
 716
 717The prepare() callback is executed during system suspend, during hibernation
 718(when a hibernation image is about to be created), during power-off after
 719saving a hibernation image and during system restore, when a hibernation image
 720has just been loaded into memory.
 721
 722This callback is only necessary if the driver's device has children that in
 723general may be registered at any time.  In that case the role of the prepare()
 724callback is to prevent new children of the device from being registered until
 725one of the resume_noirq(), thaw_noirq(), or restore_noirq() callbacks is run.
 726
 727In addition to that the prepare() callback may carry out some operations
 728preparing the device to be suspended, although it should not allocate memory
 729(if additional memory is required to suspend the device, it has to be
 730preallocated earlier, for example in a suspend/hibernate notifier as described
 731in Documentation/driver-api/pm/notifiers.rst).
 732
 7333.1.2. suspend()
 734^^^^^^^^^^^^^^^^
 735
 736The suspend() callback is only executed during system suspend, after prepare()
 737callbacks have been executed for all devices in the system.
 738
 739This callback is expected to quiesce the device and prepare it to be put into a
 740low-power state by the PCI subsystem.  It is not required (in fact it even is
 741not recommended) that a PCI driver's suspend() callback save the standard
 742configuration registers of the device, prepare it for waking up the system, or
 743put it into a low-power state.  All of these operations can very well be taken
 744care of by the PCI subsystem, without the driver's participation.
 745
 746However, in some rare case it is convenient to carry out these operations in
 747a PCI driver.  Then, pci_save_state(), pci_prepare_to_sleep(), and
 748pci_set_power_state() should be used to save the device's standard configuration
 749registers, to prepare it for system wakeup (if necessary), and to put it into a
 750low-power state, respectively.  Moreover, if the driver calls pci_save_state(),
 751the PCI subsystem will not execute either pci_prepare_to_sleep(), or
 752pci_set_power_state() for its device, so the driver is then responsible for
 753handling the device as appropriate.
 754
 755While the suspend() callback is being executed, the driver's interrupt handler
 756can be invoked to handle an interrupt from the device, so all suspend-related
 757operations relying on the driver's ability to handle interrupts should be
 758carried out in this callback.
 759
 7603.1.3. suspend_noirq()
 761^^^^^^^^^^^^^^^^^^^^^^
 762
 763The suspend_noirq() callback is only executed during system suspend, after
 764suspend() callbacks have been executed for all devices in the system and
 765after device interrupts have been disabled by the PM core.
 766
 767The difference between suspend_noirq() and suspend() is that the driver's
 768interrupt handler will not be invoked while suspend_noirq() is running.  Thus
 769suspend_noirq() can carry out operations that would cause race conditions to
 770arise if they were performed in suspend().
 771
 7723.1.4. freeze()
 773^^^^^^^^^^^^^^^
 774
 775The freeze() callback is hibernation-specific and is executed in two situations,
 776during hibernation, after prepare() callbacks have been executed for all devices
 777in preparation for the creation of a system image, and during restore,
 778after a system image has been loaded into memory from persistent storage and the
 779prepare() callbacks have been executed for all devices.
 780
 781The role of this callback is analogous to the role of the suspend() callback
 782described above.  In fact, they only need to be different in the rare cases when
 783the driver takes the responsibility for putting the device into a low-power
 784state.
 785
 786In that cases the freeze() callback should not prepare the device system wakeup
 787or put it into a low-power state.  Still, either it or freeze_noirq() should
 788save the device's standard configuration registers using pci_save_state().
 789
 7903.1.5. freeze_noirq()
 791^^^^^^^^^^^^^^^^^^^^^
 792
 793The freeze_noirq() callback is hibernation-specific.  It is executed during
 794hibernation, after prepare() and freeze() callbacks have been executed for all
 795devices in preparation for the creation of a system image, and during restore,
 796after a system image has been loaded into memory and after prepare() and
 797freeze() callbacks have been executed for all devices.  It is always executed
 798after device interrupts have been disabled by the PM core.
 799
 800The role of this callback is analogous to the role of the suspend_noirq()
 801callback described above and it very rarely is necessary to define
 802freeze_noirq().
 803
 804The difference between freeze_noirq() and freeze() is analogous to the
 805difference between suspend_noirq() and suspend().
 806
 8073.1.6. poweroff()
 808^^^^^^^^^^^^^^^^^
 809
 810The poweroff() callback is hibernation-specific.  It is executed when the system
 811is about to be powered off after saving a hibernation image to a persistent
 812storage.  prepare() callbacks are executed for all devices before poweroff() is
 813called.
 814
 815The role of this callback is analogous to the role of the suspend() and freeze()
 816callbacks described above, although it does not need to save the contents of
 817the device's registers.  In particular, if the driver wants to put the device
 818into a low-power state itself instead of allowing the PCI subsystem to do that,
 819the poweroff() callback should use pci_prepare_to_sleep() and
 820pci_set_power_state() to prepare the device for system wakeup and to put it
 821into a low-power state, respectively, but it need not save the device's standard
 822configuration registers.
 823
 8243.1.7. poweroff_noirq()
 825^^^^^^^^^^^^^^^^^^^^^^^
 826
 827The poweroff_noirq() callback is hibernation-specific.  It is executed after
 828poweroff() callbacks have been executed for all devices in the system.
 829
 830The role of this callback is analogous to the role of the suspend_noirq() and
 831freeze_noirq() callbacks described above, but it does not need to save the
 832contents of the device's registers.
 833
 834The difference between poweroff_noirq() and poweroff() is analogous to the
 835difference between suspend_noirq() and suspend().
 836
 8373.1.8. resume_noirq()
 838^^^^^^^^^^^^^^^^^^^^^
 839
 840The resume_noirq() callback is only executed during system resume, after the
 841PM core has enabled the non-boot CPUs.  The driver's interrupt handler will not
 842be invoked while resume_noirq() is running, so this callback can carry out
 843operations that might race with the interrupt handler.
 844
 845Since the PCI subsystem unconditionally puts all devices into the full power
 846state in the resume_noirq phase of system resume and restores their standard
 847configuration registers, resume_noirq() is usually not necessary.  In general
 848it should only be used for performing operations that would lead to race
 849conditions if carried out by resume().
 850
 8513.1.9. resume()
 852^^^^^^^^^^^^^^^
 853
 854The resume() callback is only executed during system resume, after
 855resume_noirq() callbacks have been executed for all devices in the system and
 856device interrupts have been enabled by the PM core.
 857
 858This callback is responsible for restoring the pre-suspend configuration of the
 859device and bringing it back to the fully functional state.  The device should be
 860able to process I/O in a usual way after resume() has returned.
 861
 8623.1.10. thaw_noirq()
 863^^^^^^^^^^^^^^^^^^^^
 864
 865The thaw_noirq() callback is hibernation-specific.  It is executed after a
 866system image has been created and the non-boot CPUs have been enabled by the PM
 867core, in the thaw_noirq phase of hibernation.  It also may be executed if the
 868loading of a hibernation image fails during system restore (it is then executed
 869after enabling the non-boot CPUs).  The driver's interrupt handler will not be
 870invoked while thaw_noirq() is running.
 871
 872The role of this callback is analogous to the role of resume_noirq().  The
 873difference between these two callbacks is that thaw_noirq() is executed after
 874freeze() and freeze_noirq(), so in general it does not need to modify the
 875contents of the device's registers.
 876
 8773.1.11. thaw()
 878^^^^^^^^^^^^^^
 879
 880The thaw() callback is hibernation-specific.  It is executed after thaw_noirq()
 881callbacks have been executed for all devices in the system and after device
 882interrupts have been enabled by the PM core.
 883
 884This callback is responsible for restoring the pre-freeze configuration of
 885the device, so that it will work in a usual way after thaw() has returned.
 886
 8873.1.12. restore_noirq()
 888^^^^^^^^^^^^^^^^^^^^^^^
 889
 890The restore_noirq() callback is hibernation-specific.  It is executed in the
 891restore_noirq phase of hibernation, when the boot kernel has passed control to
 892the image kernel and the non-boot CPUs have been enabled by the image kernel's
 893PM core.
 894
 895This callback is analogous to resume_noirq() with the exception that it cannot
 896make any assumption on the previous state of the device, even if the BIOS (or
 897generally the platform firmware) is known to preserve that state over a
 898suspend-resume cycle.
 899
 900For the vast majority of PCI device drivers there is no difference between
 901resume_noirq() and restore_noirq().
 902
 9033.1.13. restore()
 904^^^^^^^^^^^^^^^^^
 905
 906The restore() callback is hibernation-specific.  It is executed after
 907restore_noirq() callbacks have been executed for all devices in the system and
 908after the PM core has enabled device drivers' interrupt handlers to be invoked.
 909
 910This callback is analogous to resume(), just like restore_noirq() is analogous
 911to resume_noirq().  Consequently, the difference between restore_noirq() and
 912restore() is analogous to the difference between resume_noirq() and resume().
 913
 914For the vast majority of PCI device drivers there is no difference between
 915resume() and restore().
 916
 9173.1.14. complete()
 918^^^^^^^^^^^^^^^^^^
 919
 920The complete() callback is executed in the following situations:
 921
 922  - during system resume, after resume() callbacks have been executed for all
 923    devices,
 924  - during hibernation, before saving the system image, after thaw() callbacks
 925    have been executed for all devices,
 926  - during system restore, when the system is going back to its pre-hibernation
 927    state, after restore() callbacks have been executed for all devices.
 928
 929It also may be executed if the loading of a hibernation image into memory fails
 930(in that case it is run after thaw() callbacks have been executed for all
 931devices that have drivers in the boot kernel).
 932
 933This callback is entirely optional, although it may be necessary if the
 934prepare() callback performs operations that need to be reversed.
 935
 9363.1.15. runtime_suspend()
 937^^^^^^^^^^^^^^^^^^^^^^^^^
 938
 939The runtime_suspend() callback is specific to device runtime power management
 940(runtime PM).  It is executed by the PM core's runtime PM framework when the
 941device is about to be suspended (i.e. quiesced and put into a low-power state)
 942at run time.
 943
 944This callback is responsible for freezing the device and preparing it to be
 945put into a low-power state, but it must allow the PCI subsystem to perform all
 946of the PCI-specific actions necessary for suspending the device.
 947
 9483.1.16. runtime_resume()
 949^^^^^^^^^^^^^^^^^^^^^^^^
 950
 951The runtime_resume() callback is specific to device runtime PM.  It is executed
 952by the PM core's runtime PM framework when the device is about to be resumed
 953(i.e. put into the full-power state and programmed to process I/O normally) at
 954run time.
 955
 956This callback is responsible for restoring the normal functionality of the
 957device after it has been put into the full-power state by the PCI subsystem.
 958The device is expected to be able to process I/O in the usual way after
 959runtime_resume() has returned.
 960
 9613.1.17. runtime_idle()
 962^^^^^^^^^^^^^^^^^^^^^^
 963
 964The runtime_idle() callback is specific to device runtime PM.  It is executed
 965by the PM core's runtime PM framework whenever it may be desirable to suspend
 966the device according to the PM core's information.  In particular, it is
 967automatically executed right after runtime_resume() has returned in case the
 968resume of the device has happened as a result of a spurious event.
 969
 970This callback is optional, but if it is not implemented or if it returns 0, the
 971PCI subsystem will call pm_runtime_suspend() for the device, which in turn will
 972cause the driver's runtime_suspend() callback to be executed.
 973
 9743.1.18. Pointing Multiple Callback Pointers to One Routine
 975^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 976
 977Although in principle each of the callbacks described in the previous
 978subsections can be defined as a separate function, it often is convenient to
 979point two or more members of struct dev_pm_ops to the same routine.  There are
 980a few convenience macros that can be used for this purpose.
 981
 982The SIMPLE_DEV_PM_OPS macro declares a struct dev_pm_ops object with one
 983suspend routine pointed to by the .suspend(), .freeze(), and .poweroff()
 984members and one resume routine pointed to by the .resume(), .thaw(), and
 985.restore() members.  The other function pointers in this struct dev_pm_ops are
 986unset.
 987
 988The UNIVERSAL_DEV_PM_OPS macro is similar to SIMPLE_DEV_PM_OPS, but it
 989additionally sets the .runtime_resume() pointer to the same value as
 990.resume() (and .thaw(), and .restore()) and the .runtime_suspend() pointer to
 991the same value as .suspend() (and .freeze() and .poweroff()).
 992
 993The SET_SYSTEM_SLEEP_PM_OPS can be used inside of a declaration of struct
 994dev_pm_ops to indicate that one suspend routine is to be pointed to by the
 995.suspend(), .freeze(), and .poweroff() members and one resume routine is to
 996be pointed to by the .resume(), .thaw(), and .restore() members.
 997
 9983.1.19. Driver Flags for Power Management
 999^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1000
1001The PM core allows device drivers to set flags that influence the handling of
1002power management for the devices by the core itself and by middle layer code
1003including the PCI bus type.  The flags should be set once at the driver probe
1004time with the help of the dev_pm_set_driver_flags() function and they should not
1005be updated directly afterwards.
1006
1007The DPM_FLAG_NO_DIRECT_COMPLETE flag prevents the PM core from using the
1008direct-complete mechanism allowing device suspend/resume callbacks to be skipped
1009if the device is in runtime suspend when the system suspend starts.  That also
1010affects all of the ancestors of the device, so this flag should only be used if
1011absolutely necessary.
1012
1013The DPM_FLAG_SMART_PREPARE flag causes the PCI bus type to return a positive
1014value from pci_pm_prepare() only if the ->prepare callback provided by the
1015driver of the device returns a positive value.  That allows the driver to opt
1016out from using the direct-complete mechanism dynamically (whereas setting
1017DPM_FLAG_NO_DIRECT_COMPLETE means permanent opt-out).
1018
1019The DPM_FLAG_SMART_SUSPEND flag tells the PCI bus type that from the driver's
1020perspective the device can be safely left in runtime suspend during system
1021suspend.  That causes pci_pm_suspend(), pci_pm_freeze() and pci_pm_poweroff()
1022to avoid resuming the device from runtime suspend unless there are PCI-specific
1023reasons for doing that.  Also, it causes pci_pm_suspend_late/noirq() and
1024pci_pm_poweroff_late/noirq() to return early if the device remains in runtime
1025suspend during the "late" phase of the system-wide transition under way.
1026Moreover, if the device is in runtime suspend in pci_pm_resume_noirq() or
1027pci_pm_restore_noirq(), its runtime PM status will be changed to "active" (as it
1028is going to be put into D0 going forward).
1029
1030Setting the DPM_FLAG_MAY_SKIP_RESUME flag means that the driver allows its
1031"noirq" and "early" resume callbacks to be skipped if the device can be left
1032in suspend after a system-wide transition into the working state.  This flag is
1033taken into consideration by the PM core along with the power.may_skip_resume
1034status bit of the device which is set by pci_pm_suspend_noirq() in certain
1035situations.  If the PM core determines that the driver's "noirq" and "early"
1036resume callbacks should be skipped, the dev_pm_skip_resume() helper function
1037will return "true" and that will cause pci_pm_resume_noirq() and
1038pci_pm_resume_early() to return upfront without touching the device and
1039executing the driver callbacks.
1040
10413.2. Device Runtime Power Management
1042------------------------------------
1043
1044In addition to providing device power management callbacks PCI device drivers
1045are responsible for controlling the runtime power management (runtime PM) of
1046their devices.
1047
1048The PCI device runtime PM is optional, but it is recommended that PCI device
1049drivers implement it at least in the cases where there is a reliable way of
1050verifying that the device is not used (like when the network cable is detached
1051from an Ethernet adapter or there are no devices attached to a USB controller).
1052
1053To support the PCI runtime PM the driver first needs to implement the
1054runtime_suspend() and runtime_resume() callbacks.  It also may need to implement
1055the runtime_idle() callback to prevent the device from being suspended again
1056every time right after the runtime_resume() callback has returned
1057(alternatively, the runtime_suspend() callback will have to check if the
1058device should really be suspended and return -EAGAIN if that is not the case).
1059
1060The runtime PM of PCI devices is enabled by default by the PCI core.  PCI
1061device drivers do not need to enable it and should not attempt to do so.
1062However, it is blocked by pci_pm_init() that runs the pm_runtime_forbid()
1063helper function.  In addition to that, the runtime PM usage counter of
1064each PCI device is incremented by local_pci_probe() before executing the
1065probe callback provided by the device's driver.
1066
1067If a PCI driver implements the runtime PM callbacks and intends to use the
1068runtime PM framework provided by the PM core and the PCI subsystem, it needs
1069to decrement the device's runtime PM usage counter in its probe callback
1070function.  If it doesn't do that, the counter will always be different from
1071zero for the device and it will never be runtime-suspended.  The simplest
1072way to do that is by calling pm_runtime_put_noidle(), but if the driver
1073wants to schedule an autosuspend right away, for example, it may call
1074pm_runtime_put_autosuspend() instead for this purpose.  Generally, it
1075just needs to call a function that decrements the devices usage counter
1076from its probe routine to make runtime PM work for the device.
1077
1078It is important to remember that the driver's runtime_suspend() callback
1079may be executed right after the usage counter has been decremented, because
1080user space may already have caused the pm_runtime_allow() helper function
1081unblocking the runtime PM of the device to run via sysfs, so the driver must
1082be prepared to cope with that.
1083
1084The driver itself should not call pm_runtime_allow(), though.  Instead, it
1085should let user space or some platform-specific code do that (user space can
1086do it via sysfs as stated above), but it must be prepared to handle the
1087runtime PM of the device correctly as soon as pm_runtime_allow() is called
1088(which may happen at any time, even before the driver is loaded).
1089
1090When the driver's remove callback runs, it has to balance the decrementation
1091of the device's runtime PM usage counter at the probe time.  For this reason,
1092if it has decremented the counter in its probe callback, it must run
1093pm_runtime_get_noresume() in its remove callback.  [Since the core carries
1094out a runtime resume of the device and bumps up the device's usage counter
1095before running the driver's remove callback, the runtime PM of the device
1096is effectively disabled for the duration of the remove execution and all
1097runtime PM helper functions incrementing the device's usage counter are
1098then effectively equivalent to pm_runtime_get_noresume().]
1099
1100The runtime PM framework works by processing requests to suspend or resume
1101devices, or to check if they are idle (in which cases it is reasonable to
1102subsequently request that they be suspended).  These requests are represented
1103by work items put into the power management workqueue, pm_wq.  Although there
1104are a few situations in which power management requests are automatically
1105queued by the PM core (for example, after processing a request to resume a
1106device the PM core automatically queues a request to check if the device is
1107idle), device drivers are generally responsible for queuing power management
1108requests for their devices.  For this purpose they should use the runtime PM
1109helper functions provided by the PM core, discussed in
1110Documentation/power/runtime_pm.rst.
1111
1112Devices can also be suspended and resumed synchronously, without placing a
1113request into pm_wq.  In the majority of cases this also is done by their
1114drivers that use helper functions provided by the PM core for this purpose.
1115
1116For more information on the runtime PM of devices refer to
1117Documentation/power/runtime_pm.rst.
1118
1119
11204. Resources
1121============
1122
1123PCI Local Bus Specification, Rev. 3.0
1124
1125PCI Bus Power Management Interface Specification, Rev. 1.2
1126
1127Advanced Configuration and Power Interface (ACPI) Specification, Rev. 3.0b
1128
1129PCI Express Base Specification, Rev. 2.0
1130
1131Documentation/driver-api/pm/devices.rst
1132
1133Documentation/power/runtime_pm.rst