concepts.rst - Documentation/admin-guide/mm/concepts.rst - Linux diff v6.8

 
 
  1=================
  2Concepts overview
  3=================
  4
  5The memory management in Linux is a complex system that evolved over the
  6years and included more and more functionality to support a variety of
  7systems from MMU-less microcontrollers to supercomputers. The memory
  8management for systems without an MMU is called ``nommu`` and it
  9definitely deserves a dedicated document, which hopefully will be
 10eventually written. Yet, although some of the concepts are the same,
 11here we assume that an MMU is available and a CPU can translate a virtual
 12address to a physical address.
 13
 14.. contents:: :local:
 15
 16Virtual Memory Primer
 17=====================
 18
 19The physical memory in a computer system is a limited resource and
 20even for systems that support memory hotplug there is a hard limit on
 21the amount of memory that can be installed. The physical memory is not
 22necessarily contiguous; it might be accessible as a set of distinct
 23address ranges. Besides, different CPU architectures, and even
 24different implementations of the same architecture have different views
 25of how these address ranges are defined.
 26
 27All this makes dealing directly with physical memory quite complex and
 28to avoid this complexity a concept of virtual memory was developed.
 29
 30The virtual memory abstracts the details of physical memory from the
 31application software, allows to keep only needed information in the
 32physical memory (demand paging) and provides a mechanism for the
 33protection and controlled sharing of data between processes.
 34
 35With virtual memory, each and every memory access uses a virtual
 36address. When the CPU decodes an instruction that reads (or
 37writes) from (or to) the system memory, it translates the `virtual`
 38address encoded in that instruction to a `physical` address that the
 39memory controller can understand.
 40
 41The physical system memory is divided into page frames, or pages. The
 42size of each page is architecture specific. Some architectures allow
 43selection of the page size from several supported values; this
 44selection is performed at the kernel build time by setting an
 45appropriate kernel configuration option.
 46
 47Each physical memory page can be mapped as one or more virtual
 48pages. These mappings are described by page tables that allow
 49translation from a virtual address used by programs to the physical
 50memory address. The page tables are organized hierarchically.
 51
 52The tables at the lowest level of the hierarchy contain physical
 53addresses of actual pages used by the software. The tables at higher
 54levels contain physical addresses of the pages belonging to the lower
 55levels. The pointer to the top level page table resides in a
 56register. When the CPU performs the address translation, it uses this
 57register to access the top level page table. The high bits of the
 58virtual address are used to index an entry in the top level page
 59table. That entry is then used to access the next level in the
 60hierarchy with the next bits of the virtual address as the index to
 61that level page table. The lowest bits in the virtual address define
 62the offset inside the actual page.
 63
 64Huge Pages
 65==========
 66
 67The address translation requires several memory accesses and memory
 68accesses are slow relatively to CPU speed. To avoid spending precious
 69processor cycles on the address translation, CPUs maintain a cache of
 70such translations called Translation Lookaside Buffer (or
 71TLB). Usually TLB is pretty scarce resource and applications with
 72large memory working set will experience performance hit because of
 73TLB misses.
 74
 75Many modern CPU architectures allow mapping of the memory pages
 76directly by the higher levels in the page table. For instance, on x86,
 77it is possible to map 2M and even 1G pages using entries in the second
 78and the third level page tables. In Linux such pages are called
 79`huge`. Usage of huge pages significantly reduces pressure on TLB,
 80improves TLB hit-rate and thus improves overall system performance.
 81
 82There are two mechanisms in Linux that enable mapping of the physical
 83memory with the huge pages. The first one is `HugeTLB filesystem`, or
 84hugetlbfs. It is a pseudo filesystem that uses RAM as its backing
 85store. For the files created in this filesystem the data resides in
 86the memory and mapped using huge pages. The hugetlbfs is described at
 87Documentation/admin-guide/mm/hugetlbpage.rst.
 88
 89Another, more recent, mechanism that enables use of the huge pages is
 90called `Transparent HugePages`, or THP. Unlike the hugetlbfs that
 91requires users and/or system administrators to configure what parts of
 92the system memory should and can be mapped by the huge pages, THP
 93manages such mappings transparently to the user and hence the
 94name. See Documentation/admin-guide/mm/transhuge.rst for more details
 95about THP.
 
 96
 97Zones
 98=====
 99
100Often hardware poses restrictions on how different physical memory
101ranges can be accessed. In some cases, devices cannot perform DMA to
102all the addressable memory. In other cases, the size of the physical
103memory exceeds the maximal addressable size of virtual memory and
104special actions are required to access portions of the memory. Linux
105groups memory pages into `zones` according to their possible
106usage. For example, ZONE_DMA will contain memory that can be used by
107devices for DMA, ZONE_HIGHMEM will contain memory that is not
108permanently mapped into kernel's address space and ZONE_NORMAL will
109contain normally addressed pages.
110
111The actual layout of the memory zones is hardware dependent as not all
112architectures define all zones, and requirements for DMA are different
113for different platforms.
114
115Nodes
116=====
117
118Many multi-processor machines are NUMA - Non-Uniform Memory Access -
119systems. In such systems the memory is arranged into banks that have
120different access latency depending on the "distance" from the
121processor. Each bank is referred to as a `node` and for each node Linux
122constructs an independent memory management subsystem. A node has its
123own set of zones, lists of free and used pages and various statistics
124counters. You can find more details about NUMA in
125Documentation/mm/numa.rst` and in
126Documentation/admin-guide/mm/numa_memory_policy.rst.
127
128Page cache
129==========
130
131The physical memory is volatile and the common case for getting data
132into the memory is to read it from files. Whenever a file is read, the
133data is put into the `page cache` to avoid expensive disk access on
134the subsequent reads. Similarly, when one writes to a file, the data
135is placed in the page cache and eventually gets into the backing
136storage device. The written pages are marked as `dirty` and when Linux
137decides to reuse them for other purposes, it makes sure to synchronize
138the file contents on the device with the updated data.
139
140Anonymous Memory
141================
142
143The `anonymous memory` or `anonymous mappings` represent memory that
144is not backed by a filesystem. Such mappings are implicitly created
145for program's stack and heap or by explicit calls to mmap(2) system
146call. Usually, the anonymous mappings only define virtual memory areas
147that the program is allowed to access. The read accesses will result
148in creation of a page table entry that references a special physical
149page filled with zeroes. When the program performs a write, a regular
150physical page will be allocated to hold the written data. The page
151will be marked dirty and if the kernel decides to repurpose it,
152the dirty page will be swapped out.
153
154Reclaim
155=======
156
157Throughout the system lifetime, a physical page can be used for storing
158different types of data. It can be kernel internal data structures,
159DMA'able buffers for device drivers use, data read from a filesystem,
160memory allocated by user space processes etc.
161
162Depending on the page usage it is treated differently by the Linux
163memory management. The pages that can be freed at any time, either
164because they cache the data available elsewhere, for instance, on a
165hard disk, or because they can be swapped out, again, to the hard
166disk, are called `reclaimable`. The most notable categories of the
167reclaimable pages are page cache and anonymous memory.
168
169In most cases, the pages holding internal kernel data and used as DMA
170buffers cannot be repurposed, and they remain pinned until freed by
171their user. Such pages are called `unreclaimable`. However, in certain
172circumstances, even pages occupied with kernel data structures can be
173reclaimed. For instance, in-memory caches of filesystem metadata can
174be re-read from the storage device and therefore it is possible to
175discard them from the main memory when system is under memory
176pressure.
177
178The process of freeing the reclaimable physical memory pages and
179repurposing them is called (surprise!) `reclaim`. Linux can reclaim
180pages either asynchronously or synchronously, depending on the state
181of the system. When the system is not loaded, most of the memory is free
182and allocation requests will be satisfied immediately from the free
183pages supply. As the load increases, the amount of the free pages goes
184down and when it reaches a certain threshold (low watermark), an
185allocation request will awaken the ``kswapd`` daemon. It will
186asynchronously scan memory pages and either just free them if the data
187they contain is available elsewhere, or evict to the backing storage
188device (remember those dirty pages?). As memory usage increases even
189more and reaches another threshold - min watermark - an allocation
190will trigger `direct reclaim`. In this case allocation is stalled
191until enough memory pages are reclaimed to satisfy the request.
192
193Compaction
194==========
195
196As the system runs, tasks allocate and free the memory and it becomes
197fragmented. Although with virtual memory it is possible to present
198scattered physical pages as virtually contiguous range, sometimes it is
199necessary to allocate large physically contiguous memory areas. Such
200need may arise, for instance, when a device driver requires a large
201buffer for DMA, or when THP allocates a huge page. Memory `compaction`
202addresses the fragmentation issue. This mechanism moves occupied pages
203from the lower part of a memory zone to free pages in the upper part
204of the zone. When a compaction scan is finished free pages are grouped
205together at the beginning of the zone and allocations of large
206physically contiguous areas become possible.
207
208Like reclaim, the compaction may happen asynchronously in the ``kcompactd``
209daemon or synchronously as a result of a memory allocation request.
210
211OOM killer
212==========
213
214It is possible that on a loaded machine memory will be exhausted and the
215kernel will be unable to reclaim enough memory to continue to operate. In
216order to save the rest of the system, it invokes the `OOM killer`.
217
218The `OOM killer` selects a task to sacrifice for the sake of the overall
219system health. The selected task is killed in a hope that after it exits
220enough memory will be freed to continue normal operation.

  1.. _mm_concepts:
  2
  3=================
  4Concepts overview
  5=================
  6
  7The memory management in Linux is a complex system that evolved over the
  8years and included more and more functionality to support a variety of
  9systems from MMU-less microcontrollers to supercomputers. The memory
 10management for systems without an MMU is called ``nommu`` and it
 11definitely deserves a dedicated document, which hopefully will be
 12eventually written. Yet, although some of the concepts are the same,
 13here we assume that an MMU is available and a CPU can translate a virtual
 14address to a physical address.
 15
 16.. contents:: :local:
 17
 18Virtual Memory Primer
 19=====================
 20
 21The physical memory in a computer system is a limited resource and
 22even for systems that support memory hotplug there is a hard limit on
 23the amount of memory that can be installed. The physical memory is not
 24necessarily contiguous; it might be accessible as a set of distinct
 25address ranges. Besides, different CPU architectures, and even
 26different implementations of the same architecture have different views
 27of how these address ranges are defined.
 28
 29All this makes dealing directly with physical memory quite complex and
 30to avoid this complexity a concept of virtual memory was developed.
 31
 32The virtual memory abstracts the details of physical memory from the
 33application software, allows to keep only needed information in the
 34physical memory (demand paging) and provides a mechanism for the
 35protection and controlled sharing of data between processes.
 36
 37With virtual memory, each and every memory access uses a virtual
 38address. When the CPU decodes an instruction that reads (or
 39writes) from (or to) the system memory, it translates the `virtual`
 40address encoded in that instruction to a `physical` address that the
 41memory controller can understand.
 42
 43The physical system memory is divided into page frames, or pages. The
 44size of each page is architecture specific. Some architectures allow
 45selection of the page size from several supported values; this
 46selection is performed at the kernel build time by setting an
 47appropriate kernel configuration option.
 48
 49Each physical memory page can be mapped as one or more virtual
 50pages. These mappings are described by page tables that allow
 51translation from a virtual address used by programs to the physical
 52memory address. The page tables are organized hierarchically.
 53
 54The tables at the lowest level of the hierarchy contain physical
 55addresses of actual pages used by the software. The tables at higher
 56levels contain physical addresses of the pages belonging to the lower
 57levels. The pointer to the top level page table resides in a
 58register. When the CPU performs the address translation, it uses this
 59register to access the top level page table. The high bits of the
 60virtual address are used to index an entry in the top level page
 61table. That entry is then used to access the next level in the
 62hierarchy with the next bits of the virtual address as the index to
 63that level page table. The lowest bits in the virtual address define
 64the offset inside the actual page.
 65
 66Huge Pages
 67==========
 68
 69The address translation requires several memory accesses and memory
 70accesses are slow relatively to CPU speed. To avoid spending precious
 71processor cycles on the address translation, CPUs maintain a cache of
 72such translations called Translation Lookaside Buffer (or
 73TLB). Usually TLB is pretty scarce resource and applications with
 74large memory working set will experience performance hit because of
 75TLB misses.
 76
 77Many modern CPU architectures allow mapping of the memory pages
 78directly by the higher levels in the page table. For instance, on x86,
 79it is possible to map 2M and even 1G pages using entries in the second
 80and the third level page tables. In Linux such pages are called
 81`huge`. Usage of huge pages significantly reduces pressure on TLB,
 82improves TLB hit-rate and thus improves overall system performance.
 83
 84There are two mechanisms in Linux that enable mapping of the physical
 85memory with the huge pages. The first one is `HugeTLB filesystem`, or
 86hugetlbfs. It is a pseudo filesystem that uses RAM as its backing
 87store. For the files created in this filesystem the data resides in
 88the memory and mapped using huge pages. The hugetlbfs is described at
 89:ref:`Documentation/admin-guide/mm/hugetlbpage.rst <hugetlbpage>`.
 90
 91Another, more recent, mechanism that enables use of the huge pages is
 92called `Transparent HugePages`, or THP. Unlike the hugetlbfs that
 93requires users and/or system administrators to configure what parts of
 94the system memory should and can be mapped by the huge pages, THP
 95manages such mappings transparently to the user and hence the
 96name. See
 97:ref:`Documentation/admin-guide/mm/transhuge.rst <admin_guide_transhuge>`
 98for more details about THP.
 99
100Zones
101=====
102
103Often hardware poses restrictions on how different physical memory
104ranges can be accessed. In some cases, devices cannot perform DMA to
105all the addressable memory. In other cases, the size of the physical
106memory exceeds the maximal addressable size of virtual memory and
107special actions are required to access portions of the memory. Linux
108groups memory pages into `zones` according to their possible
109usage. For example, ZONE_DMA will contain memory that can be used by
110devices for DMA, ZONE_HIGHMEM will contain memory that is not
111permanently mapped into kernel's address space and ZONE_NORMAL will
112contain normally addressed pages.
113
114The actual layout of the memory zones is hardware dependent as not all
115architectures define all zones, and requirements for DMA are different
116for different platforms.
117
118Nodes
119=====
120
121Many multi-processor machines are NUMA - Non-Uniform Memory Access -
122systems. In such systems the memory is arranged into banks that have
123different access latency depending on the "distance" from the
124processor. Each bank is referred to as a `node` and for each node Linux
125constructs an independent memory management subsystem. A node has its
126own set of zones, lists of free and used pages and various statistics
127counters. You can find more details about NUMA in
128:ref:`Documentation/mm/numa.rst <numa>` and in
129:ref:`Documentation/admin-guide/mm/numa_memory_policy.rst <numa_memory_policy>`.
130
131Page cache
132==========
133
134The physical memory is volatile and the common case for getting data
135into the memory is to read it from files. Whenever a file is read, the
136data is put into the `page cache` to avoid expensive disk access on
137the subsequent reads. Similarly, when one writes to a file, the data
138is placed in the page cache and eventually gets into the backing
139storage device. The written pages are marked as `dirty` and when Linux
140decides to reuse them for other purposes, it makes sure to synchronize
141the file contents on the device with the updated data.
142
143Anonymous Memory
144================
145
146The `anonymous memory` or `anonymous mappings` represent memory that
147is not backed by a filesystem. Such mappings are implicitly created
148for program's stack and heap or by explicit calls to mmap(2) system
149call. Usually, the anonymous mappings only define virtual memory areas
150that the program is allowed to access. The read accesses will result
151in creation of a page table entry that references a special physical
152page filled with zeroes. When the program performs a write, a regular
153physical page will be allocated to hold the written data. The page
154will be marked dirty and if the kernel decides to repurpose it,
155the dirty page will be swapped out.
156
157Reclaim
158=======
159
160Throughout the system lifetime, a physical page can be used for storing
161different types of data. It can be kernel internal data structures,
162DMA'able buffers for device drivers use, data read from a filesystem,
163memory allocated by user space processes etc.
164
165Depending on the page usage it is treated differently by the Linux
166memory management. The pages that can be freed at any time, either
167because they cache the data available elsewhere, for instance, on a
168hard disk, or because they can be swapped out, again, to the hard
169disk, are called `reclaimable`. The most notable categories of the
170reclaimable pages are page cache and anonymous memory.
171
172In most cases, the pages holding internal kernel data and used as DMA
173buffers cannot be repurposed, and they remain pinned until freed by
174their user. Such pages are called `unreclaimable`. However, in certain
175circumstances, even pages occupied with kernel data structures can be
176reclaimed. For instance, in-memory caches of filesystem metadata can
177be re-read from the storage device and therefore it is possible to
178discard them from the main memory when system is under memory
179pressure.
180
181The process of freeing the reclaimable physical memory pages and
182repurposing them is called (surprise!) `reclaim`. Linux can reclaim
183pages either asynchronously or synchronously, depending on the state
184of the system. When the system is not loaded, most of the memory is free
185and allocation requests will be satisfied immediately from the free
186pages supply. As the load increases, the amount of the free pages goes
187down and when it reaches a certain threshold (low watermark), an
188allocation request will awaken the ``kswapd`` daemon. It will
189asynchronously scan memory pages and either just free them if the data
190they contain is available elsewhere, or evict to the backing storage
191device (remember those dirty pages?). As memory usage increases even
192more and reaches another threshold - min watermark - an allocation
193will trigger `direct reclaim`. In this case allocation is stalled
194until enough memory pages are reclaimed to satisfy the request.
195
196Compaction
197==========
198
199As the system runs, tasks allocate and free the memory and it becomes
200fragmented. Although with virtual memory it is possible to present
201scattered physical pages as virtually contiguous range, sometimes it is
202necessary to allocate large physically contiguous memory areas. Such
203need may arise, for instance, when a device driver requires a large
204buffer for DMA, or when THP allocates a huge page. Memory `compaction`
205addresses the fragmentation issue. This mechanism moves occupied pages
206from the lower part of a memory zone to free pages in the upper part
207of the zone. When a compaction scan is finished free pages are grouped
208together at the beginning of the zone and allocations of large
209physically contiguous areas become possible.
210
211Like reclaim, the compaction may happen asynchronously in the ``kcompactd``
212daemon or synchronously as a result of a memory allocation request.
213
214OOM killer
215==========
216
217It is possible that on a loaded machine memory will be exhausted and the
218kernel will be unable to reclaim enough memory to continue to operate. In
219order to save the rest of the system, it invokes the `OOM killer`.
220
221The `OOM killer` selects a task to sacrifice for the sake of the overall
222system health. The selected task is killed in a hope that after it exits
223enough memory will be freed to continue normal operation.