Skip to content

Commit fbc90c0

Browse files
committed
Merge tag 'mm-stable-2024-07-21-14-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton: - In the series "mm: Avoid possible overflows in dirty throttling" Jan Kara addresses a couple of issues in the writeback throttling code. These fixes are also targetted at -stable kernels. - Ryusuke Konishi's series "nilfs2: fix potential issues related to reserved inodes" does that. This should actually be in the mm-nonmm-stable tree, along with the many other nilfs2 patches. My bad. - More folio conversions from Kefeng Wang in the series "mm: convert to folio_alloc_mpol()" - Kemeng Shi has sent some cleanups to the writeback code in the series "Add helper functions to remove repeated code and improve readability of cgroup writeback" - Kairui Song has made the swap code a little smaller and a little faster in the series "mm/swap: clean up and optimize swap cache index". - In the series "mm/memory: cleanly support zeropage in vm_insert_page*(), vm_map_pages*() and vmf_insert_mixed()" David Hildenbrand has reworked the rather sketchy handling of the use of the zeropage in MAP_SHARED mappings. I don't see any runtime effects here - more a cleanup/understandability/maintainablity thing. - Dev Jain has improved selftests/mm/va_high_addr_switch.c's handling of higher addresses, for aarch64. The (poorly named) series is "Restructure va_high_addr_switch". - The core TLB handling code gets some cleanups and possible slight optimizations in Bang Li's series "Add update_mmu_tlb_range() to simplify code". - Jane Chu has improved the handling of our fake-an-unrecoverable-memory-error testing feature MADV_HWPOISON in the series "Enhance soft hwpoison handling and injection". - Jeff Johnson has sent a billion patches everywhere to add MODULE_DESCRIPTION() to everything. Some landed in this pull. - In the series "mm: cleanup MIGRATE_SYNC_NO_COPY mode", Kefeng Wang has simplified migration's use of hardware-offload memory copying. - Yosry Ahmed performs more folio API conversions in his series "mm: zswap: trivial folio conversions". - In the series "large folios swap-in: handle refault cases first", Chuanhua Han inches us forward in the handling of large pages in the swap code. This is a cleanup and optimization, working toward the end objective of full support of large folio swapin/out. - In the series "mm,swap: cleanup VMA based swap readahead window calculation", Huang Ying has contributed some cleanups and a possible fixlet to his VMA based swap readahead code. - In the series "add mTHP support for anonymous shmem" Baolin Wang has taught anonymous shmem mappings to use multisize THP. By default this is a no-op - users must opt in vis sysfs controls. Dramatic improvements in pagefault latency are realized. - David Hildenbrand has some cleanups to our remaining use of page_mapcount() in the series "fs/proc: move page_mapcount() to fs/proc/internal.h". - David also has some highmem accounting cleanups in the series "mm/highmem: don't track highmem pages manually". - Build-time fixes and cleanups from John Hubbard in the series "cleanups, fixes, and progress towards avoiding "make headers"". - Cleanups and consolidation of the core pagemap handling from Barry Song in the series "mm: introduce pmd|pte_needs_soft_dirty_wp helpers and utilize them". - Lance Yang's series "Reclaim lazyfree THP without splitting" has reduced the latency of the reclaim of pmd-mapped THPs under fairly common circumstances. A 10x speedup is seen in a microbenchmark. It does this by punting to aother CPU but I guess that's a win unless all CPUs are pegged. - hugetlb_cgroup cleanups from Xiu Jianfeng in the series "mm/hugetlb_cgroup: rework on cftypes". - Miaohe Lin's series "Some cleanups for memory-failure" does just that thing. - Someone other than SeongJae has developed a DAMON feature in Honggyu Kim's series "DAMON based tiered memory management for CXL memory". This adds DAMON features which may be used to help determine the efficiency of our placement of CXL/PCIe attached DRAM. - DAMON user API centralization and simplificatio work in SeongJae Park's series "mm/damon: introduce DAMON parameters online commit function". - In the series "mm: page_type, zsmalloc and page_mapcount_reset()" David Hildenbrand does some maintenance work on zsmalloc - partially modernizing its use of pageframe fields. - Kefeng Wang provides more folio conversions in the series "mm: remove page_maybe_dma_pinned() and page_mkclean()". - More cleanup from David Hildenbrand, this time in the series "mm/memory_hotplug: use PageOffline() instead of PageReserved() for !ZONE_DEVICE". It "enlightens memory hotplug more about PageOffline() pages" and permits the removal of some virtio-mem hacks. - Barry Song's series "mm: clarify folio_add_new_anon_rmap() and __folio_add_anon_rmap()" is a cleanup to the anon folio handling in preparation for mTHP (multisize THP) swapin. - Kefeng Wang's series "mm: improve clear and copy user folio" implements more folio conversions, this time in the area of large folio userspace copying. - The series "Docs/mm/damon/maintaier-profile: document a mailing tool and community meetup series" tells people how to get better involved with other DAMON developers. From SeongJae Park. - A large series ("kmsan: Enable on s390") from Ilya Leoshkevich does that. - David Hildenbrand sends along more cleanups, this time against the migration code. The series is "mm/migrate: move NUMA hinting fault folio isolation + checks under PTL". - Jan Kara has found quite a lot of strangenesses and minor errors in the readahead code. He addresses this in the series "mm: Fix various readahead quirks". - SeongJae Park's series "selftests/damon: test DAMOS tried regions and {min,max}_nr_regions" adds features and addresses errors in DAMON's self testing code. - Gavin Shan has found a userspace-triggerable WARN in the pagecache code. The series "mm/filemap: Limit page cache size to that supported by xarray" addresses this. The series is marked cc:stable. - Chengming Zhou's series "mm/ksm: cmp_and_merge_page() optimizations and cleanup" cleans up and slightly optimizes KSM. - Roman Gushchin has separated the memcg-v1 and memcg-v2 code - lots of code motion. The series (which also makes the memcg-v1 code Kconfigurable) are "mm: memcg: separate legacy cgroup v1 code and put under config option" and "mm: memcg: put cgroup v1-specific memcg data under CONFIG_MEMCG_V1" - Dan Schatzberg's series "Add swappiness argument to memory.reclaim" adds an additional feature to this cgroup-v2 control file. - The series "Userspace controls soft-offline pages" from Jiaqi Yan permits userspace to stop the kernel's automatic treatment of excessive correctable memory errors. In order to permit userspace to monitor and handle this situation. - Kefeng Wang's series "mm: migrate: support poison recover from migrate folio" teaches the kernel to appropriately handle migration from poisoned source folios rather than simply panicing. - SeongJae Park's series "Docs/damon: minor fixups and improvements" does those things. - In the series "mm/zsmalloc: change back to per-size_class lock" Chengming Zhou improves zsmalloc's scalability and memory utilization. - Vivek Kasireddy's series "mm/gup: Introduce memfd_pin_folios() for pinning memfd folios" makes the GUP code use FOLL_PIN rather than bare refcount increments. So these paes can first be moved aside if they reside in the movable zone or a CMA block. - Andrii Nakryiko has added a binary ioctl()-based API to /proc/pid/maps for much faster reading of vma information. The series is "query VMAs from /proc/<pid>/maps". - In the series "mm: introduce per-order mTHP split counters" Lance Yang improves the kernel's presentation of developer information related to multisize THP splitting. - Michael Ellerman has developed the series "Reimplement huge pages without hugepd on powerpc (8xx, e500, book3s/64)". This permits userspace to use all available huge page sizes. - In the series "revert unconditional slab and page allocator fault injection calls" Vlastimil Babka removes a performance-affecting and not very useful feature from slab fault injection. * tag 'mm-stable-2024-07-21-14-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (411 commits) mm/mglru: fix ineffective protection calculation mm/zswap: fix a white space issue mm/hugetlb: fix kernel NULL pointer dereference when migrating hugetlb folio mm/hugetlb: fix possible recursive locking detected warning mm/gup: clear the LRU flag of a page before adding to LRU batch mm/numa_balancing: teach mpol_to_str about the balancing mode mm: memcg1: convert charge move flags to unsigned long long alloc_tag: fix page_ext_get/page_ext_put sequence during page splitting lib: reuse page_ext_data() to obtain codetag_ref lib: add missing newline character in the warning message mm/mglru: fix overshooting shrinker memory mm/mglru: fix div-by-zero in vmpressure_calc_level() mm/kmemleak: replace strncpy() with strscpy() mm, page_alloc: put should_fail_alloc_page() back behing CONFIG_FAIL_PAGE_ALLOC mm, slab: put should_failslab() back behind CONFIG_SHOULD_FAILSLAB mm: ignore data-race in __swap_writepage hugetlbfs: ensure generic_hugetlb_get_unmapped_area() returns higher address than mmap_min_addr mm: shmem: rename mTHP shmem counters mm: swap_state: use folio_alloc_mpol() in __read_swap_cache_async() mm/migrate: putback split folios when numa hint migration fails ...
2 parents 7846b61 + 30d77b7 commit fbc90c0

File tree

328 files changed

+12964
-9724
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

328 files changed

+12964
-9724
lines changed

Documentation/ABI/testing/sysfs-kernel-mm-damon

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -155,6 +155,12 @@ Contact: SeongJae Park <[email protected]>
155155
Description: Writing to and reading from this file sets and gets the action
156156
of the scheme.
157157

158+
What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/target_nid
159+
Date: Jun 2024
160+
Contact: SeongJae Park <[email protected]>
161+
Description: Action's target NUMA node id. Supported by only relevant
162+
actions.
163+
158164
What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/apply_interval_us
159165
Date: Sep 2023
160166
Contact: SeongJae Park <[email protected]>

Documentation/admin-guide/cgroup-v2.rst

Lines changed: 11 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1306,17 +1306,10 @@ PAGE_SIZE multiple when read back.
13061306
This is a simple interface to trigger memory reclaim in the
13071307
target cgroup.
13081308

1309-
This file accepts a single key, the number of bytes to reclaim.
1310-
No nested keys are currently supported.
1311-
13121309
Example::
13131310

13141311
echo "1G" > memory.reclaim
13151312

1316-
The interface can be later extended with nested keys to
1317-
configure the reclaim behavior. For example, specify the
1318-
type of memory to reclaim from (anon, file, ..).
1319-
13201313
Please note that the kernel can over or under reclaim from
13211314
the target cgroup. If less bytes are reclaimed than the
13221315
specified amount, -EAGAIN is returned.
@@ -1328,6 +1321,17 @@ PAGE_SIZE multiple when read back.
13281321
This means that the networking layer will not adapt based on
13291322
reclaim induced by memory.reclaim.
13301323

1324+
The following nested keys are defined.
1325+
1326+
========== ================================
1327+
swappiness Swappiness value to reclaim with
1328+
========== ================================
1329+
1330+
Specifying a swappiness value instructs the kernel to perform
1331+
the reclaim with that swappiness value. Note that this has the
1332+
same semantics as vm.swappiness applied to memcg reclaim with
1333+
all the existing limitations and potential future extensions.
1334+
13311335
memory.peak
13321336
A read-only single value file which exists on non-root
13331337
cgroups.

Documentation/admin-guide/kernel-parameters.txt

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -7239,9 +7239,12 @@
72397239

72407240
vmalloc=nn[KMG] [KNL,BOOT,EARLY] Forces the vmalloc area to have an
72417241
exact size of <nn>. This can be used to increase
7242-
the minimum size (128MB on x86). It can also be
7243-
used to decrease the size and leave more room
7244-
for directly mapped kernel RAM.
7242+
the minimum size (128MB on x86, arm32 platforms).
7243+
It can also be used to decrease the size and leave more room
7244+
for directly mapped kernel RAM. Note that this parameter does
7245+
not exist on many other platforms (including arm64, alpha,
7246+
loongarch, arc, csky, hexagon, microblaze, mips, nios2, openrisc,
7247+
parisc, m64k, powerpc, riscv, sh, um, xtensa, s390, sparc).
72457248

72467249
vmcp_cma=nn[MG] [KNL,S390,EARLY]
72477250
Sets the memory size reserved for contiguous memory

Documentation/admin-guide/mm/damon/start.rst

Lines changed: 42 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -34,18 +34,56 @@ detail) of DAMON, you should ensure :doc:`sysfs </filesystems/sysfs>` is
3434
mounted.
3535

3636

37+
Snapshot Data Access Patterns
38+
=============================
39+
40+
The commands below show the memory access pattern of a program at the moment of
41+
the execution. ::
42+
43+
$ git clone https://github.com/sjp38/masim; cd masim; make
44+
$ sudo damo start "./masim ./configs/stairs.cfg --quiet"
45+
$ sudo ./damo show
46+
0 addr [85.541 TiB , 85.541 TiB ) (57.707 MiB ) access 0 % age 10.400 s
47+
1 addr [85.541 TiB , 85.542 TiB ) (413.285 MiB) access 0 % age 11.400 s
48+
2 addr [127.649 TiB , 127.649 TiB) (57.500 MiB ) access 0 % age 1.600 s
49+
3 addr [127.649 TiB , 127.649 TiB) (32.500 MiB ) access 0 % age 500 ms
50+
4 addr [127.649 TiB , 127.649 TiB) (9.535 MiB ) access 100 % age 300 ms
51+
5 addr [127.649 TiB , 127.649 TiB) (8.000 KiB ) access 60 % age 0 ns
52+
6 addr [127.649 TiB , 127.649 TiB) (6.926 MiB ) access 0 % age 1 s
53+
7 addr [127.998 TiB , 127.998 TiB) (120.000 KiB) access 0 % age 11.100 s
54+
8 addr [127.998 TiB , 127.998 TiB) (8.000 KiB ) access 40 % age 100 ms
55+
9 addr [127.998 TiB , 127.998 TiB) (4.000 KiB ) access 0 % age 11 s
56+
total size: 577.590 MiB
57+
$ sudo ./damo stop
58+
59+
The first command of the above example downloads and builds an artificial
60+
memory access generator program called ``masim``. The second command asks DAMO
61+
to execute the artificial generator process start via the given command and
62+
make DAMON monitors the generator process. The third command retrieves the
63+
current snapshot of the monitored access pattern of the process from DAMON and
64+
shows the pattern in a human readable format.
65+
66+
Each line of the output shows which virtual address range (``addr [XX, XX)``)
67+
of the process is how frequently (``access XX %``) accessed for how long time
68+
(``age XX``). For example, the fifth region of ~9 MiB size is being most
69+
frequently accessed for last 300 milliseconds. Finally, the fourth command
70+
stops DAMON.
71+
72+
Note that DAMON can monitor not only virtual address spaces but multiple types
73+
of address spaces including the physical address space.
74+
75+
3776
Recording Data Access Patterns
3877
==============================
3978

4079
The commands below record the memory access patterns of a program and save the
4180
monitoring results to a file. ::
4281

43-
$ git clone https://github.com/sjp38/masim
44-
$ cd masim; make; ./masim ./configs/zigzag.cfg &
82+
$ ./masim ./configs/zigzag.cfg &
4583
$ sudo damo record -o damon.data $(pidof masim)
4684

47-
The first two lines of the commands download an artificial memory access
48-
generator program and run it in the background. The generator will repeatedly
85+
The line of the commands run the artificial memory access
86+
generator program again. The generator will repeatedly
4987
access two 100 MiB sized memory regions one by one. You can substitute this
5088
with your real workload. The last line asks ``damo`` to record the access
5189
pattern in the ``damon.data`` file.

Documentation/admin-guide/mm/damon/usage.rst

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -78,7 +78,7 @@ comma (",").
7878
│ │ │ │ │ │ │ │ ...
7979
│ │ │ │ │ │ ...
8080
│ │ │ │ │ :ref:`schemes <sysfs_schemes>`/nr_schemes
81-
│ │ │ │ │ │ :ref:`0 <sysfs_scheme>`/action,apply_interval_us
81+
│ │ │ │ │ │ :ref:`0 <sysfs_scheme>`/action,target_nid,apply_interval_us
8282
│ │ │ │ │ │ │ :ref:`access_pattern <sysfs_access_pattern>`/
8383
│ │ │ │ │ │ │ │ sz/min,max
8484
│ │ │ │ │ │ │ │ nr_accesses/min,max
@@ -289,14 +289,18 @@ schemes/<N>/
289289
------------
290290

291291
In each scheme directory, five directories (``access_pattern``, ``quotas``,
292-
``watermarks``, ``filters``, ``stats``, and ``tried_regions``) and two files
293-
(``action`` and ``apply_interval``) exist.
292+
``watermarks``, ``filters``, ``stats``, and ``tried_regions``) and three files
293+
(``action``, ``target_nid`` and ``apply_interval``) exist.
294294

295295
The ``action`` file is for setting and getting the scheme's :ref:`action
296296
<damon_design_damos_action>`. The keywords that can be written to and read
297297
from the file and their meaning are same to those of the list on
298298
:ref:`design doc <damon_design_damos_action>`.
299299

300+
The ``target_nid`` file is for setting the migration target node, which is
301+
only meaningful when the ``action`` is either ``migrate_hot`` or
302+
``migrate_cold``.
303+
300304
The ``apply_interval_us`` file is for setting and getting the scheme's
301305
:ref:`apply_interval <damon_design_damos>` in microseconds.
302306

Documentation/admin-guide/mm/pagemap.rst

Lines changed: 2 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -118,7 +118,7 @@ Short descriptions to the page flags
118118
21 - KSM
119119
Identical memory pages dynamically shared between one or more processes.
120120
22 - THP
121-
Contiguous pages which construct transparent hugepages.
121+
Contiguous pages which construct THP of any size and mapped by any granularity.
122122
23 - OFFLINE
123123
The page is logically offline.
124124
24 - ZERO_PAGE
@@ -173,27 +173,6 @@ LRU related page flags
173173
The page-types tool in the tools/mm directory can be used to query the
174174
above flags.
175175

176-
Using pagemap to do something useful
177-
====================================
178-
179-
The general procedure for using pagemap to find out about a process' memory
180-
usage goes like this:
181-
182-
1. Read ``/proc/pid/maps`` to determine which parts of the memory space are
183-
mapped to what.
184-
2. Select the maps you are interested in -- all of them, or a particular
185-
library, or the stack or the heap, etc.
186-
3. Open ``/proc/pid/pagemap`` and seek to the pages you would like to examine.
187-
4. Read a u64 for each page from pagemap.
188-
5. Open ``/proc/kpagecount`` and/or ``/proc/kpageflags``. For each PFN you
189-
just read, seek to that entry in the file, and read the data you want.
190-
191-
For example, to find the "unique set size" (USS), which is the amount of
192-
memory that a process is using that is not shared with any other process,
193-
you can go through every map in the process, find the PFNs, look those up
194-
in kpagecount, and tally up the number of pages that are only referenced
195-
once.
196-
197176
Exceptions for Shared Memory
198177
============================
199178

@@ -252,7 +231,7 @@ Following flags about pages are currently supported:
252231
- ``PAGE_IS_PRESENT`` - Page is present in the memory
253232
- ``PAGE_IS_SWAPPED`` - Page is in swapped
254233
- ``PAGE_IS_PFNZERO`` - Page has zero PFN
255-
- ``PAGE_IS_HUGE`` - Page is THP or Hugetlb backed
234+
- ``PAGE_IS_HUGE`` - Page is PMD-mapped THP or Hugetlb backed
256235
- ``PAGE_IS_SOFT_DIRTY`` - Page is soft-dirty
257236

258237
The ``struct pm_scan_arg`` is used as the argument of the IOCTL.

Documentation/admin-guide/mm/transhuge.rst

Lines changed: 68 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -202,12 +202,11 @@ PMD-mappable transparent hugepage::
202202

203203
cat /sys/kernel/mm/transparent_hugepage/hpage_pmd_size
204204

205-
khugepaged will be automatically started when one or more hugepage
206-
sizes are enabled (either by directly setting "always" or "madvise",
207-
or by setting "inherit" while the top-level enabled is set to "always"
208-
or "madvise"), and it'll be automatically shutdown when the last
209-
hugepage size is disabled (either by directly setting "never", or by
210-
setting "inherit" while the top-level enabled is set to "never").
205+
khugepaged will be automatically started when PMD-sized THP is enabled
206+
(either of the per-size anon control or the top-level control are set
207+
to "always" or "madvise"), and it'll be automatically shutdown when
208+
PMD-sized THP is disabled (when both the per-size anon control and the
209+
top-level control are "never")
211210

212211
Khugepaged controls
213212
-------------------
@@ -332,6 +331,31 @@ deny
332331
force
333332
Force the huge option on for all - very useful for testing;
334333

334+
Shmem can also use "multi-size THP" (mTHP) by adding a new sysfs knob to
335+
control mTHP allocation:
336+
'/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/shmem_enabled',
337+
and its value for each mTHP is essentially consistent with the global
338+
setting. An 'inherit' option is added to ensure compatibility with these
339+
global settings. Conversely, the options 'force' and 'deny' are dropped,
340+
which are rather testing artifacts from the old ages.
341+
342+
always
343+
Attempt to allocate <size> huge pages every time we need a new page;
344+
345+
inherit
346+
Inherit the top-level "shmem_enabled" value. By default, PMD-sized hugepages
347+
have enabled="inherit" and all other hugepage sizes have enabled="never";
348+
349+
never
350+
Do not allocate <size> huge pages;
351+
352+
within_size
353+
Only allocate <size> huge page if it will be fully within i_size.
354+
Also respect fadvise()/madvise() hints;
355+
356+
advise
357+
Only allocate <size> huge pages if requested with fadvise()/madvise();
358+
335359
Need of application restart
336360
===========================
337361

@@ -344,10 +368,6 @@ also applies to the regions registered in khugepaged.
344368
Monitoring usage
345369
================
346370

347-
.. note::
348-
Currently the below counters only record events relating to
349-
PMD-sized THP. Events relating to other THP sizes are not included.
350-
351371
The number of PMD-sized anonymous transparent huge pages currently used by the
352372
system is available by reading the AnonHugePages field in ``/proc/meminfo``.
353373
To identify what applications are using PMD-sized anonymous transparent huge
@@ -392,20 +412,23 @@ thp_collapse_alloc_failed
392412
the allocation.
393413

394414
thp_file_alloc
395-
is incremented every time a file huge page is successfully
396-
allocated.
415+
is incremented every time a shmem huge page is successfully
416+
allocated (Note that despite being named after "file", the counter
417+
measures only shmem).
397418

398419
thp_file_fallback
399-
is incremented if a file huge page is attempted to be allocated
400-
but fails and instead falls back to using small pages.
420+
is incremented if a shmem huge page is attempted to be allocated
421+
but fails and instead falls back to using small pages. (Note that
422+
despite being named after "file", the counter measures only shmem).
401423

402424
thp_file_fallback_charge
403-
is incremented if a file huge page cannot be charged and instead
425+
is incremented if a shmem huge page cannot be charged and instead
404426
falls back to using small pages even though the allocation was
405-
successful.
427+
successful. (Note that despite being named after "file", the
428+
counter measures only shmem).
406429

407430
thp_file_mapped
408-
is incremented every time a file huge page is mapped into
431+
is incremented every time a file or shmem huge page is mapped into
409432
user address space.
410433

411434
thp_split_page
@@ -476,6 +499,34 @@ swpout_fallback
476499
Usually because failed to allocate some continuous swap space
477500
for the huge page.
478501

502+
shmem_alloc
503+
is incremented every time a shmem huge page is successfully
504+
allocated.
505+
506+
shmem_fallback
507+
is incremented if a shmem huge page is attempted to be allocated
508+
but fails and instead falls back to using small pages.
509+
510+
shmem_fallback_charge
511+
is incremented if a shmem huge page cannot be charged and instead
512+
falls back to using small pages even though the allocation was
513+
successful.
514+
515+
split
516+
is incremented every time a huge page is successfully split into
517+
smaller orders. This can happen for a variety of reasons but a
518+
common reason is that a huge page is old and is being reclaimed.
519+
520+
split_failed
521+
is incremented if kernel fails to split huge
522+
page. This can happen if the page was pinned by somebody.
523+
524+
split_deferred
525+
is incremented when a huge page is put onto split queue.
526+
This happens when a huge page is partially unmapped and splitting
527+
it would free up some memory. Pages on split queue are going to
528+
be split under memory pressure, if splitting is possible.
529+
479530
As the system ages, allocating huge pages may be expensive as the
480531
system uses memory compaction to copy data around memory to free a
481532
huge page for use. There are some counters in ``/proc/vmstat`` to help

Documentation/admin-guide/sysctl/vm.rst

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -36,6 +36,7 @@ Currently, these files are in /proc/sys/vm:
3636
- dirtytime_expire_seconds
3737
- dirty_writeback_centisecs
3838
- drop_caches
39+
- enable_soft_offline
3940
- extfrag_threshold
4041
- highmem_is_dirtyable
4142
- hugetlb_shm_group
@@ -267,6 +268,43 @@ used::
267268
These are informational only. They do not mean that anything is wrong
268269
with your system. To disable them, echo 4 (bit 2) into drop_caches.
269270

271+
enable_soft_offline
272+
===================
273+
Correctable memory errors are very common on servers. Soft-offline is kernel's
274+
solution for memory pages having (excessive) corrected memory errors.
275+
276+
For different types of page, soft-offline has different behaviors / costs.
277+
278+
- For a raw error page, soft-offline migrates the in-use page's content to
279+
a new raw page.
280+
281+
- For a page that is part of a transparent hugepage, soft-offline splits the
282+
transparent hugepage into raw pages, then migrates only the raw error page.
283+
As a result, user is transparently backed by 1 less hugepage, impacting
284+
memory access performance.
285+
286+
- For a page that is part of a HugeTLB hugepage, soft-offline first migrates
287+
the entire HugeTLB hugepage, during which a free hugepage will be consumed
288+
as migration target. Then the original hugepage is dissolved into raw
289+
pages without compensation, reducing the capacity of the HugeTLB pool by 1.
290+
291+
It is user's call to choose between reliability (staying away from fragile
292+
physical memory) vs performance / capacity implications in transparent and
293+
HugeTLB cases.
294+
295+
For all architectures, enable_soft_offline controls whether to soft offline
296+
memory pages. When set to 1, kernel attempts to soft offline the pages
297+
whenever it thinks needed. When set to 0, kernel returns EOPNOTSUPP to
298+
the request to soft offline the pages. Its default value is 1.
299+
300+
It is worth mentioning that after setting enable_soft_offline to 0, the
301+
following requests to soft offline pages will not be performed:
302+
303+
- Request to soft offline pages from RAS Correctable Errors Collector.
304+
305+
- On ARM, the request to soft offline pages from GHES driver.
306+
307+
- On PARISC, the request to soft offline pages from Page Deallocation Table.
270308

271309
extfrag_threshold
272310
=================

0 commit comments

Comments
 (0)