Skip to content

Commit 98931dd

Browse files
committed
Merge tag 'mm-stable-2022-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton: "Almost all of MM here. A few things are still getting finished off, reviewed, etc. - Yang Shi has improved the behaviour of khugepaged collapsing of readonly file-backed transparent hugepages. - Johannes Weiner has arranged for zswap memory use to be tracked and managed on a per-cgroup basis. - Munchun Song adds a /proc knob ("hugetlb_optimize_vmemmap") for runtime enablement of the recent huge page vmemmap optimization feature. - Baolin Wang contributes a series to fix some issues around hugetlb pagetable invalidation. - Zhenwei Pi has fixed some interactions between hwpoisoned pages and virtualization. - Tong Tiangen has enabled the use of the presently x86-only page_table_check debugging feature on arm64 and riscv. - David Vernet has done some fixup work on the memcg selftests. - Peter Xu has taught userfaultfd to handle write protection faults against shmem- and hugetlbfs-backed files. - More DAMON development from SeongJae Park - adding online tuning of the feature and support for monitoring of fixed virtual address ranges. Also easier discovery of which monitoring operations are available. - Nadav Amit has done some optimization of TLB flushing during mprotect(). - Neil Brown continues to labor away at improving our swap-over-NFS support. - David Hildenbrand has some fixes to anon page COWing versus get_user_pages(). - Peng Liu fixed some errors in the core hugetlb code. - Joao Martins has reduced the amount of memory consumed by device-dax's compound devmaps. - Some cleanups of the arch-specific pagemap code from Anshuman Khandual. - Muchun Song has found and fixed some errors in the TLB flushing of transparent hugepages. - Roman Gushchin has done more work on the memcg selftests. ... and, of course, many smaller fixes and cleanups. Notably, the customary million cleanup serieses from Miaohe Lin" * tag 'mm-stable-2022-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (381 commits) mm: kfence: use PAGE_ALIGNED helper selftests: vm: add the "settings" file with timeout variable selftests: vm: add "test_hmm.sh" to TEST_FILES selftests: vm: check numa_available() before operating "merge_across_nodes" in ksm_tests selftests: vm: add migration to the .gitignore selftests/vm/pkeys: fix typo in comment ksm: fix typo in comment selftests: vm: add process_mrelease tests Revert "mm/vmscan: never demote for memcg reclaim" mm/kfence: print disabling or re-enabling message include/trace/events/percpu.h: cleanup for "percpu: improve percpu_alloc_percpu event trace" include/trace/events/mmflags.h: cleanup for "tracing: incorrect gfp_t conversion" mm: fix a potential infinite loop in start_isolate_page_range() MAINTAINERS: add Muchun as co-maintainer for HugeTLB zram: fix Kconfig dependency warning mm/shmem: fix shmem folio swapoff hang cgroup: fix an error handling path in alloc_pagecache_max_30M() mm: damon: use HPAGE_PMD_SIZE tracing: incorrect isolate_mote_t cast in mm_vmscan_lru_isolate nodemask.h: fix compilation error with GCC12 ...
2 parents df202b4 + f403f22 commit 98931dd

File tree

240 files changed

+9207
-4664
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

240 files changed

+9207
-4664
lines changed

Documentation/ABI/testing/sysfs-kernel-mm-damon

Lines changed: 18 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -23,9 +23,10 @@ Date: Mar 2022
2323
Contact: SeongJae Park <[email protected]>
2424
Description: Writing 'on' or 'off' to this file makes the kdamond starts or
2525
stops, respectively. Reading the file returns the keywords
26-
based on the current status. Writing 'update_schemes_stats' to
27-
the file updates contents of schemes stats files of the
28-
kdamond.
26+
based on the current status. Writing 'commit' to this file
27+
makes the kdamond reads the user inputs in the sysfs files
28+
except 'state' again. Writing 'update_schemes_stats' to the
29+
file updates contents of schemes stats files of the kdamond.
2930

3031
What: /sys/kernel/mm/damon/admin/kdamonds/<K>/pid
3132
Date: Mar 2022
@@ -40,14 +41,24 @@ Description: Writing a number 'N' to this file creates the number of
4041
directories for controlling each DAMON context named '0' to
4142
'N-1' under the contexts/ directory.
4243

44+
What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/avail_operations
45+
Date: Apr 2022
46+
Contact: SeongJae Park <[email protected]>
47+
Description: Reading this file returns the available monitoring operations
48+
sets on the currently running kernel.
49+
4350
What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/operations
4451
Date: Mar 2022
4552
Contact: SeongJae Park <[email protected]>
4653
Description: Writing a keyword for a monitoring operations set ('vaddr' for
47-
virtual address spaces monitoring, and 'paddr' for the physical
48-
address space monitoring) to this file makes the context to use
49-
the operations set. Reading the file returns the keyword for
50-
the operations set the context is set to use.
54+
virtual address spaces monitoring, 'fvaddr' for fixed virtual
55+
address ranges monitoring, and 'paddr' for the physical address
56+
space monitoring) to this file makes the context to use the
57+
operations set. Reading the file returns the keyword for the
58+
operations set the context is set to use.
59+
60+
Note that only the operations sets that listed in
61+
'avail_operations' file are valid inputs.
5162

5263
What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/monitoring_attrs/intervals/sample_us
5364
Date: Mar 2022

Documentation/admin-guide/blockdev/zram.rst

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -343,6 +343,11 @@ Admin can request writeback of those idle pages at right timing via::
343343

344344
With the command, zram will writeback idle pages from memory to the storage.
345345

346+
Additionally, if a user choose to writeback only huge and idle pages
347+
this can be accomplished with::
348+
349+
echo huge_idle > /sys/block/zramX/writeback
350+
346351
If an admin wants to write a specific page in zram device to the backing device,
347352
they could write a page index into the interface.
348353

Documentation/admin-guide/cgroup-v2.rst

Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1208,6 +1208,34 @@ PAGE_SIZE multiple when read back.
12081208
high limit is used and monitored properly, this limit's
12091209
utility is limited to providing the final safety net.
12101210

1211+
memory.reclaim
1212+
A write-only nested-keyed file which exists for all cgroups.
1213+
1214+
This is a simple interface to trigger memory reclaim in the
1215+
target cgroup.
1216+
1217+
This file accepts a single key, the number of bytes to reclaim.
1218+
No nested keys are currently supported.
1219+
1220+
Example::
1221+
1222+
echo "1G" > memory.reclaim
1223+
1224+
The interface can be later extended with nested keys to
1225+
configure the reclaim behavior. For example, specify the
1226+
type of memory to reclaim from (anon, file, ..).
1227+
1228+
Please note that the kernel can over or under reclaim from
1229+
the target cgroup. If less bytes are reclaimed than the
1230+
specified amount, -EAGAIN is returned.
1231+
1232+
memory.peak
1233+
A read-only single value file which exists on non-root
1234+
cgroups.
1235+
1236+
The max memory usage recorded for the cgroup and its
1237+
descendants since the creation of the cgroup.
1238+
12111239
memory.oom.group
12121240
A read-write single value file which exists on non-root
12131241
cgroups. The default value is "0".
@@ -1326,6 +1354,12 @@ PAGE_SIZE multiple when read back.
13261354
Amount of cached filesystem data that is swap-backed,
13271355
such as tmpfs, shm segments, shared anonymous mmap()s
13281356

1357+
zswap
1358+
Amount of memory consumed by the zswap compression backend.
1359+
1360+
zswapped
1361+
Amount of application memory swapped out to zswap.
1362+
13291363
file_mapped
13301364
Amount of cached filesystem data mapped with mmap()
13311365

@@ -1516,6 +1550,21 @@ PAGE_SIZE multiple when read back.
15161550
higher than the limit for an extended period of time. This
15171551
reduces the impact on the workload and memory management.
15181552

1553+
memory.zswap.current
1554+
A read-only single value file which exists on non-root
1555+
cgroups.
1556+
1557+
The total amount of memory consumed by the zswap compression
1558+
backend.
1559+
1560+
memory.zswap.max
1561+
A read-write single value file which exists on non-root
1562+
cgroups. The default is "max".
1563+
1564+
Zswap usage hard limit. If a cgroup's zswap pool reaches this
1565+
limit, it will refuse to take any more stores before existing
1566+
entries fault back in or are written out to disk.
1567+
15191568
memory.pressure
15201569
A read-only nested-keyed file.
15211570

Documentation/admin-guide/kernel-parameters.txt

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1705,16 +1705,16 @@
17051705
boot-time allocation of gigantic hugepages is skipped.
17061706

17071707
hugetlb_free_vmemmap=
1708-
[KNL] Reguires CONFIG_HUGETLB_PAGE_FREE_VMEMMAP
1708+
[KNL] Reguires CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
17091709
enabled.
17101710
Allows heavy hugetlb users to free up some more
17111711
memory (7 * PAGE_SIZE for each 2MB hugetlb page).
1712-
Format: { on | off (default) }
1712+
Format: { [oO][Nn]/Y/y/1 | [oO][Ff]/N/n/0 (default) }
17131713

1714-
on: enable the feature
1715-
off: disable the feature
1714+
[oO][Nn]/Y/y/1: enable the feature
1715+
[oO][Ff]/N/n/0: disable the feature
17161716

1717-
Built with CONFIG_HUGETLB_PAGE_FREE_VMEMMAP_DEFAULT_ON=y,
1717+
Built with CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP_DEFAULT_ON=y,
17181718
the default is on.
17191719

17201720
This is not compatible with memory_hotplug.memmap_on_memory.

Documentation/admin-guide/mm/damon/reclaim.rst

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -66,6 +66,17 @@ Setting it as ``N`` disables DAMON_RECLAIM. Note that DAMON_RECLAIM could do
6666
no real monitoring and reclamation due to the watermarks-based activation
6767
condition. Refer to below descriptions for the watermarks parameter for this.
6868

69+
commit_inputs
70+
-------------
71+
72+
Make DAMON_RECLAIM reads the input parameters again, except ``enabled``.
73+
74+
Input parameters that updated while DAMON_RECLAIM is running are not applied
75+
by default. Once this parameter is set as ``Y``, DAMON_RECLAIM reads values
76+
of parametrs except ``enabled`` again. Once the re-reading is done, this
77+
parameter is set as ``N``. If invalid parameters are found while the
78+
re-reading, DAMON_RECLAIM will be disabled.
79+
6980
min_age
7081
-------
7182

Documentation/admin-guide/mm/damon/usage.rst

Lines changed: 28 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -68,7 +68,7 @@ comma (","). ::
6868
│ kdamonds/nr_kdamonds
6969
│ │ 0/state,pid
7070
│ │ │ contexts/nr_contexts
71-
│ │ │ │ 0/operations
71+
│ │ │ │ 0/avail_operations,operations
7272
│ │ │ │ │ monitoring_attrs/
7373
│ │ │ │ │ │ intervals/sample_us,aggr_us,update_us
7474
│ │ │ │ │ │ nr_regions/min,max
@@ -121,10 +121,11 @@ In each kdamond directory, two files (``state`` and ``pid``) and one directory
121121

122122
Reading ``state`` returns ``on`` if the kdamond is currently running, or
123123
``off`` if it is not running. Writing ``on`` or ``off`` makes the kdamond be
124-
in the state. Writing ``update_schemes_stats`` to ``state`` file updates the
125-
contents of stats files for each DAMON-based operation scheme of the kdamond.
126-
For details of the stats, please refer to :ref:`stats section
127-
<sysfs_schemes_stats>`.
124+
in the state. Writing ``commit`` to the ``state`` file makes kdamond reads the
125+
user inputs in the sysfs files except ``state`` file again. Writing
126+
``update_schemes_stats`` to ``state`` file updates the contents of stats files
127+
for each DAMON-based operation scheme of the kdamond. For details of the
128+
stats, please refer to :ref:`stats section <sysfs_schemes_stats>`.
128129

129130
If the state is ``on``, reading ``pid`` shows the pid of the kdamond thread.
130131

@@ -143,17 +144,28 @@ be written to the file.
143144
contexts/<N>/
144145
-------------
145146

146-
In each context directory, one file (``operations``) and three directories
147-
(``monitoring_attrs``, ``targets``, and ``schemes``) exist.
147+
In each context directory, two files (``avail_operations`` and ``operations``)
148+
and three directories (``monitoring_attrs``, ``targets``, and ``schemes``)
149+
exist.
148150

149151
DAMON supports multiple types of monitoring operations, including those for
150-
virtual address space and the physical address space. You can set and get what
151-
type of monitoring operations DAMON will use for the context by writing one of
152-
below keywords to, and reading from the file.
152+
virtual address space and the physical address space. You can get the list of
153+
available monitoring operations set on the currently running kernel by reading
154+
``avail_operations`` file. Based on the kernel configuration, the file will
155+
list some or all of below keywords.
153156

154157
- vaddr: Monitor virtual address spaces of specific processes
158+
- fvaddr: Monitor fixed virtual address ranges
155159
- paddr: Monitor the physical address space of the system
156160

161+
Please refer to :ref:`regions sysfs directory <sysfs_regions>` for detailed
162+
differences between the operations sets in terms of the monitoring target
163+
regions.
164+
165+
You can set and get what type of monitoring operations DAMON will use for the
166+
context by writing one of the keywords listed in ``avail_operations`` file and
167+
reading from the ``operations`` file.
168+
157169
contexts/<N>/monitoring_attrs/
158170
------------------------------
159171

@@ -192,6 +204,8 @@ If you wrote ``vaddr`` to the ``contexts/<N>/operations``, each target should
192204
be a process. You can specify the process to DAMON by writing the pid of the
193205
process to the ``pid_target`` file.
194206

207+
.. _sysfs_regions:
208+
195209
targets/<N>/regions
196210
-------------------
197211

@@ -202,9 +216,10 @@ can be covered. However, users could want to set the initial monitoring region
202216
to specific address ranges.
203217

204218
In contrast, DAMON do not automatically sets and updates the monitoring target
205-
regions when ``paddr`` monitoring operations set is being used (``paddr`` is
206-
written to the ``contexts/<N>/operations``). Therefore, users should set the
207-
monitoring target regions by themselves in the case.
219+
regions when ``fvaddr`` or ``paddr`` monitoring operations sets are being used
220+
(``fvaddr`` or ``paddr`` have written to the ``contexts/<N>/operations``).
221+
Therefore, users should set the monitoring target regions by themselves in the
222+
cases.
208223

209224
For such cases, users can explicitly set the initial monitoring target regions
210225
as they want, by writing proper values to the files under this directory.

Documentation/admin-guide/mm/hugetlbpage.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -164,7 +164,7 @@ default_hugepagesz
164164
will all result in 256 2M huge pages being allocated. Valid default
165165
huge page size is architecture dependent.
166166
hugetlb_free_vmemmap
167-
When CONFIG_HUGETLB_PAGE_FREE_VMEMMAP is set, this enables freeing
167+
When CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP is set, this enables optimizing
168168
unused vmemmap pages associated with each HugeTLB page.
169169

170170
When multiple huge page sizes are supported, ``/proc/sys/vm/nr_hugepages``

Documentation/admin-guide/mm/ksm.rst

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -184,6 +184,24 @@ The maximum possible ``pages_sharing/pages_shared`` ratio is limited by the
184184
``max_page_sharing`` tunable. To increase the ratio ``max_page_sharing`` must
185185
be increased accordingly.
186186

187+
Monitoring KSM events
188+
=====================
189+
190+
There are some counters in /proc/vmstat that may be used to monitor KSM events.
191+
KSM might help save memory, it's a tradeoff by may suffering delay on KSM COW
192+
or on swapping in copy. Those events could help users evaluate whether or how
193+
to use KSM. For example, if cow_ksm increases too fast, user may decrease the
194+
range of madvise(, , MADV_MERGEABLE).
195+
196+
cow_ksm
197+
is incremented every time a KSM page triggers copy on write (COW)
198+
when users try to write to a KSM page, we have to make a copy.
199+
200+
ksm_swpin_copy
201+
is incremented every time a KSM page is copied when swapping in
202+
note that KSM page might be copied when swapping in because do_swap_page()
203+
cannot do all the locking needed to reconstitute a cross-anon_vma KSM page.
204+
187205
--
188206
Izik Eidus,
189207
Hugh Dickins, 17 Nov 2009

Documentation/admin-guide/sysctl/vm.rst

Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -62,6 +62,7 @@ Currently, these files are in /proc/sys/vm:
6262
- overcommit_memory
6363
- overcommit_ratio
6464
- page-cluster
65+
- page_lock_unfairness
6566
- panic_on_oom
6667
- percpu_pagelist_high_fraction
6768
- stat_interval
@@ -561,6 +562,45 @@ Change the minimum size of the hugepage pool.
561562
See Documentation/admin-guide/mm/hugetlbpage.rst
562563

563564

565+
hugetlb_optimize_vmemmap
566+
========================
567+
568+
This knob is not available when memory_hotplug.memmap_on_memory (kernel parameter)
569+
is configured or the size of 'struct page' (a structure defined in
570+
include/linux/mm_types.h) is not power of two (an unusual system config could
571+
result in this).
572+
573+
Enable (set to 1) or disable (set to 0) the feature of optimizing vmemmap pages
574+
associated with each HugeTLB page.
575+
576+
Once enabled, the vmemmap pages of subsequent allocation of HugeTLB pages from
577+
buddy allocator will be optimized (7 pages per 2MB HugeTLB page and 4095 pages
578+
per 1GB HugeTLB page), whereas already allocated HugeTLB pages will not be
579+
optimized. When those optimized HugeTLB pages are freed from the HugeTLB pool
580+
to the buddy allocator, the vmemmap pages representing that range needs to be
581+
remapped again and the vmemmap pages discarded earlier need to be rellocated
582+
again. If your use case is that HugeTLB pages are allocated 'on the fly' (e.g.
583+
never explicitly allocating HugeTLB pages with 'nr_hugepages' but only set
584+
'nr_overcommit_hugepages', those overcommitted HugeTLB pages are allocated 'on
585+
the fly') instead of being pulled from the HugeTLB pool, you should weigh the
586+
benefits of memory savings against the more overhead (~2x slower than before)
587+
of allocation or freeing HugeTLB pages between the HugeTLB pool and the buddy
588+
allocator. Another behavior to note is that if the system is under heavy memory
589+
pressure, it could prevent the user from freeing HugeTLB pages from the HugeTLB
590+
pool to the buddy allocator since the allocation of vmemmap pages could be
591+
failed, you have to retry later if your system encounter this situation.
592+
593+
Once disabled, the vmemmap pages of subsequent allocation of HugeTLB pages from
594+
buddy allocator will not be optimized meaning the extra overhead at allocation
595+
time from buddy allocator disappears, whereas already optimized HugeTLB pages
596+
will not be affected. If you want to make sure there are no optimized HugeTLB
597+
pages, you can set "nr_hugepages" to 0 first and then disable this. Note that
598+
writing 0 to nr_hugepages will make any "in use" HugeTLB pages become surplus
599+
pages. So, those surplus pages are still optimized until they are no longer
600+
in use. You would need to wait for those surplus pages to be released before
601+
there are no optimized pages in the system.
602+
603+
564604
nr_hugepages_mempolicy
565605
======================
566606

@@ -754,6 +794,14 @@ extra faults and I/O delays for following faults if they would have been part of
754794
that consecutive pages readahead would have brought in.
755795

756796

797+
page_lock_unfairness
798+
====================
799+
800+
This value determines the number of times that the page lock can be
801+
stolen from under a waiter. After the lock is stolen the number of times
802+
specified in this file (default is 5), the "fair lock handoff" semantics
803+
will apply, and the waiter will only be awakened if the lock can be taken.
804+
757805
panic_on_oom
758806
============
759807

0 commit comments

Comments
 (0)