Skip to content

Commit e2ca6ba

Browse files
committed
Merge tag 'mm-stable-2022-12-13' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton: - More userfaultfs work from Peter Xu - Several convert-to-folios series from Sidhartha Kumar and Huang Ying - Some filemap cleanups from Vishal Moola - David Hildenbrand added the ability to selftest anon memory COW handling - Some cpuset simplifications from Liu Shixin - Addition of vmalloc tracing support by Uladzislau Rezki - Some pagecache folioifications and simplifications from Matthew Wilcox - A pagemap cleanup from Kefeng Wang: we have VM_ACCESS_FLAGS, so use it - Miguel Ojeda contributed some cleanups for our use of the __no_sanitize_thread__ gcc keyword. This series should have been in the non-MM tree, my bad - Naoya Horiguchi improved the interaction between memory poisoning and memory section removal for huge pages - DAMON cleanups and tuneups from SeongJae Park - Tony Luck fixed the handling of COW faults against poisoned pages - Peter Xu utilized the PTE marker code for handling swapin errors - Hugh Dickins reworked compound page mapcount handling, simplifying it and making it more efficient - Removal of the autonuma savedwrite infrastructure from Nadav Amit and David Hildenbrand - zram support for multiple compression streams from Sergey Senozhatsky - David Hildenbrand reworked the GUP code's R/O long-term pinning so that drivers no longer need to use the FOLL_FORCE workaround which didn't work very well anyway - Mel Gorman altered the page allocator so that local IRQs can remnain enabled during per-cpu page allocations - Vishal Moola removed the try_to_release_page() wrapper - Stefan Roesch added some per-BDI sysfs tunables which are used to prevent network block devices from dirtying excessive amounts of pagecache - David Hildenbrand did some cleanup and repair work on KSM COW breaking - Nhat Pham and Johannes Weiner have implemented writeback in zswap's zsmalloc backend - Brian Foster has fixed a longstanding corner-case oddity in file[map]_write_and_wait_range() - sparse-vmemmap changes for MIPS, LoongArch and NIOS2 from Feiyang Chen - Shiyang Ruan has done some work on fsdax, to make its reflink mode work better under xfstests. Better, but still not perfect - Christoph Hellwig has removed the .writepage() method from several filesystems. They only need .writepages() - Yosry Ahmed wrote a series which fixes the memcg reclaim target beancounting - David Hildenbrand has fixed some of our MM selftests for 32-bit machines - Many singleton patches, as usual * tag 'mm-stable-2022-12-13' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (313 commits) mm/hugetlb: set head flag before setting compound_order in __prep_compound_gigantic_folio mm: mmu_gather: allow more than one batch of delayed rmaps mm: fix typo in struct pglist_data code comment kmsan: fix memcpy tests mm: add cond_resched() in swapin_walk_pmd_entry() mm: do not show fs mm pc for VM_LOCKONFAULT pages selftests/vm: ksm_functional_tests: fixes for 32bit selftests/vm: cow: fix compile warning on 32bit selftests/vm: madv_populate: fix missing MADV_POPULATE_(READ|WRITE) definitions mm/gup_test: fix PIN_LONGTERM_TEST_READ with highmem mm,thp,rmap: fix races between updates of subpages_mapcount mm: memcg: fix swapcached stat accounting mm: add nodes= arg to memory.reclaim mm: disable top-tier fallback to reclaim on proactive reclaim selftests: cgroup: make sure reclaim target memcg is unprotected selftests: cgroup: refactor proactive reclaim code to reclaim_until() mm: memcg: fix stale protection of reclaim target memcg mm/mmap: properly unaccount memory on mas_preallocate() failure omfs: remove ->writepage jfs: remove ->writepage ...
2 parents 7e68dd7 + c45bc55 commit e2ca6ba

File tree

237 files changed

+9281
-5047
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

237 files changed

+9281
-5047
lines changed

Documentation/ABI/testing/sysfs-block-zram

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -137,3 +137,17 @@ Description:
137137
The writeback_limit file is read-write and specifies the maximum
138138
amount of writeback ZRAM can do. The limit could be changed
139139
in run time.
140+
141+
What: /sys/block/zram<id>/recomp_algorithm
142+
Date: November 2022
143+
Contact: Sergey Senozhatsky <[email protected]>
144+
Description:
145+
The recomp_algorithm file is read-write and allows to set
146+
or show secondary compression algorithms.
147+
148+
What: /sys/block/zram<id>/recompress
149+
Date: November 2022
150+
Contact: Sergey Senozhatsky <[email protected]>
151+
Description:
152+
The recompress file is write-only and triggers re-compression
153+
with secondary compression algorithms.

Documentation/ABI/testing/sysfs-class-bdi

Lines changed: 68 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -44,6 +44,21 @@ Description:
4444

4545
(read-write)
4646

47+
What: /sys/class/bdi/<bdi>/min_ratio_fine
48+
Date: November 2022
49+
Contact: Stefan Roesch <[email protected]>
50+
Description:
51+
Under normal circumstances each device is given a part of the
52+
total write-back cache that relates to its current average
53+
writeout speed in relation to the other devices.
54+
55+
The 'min_ratio_fine' parameter allows assigning a minimum reserve
56+
of the write-back cache to a particular device. The value is
57+
expressed as part of 1 million. For example, this is useful for
58+
providing a minimum QoS.
59+
60+
(read-write)
61+
4762
What: /sys/class/bdi/<bdi>/max_ratio
4863
Date: January 2008
4964
Contact: Peter Zijlstra <[email protected]>
@@ -55,6 +70,59 @@ Description:
5570
mount that is prone to get stuck, or a FUSE mount which cannot
5671
be trusted to play fair.
5772

73+
(read-write)
74+
75+
What: /sys/class/bdi/<bdi>/max_ratio_fine
76+
Date: November 2022
77+
Contact: Stefan Roesch <[email protected]>
78+
Description:
79+
Allows limiting a particular device to use not more than the
80+
given value of the write-back cache. The value is given as part
81+
of 1 million. This is useful in situations where we want to avoid
82+
one device taking all or most of the write-back cache. For example
83+
in case of an NFS mount that is prone to get stuck, or a FUSE mount
84+
which cannot be trusted to play fair.
85+
86+
(read-write)
87+
88+
What: /sys/class/bdi/<bdi>/min_bytes
89+
Date: October 2022
90+
Contact: Stefan Roesch <[email protected]>
91+
Description:
92+
Under normal circumstances each device is given a part of the
93+
total write-back cache that relates to its current average
94+
writeout speed in relation to the other devices.
95+
96+
The 'min_bytes' parameter allows assigning a minimum
97+
percentage of the write-back cache to a particular device
98+
expressed in bytes.
99+
For example, this is useful for providing a minimum QoS.
100+
101+
(read-write)
102+
103+
What: /sys/class/bdi/<bdi>/max_bytes
104+
Date: October 2022
105+
Contact: Stefan Roesch <[email protected]>
106+
Description:
107+
Allows limiting a particular device to use not more than the
108+
given 'max_bytes' of the write-back cache. This is useful in
109+
situations where we want to avoid one device taking all or
110+
most of the write-back cache. For example in case of an NFS
111+
mount that is prone to get stuck, a FUSE mount which cannot be
112+
trusted to play fair, or a nbd device.
113+
114+
(read-write)
115+
116+
What: /sys/class/bdi/<bdi>/strict_limit
117+
Date: October 2022
118+
Contact: Stefan Roesch <[email protected]>
119+
Description:
120+
Forces per-BDI checks for the share of given device in the write-back
121+
cache even before the global background dirty limit is reached. This
122+
is useful in situations where the global limit is much higher than
123+
affordable for given relatively slow (or untrusted) device. Turning
124+
strictlimit on has no visible effect if max_ratio is equal to 100%.
125+
58126
(read-write)
59127
What: /sys/class/bdi/<bdi>/stable_pages_required
60128
Date: January 2008

Documentation/ABI/testing/sysfs-kernel-mm-damon

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,10 @@ Description: Writing 'on' or 'off' to this file makes the kdamond starts or
2727
makes the kdamond reads the user inputs in the sysfs files
2828
except 'state' again. Writing 'update_schemes_stats' to the
2929
file updates contents of schemes stats files of the kdamond.
30+
Writing 'update_schemes_tried_regions' to the file updates
31+
contents of 'tried_regions' directory of every scheme directory
32+
of this kdamond. Writing 'clear_schemes_tried_regions' to the
33+
file removes contents of the 'tried_regions' directory.
3034

3135
What: /sys/kernel/mm/damon/admin/kdamonds/<K>/pid
3236
Date: Mar 2022
@@ -283,3 +287,31 @@ Date: Mar 2022
283287
Contact: SeongJae Park <[email protected]>
284288
Description: Reading this file returns the number of the exceed events of
285289
the scheme's quotas.
290+
291+
What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/tried_regions/<R>/start
292+
Date: Oct 2022
293+
Contact: SeongJae Park <[email protected]>
294+
Description: Reading this file returns the start address of a memory region
295+
that corresponding DAMON-based Operation Scheme's action has
296+
tried to be applied.
297+
298+
What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/tried_regions/<R>/end
299+
Date: Oct 2022
300+
Contact: SeongJae Park <[email protected]>
301+
Description: Reading this file returns the end address of a memory region
302+
that corresponding DAMON-based Operation Scheme's action has
303+
tried to be applied.
304+
305+
What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/tried_regions/<R>/nr_accesses
306+
Date: Oct 2022
307+
Contact: SeongJae Park <[email protected]>
308+
Description: Reading this file returns the 'nr_accesses' of a memory region
309+
that corresponding DAMON-based Operation Scheme's action has
310+
tried to be applied.
311+
312+
What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/tried_regions/<R>/age
313+
Date: Oct 2022
314+
Contact: SeongJae Park <[email protected]>
315+
Description: Reading this file returns the 'age' of a memory region that
316+
corresponding DAMON-based Operation Scheme's action has tried
317+
to be applied.

Documentation/admin-guide/blockdev/zram.rst

Lines changed: 96 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -348,8 +348,13 @@ this can be accomplished with::
348348

349349
echo huge_idle > /sys/block/zramX/writeback
350350

351+
If a user chooses to writeback only incompressible pages (pages that none of
352+
algorithms can compress) this can be accomplished with::
353+
354+
echo incompressible > /sys/block/zramX/writeback
355+
351356
If an admin wants to write a specific page in zram device to the backing device,
352-
they could write a page index into the interface.
357+
they could write a page index into the interface::
353358

354359
echo "page_index=1251" > /sys/block/zramX/writeback
355360

@@ -401,6 +406,87 @@ budget in next setting is user's job.
401406
If admin wants to measure writeback count in a certain period, they could
402407
know it via /sys/block/zram0/bd_stat's 3rd column.
403408

409+
recompression
410+
-------------
411+
412+
With CONFIG_ZRAM_MULTI_COMP, zram can recompress pages using alternative
413+
(secondary) compression algorithms. The basic idea is that alternative
414+
compression algorithm can provide better compression ratio at a price of
415+
(potentially) slower compression/decompression speeds. Alternative compression
416+
algorithm can, for example, be more successful compressing huge pages (those
417+
that default algorithm failed to compress). Another application is idle pages
418+
recompression - pages that are cold and sit in the memory can be recompressed
419+
using more effective algorithm and, hence, reduce zsmalloc memory usage.
420+
421+
With CONFIG_ZRAM_MULTI_COMP, zram supports up to 4 compression algorithms:
422+
one primary and up to 3 secondary ones. Primary zram compressor is explained
423+
in "3) Select compression algorithm", secondary algorithms are configured
424+
using recomp_algorithm device attribute.
425+
426+
Example:::
427+
428+
#show supported recompression algorithms
429+
cat /sys/block/zramX/recomp_algorithm
430+
#1: lzo lzo-rle lz4 lz4hc [zstd]
431+
#2: lzo lzo-rle lz4 [lz4hc] zstd
432+
433+
Alternative compression algorithms are sorted by priority. In the example
434+
above, zstd is used as the first alternative algorithm, which has priority
435+
of 1, while lz4hc is configured as a compression algorithm with priority 2.
436+
Alternative compression algorithm's priority is provided during algorithms
437+
configuration:::
438+
439+
#select zstd recompression algorithm, priority 1
440+
echo "algo=zstd priority=1" > /sys/block/zramX/recomp_algorithm
441+
442+
#select deflate recompression algorithm, priority 2
443+
echo "algo=deflate priority=2" > /sys/block/zramX/recomp_algorithm
444+
445+
Another device attribute that CONFIG_ZRAM_MULTI_COMP enables is recompress,
446+
which controls recompression.
447+
448+
Examples:::
449+
450+
#IDLE pages recompression is activated by `idle` mode
451+
echo "type=idle" > /sys/block/zramX/recompress
452+
453+
#HUGE pages recompression is activated by `huge` mode
454+
echo "type=huge" > /sys/block/zram0/recompress
455+
456+
#HUGE_IDLE pages recompression is activated by `huge_idle` mode
457+
echo "type=huge_idle" > /sys/block/zramX/recompress
458+
459+
The number of idle pages can be significant, so user-space can pass a size
460+
threshold (in bytes) to the recompress knob: zram will recompress only pages
461+
of equal or greater size:::
462+
463+
#recompress all pages larger than 3000 bytes
464+
echo "threshold=3000" > /sys/block/zramX/recompress
465+
466+
#recompress idle pages larger than 2000 bytes
467+
echo "type=idle threshold=2000" > /sys/block/zramX/recompress
468+
469+
Recompression of idle pages requires memory tracking.
470+
471+
During re-compression for every page, that matches re-compression criteria,
472+
ZRAM iterates the list of registered alternative compression algorithms in
473+
order of their priorities. ZRAM stops either when re-compression was
474+
successful (re-compressed object is smaller in size than the original one)
475+
and matches re-compression criteria (e.g. size threshold) or when there are
476+
no secondary algorithms left to try. If none of the secondary algorithms can
477+
successfully re-compressed the page such a page is marked as incompressible,
478+
so ZRAM will not attempt to re-compress it in the future.
479+
480+
This re-compression behaviour, when it iterates through the list of
481+
registered compression algorithms, increases our chances of finding the
482+
algorithm that successfully compresses a particular page. Sometimes, however,
483+
it is convenient (and sometimes even necessary) to limit recompression to
484+
only one particular algorithm so that it will not try any other algorithms.
485+
This can be achieved by providing a algo=NAME parameter:::
486+
487+
#use zstd algorithm only (if registered)
488+
echo "type=huge algo=zstd" > /sys/block/zramX/recompress
489+
404490
memory tracking
405491
===============
406492

@@ -411,9 +497,11 @@ pages of the process with*pagemap.
411497
If you enable the feature, you could see block state via
412498
/sys/kernel/debug/zram/zram0/block_state". The output is as follows::
413499

414-
300 75.033841 .wh.
415-
301 63.806904 s...
416-
302 63.806919 ..hi
500+
300 75.033841 .wh...
501+
301 63.806904 s.....
502+
302 63.806919 ..hi..
503+
303 62.801919 ....r.
504+
304 146.781902 ..hi.n
417505

418506
First column
419507
zram's block index.
@@ -430,6 +518,10 @@ Third column
430518
huge page
431519
i:
432520
idle page
521+
r:
522+
recompressed page (secondary compression algorithm)
523+
n:
524+
none (including secondary) of algorithms could compress it
433525

434526
First line of above example says 300th block is accessed at 75.033841sec
435527
and the block's state is huge so it is written back to the backing

Documentation/admin-guide/cgroup-v1/memory.rst

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -543,7 +543,8 @@ inactive_anon # of bytes of anonymous and swap cache memory on inactive
543543
LRU list.
544544
active_anon # of bytes of anonymous and swap cache memory on active
545545
LRU list.
546-
inactive_file # of bytes of file-backed memory on inactive LRU list.
546+
inactive_file # of bytes of file-backed memory and MADV_FREE anonymous memory(
547+
LazyFree pages) on inactive LRU list.
547548
active_file # of bytes of file-backed memory on active LRU list.
548549
unevictable # of bytes of memory that cannot be reclaimed (mlocked etc).
549550
=============== ===============================================================

Documentation/admin-guide/cgroup-v2.rst

Lines changed: 15 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1245,17 +1245,13 @@ PAGE_SIZE multiple when read back.
12451245
This is a simple interface to trigger memory reclaim in the
12461246
target cgroup.
12471247

1248-
This file accepts a single key, the number of bytes to reclaim.
1249-
No nested keys are currently supported.
1248+
This file accepts a string which contains the number of bytes to
1249+
reclaim.
12501250

12511251
Example::
12521252

12531253
echo "1G" > memory.reclaim
12541254

1255-
The interface can be later extended with nested keys to
1256-
configure the reclaim behavior. For example, specify the
1257-
type of memory to reclaim from (anon, file, ..).
1258-
12591255
Please note that the kernel can over or under reclaim from
12601256
the target cgroup. If less bytes are reclaimed than the
12611257
specified amount, -EAGAIN is returned.
@@ -1267,6 +1263,13 @@ PAGE_SIZE multiple when read back.
12671263
This means that the networking layer will not adapt based on
12681264
reclaim induced by memory.reclaim.
12691265

1266+
This file also allows the user to specify the nodes to reclaim from,
1267+
via the 'nodes=' key, for example::
1268+
1269+
echo "1G nodes=0,1" > memory.reclaim
1270+
1271+
The above instructs the kernel to reclaim memory from nodes 0,1.
1272+
12701273
memory.peak
12711274
A read-only single value file which exists on non-root
12721275
cgroups.
@@ -1488,12 +1491,18 @@ PAGE_SIZE multiple when read back.
14881491
pgscan_direct (npn)
14891492
Amount of scanned pages directly (in an inactive LRU list)
14901493

1494+
pgscan_khugepaged (npn)
1495+
Amount of scanned pages by khugepaged (in an inactive LRU list)
1496+
14911497
pgsteal_kswapd (npn)
14921498
Amount of reclaimed pages by kswapd
14931499

14941500
pgsteal_direct (npn)
14951501
Amount of reclaimed pages directly
14961502

1503+
pgsteal_khugepaged (npn)
1504+
Amount of reclaimed pages by khugepaged
1505+
14971506
pgfault (npn)
14981507
Total number of page faults incurred
14991508

0 commit comments

Comments
 (0)