Skip to content

Commit beace86

Browse files
committed
Merge tag 'mm-stable-2025-07-30-15-25' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton: "As usual, many cleanups. The below blurbiage describes 42 patchsets. 21 of those are partially or fully cleanup work. "cleans up", "cleanup", "maintainability", "rationalizes", etc. I never knew the MM code was so dirty. "mm: ksm: prevent KSM from breaking merging of new VMAs" (Lorenzo Stoakes) addresses an issue with KSM's PR_SET_MEMORY_MERGE mode: newly mapped VMAs were not eligible for merging with existing adjacent VMAs. "mm/damon: introduce DAMON_STAT for simple and practical access monitoring" (SeongJae Park) adds a new kernel module which simplifies the setup and usage of DAMON in production environments. "stop passing a writeback_control to swap/shmem writeout" (Christoph Hellwig) is a cleanup to the writeback code which removes a couple of pointers from struct writeback_control. "drivers/base/node.c: optimization and cleanups" (Donet Tom) contains largely uncorrelated cleanups to the NUMA node setup and management code. "mm: userfaultfd: assorted fixes and cleanups" (Tal Zussman) does some maintenance work on the userfaultfd code. "Readahead tweaks for larger folios" (Ryan Roberts) implements some tuneups for pagecache readahead when it is reading into order>0 folios. "selftests/mm: Tweaks to the cow test" (Mark Brown) provides some cleanups and consistency improvements to the selftests code. "Optimize mremap() for large folios" (Dev Jain) does that. A 37% reduction in execution time was measured in a memset+mremap+munmap microbenchmark. "Remove zero_user()" (Matthew Wilcox) expunges zero_user() in favor of the more modern memzero_page(). "mm/huge_memory: vmf_insert_folio_*() and vmf_insert_pfn_pud() fixes" (David Hildenbrand) addresses some warts which David noticed in the huge page code. These were not known to be causing any issues at this time. "mm/damon: use alloc_migrate_target() for DAMOS_MIGRATE_{HOT,COLD" (SeongJae Park) provides some cleanup and consolidation work in DAMON. "use vm_flags_t consistently" (Lorenzo Stoakes) uses vm_flags_t in places where we were inappropriately using other types. "mm/memfd: Reserve hugetlb folios before allocation" (Vivek Kasireddy) increases the reliability of large page allocation in the memfd code. "mm: Remove pXX_devmap page table bit and pfn_t type" (Alistair Popple) removes several now-unneeded PFN_* flags. "mm/damon: decouple sysfs from core" (SeongJae Park) implememnts some cleanup and maintainability work in the DAMON sysfs layer. "madvise cleanup" (Lorenzo Stoakes) does quite a lot of cleanup/maintenance work in the madvise() code. "madvise anon_name cleanups" (Vlastimil Babka) provides additional cleanups on top or Lorenzo's effort. "Implement numa node notifier" (Oscar Salvador) creates a standalone notifier for NUMA node memory state changes. Previously these were lumped under the more general memory on/offline notifier. "Make MIGRATE_ISOLATE a standalone bit" (Zi Yan) cleans up the pageblock isolation code and fixes a potential issue which doesn't seem to cause any problems in practice. "selftests/damon: add python and drgn based DAMON sysfs functionality tests" (SeongJae Park) adds additional drgn- and python-based DAMON selftests which are more comprehensive than the existing selftest suite. "Misc rework on hugetlb faulting path" (Oscar Salvador) fixes a rather obscure deadlock in the hugetlb fault code and follows that fix with a series of cleanups. "cma: factor out allocation logic from __cma_declare_contiguous_nid" (Mike Rapoport) rationalizes and cleans up the highmem-specific code in the CMA allocator. "mm/migration: rework movable_ops page migration (part 1)" (David Hildenbrand) provides cleanups and future-preparedness to the migration code. "mm/damon: add trace events for auto-tuned monitoring intervals and DAMOS quota" (SeongJae Park) adds some tracepoints to some DAMON auto-tuning code. "mm/damon: fix misc bugs in DAMON modules" (SeongJae Park) does that. "mm/damon: misc cleanups" (SeongJae Park) also does what it claims. "mm: folio_pte_batch() improvements" (David Hildenbrand) cleans up the large folio PTE batching code. "mm/damon/vaddr: Allow interleaving in migrate_{hot,cold} actions" (SeongJae Park) facilitates dynamic alteration of DAMON's inter-node allocation policy. "Remove unmap_and_put_page()" (Vishal Moola) provides a couple of page->folio conversions. "mm: per-node proactive reclaim" (Davidlohr Bueso) implements a per-node control of proactive reclaim - beyond the current memcg-based implementation. "mm/damon: remove damon_callback" (SeongJae Park) replaces the damon_callback interface with a more general and powerful damon_call()+damos_walk() interface. "mm/mremap: permit mremap() move of multiple VMAs" (Lorenzo Stoakes) implements a number of mremap cleanups (of course) in preparation for adding new mremap() functionality: newly permit the remapping of multiple VMAs when the user is specifying MREMAP_FIXED. It still excludes some specialized situations where this cannot be performed reliably. "drop hugetlb_free_pgd_range()" (Anthony Yznaga) switches some sparc hugetlb code over to the generic version and removes the thus-unneeded hugetlb_free_pgd_range(). "mm/damon/sysfs: support periodic and automated stats update" (SeongJae Park) augments the present userspace-requested update of DAMON sysfs monitoring files. Automatic update is now provided, along with a tunable to control the update interval. "Some randome fixes and cleanups to swapfile" (Kemeng Shi) does what is claims. "mm: introduce snapshot_page" (Luiz Capitulino and David Hildenbrand) provides (and uses) a means by which debug-style functions can grab a copy of a pageframe and inspect it locklessly without tripping over the races inherent in operating on the live pageframe directly. "use per-vma locks for /proc/pid/maps reads" (Suren Baghdasaryan) addresses the large contention issues which can be triggered by reads from that procfs file. Latencies are reduced by more than half in some situations. The series also introduces several new selftests for the /proc/pid/maps interface. "__folio_split() clean up" (Zi Yan) cleans up __folio_split()! "Optimize mprotect() for large folios" (Dev Jain) provides some quite large (>3x) speedups to mprotect() when dealing with large folios. "selftests/mm: reuse FORCE_READ to replace "asm volatile("" : "+r" (XXX));" and some cleanup" (wang lian) does some cleanup work in the selftests code. "tools/testing: expand mremap testing" (Lorenzo Stoakes) extends the mremap() selftest in several ways, including adding more checking of Lorenzo's recently added "permit mremap() move of multiple VMAs" feature. "selftests/damon/sysfs.py: test all parameters" (SeongJae Park) extends the DAMON sysfs interface selftest so that it tests all possible user-requested parameters. Rather than the present minimal subset" * tag 'mm-stable-2025-07-30-15-25' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (370 commits) MAINTAINERS: add missing headers to mempory policy & migration section MAINTAINERS: add missing file to cgroup section MAINTAINERS: add MM MISC section, add missing files to MISC and CORE MAINTAINERS: add missing zsmalloc file MAINTAINERS: add missing files to page alloc section MAINTAINERS: add missing shrinker files MAINTAINERS: move memremap.[ch] to hotplug section MAINTAINERS: add missing mm_slot.h file THP section MAINTAINERS: add missing interval_tree.c to memory mapping section MAINTAINERS: add missing percpu-internal.h file to per-cpu section mm/page_alloc: remove trace_mm_alloc_contig_migrate_range_info() selftests/damon: introduce _common.sh to host shared function selftests/damon/sysfs.py: test runtime reduction of DAMON parameters selftests/damon/sysfs.py: test non-default parameters runtime commit selftests/damon/sysfs.py: generalize DAMON context commit assertion selftests/damon/sysfs.py: generalize monitoring attributes commit assertion selftests/damon/sysfs.py: generalize DAMOS schemes commit assertion selftests/damon/sysfs.py: test DAMOS filters commitment selftests/damon/sysfs.py: generalize DAMOS scheme commit assertion selftests/damon/sysfs.py: test DAMOS destinations commitment ...
2 parents cbbf0a7 + af915c3 commit beace86

File tree

329 files changed

+10711
-5779
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

329 files changed

+10711
-5779
lines changed

Documentation/ABI/stable/sysfs-devices-node

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -227,3 +227,12 @@ Contact: Jiaqi Yan <[email protected]>
227227
Description:
228228
Of the raw poisoned pages on a NUMA node, how many pages are
229229
recovered by memory error recovery attempt.
230+
231+
What: /sys/devices/system/node/nodeX/reclaim
232+
Date: June 2025
233+
Contact: Linux Memory Management list <[email protected]>
234+
Description:
235+
Perform user-triggered proactive reclaim on a NUMA node.
236+
This interface is equivalent to the memcg variant.
237+
238+
See Documentation/admin-guide/cgroup-v2.rst

Documentation/ABI/testing/sysfs-kernel-mm-damon

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -44,6 +44,13 @@ Contact: SeongJae Park <[email protected]>
4444
Description: Reading this file returns the pid of the kdamond if it is
4545
running.
4646

47+
What: /sys/kernel/mm/damon/admin/kdamonds/<K>/refresh_ms
48+
Date: Jul 2025
49+
Contact: SeongJae Park <[email protected]>
50+
Description: Writing a value to this file sets the time interval for
51+
automatic DAMON status file contents update. Writing '0'
52+
disables the update. Reading this file returns the value.
53+
4754
What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/nr_contexts
4855
Date: Mar 2022
4956
Contact: SeongJae Park <[email protected]>
@@ -431,6 +438,28 @@ Description: Directory for DAMON operations set layer-handled DAMOS filters.
431438
/sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/filters
432439
directory.
433440

441+
What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/dests/nr_dests
442+
Date: Jul 2025
443+
Contact: SeongJae Park <[email protected]>
444+
Description: Writing a number 'N' to this file creates the number of
445+
directories for setting action destinations of the scheme named
446+
'0' to 'N-1' under the dests/ directory.
447+
448+
What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/dests/<D>/id
449+
Date: Jul 2025
450+
Contact: SeongJae Park <[email protected]>
451+
Description: Writing to and reading from this file sets and gets the id of
452+
the DAMOS action destination. For DAMOS_MIGRATE_{HOT,COLD}
453+
actions, the destination node's node id can be written and
454+
read.
455+
456+
What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/dests/<D>/weight
457+
Date: Jul 2025
458+
Contact: SeongJae Park <[email protected]>
459+
Description: Writing to and reading from this file sets and gets the weight
460+
of the DAMOS action destination to select as the destination of
461+
each action among the destinations.
462+
434463
What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/stats/nr_tried
435464
Date: Mar 2022
436465
Contact: SeongJae Park <[email protected]>

Documentation/admin-guide/mm/damon/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,3 +14,4 @@ access monitoring and access-aware system operations.
1414
usage
1515
reclaim
1616
lru_sort
17+
stat
Lines changed: 69 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,69 @@
1+
.. SPDX-License-Identifier: GPL-2.0
2+
3+
===================================
4+
Data Access Monitoring Results Stat
5+
===================================
6+
7+
Data Access Monitoring Results Stat (DAMON_STAT) is a static kernel module that
8+
is aimed to be used for simple access pattern monitoring. It monitors accesses
9+
on the system's entire physical memory using DAMON, and provides simplified
10+
access monitoring results statistics, namely idle time percentiles and
11+
estimated memory bandwidth.
12+
13+
Monitoring Accuracy and Overhead
14+
================================
15+
16+
DAMON_STAT uses monitoring intervals :ref:`auto-tuning
17+
<damon_design_monitoring_intervals_autotuning>` to make its accuracy high and
18+
overhead minimum. It auto-tunes the intervals aiming 4 % of observable access
19+
events to be captured in each snapshot, while limiting the resulting sampling
20+
events to be 5 milliseconds in minimum and 10 seconds in maximum. On a few
21+
production server systems, it resulted in consuming only 0.x % single CPU time,
22+
while capturing reasonable quality of access patterns.
23+
24+
Interface: Module Parameters
25+
============================
26+
27+
To use this feature, you should first ensure your system is running on a kernel
28+
that is built with ``CONFIG_DAMON_STAT=y``. The feature can be enabled by
29+
default at build time, by setting ``CONFIG_DAMON_STAT_ENABLED_DEFAULT`` true.
30+
31+
To let sysadmins enable or disable it at boot and/or runtime, and read the
32+
monitoring results, DAMON_STAT provides module parameters. Following
33+
sections are descriptions of the parameters.
34+
35+
enabled
36+
-------
37+
38+
Enable or disable DAMON_STAT.
39+
40+
You can enable DAMON_STAT by setting the value of this parameter as ``Y``.
41+
Setting it as ``N`` disables DAMON_STAT. The default value is set by
42+
``CONFIG_DAMON_STAT_ENABLED_DEFAULT`` build config option.
43+
44+
estimated_memory_bandwidth
45+
--------------------------
46+
47+
Estimated memory bandwidth consumption (bytes per second) of the system.
48+
49+
DAMON_STAT reads observed access events on the current DAMON results snapshot
50+
and converts it to memory bandwidth consumption estimation in bytes per second.
51+
The resulting metric is exposed to user via this read-only parameter. Because
52+
DAMON uses sampling, this is only an estimation of the access intensity rather
53+
than accurate memory bandwidth.
54+
55+
memory_idle_ms_percentiles
56+
--------------------------
57+
58+
Per-byte idle time (milliseconds) percentiles of the system.
59+
60+
DAMON_STAT calculates how long each byte of the memory was not accessed until
61+
now (idle time), based on the current DAMON results snapshot. If DAMON found a
62+
region of access frequency (nr_accesses) larger than zero, every byte of the
63+
region gets zero idle time. If a region has zero access frequency
64+
(nr_accesses), how long the region was keeping the zero access frequency (age)
65+
becomes the idle time of every byte of the region. Then, DAMON_STAT exposes
66+
the percentiles of the idle time values via this read-only parameter. Reading
67+
the parameter returns 101 idle time values in milliseconds, separated by comma.
68+
Each value represents 0-th, 1st, 2nd, 3rd, ..., 99th and 100th percentile idle
69+
times.

Documentation/admin-guide/mm/damon/usage.rst

Lines changed: 39 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -59,7 +59,7 @@ comma (",").
5959
6060
:ref:`/sys/kernel/mm/damon <sysfs_root>`/admin
6161
:ref:`kdamonds <sysfs_kdamonds>`/nr_kdamonds
62-
│ │ :ref:`0 <sysfs_kdamond>`/state,pid
62+
│ │ :ref:`0 <sysfs_kdamond>`/state,pid,refresh_ms
6363
│ │ │ :ref:`contexts <sysfs_contexts>`/nr_contexts
6464
│ │ │ │ :ref:`0 <sysfs_context>`/avail_operations,operations
6565
│ │ │ │ │ :ref:`monitoring_attrs <sysfs_monitoring_attrs>`/
@@ -85,6 +85,8 @@ comma (",").
8585
│ │ │ │ │ │ │ :ref:`watermarks <sysfs_watermarks>`/metric,interval_us,high,mid,low
8686
│ │ │ │ │ │ │ :ref:`{core_,ops_,}filters <sysfs_filters>`/nr_filters
8787
│ │ │ │ │ │ │ │ 0/type,matching,allow,memcg_path,addr_start,addr_end,target_idx,min,max
88+
│ │ │ │ │ │ │ :ref:`dests <damon_sysfs_dests>`/nr_dests
89+
│ │ │ │ │ │ │ │ 0/id,weight
8890
│ │ │ │ │ │ │ :ref:`stats <sysfs_schemes_stats>`/nr_tried,sz_tried,nr_applied,sz_applied,sz_ops_filter_passed,qt_exceeds
8991
│ │ │ │ │ │ │ :ref:`tried_regions <sysfs_schemes_tried_regions>`/total_bytes
9092
│ │ │ │ │ │ │ │ 0/start,end,nr_accesses,age,sz_filter_passed
@@ -121,8 +123,8 @@ kdamond.
121123
kdamonds/<N>/
122124
-------------
123125

124-
In each kdamond directory, two files (``state`` and ``pid``) and one directory
125-
(``contexts``) exist.
126+
In each kdamond directory, three files (``state``, ``pid`` and ``refresh_ms``)
127+
and one directory (``contexts``) exist.
126128

127129
Reading ``state`` returns ``on`` if the kdamond is currently running, or
128130
``off`` if it is not running.
@@ -159,6 +161,13 @@ Users can write below commands for the kdamond to the ``state`` file.
159161

160162
If the state is ``on``, reading ``pid`` shows the pid of the kdamond thread.
161163

164+
Users can ask the kernel to periodically update files showing auto-tuned
165+
parameters and DAMOS stats instead of manually writing
166+
``update_tuned_intervals`` like keywords to ``state`` file. For this, users
167+
should write the desired update time interval in milliseconds to ``refresh_ms``
168+
file. If the interval is zero, the periodic update is disabled. Reading the
169+
file shows currently set time interval.
170+
162171
``contexts`` directory contains files for controlling the monitoring contexts
163172
that this kdamond will execute.
164173

@@ -307,10 +316,10 @@ to ``N-1``. Each directory represents each DAMON-based operation scheme.
307316
schemes/<N>/
308317
------------
309318

310-
In each scheme directory, seven directories (``access_pattern``, ``quotas``,
311-
``watermarks``, ``core_filters``, ``ops_filters``, ``filters``, ``stats``, and
312-
``tried_regions``) and three files (``action``, ``target_nid`` and
313-
``apply_interval``) exist.
319+
In each scheme directory, eight directories (``access_pattern``, ``quotas``,
320+
``watermarks``, ``core_filters``, ``ops_filters``, ``filters``, ``dests``,
321+
``stats``, and ``tried_regions``) and three files (``action``, ``target_nid``
322+
and ``apply_interval``) exist.
314323

315324
The ``action`` file is for setting and getting the scheme's :ref:`action
316325
<damon_design_damos_action>`. The keywords that can be written to and read
@@ -484,6 +493,29 @@ Refer to the :ref:`DAMOS filters design documentation
484493
of different ``allow`` works, when each of the filters are supported, and
485494
differences on stats.
486495

496+
.. _damon_sysfs_dests:
497+
498+
schemes/<N>/dests/
499+
------------------
500+
501+
Directory for specifying the destinations of given DAMON-based operation
502+
scheme's action. This directory is ignored if the action of the given scheme
503+
is not supporting multiple destinations. Only ``DAMOS_MIGRATE_{HOT,COLD}``
504+
actions are supporting multiple destinations.
505+
506+
In the beginning, the directory has only one file, ``nr_dests``. Writing a
507+
number (``N``) to the file creates the number of child directories named ``0``
508+
to ``N-1``. Each directory represents each action destination.
509+
510+
Each destination directory contains two files, namely ``id`` and ``weight``.
511+
Users can write and read the identifier of the destination to ``id`` file.
512+
For ``DAMOS_MIGRATE_{HOT,COLD}`` actions, the migrate destination node's node
513+
id should be written to ``id`` file. Users can write and read the weight of
514+
the destination among the given destinations to the ``weight`` file. The
515+
weight can be an arbitrary integer. When DAMOS apply the action to each entity
516+
of the memory region, it will select the destination of the action based on the
517+
relative weights of the destinations.
518+
487519
.. _sysfs_schemes_stats:
488520

489521
schemes/<N>/stats/

Documentation/admin-guide/mm/transhuge.rst

Lines changed: 15 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -107,7 +107,7 @@ sysfs
107107
Global THP controls
108108
-------------------
109109

110-
Transparent Hugepage Support for anonymous memory can be entirely disabled
110+
Transparent Hugepage Support for anonymous memory can be disabled
111111
(mostly for debugging purposes) or only enabled inside MADV_HUGEPAGE
112112
regions (to avoid the risk of consuming more memory resources) or enabled
113113
system wide. This can be achieved per-supported-THP-size with one of::
@@ -119,6 +119,11 @@ system wide. This can be achieved per-supported-THP-size with one of::
119119
where <size> is the hugepage size being addressed, the available sizes
120120
for which vary by system.
121121

122+
.. note:: Setting "never" in all sysfs THP controls does **not** disable
123+
Transparent Huge Pages globally. This is because ``madvise(...,
124+
MADV_COLLAPSE)`` ignores these settings and collapses ranges to
125+
PMD-sized huge pages unconditionally.
126+
122127
For example::
123128

124129
echo always >/sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled
@@ -187,7 +192,9 @@ madvise
187192
behaviour.
188193

189194
never
190-
should be self-explanatory.
195+
should be self-explanatory. Note that ``madvise(...,
196+
MADV_COLLAPSE)`` can still cause transparent huge pages to be
197+
obtained even if this mode is specified everywhere.
191198

192199
By default kernel tries to use huge, PMD-mappable zero page on read
193200
page fault to anonymous mapping. It's possible to disable huge zero
@@ -378,7 +385,9 @@ always
378385
Attempt to allocate huge pages every time we need a new page;
379386

380387
never
381-
Do not allocate huge pages;
388+
Do not allocate huge pages. Note that ``madvise(..., MADV_COLLAPSE)``
389+
can still cause transparent huge pages to be obtained even if this mode
390+
is specified everywhere;
382391

383392
within_size
384393
Only allocate huge page if it will be fully within i_size.
@@ -434,7 +443,9 @@ inherit
434443
have enabled="inherit" and all other hugepage sizes have enabled="never";
435444

436445
never
437-
Do not allocate <size> huge pages;
446+
Do not allocate <size> huge pages. Note that ``madvise(...,
447+
MADV_COLLAPSE)`` can still cause transparent huge pages to be obtained
448+
even if this mode is specified everywhere;
438449

439450
within_size
440451
Only allocate <size> huge page if it will be fully within i_size.

Documentation/core-api/memory-hotplug.rst

Lines changed: 82 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,9 @@ Memory hotplug event notifier
99

1010
Hotplugging events are sent to a notification queue.
1111

12+
Memory notifier
13+
----------------
14+
1215
There are six types of notification defined in ``include/linux/memory.h``:
1316

1417
MEM_GOING_ONLINE
@@ -56,20 +59,18 @@ The third argument (arg) passes a pointer of struct memory_notify::
5659
struct memory_notify {
5760
unsigned long start_pfn;
5861
unsigned long nr_pages;
59-
int status_change_nid_normal;
60-
int status_change_nid;
6162
}
6263

6364
- start_pfn is start_pfn of online/offline memory.
6465
- nr_pages is # of pages of online/offline memory.
65-
- status_change_nid_normal is set node id when N_NORMAL_MEMORY of nodemask
66-
is (will be) set/clear, if this is -1, then nodemask status is not changed.
67-
- status_change_nid is set node id when N_MEMORY of nodemask is (will be)
68-
set/clear. It means a new(memoryless) node gets new memory by online and a
69-
node loses all memory. If this is -1, then nodemask status is not changed.
7066

71-
If status_changed_nid* >= 0, callback should create/discard structures for the
72-
node if necessary.
67+
It is possible to get notified for MEM_CANCEL_ONLINE without having been notified
68+
for MEM_GOING_ONLINE, and the same applies to MEM_CANCEL_OFFLINE and
69+
MEM_GOING_OFFLINE.
70+
This can happen when a consumer fails, meaning we break the callchain and we
71+
stop calling the remaining consumers of the notifier.
72+
It is then important that users of memory_notify make no assumptions and get
73+
prepared to handle such cases.
7374

7475
The callback routine shall return one of the values
7576
NOTIFY_DONE, NOTIFY_OK, NOTIFY_BAD, NOTIFY_STOP
@@ -83,6 +84,78 @@ further processing of the notification queue.
8384

8485
NOTIFY_STOP stops further processing of the notification queue.
8586

87+
Numa node notifier
88+
------------------
89+
90+
There are six types of notification defined in ``include/linux/node.h``:
91+
92+
NODE_ADDING_FIRST_MEMORY
93+
Generated before memory becomes available to this node for the first time.
94+
95+
NODE_CANCEL_ADDING_FIRST_MEMORY
96+
Generated if NODE_ADDING_FIRST_MEMORY fails.
97+
98+
NODE_ADDED_FIRST_MEMORY
99+
Generated when memory has become available fo this node for the first time.
100+
101+
NODE_REMOVING_LAST_MEMORY
102+
Generated when the last memory available to this node is about to be offlined.
103+
104+
NODE_CANCEL_REMOVING_LAST_MEMORY
105+
Generated when NODE_CANCEL_REMOVING_LAST_MEMORY fails.
106+
107+
NODE_REMOVED_LAST_MEMORY
108+
Generated when the last memory available to this node has been offlined.
109+
110+
A callback routine can be registered by calling::
111+
112+
hotplug_node_notifier(callback_func, priority)
113+
114+
Callback functions with higher values of priority are called before callback
115+
functions with lower values.
116+
117+
A callback function must have the following prototype::
118+
119+
int callback_func(
120+
121+
struct notifier_block *self, unsigned long action, void *arg);
122+
123+
The first argument of the callback function (self) is a pointer to the block
124+
of the notifier chain that points to the callback function itself.
125+
The second argument (action) is one of the event types described above.
126+
The third argument (arg) passes a pointer of struct node_notify::
127+
128+
struct node_notify {
129+
int nid;
130+
}
131+
132+
- nid is the node we are adding or removing memory to.
133+
134+
It is possible to get notified for NODE_CANCEL_ADDING_FIRST_MEMORY without
135+
having been notified for NODE_ADDING_FIRST_MEMORY, and the same applies to
136+
NODE_CANCEL_REMOVING_LAST_MEMORY and NODE_REMOVING_LAST_MEMORY.
137+
This can happen when a consumer fails, meaning we break the callchain and we
138+
stop calling the remaining consumers of the notifier.
139+
It is then important that users of node_notify make no assumptions and get
140+
prepared to handle such cases.
141+
142+
The callback routine shall return one of the values
143+
NOTIFY_DONE, NOTIFY_OK, NOTIFY_BAD, NOTIFY_STOP
144+
defined in ``include/linux/notifier.h``
145+
146+
NOTIFY_DONE and NOTIFY_OK have no effect on the further processing.
147+
148+
NOTIFY_BAD is used as response to the NODE_ADDING_FIRST_MEMORY,
149+
NODE_REMOVING_LAST_MEMORY, NODE_ADDED_FIRST_MEMORY or
150+
NODE_REMOVED_LAST_MEMORY action to cancel hotplugging.
151+
It stops further processing of the notification queue.
152+
153+
NOTIFY_STOP stops further processing of the notification queue.
154+
155+
Please note that we should not fail for NODE_ADDED_FIRST_MEMORY /
156+
NODE_REMOVED_FIRST_MEMORY, as memory_hotplug code cannot rollback at that
157+
point anymore.
158+
86159
Locking Internals
87160
=================
88161

0 commit comments

Comments
 (0)