Skip to content

Commit 7eec11d

Browse files
committed
Merge branch 'akpm' (patches from Andrew)
Pull updates from Andrew Morton: "Most of -mm and quite a number of other subsystems: hotfixes, scripts, ocfs2, misc, lib, binfmt, init, reiserfs, exec, dma-mapping, kcov. MM is fairly quiet this time. Holidays, I assume" * emailed patches from Andrew Morton <[email protected]>: (118 commits) kcov: ignore fault-inject and stacktrace include/linux/io-mapping.h-mapping: use PHYS_PFN() macro in io_mapping_map_atomic_wc() execve: warn if process starts with executable stack reiserfs: prevent NULL pointer dereference in reiserfs_insert_item() init/main.c: fix misleading "This architecture does not have kernel memory protection" message init/main.c: fix quoted value handling in unknown_bootoption init/main.c: remove unnecessary repair_env_string in do_initcall_level init/main.c: log arguments and environment passed to init fs/binfmt_elf.c: coredump: allow process with empty address space to coredump fs/binfmt_elf.c: coredump: delete duplicated overflow check fs/binfmt_elf.c: coredump: allocate core ELF header on stack fs/binfmt_elf.c: make BAD_ADDR() unlikely fs/binfmt_elf.c: better codegen around current->mm fs/binfmt_elf.c: don't copy ELF header around fs/binfmt_elf.c: fix ->start_code calculation fs/binfmt_elf.c: smaller code generation around auxv vector fill lib/find_bit.c: uninline helper _find_next_bit() lib/find_bit.c: join _find_next_bit{_le} uapi: rename ext2_swab() to swab() and share globally in swab.h lib/scatterlist.c: adjust indentation in __sg_alloc_table ...
2 parents ddaefe8 + 43e76af commit 7eec11d

File tree

136 files changed

+2724
-1293
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

136 files changed

+2724
-1293
lines changed

Documentation/admin-guide/kernel-parameters.txt

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -834,6 +834,18 @@
834834
dump out devices still on the deferred probe list after
835835
retrying.
836836

837+
dfltcc= [HW,S390]
838+
Format: { on | off | def_only | inf_only | always }
839+
on: s390 zlib hardware support for compression on
840+
level 1 and decompression (default)
841+
off: No s390 zlib hardware support
842+
def_only: s390 zlib hardware support for deflate
843+
only (compression on level 1)
844+
inf_only: s390 zlib hardware support for inflate
845+
only (decompression)
846+
always: Same as 'on' but ignores the selected compression
847+
level always using hardware support (used for debugging)
848+
837849
dhash_entries= [KNL]
838850
Set number of hash buckets for dentry cache.
839851

Documentation/core-api/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,7 @@ Core utilities
3131
generic-radix-tree
3232
memory-allocation
3333
mm-api
34+
pin_user_pages
3435
gfp_mask-from-fs-io
3536
timekeeping
3637
boot-time-mm
Lines changed: 232 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,232 @@
1+
.. SPDX-License-Identifier: GPL-2.0
2+
3+
====================================================
4+
pin_user_pages() and related calls
5+
====================================================
6+
7+
.. contents:: :local:
8+
9+
Overview
10+
========
11+
12+
This document describes the following functions::
13+
14+
pin_user_pages()
15+
pin_user_pages_fast()
16+
pin_user_pages_remote()
17+
18+
Basic description of FOLL_PIN
19+
=============================
20+
21+
FOLL_PIN and FOLL_LONGTERM are flags that can be passed to the get_user_pages*()
22+
("gup") family of functions. FOLL_PIN has significant interactions and
23+
interdependencies with FOLL_LONGTERM, so both are covered here.
24+
25+
FOLL_PIN is internal to gup, meaning that it should not appear at the gup call
26+
sites. This allows the associated wrapper functions (pin_user_pages*() and
27+
others) to set the correct combination of these flags, and to check for problems
28+
as well.
29+
30+
FOLL_LONGTERM, on the other hand, *is* allowed to be set at the gup call sites.
31+
This is in order to avoid creating a large number of wrapper functions to cover
32+
all combinations of get*(), pin*(), FOLL_LONGTERM, and more. Also, the
33+
pin_user_pages*() APIs are clearly distinct from the get_user_pages*() APIs, so
34+
that's a natural dividing line, and a good point to make separate wrapper calls.
35+
In other words, use pin_user_pages*() for DMA-pinned pages, and
36+
get_user_pages*() for other cases. There are four cases described later on in
37+
this document, to further clarify that concept.
38+
39+
FOLL_PIN and FOLL_GET are mutually exclusive for a given gup call. However,
40+
multiple threads and call sites are free to pin the same struct pages, via both
41+
FOLL_PIN and FOLL_GET. It's just the call site that needs to choose one or the
42+
other, not the struct page(s).
43+
44+
The FOLL_PIN implementation is nearly the same as FOLL_GET, except that FOLL_PIN
45+
uses a different reference counting technique.
46+
47+
FOLL_PIN is a prerequisite to FOLL_LONGTERM. Another way of saying that is,
48+
FOLL_LONGTERM is a specific case, more restrictive case of FOLL_PIN.
49+
50+
Which flags are set by each wrapper
51+
===================================
52+
53+
For these pin_user_pages*() functions, FOLL_PIN is OR'd in with whatever gup
54+
flags the caller provides. The caller is required to pass in a non-null struct
55+
pages* array, and the function then pin pages by incrementing each by a special
56+
value. For now, that value is +1, just like get_user_pages*().::
57+
58+
Function
59+
--------
60+
pin_user_pages FOLL_PIN is always set internally by this function.
61+
pin_user_pages_fast FOLL_PIN is always set internally by this function.
62+
pin_user_pages_remote FOLL_PIN is always set internally by this function.
63+
64+
For these get_user_pages*() functions, FOLL_GET might not even be specified.
65+
Behavior is a little more complex than above. If FOLL_GET was *not* specified,
66+
but the caller passed in a non-null struct pages* array, then the function
67+
sets FOLL_GET for you, and proceeds to pin pages by incrementing the refcount
68+
of each page by +1.::
69+
70+
Function
71+
--------
72+
get_user_pages FOLL_GET is sometimes set internally by this function.
73+
get_user_pages_fast FOLL_GET is sometimes set internally by this function.
74+
get_user_pages_remote FOLL_GET is sometimes set internally by this function.
75+
76+
Tracking dma-pinned pages
77+
=========================
78+
79+
Some of the key design constraints, and solutions, for tracking dma-pinned
80+
pages:
81+
82+
* An actual reference count, per struct page, is required. This is because
83+
multiple processes may pin and unpin a page.
84+
85+
* False positives (reporting that a page is dma-pinned, when in fact it is not)
86+
are acceptable, but false negatives are not.
87+
88+
* struct page may not be increased in size for this, and all fields are already
89+
used.
90+
91+
* Given the above, we can overload the page->_refcount field by using, sort of,
92+
the upper bits in that field for a dma-pinned count. "Sort of", means that,
93+
rather than dividing page->_refcount into bit fields, we simple add a medium-
94+
large value (GUP_PIN_COUNTING_BIAS, initially chosen to be 1024: 10 bits) to
95+
page->_refcount. This provides fuzzy behavior: if a page has get_page() called
96+
on it 1024 times, then it will appear to have a single dma-pinned count.
97+
And again, that's acceptable.
98+
99+
This also leads to limitations: there are only 31-10==21 bits available for a
100+
counter that increments 10 bits at a time.
101+
102+
TODO: for 1GB and larger huge pages, this is cutting it close. That's because
103+
when pin_user_pages() follows such pages, it increments the head page by "1"
104+
(where "1" used to mean "+1" for get_user_pages(), but now means "+1024" for
105+
pin_user_pages()) for each tail page. So if you have a 1GB huge page:
106+
107+
* There are 256K (18 bits) worth of 4 KB tail pages.
108+
* There are 21 bits available to count up via GUP_PIN_COUNTING_BIAS (that is,
109+
10 bits at a time)
110+
* There are 21 - 18 == 3 bits available to count. Except that there aren't,
111+
because you need to allow for a few normal get_page() calls on the head page,
112+
as well. Fortunately, the approach of using addition, rather than "hard"
113+
bitfields, within page->_refcount, allows for sharing these bits gracefully.
114+
But we're still looking at about 8 references.
115+
116+
This, however, is a missing feature more than anything else, because it's easily
117+
solved by addressing an obvious inefficiency in the original get_user_pages()
118+
approach of retrieving pages: stop treating all the pages as if they were
119+
PAGE_SIZE. Retrieve huge pages as huge pages. The callers need to be aware of
120+
this, so some work is required. Once that's in place, this limitation mostly
121+
disappears from view, because there will be ample refcounting range available.
122+
123+
* Callers must specifically request "dma-pinned tracking of pages". In other
124+
words, just calling get_user_pages() will not suffice; a new set of functions,
125+
pin_user_page() and related, must be used.
126+
127+
FOLL_PIN, FOLL_GET, FOLL_LONGTERM: when to use which flags
128+
==========================================================
129+
130+
Thanks to Jan Kara, Vlastimil Babka and several other -mm people, for describing
131+
these categories:
132+
133+
CASE 1: Direct IO (DIO)
134+
-----------------------
135+
There are GUP references to pages that are serving
136+
as DIO buffers. These buffers are needed for a relatively short time (so they
137+
are not "long term"). No special synchronization with page_mkclean() or
138+
munmap() is provided. Therefore, flags to set at the call site are: ::
139+
140+
FOLL_PIN
141+
142+
...but rather than setting FOLL_PIN directly, call sites should use one of
143+
the pin_user_pages*() routines that set FOLL_PIN.
144+
145+
CASE 2: RDMA
146+
------------
147+
There are GUP references to pages that are serving as DMA
148+
buffers. These buffers are needed for a long time ("long term"). No special
149+
synchronization with page_mkclean() or munmap() is provided. Therefore, flags
150+
to set at the call site are: ::
151+
152+
FOLL_PIN | FOLL_LONGTERM
153+
154+
NOTE: Some pages, such as DAX pages, cannot be pinned with longterm pins. That's
155+
because DAX pages do not have a separate page cache, and so "pinning" implies
156+
locking down file system blocks, which is not (yet) supported in that way.
157+
158+
CASE 3: Hardware with page faulting support
159+
-------------------------------------------
160+
Here, a well-written driver doesn't normally need to pin pages at all. However,
161+
if the driver does choose to do so, it can register MMU notifiers for the range,
162+
and will be called back upon invalidation. Either way (avoiding page pinning, or
163+
using MMU notifiers to unpin upon request), there is proper synchronization with
164+
both filesystem and mm (page_mkclean(), munmap(), etc).
165+
166+
Therefore, neither flag needs to be set.
167+
168+
In this case, ideally, neither get_user_pages() nor pin_user_pages() should be
169+
called. Instead, the software should be written so that it does not pin pages.
170+
This allows mm and filesystems to operate more efficiently and reliably.
171+
172+
CASE 4: Pinning for struct page manipulation only
173+
-------------------------------------------------
174+
Here, normal GUP calls are sufficient, so neither flag needs to be set.
175+
176+
page_dma_pinned(): the whole point of pinning
177+
=============================================
178+
179+
The whole point of marking pages as "DMA-pinned" or "gup-pinned" is to be able
180+
to query, "is this page DMA-pinned?" That allows code such as page_mkclean()
181+
(and file system writeback code in general) to make informed decisions about
182+
what to do when a page cannot be unmapped due to such pins.
183+
184+
What to do in those cases is the subject of a years-long series of discussions
185+
and debates (see the References at the end of this document). It's a TODO item
186+
here: fill in the details once that's worked out. Meanwhile, it's safe to say
187+
that having this available: ::
188+
189+
static inline bool page_dma_pinned(struct page *page)
190+
191+
...is a prerequisite to solving the long-running gup+DMA problem.
192+
193+
Another way of thinking about FOLL_GET, FOLL_PIN, and FOLL_LONGTERM
194+
===================================================================
195+
196+
Another way of thinking about these flags is as a progression of restrictions:
197+
FOLL_GET is for struct page manipulation, without affecting the data that the
198+
struct page refers to. FOLL_PIN is a *replacement* for FOLL_GET, and is for
199+
short term pins on pages whose data *will* get accessed. As such, FOLL_PIN is
200+
a "more severe" form of pinning. And finally, FOLL_LONGTERM is an even more
201+
restrictive case that has FOLL_PIN as a prerequisite: this is for pages that
202+
will be pinned longterm, and whose data will be accessed.
203+
204+
Unit testing
205+
============
206+
This file::
207+
208+
tools/testing/selftests/vm/gup_benchmark.c
209+
210+
has the following new calls to exercise the new pin*() wrapper functions:
211+
212+
* PIN_FAST_BENCHMARK (./gup_benchmark -a)
213+
* PIN_BENCHMARK (./gup_benchmark -b)
214+
215+
You can monitor how many total dma-pinned pages have been acquired and released
216+
since the system was booted, via two new /proc/vmstat entries: ::
217+
218+
/proc/vmstat/nr_foll_pin_requested
219+
/proc/vmstat/nr_foll_pin_requested
220+
221+
Those are both going to show zero, unless CONFIG_DEBUG_VM is set. This is
222+
because there is a noticeable performance drop in unpin_user_page(), when they
223+
are activated.
224+
225+
References
226+
==========
227+
228+
* `Some slow progress on get_user_pages() (Apr 2, 2019) <https://lwn.net/Articles/784574/>`_
229+
* `DMA and get_user_pages() (LPC: Dec 12, 2018) <https://lwn.net/Articles/774411/>`_
230+
* `The trouble with get_user_pages() (Apr 30, 2018) <https://lwn.net/Articles/753027/>`_
231+
232+
John Hubbard, October, 2019

Documentation/vm/zswap.rst

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -130,6 +130,19 @@ checking for the same-value filled pages during store operation. However, the
130130
existing pages which are marked as same-value filled pages remain stored
131131
unchanged in zswap until they are either loaded or invalidated.
132132

133+
To prevent zswap from shrinking pool when zswap is full and there's a high
134+
pressure on swap (this will result in flipping pages in and out zswap pool
135+
without any real benefit but with a performance drop for the system), a
136+
special parameter has been introduced to implement a sort of hysteresis to
137+
refuse taking pages into zswap pool until it has sufficient space if the limit
138+
has been hit. To set the threshold at which zswap would start accepting pages
139+
again after it became full, use the sysfs ``accept_threhsold_percent``
140+
attribute, e. g.::
141+
142+
echo 80 > /sys/module/zswap/parameters/accept_threhsold_percent
143+
144+
Setting this parameter to 100 will disable the hysteresis.
145+
133146
A debugfs interface is provided for various statistic about pool size, number
134147
of pages stored, same-value filled pages and various counters for the reasons
135148
pages are rejected.

arch/powerpc/mm/book3s64/iommu_api.c

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -103,7 +103,7 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, unsigned long ua,
103103
for (entry = 0; entry < entries; entry += chunk) {
104104
unsigned long n = min(entries - entry, chunk);
105105

106-
ret = get_user_pages(ua + (entry << PAGE_SHIFT), n,
106+
ret = pin_user_pages(ua + (entry << PAGE_SHIFT), n,
107107
FOLL_WRITE | FOLL_LONGTERM,
108108
mem->hpages + entry, NULL);
109109
if (ret == n) {
@@ -167,9 +167,8 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, unsigned long ua,
167167
return 0;
168168

169169
free_exit:
170-
/* free the reference taken */
171-
for (i = 0; i < pinned; i++)
172-
put_page(mem->hpages[i]);
170+
/* free the references taken */
171+
unpin_user_pages(mem->hpages, pinned);
173172

174173
vfree(mem->hpas);
175174
kfree(mem);
@@ -215,7 +214,8 @@ static void mm_iommu_unpin(struct mm_iommu_table_group_mem_t *mem)
215214
if (mem->hpas[i] & MM_IOMMU_TABLE_GROUP_PAGE_DIRTY)
216215
SetPageDirty(page);
217216

218-
put_page(page);
217+
unpin_user_page(page);
218+
219219
mem->hpas[i] = 0;
220220
}
221221
}

arch/s390/boot/compressed/decompressor.c

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -30,13 +30,13 @@ extern unsigned char _compressed_start[];
3030
extern unsigned char _compressed_end[];
3131

3232
#ifdef CONFIG_HAVE_KERNEL_BZIP2
33-
#define HEAP_SIZE 0x400000
33+
#define BOOT_HEAP_SIZE 0x400000
3434
#else
35-
#define HEAP_SIZE 0x10000
35+
#define BOOT_HEAP_SIZE 0x10000
3636
#endif
3737

3838
static unsigned long free_mem_ptr = (unsigned long) _end;
39-
static unsigned long free_mem_end_ptr = (unsigned long) _end + HEAP_SIZE;
39+
static unsigned long free_mem_end_ptr = (unsigned long) _end + BOOT_HEAP_SIZE;
4040

4141
#ifdef CONFIG_KERNEL_GZIP
4242
#include "../../../../lib/decompress_inflate.c"
@@ -62,7 +62,7 @@ static unsigned long free_mem_end_ptr = (unsigned long) _end + HEAP_SIZE;
6262
#include "../../../../lib/decompress_unxz.c"
6363
#endif
6464

65-
#define decompress_offset ALIGN((unsigned long)_end + HEAP_SIZE, PAGE_SIZE)
65+
#define decompress_offset ALIGN((unsigned long)_end + BOOT_HEAP_SIZE, PAGE_SIZE)
6666

6767
unsigned long mem_safe_offset(void)
6868
{

arch/s390/boot/ipl_parm.c

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,7 @@
1414
char __bootdata(early_command_line)[COMMAND_LINE_SIZE];
1515
struct ipl_parameter_block __bootdata_preserved(ipl_block);
1616
int __bootdata_preserved(ipl_block_valid);
17+
unsigned int __bootdata_preserved(zlib_dfltcc_support) = ZLIB_DFLTCC_FULL;
1718

1819
unsigned long __bootdata(vmalloc_size) = VMALLOC_DEFAULT_SIZE;
1920
unsigned long __bootdata(memory_end);
@@ -229,6 +230,19 @@ void parse_boot_command_line(void)
229230
if (!strcmp(param, "vmalloc") && val)
230231
vmalloc_size = round_up(memparse(val, NULL), PAGE_SIZE);
231232

233+
if (!strcmp(param, "dfltcc")) {
234+
if (!strcmp(val, "off"))
235+
zlib_dfltcc_support = ZLIB_DFLTCC_DISABLED;
236+
else if (!strcmp(val, "on"))
237+
zlib_dfltcc_support = ZLIB_DFLTCC_FULL;
238+
else if (!strcmp(val, "def_only"))
239+
zlib_dfltcc_support = ZLIB_DFLTCC_DEFLATE_ONLY;
240+
else if (!strcmp(val, "inf_only"))
241+
zlib_dfltcc_support = ZLIB_DFLTCC_INFLATE_ONLY;
242+
else if (!strcmp(val, "always"))
243+
zlib_dfltcc_support = ZLIB_DFLTCC_FULL_DEBUG;
244+
}
245+
232246
if (!strcmp(param, "noexec")) {
233247
rc = kstrtobool(val, &enabled);
234248
if (!rc && !enabled)

arch/s390/include/asm/setup.h

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -79,6 +79,13 @@ struct parmarea {
7979
char command_line[ARCH_COMMAND_LINE_SIZE]; /* 0x10480 */
8080
};
8181

82+
extern unsigned int zlib_dfltcc_support;
83+
#define ZLIB_DFLTCC_DISABLED 0
84+
#define ZLIB_DFLTCC_FULL 1
85+
#define ZLIB_DFLTCC_DEFLATE_ONLY 2
86+
#define ZLIB_DFLTCC_INFLATE_ONLY 3
87+
#define ZLIB_DFLTCC_FULL_DEBUG 4
88+
8289
extern int noexec_disabled;
8390
extern int memory_end_set;
8491
extern unsigned long memory_end;

0 commit comments

Comments
 (0)