Skip to content

Commit db3b4ca

Browse files
committed
mm: introduce ARCH_SUPPORTS_HUGE_PFNMAP and special bits to pmd/pud
jira LE-3557 Rebuild_History Non-Buildable kernel-5.14.0-570.26.1.el9_6 commit-author Peter Xu <[email protected]> commit 6857be5 Empty-Commit: Cherry-Pick Conflicts during history rebuild. Will be included in final tarball splat. Ref for failed cherry-pick at: ciq/ciq_backports/kernel-5.14.0-570.26.1.el9_6/6857be5f.failed Patch series "mm: Support huge pfnmaps", v2. Overview ======== This series implements huge pfnmaps support for mm in general. Huge pfnmap allows e.g. VM_PFNMAP vmas to map in either PMD or PUD levels, similar to what we do with dax / thp / hugetlb so far to benefit from TLB hits. Now we extend that idea to PFN mappings, e.g. PCI MMIO bars where it can grow as large as 8GB or even bigger. Currently, only x86_64 (1G+2M) and arm64 (2M) are supported. The last patch (from Alex Williamson) will be the first user of huge pfnmap, so as to enable vfio-pci driver to fault in huge pfn mappings. Implementation ============== In reality, it's relatively simple to add such support comparing to many other types of mappings, because of PFNMAP's specialties when there's no vmemmap backing it, so that most of the kernel routines on huge mappings should simply already fail for them, like GUPs or old-school follow_page() (which is recently rewritten to be folio_walk* APIs by David). One trick here is that we're still unmature on PUDs in generic paths here and there, as DAX is so far the only user. This patchset will add the 2nd user of it. Hugetlb can be a 3rd user if the hugetlb unification work can go on smoothly, but to be discussed later. The other trick is how to allow gup-fast working for such huge mappings even if there's no direct sign of knowing whether it's a normal page or MMIO mapping. This series chose to keep the pte_special solution, so that it reuses similar idea on setting a special bit to pfnmap PMDs/PUDs so that gup-fast will be able to identify them and fail properly. Along the way, we'll also notice that the major pgtable pfn walker, aka, follow_pte(), will need to retire soon due to the fact that it only works with ptes. A new set of simple API is introduced (follow_pfnmap* API) to be able to do whatever follow_pte() can already do, plus that it can also process huge pfnmaps now. Half of this series is about that and converting all existing pfnmap walkers to use the new API properly. Hopefully the new API also looks better to avoid exposing e.g. pgtable lock details into the callers, so that it can be used in an even more straightforward way. Here, three more options will be introduced and involved in huge pfnmap: - ARCH_SUPPORTS_HUGE_PFNMAP Arch developers will need to select this option when huge pfnmap is supported in arch's Kconfig. After this patchset applied, both x86_64 and arm64 will start to enable it by default. - ARCH_SUPPORTS_PMD_PFNMAP / ARCH_SUPPORTS_PUD_PFNMAP These options are for driver developers to identify whether current arch / config supports huge pfnmaps, making decision on whether it can use the huge pfnmap APIs to inject them. One can refer to the last vfio-pci patch from Alex on the use of them properly in a device driver. So after the whole set applied, and if one would enable some dynamic debug lines in vfio-pci core files, we should observe things like: vfio-pci 0000:00:06.0: vfio_pci_mmap_huge_fault(,order = 9) BAR 0 page offset 0x0: 0x100 vfio-pci 0000:00:06.0: vfio_pci_mmap_huge_fault(,order = 9) BAR 0 page offset 0x200: 0x100 vfio-pci 0000:00:06.0: vfio_pci_mmap_huge_fault(,order = 9) BAR 0 page offset 0x400: 0x100 In this specific case, it says that vfio-pci faults in PMDs properly for a few BAR0 offsets. Patch Layout ============ Patch 1: Introduce the new options mentioned above for huge PFNMAPs Patch 2: A tiny cleanup Patch 3-8: Preparation patches for huge pfnmap (include introduce special bit for pmd/pud) Patch 9-16: Introduce follow_pfnmap*() API, use it everywhere, and then drop follow_pte() API Patch 17: Add huge pfnmap support for x86_64 Patch 18: Add huge pfnmap support for arm64 Patch 19: Add vfio-pci support for all kinds of huge pfnmaps (Alex) TODO ==== More architectures / More page sizes ------------------------------------ Currently only x86_64 (2M+1G) and arm64 (2M) are supported. There seems to have plan to support arm64 1G later on top of this series [2]. Any arch will need to first support THP / THP_1G, then provide a special bit in pmds/puds to support huge pfnmaps. remap_pfn_range() support ------------------------- Currently, remap_pfn_range() still only maps PTEs. With the new option, remap_pfn_range() can logically start to inject either PMDs or PUDs when the alignment requirements match on the VAs. When the support is there, it should be able to silently benefit all drivers that is using remap_pfn_range() in its mmap() handler on better TLB hit rate and overall faster MMIO accesses similar to processor on hugepages. More driver support ------------------- VFIO is so far the only consumer for the huge pfnmaps after this series applied. Besides above remap_pfn_range() generic optimization, device driver can also try to optimize its mmap() on a better VA alignment for either PMD/PUD sizes. This may, iiuc, normally require userspace changes, as the driver doesn't normally decide the VA to map a bar. But I don't think I know all the drivers to know the full picture. Credits all go to Alex on help testing the GPU/NIC use cases above. [0] https://lore.kernel.org/r/[email protected] [1] https://lore.kernel.org/r/[email protected] [2] https://lore.kernel.org/r/[email protected] This patch (of 19): This patch introduces the option to introduce special pte bit into pmd/puds. Archs can start to define pmd_special / pud_special when supported by selecting the new option. Per-arch support will be added later. Before that, create fallbacks for these helpers so that they are always available. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Peter Xu <[email protected]> Cc: Alexander Gordeev <[email protected]> Cc: Alex Williamson <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Christian Borntraeger <[email protected]> Cc: Dave Hansen <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Gavin Shan <[email protected]> Cc: Gerald Schaefer <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Niklas Schnelle <[email protected]> Cc: Paolo Bonzini <[email protected]> Cc: Ryan Roberts <[email protected]> Cc: Sean Christopherson <[email protected]> Cc: Sven Schnelle <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vasily Gorbik <[email protected]> Cc: Will Deacon <[email protected]> Cc: Zi Yan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> (cherry picked from commit 6857be5) Signed-off-by: Jonathan Maple <[email protected]> # Conflicts: # mm/Kconfig
1 parent cad0cbc commit db3b4ca

File tree

1 file changed

+246
-0
lines changed

1 file changed

+246
-0
lines changed
Lines changed: 246 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,246 @@
1+
mm: introduce ARCH_SUPPORTS_HUGE_PFNMAP and special bits to pmd/pud
2+
3+
jira LE-3557
4+
Rebuild_History Non-Buildable kernel-5.14.0-570.26.1.el9_6
5+
commit-author Peter Xu <[email protected]>
6+
commit 6857be5fecaebd9773ff27b6d29b6fff3b1abbce
7+
Empty-Commit: Cherry-Pick Conflicts during history rebuild.
8+
Will be included in final tarball splat. Ref for failed cherry-pick at:
9+
ciq/ciq_backports/kernel-5.14.0-570.26.1.el9_6/6857be5f.failed
10+
11+
Patch series "mm: Support huge pfnmaps", v2.
12+
13+
Overview
14+
========
15+
16+
This series implements huge pfnmaps support for mm in general. Huge
17+
pfnmap allows e.g. VM_PFNMAP vmas to map in either PMD or PUD levels,
18+
similar to what we do with dax / thp / hugetlb so far to benefit from TLB
19+
hits. Now we extend that idea to PFN mappings, e.g. PCI MMIO bars where
20+
it can grow as large as 8GB or even bigger.
21+
22+
Currently, only x86_64 (1G+2M) and arm64 (2M) are supported. The last
23+
patch (from Alex Williamson) will be the first user of huge pfnmap, so as
24+
to enable vfio-pci driver to fault in huge pfn mappings.
25+
26+
Implementation
27+
==============
28+
29+
In reality, it's relatively simple to add such support comparing to many
30+
other types of mappings, because of PFNMAP's specialties when there's no
31+
vmemmap backing it, so that most of the kernel routines on huge mappings
32+
should simply already fail for them, like GUPs or old-school follow_page()
33+
(which is recently rewritten to be folio_walk* APIs by David).
34+
35+
One trick here is that we're still unmature on PUDs in generic paths here
36+
and there, as DAX is so far the only user. This patchset will add the 2nd
37+
user of it. Hugetlb can be a 3rd user if the hugetlb unification work can
38+
go on smoothly, but to be discussed later.
39+
40+
The other trick is how to allow gup-fast working for such huge mappings
41+
even if there's no direct sign of knowing whether it's a normal page or
42+
MMIO mapping. This series chose to keep the pte_special solution, so that
43+
it reuses similar idea on setting a special bit to pfnmap PMDs/PUDs so
44+
that gup-fast will be able to identify them and fail properly.
45+
46+
Along the way, we'll also notice that the major pgtable pfn walker, aka,
47+
follow_pte(), will need to retire soon due to the fact that it only works
48+
with ptes. A new set of simple API is introduced (follow_pfnmap* API) to
49+
be able to do whatever follow_pte() can already do, plus that it can also
50+
process huge pfnmaps now. Half of this series is about that and
51+
converting all existing pfnmap walkers to use the new API properly.
52+
Hopefully the new API also looks better to avoid exposing e.g. pgtable
53+
lock details into the callers, so that it can be used in an even more
54+
straightforward way.
55+
56+
Here, three more options will be introduced and involved in huge pfnmap:
57+
58+
- ARCH_SUPPORTS_HUGE_PFNMAP
59+
60+
Arch developers will need to select this option when huge pfnmap is
61+
supported in arch's Kconfig. After this patchset applied, both x86_64
62+
and arm64 will start to enable it by default.
63+
64+
- ARCH_SUPPORTS_PMD_PFNMAP / ARCH_SUPPORTS_PUD_PFNMAP
65+
66+
These options are for driver developers to identify whether current
67+
arch / config supports huge pfnmaps, making decision on whether it can
68+
use the huge pfnmap APIs to inject them. One can refer to the last
69+
vfio-pci patch from Alex on the use of them properly in a device
70+
driver.
71+
72+
So after the whole set applied, and if one would enable some dynamic debug
73+
lines in vfio-pci core files, we should observe things like:
74+
75+
vfio-pci 0000:00:06.0: vfio_pci_mmap_huge_fault(,order = 9) BAR 0 page offset 0x0: 0x100
76+
vfio-pci 0000:00:06.0: vfio_pci_mmap_huge_fault(,order = 9) BAR 0 page offset 0x200: 0x100
77+
vfio-pci 0000:00:06.0: vfio_pci_mmap_huge_fault(,order = 9) BAR 0 page offset 0x400: 0x100
78+
79+
In this specific case, it says that vfio-pci faults in PMDs properly for a
80+
few BAR0 offsets.
81+
82+
Patch Layout
83+
============
84+
85+
Patch 1: Introduce the new options mentioned above for huge PFNMAPs
86+
Patch 2: A tiny cleanup
87+
Patch 3-8: Preparation patches for huge pfnmap (include introduce
88+
special bit for pmd/pud)
89+
Patch 9-16: Introduce follow_pfnmap*() API, use it everywhere, and
90+
then drop follow_pte() API
91+
Patch 17: Add huge pfnmap support for x86_64
92+
Patch 18: Add huge pfnmap support for arm64
93+
Patch 19: Add vfio-pci support for all kinds of huge pfnmaps (Alex)
94+
95+
TODO
96+
====
97+
98+
More architectures / More page sizes
99+
------------------------------------
100+
101+
Currently only x86_64 (2M+1G) and arm64 (2M) are supported. There seems
102+
to have plan to support arm64 1G later on top of this series [2].
103+
104+
Any arch will need to first support THP / THP_1G, then provide a special
105+
bit in pmds/puds to support huge pfnmaps.
106+
107+
remap_pfn_range() support
108+
-------------------------
109+
110+
Currently, remap_pfn_range() still only maps PTEs. With the new option,
111+
remap_pfn_range() can logically start to inject either PMDs or PUDs when
112+
the alignment requirements match on the VAs.
113+
114+
When the support is there, it should be able to silently benefit all
115+
drivers that is using remap_pfn_range() in its mmap() handler on better
116+
TLB hit rate and overall faster MMIO accesses similar to processor on
117+
hugepages.
118+
119+
More driver support
120+
-------------------
121+
122+
VFIO is so far the only consumer for the huge pfnmaps after this series
123+
applied. Besides above remap_pfn_range() generic optimization, device
124+
driver can also try to optimize its mmap() on a better VA alignment for
125+
either PMD/PUD sizes. This may, iiuc, normally require userspace changes,
126+
as the driver doesn't normally decide the VA to map a bar. But I don't
127+
think I know all the drivers to know the full picture.
128+
129+
Credits all go to Alex on help testing the GPU/NIC use cases above.
130+
131+
[0] https://lore.kernel.org/r/[email protected]
132+
[1] https://lore.kernel.org/r/[email protected]
133+
[2] https://lore.kernel.org/r/[email protected]
134+
135+
136+
This patch (of 19):
137+
138+
This patch introduces the option to introduce special pte bit into
139+
pmd/puds. Archs can start to define pmd_special / pud_special when
140+
supported by selecting the new option. Per-arch support will be added
141+
later.
142+
143+
Before that, create fallbacks for these helpers so that they are always
144+
available.
145+
146+
Link: https://lkml.kernel.org/r/[email protected]
147+
Link: https://lkml.kernel.org/r/[email protected]
148+
Signed-off-by: Peter Xu <[email protected]>
149+
Cc: Alexander Gordeev <[email protected]>
150+
Cc: Alex Williamson <[email protected]>
151+
Cc: Aneesh Kumar K.V <[email protected]>
152+
Cc: Borislav Petkov <[email protected]>
153+
Cc: Catalin Marinas <[email protected]>
154+
Cc: Christian Borntraeger <[email protected]>
155+
Cc: Dave Hansen <[email protected]>
156+
Cc: David Hildenbrand <[email protected]>
157+
Cc: Gavin Shan <[email protected]>
158+
Cc: Gerald Schaefer <[email protected]>
159+
Cc: Heiko Carstens <[email protected]>
160+
Cc: Ingo Molnar <[email protected]>
161+
Cc: Jason Gunthorpe <[email protected]>
162+
Cc: Matthew Wilcox <[email protected]>
163+
Cc: Niklas Schnelle <[email protected]>
164+
Cc: Paolo Bonzini <[email protected]>
165+
Cc: Ryan Roberts <[email protected]>
166+
Cc: Sean Christopherson <[email protected]>
167+
Cc: Sven Schnelle <[email protected]>
168+
Cc: Thomas Gleixner <[email protected]>
169+
Cc: Vasily Gorbik <[email protected]>
170+
Cc: Will Deacon <[email protected]>
171+
Cc: Zi Yan <[email protected]>
172+
Signed-off-by: Andrew Morton <[email protected]>
173+
(cherry picked from commit 6857be5fecaebd9773ff27b6d29b6fff3b1abbce)
174+
Signed-off-by: Jonathan Maple <[email protected]>
175+
176+
# Conflicts:
177+
# mm/Kconfig
178+
diff --cc mm/Kconfig
179+
index a91823e31f45,1aa282e35dc7..000000000000
180+
--- a/mm/Kconfig
181+
+++ b/mm/Kconfig
182+
@@@ -898,6 -870,25 +898,28 @@@ config READ_ONLY_THP_FOR_F
183+
endif # TRANSPARENT_HUGEPAGE
184+
185+
#
186+
++<<<<<<< HEAD
187+
++=======
188+
+ # The architecture supports pgtable leaves that is larger than PAGE_SIZE
189+
+ #
190+
+ config PGTABLE_HAS_HUGE_LEAVES
191+
+ def_bool TRANSPARENT_HUGEPAGE || HUGETLB_PAGE
192+
+
193+
+ # TODO: Allow to be enabled without THP
194+
+ config ARCH_SUPPORTS_HUGE_PFNMAP
195+
+ def_bool n
196+
+ depends on TRANSPARENT_HUGEPAGE
197+
+
198+
+ config ARCH_SUPPORTS_PMD_PFNMAP
199+
+ def_bool y
200+
+ depends on ARCH_SUPPORTS_HUGE_PFNMAP && HAVE_ARCH_TRANSPARENT_HUGEPAGE
201+
+
202+
+ config ARCH_SUPPORTS_PUD_PFNMAP
203+
+ def_bool y
204+
+ depends on ARCH_SUPPORTS_HUGE_PFNMAP && HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
205+
+
206+
+ #
207+
++>>>>>>> 6857be5fecae (mm: introduce ARCH_SUPPORTS_HUGE_PFNMAP and special bits to pmd/pud)
208+
# UP and nommu archs use km based percpu allocator
209+
#
210+
config NEED_PER_CPU_KM
211+
diff --git a/include/linux/mm.h b/include/linux/mm.h
212+
index 196c481ec160..7b6f347d05b9 100644
213+
--- a/include/linux/mm.h
214+
+++ b/include/linux/mm.h
215+
@@ -2730,6 +2730,30 @@ static inline pte_t pte_mkspecial(pte_t pte)
216+
}
217+
#endif
218+
219+
+#ifndef CONFIG_ARCH_SUPPORTS_PMD_PFNMAP
220+
+static inline bool pmd_special(pmd_t pmd)
221+
+{
222+
+ return false;
223+
+}
224+
+
225+
+static inline pmd_t pmd_mkspecial(pmd_t pmd)
226+
+{
227+
+ return pmd;
228+
+}
229+
+#endif /* CONFIG_ARCH_SUPPORTS_PMD_PFNMAP */
230+
+
231+
+#ifndef CONFIG_ARCH_SUPPORTS_PUD_PFNMAP
232+
+static inline bool pud_special(pud_t pud)
233+
+{
234+
+ return false;
235+
+}
236+
+
237+
+static inline pud_t pud_mkspecial(pud_t pud)
238+
+{
239+
+ return pud;
240+
+}
241+
+#endif /* CONFIG_ARCH_SUPPORTS_PUD_PFNMAP */
242+
+
243+
#ifndef CONFIG_ARCH_HAS_PTE_DEVMAP
244+
static inline int pte_devmap(pte_t pte)
245+
{
246+
* Unmerged path mm/Kconfig

0 commit comments

Comments
 (0)