Skip to content

Commit cec3aae

Browse files
committed
vfio/pci: Align huge faults to order
jira LE-3557 Rebuild_History Non-Buildable kernel-5.14.0-570.26.1.el9_6 commit-author Alex Williamson <[email protected]> commit c1d9dac Empty-Commit: Cherry-Pick Conflicts during history rebuild. Will be included in final tarball splat. Ref for failed cherry-pick at: ciq/ciq_backports/kernel-5.14.0-570.26.1.el9_6/c1d9dac0.failed The vfio-pci huge_fault handler doesn't make any attempt to insert a mapping containing the faulting address, it only inserts mappings if the faulting address and resulting pfn are aligned. This works in a lot of cases, particularly in conjunction with QEMU where DMA mappings linearly fault the mmap. However, there are configurations where we don't get that linear faulting and pages are faulted on-demand. The scenario reported in the bug below is such a case, where the physical address width of the CPU is greater than that of the IOMMU, resulting in a VM where guest firmware has mapped device MMIO beyond the address width of the IOMMU. In this configuration, the MMIO is faulted on demand and tracing indicates that occasionally the faults generate a VM_FAULT_OOM. Given the use case, this results in a "error: kvm run failed Bad address", killing the VM. The host is not under memory pressure in this test, therefore it's suspected that VM_FAULT_OOM is actually the result of a NULL return from __pte_offset_map_lock() in the get_locked_pte() path from insert_pfn(). This suggests a potential race inserting a pte concurrent to a pmd, and maybe indicates some deficiency in the mm layer properly handling such a case. Nevertheless, Peter noted the inconsistency of vfio-pci's huge_fault handler where our mapping granularity depends on the alignment of the faulting address relative to the order rather than aligning the faulting address to the order to more consistently insert huge mappings. This change not only uses the page tables more consistently and efficiently, but as any fault to an aligned page results in the same mapping, the race condition suspected in the VM_FAULT_OOM is avoided. Reported-by: Adolfo <[email protected]> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=220057 Fixes: 09dfc8a ("vfio/pci: Fallback huge faults for unaligned pfn") Cc: [email protected] Tested-by: Adolfo <[email protected]> Co-developed-by: Peter Xu <[email protected]> Signed-off-by: Peter Xu <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alex Williamson <[email protected]> (cherry picked from commit c1d9dac) Signed-off-by: Jonathan Maple <[email protected]> # Conflicts: # drivers/vfio/pci/vfio_pci_core.c
1 parent d844479 commit cec3aae

File tree

1 file changed

+157
-0
lines changed

1 file changed

+157
-0
lines changed
Lines changed: 157 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,157 @@
1+
vfio/pci: Align huge faults to order
2+
3+
jira LE-3557
4+
Rebuild_History Non-Buildable kernel-5.14.0-570.26.1.el9_6
5+
commit-author Alex Williamson <[email protected]>
6+
commit c1d9dac0db168198b6f63f460665256dedad9b6e
7+
Empty-Commit: Cherry-Pick Conflicts during history rebuild.
8+
Will be included in final tarball splat. Ref for failed cherry-pick at:
9+
ciq/ciq_backports/kernel-5.14.0-570.26.1.el9_6/c1d9dac0.failed
10+
11+
The vfio-pci huge_fault handler doesn't make any attempt to insert a
12+
mapping containing the faulting address, it only inserts mappings if the
13+
faulting address and resulting pfn are aligned. This works in a lot of
14+
cases, particularly in conjunction with QEMU where DMA mappings linearly
15+
fault the mmap. However, there are configurations where we don't get
16+
that linear faulting and pages are faulted on-demand.
17+
18+
The scenario reported in the bug below is such a case, where the physical
19+
address width of the CPU is greater than that of the IOMMU, resulting in a
20+
VM where guest firmware has mapped device MMIO beyond the address width of
21+
the IOMMU. In this configuration, the MMIO is faulted on demand and
22+
tracing indicates that occasionally the faults generate a VM_FAULT_OOM.
23+
Given the use case, this results in a "error: kvm run failed Bad address",
24+
killing the VM.
25+
26+
The host is not under memory pressure in this test, therefore it's
27+
suspected that VM_FAULT_OOM is actually the result of a NULL return from
28+
__pte_offset_map_lock() in the get_locked_pte() path from insert_pfn().
29+
This suggests a potential race inserting a pte concurrent to a pmd, and
30+
maybe indicates some deficiency in the mm layer properly handling such a
31+
case.
32+
33+
Nevertheless, Peter noted the inconsistency of vfio-pci's huge_fault
34+
handler where our mapping granularity depends on the alignment of the
35+
faulting address relative to the order rather than aligning the faulting
36+
address to the order to more consistently insert huge mappings. This
37+
change not only uses the page tables more consistently and efficiently, but
38+
as any fault to an aligned page results in the same mapping, the race
39+
condition suspected in the VM_FAULT_OOM is avoided.
40+
41+
Reported-by: Adolfo <[email protected]>
42+
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=220057
43+
Fixes: 09dfc8a5f2ce ("vfio/pci: Fallback huge faults for unaligned pfn")
44+
45+
Tested-by: Adolfo <[email protected]>
46+
Co-developed-by: Peter Xu <[email protected]>
47+
Signed-off-by: Peter Xu <[email protected]>
48+
Link: https://lore.kernel.org/r/[email protected]
49+
Signed-off-by: Alex Williamson <[email protected]>
50+
(cherry picked from commit c1d9dac0db168198b6f63f460665256dedad9b6e)
51+
Signed-off-by: Jonathan Maple <[email protected]>
52+
53+
# Conflicts:
54+
# drivers/vfio/pci/vfio_pci_core.c
55+
diff --cc drivers/vfio/pci/vfio_pci_core.c
56+
index ffda816e0119,6328c3a05bcd..000000000000
57+
--- a/drivers/vfio/pci/vfio_pci_core.c
58+
+++ b/drivers/vfio/pci/vfio_pci_core.c
59+
@@@ -1770,49 -1646,59 +1770,63 @@@ static vm_fault_t vfio_pci_mmap_fault(s
60+
{
61+
struct vm_area_struct *vma = vmf->vma;
62+
struct vfio_pci_core_device *vdev = vma->vm_private_data;
63+
++<<<<<<< HEAD
64+
+ struct vfio_pci_mmap_vma *mmap_vma;
65+
+ vm_fault_t ret = VM_FAULT_NOPAGE;
66+
++=======
67+
+ unsigned long addr = vmf->address & ~((PAGE_SIZE << order) - 1);
68+
+ unsigned long pgoff = (addr - vma->vm_start) >> PAGE_SHIFT;
69+
+ unsigned long pfn = vma_to_pfn(vma) + pgoff;
70+
+ vm_fault_t ret = VM_FAULT_SIGBUS;
71+
+
72+
+ if (order && (addr < vma->vm_start ||
73+
+ addr + (PAGE_SIZE << order) > vma->vm_end ||
74+
+ pfn & ((1 << order) - 1))) {
75+
+ ret = VM_FAULT_FALLBACK;
76+
+ goto out;
77+
+ }
78+
++>>>>>>> c1d9dac0db16 (vfio/pci: Align huge faults to order)
79+
80+
+ mutex_lock(&vdev->vma_lock);
81+
down_read(&vdev->memory_lock);
82+
83+
- if (vdev->pm_runtime_engaged || !__vfio_pci_memory_enabled(vdev))
84+
- goto out_unlock;
85+
+ /*
86+
+ * Memory region cannot be accessed if the low power feature is engaged
87+
+ * or memory access is disabled.
88+
+ */
89+
+ if (vdev->pm_runtime_engaged || !__vfio_pci_memory_enabled(vdev)) {
90+
+ ret = VM_FAULT_SIGBUS;
91+
+ goto up_out;
92+
+ }
93+
94+
- switch (order) {
95+
- case 0:
96+
- ret = vmf_insert_pfn(vma, vmf->address, pfn);
97+
- break;
98+
-#ifdef CONFIG_ARCH_SUPPORTS_PMD_PFNMAP
99+
- case PMD_ORDER:
100+
- ret = vmf_insert_pfn_pmd(vmf,
101+
- __pfn_to_pfn_t(pfn, PFN_DEV), false);
102+
- break;
103+
-#endif
104+
-#ifdef CONFIG_ARCH_SUPPORTS_PUD_PFNMAP
105+
- case PUD_ORDER:
106+
- ret = vmf_insert_pfn_pud(vmf,
107+
- __pfn_to_pfn_t(pfn, PFN_DEV), false);
108+
- break;
109+
-#endif
110+
- default:
111+
- ret = VM_FAULT_FALLBACK;
112+
+ /*
113+
+ * We populate the whole vma on fault, so we need to test whether
114+
+ * the vma has already been mapped, such as for concurrent faults
115+
+ * to the same vma. io_remap_pfn_range() will trigger a BUG_ON if
116+
+ * we ask it to fill the same range again.
117+
+ */
118+
+ list_for_each_entry(mmap_vma, &vdev->vma_list, vma_next) {
119+
+ if (mmap_vma->vma == vma)
120+
+ goto up_out;
121+
}
122+
123+
-out_unlock:
124+
- up_read(&vdev->memory_lock);
125+
-out:
126+
- dev_dbg_ratelimited(&vdev->pdev->dev,
127+
- "%s(,order = %d) BAR %ld page offset 0x%lx: 0x%x\n",
128+
- __func__, order,
129+
- vma->vm_pgoff >>
130+
- (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT),
131+
- pgoff, (unsigned int)ret);
132+
+ if (io_remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff,
133+
+ vma->vm_end - vma->vm_start,
134+
+ vma->vm_page_prot)) {
135+
+ ret = VM_FAULT_SIGBUS;
136+
+ zap_vma_ptes(vma, vma->vm_start, vma->vm_end - vma->vm_start);
137+
+ goto up_out;
138+
+ }
139+
140+
- return ret;
141+
-}
142+
+ if (__vfio_pci_add_vma(vdev, vma)) {
143+
+ ret = VM_FAULT_OOM;
144+
+ zap_vma_ptes(vma, vma->vm_start, vma->vm_end - vma->vm_start);
145+
+ }
146+
147+
-static vm_fault_t vfio_pci_mmap_page_fault(struct vm_fault *vmf)
148+
-{
149+
- return vfio_pci_mmap_huge_fault(vmf, 0);
150+
+up_out:
151+
+ up_read(&vdev->memory_lock);
152+
+ mutex_unlock(&vdev->vma_lock);
153+
+ return ret;
154+
}
155+
156+
static const struct vm_operations_struct vfio_pci_mmap_ops = {
157+
* Unmerged path drivers/vfio/pci/vfio_pci_core.c

0 commit comments

Comments
 (0)