Skip to content

Commit 8c16ec9

Browse files
committed
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini: "Bugfixes, mostly for ARM and AMD, and more documentation. Slightly bigger than usual because I couldn't send out what was pending for rc4, but there is nothing worrisome going on. I have more fixes pending for guest debugging support (gdbstub) but I will send them next week" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (22 commits) KVM: X86: Declare KVM_CAP_SET_GUEST_DEBUG properly KVM: selftests: Fix build for evmcs.h kvm: x86: Use KVM CPU capabilities to determine CR4 reserved bits KVM: VMX: Explicitly clear RFLAGS.CF and RFLAGS.ZF in VM-Exit RSB path docs/virt/kvm: Document configuring and running nested guests KVM: s390: Remove false WARN_ON_ONCE for the PQAP instruction kvm: ioapic: Restrict lazy EOI update to edge-triggered interrupts KVM: x86: Fixes posted interrupt check for IRQs delivery modes KVM: SVM: fill in kvm_run->debug.arch.dr[67] KVM: nVMX: Replace a BUG_ON(1) with BUG() to squash clang warning KVM: arm64: Fix 32bit PC wrap-around KVM: arm64: vgic-v4: Initialize GICv4.1 even in the absence of a virtual ITS KVM: arm64: Save/restore sp_el0 as part of __guest_enter KVM: arm64: Delete duplicated label in invalid_vector KVM: arm64: vgic-its: Fix memory leak on the error path of vgic_add_lpi() KVM: arm64: vgic-v3: Retire all pending LPIs on vcpu destroy KVM: arm: vgic-v2: Only use the virtual state when userspace accesses pending bits KVM: arm: vgic: Only use the virtual state when userspace accesses enable bits KVM: arm: vgic: Synchronize the whole guest on GIC{D,R}_I{S,C}ACTIVER read KVM: arm64: PSCI: Forbid 64bit functions for 32bit guests ...
2 parents de268cc + 2673cb6 commit 8c16ec9

File tree

25 files changed

+628
-125
lines changed

25 files changed

+628
-125
lines changed

Documentation/virt/kvm/index.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,3 +28,5 @@ KVM
2828
arm/index
2929

3030
devices/index
31+
32+
running-nested-guests
Lines changed: 276 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,276 @@
1+
==============================
2+
Running nested guests with KVM
3+
==============================
4+
5+
A nested guest is the ability to run a guest inside another guest (it
6+
can be KVM-based or a different hypervisor). The straightforward
7+
example is a KVM guest that in turn runs on a KVM guest (the rest of
8+
this document is built on this example)::
9+
10+
.----------------. .----------------.
11+
| | | |
12+
| L2 | | L2 |
13+
| (Nested Guest) | | (Nested Guest) |
14+
| | | |
15+
|----------------'--'----------------|
16+
| |
17+
| L1 (Guest Hypervisor) |
18+
| KVM (/dev/kvm) |
19+
| |
20+
.------------------------------------------------------.
21+
| L0 (Host Hypervisor) |
22+
| KVM (/dev/kvm) |
23+
|------------------------------------------------------|
24+
| Hardware (with virtualization extensions) |
25+
'------------------------------------------------------'
26+
27+
Terminology:
28+
29+
- L0 – level-0; the bare metal host, running KVM
30+
31+
- L1 – level-1 guest; a VM running on L0; also called the "guest
32+
hypervisor", as it itself is capable of running KVM.
33+
34+
- L2 – level-2 guest; a VM running on L1, this is the "nested guest"
35+
36+
.. note:: The above diagram is modelled after the x86 architecture;
37+
s390x, ppc64 and other architectures are likely to have
38+
a different design for nesting.
39+
40+
For example, s390x always has an LPAR (LogicalPARtition)
41+
hypervisor running on bare metal, adding another layer and
42+
resulting in at least four levels in a nested setup — L0 (bare
43+
metal, running the LPAR hypervisor), L1 (host hypervisor), L2
44+
(guest hypervisor), L3 (nested guest).
45+
46+
This document will stick with the three-level terminology (L0,
47+
L1, and L2) for all architectures; and will largely focus on
48+
x86.
49+
50+
51+
Use Cases
52+
---------
53+
54+
There are several scenarios where nested KVM can be useful, to name a
55+
few:
56+
57+
- As a developer, you want to test your software on different operating
58+
systems (OSes). Instead of renting multiple VMs from a Cloud
59+
Provider, using nested KVM lets you rent a large enough "guest
60+
hypervisor" (level-1 guest). This in turn allows you to create
61+
multiple nested guests (level-2 guests), running different OSes, on
62+
which you can develop and test your software.
63+
64+
- Live migration of "guest hypervisors" and their nested guests, for
65+
load balancing, disaster recovery, etc.
66+
67+
- VM image creation tools (e.g. ``virt-install``, etc) often run
68+
their own VM, and users expect these to work inside a VM.
69+
70+
- Some OSes use virtualization internally for security (e.g. to let
71+
applications run safely in isolation).
72+
73+
74+
Enabling "nested" (x86)
75+
-----------------------
76+
77+
From Linux kernel v4.19 onwards, the ``nested`` KVM parameter is enabled
78+
by default for Intel and AMD. (Though your Linux distribution might
79+
override this default.)
80+
81+
In case you are running a Linux kernel older than v4.19, to enable
82+
nesting, set the ``nested`` KVM module parameter to ``Y`` or ``1``. To
83+
persist this setting across reboots, you can add it in a config file, as
84+
shown below:
85+
86+
1. On the bare metal host (L0), list the kernel modules and ensure that
87+
the KVM modules::
88+
89+
$ lsmod | grep -i kvm
90+
kvm_intel 133627 0
91+
kvm 435079 1 kvm_intel
92+
93+
2. Show information for ``kvm_intel`` module::
94+
95+
$ modinfo kvm_intel | grep -i nested
96+
parm: nested:bool
97+
98+
3. For the nested KVM configuration to persist across reboots, place the
99+
below in ``/etc/modprobed/kvm_intel.conf`` (create the file if it
100+
doesn't exist)::
101+
102+
$ cat /etc/modprobe.d/kvm_intel.conf
103+
options kvm-intel nested=y
104+
105+
4. Unload and re-load the KVM Intel module::
106+
107+
$ sudo rmmod kvm-intel
108+
$ sudo modprobe kvm-intel
109+
110+
5. Verify if the ``nested`` parameter for KVM is enabled::
111+
112+
$ cat /sys/module/kvm_intel/parameters/nested
113+
Y
114+
115+
For AMD hosts, the process is the same as above, except that the module
116+
name is ``kvm-amd``.
117+
118+
119+
Additional nested-related kernel parameters (x86)
120+
-------------------------------------------------
121+
122+
If your hardware is sufficiently advanced (Intel Haswell processor or
123+
higher, which has newer hardware virt extensions), the following
124+
additional features will also be enabled by default: "Shadow VMCS
125+
(Virtual Machine Control Structure)", APIC Virtualization on your bare
126+
metal host (L0). Parameters for Intel hosts::
127+
128+
$ cat /sys/module/kvm_intel/parameters/enable_shadow_vmcs
129+
Y
130+
131+
$ cat /sys/module/kvm_intel/parameters/enable_apicv
132+
Y
133+
134+
$ cat /sys/module/kvm_intel/parameters/ept
135+
Y
136+
137+
.. note:: If you suspect your L2 (i.e. nested guest) is running slower,
138+
ensure the above are enabled (particularly
139+
``enable_shadow_vmcs`` and ``ept``).
140+
141+
142+
Starting a nested guest (x86)
143+
-----------------------------
144+
145+
Once your bare metal host (L0) is configured for nesting, you should be
146+
able to start an L1 guest with::
147+
148+
$ qemu-kvm -cpu host [...]
149+
150+
The above will pass through the host CPU's capabilities as-is to the
151+
gues); or for better live migration compatibility, use a named CPU
152+
model supported by QEMU. e.g.::
153+
154+
$ qemu-kvm -cpu Haswell-noTSX-IBRS,vmx=on
155+
156+
then the guest hypervisor will subsequently be capable of running a
157+
nested guest with accelerated KVM.
158+
159+
160+
Enabling "nested" (s390x)
161+
-------------------------
162+
163+
1. On the host hypervisor (L0), enable the ``nested`` parameter on
164+
s390x::
165+
166+
$ rmmod kvm
167+
$ modprobe kvm nested=1
168+
169+
.. note:: On s390x, the kernel parameter ``hpage`` is mutually exclusive
170+
with the ``nested`` paramter — i.e. to be able to enable
171+
``nested``, the ``hpage`` parameter *must* be disabled.
172+
173+
2. The guest hypervisor (L1) must be provided with the ``sie`` CPU
174+
feature — with QEMU, this can be done by using "host passthrough"
175+
(via the command-line ``-cpu host``).
176+
177+
3. Now the KVM module can be loaded in the L1 (guest hypervisor)::
178+
179+
$ modprobe kvm
180+
181+
182+
Live migration with nested KVM
183+
------------------------------
184+
185+
Migrating an L1 guest, with a *live* nested guest in it, to another
186+
bare metal host, works as of Linux kernel 5.3 and QEMU 4.2.0 for
187+
Intel x86 systems, and even on older versions for s390x.
188+
189+
On AMD systems, once an L1 guest has started an L2 guest, the L1 guest
190+
should no longer be migrated or saved (refer to QEMU documentation on
191+
"savevm"/"loadvm") until the L2 guest shuts down. Attempting to migrate
192+
or save-and-load an L1 guest while an L2 guest is running will result in
193+
undefined behavior. You might see a ``kernel BUG!`` entry in ``dmesg``, a
194+
kernel 'oops', or an outright kernel panic. Such a migrated or loaded L1
195+
guest can no longer be considered stable or secure, and must be restarted.
196+
Migrating an L1 guest merely configured to support nesting, while not
197+
actually running L2 guests, is expected to function normally even on AMD
198+
systems but may fail once guests are started.
199+
200+
Migrating an L2 guest is always expected to succeed, so all the following
201+
scenarios should work even on AMD systems:
202+
203+
- Migrating a nested guest (L2) to another L1 guest on the *same* bare
204+
metal host.
205+
206+
- Migrating a nested guest (L2) to another L1 guest on a *different*
207+
bare metal host.
208+
209+
- Migrating a nested guest (L2) to a bare metal host.
210+
211+
Reporting bugs from nested setups
212+
-----------------------------------
213+
214+
Debugging "nested" problems can involve sifting through log files across
215+
L0, L1 and L2; this can result in tedious back-n-forth between the bug
216+
reporter and the bug fixer.
217+
218+
- Mention that you are in a "nested" setup. If you are running any kind
219+
of "nesting" at all, say so. Unfortunately, this needs to be called
220+
out because when reporting bugs, people tend to forget to even
221+
*mention* that they're using nested virtualization.
222+
223+
- Ensure you are actually running KVM on KVM. Sometimes people do not
224+
have KVM enabled for their guest hypervisor (L1), which results in
225+
them running with pure emulation or what QEMU calls it as "TCG", but
226+
they think they're running nested KVM. Thus confusing "nested Virt"
227+
(which could also mean, QEMU on KVM) with "nested KVM" (KVM on KVM).
228+
229+
Information to collect (generic)
230+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
231+
232+
The following is not an exhaustive list, but a very good starting point:
233+
234+
- Kernel, libvirt, and QEMU version from L0
235+
236+
- Kernel, libvirt and QEMU version from L1
237+
238+
- QEMU command-line of L1 -- when using libvirt, you'll find it here:
239+
``/var/log/libvirt/qemu/instance.log``
240+
241+
- QEMU command-line of L2 -- as above, when using libvirt, get the
242+
complete libvirt-generated QEMU command-line
243+
244+
- ``cat /sys/cpuinfo`` from L0
245+
246+
- ``cat /sys/cpuinfo`` from L1
247+
248+
- ``lscpu`` from L0
249+
250+
- ``lscpu`` from L1
251+
252+
- Full ``dmesg`` output from L0
253+
254+
- Full ``dmesg`` output from L1
255+
256+
x86-specific info to collect
257+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
258+
259+
Both the below commands, ``x86info`` and ``dmidecode``, should be
260+
available on most Linux distributions with the same name:
261+
262+
- Output of: ``x86info -a`` from L0
263+
264+
- Output of: ``x86info -a`` from L1
265+
266+
- Output of: ``dmidecode`` from L0
267+
268+
- Output of: ``dmidecode`` from L1
269+
270+
s390x-specific info to collect
271+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
272+
273+
Along with the earlier mentioned generic details, the below is
274+
also recommended:
275+
276+
- ``/proc/sysinfo`` from L1; this will also include the info from L0

arch/arm64/kvm/guest.c

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -200,6 +200,13 @@ static int set_core_reg(struct kvm_vcpu *vcpu, const struct kvm_one_reg *reg)
200200
}
201201

202202
memcpy((u32 *)regs + off, valp, KVM_REG_SIZE(reg->id));
203+
204+
if (*vcpu_cpsr(vcpu) & PSR_MODE32_BIT) {
205+
int i;
206+
207+
for (i = 0; i < 16; i++)
208+
*vcpu_reg32(vcpu, i) = (u32)*vcpu_reg32(vcpu, i);
209+
}
203210
out:
204211
return err;
205212
}

arch/arm64/kvm/hyp/entry.S

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,7 @@
1818

1919
#define CPU_GP_REG_OFFSET(x) (CPU_GP_REGS + x)
2020
#define CPU_XREG_OFFSET(x) CPU_GP_REG_OFFSET(CPU_USER_PT_REGS + 8*x)
21+
#define CPU_SP_EL0_OFFSET (CPU_XREG_OFFSET(30) + 8)
2122

2223
.text
2324
.pushsection .hyp.text, "ax"
@@ -47,6 +48,16 @@
4748
ldp x29, lr, [\ctxt, #CPU_XREG_OFFSET(29)]
4849
.endm
4950

51+
.macro save_sp_el0 ctxt, tmp
52+
mrs \tmp, sp_el0
53+
str \tmp, [\ctxt, #CPU_SP_EL0_OFFSET]
54+
.endm
55+
56+
.macro restore_sp_el0 ctxt, tmp
57+
ldr \tmp, [\ctxt, #CPU_SP_EL0_OFFSET]
58+
msr sp_el0, \tmp
59+
.endm
60+
5061
/*
5162
* u64 __guest_enter(struct kvm_vcpu *vcpu,
5263
* struct kvm_cpu_context *host_ctxt);
@@ -60,6 +71,9 @@ SYM_FUNC_START(__guest_enter)
6071
// Store the host regs
6172
save_callee_saved_regs x1
6273

74+
// Save the host's sp_el0
75+
save_sp_el0 x1, x2
76+
6377
// Now the host state is stored if we have a pending RAS SError it must
6478
// affect the host. If any asynchronous exception is pending we defer
6579
// the guest entry. The DSB isn't necessary before v8.2 as any SError
@@ -83,6 +97,9 @@ alternative_else_nop_endif
8397
// when this feature is enabled for kernel code.
8498
ptrauth_switch_to_guest x29, x0, x1, x2
8599

100+
// Restore the guest's sp_el0
101+
restore_sp_el0 x29, x0
102+
86103
// Restore guest regs x0-x17
87104
ldp x0, x1, [x29, #CPU_XREG_OFFSET(0)]
88105
ldp x2, x3, [x29, #CPU_XREG_OFFSET(2)]
@@ -130,6 +147,9 @@ SYM_INNER_LABEL(__guest_exit, SYM_L_GLOBAL)
130147
// Store the guest regs x18-x29, lr
131148
save_callee_saved_regs x1
132149

150+
// Store the guest's sp_el0
151+
save_sp_el0 x1, x2
152+
133153
get_host_ctxt x2, x3
134154

135155
// Macro ptrauth_switch_to_guest format:
@@ -139,6 +159,9 @@ SYM_INNER_LABEL(__guest_exit, SYM_L_GLOBAL)
139159
// when this feature is enabled for kernel code.
140160
ptrauth_switch_to_host x1, x2, x3, x4, x5
141161

162+
// Restore the hosts's sp_el0
163+
restore_sp_el0 x2, x3
164+
142165
// Now restore the host regs
143166
restore_callee_saved_regs x2
144167

arch/arm64/kvm/hyp/hyp-entry.S

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -198,7 +198,6 @@ SYM_CODE_END(__hyp_panic)
198198
.macro invalid_vector label, target = __hyp_panic
199199
.align 2
200200
SYM_CODE_START(\label)
201-
\label:
202201
b \target
203202
SYM_CODE_END(\label)
204203
.endm

0 commit comments

Comments
 (0)