|
| 1 | +============================== |
| 2 | +Running nested guests with KVM |
| 3 | +============================== |
| 4 | + |
| 5 | +A nested guest is the ability to run a guest inside another guest (it |
| 6 | +can be KVM-based or a different hypervisor). The straightforward |
| 7 | +example is a KVM guest that in turn runs on a KVM guest (the rest of |
| 8 | +this document is built on this example):: |
| 9 | + |
| 10 | + .----------------. .----------------. |
| 11 | + | | | | |
| 12 | + | L2 | | L2 | |
| 13 | + | (Nested Guest) | | (Nested Guest) | |
| 14 | + | | | | |
| 15 | + |----------------'--'----------------| |
| 16 | + | | |
| 17 | + | L1 (Guest Hypervisor) | |
| 18 | + | KVM (/dev/kvm) | |
| 19 | + | | |
| 20 | + .------------------------------------------------------. |
| 21 | + | L0 (Host Hypervisor) | |
| 22 | + | KVM (/dev/kvm) | |
| 23 | + |------------------------------------------------------| |
| 24 | + | Hardware (with virtualization extensions) | |
| 25 | + '------------------------------------------------------' |
| 26 | + |
| 27 | +Terminology: |
| 28 | + |
| 29 | +- L0 – level-0; the bare metal host, running KVM |
| 30 | + |
| 31 | +- L1 – level-1 guest; a VM running on L0; also called the "guest |
| 32 | + hypervisor", as it itself is capable of running KVM. |
| 33 | + |
| 34 | +- L2 – level-2 guest; a VM running on L1, this is the "nested guest" |
| 35 | + |
| 36 | +.. note:: The above diagram is modelled after the x86 architecture; |
| 37 | + s390x, ppc64 and other architectures are likely to have |
| 38 | + a different design for nesting. |
| 39 | + |
| 40 | + For example, s390x always has an LPAR (LogicalPARtition) |
| 41 | + hypervisor running on bare metal, adding another layer and |
| 42 | + resulting in at least four levels in a nested setup — L0 (bare |
| 43 | + metal, running the LPAR hypervisor), L1 (host hypervisor), L2 |
| 44 | + (guest hypervisor), L3 (nested guest). |
| 45 | + |
| 46 | + This document will stick with the three-level terminology (L0, |
| 47 | + L1, and L2) for all architectures; and will largely focus on |
| 48 | + x86. |
| 49 | + |
| 50 | + |
| 51 | +Use Cases |
| 52 | +--------- |
| 53 | + |
| 54 | +There are several scenarios where nested KVM can be useful, to name a |
| 55 | +few: |
| 56 | + |
| 57 | +- As a developer, you want to test your software on different operating |
| 58 | + systems (OSes). Instead of renting multiple VMs from a Cloud |
| 59 | + Provider, using nested KVM lets you rent a large enough "guest |
| 60 | + hypervisor" (level-1 guest). This in turn allows you to create |
| 61 | + multiple nested guests (level-2 guests), running different OSes, on |
| 62 | + which you can develop and test your software. |
| 63 | + |
| 64 | +- Live migration of "guest hypervisors" and their nested guests, for |
| 65 | + load balancing, disaster recovery, etc. |
| 66 | + |
| 67 | +- VM image creation tools (e.g. ``virt-install``, etc) often run |
| 68 | + their own VM, and users expect these to work inside a VM. |
| 69 | + |
| 70 | +- Some OSes use virtualization internally for security (e.g. to let |
| 71 | + applications run safely in isolation). |
| 72 | + |
| 73 | + |
| 74 | +Enabling "nested" (x86) |
| 75 | +----------------------- |
| 76 | + |
| 77 | +From Linux kernel v4.19 onwards, the ``nested`` KVM parameter is enabled |
| 78 | +by default for Intel and AMD. (Though your Linux distribution might |
| 79 | +override this default.) |
| 80 | + |
| 81 | +In case you are running a Linux kernel older than v4.19, to enable |
| 82 | +nesting, set the ``nested`` KVM module parameter to ``Y`` or ``1``. To |
| 83 | +persist this setting across reboots, you can add it in a config file, as |
| 84 | +shown below: |
| 85 | + |
| 86 | +1. On the bare metal host (L0), list the kernel modules and ensure that |
| 87 | + the KVM modules:: |
| 88 | + |
| 89 | + $ lsmod | grep -i kvm |
| 90 | + kvm_intel 133627 0 |
| 91 | + kvm 435079 1 kvm_intel |
| 92 | + |
| 93 | +2. Show information for ``kvm_intel`` module:: |
| 94 | + |
| 95 | + $ modinfo kvm_intel | grep -i nested |
| 96 | + parm: nested:bool |
| 97 | + |
| 98 | +3. For the nested KVM configuration to persist across reboots, place the |
| 99 | + below in ``/etc/modprobed/kvm_intel.conf`` (create the file if it |
| 100 | + doesn't exist):: |
| 101 | + |
| 102 | + $ cat /etc/modprobe.d/kvm_intel.conf |
| 103 | + options kvm-intel nested=y |
| 104 | + |
| 105 | +4. Unload and re-load the KVM Intel module:: |
| 106 | + |
| 107 | + $ sudo rmmod kvm-intel |
| 108 | + $ sudo modprobe kvm-intel |
| 109 | + |
| 110 | +5. Verify if the ``nested`` parameter for KVM is enabled:: |
| 111 | + |
| 112 | + $ cat /sys/module/kvm_intel/parameters/nested |
| 113 | + Y |
| 114 | + |
| 115 | +For AMD hosts, the process is the same as above, except that the module |
| 116 | +name is ``kvm-amd``. |
| 117 | + |
| 118 | + |
| 119 | +Additional nested-related kernel parameters (x86) |
| 120 | +------------------------------------------------- |
| 121 | + |
| 122 | +If your hardware is sufficiently advanced (Intel Haswell processor or |
| 123 | +higher, which has newer hardware virt extensions), the following |
| 124 | +additional features will also be enabled by default: "Shadow VMCS |
| 125 | +(Virtual Machine Control Structure)", APIC Virtualization on your bare |
| 126 | +metal host (L0). Parameters for Intel hosts:: |
| 127 | + |
| 128 | + $ cat /sys/module/kvm_intel/parameters/enable_shadow_vmcs |
| 129 | + Y |
| 130 | + |
| 131 | + $ cat /sys/module/kvm_intel/parameters/enable_apicv |
| 132 | + Y |
| 133 | + |
| 134 | + $ cat /sys/module/kvm_intel/parameters/ept |
| 135 | + Y |
| 136 | + |
| 137 | +.. note:: If you suspect your L2 (i.e. nested guest) is running slower, |
| 138 | + ensure the above are enabled (particularly |
| 139 | + ``enable_shadow_vmcs`` and ``ept``). |
| 140 | + |
| 141 | + |
| 142 | +Starting a nested guest (x86) |
| 143 | +----------------------------- |
| 144 | + |
| 145 | +Once your bare metal host (L0) is configured for nesting, you should be |
| 146 | +able to start an L1 guest with:: |
| 147 | + |
| 148 | + $ qemu-kvm -cpu host [...] |
| 149 | + |
| 150 | +The above will pass through the host CPU's capabilities as-is to the |
| 151 | +gues); or for better live migration compatibility, use a named CPU |
| 152 | +model supported by QEMU. e.g.:: |
| 153 | + |
| 154 | + $ qemu-kvm -cpu Haswell-noTSX-IBRS,vmx=on |
| 155 | + |
| 156 | +then the guest hypervisor will subsequently be capable of running a |
| 157 | +nested guest with accelerated KVM. |
| 158 | + |
| 159 | + |
| 160 | +Enabling "nested" (s390x) |
| 161 | +------------------------- |
| 162 | + |
| 163 | +1. On the host hypervisor (L0), enable the ``nested`` parameter on |
| 164 | + s390x:: |
| 165 | + |
| 166 | + $ rmmod kvm |
| 167 | + $ modprobe kvm nested=1 |
| 168 | + |
| 169 | +.. note:: On s390x, the kernel parameter ``hpage`` is mutually exclusive |
| 170 | + with the ``nested`` paramter — i.e. to be able to enable |
| 171 | + ``nested``, the ``hpage`` parameter *must* be disabled. |
| 172 | + |
| 173 | +2. The guest hypervisor (L1) must be provided with the ``sie`` CPU |
| 174 | + feature — with QEMU, this can be done by using "host passthrough" |
| 175 | + (via the command-line ``-cpu host``). |
| 176 | + |
| 177 | +3. Now the KVM module can be loaded in the L1 (guest hypervisor):: |
| 178 | + |
| 179 | + $ modprobe kvm |
| 180 | + |
| 181 | + |
| 182 | +Live migration with nested KVM |
| 183 | +------------------------------ |
| 184 | + |
| 185 | +Migrating an L1 guest, with a *live* nested guest in it, to another |
| 186 | +bare metal host, works as of Linux kernel 5.3 and QEMU 4.2.0 for |
| 187 | +Intel x86 systems, and even on older versions for s390x. |
| 188 | + |
| 189 | +On AMD systems, once an L1 guest has started an L2 guest, the L1 guest |
| 190 | +should no longer be migrated or saved (refer to QEMU documentation on |
| 191 | +"savevm"/"loadvm") until the L2 guest shuts down. Attempting to migrate |
| 192 | +or save-and-load an L1 guest while an L2 guest is running will result in |
| 193 | +undefined behavior. You might see a ``kernel BUG!`` entry in ``dmesg``, a |
| 194 | +kernel 'oops', or an outright kernel panic. Such a migrated or loaded L1 |
| 195 | +guest can no longer be considered stable or secure, and must be restarted. |
| 196 | +Migrating an L1 guest merely configured to support nesting, while not |
| 197 | +actually running L2 guests, is expected to function normally even on AMD |
| 198 | +systems but may fail once guests are started. |
| 199 | + |
| 200 | +Migrating an L2 guest is always expected to succeed, so all the following |
| 201 | +scenarios should work even on AMD systems: |
| 202 | + |
| 203 | +- Migrating a nested guest (L2) to another L1 guest on the *same* bare |
| 204 | + metal host. |
| 205 | + |
| 206 | +- Migrating a nested guest (L2) to another L1 guest on a *different* |
| 207 | + bare metal host. |
| 208 | + |
| 209 | +- Migrating a nested guest (L2) to a bare metal host. |
| 210 | + |
| 211 | +Reporting bugs from nested setups |
| 212 | +----------------------------------- |
| 213 | + |
| 214 | +Debugging "nested" problems can involve sifting through log files across |
| 215 | +L0, L1 and L2; this can result in tedious back-n-forth between the bug |
| 216 | +reporter and the bug fixer. |
| 217 | + |
| 218 | +- Mention that you are in a "nested" setup. If you are running any kind |
| 219 | + of "nesting" at all, say so. Unfortunately, this needs to be called |
| 220 | + out because when reporting bugs, people tend to forget to even |
| 221 | + *mention* that they're using nested virtualization. |
| 222 | + |
| 223 | +- Ensure you are actually running KVM on KVM. Sometimes people do not |
| 224 | + have KVM enabled for their guest hypervisor (L1), which results in |
| 225 | + them running with pure emulation or what QEMU calls it as "TCG", but |
| 226 | + they think they're running nested KVM. Thus confusing "nested Virt" |
| 227 | + (which could also mean, QEMU on KVM) with "nested KVM" (KVM on KVM). |
| 228 | + |
| 229 | +Information to collect (generic) |
| 230 | +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 231 | + |
| 232 | +The following is not an exhaustive list, but a very good starting point: |
| 233 | + |
| 234 | + - Kernel, libvirt, and QEMU version from L0 |
| 235 | + |
| 236 | + - Kernel, libvirt and QEMU version from L1 |
| 237 | + |
| 238 | + - QEMU command-line of L1 -- when using libvirt, you'll find it here: |
| 239 | + ``/var/log/libvirt/qemu/instance.log`` |
| 240 | + |
| 241 | + - QEMU command-line of L2 -- as above, when using libvirt, get the |
| 242 | + complete libvirt-generated QEMU command-line |
| 243 | + |
| 244 | + - ``cat /sys/cpuinfo`` from L0 |
| 245 | + |
| 246 | + - ``cat /sys/cpuinfo`` from L1 |
| 247 | + |
| 248 | + - ``lscpu`` from L0 |
| 249 | + |
| 250 | + - ``lscpu`` from L1 |
| 251 | + |
| 252 | + - Full ``dmesg`` output from L0 |
| 253 | + |
| 254 | + - Full ``dmesg`` output from L1 |
| 255 | + |
| 256 | +x86-specific info to collect |
| 257 | +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 258 | + |
| 259 | +Both the below commands, ``x86info`` and ``dmidecode``, should be |
| 260 | +available on most Linux distributions with the same name: |
| 261 | + |
| 262 | + - Output of: ``x86info -a`` from L0 |
| 263 | + |
| 264 | + - Output of: ``x86info -a`` from L1 |
| 265 | + |
| 266 | + - Output of: ``dmidecode`` from L0 |
| 267 | + |
| 268 | + - Output of: ``dmidecode`` from L1 |
| 269 | + |
| 270 | +s390x-specific info to collect |
| 271 | +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 272 | + |
| 273 | +Along with the earlier mentioned generic details, the below is |
| 274 | +also recommended: |
| 275 | + |
| 276 | + - ``/proc/sysinfo`` from L1; this will also include the info from L0 |
0 commit comments