docs/virt/kvm: Document configuring and running nested guests
This is a rewrite of this[1] Wiki page with further enhancements. The doc also includes a section on debugging problems in nested environments, among other improvements. [1] https://www.linux-kvm.org/page/Nested_Guests Signed-off-by: Kashyap Chamarthy <kchamart@redhat.com> Message-Id: <20200505112839.30534-1-kchamart@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This commit is contained in:
parent
8be8f932e3
commit
27abe57770
@ -28,3 +28,5 @@ KVM
|
||||
arm/index
|
||||
|
||||
devices/index
|
||||
|
||||
running-nested-guests
|
||||
|
276
Documentation/virt/kvm/running-nested-guests.rst
Normal file
276
Documentation/virt/kvm/running-nested-guests.rst
Normal file
@ -0,0 +1,276 @@
|
||||
==============================
|
||||
Running nested guests with KVM
|
||||
==============================
|
||||
|
||||
A nested guest is the ability to run a guest inside another guest (it
|
||||
can be KVM-based or a different hypervisor). The straightforward
|
||||
example is a KVM guest that in turn runs on a KVM guest (the rest of
|
||||
this document is built on this example)::
|
||||
|
||||
.----------------. .----------------.
|
||||
| | | |
|
||||
| L2 | | L2 |
|
||||
| (Nested Guest) | | (Nested Guest) |
|
||||
| | | |
|
||||
|----------------'--'----------------|
|
||||
| |
|
||||
| L1 (Guest Hypervisor) |
|
||||
| KVM (/dev/kvm) |
|
||||
| |
|
||||
.------------------------------------------------------.
|
||||
| L0 (Host Hypervisor) |
|
||||
| KVM (/dev/kvm) |
|
||||
|------------------------------------------------------|
|
||||
| Hardware (with virtualization extensions) |
|
||||
'------------------------------------------------------'
|
||||
|
||||
Terminology:
|
||||
|
||||
- L0 – level-0; the bare metal host, running KVM
|
||||
|
||||
- L1 – level-1 guest; a VM running on L0; also called the "guest
|
||||
hypervisor", as it itself is capable of running KVM.
|
||||
|
||||
- L2 – level-2 guest; a VM running on L1, this is the "nested guest"
|
||||
|
||||
.. note:: The above diagram is modelled after the x86 architecture;
|
||||
s390x, ppc64 and other architectures are likely to have
|
||||
a different design for nesting.
|
||||
|
||||
For example, s390x always has an LPAR (LogicalPARtition)
|
||||
hypervisor running on bare metal, adding another layer and
|
||||
resulting in at least four levels in a nested setup — L0 (bare
|
||||
metal, running the LPAR hypervisor), L1 (host hypervisor), L2
|
||||
(guest hypervisor), L3 (nested guest).
|
||||
|
||||
This document will stick with the three-level terminology (L0,
|
||||
L1, and L2) for all architectures; and will largely focus on
|
||||
x86.
|
||||
|
||||
|
||||
Use Cases
|
||||
---------
|
||||
|
||||
There are several scenarios where nested KVM can be useful, to name a
|
||||
few:
|
||||
|
||||
- As a developer, you want to test your software on different operating
|
||||
systems (OSes). Instead of renting multiple VMs from a Cloud
|
||||
Provider, using nested KVM lets you rent a large enough "guest
|
||||
hypervisor" (level-1 guest). This in turn allows you to create
|
||||
multiple nested guests (level-2 guests), running different OSes, on
|
||||
which you can develop and test your software.
|
||||
|
||||
- Live migration of "guest hypervisors" and their nested guests, for
|
||||
load balancing, disaster recovery, etc.
|
||||
|
||||
- VM image creation tools (e.g. ``virt-install``, etc) often run
|
||||
their own VM, and users expect these to work inside a VM.
|
||||
|
||||
- Some OSes use virtualization internally for security (e.g. to let
|
||||
applications run safely in isolation).
|
||||
|
||||
|
||||
Enabling "nested" (x86)
|
||||
-----------------------
|
||||
|
||||
From Linux kernel v4.19 onwards, the ``nested`` KVM parameter is enabled
|
||||
by default for Intel and AMD. (Though your Linux distribution might
|
||||
override this default.)
|
||||
|
||||
In case you are running a Linux kernel older than v4.19, to enable
|
||||
nesting, set the ``nested`` KVM module parameter to ``Y`` or ``1``. To
|
||||
persist this setting across reboots, you can add it in a config file, as
|
||||
shown below:
|
||||
|
||||
1. On the bare metal host (L0), list the kernel modules and ensure that
|
||||
the KVM modules::
|
||||
|
||||
$ lsmod | grep -i kvm
|
||||
kvm_intel 133627 0
|
||||
kvm 435079 1 kvm_intel
|
||||
|
||||
2. Show information for ``kvm_intel`` module::
|
||||
|
||||
$ modinfo kvm_intel | grep -i nested
|
||||
parm: nested:bool
|
||||
|
||||
3. For the nested KVM configuration to persist across reboots, place the
|
||||
below in ``/etc/modprobed/kvm_intel.conf`` (create the file if it
|
||||
doesn't exist)::
|
||||
|
||||
$ cat /etc/modprobe.d/kvm_intel.conf
|
||||
options kvm-intel nested=y
|
||||
|
||||
4. Unload and re-load the KVM Intel module::
|
||||
|
||||
$ sudo rmmod kvm-intel
|
||||
$ sudo modprobe kvm-intel
|
||||
|
||||
5. Verify if the ``nested`` parameter for KVM is enabled::
|
||||
|
||||
$ cat /sys/module/kvm_intel/parameters/nested
|
||||
Y
|
||||
|
||||
For AMD hosts, the process is the same as above, except that the module
|
||||
name is ``kvm-amd``.
|
||||
|
||||
|
||||
Additional nested-related kernel parameters (x86)
|
||||
-------------------------------------------------
|
||||
|
||||
If your hardware is sufficiently advanced (Intel Haswell processor or
|
||||
higher, which has newer hardware virt extensions), the following
|
||||
additional features will also be enabled by default: "Shadow VMCS
|
||||
(Virtual Machine Control Structure)", APIC Virtualization on your bare
|
||||
metal host (L0). Parameters for Intel hosts::
|
||||
|
||||
$ cat /sys/module/kvm_intel/parameters/enable_shadow_vmcs
|
||||
Y
|
||||
|
||||
$ cat /sys/module/kvm_intel/parameters/enable_apicv
|
||||
Y
|
||||
|
||||
$ cat /sys/module/kvm_intel/parameters/ept
|
||||
Y
|
||||
|
||||
.. note:: If you suspect your L2 (i.e. nested guest) is running slower,
|
||||
ensure the above are enabled (particularly
|
||||
``enable_shadow_vmcs`` and ``ept``).
|
||||
|
||||
|
||||
Starting a nested guest (x86)
|
||||
-----------------------------
|
||||
|
||||
Once your bare metal host (L0) is configured for nesting, you should be
|
||||
able to start an L1 guest with::
|
||||
|
||||
$ qemu-kvm -cpu host [...]
|
||||
|
||||
The above will pass through the host CPU's capabilities as-is to the
|
||||
gues); or for better live migration compatibility, use a named CPU
|
||||
model supported by QEMU. e.g.::
|
||||
|
||||
$ qemu-kvm -cpu Haswell-noTSX-IBRS,vmx=on
|
||||
|
||||
then the guest hypervisor will subsequently be capable of running a
|
||||
nested guest with accelerated KVM.
|
||||
|
||||
|
||||
Enabling "nested" (s390x)
|
||||
-------------------------
|
||||
|
||||
1. On the host hypervisor (L0), enable the ``nested`` parameter on
|
||||
s390x::
|
||||
|
||||
$ rmmod kvm
|
||||
$ modprobe kvm nested=1
|
||||
|
||||
.. note:: On s390x, the kernel parameter ``hpage`` is mutually exclusive
|
||||
with the ``nested`` paramter — i.e. to be able to enable
|
||||
``nested``, the ``hpage`` parameter *must* be disabled.
|
||||
|
||||
2. The guest hypervisor (L1) must be provided with the ``sie`` CPU
|
||||
feature — with QEMU, this can be done by using "host passthrough"
|
||||
(via the command-line ``-cpu host``).
|
||||
|
||||
3. Now the KVM module can be loaded in the L1 (guest hypervisor)::
|
||||
|
||||
$ modprobe kvm
|
||||
|
||||
|
||||
Live migration with nested KVM
|
||||
------------------------------
|
||||
|
||||
Migrating an L1 guest, with a *live* nested guest in it, to another
|
||||
bare metal host, works as of Linux kernel 5.3 and QEMU 4.2.0 for
|
||||
Intel x86 systems, and even on older versions for s390x.
|
||||
|
||||
On AMD systems, once an L1 guest has started an L2 guest, the L1 guest
|
||||
should no longer be migrated or saved (refer to QEMU documentation on
|
||||
"savevm"/"loadvm") until the L2 guest shuts down. Attempting to migrate
|
||||
or save-and-load an L1 guest while an L2 guest is running will result in
|
||||
undefined behavior. You might see a ``kernel BUG!`` entry in ``dmesg``, a
|
||||
kernel 'oops', or an outright kernel panic. Such a migrated or loaded L1
|
||||
guest can no longer be considered stable or secure, and must be restarted.
|
||||
Migrating an L1 guest merely configured to support nesting, while not
|
||||
actually running L2 guests, is expected to function normally even on AMD
|
||||
systems but may fail once guests are started.
|
||||
|
||||
Migrating an L2 guest is always expected to succeed, so all the following
|
||||
scenarios should work even on AMD systems:
|
||||
|
||||
- Migrating a nested guest (L2) to another L1 guest on the *same* bare
|
||||
metal host.
|
||||
|
||||
- Migrating a nested guest (L2) to another L1 guest on a *different*
|
||||
bare metal host.
|
||||
|
||||
- Migrating a nested guest (L2) to a bare metal host.
|
||||
|
||||
Reporting bugs from nested setups
|
||||
-----------------------------------
|
||||
|
||||
Debugging "nested" problems can involve sifting through log files across
|
||||
L0, L1 and L2; this can result in tedious back-n-forth between the bug
|
||||
reporter and the bug fixer.
|
||||
|
||||
- Mention that you are in a "nested" setup. If you are running any kind
|
||||
of "nesting" at all, say so. Unfortunately, this needs to be called
|
||||
out because when reporting bugs, people tend to forget to even
|
||||
*mention* that they're using nested virtualization.
|
||||
|
||||
- Ensure you are actually running KVM on KVM. Sometimes people do not
|
||||
have KVM enabled for their guest hypervisor (L1), which results in
|
||||
them running with pure emulation or what QEMU calls it as "TCG", but
|
||||
they think they're running nested KVM. Thus confusing "nested Virt"
|
||||
(which could also mean, QEMU on KVM) with "nested KVM" (KVM on KVM).
|
||||
|
||||
Information to collect (generic)
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
The following is not an exhaustive list, but a very good starting point:
|
||||
|
||||
- Kernel, libvirt, and QEMU version from L0
|
||||
|
||||
- Kernel, libvirt and QEMU version from L1
|
||||
|
||||
- QEMU command-line of L1 -- when using libvirt, you'll find it here:
|
||||
``/var/log/libvirt/qemu/instance.log``
|
||||
|
||||
- QEMU command-line of L2 -- as above, when using libvirt, get the
|
||||
complete libvirt-generated QEMU command-line
|
||||
|
||||
- ``cat /sys/cpuinfo`` from L0
|
||||
|
||||
- ``cat /sys/cpuinfo`` from L1
|
||||
|
||||
- ``lscpu`` from L0
|
||||
|
||||
- ``lscpu`` from L1
|
||||
|
||||
- Full ``dmesg`` output from L0
|
||||
|
||||
- Full ``dmesg`` output from L1
|
||||
|
||||
x86-specific info to collect
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Both the below commands, ``x86info`` and ``dmidecode``, should be
|
||||
available on most Linux distributions with the same name:
|
||||
|
||||
- Output of: ``x86info -a`` from L0
|
||||
|
||||
- Output of: ``x86info -a`` from L1
|
||||
|
||||
- Output of: ``dmidecode`` from L0
|
||||
|
||||
- Output of: ``dmidecode`` from L1
|
||||
|
||||
s390x-specific info to collect
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Along with the earlier mentioned generic details, the below is
|
||||
also recommended:
|
||||
|
||||
- ``/proc/sysinfo`` from L1; this will also include the info from L0
|
Loading…
Reference in New Issue
Block a user