4e1c7dddc7
Add documentation for TDX host kernel support. There is already one file Documentation/x86/tdx.rst containing documentation for TDX guest internals. Also reuse it for TDX host kernel support. Introduce a new level menu "TDX Guest Support" and move existing materials under it, and add a new menu for TDX host kernel support. Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20231208170740.53979-19-dave.hansen%40intel.com
447 lines
18 KiB
ReStructuredText
447 lines
18 KiB
ReStructuredText
.. SPDX-License-Identifier: GPL-2.0
|
|
|
|
=====================================
|
|
Intel Trust Domain Extensions (TDX)
|
|
=====================================
|
|
|
|
Intel's Trust Domain Extensions (TDX) protect confidential guest VMs from
|
|
the host and physical attacks by isolating the guest register state and by
|
|
encrypting the guest memory. In TDX, a special module running in a special
|
|
mode sits between the host and the guest and manages the guest/host
|
|
separation.
|
|
|
|
TDX Host Kernel Support
|
|
=======================
|
|
|
|
TDX introduces a new CPU mode called Secure Arbitration Mode (SEAM) and
|
|
a new isolated range pointed by the SEAM Ranger Register (SEAMRR). A
|
|
CPU-attested software module called 'the TDX module' runs inside the new
|
|
isolated range to provide the functionalities to manage and run protected
|
|
VMs.
|
|
|
|
TDX also leverages Intel Multi-Key Total Memory Encryption (MKTME) to
|
|
provide crypto-protection to the VMs. TDX reserves part of MKTME KeyIDs
|
|
as TDX private KeyIDs, which are only accessible within the SEAM mode.
|
|
BIOS is responsible for partitioning legacy MKTME KeyIDs and TDX KeyIDs.
|
|
|
|
Before the TDX module can be used to create and run protected VMs, it
|
|
must be loaded into the isolated range and properly initialized. The TDX
|
|
architecture doesn't require the BIOS to load the TDX module, but the
|
|
kernel assumes it is loaded by the BIOS.
|
|
|
|
TDX boot-time detection
|
|
-----------------------
|
|
|
|
The kernel detects TDX by detecting TDX private KeyIDs during kernel
|
|
boot. Below dmesg shows when TDX is enabled by BIOS::
|
|
|
|
[..] virt/tdx: BIOS enabled: private KeyID range: [16, 64)
|
|
|
|
TDX module initialization
|
|
---------------------------------------
|
|
|
|
The kernel talks to the TDX module via the new SEAMCALL instruction. The
|
|
TDX module implements SEAMCALL leaf functions to allow the kernel to
|
|
initialize it.
|
|
|
|
If the TDX module isn't loaded, the SEAMCALL instruction fails with a
|
|
special error. In this case the kernel fails the module initialization
|
|
and reports the module isn't loaded::
|
|
|
|
[..] virt/tdx: module not loaded
|
|
|
|
Initializing the TDX module consumes roughly ~1/256th system RAM size to
|
|
use it as 'metadata' for the TDX memory. It also takes additional CPU
|
|
time to initialize those metadata along with the TDX module itself. Both
|
|
are not trivial. The kernel initializes the TDX module at runtime on
|
|
demand.
|
|
|
|
Besides initializing the TDX module, a per-cpu initialization SEAMCALL
|
|
must be done on one cpu before any other SEAMCALLs can be made on that
|
|
cpu.
|
|
|
|
The kernel provides two functions, tdx_enable() and tdx_cpu_enable() to
|
|
allow the user of TDX to enable the TDX module and enable TDX on local
|
|
cpu respectively.
|
|
|
|
Making SEAMCALL requires VMXON has been done on that CPU. Currently only
|
|
KVM implements VMXON. For now both tdx_enable() and tdx_cpu_enable()
|
|
don't do VMXON internally (not trivial), but depends on the caller to
|
|
guarantee that.
|
|
|
|
To enable TDX, the caller of TDX should: 1) temporarily disable CPU
|
|
hotplug; 2) do VMXON and tdx_enable_cpu() on all online cpus; 3) call
|
|
tdx_enable(). For example::
|
|
|
|
cpus_read_lock();
|
|
on_each_cpu(vmxon_and_tdx_cpu_enable());
|
|
ret = tdx_enable();
|
|
cpus_read_unlock();
|
|
if (ret)
|
|
goto no_tdx;
|
|
// TDX is ready to use
|
|
|
|
And the caller of TDX must guarantee the tdx_cpu_enable() has been
|
|
successfully done on any cpu before it wants to run any other SEAMCALL.
|
|
A typical usage is do both VMXON and tdx_cpu_enable() in CPU hotplug
|
|
online callback, and refuse to online if tdx_cpu_enable() fails.
|
|
|
|
User can consult dmesg to see whether the TDX module has been initialized.
|
|
|
|
If the TDX module is initialized successfully, dmesg shows something
|
|
like below::
|
|
|
|
[..] virt/tdx: 262668 KBs allocated for PAMT
|
|
[..] virt/tdx: module initialized
|
|
|
|
If the TDX module failed to initialize, dmesg also shows it failed to
|
|
initialize::
|
|
|
|
[..] virt/tdx: module initialization failed ...
|
|
|
|
TDX Interaction to Other Kernel Components
|
|
------------------------------------------
|
|
|
|
TDX Memory Policy
|
|
~~~~~~~~~~~~~~~~~
|
|
|
|
TDX reports a list of "Convertible Memory Region" (CMR) to tell the
|
|
kernel which memory is TDX compatible. The kernel needs to build a list
|
|
of memory regions (out of CMRs) as "TDX-usable" memory and pass those
|
|
regions to the TDX module. Once this is done, those "TDX-usable" memory
|
|
regions are fixed during module's lifetime.
|
|
|
|
To keep things simple, currently the kernel simply guarantees all pages
|
|
in the page allocator are TDX memory. Specifically, the kernel uses all
|
|
system memory in the core-mm "at the time of TDX module initialization"
|
|
as TDX memory, and in the meantime, refuses to online any non-TDX-memory
|
|
in the memory hotplug.
|
|
|
|
Physical Memory Hotplug
|
|
~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
Note TDX assumes convertible memory is always physically present during
|
|
machine's runtime. A non-buggy BIOS should never support hot-removal of
|
|
any convertible memory. This implementation doesn't handle ACPI memory
|
|
removal but depends on the BIOS to behave correctly.
|
|
|
|
CPU Hotplug
|
|
~~~~~~~~~~~
|
|
|
|
TDX module requires the per-cpu initialization SEAMCALL must be done on
|
|
one cpu before any other SEAMCALLs can be made on that cpu. The kernel
|
|
provides tdx_cpu_enable() to let the user of TDX to do it when the user
|
|
wants to use a new cpu for TDX task.
|
|
|
|
TDX doesn't support physical (ACPI) CPU hotplug. During machine boot,
|
|
TDX verifies all boot-time present logical CPUs are TDX compatible before
|
|
enabling TDX. A non-buggy BIOS should never support hot-add/removal of
|
|
physical CPU. Currently the kernel doesn't handle physical CPU hotplug,
|
|
but depends on the BIOS to behave correctly.
|
|
|
|
Note TDX works with CPU logical online/offline, thus the kernel still
|
|
allows to offline logical CPU and online it again.
|
|
|
|
Kexec()
|
|
~~~~~~~
|
|
|
|
TDX host support currently lacks the ability to handle kexec. For
|
|
simplicity only one of them can be enabled in the Kconfig. This will be
|
|
fixed in the future.
|
|
|
|
Erratum
|
|
~~~~~~~
|
|
|
|
The first few generations of TDX hardware have an erratum. A partial
|
|
write to a TDX private memory cacheline will silently "poison" the
|
|
line. Subsequent reads will consume the poison and generate a machine
|
|
check.
|
|
|
|
A partial write is a memory write where a write transaction of less than
|
|
cacheline lands at the memory controller. The CPU does these via
|
|
non-temporal write instructions (like MOVNTI), or through UC/WC memory
|
|
mappings. Devices can also do partial writes via DMA.
|
|
|
|
Theoretically, a kernel bug could do partial write to TDX private memory
|
|
and trigger unexpected machine check. What's more, the machine check
|
|
code will present these as "Hardware error" when they were, in fact, a
|
|
software-triggered issue. But in the end, this issue is hard to trigger.
|
|
|
|
If the platform has such erratum, the kernel prints additional message in
|
|
machine check handler to tell user the machine check may be caused by
|
|
kernel bug on TDX private memory.
|
|
|
|
Interaction vs S3 and deeper states
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
TDX cannot survive from S3 and deeper states. The hardware resets and
|
|
disables TDX completely when platform goes to S3 and deeper. Both TDX
|
|
guests and the TDX module get destroyed permanently.
|
|
|
|
The kernel uses S3 for suspend-to-ram, and use S4 and deeper states for
|
|
hibernation. Currently, for simplicity, the kernel chooses to make TDX
|
|
mutually exclusive with S3 and hibernation.
|
|
|
|
The kernel disables TDX during early boot when hibernation support is
|
|
available::
|
|
|
|
[..] virt/tdx: initialization failed: Hibernation support is enabled
|
|
|
|
Add 'nohibernate' kernel command line to disable hibernation in order to
|
|
use TDX.
|
|
|
|
ACPI S3 is disabled during kernel early boot if TDX is enabled. The user
|
|
needs to turn off TDX in the BIOS in order to use S3.
|
|
|
|
TDX Guest Support
|
|
=================
|
|
Since the host cannot directly access guest registers or memory, much
|
|
normal functionality of a hypervisor must be moved into the guest. This is
|
|
implemented using a Virtualization Exception (#VE) that is handled by the
|
|
guest kernel. A #VE is handled entirely inside the guest kernel, but some
|
|
require the hypervisor to be consulted.
|
|
|
|
TDX includes new hypercall-like mechanisms for communicating from the
|
|
guest to the hypervisor or the TDX module.
|
|
|
|
New TDX Exceptions
|
|
------------------
|
|
|
|
TDX guests behave differently from bare-metal and traditional VMX guests.
|
|
In TDX guests, otherwise normal instructions or memory accesses can cause
|
|
#VE or #GP exceptions.
|
|
|
|
Instructions marked with an '*' conditionally cause exceptions. The
|
|
details for these instructions are discussed below.
|
|
|
|
Instruction-based #VE
|
|
~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
- Port I/O (INS, OUTS, IN, OUT)
|
|
- HLT
|
|
- MONITOR, MWAIT
|
|
- WBINVD, INVD
|
|
- VMCALL
|
|
- RDMSR*,WRMSR*
|
|
- CPUID*
|
|
|
|
Instruction-based #GP
|
|
~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
- All VMX instructions: INVEPT, INVVPID, VMCLEAR, VMFUNC, VMLAUNCH,
|
|
VMPTRLD, VMPTRST, VMREAD, VMRESUME, VMWRITE, VMXOFF, VMXON
|
|
- ENCLS, ENCLU
|
|
- GETSEC
|
|
- RSM
|
|
- ENQCMD
|
|
- RDMSR*,WRMSR*
|
|
|
|
RDMSR/WRMSR Behavior
|
|
~~~~~~~~~~~~~~~~~~~~
|
|
|
|
MSR access behavior falls into three categories:
|
|
|
|
- #GP generated
|
|
- #VE generated
|
|
- "Just works"
|
|
|
|
In general, the #GP MSRs should not be used in guests. Their use likely
|
|
indicates a bug in the guest. The guest may try to handle the #GP with a
|
|
hypercall but it is unlikely to succeed.
|
|
|
|
The #VE MSRs are typically able to be handled by the hypervisor. Guests
|
|
can make a hypercall to the hypervisor to handle the #VE.
|
|
|
|
The "just works" MSRs do not need any special guest handling. They might
|
|
be implemented by directly passing through the MSR to the hardware or by
|
|
trapping and handling in the TDX module. Other than possibly being slow,
|
|
these MSRs appear to function just as they would on bare metal.
|
|
|
|
CPUID Behavior
|
|
~~~~~~~~~~~~~~
|
|
|
|
For some CPUID leaves and sub-leaves, the virtualized bit fields of CPUID
|
|
return values (in guest EAX/EBX/ECX/EDX) are configurable by the
|
|
hypervisor. For such cases, the Intel TDX module architecture defines two
|
|
virtualization types:
|
|
|
|
- Bit fields for which the hypervisor controls the value seen by the guest
|
|
TD.
|
|
|
|
- Bit fields for which the hypervisor configures the value such that the
|
|
guest TD either sees their native value or a value of 0. For these bit
|
|
fields, the hypervisor can mask off the native values, but it can not
|
|
turn *on* values.
|
|
|
|
A #VE is generated for CPUID leaves and sub-leaves that the TDX module does
|
|
not know how to handle. The guest kernel may ask the hypervisor for the
|
|
value with a hypercall.
|
|
|
|
#VE on Memory Accesses
|
|
----------------------
|
|
|
|
There are essentially two classes of TDX memory: private and shared.
|
|
Private memory receives full TDX protections. Its content is protected
|
|
against access from the hypervisor. Shared memory is expected to be
|
|
shared between guest and hypervisor and does not receive full TDX
|
|
protections.
|
|
|
|
A TD guest is in control of whether its memory accesses are treated as
|
|
private or shared. It selects the behavior with a bit in its page table
|
|
entries. This helps ensure that a guest does not place sensitive
|
|
information in shared memory, exposing it to the untrusted hypervisor.
|
|
|
|
#VE on Shared Memory
|
|
~~~~~~~~~~~~~~~~~~~~
|
|
|
|
Access to shared mappings can cause a #VE. The hypervisor ultimately
|
|
controls whether a shared memory access causes a #VE, so the guest must be
|
|
careful to only reference shared pages it can safely handle a #VE. For
|
|
instance, the guest should be careful not to access shared memory in the
|
|
#VE handler before it reads the #VE info structure (TDG.VP.VEINFO.GET).
|
|
|
|
Shared mapping content is entirely controlled by the hypervisor. The guest
|
|
should only use shared mappings for communicating with the hypervisor.
|
|
Shared mappings must never be used for sensitive memory content like kernel
|
|
stacks. A good rule of thumb is that hypervisor-shared memory should be
|
|
treated the same as memory mapped to userspace. Both the hypervisor and
|
|
userspace are completely untrusted.
|
|
|
|
MMIO for virtual devices is implemented as shared memory. The guest must
|
|
be careful not to access device MMIO regions unless it is also prepared to
|
|
handle a #VE.
|
|
|
|
#VE on Private Pages
|
|
~~~~~~~~~~~~~~~~~~~~
|
|
|
|
An access to private mappings can also cause a #VE. Since all kernel
|
|
memory is also private memory, the kernel might theoretically need to
|
|
handle a #VE on arbitrary kernel memory accesses. This is not feasible, so
|
|
TDX guests ensure that all guest memory has been "accepted" before memory
|
|
is used by the kernel.
|
|
|
|
A modest amount of memory (typically 512M) is pre-accepted by the firmware
|
|
before the kernel runs to ensure that the kernel can start up without
|
|
being subjected to a #VE.
|
|
|
|
The hypervisor is permitted to unilaterally move accepted pages to a
|
|
"blocked" state. However, if it does this, page access will not generate a
|
|
#VE. It will, instead, cause a "TD Exit" where the hypervisor is required
|
|
to handle the exception.
|
|
|
|
Linux #VE handler
|
|
-----------------
|
|
|
|
Just like page faults or #GP's, #VE exceptions can be either handled or be
|
|
fatal. Typically, an unhandled userspace #VE results in a SIGSEGV.
|
|
An unhandled kernel #VE results in an oops.
|
|
|
|
Handling nested exceptions on x86 is typically nasty business. A #VE
|
|
could be interrupted by an NMI which triggers another #VE and hilarity
|
|
ensues. The TDX #VE architecture anticipated this scenario and includes a
|
|
feature to make it slightly less nasty.
|
|
|
|
During #VE handling, the TDX module ensures that all interrupts (including
|
|
NMIs) are blocked. The block remains in place until the guest makes a
|
|
TDG.VP.VEINFO.GET TDCALL. This allows the guest to control when interrupts
|
|
or a new #VE can be delivered.
|
|
|
|
However, the guest kernel must still be careful to avoid potential
|
|
#VE-triggering actions (discussed above) while this block is in place.
|
|
While the block is in place, any #VE is elevated to a double fault (#DF)
|
|
which is not recoverable.
|
|
|
|
MMIO handling
|
|
-------------
|
|
|
|
In non-TDX VMs, MMIO is usually implemented by giving a guest access to a
|
|
mapping which will cause a VMEXIT on access, and then the hypervisor
|
|
emulates the access. That is not possible in TDX guests because VMEXIT
|
|
will expose the register state to the host. TDX guests don't trust the host
|
|
and can't have their state exposed to the host.
|
|
|
|
In TDX, MMIO regions typically trigger a #VE exception in the guest. The
|
|
guest #VE handler then emulates the MMIO instruction inside the guest and
|
|
converts it into a controlled TDCALL to the host, rather than exposing
|
|
guest state to the host.
|
|
|
|
MMIO addresses on x86 are just special physical addresses. They can
|
|
theoretically be accessed with any instruction that accesses memory.
|
|
However, the kernel instruction decoding method is limited. It is only
|
|
designed to decode instructions like those generated by io.h macros.
|
|
|
|
MMIO access via other means (like structure overlays) may result in an
|
|
oops.
|
|
|
|
Shared Memory Conversions
|
|
-------------------------
|
|
|
|
All TDX guest memory starts out as private at boot. This memory can not
|
|
be accessed by the hypervisor. However, some kernel users like device
|
|
drivers might have a need to share data with the hypervisor. To do this,
|
|
memory must be converted between shared and private. This can be
|
|
accomplished using some existing memory encryption helpers:
|
|
|
|
* set_memory_decrypted() converts a range of pages to shared.
|
|
* set_memory_encrypted() converts memory back to private.
|
|
|
|
Device drivers are the primary user of shared memory, but there's no need
|
|
to touch every driver. DMA buffers and ioremap() do the conversions
|
|
automatically.
|
|
|
|
TDX uses SWIOTLB for most DMA allocations. The SWIOTLB buffer is
|
|
converted to shared on boot.
|
|
|
|
For coherent DMA allocation, the DMA buffer gets converted on the
|
|
allocation. Check force_dma_unencrypted() for details.
|
|
|
|
Attestation
|
|
===========
|
|
|
|
Attestation is used to verify the TDX guest trustworthiness to other
|
|
entities before provisioning secrets to the guest. For example, a key
|
|
server may want to use attestation to verify that the guest is the
|
|
desired one before releasing the encryption keys to mount the encrypted
|
|
rootfs or a secondary drive.
|
|
|
|
The TDX module records the state of the TDX guest in various stages of
|
|
the guest boot process using the build time measurement register (MRTD)
|
|
and runtime measurement registers (RTMR). Measurements related to the
|
|
guest initial configuration and firmware image are recorded in the MRTD
|
|
register. Measurements related to initial state, kernel image, firmware
|
|
image, command line options, initrd, ACPI tables, etc are recorded in
|
|
RTMR registers. For more details, as an example, please refer to TDX
|
|
Virtual Firmware design specification, section titled "TD Measurement".
|
|
At TDX guest runtime, the attestation process is used to attest to these
|
|
measurements.
|
|
|
|
The attestation process consists of two steps: TDREPORT generation and
|
|
Quote generation.
|
|
|
|
TDX guest uses TDCALL[TDG.MR.REPORT] to get the TDREPORT (TDREPORT_STRUCT)
|
|
from the TDX module. TDREPORT is a fixed-size data structure generated by
|
|
the TDX module which contains guest-specific information (such as build
|
|
and boot measurements), platform security version, and the MAC to protect
|
|
the integrity of the TDREPORT. A user-provided 64-Byte REPORTDATA is used
|
|
as input and included in the TDREPORT. Typically it can be some nonce
|
|
provided by attestation service so the TDREPORT can be verified uniquely.
|
|
More details about the TDREPORT can be found in Intel TDX Module
|
|
specification, section titled "TDG.MR.REPORT Leaf".
|
|
|
|
After getting the TDREPORT, the second step of the attestation process
|
|
is to send it to the Quoting Enclave (QE) to generate the Quote. TDREPORT
|
|
by design can only be verified on the local platform as the MAC key is
|
|
bound to the platform. To support remote verification of the TDREPORT,
|
|
TDX leverages Intel SGX Quoting Enclave to verify the TDREPORT locally
|
|
and convert it to a remotely verifiable Quote. Method of sending TDREPORT
|
|
to QE is implementation specific. Attestation software can choose
|
|
whatever communication channel available (i.e. vsock or TCP/IP) to
|
|
send the TDREPORT to QE and receive the Quote.
|
|
|
|
References
|
|
==========
|
|
|
|
TDX reference material is collected here:
|
|
|
|
https://www.intel.com/content/www/us/en/developer/articles/technical/intel-trust-domain-extensions.html
|