e753070234
22 Commits
Author | SHA1 | Message | Date | |
---|---|---|---|---|
Kai Huang
|
70060463cb |
x86/mce: Differentiate real hardware #MCs from TDX erratum ones
The first few generations of TDX hardware have an erratum. Triggering it in Linux requires some kind of kernel bug involving relatively exotic memory writes to TDX private memory and will manifest via spurious-looking machine checks when reading the affected memory. Make an effort to detect these TDX-induced machine checks and spit out a new blurb to dmesg so folks do not think their hardware is failing. == Background == Virtually all kernel memory accesses operations happen in full cachelines. In practice, writing a "byte" of memory usually reads a 64 byte cacheline of memory, modifies it, then writes the whole line back. Those operations do not trigger this problem. This problem is triggered by "partial" writes where a write transaction of less than cacheline lands at the memory controller. The CPU does these via non-temporal write instructions (like MOVNTI), or through UC/WC memory mappings. The issue can also be triggered away from the CPU by devices doing partial writes via DMA. == Problem == A partial write to a TDX private memory cacheline will silently "poison" the line. Subsequent reads will consume the poison and generate a machine check. According to the TDX hardware spec, neither of these things should have happened. To add insult to injury, the Linux machine code will present these as a literal "Hardware error" when they were, in fact, a software-triggered issue. == Solution == In the end, this issue is hard to trigger. Rather than do something rash (and incomplete) like unmap TDX private memory from the direct map, improve the machine check handler. Currently, the #MC handler doesn't distinguish whether the memory is TDX private memory or not but just dump, for instance, below message: [...] mce: [Hardware Error]: CPU 147: Machine Check Exception: f Bank 1: bd80000000100134 [...] mce: [Hardware Error]: RIP 10:<ffffffffadb69870> {__tlb_remove_page_size+0x10/0xa0} ... [...] mce: [Hardware Error]: Run the above through 'mcelog --ascii' [...] mce: [Hardware Error]: Machine check: Data load in unrecoverable area of kernel [...] Kernel panic - not syncing: Fatal local machine check Which says "Hardware Error" and "Data load in unrecoverable area of kernel". Ideally, it's better for the log to say "software bug around TDX private memory" instead of "Hardware Error". But in reality the real hardware memory error can happen, and sadly such software-triggered #MC cannot be distinguished from the real hardware error. Also, the error message is used by userspace tool 'mcelog' to parse, so changing the output may break userspace. So keep the "Hardware Error". The "Data load in unrecoverable area of kernel" is also helpful, so keep it too. Instead of modifying above error log, improve the error log by printing additional TDX related message to make the log like: ... [...] mce: [Hardware Error]: Machine check: Data load in unrecoverable area of kernel [...] mce: [Hardware Error]: Machine Check: TDX private memory error. Possible kernel bug. Adding this additional message requires determination of whether the memory page is TDX private memory. There is no existing infrastructure to do that. Add an interface to query the TDX module to fill this gap. == Impact == This issue requires some kind of kernel bug to trigger. TDX private memory should never be mapped UC/WC. A partial write originating from these mappings would require *two* bugs, first mapping the wrong page, then writing the wrong memory. It would also be detectable using traditional memory corruption techniques like DEBUG_PAGEALLOC. MOVNTI (and friends) could cause this issue with something like a simple buffer overrun or use-after-free on the direct map. It should also be detectable with normal debug techniques. The one place where this might get nasty would be if the CPU read data then wrote back the same data. That would trigger this problem but would not, for instance, set off mechanisms like slab redzoning because it doesn't actually corrupt data. With an IOMMU at least, the DMA exposure is similar to the UC/WC issue. TDX private memory would first need to be incorrectly mapped into the I/O space and then a later DMA to that mapping would actually cause the poisoning event. [ dhansen: changelog tweaks ] Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Yuan Yao <yuan.yao@intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/all/20231208170740.53979-18-dave.hansen%40intel.com |
||
Kai Huang
|
1e536e1068 |
x86/cpu: Detect TDX partial write machine check erratum
TDX memory has integrity and confidentiality protections. Violations of this integrity protection are supposed to only affect TDX operations and are never supposed to affect the host kernel itself. In other words, the host kernel should never, itself, see machine checks induced by the TDX integrity hardware. Alas, the first few generations of TDX hardware have an erratum. A partial write to a TDX private memory cacheline will silently "poison" the line. Subsequent reads will consume the poison and generate a machine check. According to the TDX hardware spec, neither of these things should have happened. Virtually all kernel memory accesses operations happen in full cachelines. In practice, writing a "byte" of memory usually reads a 64 byte cacheline of memory, modifies it, then writes the whole line back. Those operations do not trigger this problem. This problem is triggered by "partial" writes where a write transaction of less than cacheline lands at the memory controller. The CPU does these via non-temporal write instructions (like MOVNTI), or through UC/WC memory mappings. The issue can also be triggered away from the CPU by devices doing partial writes via DMA. With this erratum, there are additional things need to be done. To prepare for those changes, add a CPU bug bit to indicate this erratum. Note this bug reflects the hardware thus it is detected regardless of whether the kernel is built with TDX support or not. Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20231208170740.53979-17-dave.hansen%40intel.com |
||
Kai Huang
|
f3f6aa6864 |
x86/virt/tdx: Handle TDX interaction with sleep and hibernation
TDX is incompatible with hibernation and some ACPI sleep states. Users must disable hibernation to use TDX. Users must also disable TDX if they want to use ACPI S3 sleep. This feels a bit wonky and asymmetric, but it avoids adding any new command-line parameters for now. It can be improved if users hate it too much. Long version: TDX cannot survive from S3 and deeper states. The hardware resets and disables TDX completely when platform goes to S3 and deeper. Both TDX guests and the TDX module get destroyed permanently. The kernel uses S3 to support suspend-to-ram, and S4 or deeper states to support hibernation. The kernel also maintains TDX states to track whether it has been initialized and its metadata resource, etc. After resuming from S3 or hibernation, these TDX states won't be correct anymore. Theoretically, the kernel can do more complicated things like resetting TDX internal states and TDX module metadata before going to S3 or deeper, and re-initialize TDX module after resuming, etc, but there is no way to save/restore TDX guests for now. Until TDX supports full save and restore of TDX guests, there is no big value to handle TDX module in suspend and hibernation alone. To make things simple, just choose to make TDX mutually exclusive with S3 and hibernation. Note the TDX module is initialized at runtime. To avoid having to deal with the fuss of determining TDX state at runtime, just choose TDX vs S3 and hibernation at kernel early boot. It's a bad user experience if the choice of TDX and S3/hibernation is done at runtime anyway, i.e., the user can experience being able to do S3/hibernation but later becoming unable to due to TDX being enabled. Disable TDX in kernel early boot when hibernation support is available. Currently there's no mechanism exposed by the hibernation code to allow other kernel code to disable hibernation once for all. Users that want TDX must disable hibernation, like using hibername=no on the command line. Disable ACPI S3 when TDX is enabled by the BIOS. For now the user needs to disable TDX in the BIOS to use ACPI S3. A new kernel command line can be added in the future if there's a need to let user disable TDX host via kernel command line. Alternatively, the kernel could disable TDX when ACPI S3 is supported and request the user to disable S3 to use TDX. But there's no existing kernel command line to do that, and BIOS doesn't always have an option to disable S3. [ dhansen: subject / changelog tweaks ] Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20231208170740.53979-16-dave.hansen%40intel.com |
||
Kai Huang
|
0b2bc38131 |
x86/virt/tdx: Initialize all TDMRs
After the global KeyID has been configured on all packages, initialize all TDMRs to make all TDX-usable memory regions that are passed to the TDX module become usable. This is the last step of initializing the TDX module. Initializing TDMRs can be time consuming on large memory systems as it involves initializing all metadata entries for all pages that can be used by TDX guests. Initializing different TDMRs can be parallelized. For now to keep it simple, just initialize all TDMRs one by one. It can be enhanced in the future. Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Yuan Yao <yuan.yao@intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20231208170740.53979-15-dave.hansen%40intel.com |
||
Kai Huang
|
e56d28df2f |
x86/virt/tdx: Configure global KeyID on all packages
After the list of TDMRs and the global KeyID are configured to the TDX module, the kernel needs to configure the key of the global KeyID on all packages using TDH.SYS.KEY.CONFIG. This SEAMCALL cannot run parallel on different cpus. Loop all online cpus and use smp_call_on_cpu() to call this SEAMCALL on the first cpu of each package. To keep things simple, this implementation takes no affirmative steps to online cpus to make sure there's at least one cpu for each package. The callers (aka. KVM) can ensure success by ensuring sufficient CPUs are online for this to succeed. Intel hardware doesn't guarantee cache coherency across different KeyIDs. The PAMTs are transitioning from being used by the kernel mapping (KeyId 0) to the TDX module's "global KeyID" mapping. This means that the kernel must flush any dirty KeyID-0 PAMT cachelines before the TDX module uses the global KeyID to access the PAMTs. Otherwise, if those dirty cachelines were written back, they would corrupt the TDX module's metadata. Aside: This corruption would be detected by the memory integrity hardware on the next read of the memory with the global KeyID. The result would likely be fatal to the system but would not impact TDX security. Following the TDX module specification, flush cache before configuring the global KeyID on all packages. Given the PAMT size can be large (~1/256th of system RAM), just use WBINVD on all CPUs to flush. If TDH.SYS.KEY.CONFIG fails, the TDX module may already have "converted" some memory for TDX module use. Convert the memory back so that it can be safely used by the kernel again. Note that this is slower than it should be because of the "partial write machine check" erratum which affects TDX-capable hardware. Also refactor and introduce a new helper: tdmr_do_pamt_func(). This takes a TDMR and runs a function on its PAMT. It looks a _bit_ odd to pass a function pointer around like this, but its use is pretty narrow and it does eliminate what would otherwise be some copying and pasting. [ dhansen: * munge changelog as usual * remove weird (*pamd_func)() syntax ] Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Yuan Yao <yuan.yao@intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20231208170740.53979-14-dave.hansen%40intel.com |
||
Kai Huang
|
554ce1c36d |
x86/virt/tdx: Configure TDX module with the TDMRs and global KeyID
The TDX module uses a private KeyID as the "global KeyID" for mapping things like the PAMT and other TDX metadata. This KeyID has already been reserved when detecting TDX during the kernel early boot. Now that the "TD Memory Regions" (TDMRs) are fully built, pass them to the TDX module together with the global KeyID. Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Yuan Yao <yuan.yao@intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20231208170740.53979-13-dave.hansen%40intel.com |
||
Kai Huang
|
dde3b60d57 |
x86/virt/tdx: Designate reserved areas for all TDMRs
As the last step of constructing TDMRs, populate reserved areas for all TDMRs. Cover all memory holes and PAMTs with a TMDR reserved area. [ dhansen: trim down chagnelog ] Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Yuan Yao <yuan.yao@intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20231208170740.53979-12-dave.hansen%40intel.com |
||
Kai Huang
|
ac3a220884 |
x86/virt/tdx: Allocate and set up PAMTs for TDMRs
The TDX module uses additional metadata to record things like which guest "owns" a given page of memory. This metadata, referred as Physical Address Metadata Table (PAMT), essentially serves as the 'struct page' for the TDX module. PAMTs are not reserved by hardware up front. They must be allocated by the kernel and then given to the TDX module during module initialization. TDX supports 3 page sizes: 4K, 2M, and 1G. Each "TD Memory Region" (TDMR) has 3 PAMTs to track the 3 supported page sizes. Each PAMT must be a physically contiguous area from a Convertible Memory Region (CMR). However, the PAMTs which track pages in one TDMR do not need to reside within that TDMR but can be anywhere in CMRs. If one PAMT overlaps with any TDMR, the overlapping part must be reported as a reserved area in that particular TDMR. Use alloc_contig_pages() since PAMT must be a physically contiguous area and it may be potentially large (~1/256th of the size of the given TDMR). The downside is alloc_contig_pages() may fail at runtime. One (bad) mitigation is to launch a TDX guest early during system boot to get those PAMTs allocated at early time, but the only way to fix is to add a boot option to allocate or reserve PAMTs during kernel boot. It is imperfect but will be improved on later. TDX only supports a limited number of reserved areas per TDMR to cover both PAMTs and memory holes within the given TDMR. If many PAMTs are allocated within a single TDMR, the reserved areas may not be sufficient to cover all of them. Adopt the following policies when allocating PAMTs for a given TDMR: - Allocate three PAMTs of the TDMR in one contiguous chunk to minimize the total number of reserved areas consumed for PAMTs. - Try to first allocate PAMT from the local node of the TDMR for better NUMA locality. Also dump out how many pages are allocated for PAMTs when the TDX module is initialized successfully. This helps answer the eternal "where did all my memory go?" questions. [ dhansen: merge in error handling cleanup ] Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Yuan Yao <yuan.yao@intel.com> Link: https://lore.kernel.org/all/20231208170740.53979-11-dave.hansen%40intel.com |
||
Kai Huang
|
f3338ac159 |
x86/virt/tdx: Fill out TDMRs to cover all TDX memory regions
Start to transit out the "multi-steps" to construct a list of "TD Memory Regions" (TDMRs) to cover all TDX-usable memory regions. The kernel configures TDX-usable memory regions by passing a list of TDMRs "TD Memory Regions" (TDMRs) to the TDX module. Each TDMR contains the information of the base/size of a memory region, the base/size of the associated Physical Address Metadata Table (PAMT) and a list of reserved areas in the region. Do the first step to fill out a number of TDMRs to cover all TDX memory regions. To keep it simple, always try to use one TDMR for each memory region. As the first step only set up the base/size for each TDMR. Each TDMR must be 1G aligned and the size must be in 1G granularity. This implies that one TDMR could cover multiple memory regions. If a memory region spans the 1GB boundary and the former part is already covered by the previous TDMR, just use a new TDMR for the remaining part. TDX only supports a limited number of TDMRs. Disable TDX if all TDMRs are consumed but there is more memory region to cover. There are fancier things that could be done like trying to merge adjacent TDMRs. This would allow more pathological memory layouts to be supported. But, current systems are not even close to exhausting the existing TDMR resources in practice. For now, keep it simple. Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Reviewed-by: Yuan Yao <yuan.yao@intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20231208170740.53979-10-dave.hansen%40intel.com |
||
Kai Huang
|
5173d3c5d0 |
x86/virt/tdx: Add placeholder to construct TDMRs to cover all TDX memory regions
After the kernel selects all TDX-usable memory regions, the kernel needs to pass those regions to the TDX module via data structure "TD Memory Region" (TDMR). Add a placeholder to construct a list of TDMRs (in multiple steps) to cover all TDX-usable memory regions. === Long Version === TDX provides increased levels of memory confidentiality and integrity. This requires special hardware support for features like memory encryption and storage of memory integrity checksums. Not all memory satisfies these requirements. As a result, TDX introduced the concept of a "Convertible Memory Region" (CMR). During boot, the firmware builds a list of all of the memory ranges which can provide the TDX security guarantees. The list of these ranges is available to the kernel by querying the TDX module. The TDX architecture needs additional metadata to record things like which TD guest "owns" a given page of memory. This metadata essentially serves as the 'struct page' for the TDX module. The space for this metadata is not reserved by the hardware up front and must be allocated by the kernel and given to the TDX module. Since this metadata consumes space, the VMM can choose whether or not to allocate it for a given area of convertible memory. If it chooses not to, the memory cannot receive TDX protections and can not be used by TDX guests as private memory. For every memory region that the VMM wants to use as TDX memory, it sets up a "TD Memory Region" (TDMR). Each TDMR represents a physically contiguous convertible range and must also have its own physically contiguous metadata table, referred to as a Physical Address Metadata Table (PAMT), to track status for each page in the TDMR range. Unlike a CMR, each TDMR requires 1G granularity and alignment. To support physical RAM areas that don't meet those strict requirements, each TDMR permits a number of internal "reserved areas" which can be placed over memory holes. If PAMT metadata is placed within a TDMR it must be covered by one of these reserved areas. Let's summarize the concepts: CMR - Firmware-enumerated physical ranges that support TDX. CMRs are 4K aligned. TDMR - Physical address range which is chosen by the kernel to support TDX. 1G granularity and alignment required. Each TDMR has reserved areas where TDX memory holes and overlapping PAMTs can be represented. PAMT - Physically contiguous TDX metadata. One table for each page size per TDMR. Roughly 1/256th of TDMR in size. 256G TDMR = ~1G PAMT. As one step of initializing the TDX module, the kernel configures TDX-usable memory regions by passing a list of TDMRs to the TDX module. Constructing the list of TDMRs consists below steps: 1) Fill out TDMRs to cover all memory regions that the TDX module will use for TD memory. 2) Allocate and set up PAMT for each TDMR. 3) Designate reserved areas for each TDMR. Add a placeholder to construct TDMRs to do the above steps. To keep things simple, just allocate enough space to hold maximum number of TDMRs up front. Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Link: https://lore.kernel.org/all/20231208170740.53979-9-dave.hansen%40intel.com |
||
Kai Huang
|
cf72bc4816 |
x86/virt/tdx: Get module global metadata for module initialization
The TDX module global metadata provides system-wide information about the module. TL;DR: Use the TDH.SYS.RD SEAMCALL to tell if the module is good or not. Long Version: 1) Only initialize TDX module with version 1.5 and later TDX module 1.0 has some compatibility issues with the later versions of module, as documented in the "Intel TDX module ABI incompatibilities between TDX1.0 and TDX1.5" spec. Don't bother with module versions that do not have a stable ABI. 2) Get the essential global metadata for module initialization TDX reports a list of "Convertible Memory Region" (CMR) to tell the kernel which memory is TDX compatible. The kernel needs to build a list of memory regions (out of CMRs) as "TDX-usable" memory and pass them to the TDX module. The kernel does this by constructing a list of "TD Memory Regions" (TDMRs) to cover all these memory regions and passing them to the TDX module. Each TDMR is a TDX architectural data structure containing the memory region that the TDMR covers, plus the information to track (within this TDMR): a) the "Physical Address Metadata Table" (PAMT) to track each TDX memory page's status (such as which TDX guest "owns" a given page, and b) the "reserved areas" to tell memory holes that cannot be used as TDX memory. The kernel needs to get below metadata from the TDX module to build the list of TDMRs: a) the maximum number of supported TDMRs b) the maximum number of supported reserved areas per TDMR and, c) the PAMT entry size for each TDX-supported page size. == Implementation == The TDX module has two modes of fetching the metadata: a one field at a time, or all in one blob. Use the field at a time for now. It is slower, but there just are not enough fields now to justify the complexity of extra unpacking. The err_free_tdxmem=>out_put_tdxmem goto looks wonky by itself. But it is the first of a bunch of error handling that will get stuck at its site. [ dhansen: clean up changelog and add a struct to map between the TDX module fields and 'struct tdx_tdmr_sysinfo' ] Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20231208170740.53979-8-dave.hansen%40intel.com |
||
Kai Huang
|
abe8dbab8f |
x86/virt/tdx: Use all system memory when initializing TDX module as TDX memory
Start to transit out the "multi-steps" to initialize the TDX module. TDX provides increased levels of memory confidentiality and integrity. This requires special hardware support for features like memory encryption and storage of memory integrity checksums. Not all memory satisfies these requirements. As a result, TDX introduced the concept of a "Convertible Memory Region" (CMR). During boot, the firmware builds a list of all of the memory ranges which can provide the TDX security guarantees. The list of these ranges is available to the kernel by querying the TDX module. CMRs tell the kernel which memory is TDX compatible. The kernel needs to build a list of memory regions (out of CMRs) as "TDX-usable" memory and pass them to the TDX module. Once this is done, those "TDX-usable" memory regions are fixed during module's lifetime. To keep things simple, assume that all TDX-protected memory will come from the page allocator. Make sure all pages in the page allocator *are* TDX-usable memory. As TDX-usable memory is a fixed configuration, take a snapshot of the memory configuration from memblocks at the time of module initialization (memblocks are modified on memory hotplug). This snapshot is used to enable TDX support for *this* memory configuration only. Use a memory hotplug notifier to ensure that no other RAM can be added outside of this configuration. This approach requires all memblock memory regions at the time of module initialization to be TDX convertible memory to work, otherwise module initialization will fail in a later SEAMCALL when passing those regions to the module. This approach works when all boot-time "system RAM" is TDX convertible memory and no non-TDX-convertible memory is hot-added to the core-mm before module initialization. For instance, on the first generation of TDX machines, both CXL memory and NVDIMM are not TDX convertible memory. Using kmem driver to hot-add any CXL memory or NVDIMM to the core-mm before module initialization will result in failure to initialize the module. The SEAMCALL error code will be available in the dmesg to help user to understand the failure. Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Link: https://lore.kernel.org/all/20231208170740.53979-7-dave.hansen%40intel.com |
||
Kai Huang
|
6162b310bc |
x86/virt/tdx: Add skeleton to enable TDX on demand
There are essentially two steps to get the TDX module ready: 1) Get each CPU ready to run TDX 2) Set up the shared TDX module data structures Introduce and export (to KVM) the infrastructure to do both of these pieces at runtime. == Per-CPU TDX Initialization == Track the initialization status of each CPU with a per-cpu variable. This avoids failures in the case of KVM module reloads and handles cases where CPUs come online later. Generally, the per-cpu SEAMCALLs happen first. But there's actually one global call that has to happen before _any_ others (TDH_SYS_INIT). It's analogous to the boot CPU having to do a bit of extra work just because it happens to be the first one. Track if _any_ CPU has done this call and then only actually do it during the first per-cpu init. == Shared TDX Initialization == Create the global state function (tdx_enable()) as a simple placeholder. The TODO list will be pared down as functionality is added. Use a state machine protected by mutex to make sure the work in tdx_enable() will only be done once. This avoids failures if the KVM module is reloaded. A CPU must be made ready to run TDX before it can participate in initializing the shared parts of the module. Any caller of tdx_enable() need to ensure that it can never run on a CPU which is not ready to run TDX. It needs to be wary of CPU hotplug, preemption and the VMX enabling state of any CPU on which it might run. == Why runtime instead of boot time? == The TDX module can be initialized only once in its lifetime. Instead of always initializing it at boot time, this implementation chooses an "on demand" approach to initialize TDX until there is a real need (e.g when requested by KVM). This approach has below pros: 1) It avoids consuming the memory that must be allocated by kernel and given to the TDX module as metadata (~1/256th of the TDX-usable memory), and also saves the CPU cycles of initializing the TDX module (and the metadata) when TDX is not used at all. 2) The TDX module design allows it to be updated while the system is running. The update procedure shares quite a few steps with this "on demand" initialization mechanism. The hope is that much of "on demand" mechanism can be shared with a future "update" mechanism. A boot-time TDX module implementation would not be able to share much code with the update mechanism. 3) Making SEAMCALL requires VMX to be enabled. Currently, only the KVM code mucks with VMX enabling. If the TDX module were to be initialized separately from KVM (like at boot), the boot code would need to be taught how to muck with VMX enabling and KVM would need to be taught how to cope with that. Making KVM itself responsible for TDX initialization lets the rest of the kernel stay blissfully unaware of VMX. [ dhansen: completely reorder/rewrite changelog ] Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Nikolay Borisov <nik.borisov@suse.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20231208170740.53979-6-dave.hansen%40intel.com |
||
Kai Huang
|
df01f5ae07 |
x86/virt/tdx: Add SEAMCALL error printing for module initialization
The SEAMCALLs involved during the TDX module initialization are not expected to fail. In fact, they are not expected to return any non-zero code (except the "running out of entropy error", which can be handled internally already). Add yet another set of SEAMCALL wrappers, which treats all non-zero return code as error, to support printing SEAMCALL error upon failure for module initialization. Note the TDX module initialization doesn't use the _saved_ret() variant thus no wrapper is added for it. SEAMCALL assembly can also return kernel-defined error codes for three special cases: 1) TDX isn't enabled by the BIOS; 2) TDX module isn't loaded; 3) CPU isn't in VMX operation. Whether they can legally happen depends on the caller, so leave to the caller to print error message when desired. Also convert the SEAMCALL error codes to the kernel error codes in the new wrappers so that each SEAMCALL caller doesn't have to repeat the conversion. [ dhansen: Align the register dump with show_regs(). Zero-pad the contents, split on two lines and use consistent spacing. ] Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20231208170740.53979-5-dave.hansen%40intel.com |
||
Kai Huang
|
765a0542fd |
x86/virt/tdx: Detect TDX during kernel boot
Intel Trust Domain Extensions (TDX) protects guest VMs from malicious host and certain physical attacks. A CPU-attested software module called 'the TDX module' runs inside a new isolated memory range as a trusted hypervisor to manage and run protected VMs. Pre-TDX Intel hardware has support for a memory encryption architecture called MKTME. The memory encryption hardware underpinning MKTME is also used for Intel TDX. TDX ends up "stealing" some of the physical address space from the MKTME architecture for crypto-protection to VMs. The BIOS is responsible for partitioning the "KeyID" space between legacy MKTME and TDX. The KeyIDs reserved for TDX are called 'TDX private KeyIDs' or 'TDX KeyIDs' for short. During machine boot, TDX microcode verifies that the BIOS programmed TDX private KeyIDs consistently and correctly programmed across all CPU packages. The MSRs are locked in this state after verification. This is why MSR_IA32_MKTME_KEYID_PARTITIONING gets used for TDX enumeration: it indicates not just that the hardware supports TDX, but that all the boot-time security checks passed. The TDX module is expected to be loaded by the BIOS when it enables TDX, but the kernel needs to properly initialize it before it can be used to create and run any TDX guests. The TDX module will be initialized by the KVM subsystem when KVM wants to use TDX. Detect platform TDX support by detecting TDX private KeyIDs. The TDX module itself requires one TDX KeyID as the 'TDX global KeyID' to protect its metadata. Each TDX guest also needs a TDX KeyID for its own protection. Just use the first TDX KeyID as the global KeyID and leave the rest for TDX guests. If no TDX KeyID is left for TDX guests, disable TDX as initializing the TDX module alone is useless. [ dhansen: add X86_FEATURE, replace helper function ] Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Link: https://lore.kernel.org/all/20231208170740.53979-1-dave.hansen%40intel.com |
||
Kai Huang
|
7b804135d4 |
x86/virt/tdx: Make TDX_MODULE_CALL handle SEAMCALL #UD and #GP
SEAMCALL instruction causes #UD if the CPU isn't in VMX operation. Currently the TDX_MODULE_CALL assembly doesn't handle #UD, thus making SEAMCALL when VMX is disabled would cause Oops. Unfortunately, there are legal cases that SEAMCALL can be made when VMX is disabled. For instance, VMX can be disabled due to emergency reboot while there are still TDX guests running. Extend the TDX_MODULE_CALL assembly to return an error code for #UD to handle this case gracefully, e.g., KVM can then quietly eat all SEAMCALL errors caused by emergency reboot. SEAMCALL instruction also causes #GP when TDX isn't enabled by the BIOS. Use _ASM_EXTABLE_FAULT() to catch both exceptions with the trap number recorded, and define two new error codes by XORing the trap number to the TDX_SW_ERROR. This opportunistically handles #GP too while using the same simple assembly code. A bonus is when kernel mistakenly calls SEAMCALL when CPU isn't in VMX operation, or when TDX isn't enabled by the BIOS, or when the BIOS is buggy, the kernel can get a nicer error code rather than a less understandable Oops. This is basically based on Peter's code. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/all/de975832a367f476aab2d0eb0d9de66019a16b54.1692096753.git.kai.huang%40intel.com |
||
Kai Huang
|
c33621b4c5 |
x86/virt/tdx: Wire up basic SEAMCALL functions
Intel Trust Domain Extensions (TDX) protects guest VMs from malicious host and certain physical attacks. A CPU-attested software module called 'the TDX module' runs inside a new isolated memory range as a trusted hypervisor to manage and run protected VMs. TDX introduces a new CPU mode: Secure Arbitration Mode (SEAM). This mode runs only the TDX module itself or other code to load the TDX module. The host kernel communicates with SEAM software via a new SEAMCALL instruction. This is conceptually similar to a guest->host hypercall, except it is made from the host to SEAM software instead. The TDX module establishes a new SEAMCALL ABI which allows the host to initialize the module and to manage VMs. The SEAMCALL ABI is very similar to the TDCALL ABI and leverages much TDCALL infrastructure. Wire up basic functions to make SEAMCALLs for the basic support of running TDX guests: __seamcall(), __seamcall_ret(), and __seamcall_saved_ret() for TDH.VP.ENTER. All SEAMCALLs involved in the basic TDX support don't use "callee-saved" registers as input and output, except the TDH.VP.ENTER. To start to support TDX, create a new arch/x86/virt/vmx/tdx/tdx.c for TDX host kernel support. Add a new Kconfig option CONFIG_INTEL_TDX_HOST to opt-in TDX host kernel support (to distinguish with TDX guest kernel support). So far only KVM uses TDX. Make the new config option depend on KVM_INTEL. Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Isaku Yamahata <isaku.yamahata@intel.com> Link: https://lore.kernel.org/all/4db7c3fc085e6af12acc2932294254ddb3d320b3.1692096753.git.kai.huang%40intel.com |
||
Kai Huang
|
90f5ecd37f |
x86/tdx: Reimplement __tdx_hypercall() using TDX_MODULE_CALL asm
Now the TDX_HYPERCALL asm is basically identical to the TDX_MODULE_CALL with both '\saved' and '\ret' enabled, with two minor things though: 1) The way to restore the structure pointer is different The TDX_HYPERCALL uses RCX as spare to restore the structure pointer, but the TDX_MODULE_CALL assumes no spare register can be used. In other words, TDX_MODULE_CALL already covers what TDX_HYPERCALL does. 2) TDX_MODULE_CALL only clears shared registers for TDH.VP.ENTER For this just need to make that code available for the non-host case. Thus, remove the TDX_HYPERCALL and reimplement the __tdx_hypercall() using the TDX_MODULE_CALL. Extend the TDX_MODULE_CALL to cover "clear shared registers" for TDG.VP.VMCALL. Introduce a new __tdcall_saved_ret() to replace the temporary __tdcall_hypercall(). The __tdcall_saved_ret() can also be used for those new TDCALLs which require more input/output registers than the basic TDCALLs do. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/all/e68a2473fb6f5bcd78b078cae7510e9d0753b3df.1692096753.git.kai.huang%40intel.com |
||
Kai Huang
|
12f34ed862 |
x86/tdx: Extend TDX_MODULE_CALL to support more TDCALL/SEAMCALL leafs
The TDX guest live migration support (TDX 1.5) adds new TDCALL/SEAMCALL leaf functions. Those new TDCALLs/SEAMCALLs take additional registers for input (R10-R13) and output (R12-R13). TDG.SERVTD.RD is an example. Also, the current TDX_MODULE_CALL doesn't aim to handle TDH.VP.ENTER SEAMCALL, which monitors the TDG.VP.VMCALL in input/output registers when it returns in case of VMCALL from TDX guest. With those new TDCALLs/SEAMCALLs and the TDH.VP.ENTER covered, the TDX_MODULE_CALL macro basically needs to handle the same input/output registers as the TDX_HYPERCALL does. And as a result, they also share similar logic in the assembly, thus should be unified to use one common assembly. Extend the TDX_MODULE_CALL asm to support the new TDCALLs/SEAMCALLs and also the TDH.VP.ENTER SEAMCALL. Eventually it will be unified with the TDX_HYPERCALL. The new input/output registers fit with the "callee-saved" registers in the x86 calling convention. Add a new "saved" parameter to support those new TDCALLs/SEAMCALLs and TDH.VP.ENTER and keep the existing TDCALLs/SEAMCALLs minimally impacted. For TDH.VP.ENTER, after it returns the registers shared by the guest contain guest's values. Explicitly clear them to prevent speculative use of guest's values. Note most TDX live migration related SEAMCALLs may also clobber AVX* state ("AVX, AVX2 and AVX512 state: may be reset to the architectural INIT state" -- see TDH.EXPORT.MEM for example). And TDH.VP.ENTER also clobbers XMM0-XMM15 when the corresponding bit is set in RCX. Don't handle them in the TDX_MODULE_CALL macro but let the caller save and restore when needed. This is basically based on Peter's code. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/all/d4785de7c392f7c5684407f6c24a73b92148ec49.1692096753.git.kai.huang%40intel.com |
||
Kai Huang
|
57a420bb81 |
x86/tdx: Pass TDCALL/SEAMCALL input/output registers via a structure
Currently, the TDX_MODULE_CALL asm macro, which handles both TDCALL and SEAMCALL, takes one parameter for each input register and an optional 'struct tdx_module_output' (a collection of output registers) as output. This is different from the TDX_HYPERCALL macro which uses a single 'struct tdx_hypercall_args' to carry all input/output registers. The newer TDX versions introduce more TDCALLs/SEAMCALLs which use more input/output registers. Also, the TDH.VP.ENTER (which isn't covered by the current TDX_MODULE_CALL macro) basically can use all registers that the TDX_HYPERCALL does. The current TDX_MODULE_CALL macro isn't extendible to cover those cases. Similar to the TDX_HYPERCALL macro, simplify the TDX_MODULE_CALL macro to use a single structure 'struct tdx_module_args' to carry all the input/output registers. Currently, R10/R11 are only used as output register but not as input by any TDCALL/SEAMCALL. Change to also use R10/R11 as input register to make input/output registers symmetric. Currently, the TDX_MODULE_CALL macro depends on the caller to pass a non-NULL 'struct tdx_module_output' to get additional output registers. Similar to the TDX_HYPERCALL macro, change the TDX_MODULE_CALL macro to take a new 'ret' macro argument to indicate whether to save the output registers to the 'struct tdx_module_args'. Also introduce a new __tdcall_ret() for that purpose, similar to the __tdx_hypercall_ret(). Note the tdcall(), which is a wrapper of __tdcall(), is called by three callers: tdx_parse_tdinfo(), tdx_get_ve_info() and tdx_early_init(). The former two need the additional output but the last one doesn't. For simplicity, make tdcall() always call __tdcall_ret() to avoid another "_ret()" wrapper. The last caller tdx_early_init() isn't performance critical anyway. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/all/483616c1762d85eb3a3c3035a7de061cfacf2f14.1692096753.git.kai.huang%40intel.com |
||
Kai Huang
|
03a423d40c |
x86/tdx: Skip saving output regs when SEAMCALL fails with VMFailInvalid
If SEAMCALL fails with VMFailInvalid, the SEAM software (e.g., the TDX module) won't have chance to set any output register. Skip saving the output registers to the structure in this case. Also, as '.Lno_output_struct' is the very last symbol before RET, rename it to '.Lout' to make it short. Opportunistically make the asm directives unindented. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/all/704088f5b4d72c7e24084f7f15bd1ac5005b7213.1692096753.git.kai.huang%40intel.com |
||
Kirill A. Shutemov
|
527a534c73 |
x86/tdx: Provide common base for SEAMCALL and TDCALL C wrappers
Secure Arbitration Mode (SEAM) is an extension of VMX architecture. It defines a new VMX root operation (SEAM VMX root) and a new VMX non-root operation (SEAM VMX non-root) which are both isolated from the legacy VMX operation where the host kernel runs. A CPU-attested software module (called 'TDX module') runs in SEAM VMX root to manage and protect VMs running in SEAM VMX non-root. SEAM VMX root is also used to host another CPU-attested software module (called 'P-SEAMLDR') to load and update the TDX module. Host kernel transits to either P-SEAMLDR or TDX module via the new SEAMCALL instruction, which is essentially a VMExit from VMX root mode to SEAM VMX root mode. SEAMCALLs are leaf functions defined by P-SEAMLDR and TDX module around the new SEAMCALL instruction. A guest kernel can also communicate with TDX module via TDCALL instruction. TDCALLs and SEAMCALLs use an ABI different from the x86-64 system-v ABI. RAX is used to carry both the SEAMCALL leaf function number (input) and the completion status (output). Additional GPRs (RCX, RDX, R8-R11) may be further used as both input and output operands in individual leaf. TDCALL and SEAMCALL share the same ABI and require the largely same code to pass down arguments and retrieve results. Define an assembly macro that can be used to implement C wrapper for both TDCALL and SEAMCALL. Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20220405232939.73860-3-kirill.shutemov@linux.intel.com |