linux/Documentation
Paolo Bonzini 6c370dc653 Merge branch 'kvm-guestmemfd' into HEAD
Introduce several new KVM uAPIs to ultimately create a guest-first memory
subsystem within KVM, a.k.a. guest_memfd.  Guest-first memory allows KVM
to provide features, enhancements, and optimizations that are kludgly
or outright impossible to implement in a generic memory subsystem.

The core KVM ioctl() for guest_memfd is KVM_CREATE_GUEST_MEMFD, which
similar to the generic memfd_create(), creates an anonymous file and
returns a file descriptor that refers to it.  Again like "regular"
memfd files, guest_memfd files live in RAM, have volatile storage,
and are automatically released when the last reference is dropped.
The key differences between memfd files (and every other memory subystem)
is that guest_memfd files are bound to their owning virtual machine,
cannot be mapped, read, or written by userspace, and cannot be resized.
guest_memfd files do however support PUNCH_HOLE, which can be used to
convert a guest memory area between the shared and guest-private states.

A second KVM ioctl(), KVM_SET_MEMORY_ATTRIBUTES, allows userspace to
specify attributes for a given page of guest memory.  In the long term,
it will likely be extended to allow userspace to specify per-gfn RWX
protections, including allowing memory to be writable in the guest
without it also being writable in host userspace.

The immediate and driving use case for guest_memfd are Confidential
(CoCo) VMs, specifically AMD's SEV-SNP, Intel's TDX, and KVM's own pKVM.
For such use cases, being able to map memory into KVM guests without
requiring said memory to be mapped into the host is a hard requirement.
While SEV+ and TDX prevent untrusted software from reading guest private
data by encrypting guest memory, pKVM provides confidentiality and
integrity *without* relying on memory encryption.  In addition, with
SEV-SNP and especially TDX, accessing guest private memory can be fatal
to the host, i.e. KVM must be prevent host userspace from accessing
guest memory irrespective of hardware behavior.

Long term, guest_memfd may be useful for use cases beyond CoCo VMs,
for example hardening userspace against unintentional accesses to guest
memory.  As mentioned earlier, KVM's ABI uses userspace VMA protections to
define the allow guest protection (with an exception granted to mapping
guest memory executable), and similarly KVM currently requires the guest
mapping size to be a strict subset of the host userspace mapping size.
Decoupling the mappings sizes would allow userspace to precisely map
only what is needed and with the required permissions, without impacting
guest performance.

A guest-first memory subsystem also provides clearer line of sight to
things like a dedicated memory pool (for slice-of-hardware VMs) and
elimination of "struct page" (for offload setups where userspace _never_
needs to DMA from or into guest memory).

guest_memfd is the result of 3+ years of development and exploration;
taking on memory management responsibilities in KVM was not the first,
second, or even third choice for supporting CoCo VMs.  But after many
failed attempts to avoid KVM-specific backing memory, and looking at
where things ended up, it is quite clear that of all approaches tried,
guest_memfd is the simplest, most robust, and most extensible, and the
right thing to do for KVM and the kernel at-large.

The "development cycle" for this version is going to be very short;
ideally, next week I will merge it as is in kvm/next, taking this through
the KVM tree for 6.8 immediately after the end of the merge window.
The series is still based on 6.6 (plus KVM changes for 6.7) so it
will require a small fixup for changes to get_file_rcu() introduced in
6.7 by commit 0ede61d858 ("file: convert to SLAB_TYPESAFE_BY_RCU").
The fixup will be done as part of the merge commit, and most of the text
above will become the commit message for the merge.

Pending post-merge work includes:
- hugepage support
- looking into using the restrictedmem framework for guest memory
- introducing a testing mechanism to poison memory, possibly using
  the same memory attributes introduced here
- SNP and TDX support

There are two non-KVM patches buried in the middle of this series:

  fs: Rename anon_inode_getfile_secure() and anon_inode_getfd_secure()
  mm: Add AS_UNMOVABLE to mark mapping as completely unmovable

The first is small and mostly suggested-by Christian Brauner; the second
a bit less so but it was written by an mm person (Vlastimil Babka).
2023-11-14 08:31:31 -05:00
..
ABI vhost,virtio,vdpa: features, fixes, cleanups 2023-11-05 09:02:32 -10:00
accel
accounting
admin-guide IOMMU Updates for Linux v6.7 2023-11-09 13:37:28 -08:00
arch arm64 fixes: 2023-11-10 12:22:14 -08:00
block The number of commits for documentation is not huge this time around, but 2023-11-01 17:11:41 -10:00
bpf bpf: Add __bpf_kfunc_{start,end}_defs macros 2023-11-01 22:33:53 -07:00
cdrom
core-api Many singleton patches against the MM code. The patch series which are 2023-11-02 19:38:47 -10:00
cpu-freq
crypto crypto: ahash - remove support for nonzero alignmask 2023-10-27 18:04:29 +08:00
dev-tools Many singleton patches against the MM code. The patch series which are 2023-11-02 19:38:47 -10:00
devicetree Input updates for 6.7 merge window: 2023-11-09 14:18:42 -08:00
doc-guide docs: doc-guide: mention 'make refcheckdocs' 2023-10-22 20:38:55 -06:00
driver-api media updates for v6.7-rc1 2023-11-06 15:06:06 -08:00
fault-injection
fb
features
filesystems vfs-6.7.fsid 2023-11-07 12:11:26 -08:00
firmware_class
firmware-guide
fpga
gpu Merge tag 'drm-misc-next-2023-10-27' of git://anongit.freedesktop.org/drm/drm-misc into drm-next 2023-10-31 10:47:50 +10:00
hid
hwmon hwmon: (aquacomputer_d5next) Add support for Aquacomputer High Flow USB and MPS Flow 2023-10-29 22:22:48 -07:00
i2c Documentation: i2c: add fault code for not supporting 10 bit addresses 2023-10-29 21:03:35 +01:00
iio
images
infiniband
input
isdn
kbuild Kbuild updates for v6.7 2023-11-04 08:07:19 -10:00
kernel-hacking
leds
litmus-tests
livepatch
locking
maintainer
mhi
misc-devices eeprom: remove doc and MAINTAINERS section after driver was removed 2023-10-18 10:01:34 +02:00
mm Many singleton patches against the MM code. The patch series which are 2023-11-02 19:38:47 -10:00
netlabel
netlink netlink: specs: devlink: add forgotten port function caps enum values 2023-11-01 22:13:43 -07:00
networking Including fixes from netfilter and bpf. 2023-11-09 17:09:35 -08:00
nvdimm
nvme
PCI
pcmcia
peci
power
process Driver core changes for 6.7-rc1 2023-11-03 15:15:47 -10:00
RCU Merge branches 'rcu/torture', 'rcu/fixes', 'rcu/docs', 'rcu/refscale', 'rcu/tasks' and 'rcu/stall' into rcu/next 2023-10-23 15:24:11 +02:00
rust Rust changes for v6.7 2023-10-30 20:30:49 -10:00
scheduler asm-generic updates for v6.7 2023-11-01 15:28:33 -10:00
scsi
security
sound Linux 6.6-rc7 2023-10-23 19:38:22 +01:00
sphinx Documentation/sphinx: Remove the repeated word "the" in comments. 2023-10-22 20:33:38 -06:00
sphinx-static
spi
staging
target
timers
tools
trace Documentation: tracing: Add a note about argument and retval access 2023-11-10 19:59:03 +09:00
translations media updates for v6.7-rc1 2023-11-06 15:06:06 -08:00
usb USB/Thunderbolt changes for 6.7-rc1 2023-11-03 16:00:42 -10:00
userspace-api drm next and fixes for 6.7-rc1 2023-11-07 17:10:02 -08:00
virt KVM: x86: Add support for "protected VMs" that can utilize private memory 2023-11-14 08:01:05 -05:00
w1
watchdog
wmi
.gitignore
atomic_bitops.txt
atomic_t.txt
Changes
CodingStyle
conf.py
docutils.conf
dontdiff
index.rst
Kconfig
Makefile
memory-barriers.txt
SubmittingPatches
subsystem-apis.rst