6c370dc653
Introduce several new KVM uAPIs to ultimately create a guest-first memory subsystem within KVM, a.k.a. guest_memfd. Guest-first memory allows KVM to provide features, enhancements, and optimizations that are kludgly or outright impossible to implement in a generic memory subsystem. The core KVM ioctl() for guest_memfd is KVM_CREATE_GUEST_MEMFD, which similar to the generic memfd_create(), creates an anonymous file and returns a file descriptor that refers to it. Again like "regular" memfd files, guest_memfd files live in RAM, have volatile storage, and are automatically released when the last reference is dropped. The key differences between memfd files (and every other memory subystem) is that guest_memfd files are bound to their owning virtual machine, cannot be mapped, read, or written by userspace, and cannot be resized. guest_memfd files do however support PUNCH_HOLE, which can be used to convert a guest memory area between the shared and guest-private states. A second KVM ioctl(), KVM_SET_MEMORY_ATTRIBUTES, allows userspace to specify attributes for a given page of guest memory. In the long term, it will likely be extended to allow userspace to specify per-gfn RWX protections, including allowing memory to be writable in the guest without it also being writable in host userspace. The immediate and driving use case for guest_memfd are Confidential (CoCo) VMs, specifically AMD's SEV-SNP, Intel's TDX, and KVM's own pKVM. For such use cases, being able to map memory into KVM guests without requiring said memory to be mapped into the host is a hard requirement. While SEV+ and TDX prevent untrusted software from reading guest private data by encrypting guest memory, pKVM provides confidentiality and integrity *without* relying on memory encryption. In addition, with SEV-SNP and especially TDX, accessing guest private memory can be fatal to the host, i.e. KVM must be prevent host userspace from accessing guest memory irrespective of hardware behavior. Long term, guest_memfd may be useful for use cases beyond CoCo VMs, for example hardening userspace against unintentional accesses to guest memory. As mentioned earlier, KVM's ABI uses userspace VMA protections to define the allow guest protection (with an exception granted to mapping guest memory executable), and similarly KVM currently requires the guest mapping size to be a strict subset of the host userspace mapping size. Decoupling the mappings sizes would allow userspace to precisely map only what is needed and with the required permissions, without impacting guest performance. A guest-first memory subsystem also provides clearer line of sight to things like a dedicated memory pool (for slice-of-hardware VMs) and elimination of "struct page" (for offload setups where userspace _never_ needs to DMA from or into guest memory). guest_memfd is the result of 3+ years of development and exploration; taking on memory management responsibilities in KVM was not the first, second, or even third choice for supporting CoCo VMs. But after many failed attempts to avoid KVM-specific backing memory, and looking at where things ended up, it is quite clear that of all approaches tried, guest_memfd is the simplest, most robust, and most extensible, and the right thing to do for KVM and the kernel at-large. The "development cycle" for this version is going to be very short; ideally, next week I will merge it as is in kvm/next, taking this through the KVM tree for 6.8 immediately after the end of the merge window. The series is still based on 6.6 (plus KVM changes for 6.7) so it will require a small fixup for changes to get_file_rcu() introduced in 6.7 by commit 0ede61d8589c ("file: convert to SLAB_TYPESAFE_BY_RCU"). The fixup will be done as part of the merge commit, and most of the text above will become the commit message for the merge. Pending post-merge work includes: - hugepage support - looking into using the restrictedmem framework for guest memory - introducing a testing mechanism to poison memory, possibly using the same memory attributes introduced here - SNP and TDX support There are two non-KVM patches buried in the middle of this series: fs: Rename anon_inode_getfile_secure() and anon_inode_getfd_secure() mm: Add AS_UNMOVABLE to mark mapping as completely unmovable The first is small and mostly suggested-by Christian Brauner; the second a bit less so but it was written by an mm person (Vlastimil Babka).
289 lines
8.4 KiB
C
289 lines
8.4 KiB
C
// SPDX-License-Identifier: GPL-2.0-only
|
|
/*
|
|
* fs/anon_inodes.c
|
|
*
|
|
* Copyright (C) 2007 Davide Libenzi <davidel@xmailserver.org>
|
|
*
|
|
* Thanks to Arnd Bergmann for code review and suggestions.
|
|
* More changes for Thomas Gleixner suggestions.
|
|
*
|
|
*/
|
|
|
|
#include <linux/cred.h>
|
|
#include <linux/file.h>
|
|
#include <linux/poll.h>
|
|
#include <linux/sched.h>
|
|
#include <linux/init.h>
|
|
#include <linux/fs.h>
|
|
#include <linux/mount.h>
|
|
#include <linux/module.h>
|
|
#include <linux/kernel.h>
|
|
#include <linux/magic.h>
|
|
#include <linux/anon_inodes.h>
|
|
#include <linux/pseudo_fs.h>
|
|
|
|
#include <linux/uaccess.h>
|
|
|
|
static struct vfsmount *anon_inode_mnt __ro_after_init;
|
|
static struct inode *anon_inode_inode __ro_after_init;
|
|
|
|
/*
|
|
* anon_inodefs_dname() is called from d_path().
|
|
*/
|
|
static char *anon_inodefs_dname(struct dentry *dentry, char *buffer, int buflen)
|
|
{
|
|
return dynamic_dname(buffer, buflen, "anon_inode:%s",
|
|
dentry->d_name.name);
|
|
}
|
|
|
|
static const struct dentry_operations anon_inodefs_dentry_operations = {
|
|
.d_dname = anon_inodefs_dname,
|
|
};
|
|
|
|
static int anon_inodefs_init_fs_context(struct fs_context *fc)
|
|
{
|
|
struct pseudo_fs_context *ctx = init_pseudo(fc, ANON_INODE_FS_MAGIC);
|
|
if (!ctx)
|
|
return -ENOMEM;
|
|
ctx->dops = &anon_inodefs_dentry_operations;
|
|
return 0;
|
|
}
|
|
|
|
static struct file_system_type anon_inode_fs_type = {
|
|
.name = "anon_inodefs",
|
|
.init_fs_context = anon_inodefs_init_fs_context,
|
|
.kill_sb = kill_anon_super,
|
|
};
|
|
|
|
static struct inode *anon_inode_make_secure_inode(
|
|
const char *name,
|
|
const struct inode *context_inode)
|
|
{
|
|
struct inode *inode;
|
|
const struct qstr qname = QSTR_INIT(name, strlen(name));
|
|
int error;
|
|
|
|
inode = alloc_anon_inode(anon_inode_mnt->mnt_sb);
|
|
if (IS_ERR(inode))
|
|
return inode;
|
|
inode->i_flags &= ~S_PRIVATE;
|
|
error = security_inode_init_security_anon(inode, &qname, context_inode);
|
|
if (error) {
|
|
iput(inode);
|
|
return ERR_PTR(error);
|
|
}
|
|
return inode;
|
|
}
|
|
|
|
static struct file *__anon_inode_getfile(const char *name,
|
|
const struct file_operations *fops,
|
|
void *priv, int flags,
|
|
const struct inode *context_inode,
|
|
bool make_inode)
|
|
{
|
|
struct inode *inode;
|
|
struct file *file;
|
|
|
|
if (fops->owner && !try_module_get(fops->owner))
|
|
return ERR_PTR(-ENOENT);
|
|
|
|
if (make_inode) {
|
|
inode = anon_inode_make_secure_inode(name, context_inode);
|
|
if (IS_ERR(inode)) {
|
|
file = ERR_CAST(inode);
|
|
goto err;
|
|
}
|
|
} else {
|
|
inode = anon_inode_inode;
|
|
if (IS_ERR(inode)) {
|
|
file = ERR_PTR(-ENODEV);
|
|
goto err;
|
|
}
|
|
/*
|
|
* We know the anon_inode inode count is always
|
|
* greater than zero, so ihold() is safe.
|
|
*/
|
|
ihold(inode);
|
|
}
|
|
|
|
file = alloc_file_pseudo(inode, anon_inode_mnt, name,
|
|
flags & (O_ACCMODE | O_NONBLOCK), fops);
|
|
if (IS_ERR(file))
|
|
goto err_iput;
|
|
|
|
file->f_mapping = inode->i_mapping;
|
|
|
|
file->private_data = priv;
|
|
|
|
return file;
|
|
|
|
err_iput:
|
|
iput(inode);
|
|
err:
|
|
module_put(fops->owner);
|
|
return file;
|
|
}
|
|
|
|
/**
|
|
* anon_inode_getfile - creates a new file instance by hooking it up to an
|
|
* anonymous inode, and a dentry that describe the "class"
|
|
* of the file
|
|
*
|
|
* @name: [in] name of the "class" of the new file
|
|
* @fops: [in] file operations for the new file
|
|
* @priv: [in] private data for the new file (will be file's private_data)
|
|
* @flags: [in] flags
|
|
*
|
|
* Creates a new file by hooking it on a single inode. This is useful for files
|
|
* that do not need to have a full-fledged inode in order to operate correctly.
|
|
* All the files created with anon_inode_getfile() will share a single inode,
|
|
* hence saving memory and avoiding code duplication for the file/inode/dentry
|
|
* setup. Returns the newly created file* or an error pointer.
|
|
*/
|
|
struct file *anon_inode_getfile(const char *name,
|
|
const struct file_operations *fops,
|
|
void *priv, int flags)
|
|
{
|
|
return __anon_inode_getfile(name, fops, priv, flags, NULL, false);
|
|
}
|
|
EXPORT_SYMBOL_GPL(anon_inode_getfile);
|
|
|
|
/**
|
|
* anon_inode_create_getfile - Like anon_inode_getfile(), but creates a new
|
|
* !S_PRIVATE anon inode rather than reuse the
|
|
* singleton anon inode and calls the
|
|
* inode_init_security_anon() LSM hook.
|
|
*
|
|
* @name: [in] name of the "class" of the new file
|
|
* @fops: [in] file operations for the new file
|
|
* @priv: [in] private data for the new file (will be file's private_data)
|
|
* @flags: [in] flags
|
|
* @context_inode:
|
|
* [in] the logical relationship with the new inode (optional)
|
|
*
|
|
* Create a new anonymous inode and file pair. This can be done for two
|
|
* reasons:
|
|
*
|
|
* - for the inode to have its own security context, so that LSMs can enforce
|
|
* policy on the inode's creation;
|
|
*
|
|
* - if the caller needs a unique inode, for example in order to customize
|
|
* the size returned by fstat()
|
|
*
|
|
* The LSM may use @context_inode in inode_init_security_anon(), but a
|
|
* reference to it is not held.
|
|
*
|
|
* Returns the newly created file* or an error pointer.
|
|
*/
|
|
struct file *anon_inode_create_getfile(const char *name,
|
|
const struct file_operations *fops,
|
|
void *priv, int flags,
|
|
const struct inode *context_inode)
|
|
{
|
|
return __anon_inode_getfile(name, fops, priv, flags,
|
|
context_inode, true);
|
|
}
|
|
EXPORT_SYMBOL_GPL(anon_inode_create_getfile);
|
|
|
|
static int __anon_inode_getfd(const char *name,
|
|
const struct file_operations *fops,
|
|
void *priv, int flags,
|
|
const struct inode *context_inode,
|
|
bool make_inode)
|
|
{
|
|
int error, fd;
|
|
struct file *file;
|
|
|
|
error = get_unused_fd_flags(flags);
|
|
if (error < 0)
|
|
return error;
|
|
fd = error;
|
|
|
|
file = __anon_inode_getfile(name, fops, priv, flags, context_inode,
|
|
make_inode);
|
|
if (IS_ERR(file)) {
|
|
error = PTR_ERR(file);
|
|
goto err_put_unused_fd;
|
|
}
|
|
fd_install(fd, file);
|
|
|
|
return fd;
|
|
|
|
err_put_unused_fd:
|
|
put_unused_fd(fd);
|
|
return error;
|
|
}
|
|
|
|
/**
|
|
* anon_inode_getfd - creates a new file instance by hooking it up to
|
|
* an anonymous inode and a dentry that describe
|
|
* the "class" of the file
|
|
*
|
|
* @name: [in] name of the "class" of the new file
|
|
* @fops: [in] file operations for the new file
|
|
* @priv: [in] private data for the new file (will be file's private_data)
|
|
* @flags: [in] flags
|
|
*
|
|
* Creates a new file by hooking it on a single inode. This is
|
|
* useful for files that do not need to have a full-fledged inode in
|
|
* order to operate correctly. All the files created with
|
|
* anon_inode_getfd() will use the same singleton inode, reducing
|
|
* memory use and avoiding code duplication for the file/inode/dentry
|
|
* setup. Returns a newly created file descriptor or an error code.
|
|
*/
|
|
int anon_inode_getfd(const char *name, const struct file_operations *fops,
|
|
void *priv, int flags)
|
|
{
|
|
return __anon_inode_getfd(name, fops, priv, flags, NULL, false);
|
|
}
|
|
EXPORT_SYMBOL_GPL(anon_inode_getfd);
|
|
|
|
/**
|
|
* anon_inode_create_getfd - Like anon_inode_getfd(), but creates a new
|
|
* !S_PRIVATE anon inode rather than reuse the singleton anon inode, and calls
|
|
* the inode_init_security_anon() LSM hook.
|
|
*
|
|
* @name: [in] name of the "class" of the new file
|
|
* @fops: [in] file operations for the new file
|
|
* @priv: [in] private data for the new file (will be file's private_data)
|
|
* @flags: [in] flags
|
|
* @context_inode:
|
|
* [in] the logical relationship with the new inode (optional)
|
|
*
|
|
* Create a new anonymous inode and file pair. This can be done for two
|
|
* reasons:
|
|
*
|
|
* - for the inode to have its own security context, so that LSMs can enforce
|
|
* policy on the inode's creation;
|
|
*
|
|
* - if the caller needs a unique inode, for example in order to customize
|
|
* the size returned by fstat()
|
|
*
|
|
* The LSM may use @context_inode in inode_init_security_anon(), but a
|
|
* reference to it is not held.
|
|
*
|
|
* Returns a newly created file descriptor or an error code.
|
|
*/
|
|
int anon_inode_create_getfd(const char *name, const struct file_operations *fops,
|
|
void *priv, int flags,
|
|
const struct inode *context_inode)
|
|
{
|
|
return __anon_inode_getfd(name, fops, priv, flags, context_inode, true);
|
|
}
|
|
|
|
static int __init anon_inode_init(void)
|
|
{
|
|
anon_inode_mnt = kern_mount(&anon_inode_fs_type);
|
|
if (IS_ERR(anon_inode_mnt))
|
|
panic("anon_inode_init() kernel mount failed (%ld)\n", PTR_ERR(anon_inode_mnt));
|
|
|
|
anon_inode_inode = alloc_anon_inode(anon_inode_mnt->mnt_sb);
|
|
if (IS_ERR(anon_inode_inode))
|
|
panic("anon_inode_init() inode allocation failed (%ld)\n", PTR_ERR(anon_inode_inode));
|
|
|
|
return 0;
|
|
}
|
|
|
|
fs_initcall(anon_inode_init);
|
|
|