linux

iv/linux

Author	SHA1	Message	Date
Sean Christopherson	106ed2cad9	KVM: SVM: WARN, but continue, if misc_cg_set_capacity() fails WARN and continue if misc_cg_set_capacity() fails, as the only scenario in which it can fail is if the specified resource is invalid, which should never happen when CONFIG_KVM_AMD_SEV=y. Deliberately not bailing "fixes" a theoretical bug where KVM would leak the ASID bitmaps on failure, which again can't happen. If the impossible should happen, the end result is effectively the same with respect to SEV and SEV-ES (they are unusable), while continuing on has the advantage of letting KVM load, i.e. userspace can still run non-SEV guests. Reported-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> Link: https://lore.kernel.org/r/20230607004449.1421131-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-06-13 09:20:26 -07:00
Sean Christopherson	0b210faf33	KVM: x86/mmu: Add "never" option to allow sticky disabling of nx_huge_pages Add a "never" option to the nx_huge_pages module param to allow userspace to do a one-way hard disabling of the mitigation, and don't create the per-VM recovery threads when the mitigation is hard disabled. Letting userspace pinky swear that userspace doesn't want to enable NX mitigation (without reloading KVM) allows certain use cases to avoid the latency problems associated with spawning a kthread for each VM. E.g. in FaaS use cases, the guest kernel is trusted and the host may create 100+ VMs per logical CPU, which can result in 100ms+ latencies when a burst of VMs is created. Reported-by: Li RongQing <lirongqing@baidu.com> Closes: https://lore.kernel.org/all/1679555884-32544-1-git-send-email-lirongqing@baidu.com Cc: Yong He <zhuangel570@gmail.com> Cc: Robert Hoo <robert.hoo.linux@gmail.com> Cc: Kai Huang <kai.huang@intel.com> Reviewed-by: Robert Hoo <robert.hoo.linux@gmail.com> Acked-by: Kai Huang <kai.huang@intel.com> Tested-by: Luiz Capitulino <luizcap@amazon.com> Reviewed-by: Li RongQing <lirongqing@baidu.com> Link: https://lore.kernel.org/r/20230602005859.784190-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-06-13 09:16:03 -07:00
Sean Christopherson	a306425708	KVM: x86: Update comments about MSR lists exposed to userspace Refresh comments about msrs_to_save, emulated_msrs, and msr_based_features to remove stale references left behind by commit 2374b7310b66 (KVM: x86/pmu: Use separate array for defining "PMU MSRs to save"), and to better reflect the current reality, e.g. emulated_msrs is no longer just for MSRs that are "kvm-specific". Reported-by: Binbin Wu <binbin.wu@linux.intel.com> Link: https://lore.kernel.org/r/20230607004636.1421424-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-06-13 09:09:07 -07:00
Jakob Koschel	1e327963cf	x86/sgx: Avoid using iterator after loop in sgx_mmu_notifier_release() If &encl_mm->encl->mm_list does not contain the searched 'encl_mm', 'tmp' will not point to a valid sgx_encl_mm struct. Linus proposed to avoid any use of the list iterator variable after the loop, in the attempt to move the list iterator variable declaration into the macro to avoid any potential misuse after the loop. Using it in a pointer comparison after the loop is undefined behavior and should be omitted if possible, see Link tag. Instead, just use a 'found' boolean to indicate if an element was found. [ bp: Massage, fix typos. ] Signed-off-by: Jakob Koschel <jkl820.git@gmail.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/CAHk-=wgRr_D8CB-D9Kg-c=EHreAsk5SqXPwr9Y7k9sA6cWXJ6w@mail.gmail.com/ Link: https://lore.kernel.org/r/20230206-sgx-use-after-iter-v2-1-736ca621adc3@gmail.com	2023-06-13 16:21:01 +02:00
Linus Torvalds	fb054096ae	19 hotfixes. 14 are cc:stable and the remainder address issues which were introduced during this -rc cycle or which were considered inappropriate for a backport. -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZIdw7QAKCRDdBJ7gKXxA jki4AQCygi1UoqVPq4N/NzJbv2GaNDXNmcJIoLvPpp3MYFhucAEAtQNzAYO9z6CT iLDMosnuh+1KLTaKNGL5iak3NAxnxQw= =mTdI -----END PGP SIGNATURE----- Merge tag 'mm-hotfixes-stable-2023-06-12-12-22' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "19 hotfixes. 14 are cc:stable and the remainder address issues which were introduced during this development cycle or which were considered inappropriate for a backport" * tag 'mm-hotfixes-stable-2023-06-12-12-22' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: zswap: do not shrink if cgroup may not zswap page cache: fix page_cache_next/prev_miss off by one ocfs2: check new file size on fallocate call mailmap: add entry for John Keeping mm/damon/core: fix divide error in damon_nr_accesses_to_accesses_bp() epoll: ep_autoremove_wake_function should use list_del_init_careful mm/gup_test: fix ioctl fail for compat task nilfs2: reject devices with insufficient block count ocfs2: fix use-after-free when unmounting read-only filesystem lib/test_vmalloc.c: avoid garbage in page array nilfs2: fix possible out-of-bounds segment allocation in resize ioctl riscv/purgatory: remove PGO flags powerpc/purgatory: remove PGO flags x86/purgatory: remove PGO flags kexec: support purgatories with .text.hot sections mm/uffd: allow vma to merge as much as possible mm/uffd: fix vma operation where start addr cuts part of vma radix-tree: move declarations to header nilfs2: fix incomplete buffer cleanup in nilfs_btnode_abort_change_key()	2023-06-12 16:14:34 -07:00
Ricardo Ribalda	97b6b9cbba	x86/purgatory: remove PGO flags If profile-guided optimization is enabled, the purgatory ends up with multiple .text sections. This is not supported by kexec and crashes the system. Link: https://lkml.kernel.org/r/20230321-kexec_clang16-v7-2-b05c520b7296@chromium.org Fixes: 930457057abe ("kernel/kexec_file.c: split up __kexec_load_puragory") Signed-off-by: Ricardo Ribalda <ribalda@chromium.org> Cc: <stable@vger.kernel.org> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Baoquan He <bhe@redhat.com> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dave Young <dyoung@redhat.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Palmer Dabbelt <palmer@rivosinc.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Philipp Rudo <prudo@redhat.com> Cc: Ross Zwisler <zwisler@google.com> Cc: Simon Horman <horms@kernel.org> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tom Rix <trix@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-06-12 11:31:50 -07:00
Yonghong Song	ad96f1c913	bpf: Fix a bpf_jit_dump issue for x86_64 with sysctl bpf_jit_enable. The sysctl net/core/bpf_jit_enable does not work now due to commit 1022a5498f6f ("bpf, x86_64: Use bpf_jit_binary_pack_alloc"). The commit saved the jitted insns into 'rw_image' instead of 'image' which caused bpf_jit_dump not dumping proper content. With 'echo 2 > /proc/sys/net/core/bpf_jit_enable', run './test_progs -t fentry_test'. Without this patch, one of jitted image for one particular prog is: flen=17 proglen=92 pass=4 image=0000000014c64883 from=test_progs pid=1807 00000000: cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 00000010: cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 00000020: cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 00000030: cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 00000040: cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 00000050: cc cc cc cc cc cc cc cc cc cc cc cc With this patch, the jitte image for the same prog is: flen=17 proglen=92 pass=4 image=00000000b90254b7 from=test_progs pid=1809 00000000: f3 0f 1e fa 0f 1f 44 00 00 66 90 55 48 89 e5 f3 00000010: 0f 1e fa 31 f6 48 8b 57 00 48 83 fa 07 75 2b 48 00000020: 8b 57 10 83 fa 09 75 22 48 8b 57 08 48 81 e2 ff 00000030: 00 00 00 48 83 fa 08 75 11 48 8b 7f 18 be 01 00 00000040: 00 00 48 83 ff 0a 74 02 31 f6 48 bf 18 d0 14 00 00000050: 00 c9 ff ff 48 89 77 00 31 c0 c9 c3 Fixes: 1022a5498f6f ("bpf, x86_64: Use bpf_jit_binary_pack_alloc") Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/bpf/20230609005439.3173569-1-yhs@fb.com	2023-06-12 16:47:18 +02:00
Borislav Petkov (AMD)	a32b0f0db3	x86/microcode/AMD: Load late on both threads too Do the same as early loading - load on both threads. Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Cc: <stable@kernel.org> Link: https://lore.kernel.org/r/20230605141332.25948-1-bp@alien8.de	2023-06-12 11:02:17 +02:00
Linus Torvalds	4c605260bc	- Set up the kernel CS earlier in the boot process in case EFI boots the kernel after bypassing the decompressor and the CS descriptor used ends up being the EFI one which is not mapped in the identity page table, leading to early SEV/SNP guest communication exceptions resulting in the guest crashing -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmSFqi8ACgkQEsHwGGHe VUoJWhAAqSYKHVMuOebeCiS4mF7gKIdwE5UwxF5vVWa7nwOLxRUygdFTweyjqV2n bKoGGNwEquYxhKUoRjrBr+dxZXx6qapDS8oGUL9Ndus93Zs/zIe6KF23EJbkZKoy 4uh37D0G2lgA+E3Ke/hX5ac94AYtcTd8cfSw63GIs10vt/bsupdhreSY3e2Z8zcQ e2OngvL/PUX06g4/wXYsAlQszRwyPwJO++y82OdqisJXJLV65fxui7YRS8g4Koh2 DjKHmQyuFTN3D40C7F7vlH0iq9+kAhDpMaG6lL2/QaWGQSAA4sw9qArxy9JTDATw 0Dk+L4iHimPFopke1z6rPG13TRhtLt0cWGCp1/cKQA6w6/eBvpOdpkxUeL7kF8UP yZMh3XeBRWIOewfXeN+sjhsetVHSK3dMRPppN/pry0bgTUWbGBo7nnl3QgrdYyrk l+BigY0JTOBRzkk3ECqCcR88jYcI1jNm/iaqCuPwqhkpanElKryD268cu4FINz60 UFDlrKiEVmQhMrf6MJji3eJec6CezjDtTfuboPPLyUxz/At/5khcxEtAalmieYpy WZmj2hlG9Mzdfv5TA3JX5BvOt7ODicf7wtxJ3W3qxz2iUMJ3uCO0fSn1b9sJz0L7 LwRZ7uskHz5J122AQhEEDq0T6n0rY4GBdZhzONN64wXttSGJTqY= =yaY/ -----END PGP SIGNATURE----- Merge tag 'x86_urgent_for_v6.4_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fix from Borislav Petkov: - Set up the kernel CS earlier in the boot process in case EFI boots the kernel after bypassing the decompressor and the CS descriptor used ends up being the EFI one which is not mapped in the identity page table, leading to early SEV/SNP guest communication exceptions resulting in the guest crashing * tag 'x86_urgent_for_v6.4_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/head/64: Switch to KERNEL_CS as soon as new GDT is installed	2023-06-11 10:14:02 -07:00
Arnd Bergmann	af0a76e126	thread_info: move function declarations to linux/thread_info.h There are a few __weak functions in kernel/fork.c, which architectures can override. If there is no prototype, the compiler warns about them: kernel/fork.c:164:13: error: no previous prototype for 'arch_release_task_struct' [-Werror=missing-prototypes] kernel/fork.c:991:20: error: no previous prototype for 'arch_task_cache_init' [-Werror=missing-prototypes] kernel/fork.c:1086:12: error: no previous prototype for 'arch_dup_task_struct' [-Werror=missing-prototypes] There are already prototypes in a number of architecture specific headers that have addressed those warnings before, but it's much better to have these in a single place so the warning no longer shows up anywhere. Link: https://lkml.kernel.org/r/20230517131102.934196-14-arnd@kernel.org Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Eric Paris <eparis@redhat.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Moore <paul@paul-moore.com> Cc: Pavel Machek <pavel@ucw.cz> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rafael J. Wysocki <rafael@kernel.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Waiman Long <longman@redhat.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-06-09 17:44:16 -07:00
Arnd Bergmann	ad1a48301f	init: consolidate prototypes in linux/init.h The init/main.c file contains some extern declarations for functions defined in architecture code, and it defines some other functions that are called from architecture code with a custom prototype. Both of those result in warnings with 'make W=1': init/calibrate.c:261:37: error: no previous prototype for 'calibrate_delay_is_known' [-Werror=missing-prototypes] init/main.c:790:20: error: no previous prototype for 'mem_encrypt_init' [-Werror=missing-prototypes] init/main.c:792:20: error: no previous prototype for 'poking_init' [-Werror=missing-prototypes] arch/arm64/kernel/irq.c:122:13: error: no previous prototype for 'init_IRQ' [-Werror=missing-prototypes] arch/arm64/kernel/time.c:55:13: error: no previous prototype for 'time_init' [-Werror=missing-prototypes] arch/x86/kernel/process.c:935:13: error: no previous prototype for 'arch_post_acpi_subsys_init' [-Werror=missing-prototypes] init/calibrate.c:261:37: error: no previous prototype for 'calibrate_delay_is_known' [-Werror=missing-prototypes] kernel/fork.c:991:20: error: no previous prototype for 'arch_task_cache_init' [-Werror=missing-prototypes] Add prototypes for all of these in include/linux/init.h or another appropriate header, and remove the duplicate declarations from architecture specific code. [sfr@canb.auug.org.au: declare time_init_early()] Link: https://lkml.kernel.org/r/20230519124311.5167221c@canb.auug.org.au Link: https://lkml.kernel.org/r/20230517131102.934196-12-arnd@kernel.org Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Eric Paris <eparis@redhat.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Moore <paul@paul-moore.com> Cc: Pavel Machek <pavel@ucw.cz> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rafael J. Wysocki <rafael@kernel.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Waiman Long <longman@redhat.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-06-09 17:44:16 -07:00
Lorenzo Stoakes	54d020692b	mm/gup: remove unused vmas parameter from get_user_pages() Patch series "remove the vmas parameter from GUP APIs", v6. (pin_/get)_user_pages[_remote]() each provide an optional output parameter for an array of VMA objects associated with each page in the input range. These provide the means for VMAs to be returned, as long as mm->mmap_lock is never released during the GUP operation (i.e. the internal flag FOLL_UNLOCKABLE is not specified). In addition, these VMAs can only be accessed with the mmap_lock held and become invalidated the moment it is released. The vast majority of invocations do not use this functionality and of those that do, all but one case retrieve a single VMA to perform checks upon. It is not egregious in the single VMA cases to simply replace the operation with a vma_lookup(). In these cases we duplicate the (fast) lookup on a slow path already under the mmap_lock, abstracted to a new get_user_page_vma_remote() inline helper function which also performs error checking and reference count maintenance. The special case is io_uring, where io_pin_pages() specifically needs to assert that the VMAs underlying the range do not result in broken long-term GUP file-backed mappings. As GUP now internally asserts that FOLL_LONGTERM mappings are not file-backed in a broken fashion (i.e. requiring dirty tracking) - as implemented in "mm/gup: disallow FOLL_LONGTERM GUP-nonfast writing to file-backed mappings" - this logic is no longer required and so we can simply remove it altogether from io_uring. Eliminating the vmas parameter eliminates an entire class of danging pointer errors that might have occured should the lock have been incorrectly released. In addition, the API is simplified and now clearly expresses what it is intended for - applying the specified GUP flags and (if pinning) returning pinned pages. This change additionally opens the door to further potential improvements in GUP and the possible marrying of disparate code paths. I have run this series against gup_test with no issues. Thanks to Matthew Wilcox for suggesting this refactoring! This patch (of 6): No invocation of get_user_pages() use the vmas parameter, so remove it. The GUP API is confusing and caveated. Recent changes have done much to improve that, however there is more we can do. Exporting vmas is a prime target as the caller has to be extremely careful to preclude their use after the mmap_lock has expired or otherwise be left with dangling pointers. Removing the vmas parameter focuses the GUP functions upon their primary purpose - pinning (and outputting) pages as well as performing the actions implied by the input flags. This is part of a patch series aiming to remove the vmas parameter altogether. Link: https://lkml.kernel.org/r/cover.1684350871.git.lstoakes@gmail.com Link: https://lkml.kernel.org/r/589e0c64794668ffc799651e8d85e703262b1e9d.1684350871.git.lstoakes@gmail.com Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Acked-by: Christian König <christian.koenig@amd.com> (for radeon parts) Acked-by: Jarkko Sakkinen <jarkko@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Sean Christopherson <seanjc@google.com> (KVM) Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com> Cc: Janosch Frank <frankja@linux.ibm.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Sakari Ailus <sakari.ailus@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-06-09 16:25:25 -07:00
Nhat Pham	cf264e1329	cachestat: implement cachestat syscall There is currently no good way to query the page cache state of large file sets and directory trees. There is mincore(), but it scales poorly: the kernel writes out a lot of bitmap data that userspace has to aggregate, when the user really doesn not care about per-page information in that case. The user also needs to mmap and unmap each file as it goes along, which can be quite slow as well. Some use cases where this information could come in handy: * Allowing database to decide whether to perform an index scan or direct table queries based on the in-memory cache state of the index. * Visibility into the writeback algorithm, for performance issues diagnostic. * Workload-aware writeback pacing: estimating IO fulfilled by page cache (and IO to be done) within a range of a file, allowing for more frequent syncing when and where there is IO capacity, and batching when there is not. * Computing memory usage of large files/directory trees, analogous to the du tool for disk usage. More information about these use cases could be found in the following thread: https://lore.kernel.org/lkml/20230315170934.GA97793@cmpxchg.org/ This patch implements a new syscall that queries cache state of a file and summarizes the number of cached pages, number of dirty pages, number of pages marked for writeback, number of (recently) evicted pages, etc. in a given range. Currently, the syscall is only wired in for x86 architecture. NAME cachestat - query the page cache statistics of a file. SYNOPSIS #include <sys/mman.h> struct cachestat_range { __u64 off; __u64 len; }; struct cachestat { __u64 nr_cache; __u64 nr_dirty; __u64 nr_writeback; __u64 nr_evicted; __u64 nr_recently_evicted; }; int cachestat(unsigned int fd, struct cachestat_range cstat_range, struct cachestat cstat, unsigned int flags); DESCRIPTION cachestat() queries the number of cached pages, number of dirty pages, number of pages marked for writeback, number of evicted pages, number of recently evicted pages, in the bytes range given by `off` and `len`. An evicted page is a page that is previously in the page cache but has been evicted since. A page is recently evicted if its last eviction was recent enough that its reentry to the cache would indicate that it is actively being used by the system, and that there is memory pressure on the system. These values are returned in a cachestat struct, whose address is given by the `cstat` argument. The `off` and `len` arguments must be non-negative integers. If `len` > 0, the queried range is [`off`, `off` + `len`]. If `len` == 0, we will query in the range from `off` to the end of the file. The `flags` argument is unused for now, but is included for future extensibility. User should pass 0 (i.e no flag specified). Currently, hugetlbfs is not supported. Because the status of a page can change after cachestat() checks it but before it returns to the application, the returned values may contain stale information. RETURN VALUE On success, cachestat returns 0. On error, -1 is returned, and errno is set to indicate the error. ERRORS EFAULT cstat or cstat_args points to an invalid address. EINVAL invalid flags. EBADF invalid file descriptor. EOPNOTSUPP file descriptor is of a hugetlbfs file [nphamcs@gmail.com: replace rounddown logic with the existing helper] Link: https://lkml.kernel.org/r/20230504022044.3675469-1-nphamcs@gmail.com Link: https://lkml.kernel.org/r/20230503013608.2431726-3-nphamcs@gmail.com Signed-off-by: Nhat Pham <nphamcs@gmail.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Brian Foster <bfoster@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-06-09 16:25:16 -07:00
Ingo Molnar	301cf77e21	x86/orc: Make the is_callthunk() definition depend on CONFIG_BPF_JIT=y Recent commit: 020126239b8f Revert "x86/orc: Make it callthunk aware" Made the only user of is_callthunk() depend on CONFIG_BPF_JIT=y, while the definition of the helper function is unconditional. Move is_callthunk() inside the #ifdef block. Addresses this build failure: arch/x86/kernel/callthunks.c:296:13: error: ‘is_callthunk’ defined but not used [-Werror=unused-function] Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Josh Poimboeuf <jpoimboe@kernel.org> Cc: linux-kernel@vger.kernel.org Cc: Peter Zijlstra <peterz@infradead.org>	2023-06-09 11:09:04 +02:00
Juergen Gross	78841cd185	x86/mm: Remove Xen-PV leftovers from init_32.c There are still some unneeded paravirt calls in arch/x86/mm/init_32.c. Remove them. Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230609055100.12633-1-jgross@suse.com	2023-06-09 11:00:21 +02:00
Michael Kelley	504dba50b0	x86/irq: Add hardcoded hypervisor interrupts to /proc/stat Some hypervisor interrupts (such as for Hyper-V VMbus and Hyper-V timers) have hardcoded interrupt vectors on x86 and don't have Linux IRQs assigned. These interrupts are shown in /proc/interrupts, but are not reported in the first field of the "intr" line in /proc/stat because the x86 version of arch_irq_stat_cpu() doesn't include them. Fix this by adding code to arch_irq_stat_cpu() to include these interrupts, similar to existing interrupts that don't have Linux IRQs. Use #if IS_ENABLED() because unlike all the other nearby #ifdefs, CONFIG_HYPERV can be built as a module. Signed-off-by: Michael Kelley <mikelley@microsoft.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/1677523568-50263-1-git-send-email-mikelley%40microsoft.com	2023-06-08 08:28:08 -07:00
Josh Poimboeuf	a9da824762	drm/vmwgfx: Add unwind hints around RBP clobber VMware high-bandwidth hypercalls take the RBP register as input. This breaks basic frame pointer convention, as RBP should never be clobbered. So frame pointer unwinding is broken for the instructions surrounding the hypercalls. Fortunately this doesn't break live patching with CONFIG_FRAME_POINTER, as it only unwinds from blocking tasks, and stack traces from preempted tasks are already marked unreliable anyway. However, for live patching with ORC, this could actually be a theoretical problem if vmw_port_hb_{in,out}() were still compiled with a frame pointer due to having an aligned stack. In practice that hasn't seemed to be an issue since the objtool warnings have only been seen with CONFIG_FRAME_POINTER. Add unwind hint annotations to tell the ORC unwinder to mark stack traces as unreliable. Fixes the following warnings: vmlinux.o: warning: objtool: vmw_port_hb_in+0x1df: return with modified stack frame vmlinux.o: warning: objtool: vmw_port_hb_out+0x1dd: return with modified stack frame Fixes: 89da76fde68d ("drm/vmwgfx: Add VMWare host messaging capability") Reported-by: kernel test robot <lkp@intel.com> Link: https://lore.kernel.org/oe-kbuild-all/202305160135.97q0Elax-lkp@intel.com/ Link: https://lore.kernel.org/r/4c795f2d87bc0391cf6543bcb224fa540b55ce4b.1685981486.git.jpoimboe@kernel.org Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>	2023-06-07 10:03:12 -07:00
Josh Poimboeuf	ac27ecf68a	x86/entry: Move thunk restore code into thunk functions There's no need for both thunk functions to jump to the same shared thunk restore code which lives outside the thunk function boundaries. It disrupts i-cache locality and confuses objtool. Keep it simple by keeping each thunk's restore code self-contained within the function. Fixes a bunch of false positive "missing __noreturn" warnings like: vmlinux.o: warning: objtool: do_arch_prctl_common+0xf4: preempt_schedule_thunk() is missing a __noreturn annotation Fixes: fedb724c3db5 ("objtool: Detect missing __noreturn annotations") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202305281037.3PaI3tW4-lkp@intel.com/ Link: https://lore.kernel.org/r/46aa8aeb716f302e22e1673ae15ee6fe050b41f4.1685488050.git.jpoimboe@kernel.org Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>	2023-06-07 09:54:45 -07:00
Josh Poimboeuf	020126239b	Revert "x86/orc: Make it callthunk aware" Commit 396e0b8e09e8 ("x86/orc: Make it callthunk aware") attempted to deal with the fact that function prefix code didn't have ORC coverage. However, it didn't work as advertised. Use of the "null" ORC entry just caused affected unwinds to end early. The root cause has now been fixed with commit 5743654f5e2e ("objtool: Generate ORC data for __pfx code"). Revert most of commit 396e0b8e09e8 ("x86/orc: Make it callthunk aware"). The is_callthunk() function remains as it's now used by other code. Link: https://lore.kernel.org/r/a05b916ef941da872cbece1ab3593eceabd05a79.1684245404.git.jpoimboe@kernel.org Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>	2023-06-07 09:48:57 -07:00
Bo Liu	7e980867ce	x86/mm: Remove repeated word in comments Remove the repeated word "the" in comments. Signed-off-by: Bo Liu <liubo03@inspur.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230504085446.2574-1-liubo03@inspur.com	2023-06-07 12:47:34 +02:00
Peter Newman	8da2b938eb	x86/resctrl: Implement rename op for mon groups To change the resources allocated to a large group of tasks, such as an application container, a container manager must write all of the tasks' IDs into the tasks file interface of the new control group. This is challenging when the container's task list is always changing. In addition, if the container manager is using monitoring groups to separately track the bandwidth of containers assigned to the same control group, when moving a container, it must first move the container's tasks to the default monitoring group of the new control group before it can move these tasks into the container's replacement monitoring group under the destination control group. This is undesirable because it makes bandwidth usage during the move unattributable to the correct tasks and resets monitoring event counters and cache usage information for the group. Implement the rename operation only for resctrlfs monitor groups to enable users to move a monitoring group from one control group to another. This effects a change in resources allocated to all the tasks in the monitoring group while otherwise leaving the monitoring data intact. Signed-off-by: Peter Newman <peternewman@google.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Tested-by: Babu Moger <babu.moger@amd.com> Link: https://lore.kernel.org/r/20230419125015.693566-3-peternewman@google.com	2023-06-07 12:40:36 +02:00
Peter Newman	c45c06d4ae	x86/resctrl: Factor rdtgroup lock for multi-file ops rdtgroup_kn_lock_live() can only release a kernfs reference for a single file before waiting on the rdtgroup_mutex, limiting its usefulness for operations on multiple files, such as rename. Factor the work needed to respectively break and unbreak active protection on an individual file into rdtgroup_kn_{get,put}(). No functional change. Signed-off-by: Peter Newman <peternewman@google.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Tested-by: Babu Moger <babu.moger@amd.com> Link: https://lore.kernel.org/r/20230419125015.693566-2-peternewman@google.com	2023-06-07 12:15:18 +02:00
Like Xu	94cdeebd82	KVM: x86/cpuid: Add AMD CPUID ExtPerfMonAndDbg leaf 0x80000022 CPUID leaf 0x80000022 i.e. ExtPerfMonAndDbg advertises some new performance monitoring features for AMD processors. Bit 0 of EAX indicates support for Performance Monitoring Version 2 (PerfMonV2) features. If found to be set during PMU initialization, the EBX bits of the same CPUID function can be used to determine the number of available PMCs for different PMU types. Expose the relevant bits via KVM_GET_SUPPORTED_CPUID so that guests can make use of the PerfMonV2 features. Co-developed-by: Sandipan Das <sandipan.das@amd.com> Signed-off-by: Sandipan Das <sandipan.das@amd.com> Signed-off-by: Like Xu <likexu@tencent.com> Link: https://lore.kernel.org/r/20230603011058.1038821-13-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-06-06 17:31:44 -07:00
Like Xu	4a2771895c	KVM: x86/svm/pmu: Add AMD PerfMonV2 support If AMD Performance Monitoring Version 2 (PerfMonV2) is detected by the guest, it can use a new scheme to manage the Core PMCs using the new global control and status registers. In addition to benefiting from the PerfMonV2 functionality in the same way as the host (higher precision), the guest also can reduce the number of vm-exits by lowering the total number of MSRs accesses. In terms of implementation details, amd_is_valid_msr() is resurrected since three newly added MSRs could not be mapped to one vPMC. The possibility of emulating PerfMonV2 on the mainframe has also been eliminated for reasons of precision. Co-developed-by: Sandipan Das <sandipan.das@amd.com> Signed-off-by: Sandipan Das <sandipan.das@amd.com> Signed-off-by: Like Xu <likexu@tencent.com> [sean: drop "Based on the observed HW." comments] Link: https://lore.kernel.org/r/20230603011058.1038821-12-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-06-06 17:31:44 -07:00
Like Xu	fe8d76c1a6	KVM: x86/cpuid: Add a KVM-only leaf to redirect AMD PerfMonV2 flag Add a KVM-only leaf for AMD's PerfMonV2 to redirect the kernel's scattered version to its architectural location, e.g. so that KVM can query guest support via guest_cpuid_has(). Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Like Xu <likexu@tencent.com> [sean: massage changelog] Link: https://lore.kernel.org/r/20230603011058.1038821-11-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-06-06 17:31:44 -07:00
Like Xu	1c2bf8a6b0	KVM: x86/pmu: Constrain the num of guest counters with kvm_pmu_cap Cap the number of general purpose counters enumerated on AMD to what KVM actually supports, i.e. don't allow userspace to coerce KVM into thinking there are more counters than actually exist, e.g. by enumerating X86_FEATURE_PERFCTR_CORE in guest CPUID when its not supported. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Like Xu <likexu@tencent.com> [sean: massage changelog] Link: https://lore.kernel.org/r/20230603011058.1038821-10-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-06-06 17:31:44 -07:00
Like Xu	d338d8789e	KVM: x86/pmu: Advertise PERFCTR_CORE iff the min nr of counters is met Enable and advertise PERFCTR_CORE if and only if the minimum number of required counters are available, i.e. if perf says there are less than six general purpose counters. Opportunistically, use kvm_cpu_cap_check_and_set() instead of open coding the check for host support. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Like Xu <likexu@tencent.com> [sean: massage shortlog and changelog] Link: https://lore.kernel.org/r/20230603011058.1038821-9-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-06-06 17:31:44 -07:00
Like Xu	6a08083f29	KVM: x86/pmu: Disable vPMU if the minimum num of counters isn't met Disable PMU support when running on AMD and perf reports fewer than four general purpose counters. All AMD PMUs must define at least four counters due to AMD's legacy architecture hardcoding the number of counters without providing a way to enumerate the number of counters to software, e.g. from AMD's APM: The legacy architecture defines four performance counters (PerfCtrn) and corresponding event-select registers (PerfEvtSeln). Virtualizing fewer than four counters can lead to guest instability as software expects four counters to be available. Rather than bleed AMD details into the common code, just define a const unsigned int and provide a convenient location to document why Intel and AMD have different mins (in particular, AMD's lack of any way to enumerate less than four counters to the guest). Keep the minimum number of counters at Intel at one, even though old P6 and Core Solo/Duo processor effectively require a minimum of two counters. KVM can, and more importantly has up until this point, supported a vPMU so long as the CPU has at least one counter. Perf's support for P6/Core CPUs does require two counters, but perf will happily chug along with a single counter when running on a modern CPU. Cc: Jim Mattson <jmattson@google.com> Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Like Xu <likexu@tencent.com> [sean: set Intel min to '1', not '2'] Link: https://lore.kernel.org/r/20230603011058.1038821-8-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-06-06 17:31:44 -07:00
Like Xu	6593039d33	KVM: x86: Explicitly zero cpuid "0xa" leaf when PMU is disabled Add an explicit !enable_pmu check as relying on kvm_pmu_cap to be zeroed isn't obvious. Although when !enable_pmu, KVM will have zero-padded kvm_pmu_cap to do subsequent CPUID leaf assignments. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Like Xu <likexu@tencent.com> Link: https://lore.kernel.org/r/20230603011058.1038821-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-06-06 17:31:44 -07:00
Like Xu	13afa29ae4	KVM: x86/pmu: Provide Intel PMU's pmc_is_enabled() as generic x86 code Move the Intel PMU implementation of pmc_is_enabled() to common x86 code as pmc_is_globally_enabled(), and drop AMD's implementation. AMD PMU currently supports only v1, and thus not PERF_GLOBAL_CONTROL, thus the semantics for AMD are unchanged. And when support for AMD PMU v2 comes along, the common behavior will also Just Work. Signed-off-by: Like Xu <likexu@tencent.com> Co-developed-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20230603011058.1038821-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-06-06 17:31:44 -07:00
Like Xu	c85cdc1cc1	KVM: x86/pmu: Move handling PERF_GLOBAL_CTRL and friends to common x86 Move the handling of GLOBAL_CTRL, GLOBAL_STATUS, and GLOBAL_OVF_CTRL, a.k.a. GLOBAL_STATUS_RESET, from Intel PMU code to generic x86 PMU code. AMD PerfMonV2 defines three registers that have the same semantics as Intel's variants, just with different names and indices. Conveniently, since KVM virtualizes GLOBAL_CTRL on Intel only for PMU v2 and above, and AMD's version shows up in v2, KVM can use common code for the existence check as well. Signed-off-by: Like Xu <likexu@tencent.com> Co-developed-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20230603011058.1038821-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-06-06 17:31:44 -07:00
Like Xu	30dab5c0b6	KVM: x86/pmu: Reject userspace attempts to set reserved GLOBAL_STATUS bits Reject userspace writes to MSR_CORE_PERF_GLOBAL_STATUS that attempt to set reserved bits. Allowing userspace to stuff reserved bits doesn't harm KVM itself, but it's architecturally wrong and the guest can't clear the unsupported bits, e.g. makes the guest's PMI handler very confused. Signed-off-by: Like Xu <likexu@tencent.com> [sean: rewrite changelog to avoid use of #GP, rebase on name change] Link: https://lore.kernel.org/r/20230603011058.1038821-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-06-06 17:31:44 -07:00
Like Xu	8de18543df	KVM: x86/pmu: Move reprogram_counters() to pmu.h Move reprogram_counters() out of Intel specific PMU code and into pmu.h so that it can be used to implement AMD PMU v2 support. No functional change intended. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Like Xu <likexu@tencent.com> [sean: rewrite changelog] Link: https://lore.kernel.org/r/20230603011058.1038821-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-06-06 17:31:44 -07:00
Sean Christopherson	53550b8922	KVM: x86/pmu: Rename global_ovf_ctrl_mask to global_status_mask Rename global_ovf_ctrl_mask to global_status_mask to avoid confusion now that Intel has renamed GLOBAL_OVF_CTRL to GLOBAL_STATUS_RESET in PMU v4. GLOBAL_OVF_CTRL and GLOBAL_STATUS_RESET are the same MSR index, i.e. are just different names for the same thing, but the SDM provides different entries in the IA-32 Architectural MSRs table, which gets really confusing when looking at PMU v4 definitions since it looks like GLOBAL_STATUS has bits that don't exist in GLOBAL_OVF_CTRL, but in reality the bits are simply defined in the GLOBAL_STATUS_RESET entry. No functional change intended. Cc: Like Xu <like.xu.linux@gmail.com> Link: https://lore.kernel.org/r/20230603011058.1038821-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-06-06 17:31:44 -07:00
Michal Luczaj	e12fa4b92a	KVM: x86: Clean up: remove redundant bool conversions As test_bit() returns bool, explicitly converting result to bool is unnecessary. Get rid of '!!'. No functional change intended. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Michal Luczaj <mhal@rbox.co> Link: https://lore.kernel.org/r/20230605200158.118109-1-mhal@rbox.co Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-06-06 17:13:55 -07:00
Kirill A. Shutemov	94142c9d1b	x86/mm: Fix enc_status_change_finish_noop() enc_status_change_finish_noop() is now defined as always-fail, which doesn't make sense for noop. The change has no user-visible effect because it is only called if the platform has CC_ATTR_MEM_ENCRYPT. All platforms with the attribute override the callback with their own implementation. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Link: https://lore.kernel.org/all/20230606095622.1939-4-kirill.shutemov%40linux.intel.com	2023-06-06 16:24:27 -07:00
Kirill A. Shutemov	195edce08b	x86/tdx: Fix race between set_memory_encrypted() and load_unaligned_zeropad() tl;dr: There is a race in the TDX private<=>shared conversion code which could kill the TDX guest. Fix it by changing conversion ordering to eliminate the window. TDX hardware maintains metadata to track which pages are private and shared. Additionally, TDX guests use the guest x86 page tables to specify whether a given mapping is intended to be private or shared. Bad things happen when the intent and metadata do not match. So there are two thing in play: 1. "the page" -- the physical TDX page metadata 2. "the mapping" -- the guest-controlled x86 page table intent For instance, an unrecoverable exit to VMM occurs if a guest touches a private mapping that points to a shared physical page. In summary: * Private mapping => Private Page == OK (obviously) * Shared mapping => Shared Page == OK (obviously) * Private mapping => Shared Page == BIG BOOM! * Shared mapping => Private Page == OK-ish (It will read generate a recoverable #VE via handle_mmio()) Enter load_unaligned_zeropad(). It can touch memory that is adjacent but otherwise unrelated to the memory it needs to touch. It will cause one of those unrecoverable exits (aka. BIG BOOM) if it blunders into a shared mapping pointing to a private page. This is a problem when __set_memory_enc_pgtable() converts pages from shared to private. It first changes the mapping and second modifies the TDX page metadata. It's moving from: * Shared mapping => Shared Page == OK to: * Private mapping => Shared Page == BIG BOOM! This means that there is a window with a shared mapping pointing to a private page where load_unaligned_zeropad() can strike. Add a TDX handler for guest.enc_status_change_prepare(). This converts the page from shared to private before the page becomes private. This ensures that there is never a private mapping to a shared page. Leave a guest.enc_status_change_finish() in place but only use it for private=>shared conversions. This will delay updating the TDX metadata marking the page private until after the mapping matches the metadata. This also ensures that there is never a private mapping to a shared page. [ dhansen: rewrite changelog ] Fixes: 7dbde7631629 ("x86/mm/cpa: Add support for TDX shared memory") Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Link: https://lore.kernel.org/all/20230606095622.1939-3-kirill.shutemov%40linux.intel.com	2023-06-06 16:24:02 -07:00
Sean Christopherson	056b9919a1	KVM: x86: Use cpu_feature_enabled() for PKU instead of #ifdef Replace an #ifdef on CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS with a cpu_feature_enabled() check on X86_FEATURE_PKU. The macro magic of DISABLED_MASK_BIT_SET() means that cpu_feature_enabled() provides the same end result (no code generated) when PKU is disabled by Kconfig. No functional change intended. Cc: Jon Kohler <jon@nutanix.com> Reviewed-by: Jon Kohler <jon@nutanix.com> Link: https://lore.kernel.org/r/20230602010550.785722-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-06-06 15:15:06 -07:00
Sean Christopherson	0a3869e14d	KVM: x86/mmu: Trigger APIC-access page reload iff vendor code cares Request an APIC-access page reload when the backing page is migrated (or unmapped) if and only if vendor code actually plugs the backing pfn into structures that reside outside of KVM's MMU. This avoids kicking all vCPUs in the (hopefully infrequent) scenario where the backing page is migrated/invalidated. Unlike VMX's APICv, SVM's AVIC doesn't plug the backing pfn directly into the VMCB and so doesn't need a hook to invalidate an out-of-MMU "mapping". Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Link: https://lore.kernel.org/r/20230602011518.787006-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-06-06 15:07:05 -07:00
Sean Christopherson	0a8a5f2c8c	KVM: x86: Use standard mmu_notifier invalidate hooks for APIC access page Now that KVM honors past and in-progress mmu_notifier invalidations when reloading the APIC-access page, use KVM's "standard" invalidation hooks to trigger a reload and delete the one-off usage of invalidate_range(). Aside from eliminating one-off code in KVM, dropping KVM's use of invalidate_range() will allow common mmu_notifier to redefine the API to be more strictly focused on invalidating secondary TLBs that share the primary MMU's page tables. Suggested-by: Jason Gunthorpe <jgg@nvidia.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Robin Murphy <robin.murphy@arm.com> Reviewed-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Link: https://lore.kernel.org/r/20230602011518.787006-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-06-06 15:07:05 -07:00
Sean Christopherson	878940b33d	KVM: VMX: Retry APIC-access page reload if invalidation is in-progress Re-request an APIC-access page reload if there is a relevant mmu_notifier invalidation in-progress when KVM retrieves the backing pfn, i.e. stall vCPUs until the backing pfn for the APIC-access page is "officially" stable. Relying on the primary MMU to not make changes after invoking ->invalidate_range() works, e.g. any additional changes to a PRESENT PTE would also trigger an ->invalidate_range(), but using ->invalidate_range() to fudge around KVM not honoring past and in-progress invalidations is a bit hacky. Honoring invalidations will allow using KVM's standard mmu_notifier hooks to detect APIC-access page reloads, which will in turn allow removing KVM's implementation of ->invalidate_range() (the APIC-access page case is a true one-off). Opportunistically add a comment to explain why doing nothing if a memslot isn't found is functionally correct. Suggested-by: Jason Gunthorpe <jgg@nvidia.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Robin Murphy <robin.murphy@arm.com> Reviewed-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Link: https://lore.kernel.org/r/20230602011518.787006-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-06-06 15:07:04 -07:00
Kirill A. Shutemov	3f6819dd19	x86/mm: Allow guest.enc_status_change_prepare() to fail TDX code is going to provide guest.enc_status_change_prepare() that is able to fail. TDX will use the call to convert the GPA range from shared to private. This operation can fail. Add a way to return an error from the callback. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Link: https://lore.kernel.org/all/20230606095622.1939-2-kirill.shutemov%40linux.intel.com	2023-06-06 11:07:01 -07:00
Alexander Mikhalitsyn	6d1bc9754b	KVM: SVM: enhance info printk's in SEV init Let's print available ASID ranges for SEV/SEV-ES guests. This information can be useful for system administrator to debug if SEV/SEV-ES fails to enable. There are a few reasons. SEV: - NPT is disabled (module parameter) - CPU lacks some features (sev, decodeassists) - Maximum SEV ASID is 0 SEV-ES: - mmio_caching is disabled (module parameter) - CPU lacks sev_es feature - Minimum SEV ASID value is 1 (can be adjusted in BIOS/UEFI) Cc: Sean Christopherson <seanjc@google.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Stéphane Graber <stgraber@ubuntu.com> Cc: kvm@vger.kernel.org Cc: linux-kernel@vger.kernel.org Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> Link: https://lore.kernel.org/r/20230522161249.800829-3-aleksandr.mikhalitsyn@canonical.com [sean: print '0' for min SEV-ES ASID if there are no available ASIDs] Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-06-06 10:41:29 -07:00
Tom Lendacky	6c32117963	x86/sev: Add SNP-specific unaccepted memory support Add SNP-specific hooks to the unaccepted memory support in the boot path (__accept_memory()) and the core kernel (accept_memory()) in order to support booting SNP guests when unaccepted memory is present. Without this support, SNP guests will fail to boot and/or panic() when unaccepted memory is present in the EFI memory map. The process of accepting memory under SNP involves invoking the hypervisor to perform a page state change for the page to private memory and then issuing a PVALIDATE instruction to accept the page. Since the boot path and the core kernel paths perform similar operations, move the pvalidate_pages() and vmgexit_psc() functions into sev-shared.c to avoid code duplication. Create the new header file arch/x86/boot/compressed/sev.h because adding the function declaration to any of the existing SEV related header files pulls in too many other header files, causing the build to fail. Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/a52fa69f460fd1876d70074b20ad68210dfc31dd.1686063086.git.thomas.lendacky@amd.com	2023-06-06 18:31:37 +02:00
Tom Lendacky	15d9088779	x86/sev: Use large PSC requests if applicable In advance of providing support for unaccepted memory, request 2M Page State Change (PSC) requests when the address range allows for it. By using a 2M page size, more PSC operations can be handled in a single request to the hypervisor. The hypervisor will determine if it can accommodate the larger request by checking the mapping in the nested page table. If mapped as a large page, then the 2M page request can be performed, otherwise the 2M page request will be broken down into 512 4K page requests. This is still more efficient than having the guest perform multiple PSC requests in order to process the 512 4K pages. In conjunction with the 2M PSC requests, attempt to perform the associated PVALIDATE instruction of the page using the 2M page size. If PVALIDATE fails with a size mismatch, then fallback to validating 512 4K pages. To do this, page validation is modified to work with the PSC structure and not just a virtual address range. Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/050d17b460dfc237b51d72082e5df4498d3513cb.1686063086.git.thomas.lendacky@amd.com	2023-06-06 18:29:35 +02:00
Tom Lendacky	7006b75592	x86/sev: Allow for use of the early boot GHCB for PSC requests Using a GHCB for a page stage change (as opposed to the MSR protocol) allows for multiple pages to be processed in a single request. In prep for early PSC requests in support of unaccepted memory, update the invocation of vmgexit_psc() to be able to use the early boot GHCB and not just the per-CPU GHCB structure. In order to use the proper GHCB (early boot vs per-CPU), set a flag that indicates when the per-CPU GHCBs are available and registered. For APs, the per-CPU GHCBs are created before they are started and registered upon startup, so this flag can be used globally for the BSP and APs instead of creating a per-CPU flag. This will allow for a significant reduction in the number of MSR protocol page state change requests when accepting memory. Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/d6cbb21f87f81eb8282dd3bf6c34d9698c8a4bbc.1686063086.git.thomas.lendacky@amd.com	2023-06-06 18:29:00 +02:00
Tom Lendacky	69dcb1e3bb	x86/sev: Put PSC struct on the stack in prep for unaccepted memory support In advance of providing support for unaccepted memory, switch from using kmalloc() for allocating the Page State Change (PSC) structure to using a local variable that lives on the stack. This is needed to avoid a possible recursive call into set_pages_state() if the kmalloc() call requires (more) memory to be accepted, which would result in a hang. The current size of the PSC struct is 2,032 bytes. To make the struct more stack friendly, reduce the number of PSC entries from 253 down to 64, resulting in a size of 520 bytes. This is a nice compromise on struct size and total PSC requests while still allowing parallel PSC operations across vCPUs. If the reduction in PSC entries results in any kind of performance issue (that is not seen at the moment), use of a larger static PSC struct, with fallback to the smaller stack version, can be investigated. For more background info on this decision, see the subthread in the Link: tag below. Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/lkml/658c455c40e8950cb046dd885dd19dc1c52d060a.1659103274.git.thomas.lendacky@amd.com	2023-06-06 18:28:25 +02:00
Tom Lendacky	5dee19b6b2	x86/sev: Fix calculation of end address based on number of pages When calculating an end address based on an unsigned int number of pages, any value greater than or equal to 0x100000 that is shift PAGE_SHIFT bits results in a 0 value, resulting in an invalid end address. Change the number of pages variable in various routines from an unsigned int to an unsigned long to calculate the end address correctly. Fixes: 5e5ccff60a29 ("x86/sev: Add helper for validating pages in early enc attribute changes") Fixes: dc3f3d2474b8 ("x86/mm: Validate memory when changing the C-bit") Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/6a6e4eea0e1414402bac747744984fa4e9c01bb6.1686063086.git.thomas.lendacky@amd.com	2023-06-06 18:27:20 +02:00
Kirill A. Shutemov	75d090fd16	x86/tdx: Add unaccepted memory support Hookup TDX-specific code to accept memory. Accepting the memory is done with ACCEPT_PAGE module call on every page in the range. MAP_GPA hypercall is not required as the unaccepted memory is considered private already. Extract the part of tdx_enc_status_changed() that does memory acceptance in a new helper. Move the helper tdx-shared.c. It is going to be used by both main kernel and decompressor. [ bp: Fix the INTEL_TDX_GUEST=y, KVM_GUEST=n build. ] Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230606142637.5171-10-kirill.shutemov@linux.intel.com	2023-06-06 18:25:57 +02:00
Chao Gao	02f1b0b736	KVM: x86: Correct the name for skipping VMENTER l1d flush There is no VMENTER_L1D_FLUSH_NESTED_VM. It should be ARCH_CAP_SKIP_VMENTRY_L1DFLUSH. Signed-off-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://lore.kernel.org/r/20230524061634.54141-3-chao.gao@intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-06-06 09:14:52 -07:00

1 2 3 4 5 ...

44245 Commits