linux

iv/linux

Author	SHA1	Message	Date
Thomas Gleixner	c27c753ea6	x86/static_call: Serialize __static_call_fixup() properly __static_call_fixup() invokes __static_call_transform() without holding text_mutex, which causes lockdep to complain in text_poke_bp(). Adding the proper locking cures that, but as this is either used during early boot or during module finalizing, it's not required to use text_poke_bp(). Add an argument to __static_call_transform() which tells it to use text_poke_early() for it. Fixes: ee88d363d156 ("x86,static_call: Use alternative RET encoding") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de>	2022-07-12 14:23:32 +02:00
Linus Torvalds	ce114c8668	Just when you thought that all the speculation bugs were addressed and solved and the nightmare is complete, here's the next one: speculating after RET instructions and leaking privileged information using the now pretty much classical covert channels. It is called RETBleed and the mitigation effort and controlling functionality has been modelled similar to what already existing mitigations provide. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmLKqAgACgkQEsHwGGHe VUoM5w/8CSvwPZ3otkhmu8MrJPtWc7eLDPjYN4qQP+19e+bt094MoozxeeWG2wmp hkDJAYHT2Oik/qDuEdhFgNYwS7XGgbV3Py3B8syO4//5SD5dkOSG+QqFXvXMdFri YsVqqNkjJOWk/YL9Ql5RS/xQewsrr0OqEyWWocuI6XAvfWV4kKvlRSd+6oPqtZEO qYlAHTXElyIrA/gjmxChk1HTt5HZtK3uJLf4twNlUfzw7LYFf3+sw3bdNuiXlyMr WcLXMwGpS0idURwP3mJa7JRuiVBzb4+kt8mWwWqA02FkKV45FRRRFhFUsy667r00 cdZBaWdy+b7dvXeliO3FN/x1bZwIEUxmaNy1iAClph4Ifh0ySPUkxAr8EIER7YBy bstDJEaIqgYg8NIaD4oF1UrG0ZbL0ImuxVaFdhG1hopQsh4IwLSTLgmZYDhfn/0i oSqU0Le+A7QW9s2A2j6qi7BoAbRW+gmBuCgg8f8ECYRkFX1ZF6mkUtnQxYrU7RTq rJWGW9nhwM9nRxwgntZiTjUUJ2HtyXEgYyCNjLFCbEBfeG5QTg7XSGFhqDbgoymH 85vsmSXYxgTgQ/kTW7Fs26tOqnP2h1OtLJZDL8rg49KijLAnISClEgohYW01CWQf ZKMHtz3DM0WBiLvSAmfGifScgSrLB5AjtvFHT0hF+5/okEkinVk= =09fW -----END PGP SIGNATURE----- Merge tag 'x86_bugs_retbleed' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 retbleed fixes from Borislav Petkov: "Just when you thought that all the speculation bugs were addressed and solved and the nightmare is complete, here's the next one: speculating after RET instructions and leaking privileged information using the now pretty much classical covert channels. It is called RETBleed and the mitigation effort and controlling functionality has been modelled similar to what already existing mitigations provide" * tag 'x86_bugs_retbleed' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (54 commits) x86/speculation: Disable RRSBA behavior x86/kexec: Disable RET on kexec x86/bugs: Do not enable IBPB-on-entry when IBPB is not supported x86/entry: Move PUSH_AND_CLEAR_REGS() back into error_entry x86/bugs: Add Cannon lake to RETBleed affected CPU list x86/retbleed: Add fine grained Kconfig knobs x86/cpu/amd: Enumerate BTC_NO x86/common: Stamp out the stepping madness KVM: VMX: Prevent RSB underflow before vmenter x86/speculation: Fill RSB on vmexit for IBRS KVM: VMX: Fix IBRS handling after vmexit KVM: VMX: Prevent guest RSB poisoning attacks with eIBRS KVM: VMX: Convert launched argument to flags KVM: VMX: Flatten __vmx_vcpu_run() objtool: Re-add UNWIND_HINT_{SAVE_RESTORE} x86/speculation: Remove x86_spec_ctrl_mask x86/speculation: Use cached host SPEC_CTRL value for guest entry/exit x86/speculation: Fix SPEC_CTRL write on SMT state change x86/speculation: Fix firmware entry SPEC_CTRL handling x86/speculation: Fix RSB filling with CONFIG_RETPOLINE=n ...	2022-07-11 18:15:25 -07:00
Jason A. Donenfeld	68b8e9713c	x86/setup: Use rng seeds from setup_data Currently, the only way x86 can get an early boot RNG seed is via EFI, which is generally always used now for physical machines, but is very rarely used in VMs, especially VMs that are optimized for starting "instantaneously", such as Firecracker's MicroVM. For tiny fast booting VMs, EFI is not something you generally need or want. Rather, the image loader or firmware should be able to pass a single random seed, exactly as device tree platforms do with the "rng-seed" property. Additionally, this is something that bootloaders can append, with their own seed file management, which is something every other major OS ecosystem has that Linux does not (yet). Add SETUP_RNG_SEED, similar to the other eight setup_data entries that are parsed at boot. It also takes care to zero out the seed immediately after using, in order to retain forward secrecy. This all takes about 7 trivial lines of code. Then, on kexec_file_load(), a new fresh seed is generated and passed to the next kernel, just as is done on device tree architectures when using kexec. And, importantly, I've tested that QEMU is able to properly pass SETUP_RNG_SEED as well, making this work for every step of the way. This code too is pretty straight forward. Together these measures ensure that VMs and nested kexec()'d kernels always receive a proper boot time RNG seed at the earliest possible stage from their parents: - Host [already has strongly initialized RNG] - QEMU [passes fresh seed in SETUP_RNG_SEED field] - Linux [uses parent's seed and gathers entropy of its own] - kexec [passes this in SETUP_RNG_SEED field] - Linux [uses parent's seed and gathers entropy of its own] - kexec [passes this in SETUP_RNG_SEED field] - Linux [uses parent's seed and gathers entropy of its own] - kexec [passes this in SETUP_RNG_SEED field] - ... I've verified in several scenarios that this works quite well from a host kernel to QEMU and down inwards, mixing and matching loaders, with every layer providing a seed to the next. [ bp: Massage commit message. ] Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: H. Peter Anvin (Intel) <hpa@zytor.com> Link: https://lore.kernel.org/r/20220630113300.1892799-1-Jason@zx2c4.com	2022-07-11 09:59:31 +02:00
Borislav Petkov	5a88c48f41	Linux 5.19-rc6 -----BEGIN PGP SIGNATURE----- iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmLLR2MeHHRvcnZhbGRz QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiG+hMH/jKGMOAbicR/CRq8 WLKmpb1eTJP2dbeiEs5amBk9DZQhqjx6tIQRCpZoGxBL+XWq7DX2fRLkAT56yS5/ NwferpR6IR9GlhjbfczF0JuQkP6eRUXnLrIKS5MViLI5QrCI80kkj4/mdqUXSiBV cMfXl5T1j+pb3zHUVXjnmvY+77q6rZTPoGxa/l8d6MaIhAg+jhu2E1HaSaSCX/YK TViq7ciI9cXoFV9yqhLkkBdGjBV8VQsKmeWEcA738bdSy1WAJSV1SVTJqLFvwdPI PM1asxkPoQ7jRrwsY4G8pZ3zPskJMS4Qwdn64HK+no2AKhJt2p6MePD1XblcrGHK QNStMY0= =LfuD -----END PGP SIGNATURE----- Merge tag 'v5.19-rc6' into tip:x86/kdump Merge rc6 to pick up dependent changes to the bootparam UAPI header. Signed-off-by: Borislav Petkov <bp@suse.de>	2022-07-11 09:58:01 +02:00
Masahiro Yamada	8b979924b9	x86/build: Remove unused OBJECT_FILES_NON_STANDARD_test_nx.o Commit 3ad38ceb2769 ("x86/mm: Remove CONFIG_DEBUG_NX_TEST") removed arch/x86/kernel/test_nx.c Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20220711041247.119357-1-masahiroy@kernel.org	2022-07-11 08:53:28 +02:00
Linus Torvalds	74a0032b85	- Prepare for and clear .brk early in order to address XenPV guests failures where the hypervisor verifies page tables and uninitialized data in that range leads to bogus failures in those checks - Add any potential setup_data entries supplied at boot to the identity pagetable mappings to prevent kexec kernel boot failures. Usually, this is not a problem for the normal kernel as those mappings are part of the initially mapped 2M pages but if kexec gets to allocate the second kernel somewhere else, those setup_data entries need to be mapped there too. - Fix objtool not to discard text references from the __tracepoints section so that ENDBR validation still works - Correct the setup_data types limit as it is user-visible, before 5.19 releases -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmLKpf8ACgkQEsHwGGHe VUrc5w/8DIVLQ8w+Balf2TGfp5Sl3mPkg+eoARH29qtXhvVBs5KJB9sbT1IGnxao nE4yNeiIKhH5SEd17l11E7eWuUtNgZENLsUb3aiAdsItNS+MzOWQuEOPbnAwgJmk oKdxiI1SHiVoPy5KVXOcyAS90PSJIkhhxwgR5MInGdmpSUzEFsx5SY82ZfOjOkZU L7zCsJzeDfhJdWiR4N0MXWRaFbIvRxI1uXyqgv+Lo6JK5l8dyUUSEdWyLUqZ7E4M GFo6LwR3lskQM2bE9vBWS0h1X00d5oDMzfono8kZzRGA/11plZHRI007PCez8yZh 4sUnnxsfCy2YF8/8hs4IhrHZdcWW9XoN4gTUsjD0wekGTHhOEqu5qpAnVSrXbvvM ZfPF8vM+DLPTWQqAT0a4aj1vd1RflDIQPSXKDzJDjeF49zouAj1ae/3KSOYJDzN9 V6NGiKBnzj1rbtm0+8jOsTQusmh/oDage7uLlmel3hTfNOc2Ay0LXrJWcvqhj66V 4CtCd12sLeavin+mGptni6lXbsue61EolRtH44RvZJsXLVY8iclM4onl728xOrxj CBtJo6bd3oQYy0SQsysXGDVR7BSXtwAYfArYR8BrMTtgHxuyULt/BDoew4r7XADB Xxz7ADJZ3DI3Gqza5H6r89Tj6Oi3yXiBWUVUNXFCMYc6ZrqvZc0= =tOvF -----END PGP SIGNATURE----- Merge tag 'x86_urgent_for_v5.19_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Borislav Petkov: - Prepare for and clear .brk early in order to address XenPV guests failures where the hypervisor verifies page tables and uninitialized data in that range leads to bogus failures in those checks - Add any potential setup_data entries supplied at boot to the identity pagetable mappings to prevent kexec kernel boot failures. Usually, this is not a problem for the normal kernel as those mappings are part of the initially mapped 2M pages but if kexec gets to allocate the second kernel somewhere else, those setup_data entries need to be mapped there too. - Fix objtool not to discard text references from the __tracepoints section so that ENDBR validation still works - Correct the setup_data types limit as it is user-visible, before 5.19 releases * tag 'x86_urgent_for_v5.19_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/boot: Fix the setup data types max limit x86/ibt, objtool: Don't discard text references from tracepoint section x86/compressed/64: Add identity mappings for setup_data entries x86: Fix .brk attribute in linker script x86: Clear .brk area at early boot x86/xen: Use clear_bss() for Xen PV guests	2022-07-10 08:43:52 -07:00
Pawan Gupta	4ad3278df6	x86/speculation: Disable RRSBA behavior Some Intel processors may use alternate predictors for RETs on RSB-underflow. This condition may be vulnerable to Branch History Injection (BHI) and intramode-BTI. Kernel earlier added spectre_v2 mitigation modes (eIBRS+Retpolines, eIBRS+LFENCE, Retpolines) which protect indirect CALLs and JMPs against such attacks. However, on RSB-underflow, RET target prediction may fallback to alternate predictors. As a result, RET's predicted target may get influenced by branch history. A new MSR_IA32_SPEC_CTRL bit (RRSBA_DIS_S) controls this fallback behavior when in kernel mode. When set, RETs will not take predictions from alternate predictors, hence mitigating RETs as well. Support for this is enumerated by CPUID.7.2.EDX[RRSBA_CTRL] (bit2). For spectre v2 mitigation, when a user selects a mitigation that protects indirect CALLs and JMPs against BHI and intramode-BTI, set RRSBA_DIS_S also to protect RETs for RSB-underflow case. Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Borislav Petkov <bp@suse.de>	2022-07-09 13:12:45 +02:00
Konrad Rzeszutek Wilk	697977d841	x86/kexec: Disable RET on kexec All the invocations unroll to __x86_return_thunk and this file must be PIC independent. This fixes kexec on 64-bit AMD boxes. [ bp: Fix 32-bit build. ] Reported-by: Edward Tran <edward.tran@oracle.com> Reported-by: Awais Tanveer <awais.tanveer@oracle.com> Suggested-by: Ankur Arora <ankur.a.arora@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com> Signed-off-by: Borislav Petkov <bp@suse.de>	2022-07-09 13:12:32 +02:00
Sean Christopherson	e0a5915f1c	x86/sgx: Drop 'page_index' from sgx_backing Storing the 'page_index' value in the sgx_backing struct is dead code and no longer needed. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lkml.kernel.org/r/20220708162124.8442-1-kristen@linux.intel.com	2022-07-08 09:31:11 -07:00
Thadeu Lima de Souza Cascardo	2259da159f	x86/bugs: Do not enable IBPB-on-entry when IBPB is not supported There are some VM configurations which have Skylake model but do not support IBPB. In those cases, when using retbleed=ibpb, userspace is going to be killed and kernel is going to panic. If the CPU does not support IBPB, warn and proceed with the auto option. Also, do not fallback to IBPB on AMD/Hygon systems if it is not supported. Fixes: 3ebc17006888 ("x86/bugs: Add retbleed=ibpb") Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com> Signed-off-by: Borislav Petkov <bp@suse.de>	2022-07-08 12:50:52 +02:00
Reinette Chatre	a0506b3b06	x86/sgx: Free up EPC pages directly to support large page ranges The page reclaimer ensures availability of EPC pages across all enclaves. In support of this it runs independently from the individual enclaves in order to take locks from the different enclaves as it writes pages to swap. When needing to load a page from swap an EPC page needs to be available for its contents to be loaded into. Loading an existing enclave page from swap does not reclaim EPC pages directly if none are available, instead the reclaimer is woken when the available EPC pages are found to be below a watermark. When iterating over a large number of pages in an oversubscribed environment there is a race between the reclaimer woken up and EPC pages reclaimed fast enough for the page operations to proceed. Ensure there are EPC pages available before attempting to load a page that may potentially be pulled from swap into an available EPC page. Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Jarkko Sakkinen <jarkko@kernel.org> Link: https://lkml.kernel.org/r/a0d8f037c4a075d56bf79f432438412985f7ff7a.1652137848.git.reinette.chatre@intel.com	2022-07-07 10:13:03 -07:00
Reinette Chatre	9849bb2715	x86/sgx: Support complete page removal The SGX2 page removal flow was introduced in previous patch and is as follows: 1) Change the type of the pages to be removed to SGX_PAGE_TYPE_TRIM using the ioctl() SGX_IOC_ENCLAVE_MODIFY_TYPES introduced in previous patch. 2) Approve the page removal by running ENCLU[EACCEPT] from within the enclave. 3) Initiate actual page removal using the ioctl() SGX_IOC_ENCLAVE_REMOVE_PAGES introduced here. Support the final step of the SGX2 page removal flow with ioctl() SGX_IOC_ENCLAVE_REMOVE_PAGES. With this ioctl() the user specifies a page range that should be removed. All pages in the provided range should have the SGX_PAGE_TYPE_TRIM page type and the request will fail with EPERM (Operation not permitted) if a page that does not have the correct type is encountered. Page removal can fail on any page within the provided range. Support partial success by returning the number of pages that were successfully removed. Since actual page removal will succeed even if ENCLU[EACCEPT] was not run from within the enclave the ENCLU[EMODPR] instruction with RWX permissions is used as a no-op mechanism to ensure ENCLU[EACCEPT] was successfully run from within the enclave before the enclave page is removed. If the user omits running SGX_IOC_ENCLAVE_REMOVE_PAGES the pages will still be removed when the enclave is unloaded. Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org> Tested-by: Haitao Huang <haitao.huang@intel.com> Tested-by: Vijay Dhanraj <vijay.dhanraj@intel.com> Tested-by: Jarkko Sakkinen <jarkko@kernel.org> Link: https://lkml.kernel.org/r/b75ee93e96774e38bb44a24b8e9bbfb67b08b51b.1652137848.git.reinette.chatre@intel.com	2022-07-07 10:13:03 -07:00
Reinette Chatre	45d546b8c1	x86/sgx: Support modifying SGX page type Every enclave contains one or more Thread Control Structures (TCS). The TCS contains meta-data used by the hardware to save and restore thread specific information when entering/exiting the enclave. With SGX1 an enclave needs to be created with enough TCSs to support the largest number of threads expecting to use the enclave and enough enclave pages to meet all its anticipated memory demands. In SGX1 all pages remain in the enclave until the enclave is unloaded. SGX2 introduces a new function, ENCLS[EMODT], that is used to change the type of an enclave page from a regular (SGX_PAGE_TYPE_REG) enclave page to a TCS (SGX_PAGE_TYPE_TCS) page or change the type from a regular (SGX_PAGE_TYPE_REG) or TCS (SGX_PAGE_TYPE_TCS) page to a trimmed (SGX_PAGE_TYPE_TRIM) page (setting it up for later removal). With the existing support of dynamically adding regular enclave pages to an initialized enclave and changing the page type to TCS it is possible to dynamically increase the number of threads supported by an enclave. Changing the enclave page type to SGX_PAGE_TYPE_TRIM is the first step of dynamically removing pages from an initialized enclave. The complete page removal flow is: 1) Change the type of the pages to be removed to SGX_PAGE_TYPE_TRIM using the SGX_IOC_ENCLAVE_MODIFY_TYPES ioctl() introduced here. 2) Approve the page removal by running ENCLU[EACCEPT] from within the enclave. 3) Initiate actual page removal using the ioctl() introduced in the following patch. Add ioctl() SGX_IOC_ENCLAVE_MODIFY_TYPES to support changing SGX enclave page types within an initialized enclave. With SGX_IOC_ENCLAVE_MODIFY_TYPES the user specifies a page range and the enclave page type to be applied to all pages in the provided range. The ioctl() itself can return an error code based on failures encountered by the kernel. It is also possible for SGX specific failures to be encountered. Add a result output parameter to communicate the SGX return code. It is possible for the enclave page type change request to fail on any page within the provided range. Support partial success by returning the number of pages that were successfully changed. After the page type is changed the page continues to be accessible from the kernel perspective with page table entries and internal state. The page may be moved to swap. Any access until ENCLU[EACCEPT] will encounter a page fault with SGX flag set in error code. Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org> Tested-by: Jarkko Sakkinen <jarkko@kernel.org> Tested-by: Haitao Huang <haitao.huang@intel.com> Tested-by: Vijay Dhanraj <vijay.dhanraj@intel.com> Link: https://lkml.kernel.org/r/babe39318c5bf16fc65fbfb38896cdee72161575.1652137848.git.reinette.chatre@intel.com	2022-07-07 10:13:03 -07:00
Reinette Chatre	7b013e723a	x86/sgx: Tighten accessible memory range after enclave initialization Before an enclave is initialized the enclave's memory range is unknown. The enclave's memory range is learned at the time it is created via the SGX_IOC_ENCLAVE_CREATE ioctl() where the provided memory range is obtained from an earlier mmap() of /dev/sgx_enclave. After an enclave is initialized its memory can be mapped into user space (mmap()) from where it can be entered at its defined entry points. With the enclave's memory range known after it is initialized there is no reason why it should be possible to map memory outside this range. Lock down access to the initialized enclave's memory range by denying any attempt to map memory outside its memory range. Locking down the memory range also makes adding pages to an initialized enclave more efficient. Pages are added to an initialized enclave by accessing memory that belongs to the enclave's memory range but not yet backed by an enclave page. If it is possible for user space to map memory that does not form part of the enclave then an access to this memory would eventually fail. Failures range from a prompt general protection fault if the access was an ENCLU[EACCEPT] from within the enclave, or a page fault via the vDSO if it was another access from within the enclave, or a SIGBUS (also resulting from a page fault) if the access was from outside the enclave. Disallowing invalid memory to be mapped in the first place avoids preventable failures. Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org> Link: https://lkml.kernel.org/r/6391460d75ae79cea2e81eef0f6ffc03c6e9cfe7.1652137848.git.reinette.chatre@intel.com	2022-07-07 10:13:03 -07:00
Reinette Chatre	5a90d2c3f5	x86/sgx: Support adding of pages to an initialized enclave With SGX1 an enclave needs to be created with its maximum memory demands allocated. Pages cannot be added to an enclave after it is initialized. SGX2 introduces a new function, ENCLS[EAUG], that can be used to add pages to an initialized enclave. With SGX2 the enclave still needs to set aside address space for its maximum memory demands during enclave creation, but all pages need not be added before enclave initialization. Pages can be added during enclave runtime. Add support for dynamically adding pages to an initialized enclave, architecturally limited to RW permission at creation but allowed to obtain RWX permissions after trusted enclave runs EMODPE. Add pages via the page fault handler at the time an enclave address without a backing enclave page is accessed, potentially directly reclaiming pages if no free pages are available. The enclave is still required to run ENCLU[EACCEPT] on the page before it can be used. A useful flow is for the enclave to run ENCLU[EACCEPT] on an uninitialized address. This will trigger the page fault handler that will add the enclave page and return execution to the enclave to repeat the ENCLU[EACCEPT] instruction, this time successful. If the enclave accesses an uninitialized address in another way, for example by expanding the enclave stack to a page that has not yet been added, then the page fault handler would add the page on the first write but upon returning to the enclave the instruction that triggered the page fault would be repeated and since ENCLU[EACCEPT] was not run yet it would trigger a second page fault, this time with the SGX flag set in the page fault error code. This can only be recovered by entering the enclave again and directly running the ENCLU[EACCEPT] instruction on the now initialized address. Accessing an uninitialized address from outside the enclave also triggers this flow but the page will remain inaccessible (access will result in #PF) until accepted from within the enclave via ENCLU[EACCEPT]. Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org> Tested-by: Jarkko Sakkinen <jarkko@kernel.org> Tested-by: Haitao Huang <haitao.huang@intel.com> Tested-by: Vijay Dhanraj <vijay.dhanraj@intel.com> Link: https://lkml.kernel.org/r/a254a58eabea053803277449b24b6e4963a3883b.1652137848.git.reinette.chatre@intel.com	2022-07-07 10:13:03 -07:00
Reinette Chatre	ff08530a52	x86/sgx: Support restricting of enclave page permissions In the initial (SGX1) version of SGX, pages in an enclave need to be created with permissions that support all usages of the pages, from the time the enclave is initialized until it is unloaded. For example, pages used by a JIT compiler or when code needs to otherwise be relocated need to always have RWX permissions. SGX2 includes a new function ENCLS[EMODPR] that is run from the kernel and can be used to restrict the EPCM permissions of regular enclave pages within an initialized enclave. Introduce ioctl() SGX_IOC_ENCLAVE_RESTRICT_PERMISSIONS to support restricting EPCM permissions. With this ioctl() the user specifies a page range and the EPCM permissions to be applied to all pages in the provided range. ENCLS[EMODPR] is run to restrict the EPCM permissions followed by the ENCLS[ETRACK] flow that will ensure no cached linear-to-physical address mappings to the changed pages remain. It is possible for the permission change request to fail on any page within the provided range, either with an error encountered by the kernel or by the SGX hardware while running ENCLS[EMODPR]. To support partial success the ioctl() returns an error code based on failures encountered by the kernel as well as two result output parameters: one for the number of pages that were successfully changed and one for the SGX return code. The page table entry permissions are not impacted by the EPCM permission changes. VMAs and PTEs will continue to allow the maximum vetted permissions determined at the time the pages are added to the enclave. The SGX error code in a page fault will indicate if it was an EPCM permission check that prevented an access attempt. No checking is done to ensure that the permissions are actually being restricted. This is because the enclave may have relaxed the EPCM permissions from within the enclave without the kernel knowing. An attempt to relax permissions using this call will be ignored by the hardware. Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org> Tested-by: Jarkko Sakkinen <jarkko@kernel.org> Tested-by: Haitao Huang <haitao.huang@intel.com> Tested-by: Vijay Dhanraj <vijay.dhanraj@intel.com> Link: https://lkml.kernel.org/r/082cee986f3c1a2f4fdbf49501d7a8c5a98446f8.1652137848.git.reinette.chatre@intel.com	2022-07-07 10:13:03 -07:00
Reinette Chatre	a76e7f1f18	x86/sgx: Support VA page allocation without reclaiming struct sgx_encl should be protected with the mutex sgx_encl->lock. One exception is sgx_encl->page_cnt that is incremented (in sgx_encl_grow()) when an enclave page is added to the enclave. The reason the mutex is not held is to allow the reclaimer to be called directly if there are no EPC pages (in support of a new VA page) available at the time. Incrementing sgx_encl->page_cnt without sgc_encl->lock held is currently (before SGX2) safe from concurrent updates because all paths in which sgx_encl_grow() is called occur before enclave initialization and are protected with an atomic operation on SGX_ENCL_IOCTL. SGX2 includes support for dynamically adding pages after enclave initialization where the protection of SGX_ENCL_IOCTL is not available. Make direct reclaim of EPC pages optional when new VA pages are added to the enclave. Essentially the existing "reclaim" flag used when regular EPC pages are added to an enclave becomes available to the caller when used to allocate VA pages instead of always being "true". When adding pages without invoking the reclaimer it is possible to do so with sgx_encl->lock held, gaining its protection against concurrent updates to sgx_encl->page_cnt after enclave initialization. No functional change. Reported-by: Haitao Huang <haitao.huang@intel.com> Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org> Link: https://lkml.kernel.org/r/42c5934c229982ee67982bb97c6ab34bde758620.1652137848.git.reinette.chatre@intel.com	2022-07-07 10:13:02 -07:00
Jarkko Sakkinen	8123073c43	x86/sgx: Export sgx_encl_page_alloc() Move sgx_encl_page_alloc() to encl.c and export it so that it can be used in the implementation for support of adding pages to initialized enclaves, which requires to allocate new enclave pages. Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org> Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lkml.kernel.org/r/57ae71b4ea17998467670232e12d6617b95c6811.1652137848.git.reinette.chatre@intel.com	2022-07-07 10:13:02 -07:00
Reinette Chatre	3a53514152	x86/sgx: Export sgx_encl_{grow,shrink}() In order to use sgx_encl_{grow,shrink}() in the page augmentation code located in encl.c, export these functions. Suggested-by: Jarkko Sakkinen <jarkko@kernel.org> Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org> Link: https://lkml.kernel.org/r/d51730acf54b6565710b2261b3099517b38c2ec4.1652137848.git.reinette.chatre@intel.com	2022-07-07 10:13:02 -07:00
Reinette Chatre	8cb7b502f3	x86/sgx: Keep record of SGX page type SGX2 functions are not allowed on all page types. For example, ENCLS[EMODPR] is only allowed on regular SGX enclave pages and ENCLS[EMODPT] is only allowed on TCS and regular pages. If these functions are attempted on another type of page the hardware would trigger a fault. Keep a record of the SGX page type so that there is more certainty whether an SGX2 instruction can succeed and faults can be treated as real failures. The page type is a property of struct sgx_encl_page and thus does not cover the VA page type. VA pages are maintained in separate structures and their type can be determined in a different way. The SGX2 instructions needing the page type do not operate on VA pages and this is thus not a scenario needing to be covered at this time. struct sgx_encl_page hosting this information is maintained for each enclave page so the space consumed by the struct is important. The existing sgx_encl_page->vm_max_prot_bits is already unsigned long while only using three bits. Transition to a bitfield for the two members to support the additional information without increasing the space consumed by the struct. Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org> Link: https://lkml.kernel.org/r/a0a6939eefe7ba26514f6c49723521cde372de64.1652137848.git.reinette.chatre@intel.com	2022-07-07 10:13:02 -07:00
Reinette Chatre	dda03e2c33	x86/sgx: Create utility to validate user provided offset and length User provided offset and length is validated when parsing the parameters of the SGX_IOC_ENCLAVE_ADD_PAGES ioctl(). Extract this validation (with consistent use of IS_ALIGNED) into a utility that can be used by the SGX2 ioctl()s that will also provide these values. Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org> Link: https://lkml.kernel.org/r/767147bc100047abed47fe27c592901adfbb93a2.1652137848.git.reinette.chatre@intel.com	2022-07-07 10:13:02 -07:00
Reinette Chatre	c7c6a8a61b	x86/sgx: Make sgx_ipi_cb() available internally The ETRACK function followed by an IPI to all CPUs within an enclave is a common pattern with more frequent use in support of SGX2. Make the (empty) IPI callback function available internally in preparation for usage by SGX2. Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org> Link: https://lkml.kernel.org/r/1179ed4a9c3c1c2abf49d51bfcf2c30b493181cc.1652137848.git.reinette.chatre@intel.com	2022-07-07 10:13:02 -07:00
Reinette Chatre	f89c2f9bf5	x86/sgx: Move PTE zap code to new sgx_zap_enclave_ptes() The SGX reclaimer removes page table entries pointing to pages that are moved to swap. SGX2 enables changes to pages belonging to an initialized enclave, thus enclave pages may have their permission or type changed while the page is being accessed by an enclave. Supporting SGX2 requires page table entries to be removed so that any cached mappings to changed pages are removed. For example, with the ability to change enclave page types a regular enclave page may be changed to a Thread Control Structure (TCS) page that may not be accessed by an enclave. Factor out the code removing page table entries to a separate function sgx_zap_enclave_ptes(), fixing accuracy of comments in the process, and make it available to the upcoming SGX2 code. Place sgx_zap_enclave_ptes() with the rest of the enclave code in encl.c interacting with the page table since this code is no longer unique to the reclaimer. Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org> Link: https://lkml.kernel.org/r/b010cdf01d7ce55dd0f00e883b7ccbd9db57160a.1652137848.git.reinette.chatre@intel.com	2022-07-07 10:13:02 -07:00
Reinette Chatre	bdaa8799f6	x86/sgx: Rename sgx_encl_ewb_cpumask() as sgx_encl_cpumask() sgx_encl_ewb_cpumask() is no longer unique to the reclaimer where it is used during the EWB ENCLS leaf function when EPC pages are written out to main memory and sgx_encl_ewb_cpumask() is used to learn which CPUs might have executed the enclave to ensure that TLBs are cleared. Upcoming SGX2 enabling will use sgx_encl_ewb_cpumask() during the EMODPR and EMODT ENCLS leaf functions that make changes to enclave pages. The function is needed for the same reason it is used now: to learn which CPUs might have executed the enclave to ensure that TLBs no longer point to the changed pages. Rename sgx_encl_ewb_cpumask() to sgx_encl_cpumask() to reflect the broader usage. Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org> Link: https://lkml.kernel.org/r/d4d08c449450a13d8dd3bb6c2b1af03895586d4f.1652137848.git.reinette.chatre@intel.com	2022-07-07 10:13:02 -07:00
Reinette Chatre	7f391752d4	x86/sgx: Export sgx_encl_ewb_cpumask() Using sgx_encl_ewb_cpumask() to learn which CPUs might have executed an enclave is useful to ensure that TLBs are cleared when changes are made to enclave pages. sgx_encl_ewb_cpumask() is used within the reclaimer when an enclave page is evicted. The upcoming SGX2 support enables changes to be made to enclave pages and will require TLBs to not refer to the changed pages and thus will be needing sgx_encl_ewb_cpumask(). Relocate sgx_encl_ewb_cpumask() to be with the rest of the enclave code in encl.c now that it is no longer unique to the reclaimer. Take care to ensure that any future usage maintains the current context requirement that ETRACK has been called first. Expand the existing comments to highlight this while moving them to a more prominent location before the function. No functional change. Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org> Link: https://lkml.kernel.org/r/05b60747fd45130cf9fc6edb1c373a69a18a22c5.1652137848.git.reinette.chatre@intel.com	2022-07-07 10:13:01 -07:00
Reinette Chatre	b3fb517dc6	x86/sgx: Support loading enclave page without VMA permissions check sgx_encl_load_page() is used to find and load an enclave page into enclave (EPC) memory, potentially loading it from the backing storage. Both usages of sgx_encl_load_page() are during an access to the enclave page from a VMA and thus the permissions of the VMA are considered before the enclave page is loaded. SGX2 functions operating on enclave pages belonging to an initialized enclave requiring the page to be in EPC. It is thus required to support loading enclave pages into the EPC independent from a VMA. Split the current sgx_encl_load_page() to support the two usages: A new call, sgx_encl_load_page_in_vma(), behaves exactly like the current sgx_encl_load_page() that takes VMA permissions into account, while sgx_encl_load_page() just loads an enclave page into EPC. VMA, PTE, and EPCM permissions continue to dictate whether the pages can be accessed from within an enclave. Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org> Link: https://lkml.kernel.org/r/d4393513c1f18987c14a490bcf133bfb71a5dc43.1652137848.git.reinette.chatre@intel.com	2022-07-07 10:13:01 -07:00
Reinette Chatre	61416b294a	x86/sgx: Add wrapper for SGX2 EAUG function Add a wrapper for the EAUG ENCLS leaf function used to add a page to an initialized enclave. EAUG: 1) Stores all properties of the new enclave page in the SGX hardware's Enclave Page Cache Map (EPCM). 2) Sets the PENDING bit in the EPCM entry of the enclave page. This bit is cleared by the enclave by invoking ENCLU leaf function EACCEPT or EACCEPTCOPY. Access from within the enclave to the new enclave page is not possible until the PENDING bit is cleared. Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org> Link: https://lkml.kernel.org/r/97a46754fe4764e908651df63694fb760f783d6e.1652137848.git.reinette.chatre@intel.com	2022-07-07 10:13:01 -07:00
Reinette Chatre	09b38d0b41	x86/sgx: Add wrapper for SGX2 EMODT function Add a wrapper for the EMODT ENCLS leaf function used to change the type of an enclave page as maintained in the SGX hardware's Enclave Page Cache Map (EPCM). EMODT: 1) Updates the EPCM page type of the enclave page. 2) Sets the MODIFIED bit in the EPCM entry of the enclave page. This bit is reset by the enclave by invoking ENCLU leaf function EACCEPT or EACCEPTCOPY. Access from within the enclave to the enclave page is not possible while the MODIFIED bit is set. After changing the enclave page type by issuing EMODT the kernel needs to collaborate with the hardware to ensure that no logical processor continues to hold a reference to the changed page. This is required to ensure no required security checks are circumvented and is required for the enclave's EACCEPT/EACCEPTCOPY to succeed. Ensuring that no references to the changed page remain is accomplished with the ETRACK flow. Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org> Link: https://lkml.kernel.org/r/dba63a8c0db1d510b940beee1ba2a8207efeb1f1.1652137848.git.reinette.chatre@intel.com	2022-07-07 10:13:01 -07:00
Reinette Chatre	0fb2126db8	x86/sgx: Add wrapper for SGX2 EMODPR function Add a wrapper for the EMODPR ENCLS leaf function used to restrict enclave page permissions as maintained in the SGX hardware's Enclave Page Cache Map (EPCM). EMODPR: 1) Updates the EPCM permissions of an enclave page by treating the new permissions as a mask. Supplying a value that attempts to relax EPCM permissions has no effect on EPCM permissions (PR bit, see below, is changed). 2) Sets the PR bit in the EPCM entry of the enclave page to indicate that permission restriction is in progress. The bit is reset by the enclave by invoking ENCLU leaf function EACCEPT or EACCEPTCOPY. The enclave may access the page throughout the entire process if conforming to the EPCM permissions for the enclave page. After performing the permission restriction by issuing EMODPR the kernel needs to collaborate with the hardware to ensure that all logical processors sees the new restricted permissions. This is required for the enclave's EACCEPT/EACCEPTCOPY to succeed and is accomplished with the ETRACK flow. Expand enum sgx_return_code with the possible EMODPR return values. Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org> Link: https://lkml.kernel.org/r/d15e7a769e13e4ca671fa2d0a0d3e3aec5aedbd4.1652137848.git.reinette.chatre@intel.com	2022-07-07 10:13:01 -07:00
Reinette Chatre	4c3f73584c	x86/sgx: Add short descriptions to ENCLS wrappers The SGX ENCLS instruction uses EAX to specify an SGX function and may require additional registers, depending on the SGX function. ENCLS invokes the specified privileged SGX function for managing and debugging enclaves. Macros are used to wrap the ENCLS functionality and several wrappers are used to wrap the macros to make the different SGX functions accessible in the code. The wrappers of the supported SGX functions are cryptic. Add short descriptions of each as a comment. Suggested-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org> Link: https://lkml.kernel.org/r/5e78a1126711cbd692d5b8132e0683873398f69e.1652137848.git.reinette.chatre@intel.com	2022-07-07 10:13:01 -07:00
Pawan Gupta	f54d45372c	x86/bugs: Add Cannon lake to RETBleed affected CPU list Cannon lake is also affected by RETBleed, add it to the list. Fixes: 6ad0ad2bf8a6 ("x86/bugs: Report Intel retbleed vulnerability") Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Borislav Petkov <bp@suse.de>	2022-07-07 12:33:53 +02:00
Mario Limonciello	8b356e536e	ACPI: CPPC: Don't require _OSC if X86_FEATURE_CPPC is supported commit 72f2ecb7ece7 ("ACPI: bus: Set CPPC _OSC bits for all and when CPPC_LIB is supported") added support for claiming to support CPPC in _OSC on non-Intel platforms. This unfortunately caused a regression on a vartiety of AMD platforms in the field because a number of AMD platforms don't set the `_OSC` bit 5 or 6 to indicate CPPC or CPPC v2 support. As these AMD platforms already claim CPPC support via a dedicated MSR from `X86_FEATURE_CPPC`, use this enable this feature rather than requiring the `_OSC` on platforms with a dedicated MSR. If there is additional breakage on the shared memory designs also missing this _OSC, additional follow up changes may be needed. Fixes: 72f2ecb7ece7 ("Set CPPC _OSC bits for all and when CPPC_LIB is supported") Reported-by: Perry Yuan <perry.yuan@amd.com> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2022-07-05 20:36:11 +02:00
Jonathan McDowell	b69a2afd5a	x86/kexec: Carry forward IMA measurement log on kexec On kexec file load, the Integrity Measurement Architecture (IMA) subsystem may verify the IMA signature of the kernel and initramfs, and measure it. The command line parameters passed to the kernel in the kexec call may also be measured by IMA. A remote attestation service can verify a TPM quote based on the TPM event log, the IMA measurement list and the TPM PCR data. This can be achieved only if the IMA measurement log is carried over from the current kernel to the next kernel across the kexec call. PowerPC and ARM64 both achieve this using device tree with a "linux,ima-kexec-buffer" node. x86 platforms generally don't make use of device tree, so use the setup_data mechanism to pass the IMA buffer to the new kernel. Signed-off-by: Jonathan McDowell <noodles@fb.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Mimi Zohar <zohar@linux.ibm.com> # IMA function definitions Link: https://lore.kernel.org/r/YmKyvlF3my1yWTvK@noodles-fedora-PC23Y6EG	2022-07-01 15:22:16 +02:00
Juergen Gross	7e09ac27f4	x86: Fix .brk attribute in linker script Commit in Fixes added the "NOLOAD" attribute to the .brk section as a "failsafe" measure. Unfortunately, this leads to the linker no longer covering the .brk section in a program header, resulting in the kernel loader not knowing that the memory for the .brk section must be reserved. This has led to crashes when loading the kernel as PV dom0 under Xen, but other scenarios could be hit by the same problem (e.g. in case an uncompressed kernel is used and the initrd is placed directly behind it). So drop the "NOLOAD" attribute. This has been verified to correctly cover the .brk section by a program header of the resulting ELF file. Fixes: e32683c6f7d2 ("x86/mm: Fix RESERVE_BRK() for older binutils") Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org> Link: https://lore.kernel.org/r/20220630071441.28576-4-jgross@suse.com	2022-07-01 11:12:43 +02:00
Juergen Gross	38fa5479b4	x86: Clear .brk area at early boot The .brk section has the same properties as .bss: it is an alloc-only section and should be cleared before being used. Not doing so is especially a problem for Xen PV guests, as the hypervisor will validate page tables (check for writable page tables and hypervisor private bits) before accepting them to be used. Make sure .brk is initially zero by letting clear_bss() clear the brk area, too. Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20220630071441.28576-3-jgross@suse.com	2022-07-01 11:11:34 +02:00
Juergen Gross	96e8fc5818	x86/xen: Use clear_bss() for Xen PV guests Instead of clearing the bss area in assembly code, use the clear_bss() function. This requires to pass the start_info address as parameter to xen_start_kernel() in order to avoid the xen_start_info being zeroed again. Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Link: https://lore.kernel.org/r/20220630071441.28576-2-jgross@suse.com	2022-07-01 10:57:52 +02:00
Peter Zijlstra	f43b9876e8	x86/retbleed: Add fine grained Kconfig knobs Do fine-grained Kconfig for all the various retbleed parts. NOTE: if your compiler doesn't support return thunks this will silently 'upgrade' your mitigation to IBPB, you might not like this. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Borislav Petkov <bp@suse.de>	2022-06-29 17:43:41 +02:00
Smita Koralahalli	891e465a1b	x86/mce: Check whether writes to MCA_STATUS are getting ignored The platform can sometimes - depending on its settings - cause writes to MCA_STATUS MSRs to get ignored, regardless of HWCR[McStatusWrEn]'s value. For further info see PPR for AMD Family 19h, Model 01h, Revision B1 Processors, doc ID 55898 at https://bugzilla.kernel.org/show_bug.cgi?id=206537. Therefore, probe for ignored writes to MCA_STATUS to determine if hardware error injection is at all possible. [ bp: Heavily massage commit message and patch. ] Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20220214233640.70510-2-Smita.KoralahalliChannabasappa@amd.com	2022-06-28 12:08:10 +02:00
Andrew Cooper	26aae8ccbc	x86/cpu/amd: Enumerate BTC_NO BTC_NO indicates that hardware is not susceptible to Branch Type Confusion. Zen3 CPUs don't suffer BTC. Hypervisors are expected to synthesise BTC_NO when it is appropriate given the migration pool, to prevent kernels using heuristics. [ bp: Massage. ] Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Borislav Petkov <bp@suse.de>	2022-06-27 10:34:01 +02:00
Peter Zijlstra	7a05bc95ed	x86/common: Stamp out the stepping madness The whole MMIO/RETBLEED enumeration went overboard on steppings. Get rid of all that and simply use ANY. If a future stepping of these models would not be affected, it had better set the relevant ARCH_CAP_$FOO_NO bit in IA32_ARCH_CAPABILITIES. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Borislav Petkov <bp@suse.de>	2022-06-27 10:34:01 +02:00
Josh Poimboeuf	07853adc29	KVM: VMX: Prevent RSB underflow before vmenter On VMX, there are some balanced returns between the time the guest's SPEC_CTRL value is written, and the vmenter. Balanced returns (matched by a preceding call) are usually ok, but it's at least theoretically possible an NMI with a deep call stack could empty the RSB before one of the returns. For maximum paranoia, don't allow any returns (balanced or otherwise) between the SPEC_CTRL write and the vmenter. [ bp: Fix 32-bit build. ] Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Borislav Petkov <bp@suse.de>	2022-06-27 10:34:00 +02:00
Josh Poimboeuf	9756bba284	x86/speculation: Fill RSB on vmexit for IBRS Prevent RSB underflow/poisoning attacks with RSB. While at it, add a bunch of comments to attempt to document the current state of tribal knowledge about RSB attacks and what exactly is being mitigated. Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Borislav Petkov <bp@suse.de>	2022-06-27 10:34:00 +02:00
Josh Poimboeuf	fc02735b14	KVM: VMX: Prevent guest RSB poisoning attacks with eIBRS On eIBRS systems, the returns in the vmexit return path from __vmx_vcpu_run() to vmx_vcpu_run() are exposed to RSB poisoning attacks. Fix that by moving the post-vmexit spec_ctrl handling to immediately after the vmexit. Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Borislav Petkov <bp@suse.de>	2022-06-27 10:34:00 +02:00
Josh Poimboeuf	acac5e98ef	x86/speculation: Remove x86_spec_ctrl_mask This mask has been made redundant by kvm_spec_ctrl_test_value(). And it doesn't even work when MSR interception is disabled, as the guest can just write to SPEC_CTRL directly. Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Borislav Petkov <bp@suse.de>	2022-06-27 10:34:00 +02:00
Josh Poimboeuf	bbb69e8bee	x86/speculation: Use cached host SPEC_CTRL value for guest entry/exit There's no need to recalculate the host value for every entry/exit. Just use the cached value in spec_ctrl_current(). Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Borislav Petkov <bp@suse.de>	2022-06-27 10:34:00 +02:00
Josh Poimboeuf	56aa4d221f	x86/speculation: Fix SPEC_CTRL write on SMT state change If the SMT state changes, SSBD might get accidentally disabled. Fix that. Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Borislav Petkov <bp@suse.de>	2022-06-27 10:34:00 +02:00
Peter Zijlstra	d7caac991f	x86/cpu/amd: Add Spectral Chicken Zen2 uarchs have an undocumented, unnamed, MSR that contains a chicken bit for some speculation behaviour. It needs setting. Note: very belatedly AMD released naming; it's now officially called MSR_AMD64_DE_CFG2 and MSR_AMD64_DE_CFG2_SUPPRESS_NOBR_PRED_BIT but shall remain the SPECTRAL CHICKEN. Suggested-by: Andrew Cooper <Andrew.Cooper3@citrix.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Borislav Petkov <bp@suse.de>	2022-06-27 10:34:00 +02:00
Peter Zijlstra	a09a6e2399	objtool: Add entry UNRET validation Since entry asm is tricky, add a validation pass that ensures the retbleed mitigation has been done before the first actual RET instruction. Entry points are those that either have UNWIND_HINT_ENTRY, which acts as UNWIND_HINT_EMPTY but marks the instruction as an entry point, or those that have UWIND_HINT_IRET_REGS at +0. This is basically a variant of validate_branch() that is intra-function and it will simply follow all branches from marked entry points and ensures that all paths lead to ANNOTATE_UNRET_END. If a path hits RET or an indirection the path is a fail and will be reported. There are 3 ANNOTATE_UNRET_END instances: - UNTRAIN_RET itself - exception from-kernel; this path doesn't need UNTRAIN_RET - all early exceptions; these also don't need UNTRAIN_RET Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Borislav Petkov <bp@suse.de>	2022-06-27 10:34:00 +02:00
Josh Poimboeuf	0fe4aeea9c	x86/bugs: Do IBPB fallback check only once When booting with retbleed=auto, if the kernel wasn't built with CONFIG_CC_HAS_RETURN_THUNK, the mitigation falls back to IBPB. Make sure a warning is printed in that case. The IBPB fallback check is done twice, but it really only needs to be done once. Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Borislav Petkov <bp@suse.de>	2022-06-27 10:34:00 +02:00
Peter Zijlstra	3ebc170068	x86/bugs: Add retbleed=ibpb jmp2ret mitigates the easy-to-attack case at relatively low overhead. It mitigates the long speculation windows after a mispredicted RET, but it does not mitigate the short speculation window from arbitrary instruction boundaries. On Zen2, there is a chicken bit which needs setting, which mitigates "arbitrary instruction boundaries" down to just "basic block boundaries". But there is no fix for the short speculation window on basic block boundaries, other than to flush the entire BTB to evict all attacker predictions. On the spectrum of "fast & blurry" -> "safe", there is (on top of STIBP or no-SMT): 1) Nothing System wide open 2) jmp2ret May stop a script kiddy 3) jmp2ret+chickenbit Raises the bar rather further 4) IBPB Only thing which can count as "safe". Tentative numbers put IBPB-on-entry at a 2.5x hit on Zen2, and a 10x hit on Zen1 according to lmbench. [ bp: Fixup feature bit comments, document option, 32-bit build fix. ] Suggested-by: Andrew Cooper <Andrew.Cooper3@citrix.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Borislav Petkov <bp@suse.de>	2022-06-27 10:34:00 +02:00

1 2 3 4 5 ...

18078 Commits