linux

iv/linux

Author	SHA1	Message	Date
David Spinadel	234c8e6043	mac80211: Add RX flag to indicate ICV stripped commit cef0acd4d7d4811d2d19cd0195031bf0dfe41249 upstream. Add a flag that indicates that the WEP ICV was stripped from an RX packet, allowing the device to not transfer that if it's already checked. Signed-off-by: David Spinadel <david.spinadel@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com> Cc: Kalle Valo <kvalo@codeaurora.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2018-01-17 09:38:49 +01:00
Suren Baghdasaryan	b58aa24edb	dm bufio: fix shrinker scans when (nr_to_scan < retain_target) commit fbc7c07ec23c040179384a1f16b62b6030eb6bdd upstream. When system is under memory pressure it is observed that dm bufio shrinker often reclaims only one buffer per scan. This change fixes the following two issues in dm bufio shrinker that cause this behavior: 1. ((nr_to_scan - freed) <= retain_target) condition is used to terminate slab scan process. This assumes that nr_to_scan is equal to the LRU size, which might not be correct because do_shrink_slab() in vmscan.c calculates nr_to_scan using multiple inputs. As a result when nr_to_scan is less than retain_target (64) the scan will terminate after the first iteration, effectively reclaiming one buffer per scan and making scans very inefficient. This hurts vmscan performance especially because mutex is acquired/released every time dm_bufio_shrink_scan() is called. New implementation uses ((LRU size - freed) <= retain_target) condition for scan termination. LRU size can be safely determined inside __scan() because this function is called after dm_bufio_lock(). 2. do_shrink_slab() uses value returned by dm_bufio_shrink_count() to determine number of freeable objects in the slab. However dm_bufio always retains retain_target buffers in its LRU and will terminate a scan when this mark is reached. Therefore returning the entire LRU size from dm_bufio_shrink_count() is misleading because that does not represent the number of freeable objects that slab will reclaim during a scan. Returning (LRU size - retain_target) better represents the number of freeable objects in the slab. This way do_shrink_slab() returns 0 when (LRU size < retain_target) and vmscan will not try to scan this shrinker avoiding scans that will not reclaim any memory. Test: tested using Android device running <AOSP>/system/extras/alloc-stress that generates memory pressure and causes intensive shrinker scans Signed-off-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2018-01-17 09:38:49 +01:00
Greg Kroah-Hartman	7bbc6ca488	Linux 4.9.76	2018-01-10 09:29:55 +01:00
Boris Brezillon	5e1f377fc8	mtd: nand: pxa3xx: Fix READOOB implementation commit fee4380f368e84ed216b62ccd2fbc4126f2bf40b upstream. In the current driver, OOB bytes are accessed in raw mode, and when a page access is done with NDCR_SPARE_EN set and NDCR_ECC_EN cleared, the driver must read the whole spare area (64 bytes in case of a 2k page, 16 bytes for a 512 page). The driver was only reading the free OOB bytes, which was leaving some unread data in the FIFO and was somehow leading to a timeout. We could patch the driver to read ->spare_size + ->ecc_size instead of just ->spare_size when READOOB is requested, but we'd better make in-band and OOB accesses consistent. Since the driver is always accessing in-band data in non-raw mode (with the ECC engine enabled), we should also access OOB data in this mode. That's particularly useful when using the BCH engine because in this mode the free OOB bytes are also ECC protected. Fixes: 43bcfd2bb24a ("mtd: nand: pxa3xx: Add driver-specific ECC BCH support") Reported-by: Sean Nyekjær <sean.nyekjaer@prevas.dk> Tested-by: Willy Tarreau <w@1wt.eu> Signed-off-by: Boris Brezillon <boris.brezillon@free-electrons.com> Acked-by: Ezequiel Garcia <ezequiel@vanguardiasur.com.ar> Tested-by: Sean Nyekjaer <sean.nyekjaer@prevas.dk> Acked-by: Robert Jarzmik <robert.jarzmik@free.fr> Signed-off-by: Richard Weinberger <richard@nod.at> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2018-01-10 09:29:54 +01:00
Borislav Petkov	beca4e2d99	Map the vsyscall page with _PAGE_USER This needs to happen early in kaiser_pagetable_walk(), before the hierarchy is established so that _PAGE_USER permission can be really set. A proper fix would be to teach kaiser_pagetable_walk() to update those permissions but the vsyscall page is the only exception here so ... Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Hugh Dickins <hughd@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2018-01-10 09:29:54 +01:00
Thomas Gleixner	47f3cea393	x86/tlb: Drop the _GPL from the cpu_tlbstate export commit 1e5476815fd7f98b888e01a0f9522b63085f96c9 upstream. The recent changes for PTI touch cpu_tlbstate from various tlb_flush inlines. cpu_tlbstate is exported as GPL symbol, so this causes a regression when building out of tree drivers for certain graphics cards. Aside of that the export was wrong since it was introduced as it should have been EXPORT_PER_CPU_SYMBOL_GPL(). Use the correct PER_CPU export and drop the _GPL to restore the previous state which allows users to utilize the cards they payed for. As always I'm really thrilled to make this kind of change to support the #friends (or however the hot hashtag of today is spelled) from that closet sauce graphics corp. Fixes: 1e02ce4cccdc ("x86: Store a per-cpu shadow copy of CR4") Fixes: 6fd166aae78c ("x86/mm: Use/Fix PCID to optimize user/kernel switches") Reported-by: Kees Cook <keescook@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Thomas Backlund <tmb@mageia.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2018-01-10 09:29:54 +01:00
Helge Deller	91dfc41e75	parisc: qemu idle sleep support commit 310d82784fb4d60c80569f5ca9f53a7f3bf1d477 upstream. Add qemu idle sleep support when running under qemu with SeaBIOS PDC firmware. Like the power architecture we use the "or" assembler instructions, which translate to nops on real hardware, to indicate that qemu shall idle sleep. Signed-off-by: Helge Deller <deller@gmx.de> Cc: Richard Henderson <rth@twiddle.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2018-01-10 09:29:54 +01:00
Helge Deller	14c06206b9	parisc: Fix alignment of pa_tlb_lock in assembly on 32-bit SMP kernel commit 88776c0e70be0290f8357019d844aae15edaa967 upstream. Qemu for PARISC reported on a 32bit SMP parisc kernel strange failures about "Not-handled unaligned insn 0x0e8011d6 and 0x0c2011c9." Those opcodes evaluate to the ldcw() assembly instruction which requires (on 32bit) an alignment of 16 bytes to ensure atomicity. As it turns out, qemu is correct and in our assembly code in entry.S and pacache.S we don't pay attention to the required alignment. This patch fixes the problem by aligning the lock offset in assembly code in the same manner as we do in our C-code. Signed-off-by: Helge Deller <deller@gmx.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2018-01-10 09:29:54 +01:00
Tom Lendacky	dd43c465ba	x86/microcode/AMD: Add support for fam17h microcode loading commit f4e9b7af0cd58dd039a0fb2cd67d57cea4889abf upstream. The size for the Microcode Patch Block (MPB) for an AMD family 17h processor is 3200 bytes. Add a #define for fam17h so that it does not default to 2048 bytes and fail a microcode load/update. Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Borislav Petkov <bp@alien8.de> Link: https://lkml.kernel.org/r/20171130224640.15391.40247.stgit@tlendack-t1.amdoffice.net Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Alice Ferrazzi <alicef@gentoo.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2018-01-10 09:29:53 +01:00
Aaron Ma	2b009d33f4	Input: elantech - add new icbody type 15 commit 10d900303f1c3a821eb0bef4e7b7ece16768fba4 upstream. The touchpad of Lenovo Thinkpad L480 reports it's version as 15. Signed-off-by: Aaron Ma <aaron.ma@canonical.com> Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2018-01-10 09:29:53 +01:00
Vineet Gupta	cc1349fa9c	ARC: uaccess: dont use "l" gcc inline asm constraint modifier commit 79435ac78d160e4c245544d457850a56f805ac0d upstream. This used to setup the LP_COUNT register automatically, but now has been removed. There was an earlier fix 3c7c7a2fc8811 which fixed instance in delay.h but somehow missed this one as gcc change had not made its way into production toolchains and was not pedantic as it is now ! Signed-off-by: Vineet Gupta <vgupta@synopsys.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2018-01-10 09:29:53 +01:00
Robin Murphy	e6a897a684	iommu/arm-smmu-v3: Cope with duplicated Stream IDs commit 563b5cbe334e9503ab2b234e279d500fc4f76018 upstream. For PCI devices behind an aliasing PCIe-to-PCI/X bridge, the bridge alias to DevFn 0.0 on the subordinate bus may match the original RID of the device, resulting in the same SID being present in the device's fwspec twice. This causes trouble later in arm_smmu_write_strtab_ent() when we wind up visiting the STE a second time and find it already live. Avoid the issue by giving arm_smmu_install_ste_for_dev() the cleverness to skip over duplicates. It seems mildly counterintuitive compared to preventing the duplicates from existing in the first place, but since the DT and ACPI probe paths build their fwspecs differently, this is actually the cleanest and most self-contained way to deal with it. Fixes: 8f78515425da ("iommu/arm-smmu: Implement of_xlate() for SMMUv3") Reported-by: Tomasz Nowicki <tomasz.nowicki@caviumnetworks.com> Tested-by: Tomasz Nowicki <Tomasz.Nowicki@cavium.com> Tested-by: Jayachandran C. <jnair@caviumnetworks.com> Signed-off-by: Robin Murphy <robin.murphy@arm.com> Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2018-01-10 09:29:53 +01:00
Jean-Philippe Brucker	03975faee7	iommu/arm-smmu-v3: Don't free page table ops twice commit 57d72e159b60456c8bb281736c02ddd3164037aa upstream. Kasan reports a double free when finalise_stage_fn fails: the io_pgtable ops are freed by arm_smmu_domain_finalise and then again by arm_smmu_domain_free. Prevent this by leaving pgtbl_ops empty on failure. Fixes: 48ec83bcbcf5 ("iommu/arm-smmu: Add initial driver support for ARM SMMUv3 devices") Reviewed-by: Robin Murphy <robin.murphy@arm.com> Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com> Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2018-01-10 09:29:53 +01:00
Oleg Nesterov	4d53eb4949	kernel/signal.c: remove the no longer needed SIGNAL_UNKILLABLE check in complete_signal() commit 426915796ccaf9c2bd9bb06dc5702225957bc2e5 upstream. complete_signal() checks SIGNAL_UNKILLABLE before it starts to destroy the thread group, today this is wrong in many ways. If nothing else, fatal_signal_pending() should always imply that the whole thread group (except ->group_exit_task if it is not NULL) is killed, this check breaks the rule. After the previous changes we can rely on sig_task_ignored(); sig_fatal(sig) && SIGNAL_UNKILLABLE can only be true if we actually want to kill this task and sig == SIGKILL OR it is traced and debugger can intercept the signal. This should hopefully fix the problem reported by Dmitry. This test-case static int init(void arg) { for (;;) pause(); } int main(void) { char stack[16 1024]; for (;;) { int pid = clone(init, stack + sizeof(stack)/2, CLONE_NEWPID \| SIGCHLD, NULL); assert(pid > 0); assert(ptrace(PTRACE_ATTACH, pid, 0, 0) == 0); assert(waitpid(-1, NULL, WSTOPPED) == pid); assert(ptrace(PTRACE_DETACH, pid, 0, SIGSTOP) == 0); assert(syscall(__NR_tkill, pid, SIGKILL) == 0); assert(pid == wait(NULL)); } } triggers the WARN_ON_ONCE(!(task->jobctl & JOBCTL_STOP_PENDING)) in task_participate_group_stop(). do_signal_stop()->signal_group_exit() checks SIGNAL_GROUP_EXIT and return false, but task_set_jobctl_pending() checks fatal_signal_pending() and does not set JOBCTL_STOP_PENDING. And his should fix the minor security problem reported by Kyle, SECCOMP_RET_TRACE can miss fatal_signal_pending() the same way if the task is the root of a pid namespace. Link: http://lkml.kernel.org/r/20171103184246.GD21036@redhat.com Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reported-by: Dmitry Vyukov <dvyukov@google.com> Reported-by: Kyle Huey <me@kylehuey.com> Reviewed-by: Kees Cook <keescook@chromium.org> Tested-by: Kyle Huey <me@kylehuey.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2018-01-10 09:29:53 +01:00
Oleg Nesterov	794ac8ef9b	kernel/signal.c: protect the SIGNAL_UNKILLABLE tasks from !sig_kernel_only() signals commit ac25385089f673560867eb5179228a44ade0cfc1 upstream. Change sig_task_ignored() to drop the SIG_DFL && !sig_kernel_only() signals even if force == T. This simplifies the next change and this matches the same check in get_signal() which will drop these signals anyway. Link: http://lkml.kernel.org/r/20171103184227.GC21036@redhat.com Signed-off-by: Oleg Nesterov <oleg@redhat.com> Tested-by: Kyle Huey <me@kylehuey.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2018-01-10 09:29:52 +01:00
Oleg Nesterov	1453b3ac6c	kernel/signal.c: protect the traced SIGNAL_UNKILLABLE tasks from SIGKILL commit 628c1bcba204052d19b686b5bac149a644cdb72e upstream. The comment in sig_ignored() says "Tracers may want to know about even ignored signals" but SIGKILL can not be reported to debugger and it is just wrong to return 0 in this case: SIGKILL should only kill the SIGNAL_UNKILLABLE task if it comes from the parent ns. Change sig_ignored() to ignore ->ptrace if sig == SIGKILL and rely on sig_task_ignored(). SISGTOP coming from within the namespace is not really right too but at least debugger can intercept it, and we can't drop it here because this will break "gdb -p 1": ptrace_attach() won't work. Perhaps we will add another ->ptrace check later, we will see. Link: http://lkml.kernel.org/r/20171103184206.GB21036@redhat.com Signed-off-by: Oleg Nesterov <oleg@redhat.com> Tested-by: Kyle Huey <me@kylehuey.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2018-01-10 09:29:52 +01:00
Thiago Rafael Becker	79258d9834	kernel: make groups_sort calling a responsibility group_info allocators commit bdcf0a423ea1c40bbb40e7ee483b50fc8aa3d758 upstream. In testing, we found that nfsd threads may call set_groups in parallel for the same entry cached in auth.unix.gid, racing in the call of groups_sort, corrupting the groups for that entry and leading to permission denials for the client. This patch: - Make groups_sort globally visible. - Move the call to groups_sort to the modifiers of group_info - Remove the call to groups_sort from set_groups Link: http://lkml.kernel.org/r/20171211151420.18655-1-thiago.becker@gmail.com Signed-off-by: Thiago Rafael Becker <thiago.becker@gmail.com> Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com> Reviewed-by: NeilBrown <neilb@suse.com> Acked-by: "J. Bruce Fields" <bfields@fieldses.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2018-01-10 09:29:52 +01:00
Jens Axboe	3a381abc5b	nbd: fix use-after-free of rq/bio in the xmit path commit 429a787be6793554ee02aacc7e1f11ebcecc4453 upstream. For writes, we can get a completion in while we're still iterating the request and bio chain. If that happens, we're reading freed memory and we can crash. Break out after the last segment and avoid having the iterator read freed memory. Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2018-01-10 09:29:52 +01:00
David Howells	2b9b2002e0	fscache: Fix the default for fscache_maybe_release_page() commit 98801506552593c9b8ac11021b0cdad12cab4f6b upstream. Fix the default for fscache_maybe_release_page() for when the cookie isn't valid or the page isn't cached. It mustn't return false as that indicates the page cannot yet be freed. The problem with the default is that if, say, there's no cache, but a network filesystem's pages are using up almost all the available memory, a system can OOM because the filesystem ->releasepage() op will not allow them to be released as fscache_maybe_release_page() incorrectly prevents it. This can be tested by writing a sequence of 512MiB files to an AFS mount. It does not affect NFS or CIFS because both of those wrap the call in a check of PG_fscache and it shouldn't bother Ceph as that only has PG_private set whilst writeback is in progress. This might be an issue for 9P, however. Note that the pages aren't entirely stuck. Removing a file or unmounting will clear things because that uses ->invalidatepage() instead. Fixes: 201a15428bd5 ("FS-Cache: Handle pages pending storage that get evicted under OOM conditions") Reported-by: Marc Dionne <marc.dionne@auristor.com> Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Tested-by: Marc Dionne <marc.dionne@auristor.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2018-01-10 09:29:52 +01:00
Stefan Brüns	34fa2eede0	sunxi-rsb: Include OF based modalias in device uevent commit e2bf801ecd4e62222a46d1ba9e57e710171d29c1 upstream. Include the OF-based modalias in the uevent sent when registering devices on the sunxi RSB bus, so that user space has a chance to autoload the kernel module for the device. Fixes a regression caused by commit 3f241bfa60bd ("arm64: allwinner: a64: pine64: Use dcdc1 regulator for mmc0"). When the axp20x-rsb module for the AXP803 PMIC is built as a module, it is not loaded and the system ends up with an disfunctional MMC controller. Fixes: d787dcdb9c8f ("bus: sunxi-rsb: Add driver for Allwinner Reduced Serial Bus") Acked-by: Chen-Yu Tsai <wens@csie.org> Signed-off-by: Stefan Brüns <stefan.bruens@rwth-aachen.de> Signed-off-by: Maxime Ripard <maxime.ripard@free-electrons.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2018-01-10 09:29:52 +01:00
Eric Biggers	c195a4c023	crypto: pcrypt - fix freeing pcrypt instances commit d76c68109f37cb85b243a1cf0f40313afd2bae68 upstream. pcrypt is using the old way of freeing instances, where the ->free() method specified in the 'struct crypto_template' is passed a pointer to the 'struct crypto_instance'. But the crypto_instance is being kfree()'d directly, which is incorrect because the memory was actually allocated as an aead_instance, which contains the crypto_instance at a nonzero offset. Thus, the wrong pointer was being kfree()'d. Fix it by switching to the new way to free aead_instance's where the ->free() method is specified in the aead_instance itself. Reported-by: syzbot <syzkaller@googlegroups.com> Fixes: 0496f56065e0 ("crypto: pcrypt - Add support for new AEAD interface") Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2018-01-10 09:29:51 +01:00
Eric Biggers	868f50b95d	crypto: chacha20poly1305 - validate the digest size commit e57121d08c38dabec15cf3e1e2ad46721af30cae upstream. If the rfc7539 template was instantiated with a hash algorithm with digest size larger than 16 bytes (POLY1305_DIGEST_SIZE), then the digest overran the 'tag' buffer in 'struct chachapoly_req_ctx', corrupting the subsequent memory, including 'cryptlen'. This caused a crash during crypto_skcipher_decrypt(). Fix it by, when instantiating the template, requiring that the underlying hash algorithm has the digest size expected for Poly1305. Reproducer: #include <linux/if_alg.h> #include <sys/socket.h> #include <unistd.h> int main() { int algfd, reqfd; struct sockaddr_alg addr = { .salg_type = "aead", .salg_name = "rfc7539(chacha20,sha256)", }; unsigned char buf[32] = { 0 }; algfd = socket(AF_ALG, SOCK_SEQPACKET, 0); bind(algfd, (void *)&addr, sizeof(addr)); setsockopt(algfd, SOL_ALG, ALG_SET_KEY, buf, sizeof(buf)); reqfd = accept(algfd, 0, 0); write(reqfd, buf, 16); read(reqfd, buf, 16); } Reported-by: syzbot <syzkaller@googlegroups.com> Fixes: 71ebc4d1b27d ("crypto: chacha20poly1305 - Add a ChaCha20-Poly1305 AEAD construction, RFC7539") Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2018-01-10 09:29:51 +01:00
Jan Engelhardt	f6db86f31b	crypto: n2 - cure use after free commit 203f45003a3d03eea8fa28d74cfc74c354416fdb upstream. queue_cache_init is first called for the Control Word Queue (n2_crypto_probe). At that time, queue_cache[0] is NULL and a new kmem_cache will be allocated. If the subsequent n2_register_algs call fails, the kmem_cache will be released in queue_cache_destroy, but queue_cache_init[0] is not set back to NULL. So when the Module Arithmetic Unit gets probed next (n2_mau_probe), queue_cache_init will not allocate a kmem_cache again, but leave it as its bogus value, causing a BUG() to trigger when queue_cache[0] is eventually passed to kmem_cache_zalloc: n2_crypto: Found N2CP at /virtual-devices@100/n2cp@7 n2_crypto: Registered NCS HVAPI version 2.0 called queue_cache_init n2_crypto: md5 alg registration failed n2cp f028687c: /virtual-devices@100/n2cp@7: Unable to register algorithms. called queue_cache_destroy n2cp: probe of f028687c failed with error -22 n2_crypto: Found NCP at /virtual-devices@100/ncp@6 n2_crypto: Registered NCS HVAPI version 2.0 called queue_cache_init kernel BUG at mm/slab.c:2993! Call Trace: [0000000000604488] kmem_cache_alloc+0x1a8/0x1e0 (inlined) kmem_cache_zalloc (inlined) new_queue (inlined) spu_queue_setup (inlined) handle_exec_unit [0000000010c61eb4] spu_mdesc_scan+0x1f4/0x460 [n2_crypto] [0000000010c62b80] n2_mau_probe+0x100/0x220 [n2_crypto] [000000000084b174] platform_drv_probe+0x34/0xc0 Signed-off-by: Jan Engelhardt <jengelh@inai.de> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2018-01-10 09:29:51 +01:00
Oleg Nesterov	790080ce0e	kernel/acct.c: fix the acct->needcheck check in check_free_space() commit 4d9570158b6260f449e317a5f9ed030c2504a615 upstream. As Tsukada explains, the time_is_before_jiffies(acct->needcheck) check is very wrong, we need time_is_after_jiffies() to make sys_acct() work. Ignoring the overflows, the code should "goto out" if needcheck > jiffies, while currently it checks "needcheck < jiffies" and thus in the likely case check_free_space() does nothing until jiffies overflow. In particular this means that sys_acct() is simply broken, acct_on() sets acct->needcheck = jiffies and expects that check_free_space() should set acct->active = 1 after the free-space check, but this won't happen if jiffies increments in between. This was broken by commit 32dc73086015 ("get rid of timer in kern/acct.c") in 2011, then another (correct) commit 795a2f22a8ea ("acct() should honour the limits from the very beginning") made the problem more visible. Link: http://lkml.kernel.org/r/20171213133940.GA6554@redhat.com Fixes: 32dc73086015 ("get rid of timer in kern/acct.c") Reported-by: TSUKADA Koutaro <tsukada@ascade.co.jp> Suggested-by: TSUKADA Koutaro <tsukada@ascade.co.jp> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2018-01-10 09:29:51 +01:00
Greg Kroah-Hartman	9f74755895	Linux 4.9.75	2018-01-05 15:46:36 +01:00
Guenter Roeck	92fd81f772	kaiser: Set _PAGE_NX only if supported This resolves a crash if loaded under qemu + haxm under windows. See https://www.spinics.net/lists/kernel/msg2689835.html for details. Here is a boot log (the log is from chromeos-4.4, but Tao Wu says that the same log is also seen with vanilla v4.4.110-rc1). [ 0.712750] Freeing unused kernel memory: 552K [ 0.721821] init: Corrupted page table at address 57b029b332e0 [ 0.722761] PGD 80000000bb238067 PUD bc36a067 PMD bc369067 PTE 45d2067 [ 0.722761] Bad pagetable: 000b [#1] PREEMPT SMP [ 0.722761] Modules linked in: [ 0.722761] CPU: 1 PID: 1 Comm: init Not tainted 4.4.96 #31 [ 0.722761] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5.1-0-g8936dbb-20141113_115728-nilsson.home.kraxel.org 04/01/2014 [ 0.722761] task: ffff8800bc290000 ti: ffff8800bc28c000 task.ti: ffff8800bc28c000 [ 0.722761] RIP: 0010:[<ffffffff83f4129e>] [<ffffffff83f4129e>] __clear_user+0x42/0x67 [ 0.722761] RSP: 0000:ffff8800bc28fcf8 EFLAGS: 00010202 [ 0.722761] RAX: 0000000000000000 RBX: 00000000000001a4 RCX: 00000000000001a4 [ 0.722761] RDX: 0000000000000000 RSI: 0000000000000008 RDI: 000057b029b332e0 [ 0.722761] RBP: ffff8800bc28fd08 R08: ffff8800bc290000 R09: ffff8800bb2f4000 [ 0.722761] R10: ffff8800bc290000 R11: ffff8800bb2f4000 R12: 000057b029b332e0 [ 0.722761] R13: 0000000000000000 R14: 000057b029b33340 R15: ffff8800bb1e2a00 [ 0.722761] FS: 0000000000000000(0000) GS:ffff8800bfb00000(0000) knlGS:0000000000000000 [ 0.722761] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 0.722761] CR2: 000057b029b332e0 CR3: 00000000bb2f8000 CR4: 00000000000006e0 [ 0.722761] Stack: [ 0.722761] 000057b029b332e0 ffff8800bb95fa80 ffff8800bc28fd18 ffffffff83f4120c [ 0.722761] ffff8800bc28fe18 ffffffff83e9e7a1 ffff8800bc28fd68 0000000000000000 [ 0.722761] ffff8800bc290000 ffff8800bc290000 ffff8800bc290000 ffff8800bc290000 [ 0.722761] Call Trace: [ 0.722761] [<ffffffff83f4120c>] clear_user+0x2e/0x30 [ 0.722761] [<ffffffff83e9e7a1>] load_elf_binary+0xa7f/0x18f7 [ 0.722761] [<ffffffff83de2088>] search_binary_handler+0x86/0x19c [ 0.722761] [<ffffffff83de389e>] do_execveat_common.isra.26+0x909/0xf98 [ 0.722761] [<ffffffff844febe0>] ? rest_init+0x87/0x87 [ 0.722761] [<ffffffff83de40be>] do_execve+0x23/0x25 [ 0.722761] [<ffffffff83c002e3>] run_init_process+0x2b/0x2d [ 0.722761] [<ffffffff844fec4d>] kernel_init+0x6d/0xda [ 0.722761] [<ffffffff84505b2f>] ret_from_fork+0x3f/0x70 [ 0.722761] [<ffffffff844febe0>] ? rest_init+0x87/0x87 [ 0.722761] Code: 86 84 be 12 00 00 00 e8 87 0d e8 ff 66 66 90 48 89 d8 48 c1 eb 03 4c 89 e7 83 e0 07 48 89 d9 be 08 00 00 00 31 d2 48 85 c9 74 0a <48> 89 17 48 01 f7 ff c9 75 f6 48 89 c1 85 c9 74 09 88 17 48 ff [ 0.722761] RIP [<ffffffff83f4129e>] __clear_user+0x42/0x67 [ 0.722761] RSP <ffff8800bc28fcf8> [ 0.722761] ---[ end trace def703879b4ff090 ]--- [ 0.722761] BUG: sleeping function called from invalid context at /mnt/host/source/src/third_party/kernel/v4.4/kernel/locking/rwsem.c:21 [ 0.722761] in_atomic(): 0, irqs_disabled(): 1, pid: 1, name: init [ 0.722761] CPU: 1 PID: 1 Comm: init Tainted: G D 4.4.96 #31 [ 0.722761] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5.1-0-g8936dbb-20141113_115728-nilsson.home.kraxel.org 04/01/2014 [ 0.722761] 0000000000000086 dcb5d76098c89836 ffff8800bc28fa30 ffffffff83f34004 [ 0.722761] ffffffff84839dc2 0000000000000015 ffff8800bc28fa40 ffffffff83d57dc9 [ 0.722761] ffff8800bc28fa68 ffffffff83d57e6a ffffffff84a53640 0000000000000000 [ 0.722761] Call Trace: [ 0.722761] [<ffffffff83f34004>] dump_stack+0x4d/0x63 [ 0.722761] [<ffffffff83d57dc9>] ___might_sleep+0x13a/0x13c [ 0.722761] [<ffffffff83d57e6a>] __might_sleep+0x9f/0xa6 [ 0.722761] [<ffffffff84502788>] down_read+0x20/0x31 [ 0.722761] [<ffffffff83cc5d9b>] __blocking_notifier_call_chain+0x35/0x63 [ 0.722761] [<ffffffff83cc5ddd>] blocking_notifier_call_chain+0x14/0x16 [ 0.800374] usb 1-1: new full-speed USB device number 2 using uhci_hcd [ 0.722761] [<ffffffff83cefe97>] profile_task_exit+0x1a/0x1c [ 0.802309] [<ffffffff83cac84e>] do_exit+0x39/0xe7f [ 0.802309] [<ffffffff83ce5938>] ? vprintk_default+0x1d/0x1f [ 0.802309] [<ffffffff83d7bb95>] ? printk+0x57/0x73 [ 0.802309] [<ffffffff83c46e25>] oops_end+0x80/0x85 [ 0.802309] [<ffffffff83c7b747>] pgtable_bad+0x8a/0x95 [ 0.802309] [<ffffffff83ca7f4a>] __do_page_fault+0x8c/0x352 [ 0.802309] [<ffffffff83eefba5>] ? file_has_perm+0xc4/0xe5 [ 0.802309] [<ffffffff83ca821c>] do_page_fault+0xc/0xe [ 0.802309] [<ffffffff84507682>] page_fault+0x22/0x30 [ 0.802309] [<ffffffff83f4129e>] ? __clear_user+0x42/0x67 [ 0.802309] [<ffffffff83f4127f>] ? __clear_user+0x23/0x67 [ 0.802309] [<ffffffff83f4120c>] clear_user+0x2e/0x30 [ 0.802309] [<ffffffff83e9e7a1>] load_elf_binary+0xa7f/0x18f7 [ 0.802309] [<ffffffff83de2088>] search_binary_handler+0x86/0x19c [ 0.802309] [<ffffffff83de389e>] do_execveat_common.isra.26+0x909/0xf98 [ 0.802309] [<ffffffff844febe0>] ? rest_init+0x87/0x87 [ 0.802309] [<ffffffff83de40be>] do_execve+0x23/0x25 [ 0.802309] [<ffffffff83c002e3>] run_init_process+0x2b/0x2d [ 0.802309] [<ffffffff844fec4d>] kernel_init+0x6d/0xda [ 0.802309] [<ffffffff84505b2f>] ret_from_fork+0x3f/0x70 [ 0.802309] [<ffffffff844febe0>] ? rest_init+0x87/0x87 [ 0.830559] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000009 [ 0.830559] [ 0.831305] Kernel Offset: 0x2c00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) [ 0.831305] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000009 The crash part of this problem may be solved with the following patch (thanks to Hugh for the hint). There is still another problem, though - with this patch applied, the qemu session aborts with "VCPU Shutdown request", whatever that means. Cc: lepton <ytht.net@gmail.com> Signed-off-by: Guenter Roeck <groeck@chromium.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2018-01-05 15:46:36 +01:00
Kees Cook	ea6cd39d23	KPTI: Report when enabled Make sure dmesg reports when KPTI is enabled. Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2018-01-05 15:46:36 +01:00
Kees Cook	e71fac0172	KPTI: Rename to PAGE_TABLE_ISOLATION This renames CONFIG_KAISER to CONFIG_PAGE_TABLE_ISOLATION. Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2018-01-05 15:46:35 +01:00
Borislav Petkov	59094faf3f	x86/kaiser: Move feature detection up ... before the first use of kaiser_enabled as otherwise funky things happen: about to get started... (XEN) d0v0 Unhandled page fault fault/trap [#14, ec=0000] (XEN) Pagetable walk from ffff88022a449090: (XEN) L4[0x110] = 0000000229e0e067 0000000000001e0e (XEN) L3[0x008] = 0000000000000000 ffffffffffffffff (XEN) domain_crash_sync called from entry.S: fault at ffff82d08033fd08 entry.o#create_bounce_frame+0x135/0x14d (XEN) Domain 0 (vcpu#0) crashed on cpu#0: (XEN) ----[ Xen-4.9.1_02-3.21 x86_64 debug=n Not tainted ]---- (XEN) CPU: 0 (XEN) RIP: e033:[<ffffffff81007460>] (XEN) RFLAGS: 0000000000000286 EM: 1 CONTEXT: pv guest (d0v0) Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2018-01-05 15:46:35 +01:00
Jiri Kosina	402e63de94	kaiser: disabled on Xen PV Kaiser cannot be used on paravirtualized MMUs (namely reading and writing CR3). This does not work with KAISER as the CR3 switch from and to user space PGD would require to map the whole XEN_PV machinery into both. More importantly, enabling KAISER on Xen PV doesn't make too much sense, as PV guests use distinct %cr3 values for kernel and user already. Signed-off-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2018-01-05 15:46:35 +01:00
Borislav Petkov	2c2721754a	x86/kaiser: Reenable PARAVIRT Now that the required bits have been addressed, reenable PARAVIRT. Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2018-01-05 15:46:35 +01:00
Thomas Gleixner	1817d2c2fa	x86/paravirt: Dont patch flush_tlb_single commit a035795499ca1c2bd1928808d1a156eda1420383 upstream native_flush_tlb_single() will be changed with the upcoming PAGE_TABLE_ISOLATION feature. This requires to have more code in there than INVLPG. Remove the paravirt patching for it. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> Reviewed-by: Juergen Gross <jgross@suse.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Borislav Petkov <bpetkov@suse.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Laight <David.Laight@aculab.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Eduardo Valentin <eduval@amazon.com> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Rik van Riel <riel@redhat.com> Cc: Will Deacon <will.deacon@arm.com> Cc: aliguori@amazon.com Cc: daniel.gruss@iaik.tugraz.at Cc: hughd@google.com Cc: keescook@google.com Cc: linux-mm@kvack.org Cc: michael.schwarz@iaik.tugraz.at Cc: moritz.lipp@iaik.tugraz.at Cc: richard.fellner@student.tugraz.at Link: https://lkml.kernel.org/r/20171204150606.828111617@linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Borislav Petkov <bp@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2018-01-05 15:46:35 +01:00
Hugh Dickins	fe5cb75fd2	kaiser: kaiser_flush_tlb_on_return_to_user() check PCID Let kaiser_flush_tlb_on_return_to_user() do the X86_FEATURE_PCID check, instead of each caller doing it inline first: nobody needs to optimize for the noPCID case, it's clearer this way, and better suits later changes. Replace those no-op X86_CR3_PCID_KERN_FLUSH lines by a BUILD_BUG_ON() in load_new_mm_cr3(), in case something changes. Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2018-01-05 15:46:35 +01:00
Hugh Dickins	b72c26e911	kaiser: asm/tlbflush.h handle noPGE at lower level I found asm/tlbflush.h too twisty, and think it safer not to avoid __native_flush_tlb_global_irq_disabled() in the kaiser_enabled case, but instead let it handle kaiser_enabled along with cr3: it can just use __native_flush_tlb() for that, no harm in re-disabling preemption. (This is not the same change as Kirill and Dave have suggested for upstream, flipping PGE in cr4: that's neat, but needs a cpu_has_pge check; cr3 is enough for kaiser, and thought to be cheaper than cr4.) Also delete the X86_FEATURE_INVPCID invpcid_flush_all_nonglobals() preference from __native_flush_tlb(): unlike the invpcid_flush_all() preference in __native_flush_tlb_global(), it's not seen in upstream 4.14, and was recently reported to be surprisingly slow. Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2018-01-05 15:46:35 +01:00
Hugh Dickins	8c2f8a5cc1	kaiser: drop is_atomic arg to kaiser_pagetable_walk() I have not observed a might_sleep() warning from setup_fixmap_gdt()'s use of kaiser_add_mapping() in our tree (why not?), but like upstream we have not provided a way for that to pass is_atomic true down to kaiser_pagetable_walk(), and at startup it's far from a likely source of trouble: so just delete the walk's is_atomic arg and might_sleep(). Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2018-01-05 15:46:35 +01:00
Hugh Dickins	169b369f99	kaiser: use ALTERNATIVE instead of x86_cr3_pcid_noflush Now that we're playing the ALTERNATIVE game, use that more efficient method: instead of user-mapping an extra page, and reading an extra cacheline each time for x86_cr3_pcid_noflush. Neel has found that __stringify(bts $X86_CR3_PCID_NOFLUSH_BIT, %rax) is a working substitute for the "bts $63, %rax" in these ALTERNATIVEs; but the one line with $63 in looks clearer, so let's stick with that. Worried about what happens with an ALTERNATIVE between the jump and jump label in another ALTERNATIVE? I was, but have checked the combinations in SWITCH_KERNEL_CR3_NO_STACK at entry_SYSCALL_64, and it does a good job. Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2018-01-05 15:46:35 +01:00
Borislav Petkov	8018307a45	x86/kaiser: Check boottime cmdline params AMD (and possibly other vendors) are not affected by the leak KAISER is protecting against. Keep the "nopti" for traditional reasons and add pti=<on\|off\|auto> like upstream. Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2018-01-05 15:46:35 +01:00
Borislav Petkov	50624dd12d	x86/kaiser: Rename and simplify X86_FEATURE_KAISER handling Concentrate it in arch/x86/mm/kaiser.c and use the upstream string "nopti". Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2018-01-05 15:46:34 +01:00
Hugh Dickins	23e09439aa	kaiser: add "nokaiser" boot option, using ALTERNATIVE Added "nokaiser" boot option: an early param like "noinvpcid". Most places now check int kaiser_enabled (#defined 0 when not CONFIG_KAISER) instead of #ifdef CONFIG_KAISER; but entry_64.S and entry_64_compat.S are using the ALTERNATIVE technique, which patches in the preferred instructions at runtime. That technique is tied to x86 cpu features, so X86_FEATURE_KAISER is fabricated. Prior to "nokaiser", Kaiser #defined _PAGE_GLOBAL 0: revert that, but be careful with both _PAGE_GLOBAL and CR4.PGE: setting them when nokaiser like when !CONFIG_KAISER, but not setting either when kaiser - neither matters on its own, but it's hard to be sure that _PAGE_GLOBAL won't get set in some obscure corner, or something add PGE into CR4. By omitting _PAGE_GLOBAL from __supported_pte_mask when kaiser_enabled, all page table setup which uses pte_pfn() masks it out of the ptes. It's slightly shameful that the same declaration versus definition of kaiser_enabled appears in not one, not two, but in three header files (asm/kaiser.h, asm/pgtable.h, asm/tlbflush.h). I felt safer that way, than with #including any of those in any of the others; and did not feel it worth an asm/kaiser_enabled.h - kernel/cpu/common.c includes them all, so we shall hear about it if they get out of synch. Cleanups while in the area: removed the silly #ifdef CONFIG_KAISER from kaiser.c; removed the unused native_get_normal_pgd(); removed the spurious reg clutter from SWITCH_*_CR3 macro stubs; corrected some comments. But more interestingly, set CR4.PSE in secondary_startup_64: the manual is clear that it does not matter whether it's 0 or 1 when 4-level-pts are enabled, but I was distracted to find cr4 different on BSP and auxiliaries - BSP alone was adding PSE, in probe_page_size_mask(). Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2018-01-05 15:46:34 +01:00
Hugh Dickins	cb7d8d7e67	kaiser: fix unlikely error in alloc_ldt_struct() An error from kaiser_add_mapping() here is not at all likely, but Eric Biggers rightly points out that __free_ldt_struct() relies on new_ldt->size being initialized: move that up. Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2018-01-05 15:46:34 +01:00
Hugh Dickins	3df1461787	kaiser: kaiser_remove_mapping() move along the pgd When removing the bogus comment from kaiser_remove_mapping(), I really ought to have checked the extent of its bogosity: as Neel points out, there is nothing to stop unmap_pud_range_nofree() from continuing beyond the end of a pud (and starting in the wrong position on the next). Fix kaiser_remove_mapping() to constrain the extent and advance pgd pointer correctly: use pgd_addr_end() macro as used throughout base mm (but don't assume page-rounded start and size in this case). But this bug was very unlikely to trigger in this backport: since any buddy allocation is contained within a single pud extent, and we are not using vmapped stacks (and are only mapping one page of stack anyway): the only way to hit this bug here would be when freeing a large modified ldt. Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2018-01-05 15:46:34 +01:00
Hugh Dickins	05ddad146d	kaiser: paranoid_entry pass cr3 need to paranoid_exit Neel Natu points out that paranoid_entry() was wrong to assume that an entry that did not need swapgs would not need SWITCH_KERNEL_CR3: paranoid_entry (used for debug breakpoint, int3, double fault or MCE; though I think it's only the MCE case that is cause for concern here) can break in at an awkward time, between cr3 switch and swapgs, but its handling always needs kernel gs and kernel cr3. Easy to fix in itself, but paranoid_entry() also needs to convey to paranoid_exit() (and my reading of macro idtentry says paranoid_entry and paranoid_exit are always paired) how to restore the prior state. The swapgs state is already conveyed by %ebx (0 or 1), so extend that also to convey when SWITCH_USER_CR3 will be needed (2 or 3). (Yes, I'd much prefer that 0 meant no swapgs, whereas it's the other way round: and a convention shared with error_entry() and error_exit(), which I don't want to touch. Perhaps I should have inverted the bit for switch cr3 too, but did not.) paranoid_exit() would be straightforward, except for TRACE_IRQS: it did TRACE_IRQS_IRETQ when doing swapgs, but TRACE_IRQS_IRETQ_DEBUG when not: which is it supposed to use when SWITCH_USER_CR3 is split apart from that? As best as I can determine, commit 5963e317b1e9 ("ftrace/x86: Do not change stacks in DEBUG when calling lockdep") missed the swapgs case, and should have used TRACE_IRQS_IRETQ_DEBUG there too (the discrepancy has nothing to do with the liberal use of _NO_STACK and _UNSAFE_STACK hereabouts: TRACE_IRQS_OFF_DEBUG has just been used in all cases); discrepancy lovingly preserved across several paranoid_exit() cleanups, but I'm now removing it. Neel further indicates that to use SWITCH_USER_CR3_NO_STACK there in paranoid_exit() is now not only unnecessary but unsafe: might corrupt syscall entry's unsafe_stack_register_backup of %rax. Just use SWITCH_USER_CR3: and delete SWITCH_USER_CR3_NO_STACK altogether, before we make the mistake of using it again. hughd adds: this commit fixes an issue in the Kaiser-without-PCIDs part of the series, and ought to be moved earlier, if you decided to make a release of Kaiser-without-PCIDs. Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2018-01-05 15:46:34 +01:00
Hugh Dickins	d0142ceb79	kaiser: x86_cr3_pcid_noflush and x86_cr3_pcid_user Mostly this commit is just unshouting X86_CR3_PCID_KERN_VAR and X86_CR3_PCID_USER_VAR: we usually name variables in lower-case. But why does x86_cr3_pcid_noflush need to be __aligned(PAGE_SIZE)? Ah, it's a leftover from when kaiser_add_user_map() once complained about mapping the same page twice. Make it __read_mostly instead. (I'm a little uneasy about all the unrelated data which shares its page getting user-mapped too, but that was so before, and not a big deal: though we call it user-mapped, it's not mapped with _PAGE_USER.) And there is a little change around the two calls to do_nmi(). Previously they set the NOFLUSH bit (if PCID supported) when forcing to kernel context before do_nmi(); now they also have the NOFLUSH bit set (if PCID supported) when restoring context after: nothing done in do_nmi() should require a TLB to be flushed here. Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2018-01-05 15:46:34 +01:00
Hugh Dickins	6a2b411761	kaiser: PCID 0 for kernel and 128 for user Why was 4 chosen for kernel PCID and 6 for user PCID? No good reason in a backport where PCIDs are only used for Kaiser. If we continue with those, then we shall need to add Andy Lutomirski's 4.13 commit 6c690ee1039b ("x86/mm: Split read_cr3() into read_cr3_pa() and __read_cr3()"), which deals with the problem of read_cr3() callers finding stray bits in the cr3 that they expected to be page-aligned; and for hibernation, his 4.14 commit f34902c5c6c0 ("x86/hibernate/64: Mask off CR3's PCID bits in the saved CR3"). But if 0 is used for kernel PCID, then there's no need to add in those commits - whenever the kernel looks, it sees 0 in the lower bits; and 0 for kernel seems an obvious choice. And I naughtily propose 128 for user PCID. Because there's a place in _SWITCH_TO_USER_CR3 where it takes note of the need for TLB FLUSH, but needs to reset that to NOFLUSH for the next occasion. Currently it does so with a "movb $(0x80)" into the high byte of the per-cpu quadword, but that will cause a machine without PCID support to crash. Now, if %al just happened to have 0x80 in it at that point, on a machine with PCID support, but 0 on a machine without PCID support... (That will go badly wrong once the pgd can be at a physical address above 2^56, but even with 5-level paging, physical goes up to 2^52.) Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2018-01-05 15:46:34 +01:00
Hugh Dickins	0b5ca9d995	kaiser: load_new_mm_cr3() let SWITCH_USER_CR3 flush user We have many machines (Westmere, Sandybridge, Ivybridge) supporting PCID but not INVPCID: on these load_new_mm_cr3() simply crashed. Flushing user context inside load_new_mm_cr3() without the use of invpcid is difficult: momentarily switch from kernel to user context and back to do so? I'm not sure whether that can be safely done at all, and would risk polluting user context with kernel internals, and kernel context with stale user externals. Instead, follow the hint in the comment that was there: change X86_CR3_PCID_USER_VAR to be a per-cpu variable, then load_new_mm_cr3() can leave a note in it, for SWITCH_USER_CR3 on return to userspace to flush user context TLB, instead of default X86_CR3_PCID_USER_NOFLUSH. Which works well enough that there's no need to do it this way only when invpcid is unsupported: it's a good alternative to invpcid here. But there's a couple of inlines in asm/tlbflush.h that need to do the same trick, so it's best to localize all this per-cpu business in mm/kaiser.c: moving that part of the initialization from setup_pcid() to kaiser_setup_pcid(); with kaiser_flush_tlb_on_return_to_user() the function for noting an X86_CR3_PCID_USER_FLUSH. And let's keep a KAISER_SHADOW_PGD_OFFSET in there, to avoid the extra OR on exit. I did try to make the feature tests in asm/tlbflush.h more consistent with each other: there seem to be far too many ways of performing such tests, and I don't have a good grasp of their differences. At first I converted them all to be static_cpu_has(): but that proved to be a mistake, as the comment in __native_flush_tlb_single() hints; so then I reversed and made them all this_cpu_has(). Probably all gratuitous change, but that's the way it's working at present. I am slightly bothered by the way non-per-cpu X86_CR3_PCID_KERN_VAR gets re-initialized by each cpu (before and after these changes): no problem when (as usual) all cpus on a machine have the same features, but in principle incorrect. However, my experiment to per-cpu-ify that one did not end well... Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2018-01-05 15:46:34 +01:00
Hugh Dickins	2684b12a16	kaiser: enhanced by kernel and user PCIDs Merged performance improvements to Kaiser, using distinct kernel and user Process Context Identifiers to minimize the TLB flushing. [This work actually all from Dave Hansen 2017-08-30: still omitting trackswitch mods, and KAISER_REAL_SWITCH deleted.] Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2018-01-05 15:46:34 +01:00
Hugh Dickins	1972bb9d92	kaiser: vmstat show NR_KAISERTABLE as nr_overhead The kaiser update made an interesting choice, never to free any shadow page tables. Contention on global spinlock was worrying, particularly with it held across page table scans when freeing. Something had to be done: I was going to add refcounting; but simply never to free them is an appealing choice, minimizing contention without complicating the code (the more a page table is found already, the less the spinlock is used). But leaking pages in this way is also a worry: can we get away with it? At the very least, we need a count to show how bad it actually gets: in principle, one might end up wasting about 1/256 of memory that way (1/512 for when direct-mapped pages have to be user-mapped, plus 1/512 for when they are user-mapped from the vmalloc area on another occasion (but we don't have vmalloc'ed stacks, so only large ldts are vmalloc'ed). Add per-cpu stat NR_KAISERTABLE: including 256 at startup for the shared pgd entries, and 1 for each intermediate page table added thereafter for user-mapping - but leave out the 1 per mm, for its shadow pgd, because that distracts from the monotonic increase. Shown in /proc/vmstat as nr_overhead (0 if kaiser not enabled). In practice, it doesn't look so bad so far: more like 1/12000 after nine hours of gtests below; and movable pageblock segregation should tend to cluster the kaiser tables into a subset of the address space (if not, they will be bad for compaction too). But production may tell a different story: keep an eye on this number, and bring back lighter freeing if it gets out of control (maybe a shrinker). ["nr_overhead" should of course say "nr_kaisertable", if it needs to stay; but for the moment we are being coy, preferring that when Joe Blow notices a new line in his /proc/vmstat, he does not get too curious about what this "kaiser" stuff might be.] Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2018-01-05 15:46:33 +01:00
Hugh Dickins	1ce27de401	kaiser: delete KAISER_REAL_SWITCH option We fail to see what CONFIG_KAISER_REAL_SWITCH is for: it seems to be left over from early development, and now just obscures tricky parts of the code. Delete it before adding PCIDs, or nokaiser boot option. (Or if there is some good reason to keep the option, then it needs a help text - and a "depends on KAISER", so that all those without KAISER are not asked the question. But we'd much rather delete it.) Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2018-01-05 15:46:33 +01:00
Hugh Dickins	c27cdea56c	kaiser: name that 0x1000 KAISER_SHADOW_PGD_OFFSET There's a 0x1000 in various places, which looks better with a name. Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2018-01-05 15:46:33 +01:00
Hugh Dickins	61b7a404fa	kaiser: cleanups while trying for gold link While trying to get our gold link to work, four cleanups: matched the gdt_page declaration to its definition; in fiddling unsuccessfully with PERCPU_INPUT(), lined up backslashes; lined up the backslashes according to convention in percpu-defs.h; deleted the unused irq_stack_pointer addition to irq_stack_union. Sad to report that aligning backslashes does not appear to help gold align to 8192: but while these did not help, they are worth keeping. Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2018-01-05 15:46:33 +01:00

1 2 3 4 5 ...

640954 Commits