linux

iv/linux

Author	SHA1	Message	Date
Linus Torvalds	5465a324af	A single fix for a printk format warning in RCU. -----BEGIN PGP SIGNATURE----- iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl8B8gcTHHRnbHhAbGlu dXRyb25peC5kZQAKCRCmGPVMDXSYoXDoEADCdEVs/qLktFXrN17i6Oeju7h6oQ2m 8iI1GDStY5zj4Jk6FdQB864XXHbkNO7C096IqJOZud66k+lj3sEk9lE24tpgZqi/ gHVCueUoKF5ZYNyEtPkSDzHcr/IJg3iueQyShTbGotvGbF/gBAWJtuIq3sVpaD+Q qvZYASQMkBNrRcEgxzaTr286MJ4lIQ61ujwRWQJV4woQgAqjeTrOKQ+qOoKCZVfB c1glieDNLwZvs/534zsBLRj7ApvuJ2SyHXhfC9byIitUb1RdZ/1gAkiteX/K6ici PXoPamBsd+gSEdfWN69HB+cWqPqJ8Gq8M2zcmp3KSrg4IrXTVrnYHmymH3tN5Vbe p3my9/rH/yDv1kgcRgOlgL7ykz6W2oCr7LrTrQ7fupOXrR7UrW0dSsEcFRbWwoBn 7dlfdEI4Q7ay9GPN2f7QOiaGGE+Bi76iCXTjRTFzcEQHiwO6W1bLoSu8qtncYvke 2PaDrE4V/2CWjOuE37mw3IPsjUEOJNKC2+H3y+J9ma94CX4kAuZzH2LS4oHO8ww1 eyF1HSHOKh1tuY9RhNnsyh+1V2Iao6T0BjUMnG/c01xFeEz+lE+e1JRdTSxa5BOr zKSSuEei7z6m9t7Yn2DSU8YaKn8ZbF8JpfC5nGFZTMBh64EOEaWGhQUr7ttuFQxi 7JF6WVarhn5FHw== =p6ry -----END PGP SIGNATURE----- Merge tag 'core-urgent-2020-07-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull rcu fixlet from Thomas Gleixner: "A single fix for a printk format warning in RCU" * tag 'core-urgent-2020-07-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: rcuperf: Fix printk format warning	2020-07-05 12:21:28 -07:00
David S. Miller	f91c031e65	Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== pull-request: bpf-next 2020-07-04 The following pull-request contains BPF updates for your net-next tree. We've added 73 non-merge commits during the last 17 day(s) which contain a total of 106 files changed, 5233 insertions(+), 1283 deletions(-). The main changes are: 1) bpftool ability to show PIDs of processes having open file descriptors for BPF map/program/link/BTF objects, relying on BPF iterator progs to extract this info efficiently, from Andrii Nakryiko. 2) Addition of BPF iterator progs for dumping TCP and UDP sockets to seq_files, from Yonghong Song. 3) Support access to BPF map fields in struct bpf_map from programs through BTF struct access, from Andrey Ignatov. 4) Add a bpf_get_task_stack() helper to be able to dump /proc/*/stack via seq_file from BPF iterator progs, from Song Liu. 5) Make SO_KEEPALIVE and related options available to bpf_setsockopt() helper, from Dmitry Yakunin. 6) Optimize BPF sk_storage selection of its caching index, from Martin KaFai Lau. 7) Removal of redundant synchronize_rcu()s from BPF map destruction which has been a historic leftover, from Alexei Starovoitov. 8) Several improvements to test_progs to make it easier to create a shell loop that invokes each test individually which is useful for some CIs, from Jesper Dangaard Brouer. 9) Fix bpftool prog dump segfault when compiled without skeleton code on older clang versions, from John Fastabend. 10) Bunch of cleanups and minor improvements, from various others. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2020-07-04 17:48:34 -07:00
Christian Brauner	714acdbd1c	arch: rename copy_thread_tls() back to copy_thread() Now that HAVE_COPY_THREAD_TLS has been removed, rename copy_thread_tls() back simply copy_thread(). It's a simpler name, and doesn't imply that only tls is copied here. This finishes an outstanding chunk of internal process creation work since we've added clone3(). Cc: linux-arch@vger.kernel.org Acked-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>A Acked-by: Stafford Horne <shorne@gmail.com> Acked-by: Greentime Hu <green.hu@gmail.com> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>A Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>	2020-07-04 23:41:37 +02:00
Christian Brauner	140c8180eb	arch: remove HAVE_COPY_THREAD_TLS All architectures support copy_thread_tls() now, so remove the legacy copy_thread() function and the HAVE_COPY_THREAD_TLS config option. Everyone uses the same process creation calling convention based on copy_thread_tls() and struct kernel_clone_args. This will make it easier to maintain the core process creation code under kernel/, simplifies the callpaths and makes the identical for all architectures. Cc: linux-arch@vger.kernel.org Acked-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Acked-by: Greentime Hu <green.hu@gmail.com> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>	2020-07-04 23:41:37 +02:00
Christian Brauner	ff2a91127b	fork: remove do_fork() Now that all architectures have been switched to use _do_fork() and the new struct kernel_clone_args calling convention we can remove the legacy do_fork() helper completely. The calling convention used to be brittle and do_fork() didn't buy us anything. The only calling convention accepted should be based on struct kernel_clone_args going forward. It's cleaner and uniform. Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>	2020-07-04 23:41:36 +02:00
Eric W. Biederman	1c340ead18	umd: Track user space drivers with struct pid Use struct pid instead of user space pid values that are prone to wrap araound. In addition track the entire thread group instead of just the first thread that is started by exec. There are no multi-threaded user mode drivers today but there is nothing preclucing user drivers from being multi-threaded, so it is just a good idea to track the entire process. Take a reference count on the tgid's in question to make it possible to remove exit_umh in a future change. As a struct pid is available directly use kill_pid_info. The prior process signalling code was iffy in using a userspace pid known to be in the initial pid namespace and then looking up it's task in whatever the current pid namespace is. It worked only because kernel threads always run in the initial pid namespace. As the tgid is now refcounted verify the tgid is NULL at the start of fork_usermode_driver to avoid the possibility of silent pid leaks. v1: https://lkml.kernel.org/r/87mu4qdlv2.fsf_-_@x220.int.ebiederm.org v2: https://lkml.kernel.org/r/a70l4oy8.fsf_-_@x220.int.ebiederm.org Link: https://lkml.kernel.org/r/20200702164140.4468-12-ebiederm@xmission.com Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: Alexei Starovoitov <ast@kernel.org> Tested-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2020-07-04 09:35:56 -05:00
Eric W. Biederman	55e6074e3f	umh: Stop calling do_execve_file With the user mode driver code changed to not set subprocess_info.file there are no more users of subproces_info.file. Remove this field from struct subprocess_info and remove the only user in call_usermodehelper_exec_async that would call do_execve_file instead of do_execve if file was set. v1: https://lkml.kernel.org/r/877dvuf0i7.fsf_-_@x220.int.ebiederm.org v2: https://lkml.kernel.org/r/87r1tx4p2a.fsf_-_@x220.int.ebiederm.org Link: https://lkml.kernel.org/r/20200702164140.4468-9-ebiederm@xmission.com Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: Alexei Starovoitov <ast@kernel.org> Tested-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2020-07-04 09:35:36 -05:00
Eric W. Biederman	e2dc9bf3f5	umd: Transform fork_usermode_blob into fork_usermode_driver Instead of loading a binary blob into a temporary file with shmem_kernel_file_setup load a binary blob into a temporary tmpfs filesystem. This means that the blob can be stored in an init section and discared, and it means the binary blob will have a filename so can be executed normally. The only tricky thing about this code is that in the helper function blob_to_mnt __fput_sync is used. That is because a file can not be executed if it is still open for write, and the ordinary delayed close for kernel threads does not happen soon enough, which causes the following exec to fail. The function umd_load_blob is not called with any locks so this should be safe. Executing the blob normally winds up correcting several problems with the user mode driver code discovered by Tetsuo Handa[1]. By passing an ordinary filename into the exec, it is no longer necessary to figure out how to turn a O_RDWR file descriptor into a properly referende counted O_EXEC file descriptor that forbids all writes. For path based LSMs there are no new special cases. [1] https://lore.kernel.org/linux-fsdevel/2a8775b4-1dd5-9d5c-aa42-9872445e0942@i-love.sakura.ne.jp/ v1: https://lkml.kernel.org/r/87d05mf0j9.fsf_-_@x220.int.ebiederm.org v2: https://lkml.kernel.org/r/87wo3p4p35.fsf_-_@x220.int.ebiederm.org Link: https://lkml.kernel.org/r/20200702164140.4468-8-ebiederm@xmission.com Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: Alexei Starovoitov <ast@kernel.org> Tested-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2020-07-04 09:35:29 -05:00
Eric W. Biederman	1199c6c3da	umd: Rename umd_info.cmdline umd_info.driver_name The only thing supplied in the cmdline today is the driver name so rename the field to clarify the code. As this value is always supplied stop trying to handle the case of a NULL cmdline. Additionally since we now have a name we can count on use the driver_name any place where the code is looking for a name of the binary. v1: https://lkml.kernel.org/r/87imfef0k3.fsf_-_@x220.int.ebiederm.org v2: https://lkml.kernel.org/r/87366d63os.fsf_-_@x220.int.ebiederm.org Link: https://lkml.kernel.org/r/20200702164140.4468-7-ebiederm@xmission.com Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: Alexei Starovoitov <ast@kernel.org> Tested-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2020-07-04 09:35:13 -05:00
Eric W. Biederman	74be2d3b80	umd: For clarity rename umh_info umd_info This structure is only used for user mode drivers so change the prefix from umh to umd to make that clear. v1: https://lkml.kernel.org/r/87o8p6f0kw.fsf_-_@x220.int.ebiederm.org v2: https://lkml.kernel.org/r/878sg563po.fsf_-_@x220.int.ebiederm.org Link: https://lkml.kernel.org/r/20200702164140.4468-6-ebiederm@xmission.com Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: Alexei Starovoitov <ast@kernel.org> Tested-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2020-07-04 09:34:39 -05:00
Eric W. Biederman	884c5e683b	umh: Separate the user mode driver and the user mode helper support This makes it clear which code is part of the core user mode helper support and which code is needed to implement user mode drivers. This makes the kernel smaller for everyone who does not use a usermode driver. v1: https://lkml.kernel.org/r/87tuyyf0ln.fsf_-_@x220.int.ebiederm.org v2: https://lkml.kernel.org/r/87imf963s6.fsf_-_@x220.int.ebiederm.org Link: https://lkml.kernel.org/r/20200702164140.4468-5-ebiederm@xmission.com Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: Alexei Starovoitov <ast@kernel.org> Tested-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2020-07-04 09:34:32 -05:00
Eric W. Biederman	21d5982806	umh: Remove call_usermodehelper_setup_file. The only caller of call_usermodehelper_setup_file is fork_usermode_blob. In fork_usermode_blob replace call_usermodehelper_setup_file with call_usermodehelper_setup and delete fork_usermodehelper_setup_file. For this to work the argv_free is moved from umh_clean_and_save_pid to fork_usermode_blob. v1: https://lkml.kernel.org/r/87zh8qf0mp.fsf_-_@x220.int.ebiederm.org v2: https://lkml.kernel.org/r/87o8p163u1.fsf_-_@x220.int.ebiederm.org Link: https://lkml.kernel.org/r/20200702164140.4468-4-ebiederm@xmission.com Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: Alexei Starovoitov <ast@kernel.org> Tested-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2020-07-04 09:34:26 -05:00
Eric W. Biederman	3a171042ae	umh: Rename the user mode driver helpers for clarity Now that the functionality of umh_setup_pipe and umh_clean_and_save_pid has changed their names are too specific and don't make much sense. Instead name them umd_setup and umd_cleanup for the functional role in setting up user mode drivers. v1: https://lkml.kernel.org/r/875zbegf82.fsf_-_@x220.int.ebiederm.org v2: https://lkml.kernel.org/r/87tuyt63x3.fsf_-_@x220.int.ebiederm.org Link: https://lkml.kernel.org/r/20200702164140.4468-3-ebiederm@xmission.com Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: Alexei Starovoitov <ast@kernel.org> Tested-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2020-07-04 09:34:18 -05:00
Eric W. Biederman	b044fa2ae5	umh: Move setting PF_UMH into umh_pipe_setup I am separating the code specific to user mode drivers from the code for ordinary user space helpers. Move setting of PF_UMH from call_usermodehelper_exec_async which is core user mode helper code into umh_pipe_setup which is user mode driver code. The code is equally as easy to write in one location as the other and the movement minimizes the impact of the user mode driver code on the core of the user mode helper code. Setting PF_UMH unconditionally is harmless as an action will only happen if it is paired with an entry on umh_list. v1: https://lkml.kernel.org/r/87bll6gf8t.fsf_-_@x220.int.ebiederm.org v2: https://lkml.kernel.org/r/87zh8l63xs.fsf_-_@x220.int.ebiederm.org Link: https://lkml.kernel.org/r/20200702164140.4468-2-ebiederm@xmission.com Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: Alexei Starovoitov <ast@kernel.org> Tested-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2020-07-04 09:34:06 -05:00
Eric W. Biederman	5fec25f2cb	umh: Capture the pid in umh_pipe_setup The pid in struct subprocess_info is only used by umh_clean_and_save_pid to write the pid into umh_info. Instead always capture the pid on struct umh_info in umh_pipe_setup, removing code that is specific to user mode drivers from the common user path of user mode helpers. v1: https://lkml.kernel.org/r/87h7uygf9i.fsf_-_@x220.int.ebiederm.org v2: https://lkml.kernel.org/r/875zb97iix.fsf_-_@x220.int.ebiederm.org Link: https://lkml.kernel.org/r/20200702164140.4468-1-ebiederm@xmission.com Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: Alexei Starovoitov <ast@kernel.org> Tested-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2020-07-04 09:33:40 -05:00
Valentin Schneider	8fa88a88d5	genirq: Remove preflow handler support That was put in place for sparc64, and blackfin also used it for some time; sparc64 no longer uses those, and blackfin is dead. As there are no more users, remove preflow handlers. Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20200703155645.29703-3-valentin.schneider@arm.com	2020-07-04 10:02:06 +02:00
Christoph Hellwig	a3a66c3822	vmalloc: fix the owner argument for the new __vmalloc_node_range callers Fix the recently added new __vmalloc_node_range callers to pass the correct values as the owner for display in /proc/vmallocinfo. Fixes: 800e26b81311 ("x86/hyperv: allocate the hypercall page with only read and execute bits") Fixes: 10d5e97c1bf8 ("arm64: use PAGE_KERNEL_ROX directly in alloc_insn_page") Fixes: 7a0e27b2a0ce ("mm: remove vmalloc_exec") Reported-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20200627075649.2455097-1-hch@lst.de Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-07-03 16:15:25 -07:00
Linus Torvalds	45564bcd57	for-linus-2020-07-02 -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCXv20ggAKCRCRxhvAZXjc osYAAQD4Lp5ORPJv5jtYScxRM1RiUh8hp2MBle+4iYCAQX/UowD/QtMPtiLWaMIw 6bMMBAj1gVBmlIOmu7DhBs8hVtKugQY= =kNpi -----END PGP SIGNATURE----- Merge tag 'for-linus-2020-07-02' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux Pull data race annotation from Christian Brauner: "This contains an annotation patch for a data race in copy_process() reported by KCSAN when reading and writing nr_threads. The data race is intentional and benign. This is obvious from the comment above the relevant code and based on general consensus when discussing this issue. So simply using data_race() to annotate this as an intentional race seems the best option" * tag 'for-linus-2020-07-02' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: fork: annotate data race in copy_process()	2020-07-02 22:40:06 -07:00
Song Liu	046cc3dd9a	bpf: Fix build without CONFIG_STACKTRACE Without CONFIG_STACKTRACE stack_trace_save_tsk() is not defined. Let get_callchain_entry_for_task() to always return NULL in such cases. Fixes: fa28dcb82a38 ("bpf: Introduce helper bpf_get_task_stack()") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200703024537.79971-1-songliubraving@fb.com	2020-07-02 20:21:56 -07:00
Linus Torvalds	c93493b7cd	io_uring-5.8-2020-07-01 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl79YU0QHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgplHKD/9rgv0c1I7dCh6MgQKxT+2z/eZcaPO3PekW sbn8yC8RiSIL85Av1zEfC1wAp+Mp21QlFKXFiZ6BJj5bdDbbshLk0WdbnxvuM+9I gyngTI/+em5D/WCcetAkPjnMTDq0m4l0UXd91fyNAeErmYZbvhL5dXihZsBJ3T9c Bprn4RzWwrUsUwGn8qIEZhx2UovMrzXJHGFxWXh/81YHkh7Y4mjvATKxtECIliW/ +QQJDU7Tf3gZw+ETPIDOEB9Hl9c9W+9fcWWzmrXzViUyy54IMbF4qyJpWcGaRh6c sO3apymwu7wwAUbQcE8IWr3ZLZDtw68AgUdZ5b/T0c2fEwqsI/UDMhBbELiuqcT0 MAoQdUSNNqZTti0PX5vg5CQlCFzjnl2uIwHF6LVSbrqgyqxiC3Qrus/FYSaf3x9h bAmNgWC9DeKp/wtEKMuBXaOm7RjrEutD5hjJYfVK/AkvKTZyZDx3vZ9FRH8WtrII 7KhUI3DPSZCeWlcpDtK+0fEqtqTw6OtCQ8U5vKSnJjoRSXLUtuk6IYbp/tqNxwe/ 0d+U6R+w513jVlXARUP48mV7tzpESp2MLP6Nd2Is/OD5tePWzQEZinpKzsFP4djH d2PT5FFGPCw9yBk03sI1Je/CFqVYwCGqav6h8dKKVBanMjoEdL4U1PMhI48Zua+9 M8pqRHoeDA== =4lvI -----END PGP SIGNATURE----- Merge tag 'io_uring-5.8-2020-07-01' of git://git.kernel.dk/linux-block Pull io_uring fixes from Jens Axboe: "One fix in here, for a regression in 5.7 where a task is waiting in the kernel for a condition, but that condition won't become true until task_work is run. And the task_work can't be run exactly because the task is waiting in the kernel, so we'll never make any progress. One example of that is registering an eventfd and queueing io_uring work, and then the task goes and waits in eventfd read with the expectation that it'll get woken (and read an event) when the io_uring request completes. The io_uring request is finished through task_work, which won't get run while the task is looping in eventfd read" * tag 'io_uring-5.8-2020-07-01' of git://git.kernel.dk/linux-block: io_uring: use signal based task_work running task_work: teach task_work_add() to do signal_wake_up()	2020-07-02 14:56:22 -07:00
Bhupesh Sharma	1d50e5d0c5	crash_core, vmcoreinfo: Append 'MAX_PHYSMEM_BITS' to vmcoreinfo Right now user-space tools like 'makedumpfile' and 'crash' need to rely on a best-guess method of determining value of 'MAX_PHYSMEM_BITS' supported by underlying kernel. This value is used in user-space code to calculate the bit-space required to store a section for SPARESMEM (similar to the existing calculation method used in the kernel implementation): #define SECTIONS_SHIFT (MAX_PHYSMEM_BITS - SECTION_SIZE_BITS) Now, regressions have been reported in user-space utilities like 'makedumpfile' and 'crash' on arm64, with the recently added kernel support for 52-bit physical address space, as there is no clear method of determining this value in user-space (other than reading kernel CONFIG flags). As per suggestion from makedumpfile maintainer (Kazu), it makes more sense to append 'MAX_PHYSMEM_BITS' to vmcoreinfo in the core code itself rather than in arch-specific code, so that the user-space code for other archs can also benefit from this addition to the vmcoreinfo and use it as a standard way of determining 'SECTIONS_SHIFT' value in user-land. A reference 'makedumpfile' implementation which reads the 'MAX_PHYSMEM_BITS' value from vmcoreinfo in a arch-independent fashion is available here: While at it also update vmcoreinfo documentation for 'MAX_PHYSMEM_BITS' variable being added to vmcoreinfo. 'MAX_PHYSMEM_BITS' defines the maximum supported physical address space memory. Signed-off-by: Bhupesh Sharma <bhsharma@redhat.com> Tested-by: John Donnelly <john.p.donnelly@oracle.com> Acked-by: Dave Young <dyoung@redhat.com> Cc: Boris Petkov <bp@alien8.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: James Morse <james.morse@arm.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Dave Anderson <anderson@redhat.com> Cc: Kazuhito Hagio <k-hagio@ab.jp.nec.com> Cc: x86@kernel.org Cc: linuxppc-dev@lists.ozlabs.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-kernel@vger.kernel.org Cc: kexec@lists.infradead.org Link: https://lore.kernel.org/r/1589395957-24628-2-git-send-email-bhsharma@redhat.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>	2020-07-02 17:56:11 +01:00
Peter Zijlstra	78c2141b65	Merge branch 'perf/vlbr'	2020-07-02 15:51:48 +02:00
Quentin Perret	10dd8573b0	cpufreq: Register governors at core_initcall Currently, most CPUFreq governors are registered at the core_initcall time when the given governor is the default one, and the module_init time otherwise. In preparation for letting users specify the default governor on the kernel command line, change all of them to be registered at the core_initcall unconditionally, as it is already the case for the schedutil and performance governors. This will allow us to assume that builtin governors have been registered before the built-in CPUFreq drivers probe. And since all governors have similar init/exit patterns now, introduce two new macros, cpufreq_governor_{init,exit}(), to factorize the code. Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Quentin Perret <qperret@google.com> Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> [ rjw: Changelog ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2020-07-02 13:03:30 +02:00
Steven Rostedt (VMware)	29ce24519c	ring-buffer: Do not trigger a WARN if clock going backwards is detected After tweaking the ring buffer to be a bit faster, a warning is triggering on one of my machines, and causing my tests to fail. This warning is caused when the delta (current time stamp minus previous time stamp), is larger than the max time held by the ring buffer (59 bits). If the clock were to go backwards slightly, this would then easily trigger this warning. The machine that it triggered on, the clock did go backwards by around 450 nanoseconds, and this happened after a recalibration of the TSC clock. Now that the ring buffer is faster, it detects this, and the delta that is used larger than the max, the warning is triggered and my test fails. To handle the clock going backwards, look at the saved before and after time stamps. If they are the same, it means that the current event did not interrupt another event, and that those timestamp are of a previous event that was recorded. If the max delta is triggered, look at those time stamps, make sure they are the same, then use them to compare with the current timestamp. If the current timestamp is less than the before/after time stamps, then that means the clock being used went backward. Print out a message that this has happened, but do not warn about it (and only print the message once). Still do the warning if the delta is indeed larger than what can be used. Also remove the unneeded KERN_WARNING from the WARN_ONCE() print. Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>	2020-07-01 22:12:07 -04:00
Steven Rostedt (VMware)	bbeba3e58f	ring-buffer: Call trace_clock_local() directly for RETPOLINE kernels After doing some benchmarks and examining the code, I found that the ring buffer clock calls were quite expensive, and noticed that it uses retpolines. This is because the ring buffer clock is programmable, and can be set. But in most cases it simply uses the fastest ns unit clock which is the trace_clock_local(). For RETPOLINE builds, checking if the ring buffer clock is set to trace_clock_local() and then calling it directly has brought the time of an event on my i7 box from an average of 93 nanoseconds an event down to 83 nanoseconds an event, and the minimum time from 81 nanoseconds to 68 nanoseconds! Suggested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>	2020-07-01 22:12:07 -04:00
Steven Rostedt (VMware)	74e879373b	ring-buffer: Move the add_timestamp into its own function Make a helper function rb_add_timestamp() that moves the adding of the extended time stamps into its own function. Also, remove the noinline and inline for the functions it calls, as recent benchmarks appear they do not make a difference (just let gcc decide). Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>	2020-07-01 22:12:06 -04:00
Steven Rostedt (VMware)	58fbc3c632	ring-buffer: Consolidate add_timestamp to remove some branches Reorganize a little the logic to handle adding the absolute time stamp, extended and forced time stamps, in such a way to remove a branch or two. This is just a micro optimization. Also add before and after time stamps to the rb_event_info structure to display those values in the rb_check_timestamps() code, if something were to go wrong. Suggested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>	2020-07-01 22:11:22 -04:00
Song Liu	2df6bb5493	bpf: Allow %pB in bpf_seq_printf() and bpf_trace_printk() This makes it easy to dump stack trace in text. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20200630062846.664389-4-songliubraving@fb.com	2020-07-01 08:23:59 -07:00
Song Liu	fa28dcb82a	bpf: Introduce helper bpf_get_task_stack() Introduce helper bpf_get_task_stack(), which dumps stack trace of given task. This is different to bpf_get_stack(), which gets stack track of current task. One potential use case of bpf_get_task_stack() is to call it from bpf_iter__task and dump all /proc/<pid>/stack to a seq_file. bpf_get_task_stack() uses stack_trace_save_tsk() instead of get_perf_callchain() for kernel stack. The benefit of this choice is that stack_trace_save_tsk() doesn't require changes in arch/. The downside of using stack_trace_save_tsk() is that stack_trace_save_tsk() dumps the stack trace to unsigned long array. For 32-bit systems, we need to translate it to u64 array. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20200630062846.664389-3-songliubraving@fb.com	2020-07-01 08:23:19 -07:00
Song Liu	d141b8bc57	perf: Expose get/put_callchain_entry() Sanitize and expose get/put_callchain_entry(). This would be used by bpf stack map. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20200630062846.664389-2-songliubraving@fb.com	2020-07-01 08:22:08 -07:00
Alexei Starovoitov	bba1dc0b55	bpf: Remove redundant synchronize_rcu. bpf_free_used_maps() or close(map_fd) will trigger map_free callback. bpf_free_used_maps() is called after bpf prog is no longer executing: bpf_prog_put->call_rcu->bpf_prog_free->bpf_free_used_maps. Hence there is no need to call synchronize_rcu() to protect map elements. Note that hash_of_maps and array_of_maps update/delete inner maps via sys_bpf() that calls maybe_wait_bpf_programs() and synchronize_rcu(). Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Acked-by: Paul E. McKenney <paulmck@kernel.org> Link: https://lore.kernel.org/bpf/20200630043343.53195-2-alexei.starovoitov@gmail.com	2020-07-01 08:07:13 -07:00
Mike Rapoport	fb37409a01	arch: remove unicore32 port The unicore32 port do not seem maintained for a long time now, there is no upstream toolchain that can create unicore32 binaries and all the links to prebuilt toolchains for unicore32 are dead. Even compilers that were available are not supported by the kernel anymore. Guenter Roeck says: I have stopped building unicore32 images since v4.19 since there is no available compiler that is still supported by the kernel. I am surprised that support for it has not been removed from the kernel. Remove unicore32 port. Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Guenter Roeck <linux@roeck-us.net>	2020-07-01 12:09:13 +03:00
David S. Miller	e708e2bd55	Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Daniel Borkmann says: ==================== pull-request: bpf 2020-06-30 The following pull-request contains BPF updates for your net tree. We've added 28 non-merge commits during the last 9 day(s) which contain a total of 35 files changed, 486 insertions(+), 232 deletions(-). The main changes are: 1) Fix an incorrect verifier branch elimination for PTR_TO_BTF_ID pointer types, from Yonghong Song. 2) Fix UAPI for sockmap and flow_dissector progs that were ignoring various arguments passed to BPF_PROG_{ATTACH,DETACH}, from Lorenz Bauer & Jakub Sitnicki. 3) Fix broken AF_XDP DMA hacks that are poking into dma-direct and swiotlb internals and integrate it properly into DMA core, from Christoph Hellwig. 4) Fix RCU splat from recent changes to avoid skipping ingress policy when kTLS is enabled, from John Fastabend. 5) Fix BPF ringbuf map to enforce size to be the power of 2 in order for its position masking to work, from Andrii Nakryiko. 6) Fix regression from CAP_BPF work to re-allow CAP_SYS_ADMIN for loading of network programs, from Maciej Żenczykowski. 7) Fix libbpf section name prefix for devmap progs, from Jesper Dangaard Brouer. 8) Fix formatting in UAPI documentation for BPF helpers, from Quentin Monnet. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2020-06-30 14:20:45 -07:00
Steven Rostedt (VMware)	75b21c6dfa	ring-buffer: Mark the !tail (crossing a page) as unlikely It is the uncommon case where an event crosses a sub buffer boundary (page) mark that check at the end of reserving an event as unlikely. Suggested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>	2020-06-30 17:18:56 -04:00
Nicholas Piggin	b23d7a5f4a	ring-buffer: speed up buffer resets by avoiding synchronize_rcu for each CPU On a 144 thread system, `perf ftrace` takes about 20 seconds to start up, due to calling synchronize_rcu() for each CPU. cat /proc/108560/stack 0xc0003e7eb336f470 __switch_to+0x2e0/0x480 __wait_rcu_gp+0x20c/0x220 synchronize_rcu+0x9c/0xc0 ring_buffer_reset_cpu+0x88/0x2e0 tracing_reset_online_cpus+0x84/0xe0 tracing_open+0x1d4/0x1f0 On a system with 10x more threads, it starts to become an annoyance. Batch these up so we disable all the per-cpu buffers first, then synchronize_rcu() once, then reset each of the buffers. This brings the time down to about 0.5s. Link: https://lkml.kernel.org/r/20200625053403.2386972-1-npiggin@gmail.com Tested-by: Anton Blanchard <anton@ozlabs.org> Acked-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>	2020-06-30 17:18:56 -04:00
Steven Rostedt (VMware)	10464b4aa6	ring-buffer: Add rb_time_t 64 bit operations for speeding up 32 bit After a discussion with the new time algorithm to have nested events still have proper time keeping but required using local64_t atomic operations. Mathieu was concerned about the performance this would have on 32 bit machines, as in most cases, atomic 64 bit operations on them can be expensive. As the ring buffer's timing needs do not require full features of local64_t, a wrapper is made to implement a new rb_time_t operation that uses two longs on 32 bit machines but still uses the local64_t operations on 64 bit machines. There's a switch that can be made in the file to force 64 bit to use the 32 bit version just for testing purposes. All reads do not need to succeed if a read happened while the stamp being read is in the process of being updated. The requirement is that all reads must succed that were done by an interrupting event (where this event was interrupted by another event that did the write). Or if the event itself did the write first. That is: rb_time_set(t, x) followed by rb_time_read(t) will always succeed (even if it gets interrupted by another event that writes to t. The result of the read will be either the previous set, or a set performed by an interrupting event. If the read is done by an event that interrupted another event that was in the process of setting the time stamp, and no other event came along to write to that time stamp, it will fail and the rb_time_read() will return that it failed (the value to read will be undefined). A set will always write to the time stamp and return with a valid time stamp, such that any read after it will be valid. A cmpxchg may fail if it interrupted an event that was in the process of updating the time stamp just like the reads do. Other than that, it will act like a normal cmpxchg. The way this works is that the rb_time_t is made of of three fields. A cnt, that gets updated atomically everyting a modification is made. A top that represents the most significant 30 bits of the time, and a bottom to represent the least significant 30 bits of the time. Notice, that the time values is only 60 bits long (where the ring buffer only uses 59 bits, which gives us 18 years of nanoseconds!). The top two bits of both the top and bottom is a 2 bit counter that gets set by the value of the least two significant bits of the cnt. A read of the top and the bottom where both the top and bottom have the same most significant top 2 bits, are considered a match and a valid 60 bit number can be created from it. If they do not match, then the number is considered invalid, and this must only happen if an event interrupted another event in the midst of updating the time stamp. This is only used for 32 bits machines as 64 bit machines can get better performance out of the local64_t. This has been tested heavily by forcing 64 bit to use this logic. Link: https://lore.kernel.org/r/20200625225345.18cf5881@oasis.local.home Link: http://lkml.kernel.org/r/20200629025259.309232719@goodmis.org Inspired-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>	2020-06-30 17:18:51 -04:00
Yonghong Song	01c66c48d4	bpf: Fix an incorrect branch elimination by verifier Wenbo reported an issue in [1] where a checking of null pointer is evaluated as always false. In this particular case, the program type is tp_btf and the pointer to compare is a PTR_TO_BTF_ID. The current verifier considers PTR_TO_BTF_ID always reprents a non-null pointer, hence all PTR_TO_BTF_ID compares to 0 will be evaluated as always not-equal, which resulted in the branch elimination. For example, struct bpf_fentry_test_t { struct bpf_fentry_test_t a; }; int BPF_PROG(test7, struct bpf_fentry_test_t arg) { if (arg == 0) test7_result = 1; return 0; } int BPF_PROG(test8, struct bpf_fentry_test_t *arg) { if (arg->a == 0) test8_result = 1; return 0; } In above bpf programs, both branch arg == 0 and arg->a == 0 are removed. This may not be what developer expected. The bug is introduced by Commit cac616db39c2 ("bpf: Verifier track null pointer branch_taken with JNE and JEQ"), where PTR_TO_BTF_ID is considered to be non-null when evaluting pointer vs. scalar comparison. This may be added considering we have PTR_TO_BTF_ID_OR_NULL in the verifier as well. PTR_TO_BTF_ID_OR_NULL is added to explicitly requires a non-NULL testing in selective cases. The current generic pointer tracing framework in verifier always assigns PTR_TO_BTF_ID so users does not need to check NULL pointer at every pointer level like a->b->c->d. We may not want to assign every PTR_TO_BTF_ID as PTR_TO_BTF_ID_OR_NULL as this will require a null test before pointer dereference which may cause inconvenience for developers. But we could avoid branch elimination to preserve original code intention. This patch simply removed PTR_TO_BTD_ID from reg_type_not_null() in verifier, which prevented the above branches from being eliminated. [1]: https://lore.kernel.org/bpf/79dbb7c0-449d-83eb-5f4f-7af0cc269168@fb.com/T/ Fixes: cac616db39c2 ("bpf: Verifier track null pointer branch_taken with JNE and JEQ") Reported-by: Wenbo Zhang <ethercflow@gmail.com> Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20200630171240.2523722-1-yhs@fb.com	2020-06-30 22:21:05 +02:00
Steven Rostedt (VMware)	7c4b4a5164	ring-buffer: Incorporate absolute timestamp into add_timestamp logic Instead of calling out the absolute test for each time to check if the ring buffer wants absolute time stamps for all its recording, incorporate it with the add_timestamp field and turn it into flags for faster processing between wanting a absolute tag and needing to force one. Link: http://lkml.kernel.org/r/20200629025259.154892368@goodmis.org Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>	2020-06-30 16:16:14 -04:00
Steven Rostedt (VMware)	a389d86f7f	ring-buffer: Have nested events still record running time stamp Up until now, if an event is interrupted while it is recorded by an interrupt, and that interrupt records events, the time of those events will all be the same. This is because events only record the delta of the time since the previous event (or beginning of a page), and to handle updating the time keeping for that of nested events is extremely racy. After years of thinking about this and several failed attempts, I finally have a solution to solve this puzzle. The problem is that you need to atomically calculate the delta and then update the time stamp you made the delta from, as well as then record it into the buffer, all this while at any time an interrupt can come in and do the same thing. This is easy to solve with heavy weight atomics, but that would be detrimental to the performance of the ring buffer. The current state of affairs sacrificed the time deltas for nested events for performance. The reason for previous failed attempts at solving this puzzle was because I was trying to completely avoid slow atomic operations like cmpxchg. I final came to the conclusion to always avoid cmpxchg is not possible, which is why those previous attempts always failed. But it is possible to pick one path (the most common case) and avoid cmpxchg in that path, which is the "fast path". The most common case is that an event will not be interrupted and have other events added into it. An event can detect if it has interrupted another event, and for these cases we can make it the slow path and use the heavy operations like cmpxchg. One more player was added to the game that made this possible, and that is the "absolute timestamp" (by Tom Zanussi) that allows us to inject a full 59 bit time stamp. (Of course this breaks if a machine is running for more than 18 years without a reboot!). There's barrier() placements around for being paranoid, even when they are not needed because of other atomic functions near by. But those should not hurt, as if they are not needed, they basically become a nop. Note, this also makes the race window much smaller, which means there are less slow paths to slow down the performance. The basic idea is that there's two main paths taken. 1) Not being interrupted between time stamps and reserving buffer space. In this case, the time stamps taken are true to the location in the buffer. 2) Was interrupted by another path between taking time stamps and reserving buffer space. The objective is to know what the delta is from the last reserved location in the buffer. As it is possible to detect if an event is interrupting another event before reserving data, space is added to the length to be reserved to inject a full time stamp along with the event being reserved. When an event is not interrupted, the write stamp is always the time of the last event written to the buffer. In path 1, there's two sub paths we care about: a) The event did not interrupt another event. b) The event interrupted another event. In case a, as the write stamp was read and known to be correct, the delta between the current time stamp and the write stamp is the delta between the current event and the previously recorded event. In case b, extra space was reserved to just put the full time stamp into the buffer. Which is done, as stated, in this path the time stamp taken is known to match the location in the buffer. In path 2, there's also two sub paths we care about: a) The event was not interrupted by another event since it reserved space on the buffer and re-reading the write stamp. b) The event was interrupted by another event. In case a, the write stamp is that of the last event that interrupted this event between taking the time stamps and reserving. As no event came in after re-reading the write stamp, that event is known to be the time of the event directly before this event and the delta can be the new time stamp and the write stamp. In case b, one or more events came in between reserving the event and re-reading he write stamp. Since this event's buffer reservation is between other events at this path, there's no way to know what the delta is. But because an event interrupted this event after it started, its fine to just give a zero delta, and take the same time stamp as the events that happened within the event being recorded. Here's the implementation of the design of this solution: All this is per cpu, and only needs to worry about nested events (not parallel events). The players: write_tail: The index in the buffer where new events can be written to. It is incremented via local_add() to reserve space for a new event. before_stamp: A time stamp set by all events before reserving space. write_stamp: A time stamp updated by events after it has successfully reserved space. /* Save the current position of write / [A] w = local_read(write_tail); barrier(); / Read both before and write stamps before touching anything / before = local_read(before_stamp); after = local_read(write_stamp); barrier(); / * If before and after are the same, then this event is not * interrupting a time update. If it is, then reserve space for adding * a full time stamp (this can turn into a time extend which is * just an extended time delta but fill up the extra space). / if (after != before) abs = true; ts = clock(); / Now update the before_stamp (everyone does this!) / [B] local_set(before_stamp, ts); / Now reserve space on the buffer / [C] write = local_add_return(len, write_tail); / Set tail to be were this event's data is / tail = write - len; if (w == tail) { / Nothing interrupted this between A and C / [D] local_set(write_stamp, ts); barrier(); [E] save_before = local_read(before_stamp); if (!abs) { / This did not interrupt a time update / delta = ts - after; } else { delta = ts; / The full time stamp will be in use / } if (ts != save_before) { / slow path - Was interrupted between C and E / / The update to write_stamp could have overwritten the update to * it by the interrupting event, but before and after should be * the same for all completed top events / after = local_read(write_stamp); if (save_before > after) local_cmpxchg(write_stamp, after, save_before); } } else { / slow path - Interrupted between A and C / after = local_read(write_stamp); temp_ts = clock(); barrier(); [F] if (write == local_read(write_tail) && after < temp_ts) { / This was not interrupted since C and F * The last write_stamp is still valid for the previous event * in the buffer. / delta = temp_ts - after; / OK to keep this new time stamp / ts = temp_ts; } else { / Interrupted between C and F * Well, there's no use to try to know what the time stamp * is for the previous event. Just set delta to zero and * be the same time as that event that interrupted us before * the reservation of the buffer. / delta = 0; } / No need to use full timestamps here */ abs = 0; } Link: https://lkml.kernel.org/r/20200625094454.732790f7@oasis.local.home Link: https://lore.kernel.org/r/20200627010041.517736087@goodmis.org Link: http://lkml.kernel.org/r/20200629025258.957440797@goodmis.org Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>	2020-06-30 14:29:33 -04:00
Steven Rostedt (VMware)	7ef282e051	tracing: Move pipe reference to trace array instead of current_tracer If a process has the trace_pipe open on a trace_array, the current tracer for that trace array should not be changed. This was original enforced by a global lock, but when instances were introduced, it was moved to the current_trace. But this structure is shared by all instances, and a trace_pipe is for a single instance. There's no reason that a process that has trace_pipe open on one instance should prevent another instance from changing its current tracer. Move the reference counter to the trace_array instead. This is marked as "Fixes" but is more of a clean up than a true fix. Backport if you want, but its not critical. Fixes: cf6ab6d9143b1 ("tracing: Add ref count to tracer for when they are being read by pipe") Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>	2020-06-30 14:29:33 -04:00
Oleg Nesterov	e91b481623	task_work: teach task_work_add() to do signal_wake_up() So that the target task will exit the wait_event_interruptible-like loop and call task_work_run() asap. The patch turns "bool notify" into 0,TWA_RESUME,TWA_SIGNAL enum, the new TWA_SIGNAL flag implies signal_wake_up(). However, it needs to avoid the race with recalc_sigpending(), so the patch also adds the new JOBCTL_TASK_WORK bit included in JOBCTL_PENDING_MASK. TODO: once this patch is merged we need to change all current users of task_work_add(notify = true) to use TWA_RESUME. Cc: stable@vger.kernel.org # v5.7 Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-30 12:18:08 -06:00
Jakub Sitnicki	2576f87066	bpf, netns: Fix use-after-free in pernet pre_exit callback Iterating over BPF links attached to network namespace in pre_exit hook is not safe, even if there is just one. Once link gets auto-detached, that is its back-pointer to net object is set to NULL, the link can be released and freed without waiting on netns_bpf_mutex, effectively causing the list element we are operating on to be freed. This leads to use-after-free when trying to access the next element on the list, as reported by KASAN. Bug can be triggered by destroying a network namespace, while also releasing a link attached to this network namespace. \| ================================================================== \| BUG: KASAN: use-after-free in netns_bpf_pernet_pre_exit+0xd9/0x130 \| Read of size 8 at addr ffff888119e0d778 by task kworker/u8:2/177 \| \| CPU: 3 PID: 177 Comm: kworker/u8:2 Not tainted 5.8.0-rc1-00197-ga0c04c9d1008-dirty #776 \| Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014 \| Workqueue: netns cleanup_net \| Call Trace: \| dump_stack+0x9e/0xe0 \| print_address_description.constprop.0+0x3a/0x60 \| ? netns_bpf_pernet_pre_exit+0xd9/0x130 \| kasan_report.cold+0x1f/0x40 \| ? netns_bpf_pernet_pre_exit+0xd9/0x130 \| netns_bpf_pernet_pre_exit+0xd9/0x130 \| cleanup_net+0x30b/0x5b0 \| ? unregister_pernet_device+0x50/0x50 \| ? rcu_read_lock_bh_held+0xb0/0xb0 \| ? _raw_spin_unlock_irq+0x24/0x50 \| process_one_work+0x4d1/0xa10 \| ? lock_release+0x3e0/0x3e0 \| ? pwq_dec_nr_in_flight+0x110/0x110 \| ? rwlock_bug.part.0+0x60/0x60 \| worker_thread+0x7a/0x5c0 \| ? process_one_work+0xa10/0xa10 \| kthread+0x1e3/0x240 \| ? kthread_create_on_node+0xd0/0xd0 \| ret_from_fork+0x1f/0x30 \| \| Allocated by task 280: \| save_stack+0x1b/0x40 \| __kasan_kmalloc.constprop.0+0xc2/0xd0 \| netns_bpf_link_create+0xfe/0x650 \| __do_sys_bpf+0x153a/0x2a50 \| do_syscall_64+0x59/0x300 \| entry_SYSCALL_64_after_hwframe+0x44/0xa9 \| \| Freed by task 198: \| save_stack+0x1b/0x40 \| __kasan_slab_free+0x12f/0x180 \| kfree+0xed/0x350 \| process_one_work+0x4d1/0xa10 \| worker_thread+0x7a/0x5c0 \| kthread+0x1e3/0x240 \| ret_from_fork+0x1f/0x30 \| \| The buggy address belongs to the object at ffff888119e0d700 \| which belongs to the cache kmalloc-192 of size 192 \| The buggy address is located 120 bytes inside of \| 192-byte region [ffff888119e0d700, ffff888119e0d7c0) \| The buggy address belongs to the page: \| page:ffffea0004678340 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 \| flags: 0x2fffe0000000200(slab) \| raw: 02fffe0000000200 ffffea00045ba8c0 0000000600000006 ffff88811a80ea80 \| raw: 0000000000000000 0000000000100010 00000001ffffffff 0000000000000000 \| page dumped because: kasan: bad access detected \| \| Memory state around the buggy address: \| ffff888119e0d600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb \| ffff888119e0d680: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc \| >ffff888119e0d700: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb \| ^ \| ffff888119e0d780: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc \| ffff888119e0d800: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb \| ================================================================== Remove the "fast-path" for releasing a link that got auto-detached by a dying network namespace to fix it. This way as long as link is on the list and netns_bpf mutex is held, we have a guarantee that link memory can be accessed. An alternative way to fix this issue would be to safely iterate over the list of links and ensure there is no access to link object after detaching it. But, at the moment, optimizing synchronization overhead on link release without a workload in mind seems like an overkill. Fixes: ab53cad90eb1 ("bpf, netns: Keep a list of attached bpf_link's") Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20200630164541.1329993-1-jakub@cloudflare.com	2020-06-30 10:53:42 -07:00
Lorenz Bauer	bb0de3131f	bpf: sockmap: Require attach_bpf_fd when detaching a program The sockmap code currently ignores the value of attach_bpf_fd when detaching a program. This is contrary to the usual behaviour of checking that attach_bpf_fd represents the currently attached program. Ensure that attach_bpf_fd is indeed the currently attached program. It turns out that all sockmap selftests already do this, which indicates that this is unlikely to cause breakage. Fixes: 604326b41a6f ("bpf, sockmap: convert to generic sk_msg interface") Signed-off-by: Lorenz Bauer <lmb@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200629095630.7933-5-lmb@cloudflare.com	2020-06-30 10:46:39 -07:00
Lorenz Bauer	4ac2add659	bpf: flow_dissector: Check value of unused flags to BPF_PROG_DETACH Using BPF_PROG_DETACH on a flow dissector program supports neither attach_flags nor attach_bpf_fd. Yet no value is enforced for them. Enforce that attach_flags are zero, and require the current program to be passed via attach_bpf_fd. This allows us to remove the check for CAP_SYS_ADMIN, since userspace can now no longer remove arbitrary flow dissector programs. Fixes: b27f7bb590ba ("flow_dissector: Move out netns_bpf prog callbacks") Signed-off-by: Lorenz Bauer <lmb@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200629095630.7933-3-lmb@cloudflare.com	2020-06-30 10:46:38 -07:00
Lorenz Bauer	1b514239e8	bpf: flow_dissector: Check value of unused flags to BPF_PROG_ATTACH Using BPF_PROG_ATTACH on a flow dissector program supports neither target_fd, attach_flags or replace_bpf_fd but accepts any value. Enforce that all of them are zero. This is fine for replace_bpf_fd since its presence is indicated by BPF_F_REPLACE. It's more problematic for target_fd, since zero is a valid fd. Should we want to use the flag later on we'd have to add an exception for fd 0. The alternative is to force a value like -1. This requires more changes to tests. There is also precedent for using 0, since bpf_iter uses this for target_fd as well. Fixes: b27f7bb590ba ("flow_dissector: Move out netns_bpf prog callbacks") Signed-off-by: Lorenz Bauer <lmb@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200629095630.7933-2-lmb@cloudflare.com	2020-06-30 10:46:38 -07:00
Jakub Sitnicki	ab53cad90e	bpf, netns: Keep a list of attached bpf_link's To support multi-prog link-based attachments for new netns attach types, we need to keep track of more than one bpf_link per attach type. Hence, convert net->bpf.links into a list, that currently can be either empty or have just one item. Instead of reusing bpf_prog_list from bpf-cgroup, we link together bpf_netns_link's themselves. This makes list management simpler as we don't have to allocate, initialize, and later release list elements. We can do this because multi-prog attachment will be available only for bpf_link, and we don't need to build a list of programs attached directly and indirectly via links. No functional changes intended. Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20200625141357.910330-4-jakub@cloudflare.com	2020-06-30 10:45:08 -07:00
Jakub Sitnicki	695c12147a	bpf, netns: Keep attached programs in bpf_prog_array Prepare for having multi-prog attachments for new netns attach types by storing programs to run in a bpf_prog_array, which is well suited for iterating over programs and running them in sequence. After this change bpf(PROG_QUERY) may block to allocate memory in bpf_prog_array_copy_to_user() for collected program IDs. This forces a change in how we protect access to the attached program in the query callback. Because bpf_prog_array_copy_to_user() can sleep, we switch from an RCU read lock to holding a mutex that serializes updaters. Because we allow only one BPF flow_dissector program to be attached to netns at all times, the bpf_prog_array pointed by net->bpf.run_array is always either detached (null) or one element long. No functional changes intended. Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20200625141357.910330-3-jakub@cloudflare.com	2020-06-30 10:45:08 -07:00
Jakub Sitnicki	3b7016996c	flow_dissector: Pull BPF program assignment up to bpf-netns Prepare for using bpf_prog_array to store attached programs by moving out code that updates the attached program out of flow dissector. Managing bpf_prog_array is more involved than updating a single bpf_prog pointer. This will let us do it all from one place, bpf/net_namespace.c, in the subsequent patch. No functional change intended. Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20200625141357.910330-2-jakub@cloudflare.com	2020-06-30 10:45:07 -07:00
Andrii Nakryiko	517bbe1994	bpf: Enforce BPF ringbuf size to be the power of 2 BPF ringbuf assumes the size to be a multiple of page size and the power of 2 value. The latter is important to avoid division while calculating position inside the ring buffer and using (N-1) mask instead. This patch fixes omission to enforce power-of-2 size rule. Fixes: 457f44363a88 ("bpf: Implement BPF ring buffer and verifier support for it") Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20200630061500.1804799-1-andriin@fb.com	2020-06-30 16:31:55 +02:00
Christoph Hellwig	3aa9162500	dma-mapping: Add a new dma_need_sync API Add a new API to check if calls to dma_sync_single_for_{device,cpu} are required for a given DMA streaming mapping. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20200629130359.2690853-2-hch@lst.de	2020-06-30 15:44:03 +02:00

... 6 7 8 9 10 ...

34080 Commits