linux

iv/linux

Author	SHA1	Message	Date
Sargun Dhillon	7cf97b1254	seccomp: Introduce addfd ioctl to seccomp user notifier The current SECCOMP_RET_USER_NOTIF API allows for syscall supervision over an fd. It is often used in settings where a supervising task emulates syscalls on behalf of a supervised task in userspace, either to further restrict the supervisee's syscall abilities or to circumvent kernel enforced restrictions the supervisor deems safe to lift (e.g. actually performing a mount(2) for an unprivileged container). While SECCOMP_RET_USER_NOTIF allows for the interception of any syscall, only a certain subset of syscalls could be correctly emulated. Over the last few development cycles, the set of syscalls which can't be emulated has been reduced due to the addition of pidfd_getfd(2). With this we are now able to, for example, intercept syscalls that require the supervisor to operate on file descriptors of the supervisee such as connect(2). However, syscalls that cause new file descriptors to be installed can not currently be correctly emulated since there is no way for the supervisor to inject file descriptors into the supervisee. This patch adds a new addfd ioctl to remove this restriction by allowing the supervisor to install file descriptors into the intercepted task. By implementing this feature via seccomp the supervisor effectively instructs the supervisee to install a set of file descriptors into its own file descriptor table during the intercepted syscall. This way it is possible to intercept syscalls such as open() or accept(), and install (or replace, like dup2(2)) the supervisor's resulting fd into the supervisee. One replacement use-case would be to redirect the stdout and stderr of a supervisee into log file descriptors opened by the supervisor. The ioctl handling is based on the discussions[1] of how Extensible Arguments should interact with ioctls. Instead of building size into the addfd structure, make it a function of the ioctl command (which is how sizes are normally passed to ioctls). To support forward and backward compatibility, just mask out the direction and size, and match everything. The size (and any future direction) checks are done along with copy_struct_from_user() logic. As a note, the seccomp_notif_addfd structure is laid out based on 8-byte alignment without requiring packing as there have been packing issues with uapi highlighted before[2][3]. Although we could overload the newfd field and use -1 to indicate that it is not to be used, doing so requires changing the size of the fd field, and introduces struct packing complexity. [1]: https://lore.kernel.org/lkml/87o8w9bcaf.fsf@mid.deneb.enyo.de/ [2]: https://lore.kernel.org/lkml/a328b91d-fd8f-4f27-b3c2-91a9c45f18c0@rasmusvillemoes.dk/ [3]: https://lore.kernel.org/lkml/20200612104629.GA15814@ircssh-2.c.rugged-nimbus-611.internal Cc: Christoph Hellwig <hch@lst.de> Cc: Christian Brauner <christian.brauner@ubuntu.com> Cc: Tycho Andersen <tycho@tycho.ws> Cc: Jann Horn <jannh@google.com> Cc: Robert Sesek <rsesek@google.com> Cc: Chris Palmer <palmer@google.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: linux-api@vger.kernel.org Suggested-by: Matt Denton <mpdenton@google.com> Link: https://lore.kernel.org/r/20200603011044.7972-4-sargun@sargun.me Signed-off-by: Sargun Dhillon <sargun@sargun.me> Reviewed-by: Will Drewry <wad@chromium.org> Co-developed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Kees Cook <keescook@chromium.org>	2020-07-14 16:29:42 -07:00
Alexei Starovoitov	ec2ffdf65f	Merge branch 'usermode-driver-cleanup' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace into bpf-next	2020-07-14 12:18:01 -07:00
Alexey Dobriyan	02d7f4005e	PM: sleep: spread "const char " correctness Fixed string literals can be referred to as "const char ". Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> [ rjw: Minor subject edit ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2020-07-14 19:25:41 +02:00
Xiang Chen	6ada7ba2fa	PM: hibernate: fix white space in a few places In hibernate.c, some places lack of spaces while some places have redundant spaces. So fix them. Signed-off-by: Xiang Chen <chenxiang66@hisilicon.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2020-07-14 19:25:41 +02:00
Nicolas Saenz Julienne	d9765e41d8	dma-pool: do not allocate pool memory from CMA There is no guarantee to CMA's placement, so allocating a zone specific atomic pool from CMA might return memory from a completely different memory zone. So stop using it. Fixes: `c84dc6e68a` ("dma-pool: add additional coherent pools to map to gfp mask") Reported-by: Jeremy Linton <jeremy.linton@arm.com> Signed-off-by: Nicolas Saenz Julienne <nsaenzjulienne@suse.de> Tested-by: Jeremy Linton <jeremy.linton@arm.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-07-14 15:46:32 +02:00
Nicolas Saenz Julienne	81e9d894e0	dma-pool: make sure atomic pool suits device When allocating DMA memory from a pool, the core can only guess which atomic pool will fit a device's constraints. If it doesn't, get a safer atomic pool and try again. Fixes: `c84dc6e68a` ("dma-pool: add additional coherent pools to map to gfp mask") Reported-by: Jeremy Linton <jeremy.linton@arm.com> Suggested-by: Robin Murphy <robin.murphy@arm.com> Signed-off-by: Nicolas Saenz Julienne <nsaenzjulienne@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-07-14 15:42:33 +02:00
Nicolas Saenz Julienne	48b6703858	dma-pool: introduce dma_guess_pool() dma-pool's dev_to_pool() creates the false impression that there is a way to grantee a mapping between a device's DMA constraints and an atomic pool. It tuns out it's just a guess, and the device might need to use an atomic pool containing memory from a 'safer' (or lower) memory zone. To help mitigate this, introduce dma_guess_pool() which can be fed a device's DMA constraints and atomic pools already known to be faulty, in order for it to provide an better guess on which pool to use. Signed-off-by: Nicolas Saenz Julienne <nsaenzjulienne@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-07-14 15:42:30 +02:00
Nicolas Saenz Julienne	23e469be62	dma-pool: get rid of dma_in_atomic_pool() The function is only used once and can be simplified to a one-liner. Signed-off-by: Nicolas Saenz Julienne <nsaenzjulienne@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-07-14 15:42:27 +02:00
Nicolas Saenz Julienne	567f6a6eba	dma-direct: provide function to check physical memory area validity dma_coherent_ok() checks if a physical memory area fits a device's DMA constraints. Signed-off-by: Nicolas Saenz Julienne <nsaenzjulienne@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-07-14 15:42:24 +02:00
David S. Miller	07dd1b7e68	Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Alexei Starovoitov says: ==================== pull-request: bpf-next 2020-07-13 The following pull-request contains BPF updates for your net-next tree. We've added 36 non-merge commits during the last 7 day(s) which contain a total of 62 files changed, 2242 insertions(+), 468 deletions(-). The main changes are: 1) Avoid trace_printk warning banner by switching bpf_trace_printk to use its own tracing event, from Alan. 2) Better libbpf support on older kernels, from Andrii. 3) Additional AF_XDP stats, from Ciara. 4) build time resolution of BTF IDs, from Jiri. 5) BPF_CGROUP_INET_SOCK_RELEASE hook, from Stanislav. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2020-07-13 18:04:05 -07:00
Alan Maguire	ac5a72ea5c	bpf: Use dedicated bpf_trace_printk event instead of trace_printk() The bpf helper bpf_trace_printk() uses trace_printk() under the hood. This leads to an alarming warning message originating from trace buffer allocation which occurs the first time a program using bpf_trace_printk() is loaded. We can instead create a trace event for bpf_trace_printk() and enable it in-kernel when/if we encounter a program using the bpf_trace_printk() helper. With this approach, trace_printk() is not used directly and no warning message appears. This work was started by Steven (see Link) and finished by Alan; added Steven's Signed-off-by with his permission. Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Alan Maguire <alan.maguire@oracle.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/r/20200628194334.6238b933@oasis.local.home Link: https://lore.kernel.org/bpf/1594641154-18897-2-git-send-email-alan.maguire@oracle.com	2020-07-13 16:55:49 -07:00
Kees Cook	910d2f16ac	pidfd: Replace open-coded receive_fd() Replace the open-coded version of receive_fd() with a call to the new helper. Thanks to Vamshi K Sthambamkadi <vamshi.k.sthambamkadi@gmail.com> for catching a missed fput() in an earlier version of this patch. Cc: Christoph Hellwig <hch@lst.de> Cc: Jakub Kicinski <kuba@kernel.org> Cc: netdev@vger.kernel.org Cc: linux-kernel@vger.kernel.org Reviewed-by: Sargun Dhillon <sargun@sargun.me> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Kees Cook <keescook@chromium.org>	2020-07-13 11:03:45 -07:00
Kees Cook	4969f8a073	pidfd: Add missing sock updates for pidfd_getfd() The sock counting (sock_update_netprioidx() and sock_update_classid()) was missing from pidfd's implementation of received fd installation. Add a call to the new __receive_sock() helper. Cc: Christian Brauner <christian.brauner@ubuntu.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Sargun Dhillon <sargun@sargun.me> Cc: Jakub Kicinski <kuba@kernel.org> Cc: netdev@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: stable@vger.kernel.org Fixes: `8649c322f7` ("pid: Implement pidfd_getfd syscall") Signed-off-by: Kees Cook <keescook@chromium.org>	2020-07-13 11:03:44 -07:00
Jiri Olsa	49f4e67207	bpf: Use BTF_ID to resolve bpf_ctx_convert struct This way the ID is resolved during compile time, and we can remove the runtime name search. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Tested-by: Andrii Nakryiko <andriin@fb.com> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20200711215329.41165-7-jolsa@kernel.org	2020-07-13 10:42:03 -07:00
Jiri Olsa	138b9a0511	bpf: Remove btf_id helpers resolving Now when we moved the helpers btf_id arrays into .BTF_ids section, we can remove the code that resolve those IDs in runtime. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Tested-by: Andrii Nakryiko <andriin@fb.com> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20200711215329.41165-6-jolsa@kernel.org	2020-07-13 10:42:02 -07:00
Jiri Olsa	c9a0f3b85e	bpf: Resolve BTF IDs in vmlinux image Using BTF_ID_LIST macro to define lists for several helpers using BTF arguments. And running resolve_btfids on vmlinux elf object during linking, so the .BTF_ids section gets the IDs resolved. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Tested-by: Andrii Nakryiko <andriin@fb.com> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20200711215329.41165-5-jolsa@kernel.org	2020-07-13 10:42:02 -07:00
Bruno Meneguele	bc885f1ab6	doc:kmsg: explicitly state the return value in case of SEEK_CUR The commit `625d344978` ("Revert "kernel/printk: add kmsg SEEK_CUR handling"") reverted a change done to the return value in case a SEEK_CUR operation was performed for kmsg buffer based on the fact that different userspace apps were handling the new return value (-ESPIPE) in different ways, breaking them. At the same time -ESPIPE was the wrong decision because kmsg /does support/ seek() but doesn't follow the "normal" behavior userspace is used to. Because of that and also considering the time -EINVAL has been used, it was decided to keep this way to avoid more userspace breakage. This patch adds an official statement to the kmsg documentation pointing to the current return value for SEEK_CUR, -EINVAL, thus userspace libraries and apps can refer to it for a definitive guide on what to expect. Signed-off-by: Bruno Meneguele <bmeneg@redhat.com> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20200710174423.10480-1-bmeneg@redhat.com	2020-07-13 15:07:45 +02:00
Linus Torvalds	9901a6bd15	Merge tag 'riscv-for-linus-5.8-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux Pull RISC-V fixes from Palmer Dabbelt: "I have a few KGDB-related fixes. They're mostly fixes for build warnings, but there's also: - Support for the qSupported and qXfer packets, which are necessary to pass around GDB XML information which we need for the RISC-V GDB port to fully function. - Users can now select STRICT_KERNEL_RWX instead of forcing it on" * tag 'riscv-for-linus-5.8-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux: riscv: Avoid kgdb.h including gdb_xml.h to solve unused-const-variable warning kgdb: Move the extern declaration kgdb_has_hit_break() to generic kgdb.h riscv: Fix "no previous prototype" compile warning in kgdb.c file riscv: enable the Kconfig prompt of STRICT_KERNEL_RWX kgdb: enable arch to support XML packet.	2020-07-11 19:22:46 -07:00
David S. Miller	71930d6102	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net All conflicts seemed rather trivial, with some guidance from Saeed Mameed on the tc_ct.c one. Signed-off-by: David S. Miller <davem@davemloft.net>	2020-07-11 00:46:00 -07:00
Linus Torvalds	5a764898af	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from David Miller: 1) Restore previous behavior of CAP_SYS_ADMIN wrt loading networking BPF programs, from Maciej Żenczykowski. 2) Fix dropped broadcasts in mac80211 code, from Seevalamuthu Mariappan. 3) Slay memory leak in nl80211 bss color attribute parsing code, from Luca Coelho. 4) Get route from skb properly in ip_route_use_hint(), from Miaohe Lin. 5) Don't allow anything other than ARPHRD_ETHER in llc code, from Eric Dumazet. 6) xsk code dips too deeply into DMA mapping implementation internals. Add dma_need_sync and use it. From Christoph Hellwig 7) Enforce power-of-2 for BPF ringbuf sizes. From Andrii Nakryiko. 8) Check for disallowed attributes when loading flow dissector BPF programs. From Lorenz Bauer. 9) Correct packet injection to L3 tunnel devices via AF_PACKET, from Jason A. Donenfeld. 10) Don't advertise checksum offload on ipa devices that don't support it. From Alex Elder. 11) Resolve several issues in TCP MD5 signature support. Missing memory barriers, bogus options emitted when using syncookies, and failure to allow md5 key changes in established states. All from Eric Dumazet. 12) Fix interface leak in hsr code, from Taehee Yoo. 13) VF reset fixes in hns3 driver, from Huazhong Tan. 14) Make loopback work again with ipv6 anycast, from David Ahern. 15) Fix TX starvation under high load in fec driver, from Tobias Waldekranz. 16) MLD2 payload lengths not checked properly in bridge multicast code, from Linus Lüssing. 17) Packet scheduler code that wants to find the inner protocol currently only works for one level of VLAN encapsulation. Allow Q-in-Q situations to work properly here, from Toke Høiland-Jørgensen. 18) Fix route leak in l2tp, from Xin Long. 19) Resolve conflict between the sk->sk_user_data usage of bpf reuseport support and various protocols. From Martin KaFai Lau. 20) Fix socket cgroup v2 reference counting in some situations, from Cong Wang. 21) Cure memory leak in mlx5 connection tracking offload support, from Eli Britstein. * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (146 commits) mlxsw: pci: Fix use-after-free in case of failed devlink reload mlxsw: spectrum_router: Remove inappropriate usage of WARN_ON() net: macb: fix call to pm_runtime in the suspend/resume functions net: macb: fix macb_suspend() by removing call to netif_carrier_off() net: macb: fix macb_get/set_wol() when moving to phylink net: macb: mark device wake capable when "magic-packet" property present net: macb: fix wakeup test in runtime suspend/resume routines bnxt_en: fix NULL dereference in case SR-IOV configuration fails libbpf: Fix libbpf hashmap on (I)LP32 architectures net/mlx5e: CT: Fix memory leak in cleanup net/mlx5e: Fix port buffers cell size value net/mlx5e: Fix 50G per lane indication net/mlx5e: Fix CPU mapping after function reload to avoid aRFS RX crash net/mlx5e: Fix VXLAN configuration restore after function reload net/mlx5e: Fix usage of rcu-protected pointer net/mxl5e: Verify that rpriv is not NULL net/mlx5: E-Switch, Fix vlan or qos setting in legacy mode net/mlx5: Fix eeprom support for SFP module cgroup: Fix sock_cgroup_data on big-endian. selftests: bpf: Fix detach from sockmap tests ...	2020-07-10 18:16:22 -07:00
Kees Cook	fe4bfff86e	seccomp: Use -1 marker for end of mode 1 syscall list The terminator for the mode 1 syscalls list was a 0, but that could be a valid syscall number (e.g. x86_64 __NR_read). By luck, __NR_read was listed first and the loop construct would not test it, so there was no bug. However, this is fragile. Replace the terminator with -1 instead, and make the variable name for mode 1 syscall lists more descriptive. Cc: Andy Lutomirski <luto@amacapital.net> Cc: Will Drewry <wad@chromium.org> Signed-off-by: Kees Cook <keescook@chromium.org>	2020-07-10 16:01:52 -07:00
Kees Cook	47e33c05f9	seccomp: Fix ioctl number for SECCOMP_IOCTL_NOTIF_ID_VALID When SECCOMP_IOCTL_NOTIF_ID_VALID was first introduced it had the wrong direction flag set. While this isn't a big deal as nothing currently enforces these bits in the kernel, it should be defined correctly. Fix the define and provide support for the old command until it is no longer needed for backward compatibility. Fixes: `6a21cc50f0` ("seccomp: add a return code to trap to userspace") Signed-off-by: Kees Cook <keescook@chromium.org>	2020-07-10 16:01:52 -07:00
Kees Cook	e68f9d49dd	seccomp: Use pr_fmt Avoid open-coding "seccomp: " prefixes for pr_*() calls. Signed-off-by: Kees Cook <keescook@chromium.org>	2020-07-10 16:01:52 -07:00
Christian Brauner	99cdb8b9a5	seccomp: notify about unused filter We've been making heavy use of the seccomp notifier to intercept and handle certain syscalls for containers. This patch allows a syscall supervisor listening on a given notifier to be notified when a seccomp filter has become unused. A container is often managed by a singleton supervisor process the so-called "monitor". This monitor process has an event loop which has various event handlers registered. If the user specified a seccomp profile that included a notifier for various syscalls then we also register a seccomp notify even handler. For any container using a separate pid namespace the lifecycle of the seccomp notifier is bound to the init process of the pid namespace, i.e. when the init process exits the filter must be unused. If a new process attaches to a container we force it to assume a seccomp profile. This can either be the same seccomp profile as the container was started with or a modified one. If the attaching process makes use of the seccomp notifier we will register a new seccomp notifier handler in the monitor's event loop. However, when the attaching process exits we can't simply delete the handler since other child processes could've been created (daemons spawned etc.) that have inherited the seccomp filter and so we need to keep the seccomp notifier fd alive in the event loop. But this is problematic since we don't get a notification when the seccomp filter has become unused and so we currently never remove the seccomp notifier fd from the event loop and just keep accumulating fds in the event loop. We've had this issue for a while but it has recently become more pressing as more and larger users make use of this. To fix this, we introduce a new "users" reference counter that tracks any tasks and dependent filters making use of a filter. When a notifier is registered waiting tasks will be notified that the filter is now empty by receiving a (E)POLLHUP event. The concept in this patch introduces is the same as for signal_struct, i.e. reference counting for life-cycle management is decoupled from reference counting taks using the object. There's probably some trickery possible but the second counter is just the correct way of doing this IMHO and has precedence. Cc: Tycho Andersen <tycho@tycho.ws> Cc: Kees Cook <keescook@chromium.org> Cc: Matt Denton <mpdenton@google.com> Cc: Sargun Dhillon <sargun@sargun.me> Cc: Jann Horn <jannh@google.com> Cc: Chris Palmer <palmer@google.com> Cc: Aleksa Sarai <cyphar@cyphar.com> Cc: Robert Sesek <rsesek@google.com> Cc: Jeffrey Vander Stoep <jeffv@google.com> Cc: Linux Containers <containers@lists.linux-foundation.org> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Link: https://lore.kernel.org/r/20200531115031.391515-3-christian.brauner@ubuntu.com Signed-off-by: Kees Cook <keescook@chromium.org>	2020-07-10 16:01:51 -07:00
Christian Brauner	76194c4e83	seccomp: Lift wait_queue into struct seccomp_filter Lift the wait_queue from struct notification into struct seccomp_filter. This is cleaner overall and lets us avoid having to take the notifier mutex in the future for EPOLLHUP notifications since we need to neither read nor modify the notifier specific aspects of the seccomp filter. In the exit path I'd very much like to avoid having to take the notifier mutex for each filter in the task's filter hierarchy. Cc: Tycho Andersen <tycho@tycho.ws> Cc: Kees Cook <keescook@chromium.org> Cc: Matt Denton <mpdenton@google.com> Cc: Sargun Dhillon <sargun@sargun.me> Cc: Jann Horn <jannh@google.com> Cc: Chris Palmer <palmer@google.com> Cc: Aleksa Sarai <cyphar@cyphar.com> Cc: Robert Sesek <rsesek@google.com> Cc: Jeffrey Vander Stoep <jeffv@google.com> Cc: Linux Containers <containers@lists.linux-foundation.org> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Kees Cook <keescook@chromium.org>	2020-07-10 16:01:51 -07:00
Christian Brauner	3a15fb6ed9	seccomp: release filter after task is fully dead The seccomp filter used to be released in free_task() which is called asynchronously via call_rcu() and assorted mechanisms. Since we need to inform tasks waiting on the seccomp notifier when a filter goes empty we will notify them as soon as a task has been marked fully dead in release_task(). To not split seccomp cleanup into two parts, move filter release out of free_task() and into release_task() after we've unhashed struct task from struct pid, exited signals, and unlinked it from the threadgroups' thread list. We'll put the empty filter notification infrastructure into it in a follow up patch. This also renames put_seccomp_filter() to seccomp_filter_release() which is a more descriptive name of what we're doing here especially once we've added the empty filter notification mechanism in there. We're also NULL-ing the task's filter tree entrypoint which seems cleaner than leaving a dangling pointer in there. Note that this shouldn't need any memory barriers since we're calling this when the task is in release_task() which means it's EXIT_DEAD. So it can't modify its seccomp filters anymore. You can also see this from the point where we're calling seccomp_filter_release(). It's after __exit_signal() and at this point, tsk->sighand will already have been NULLed which is required for thread-sync and filter installation alike. Cc: Tycho Andersen <tycho@tycho.ws> Cc: Kees Cook <keescook@chromium.org> Cc: Matt Denton <mpdenton@google.com> Cc: Sargun Dhillon <sargun@sargun.me> Cc: Jann Horn <jannh@google.com> Cc: Chris Palmer <palmer@google.com> Cc: Aleksa Sarai <cyphar@cyphar.com> Cc: Robert Sesek <rsesek@google.com> Cc: Jeffrey Vander Stoep <jeffv@google.com> Cc: Linux Containers <containers@lists.linux-foundation.org> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Link: https://lore.kernel.org/r/20200531115031.391515-2-christian.brauner@ubuntu.com Signed-off-by: Kees Cook <keescook@chromium.org>	2020-07-10 16:01:51 -07:00
Christian Brauner	b707ddee11	seccomp: rename "usage" to "refs" and document Naming the lifetime counter of a seccomp filter "usage" suggests a little too strongly that its about tasks that are using this filter while it also tracks other references such as the user notifier or ptrace. This also updates the documentation to note this fact. We'll be introducing an actual usage counter in a follow-up patch. Cc: Tycho Andersen <tycho@tycho.ws> Cc: Kees Cook <keescook@chromium.org> Cc: Matt Denton <mpdenton@google.com> Cc: Sargun Dhillon <sargun@sargun.me> Cc: Jann Horn <jannh@google.com> Cc: Chris Palmer <palmer@google.com> Cc: Aleksa Sarai <cyphar@cyphar.com> Cc: Robert Sesek <rsesek@google.com> Cc: Jeffrey Vander Stoep <jeffv@google.com> Cc: Linux Containers <containers@lists.linux-foundation.org> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Link: https://lore.kernel.org/r/20200531115031.391515-1-christian.brauner@ubuntu.com Signed-off-by: Kees Cook <keescook@chromium.org>	2020-07-10 16:01:51 -07:00
Sargun Dhillon	9f87dcf14b	seccomp: Add find_notification helper This adds a helper which can iterate through a seccomp_filter to find a notification matching an ID. It removes several replicated chunks of code. Signed-off-by: Sargun Dhillon <sargun@sargun.me> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Reviewed-by: Tycho Andersen <tycho@tycho.ws> Cc: Matt Denton <mpdenton@google.com> Cc: Kees Cook <keescook@google.com>, Cc: Jann Horn <jannh@google.com>, Cc: Robert Sesek <rsesek@google.com>, Cc: Chris Palmer <palmer@google.com> Cc: Christian Brauner <christian.brauner@ubuntu.com> Cc: Tycho Andersen <tycho@tycho.ws> Link: https://lore.kernel.org/r/20200601112532.150158-1-sargun@sargun.me Signed-off-by: Kees Cook <keescook@chromium.org>	2020-07-10 16:01:51 -07:00
Kees Cook	c818c03b66	seccomp: Report number of loaded filters in /proc/$pid/status A common question asked when debugging seccomp filters is "how many filters are attached to your process?" Provide a way to easily answer this question through /proc/$pid/status with a "Seccomp_filters" line. Signed-off-by: Kees Cook <keescook@chromium.org>	2020-07-10 16:01:51 -07:00
Linus Torvalds	1bfde03742	Merge tag 'dma-mapping-5.8-5' of git://git.infradead.org/users/hch/dma-mapping Pull dma-mapping fixes from Christoph Hellwig: - add a warning when the atomic pool is depleted (David Rientjes) - protect the parameters of the new scatterlist helper macros (Marek Szyprowski ) * tag 'dma-mapping-5.8-5' of git://git.infradead.org/users/hch/dma-mapping: scatterlist: protect parameters of the sg_table related macros dma-mapping: warn when coherent pool is depleted	2020-07-10 09:36:03 -07:00
Peter Zijlstra	f9ad4a5f3f	lockdep: Remove lockdep_hardirq{s_enabled,_context}() argument Now that the macros use per-cpu data, we no longer need the argument. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Ingo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/20200623083721.571835311@infradead.org	2020-07-10 12:00:02 +02:00
Peter Zijlstra	a21ee6055c	lockdep: Change hardirq{s_enabled,_context} to per-cpu variables Currently all IRQ-tracking state is in task_struct, this means that task_struct needs to be defined before we use it. Especially for lockdep_assert_irq*() this can lead to header-hell. Move the hardirq state into per-cpu variables to avoid the task_struct dependency. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Ingo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/20200623083721.512673481@infradead.org	2020-07-10 12:00:02 +02:00
Peter Zijlstra	859d069ee1	lockdep: Prepare for NMI IRQ state tracking There is no reason not to always, accurately, track IRQ state. This change also makes IRQ state tracking ignore lockdep_off(). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Ingo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/20200623083721.155449112@infradead.org	2020-07-10 12:00:01 +02:00
Marco Elver	248591f5d2	kcsan: Make KCSAN compatible with new IRQ state tracking The new IRQ state tracking code does not honor lockdep_off(), and as such we should again permit tracing by using non-raw functions in core.c. Update the lockdep_off() comment in report.c, to reflect the fact there is still a potential risk of deadlock due to using printk() from scheduler code. Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Ingo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/20200624113246.GA170324@elver.google.com	2020-07-10 12:00:00 +02:00
John Ogness	896fbe20b4	printk: use the lockless ringbuffer Replace the existing ringbuffer usage and implementation with lockless ringbuffer usage. Even though the new ringbuffer does not require locking, all existing locking is left in place. Therefore, this change is purely replacing the underlining ringbuffer. Changes that exist due to the ringbuffer replacement: - The VMCOREINFO has been updated for the new structures. - Dictionary data is now stored in a separate data buffer from the human-readable messages. The dictionary data buffer is set to the same size as the message buffer. Therefore, the total required memory for both dictionary and message data is 2 * (2 ^ CONFIG_LOG_BUF_SHIFT) for the initial static buffers and 2 * log_buf_len (the kernel parameter) for the dynamic buffers. - Record meta-data is now stored in a separate array of descriptors. This is an additional 72 * (2 ^ (CONFIG_LOG_BUF_SHIFT - 5)) bytes for the static array and 72 * (log_buf_len >> 5) bytes for the dynamic array. Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20200709132344.760-5-john.ogness@linutronix.de	2020-07-10 08:48:55 +02:00
John Ogness	8749efc0c0	Revert "printk: lock/unlock console only for new logbuf entries" This reverts commit `3ac37a93fa`. This optimization will not apply once the transition to a lockless printk is complete. Rather than porting this optimization through the transition only to remove it anyway, just revert it now to simplify the transition. Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20200709132344.760-4-john.ogness@linutronix.de	2020-07-10 08:48:35 +02:00
John Ogness	b6cf8b3f33	printk: add lockless ringbuffer Introduce a multi-reader multi-writer lockless ringbuffer for storing the kernel log messages. Readers and writers may use their API from any context (including scheduler and NMI). This ringbuffer will make it possible to decouple printk() callers from any context, locking, or console constraints. It also makes it possible for readers to have full access to the ringbuffer contents at any time and context (for example from any panic situation). The printk_ringbuffer is made up of 3 internal ringbuffers: desc_ring: A ring of descriptors. A descriptor contains all record meta data (sequence number, timestamp, loglevel, etc.) as well as internal state information about the record and logical positions specifying where in the other ringbuffers the text and dictionary strings are located. text_data_ring: A ring of data blocks. A data block consists of an unsigned long integer (ID) that maps to a desc_ring index followed by the text string of the record. dict_data_ring: A ring of data blocks. A data block consists of an unsigned long integer (ID) that maps to a desc_ring index followed by the dictionary string of the record. The internal state information of a descriptor is the key element to allow readers and writers to locklessly synchronize access to the data. Co-developed-by: Petr Mladek <pmladek@suse.com> Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20200709132344.760-3-john.ogness@linutronix.de	2020-07-10 08:48:19 +02:00
Vincent Chen	8c080d3a97	kgdb: enable arch to support XML packet. The XML packet could be supported by required architecture if the architecture defines CONFIG_HAVE_ARCH_KGDB_QXFER_PKT and implement its own kgdb_arch_handle_qxfer_pkt(). Except for the kgdb_arch_handle_qxfer_pkt(), the architecture also needs to record the feature supported by gdb stub into the kgdb_arch_gdb_stub_feature, and these features will be reported to host gdb when gdb stub receives the qSupported packet. Signed-off-by: Vincent Chen <vincent.chen@sifive.com> Acked-by: Daniel Thompson <daniel.thompson@linaro.org> Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>	2020-07-09 20:09:28 -07:00
Wei Yang	36b8aacf2a	tracing: Save one trace_event->type by using __TRACE_LAST_TYPE Static defined trace_event->type stops at (__TRACE_LAST_TYPE - 1) and dynamic trace_event->type starts from (__TRACE_LAST_TYPE + 1). To save one trace_event->type index, let's use __TRACE_LAST_TYPE. Link: https://lkml.kernel.org/r/20200703020612.12930-3-richard.weiyang@linux.alibaba.com Signed-off-by: Wei Yang <richard.weiyang@linux.alibaba.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>	2020-07-09 18:14:58 -04:00
Wei Yang	746cf3459f	tracing: Simplify defining of the next event id The value to be used and compared in trace_search_list() is "last + 1". Let's just define next to be "last + 1" instead of doing the addition each time. Link: https://lkml.kernel.org/r/20200703020612.12930-2-richard.weiyang@linux.alibaba.com Signed-off-by: Wei Yang <richard.weiyang@linux.alibaba.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>	2020-07-09 18:00:47 -04:00
Linus Torvalds	ce69fb3b39	Merge tag 'kallsyms_show_value-v5.8-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull kallsyms fix from Kees Cook: "Refactor kallsyms_show_value() users for correct cred. I'm not delighted by the timing of getting these changes to you, but it does fix a handful of kernel address exposures, and no one has screamed yet at the patches. Several users of kallsyms_show_value() were performing checks not during "open". Refactor everything needed to gain proper checks against file->f_cred for modules, kprobes, and bpf" * tag 'kallsyms_show_value-v5.8-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: selftests: kmod: Add module address visibility test bpf: Check correct cred for CAP_SYSLOG in bpf_dump_raw_ok() kprobes: Do not expose probe addresses to non-CAP_SYSLOG module: Do not expose section addresses to non-CAP_SYSLOG module: Refactor section attr into bin attribute kallsyms: Refactor kallsyms_show_value() to take cred	2020-07-09 13:09:30 -07:00
Martin KaFai Lau	c9a368f1c0	bpf: net: Avoid incorrect bpf_sk_reuseport_detach call bpf_sk_reuseport_detach is currently called when sk->sk_user_data is not NULL. It is incorrect because sk->sk_user_data may not be managed by the bpf's reuseport_array. It has been reported in [1] that, the bpf_sk_reuseport_detach() which is called from udp_lib_unhash() has corrupted the sk_user_data managed by l2tp. This patch solves it by using another bit (defined as SK_USER_DATA_BPF) of the sk_user_data pointer value. It marks that a sk_user_data is managed/owned by BPF. The patch depends on a PTRMASK introduced in commit `f1ff5ce2cd` ("net, sk_msg: Clear sk_user_data pointer on clone if tagged"). [ Note: sk->sk_user_data is used by bpf's reuseport_array only when a sk is added to the bpf's reuseport_array. i.e. doing setsockopt(SO_REUSEPORT) and having "sk->sk_reuseport == 1" alone will not stop sk->sk_user_data being used by other means. ] [1]: https://lore.kernel.org/netdev/20200706121259.GA20199@katalix.com/ Fixes: `5dc4c4b7d4` ("bpf: Introduce BPF_MAP_TYPE_REUSEPORT_SOCKARRAY") Reported-by: James Chapman <jchapman@katalix.com> Reported-by: syzbot+9f092552ba9a5efca5df@syzkaller.appspotmail.com Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: James Chapman <jchapman@katalix.com> Acked-by: James Chapman <jchapman@katalix.com> Link: https://lore.kernel.org/bpf/20200709061110.4019316-1-kafai@fb.com	2020-07-09 22:03:31 +02:00
Martin KaFai Lau	f3dda7a679	bpf: net: Avoid copying sk_user_data of reuseport_array during sk_clone It makes little sense for copying sk_user_data of reuseport_array during sk_clone_lock(). This patch reuses the SK_USER_DATA_NOCOPY bit introduced in commit `f1ff5ce2cd` ("net, sk_msg: Clear sk_user_data pointer on clone if tagged"). It is used to mark the sk_user_data is not supposed to be copied to its clone. Although the cloned sk's sk_user_data will not be used/freed in bpf_sk_reuseport_detach(), this change can still allow the cloned sk's sk_user_data to be used by some other means. Freeing the reuseport_array's sk_user_data does not require a rcu grace period. Thus, the existing rcu_assign_sk_user_data_nocopy() is not used. Fixes: `5dc4c4b7d4` ("bpf: Introduce BPF_MAP_TYPE_REUSEPORT_SOCKARRAY") Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/bpf/20200709061104.4018798-1-kafai@fb.com	2020-07-09 22:03:31 +02:00
Frederic Weisbecker	30c66fc30e	timer: Prevent base->clk from moving backward When a timer is enqueued with a negative delta (ie: expiry is below base->clk), it gets added to the wheel as expiring now (base->clk). Yet the value that gets stored in base->next_expiry, while calling trigger_dyntick_cpu(), is the initial timer->expires value. The resulting state becomes: base->next_expiry < base->clk On the next timer enqueue, forward_timer_base() may accidentally rewind base->clk. As a possible outcome, timers may expire way too early, the worst case being that the highest wheel levels get spuriously processed again. To prevent from that, make sure that base->next_expiry doesn't get below base->clk. Fixes: `a683f390b9` ("timers: Forward the wheel clock whenever possible") Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Tested-by: Juri Lelli <juri.lelli@redhat.com> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20200703010657.2302-1-frederic@kernel.org	2020-07-09 11:56:57 +02:00
Richard Guy Briggs	d7481b24b8	audit: issue CWD record to accompany LSM_AUDIT_DATA_* records The LSM_AUDIT_DATA_* records for PATH, FILE, IOCTL_OP, DENTRY and INODE are incomplete without the task context of the AUDIT Current Working Directory record. Add it. This record addition can't use audit_dummy_context to determine whether or not to store the record information since the LSM_AUDIT_DATA_* records are initiated by various LSMs independent of any audit rules. context->in_syscall is used to determine if it was called in user context like audit_getname. Please see the upstream issue https://github.com/linux-audit/audit-kernel/issues/96 Adapted from Vladis Dronov's v2 patch. Signed-off-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Paul Moore <paul@paul-moore.com>	2020-07-08 19:02:11 -04:00
Kees Cook	6396026045	bpf: Check correct cred for CAP_SYSLOG in bpf_dump_raw_ok() When evaluating access control over kallsyms visibility, credentials at open() time need to be used, not the "current" creds (though in BPF's case, this has likely always been the same). Plumb access to associated file->f_cred down through bpf_dump_raw_ok() and its callers now that kallsysm_show_value() has been refactored to take struct cred. Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: bpf@vger.kernel.org Cc: stable@vger.kernel.org Fixes: `7105e828c0` ("bpf: allow for correlation of maps and helpers in dump") Signed-off-by: Kees Cook <keescook@chromium.org>	2020-07-08 16:01:21 -07:00
Kees Cook	60f7bb66b8	kprobes: Do not expose probe addresses to non-CAP_SYSLOG The kprobe show() functions were using "current"'s creds instead of the file opener's creds for kallsyms visibility. Fix to use seq_file->file->f_cred. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: stable@vger.kernel.org Fixes: `81365a947d` ("kprobes: Show address of kprobes if kallsyms does") Fixes: `ffb9bd68eb` ("kprobes: Show blacklist addresses as same as kallsyms does") Signed-off-by: Kees Cook <keescook@chromium.org>	2020-07-08 16:00:22 -07:00
Kees Cook	b25a7c5af9	module: Do not expose section addresses to non-CAP_SYSLOG The printing of section addresses in /sys/module//sections/ was not using the correct credentials to evaluate visibility. Before: # cat /sys/module//sections/.text 0xffffffffc0458000 ... # capsh --drop=CAP_SYSLOG -- -c "cat /sys/module//sections/.text" 0xffffffffc0458000 ... After: # cat /sys/module//sections/.text 0xffffffffc0458000 ... # capsh --drop=CAP_SYSLOG -- -c "cat /sys/module//sections/.text" 0x0000000000000000 ... Additionally replaces the existing (safe) /proc/modules check with file->f_cred for consistency. Reported-by: Dominik Czarnota <dominik.czarnota@trailofbits.com> Fixes: `be71eda538` ("module: Fix display of wrong module .text address") Cc: stable@vger.kernel.org Tested-by: Jessica Yu <jeyu@kernel.org> Acked-by: Jessica Yu <jeyu@kernel.org> Signed-off-by: Kees Cook <keescook@chromium.org>	2020-07-08 16:00:19 -07:00
Kees Cook	ed66f991bb	module: Refactor section attr into bin attribute In order to gain access to the open file's f_cred for kallsym visibility permission checks, refactor the module section attributes to use the bin_attribute instead of attribute interface. Additionally removes the redundant "name" struct member. Cc: stable@vger.kernel.org Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Tested-by: Jessica Yu <jeyu@kernel.org> Acked-by: Jessica Yu <jeyu@kernel.org> Signed-off-by: Kees Cook <keescook@chromium.org>	2020-07-08 16:00:17 -07:00
Kees Cook	160251842c	kallsyms: Refactor kallsyms_show_value() to take cred In order to perform future tests against the cred saved during open(), switch kallsyms_show_value() to operate on a cred, and have all current callers pass current_cred(). This makes it very obvious where callers are checking the wrong credential in their "read" contexts. These will be fixed in the coming patches. Additionally switch return value to bool, since it is always used as a direct permission check, not a 0-on-success, negative-on-error style function return. Cc: stable@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org>	2020-07-08 15:59:57 -07:00

... 17 18 19 20 21 ...

34700 Commits