IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
Rework irq_find_mapping() to return an both an irq_desc pointer,
optionally the virtual irq number, and rename the result to
__irq_resolve_mapping(). a new helper called irq_resolve_mapping()
is provided for code that doesn't need the virtual irq number.
irq_find_mapping() is also rewritten in terms of __irq_resolve_mapping().
Signed-off-by: Marc Zyngier <maz@kernel.org>
It is pretty odd that the radix tree uses RCU while the linear
portion doesn't, leading to potential surprises for the users,
depending on how the irqdomain has been created.
Fix this by moving the update of the linear revmap under
the mutex, and the lookup under the RCU read-side lock.
The mutex name is updated to reflect that it doesn't only
cover the radix-tree anymore.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Caching a virq number in the revmap is pretty inefficient, as
it means we will need to convert it back to either an irq_data
or irq_desc to do anything with it.
It is also a bit odd, as the radix tree does cache irq_data
pointers.
Change the revmap type to be an irq_data pointer instead of
an unsigned int, and preserve the current API for now.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Instead of open-coding the size computation of struct irqdomain,
use the struct_size() helper instead.
This is going to be handy as we change the type of the revmap
array.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Direct mappings are completely exclusive of normal mappings, meaning
that we can refactor the code slightly so that we can get rid of
the revmap_direct_max_irq field and use the revmap_size field
instead, reducing the size of the irqdomain structure.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Only a handful of old PPC systems are still using the old 'nomap'
variant of the irqdomain library. Move the associated definitions
behind a configuration option, which will allow us to make some
more radical changes.
Signed-off-by: Marc Zyngier <maz@kernel.org>
irq_linear_revmap() is supposed to be a fast path for domain
lookups, but it only exposes low-level details of the irqdomain
implementation, details which are better kept private.
The *overhead* between the two is only a function call and
a couple of tests, so it is likely that noone can show any
meaningful difference compared to the cost of taking an
interrupt.
Reimplement irq_linear_revmap() with irq_find_mapping()
in order to preserve source code compatibility, and
rename the internal field for a measure.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Someone carelessly put NMI unsafe code in irq_work_queue(), breaking
just about every single user. Also, someone has a terrible comment
style.
Fixes: e2b5bcf9f5ba ("irq_work: record irq_work_queue() call stack")
Reported-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/YL+uBq8LzXXZsYVf@hirez.programming.kicks-ass.net
Using list_move() instead of list_del() + list_add().
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Acked-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
AUDIT_DISABLED defined in kernel/audit.h as element of enum audit_state
and redefined in kernel/audit.c. This produces a warning when kernel builds
with syscalls audit disabled and brokes kernel build if -Werror used.
enum audit_state used in syscall audit code only. This patch changes
enum audit_state constants prefix AUDIT to AUDIT_STATE to avoid
AUDIT_DISABLED redefinition.
Signed-off-by: Sergey Nazarov <s-nazarov@yandex.ru>
Acked-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
Commit 545fbd0775ba ("rq-qos: fix missed wake-ups in rq_qos_throttle")
tried to fix a problem that a process could be sleeping in rq_qos_wait()
without anyone to wake it up. However the fix is not complete and the
following can still happen:
CPU1 (waiter1) CPU2 (waiter2) CPU3 (waker)
rq_qos_wait() rq_qos_wait()
acquire_inflight_cb() -> fails
acquire_inflight_cb() -> fails
completes IOs, inflight
decreased
prepare_to_wait_exclusive()
prepare_to_wait_exclusive()
has_sleeper = !wq_has_single_sleeper() -> true as there are two sleepers
has_sleeper = !wq_has_single_sleeper() -> true
io_schedule() io_schedule()
Deadlock as now there's nobody to wakeup the two waiters. The logic
automatically blocking when there are already sleepers is really subtle
and the only way to make it work reliably is that we check whether there
are some waiters in the queue when adding ourselves there. That way, we
are guaranteed that at least the first process to enter the wait queue
will recheck the waiting condition before going to sleep and thus
guarantee forward progress.
Fixes: 545fbd0775ba ("rq-qos: fix missed wake-ups in rq_qos_throttle")
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20210607112613.25344-1-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
It was reported that a bug on arm64 caused a bad ip address to be used for
updating into a nop in ftrace_init(), but the error path (rightfully)
returned -EINVAL and not -EFAULT, as the bug caused more than one error to
occur. But because -EINVAL was returned, the ftrace_bug() tried to report
what was at the location of the ip address, and read it directly. This
caused the machine to panic, as the ip was not pointing to a valid memory
address.
Instead, read the ip address with copy_from_kernel_nofault() to safely
access the memory, and if it faults, report that the address faulted,
otherwise report what was in that location.
Link: https://lore.kernel.org/lkml/20210607032329.28671-1-mark-pk.tsai@mediatek.com/
Cc: stable@vger.kernel.org
Fixes: 05736a427f7e1 ("ftrace: warn on failure to disable mcount callers")
Reported-by: Mark-PK Tsai <mark-pk.tsai@mediatek.com>
Tested-by: Mark-PK Tsai <mark-pk.tsai@mediatek.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
PSI accounts stalls for each cgroup separately and aggregates it at each
level of the hierarchy. This causes additional overhead with psi_avgs_work
being called for each cgroup in the hierarchy. psi_avgs_work has been
highly optimized, however on systems with large number of cgroups the
overhead becomes noticeable.
Systems which use PSI only at the system level could avoid this overhead
if PSI can be configured to skip per-cgroup stall accounting.
Add "cgroup_disable=pressure" kernel command-line option to allow
requesting system-wide only pressure stall accounting. When set, it
keeps system-wide accounting under /proc/pressure/ but skips accounting
for individual cgroups and does not expose PSI nodes in cgroup hierarchy.
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Some users have pointed out that path-based syscalls are problematic in
some environments and at least directory fd argument and possibly also
resolve flags are desirable for such syscalls. Rather than
reimplementing all details of pathname lookup and following where it may
eventually evolve, let's go for full file descriptor based syscall
similar to how ioctl(2) works since the beginning. Managing of quotas
isn't performance sensitive so the extra overhead of open does not
matter and we are able to consume O_PATH descriptors as well which makes
open cheap anyway. Also for frequent operations (such as retrieving
usage information for all users) we can reuse single fd and in fact get
even better performance as well as avoiding races with possible remounts
etc.
Tested-by: Sascha Hauer <s.hauer@pengutronix.de>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
netfilter and wireguard trees.
The bpf vs lockdown+audit fix is the most notable.
Current release - regressions:
- virtio-net: fix page faults and crashes when XDP is enabled
- mlx5e: fix HW timestamping with CQE compression, and make sure they
are only allowed to coexist with capable devices
- stmmac:
- fix kernel panic due to NULL pointer dereference of mdio_bus_data
- fix double clk unprepare when no PHY device is connected
Current release - new code bugs:
- mt76: a few fixes for the recent MT7921 devices and runtime
power management
Previous releases - regressions:
- ice: - track AF_XDP ZC enabled queues in bitmap to fix copy mode Tx
- fix allowing VF to request more/less queues via virtchnl
- correct supported and advertised autoneg by using PHY capabilities
- allow all LLDP packets from PF to Tx
- kbuild: quote OBJCOPY var to avoid a pahole call break the build
Previous releases - always broken:
- bpf, lockdown, audit: fix buggy SELinux lockdown permission checks
- mt76: address the recent FragAttack vulnerabilities not covered
by generic fixes
- ipv6: fix KASAN: slab-out-of-bounds Read in fib6_nh_flush_exceptions
- Bluetooth:
- fix the erroneous flush_work() order, to avoid double free
- use correct lock to prevent UAF of hdev object
- nfc: fix NULL ptr dereference in llcp_sock_getname() after failed connect
- ieee802154: multiple fixes to error checking and return values
- igb: fix XDP with PTP enabled
- intel: add correct exception tracing for XDP
- tls: fix use-after-free when TLS offload device goes down and back up
- ipvs: ignore IP_VS_SVC_F_HASHED flag when adding service
- netfilter: nft_ct: skip expectations for confirmed conntrack
- mptcp: fix falling back to TCP in presence of out of order packets
early in connection lifetime
- wireguard: switch from O(n) to a O(1) algorithm for maintaining peers,
fixing stalls and a large memory leak in the process
Misc:
- devlink: correct VIRTUAL port to not have phys_port attributes
- Bluetooth: fix VIRTIO_ID_BT assigned number
- net: return the correct errno code ENOBUF -> ENOMEM
- wireguard:
- peer: allocate in kmem_cache saving 25% on peer memory
- do not use -O3
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmC6yGMACgkQMUZtbf5S
Irv67w//ZpT4+KHETUIS+CgeUIgjAQD0FTmO4iboHFGG7BadWEZpEVswUU0xBfY/
RJrSWAEqTga8zbjWqRaLRx5Qii99F2hHPZ502VR6x6NbPu1mNdS5rUOa61YbtGCv
v4sC45eOvG7T/y5mceq4rQaPsQKEUUAIgYzIOpjSiDoMfgFCT3UUF/UrBhgLzybj
aMXd12rg17dN+RJeNOZjQKligNENX9A0tBtSGXxs9hhYYbY25O+uECOsESrA1RKt
uHeh003iqApT5x8hmJsdMDtis05n7S/Bq1/4RZfAdbTcgJngepw570bQ999tbXqE
HeB3Ls9k3Vi9W6svfUkYjFGt3GYygsVGPjFAVhC+g0TZXAgdsh5w2SPQAgcIrzIr
WOfDL9hu7OJp/XRsPiB9pg8cul7a4Q5Yhp29bvN33u43AMij2TWD0CpKCQt9UQdi
8V0KOLAGC8bzXx35VTP/pbbwAI21PIYxVKfe/0cOJKShTMtfPePx1a2cuYRWoQSP
PYYbQaY6WhfUniV3DEmvL1Z+dgL0yyaJKIV2IdBHR8MPKKy+5kD+6HDaNo2lO75J
wWSN1LtoVKrc5msCD375epGmkbjatpWdfzOE+pljWHz5LnW+2cGwFhCo7+UJhAG5
XwE8+G9YUyYH51PjFpGBsoPBWEmYmIMnY34p20A1Pz1M7/HFfXc=
=sNP5
-----END PGP SIGNATURE-----
Merge tag 'net-5.13-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Networking fixes, including fixes from bpf, wireless, netfilter and
wireguard trees.
The bpf vs lockdown+audit fix is the most notable.
Things haven't slowed down just yet, both in terms of regressions in
current release and largish fixes for older code, but we usually see a
slowdown only after -rc5.
Current release - regressions:
- virtio-net: fix page faults and crashes when XDP is enabled
- mlx5e: fix HW timestamping with CQE compression, and make sure they
are only allowed to coexist with capable devices
- stmmac:
- fix kernel panic due to NULL pointer dereference of
mdio_bus_data
- fix double clk unprepare when no PHY device is connected
Current release - new code bugs:
- mt76: a few fixes for the recent MT7921 devices and runtime power
management
Previous releases - regressions:
- ice:
- track AF_XDP ZC enabled queues in bitmap to fix copy mode Tx
- fix allowing VF to request more/less queues via virtchnl
- correct supported and advertised autoneg by using PHY
capabilities
- allow all LLDP packets from PF to Tx
- kbuild: quote OBJCOPY var to avoid a pahole call break the build
Previous releases - always broken:
- bpf, lockdown, audit: fix buggy SELinux lockdown permission checks
- mt76: address the recent FragAttack vulnerabilities not covered by
generic fixes
- ipv6: fix KASAN: slab-out-of-bounds Read in
fib6_nh_flush_exceptions
- Bluetooth:
- fix the erroneous flush_work() order, to avoid double free
- use correct lock to prevent UAF of hdev object
- nfc: fix NULL ptr dereference in llcp_sock_getname() after failed
connect
- ieee802154: multiple fixes to error checking and return values
- igb: fix XDP with PTP enabled
- intel: add correct exception tracing for XDP
- tls: fix use-after-free when TLS offload device goes down and back
up
- ipvs: ignore IP_VS_SVC_F_HASHED flag when adding service
- netfilter: nft_ct: skip expectations for confirmed conntrack
- mptcp: fix falling back to TCP in presence of out of order packets
early in connection lifetime
- wireguard: switch from O(n) to a O(1) algorithm for maintaining
peers, fixing stalls and a large memory leak in the process
Misc:
- devlink: correct VIRTUAL port to not have phys_port attributes
- Bluetooth: fix VIRTIO_ID_BT assigned number
- net: return the correct errno code ENOBUF -> ENOMEM
- wireguard:
- peer: allocate in kmem_cache saving 25% on peer memory
- do not use -O3"
* tag 'net-5.13-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (91 commits)
cxgb4: avoid link re-train during TC-MQPRIO configuration
sch_htb: fix refcount leak in htb_parent_to_leaf_offload
wireguard: allowedips: free empty intermediate nodes when removing single node
wireguard: allowedips: allocate nodes in kmem_cache
wireguard: allowedips: remove nodes in O(1)
wireguard: allowedips: initialize list head in selftest
wireguard: peer: allocate in kmem_cache
wireguard: use synchronize_net rather than synchronize_rcu
wireguard: do not use -O3
wireguard: selftests: make sure rp_filter is disabled on vethc
wireguard: selftests: remove old conntrack kconfig value
virtchnl: Add missing padding to virtchnl_proto_hdrs
ice: Allow all LLDP packets from PF to Tx
ice: report supported and advertised autoneg using PHY capabilities
ice: handle the VF VSI rebuild failure
ice: Fix VFR issues for AVF drivers that expect ATQLEN cleared
ice: Fix allowing VF to request more/less queues via virtchnl
virtio-net: fix for skb_over_panic inside big mode
ipv6: Fix KASAN: slab-out-of-bounds Read in fib6_nh_flush_exceptions
fib: Return the correct errno code
...
Fix function name in cgroup.c and rstat.c kernel-doc comment
to remove these warnings found by clang_w1.
kernel/cgroup/cgroup.c:2401: warning: expecting prototype for
cgroup_taskset_migrate(). Prototype was for cgroup_migrate_execute()
instead.
kernel/cgroup/rstat.c:233: warning: expecting prototype for
cgroup_rstat_flush_begin(). Prototype was for cgroup_rstat_flush_hold()
instead.
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Fixes: 'commit e595cd706982 ("cgroup: track migration context in cgroup_mgctx")'
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Revert commit 4698f88c06b8 ("sched/debug: Fix 'schedstats=enable'
cmdline option").
After commit 6041186a3258 ("init: initialize jump labels before
command line option parsing") we can rely on jump label infra being
ready for use when setup_schedstats() is called.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lkml.kernel.org/r/20210602112108.1709635-1-eric.dumazet@gmail.com
The util_est internal UTIL_AVG_UNCHANGED flag which is used to prevent
unnecessary util_est updates uses the LSB of util_est.enqueued. It is
exposed via _task_util_est() (and task_util_est()).
Commit 92a801e5d5b7 ("sched/fair: Mask UTIL_AVG_UNCHANGED usages")
mentions that the LSB is lost for util_est resolution but
find_energy_efficient_cpu() checks if task_util_est() returns 0 to
return prev_cpu early.
_task_util_est() returns the max value of util_est.ewma and
util_est.enqueued or'ed w/ UTIL_AVG_UNCHANGED.
So task_util_est() returning the max of task_util() and
_task_util_est() will never return 0 under the default
SCHED_FEAT(UTIL_EST, true).
To fix this use the MSB of util_est.enqueued instead and keep the flag
util_est internal, i.e. don't export it via _task_util_est().
The maximal possible util_avg value for a task is 1024 so the MSB of
'unsigned int util_est.enqueued' isn't used to store a util value.
As a caveat the code behind the util_est_se trace point has to filter
UTIL_AVG_UNCHANGED to see the real util_est.enqueued value which should
be easy to do.
This also fixes an issue report by Xuewen Yan that util_est_update()
only used UTIL_AVG_UNCHANGED for the subtrahend of the equation:
last_enqueued_diff = ue.enqueued - (task_util() | UTIL_AVG_UNCHANGED)
Fixes: b89997aa88f0b sched/pelt: Fix task util_est update filtering
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Xuewen Yan <xuewen.yan@unisoc.com>
Reviewed-by: Vincent Donnefort <vincent.donnefort@arm.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20210602145808.1562603-1-dietmar.eggemann@arm.com
Rounding in PELT calculation happening when entities are attached/detached
of a cfs_rq can result into situations where util/runnable_avg is not null
but util/runnable_sum is. This is normally not possible so we need to
ensure that util/runnable_sum stays synced with util/runnable_avg.
detach_entity_load_avg() is the last place where we don't sync
util/runnable_sum with util/runnbale_avg when moving some sched_entities
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210601085832.12626-1-vincent.guittot@linaro.org
Commit 59438b46471a ("security,lockdown,selinux: implement SELinux lockdown")
added an implementation of the locked_down LSM hook to SELinux, with the aim
to restrict which domains are allowed to perform operations that would breach
lockdown. This is indirectly also getting audit subsystem involved to report
events. The latter is problematic, as reported by Ondrej and Serhei, since it
can bring down the whole system via audit:
1) The audit events that are triggered due to calls to security_locked_down()
can OOM kill a machine, see below details [0].
2) It also seems to be causing a deadlock via avc_has_perm()/slow_avc_audit()
when trying to wake up kauditd, for example, when using trace_sched_switch()
tracepoint, see details in [1]. Triggering this was not via some hypothetical
corner case, but with existing tools like runqlat & runqslower from bcc, for
example, which make use of this tracepoint. Rough call sequence goes like:
rq_lock(rq) -> -------------------------+
trace_sched_switch() -> |
bpf_prog_xyz() -> +-> deadlock
selinux_lockdown() -> |
audit_log_end() -> |
wake_up_interruptible() -> |
try_to_wake_up() -> |
rq_lock(rq) --------------+
What's worse is that the intention of 59438b46471a to further restrict lockdown
settings for specific applications in respect to the global lockdown policy is
completely broken for BPF. The SELinux policy rule for the current lockdown check
looks something like this:
allow <who> <who> : lockdown { <reason> };
However, this doesn't match with the 'current' task where the security_locked_down()
is executed, example: httpd does a syscall. There is a tracing program attached
to the syscall which triggers a BPF program to run, which ends up doing a
bpf_probe_read_kernel{,_str}() helper call. The selinux_lockdown() hook does
the permission check against 'current', that is, httpd in this example. httpd
has literally zero relation to this tracing program, and it would be nonsensical
having to write an SELinux policy rule against httpd to let the tracing helper
pass. The policy in this case needs to be against the entity that is installing
the BPF program. For example, if bpftrace would generate a histogram of syscall
counts by user space application:
bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }'
bpftrace would then go and generate a BPF program from this internally. One way
of doing it [for the sake of the example] could be to call bpf_get_current_task()
helper and then access current->comm via one of bpf_probe_read_kernel{,_str}()
helpers. So the program itself has nothing to do with httpd or any other random
app doing a syscall here. The BPF program _explicitly initiated_ the lockdown
check. The allow/deny policy belongs in the context of bpftrace: meaning, you
want to grant bpftrace access to use these helpers, but other tracers on the
system like my_random_tracer _not_.
Therefore fix all three issues at the same time by taking a completely different
approach for the security_locked_down() hook, that is, move the check into the
program verification phase where we actually retrieve the BPF func proto. This
also reliably gets the task (current) that is trying to install the BPF tracing
program, e.g. bpftrace/bcc/perf/systemtap/etc, and it also fixes the OOM since
we're moving this out of the BPF helper's fast-path which can be called several
millions of times per second.
The check is then also in line with other security_locked_down() hooks in the
system where the enforcement is performed at open/load time, for example,
open_kcore() for /proc/kcore access or module_sig_check() for module signatures
just to pick few random ones. What's out of scope in the fix as well as in
other security_locked_down() hook locations /outside/ of BPF subsystem is that
if the lockdown policy changes on the fly there is no retrospective action.
This requires a different discussion, potentially complex infrastructure, and
it's also not clear whether this can be solved generically. Either way, it is
out of scope for a suitable stable fix which this one is targeting. Note that
the breakage is specifically on 59438b46471a where it started to rely on 'current'
as UAPI behavior, and _not_ earlier infrastructure such as 9d1f8be5cf42 ("bpf:
Restrict bpf when kernel lockdown is in confidentiality mode").
[0] https://bugzilla.redhat.com/show_bug.cgi?id=1955585, Jakub Hrozek says:
I starting seeing this with F-34. When I run a container that is traced with
BPF to record the syscalls it is doing, auditd is flooded with messages like:
type=AVC msg=audit(1619784520.593:282387): avc: denied { confidentiality }
for pid=476 comm="auditd" lockdown_reason="use of bpf to read kernel RAM"
scontext=system_u:system_r:auditd_t:s0 tcontext=system_u:system_r:auditd_t:s0
tclass=lockdown permissive=0
This seems to be leading to auditd running out of space in the backlog buffer
and eventually OOMs the machine.
[...]
auditd running at 99% CPU presumably processing all the messages, eventually I get:
Apr 30 12:20:42 fedora kernel: audit: backlog limit exceeded
Apr 30 12:20:42 fedora kernel: audit: backlog limit exceeded
Apr 30 12:20:42 fedora kernel: audit: audit_backlog=2152579 > audit_backlog_limit=64
Apr 30 12:20:42 fedora kernel: audit: audit_backlog=2152626 > audit_backlog_limit=64
Apr 30 12:20:42 fedora kernel: audit: audit_backlog=2152694 > audit_backlog_limit=64
Apr 30 12:20:42 fedora kernel: audit: audit_lost=6878426 audit_rate_limit=0 audit_backlog_limit=64
Apr 30 12:20:45 fedora kernel: oci-seccomp-bpf invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=-1000
Apr 30 12:20:45 fedora kernel: CPU: 0 PID: 13284 Comm: oci-seccomp-bpf Not tainted 5.11.12-300.fc34.x86_64 #1
Apr 30 12:20:45 fedora kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-2.fc32 04/01/2014
[...]
[1] https://lore.kernel.org/linux-audit/CANYvDQN7H5tVp47fbYcRasv4XF07eUbsDwT_eDCHXJUj43J7jQ@mail.gmail.com/,
Serhei Makarov says:
Upstream kernel 5.11.0-rc7 and later was found to deadlock during a
bpf_probe_read_compat() call within a sched_switch tracepoint. The problem
is reproducible with the reg_alloc3 testcase from SystemTap's BPF backend
testsuite on x86_64 as well as the runqlat, runqslower tools from bcc on
ppc64le. Example stack trace:
[...]
[ 730.868702] stack backtrace:
[ 730.869590] CPU: 1 PID: 701 Comm: in:imjournal Not tainted, 5.12.0-0.rc2.20210309git144c79ef3353.166.fc35.x86_64 #1
[ 730.871605] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
[ 730.873278] Call Trace:
[ 730.873770] dump_stack+0x7f/0xa1
[ 730.874433] check_noncircular+0xdf/0x100
[ 730.875232] __lock_acquire+0x1202/0x1e10
[ 730.876031] ? __lock_acquire+0xfc0/0x1e10
[ 730.876844] lock_acquire+0xc2/0x3a0
[ 730.877551] ? __wake_up_common_lock+0x52/0x90
[ 730.878434] ? lock_acquire+0xc2/0x3a0
[ 730.879186] ? lock_is_held_type+0xa7/0x120
[ 730.880044] ? skb_queue_tail+0x1b/0x50
[ 730.880800] _raw_spin_lock_irqsave+0x4d/0x90
[ 730.881656] ? __wake_up_common_lock+0x52/0x90
[ 730.882532] __wake_up_common_lock+0x52/0x90
[ 730.883375] audit_log_end+0x5b/0x100
[ 730.884104] slow_avc_audit+0x69/0x90
[ 730.884836] avc_has_perm+0x8b/0xb0
[ 730.885532] selinux_lockdown+0xa5/0xd0
[ 730.886297] security_locked_down+0x20/0x40
[ 730.887133] bpf_probe_read_compat+0x66/0xd0
[ 730.887983] bpf_prog_250599c5469ac7b5+0x10f/0x820
[ 730.888917] trace_call_bpf+0xe9/0x240
[ 730.889672] perf_trace_run_bpf_submit+0x4d/0xc0
[ 730.890579] perf_trace_sched_switch+0x142/0x180
[ 730.891485] ? __schedule+0x6d8/0xb20
[ 730.892209] __schedule+0x6d8/0xb20
[ 730.892899] schedule+0x5b/0xc0
[ 730.893522] exit_to_user_mode_prepare+0x11d/0x240
[ 730.894457] syscall_exit_to_user_mode+0x27/0x70
[ 730.895361] entry_SYSCALL_64_after_hwframe+0x44/0xae
[...]
Fixes: 59438b46471a ("security,lockdown,selinux: implement SELinux lockdown")
Reported-by: Ondrej Mosnacek <omosnace@redhat.com>
Reported-by: Jakub Hrozek <jhrozek@redhat.com>
Reported-by: Serhei Makarov <smakarov@redhat.com>
Reported-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Tested-by: Jiri Olsa <jolsa@redhat.com>
Cc: Paul Moore <paul@paul-moore.com>
Cc: James Morris <jamorris@linux.microsoft.com>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Frank Eigler <fche@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/bpf/01135120-8bf7-df2e-cff0-1d73f1f841c3@iogearbox.net
Will reported that the 'XXX __migrate_task() can fail' in migration_cpu_stop()
can happen, and it *is* sort of a big deal. Looking at it some more, one
will note there is a glaring hole in the deferred CPU selection:
(w/ CONFIG_CPUSET=n, so that the affinity mask passed via taskset doesn't
get AND'd with cpu_online_mask)
$ taskset -pc 0-2 $PID
# offline CPUs 3-4
$ taskset -pc 3-5 $PID
`\
$PID may stay on 0-2 due to the cpumask_any_distribute() picking an
offline CPU and __migrate_task() refusing to do anything due to
cpu_is_allowed().
set_cpus_allowed_ptr() goes to some length to pick a dest_cpu that matches
the right constraints vs affinity and the online/active state of the
CPUs. Reuse that instead of discarding it in the affine_move_task() case.
Fixes: 6d337eab041d ("sched: Fix migrate_disable() vs set_cpus_allowed_ptr()")
Reported-by: Will Deacon <will@kernel.org>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210526205751.842360-2-valentin.schneider@arm.com
When using something other than 8 spaces per tab, this ascii art
makes not sense, and the reader might end up wondering what this
advanced equation "is".
Signed-off-by: Odin Ugedal <odin@uged.al>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20210518125202.78658-4-odin@uged.al
Extend 8fb12156b8db ("init: Pin init task to the boot CPU, initially")
to cover the new PF_NO_SETAFFINITY requirement.
While there, move wait_for_completion(&kthreadd_done) into kernel_init()
to make it absolutely clear it is the very first thing done by the init
thread.
Fixes: 570a752b7a9b ("lib/smp_processor_id: Use is_percpu_thread() instead of nr_cpus_allowed")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Tested-by: Valentin Schneider <valentin.schneider@arm.com>
Tested-by: Borislav Petkov <bp@alien8.de>
Link: https://lkml.kernel.org/r/YLS4mbKUrA3Gnb4t@hirez.programming.kicks-ass.net
Hugh noted that the SCHED_CORE Kconfig option could do with a help
text.
Requested-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Acked-by: Hugh Dickins <hughd@google.com>
Link: https://lkml.kernel.org/r/YKyhtwhEgvtUDOyl@hirez.programming.kicks-ass.net
The reason for kprobe::fault_handler(), as given by their comment:
* We come here because instructions in the pre/post
* handler caused the page_fault, this could happen
* if handler tries to access user space by
* copy_from_user(), get_user() etc. Let the
* user-specified handler try to fix it first.
Is just plain bad. Those other handlers are ran from non-preemptible
context and had better use _nofault() functions. Also, there is no
upstream usage of this.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Link: https://lore.kernel.org/r/20210525073213.561116662@infradead.org
This patch introduces a new algorithm for multiplication of tristate
numbers (tnums) that is provably sound. It is faster and more precise when
compared to the existing method.
Like the existing method, this new algorithm follows the long
multiplication algorithm. The idea is to generate partial products by
multiplying each bit in the multiplier (tnum a) with the multiplicand
(tnum b), and adding the partial products after appropriately bit-shifting
them. The new algorithm, however, uses just a single loop over the bits of
the multiplier (tnum a) and accumulates only the uncertain components of
the multiplicand (tnum b) into a mask-only tnum. The following paper
explains the algorithm in more detail: https://arxiv.org/abs/2105.05398.
A natural way to construct the tnum product is by performing a tnum
addition on all the partial products. This algorithm presents another
method of doing this: decompose each partial product into two tnums,
consisting of the values and the masks separately. The mask-sum is
accumulated within the loop in acc_m. The value-sum tnum is generated
using a.value * b.value. The tnum constructed by tnum addition of the
value-sum and the mask-sum contains all possible summations of concrete
values drawn from the partial product tnums pairwise. We prove this result
in the paper.
Our evaluations show that the new algorithm is overall more precise
(producing tnums with less uncertain components) than the existing method.
As an illustrative example, consider the input tnums A and B. The numbers
in the parenthesis correspond to (value;mask).
A = 000000x1 (1;2)
B = 0010011x (38;1)
A * B (existing) = xxxxxxxx (0;255)
A * B (new) = 0x1xxxxx (32;95)
Importantly, we present a proof of soundness of the new algorithm in the
aforementioned paper. Additionally, we show that this new algorithm is
empirically faster than the existing method.
Co-developed-by: Matan Shachnai <m.shachnai@rutgers.edu>
Co-developed-by: Srinivas Narayana <srinivas.narayana@rutgers.edu>
Co-developed-by: Santosh Nagarakatte <santosh.nagarakatte@rutgers.edu>
Signed-off-by: Matan Shachnai <m.shachnai@rutgers.edu>
Signed-off-by: Srinivas Narayana <srinivas.narayana@rutgers.edu>
Signed-off-by: Santosh Nagarakatte <santosh.nagarakatte@rutgers.edu>
Signed-off-by: Harishankar Vishwanathan <harishankar.vishwanathan@rutgers.edu>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Edward Cree <ecree.xilinx@gmail.com>
Link: https://arxiv.org/abs/2105.05398
Link: https://lore.kernel.org/bpf/20210531020157.7386-1-harishankar.vishwanathan@rutgers.edu
With the introduction of per-cpu wakeup devices that can be used in
preference to the broadcast timer, print the name of such devices when
they are available.
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210524221818.15850-6-will@kernel.org
When configuring the broadcast timer on entry to and exit from deep idle
states, prefer a per-CPU wakeup timer if one exists.
On entry to idle, stop the tick device and transfer the next event into
the oneshot wakeup device, which will serve as the wakeup from idle. To
avoid the overhead of additional hardware accesses on exit from idle,
leave the timer armed and treat the inevitable interrupt as a (possibly
spurious) tick event.
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210524221818.15850-5-will@kernel.org
Some SoCs have two per-cpu timer implementations where the timer with the
higher rating stops in deep idle (i.e. suffers from CLOCK_EVT_FEAT_C3STOP)
but is otherwise preferable to the timer with the lower rating. In such a
design, selecting the higher rated devices relies on a global broadcast
timer and IPIs to wake up from deep idle states.
To avoid the reliance on a global broadcast timer and also to reduce the
overhead associated with the IPI wakeups, extend
tick_install_broadcast_device() to manage per-cpu wakeup timers separately
from the broadcast device.
For now, these timers remain unused.
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210524221818.15850-4-will@kernel.org
In preparation for adding support for per-cpu wakeup timers, split
_tick_broadcast_oneshot_control() into a helper function which deals
only with the broadcast timer management across idle transitions.
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210524221818.15850-3-will@kernel.org
tick-broadcast.o is only built if CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
so remove the redundant #ifdef guards around the definition of
tick_receive_broadcast().
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210524221818.15850-2-will@kernel.org
Use the DEVICE_ATTR_[RO|WO] helpers instead of plain DEVICE_ATTR, which
makes the code a bit shorter and easier to read.
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210523065825.19684-1-yuehaibing@huawei.com
KCSAN reports a data race between increment and decrement of pin_count:
write to 0xffff888237c2d4e0 of 4 bytes by task 15740 on cpu 1:
find_get_context kernel/events/core.c:4617
__do_sys_perf_event_open kernel/events/core.c:12097 [inline]
__se_sys_perf_event_open kernel/events/core.c:11933
...
read to 0xffff888237c2d4e0 of 4 bytes by task 15743 on cpu 0:
perf_unpin_context kernel/events/core.c:1525 [inline]
__do_sys_perf_event_open kernel/events/core.c:12328 [inline]
__se_sys_perf_event_open kernel/events/core.c:11933
...
Because neither read-modify-write here is atomic, this can lead to one
of the operations being lost, resulting in an inconsistent pin_count.
Fix it by adding the missing locking in the CPU-event case.
Fixes: fe4b04fa31a6 ("perf: Cure task_oncpu_function_call() races")
Reported-by: syzbot+142c9018f5962db69c7e@syzkaller.appspotmail.com
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210527104711.2671610-1-elver@google.com
Checking for and processing RCU-nocb deferred wakeup upon user/guest
entry is only relevant when nohz_full runs on the local CPU, otherwise
the periodic tick should take care of it.
Make sure we don't needlessly pollute these fast-paths as a -3%
performance regression on a will-it-scale.per_process_ops has been
reported so far.
Fixes: 47b8ff194c1f (entry: Explicitly flush pending rcuog wakeup before last rescheduling point)
Fixes: 4ae7dc97f726 (entry/kvm: Explicitly flush pending rcuog wakeup before last rescheduling point)
Reported-by: kernel test robot <oliver.sang@intel.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20210527113441.465489-1-frederic@kernel.org
During the update of fair blocked load (__update_blocked_fair()), we
update the contribution of the cfs in tg->load_avg if cfs_rq's pelt
has decayed. Nevertheless, the pelt values of a cfs_rq could have
been recently updated while propagating the change of a child. In this
case, cfs_rq's pelt will not decayed because it has already been
updated and we don't update tg->load_avg.
__update_blocked_fair
...
for_each_leaf_cfs_rq_safe: child cfs_rq
update cfs_rq_load_avg() for child cfs_rq
...
update_load_avg(cfs_rq_of(se), se, 0)
...
update cfs_rq_load_avg() for parent cfs_rq
-propagation of child's load makes parent cfs_rq->load_sum
becoming null
-UPDATE_TG is not set so it doesn't update parent
cfs_rq->tg_load_avg_contrib
..
for_each_leaf_cfs_rq_safe: parent cfs_rq
update cfs_rq_load_avg() for parent cfs_rq
- nothing to do because parent cfs_rq has already been updated
recently so cfs_rq->tg_load_avg_contrib is not updated
...
parent cfs_rq is decayed
list_del_leaf_cfs_rq parent cfs_rq
- but it still contibutes to tg->load_avg
we must set UPDATE_TG flags when propagting pending load to the parent
Fixes: 039ae8bcf7a5 ("sched/fair: Fix O(nr_cgroups) in the load balancing path")
Reported-by: Odin Ugedal <odin@uged.al>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Odin Ugedal <odin@uged.al>
Link: https://lkml.kernel.org/r/20210527122916.27683-3-vincent.guittot@linaro.org
when removing a cfs_rq from the list we only check _sum value so we must
ensure that _avg and _sum stay synced so load_sum can't be null whereas
load_avg is not after propagating load in the cgroup hierarchy.
Use load_avg to compute load_sum similarly to what is done for util_sum
and runnable_sum.
Fixes: 0e2d2aaaae52 ("sched/fair: Rewrite PELT migration propagation")
Reported-by: Odin Ugedal <odin@uged.al>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Odin Ugedal <odin@uged.al>
Link: https://lkml.kernel.org/r/20210527122916.27683-2-vincent.guittot@linaro.org
- Fix addfd notification race condition (Sargun Dhillon)
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmCyhNIACgkQiXL039xt
wCbExBAAoniF2+pW8sN32KK6a4uLGJCPCcbwZqWGw2zINqn6+I6KGAld37lGPu3E
ASuu28O45NXcP9SpHLxNT1jRhAet57G6OjSV78jEzVII2EogUIBOyRji7yTk8xCt
kCp21/9RaQ3DitYe2vh9R2neNIZh/PodmY8V5tkP2HacgaEuf5+yRhB/1QbTm7HG
+mMZsejw1eEryJ49cw7XkYpWNjyz5vxwvXWJt6nfgm7wTnNopUQUKJGwnp2bX9cZ
LUgstLq0SpHW7uxwEq4NYux3qsD9kaj5SgZxb/6KkHNmg5q6WUXxm0FljipEIhq1
RBTLdH+6Ct+DcDryno2VDoRNP/Q3pim9jxTpfQQ5V6f4dVqNv6pVuR2uNfK/iEX2
mk7Rc99IifaXeOLITKGusZrm16msVg+o7wAu0B1iT0vyacPcwRXJtIWy829Z+gCP
r5OsBguxPPTkxfoRWYX4WDNcZmuBC5hkyqzN8toiQjOGghdm9nXdH4jFl8kcqZps
I7i0Me3JBWVskx1d8AKlkJv3ctbdUX7QV/HaPdsMLlXTLyqBR76D/uqeUFgmWpUq
2ib3bkJzRNYgm2nron1fmDOLTiJGVfEha5hmbThPrVziYv7+jwamHzPf8jPvB+tg
nOpw/HEfoVQtuq/e+Ocdv6TLnZAZWnvxYC/RB3aTBq5xz+74nYA=
=c9Hd
-----END PGP SIGNATURE-----
Merge tag 'seccomp-fixes-v5.13-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull seccomp fixes from Kees Cook:
"This fixes a hard-to-hit race condition in the addfd user_notif
feature of seccomp, visible since v5.9.
And a small documentation fix"
* tag 'seccomp-fixes-v5.13-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
seccomp: Refactor notification handler to prepare for new semantics
Documentation: seccomp: Fix user notification documentation
This refactors the user notification code to have a do / while loop around
the completion condition. This has a small change in semantic, in that
previously we ignored addfd calls upon wakeup if the notification had been
responded to, but instead with the new change we check for an outstanding
addfd calls prior to returning to userspace.
Rodrigo Campos also identified a bug that can result in addfd causing
an early return, when the supervisor didn't actually handle the
syscall [1].
[1]: https://lore.kernel.org/lkml/20210413160151.3301-1-rodrigo@kinvolk.io/
Fixes: 7cf97b125455 ("seccomp: Introduce addfd ioctl to seccomp user notifier")
Signed-off-by: Sargun Dhillon <sargun@sargun.me>
Acked-by: Tycho Andersen <tycho@tycho.pizza>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Tested-by: Rodrigo Campos <rodrigo@kinvolk.io>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20210517193908.3113-3-sargun@sargun.me
As Colin pointed out, the first drops assignment after declaration will
be overwritten by the second drops assignment before using, which makes
it useless.
Since the drops variable will be used only once. Just remove it and
use "cnt - sent" in trace_xdp_devmap_xmit().
Fixes: cb261b594b41 ("bpf: Run devmap xdp_prog on flush instead of bulk enqueue")
Reported-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20210528024356.24333-1-liuhangbin@gmail.com
If set_cred_ucounts() failed, we need return the error code.
Fixes: 905ae01c4ae2 ("Add a reference to ucounts for each cred")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Link: https://lkml.kernel.org/r/20210526143805.2549649-1-yangyingliang@huawei.com
Reviewed-by: Alexey Gladkov <legion@kernel.org>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Fix the following W=1 kernel build warning(s):
kernel/events/hw_breakpoint.c:461: warning: Function parameter or member 'context' not described in 'register_user_hw_breakpoint'
kernel/events/hw_breakpoint.c:560: warning: Function parameter or member 'context' not described in 'register_wide_hw_breakpoint'
Signed-off-by: Haocheng Xie <xiehaocheng.cn@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20210527031947.1801-4-xiehaocheng.cn@gmail.com
Fix the following W=1 kernel build warning(s):
kernel/events/core.c:143: warning: Function parameter or member 'cpu' not described in 'cpu_function_call'
kernel/events/core.c:11924: warning: Function parameter or member 'flags' not described in 'sys_perf_event_open'
kernel/events/core.c:12382: warning: Function parameter or member 'overflow_handler' not described in 'perf_event_create_kernel_counter'
kernel/events/core.c:12382: warning: Function parameter or member 'context' not described in 'perf_event_create_kernel_counter'
Signed-off-by: Haocheng Xie <xiehaocheng.cn@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20210527031947.1801-3-xiehaocheng.cn@gmail.com
can and wireless trees. Notably including fixes for the recently
announced "FragAttacks" WiFi vulnerabilities. Rather large batch,
touching some core parts of the stack, too, but nothing hair-raising.
Current release - regressions:
- tipc: make node link identity publish thread safe
- dsa: felix: re-enable TAS guard band mode
- stmmac: correct clocks enabled in stmmac_vlan_rx_kill_vid()
- stmmac: fix system hang if change mac address after interface ifdown
Current release - new code bugs:
- mptcp: avoid OOB access in setsockopt()
- bpf: Fix nested bpf_bprintf_prepare with more per-cpu buffers
- ethtool: stats: fix a copy-paste error - init correct array size
Previous releases - regressions:
- sched: fix packet stuck problem for lockless qdisc
- net: really orphan skbs tied to closing sk
- mlx4: fix EEPROM dump support
- bpf: fix alu32 const subreg bound tracking on bitwise operations
- bpf: fix mask direction swap upon off reg sign change
- bpf, offload: reorder offload callback 'prepare' in verifier
- stmmac: Fix MAC WoL not working if PHY does not support WoL
- packetmmap: fix only tx timestamp on request
- tipc: skb_linearize the head skb when reassembling msgs
Previous releases - always broken:
- mac80211: address recent "FragAttacks" vulnerabilities
- mac80211: do not accept/forward invalid EAPOL frames
- mptcp: avoid potential error message floods
- bpf, ringbuf: deny reserve of buffers larger than ringbuf to prevent
out of buffer writes
- bpf: forbid trampoline attach for functions with variable arguments
- bpf: add deny list of functions to prevent inf recursion of tracing
programs
- tls splice: check SPLICE_F_NONBLOCK instead of MSG_DONTWAIT
- can: isotp: prevent race between isotp_bind() and isotp_setsockopt()
- netfilter: nft_set_pipapo_avx2: Add irq_fpu_usable() check,
fallback to non-AVX2 version
Misc:
- bpf: add kconfig knob for disabling unpriv bpf by default
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmCuy2gACgkQMUZtbf5S
IruE5BAAhihia5EaiV71Bz/Cqr/d+osv5u283riKT8kBft0bWFVFFnT3iweWyR0/
5X+bB6zmr80Cuqh45ZeYyq+zJtiAAlsbD5hqBIGdMriSWLxciNKjVJRzuEjuqnek
USMW/LqGyf4NhmLogmQKpx8XcKSG7VYuK7vPrsH8us1dL5vIssceIXn8R9Dzj9NN
P77K5Z+Oka8XQJgetNLxR3tDAM/92RwIshotkhJbRwgiUvzb+wbnrnSOAZCIPgku
ydJyOxOklln1Sx07SejgzEl33ri0CkioDPThBWpOn7Mu0JrYKukXPKludoZcRYuJ
2jNLYfbH0ZS5EkOfk89h7j7MDoAJMUK72M+S1w5DEYz6eH2EjhAq9noZ6E1iQH+U
9vfoIvQjPh6Zhyk5QeM4dpt0cvR7rSElXkLVxo/x0dSBAi2rIng1bKeCUtv2J689
CsoD0oghtEzvUTYVxY6iNr15OFGl6KsZv4tVQ709gGA36sDlK8ozGbJH5WReobBl
f8H2WJlj2tVW5V75yUoio8TumDw34yk/5xlJFzm9GOwkqBrUcqOraHtHdUIsa4qr
KbELQQ9QVt4zYdLAiWy5BL/QLycp0ibmA1IB8W1bxEVSK1JXzREHzPxv85KOfZkn
8+vzNHmk2PEZYYsExiEykc5jXKOCPs8L0rJ6p4OverlbpDZcwIg=
=peMK
-----END PGP SIGNATURE-----
Merge tag 'net-5.13-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Networking fixes for 5.13-rc4, including fixes from bpf, netfilter,
can and wireless trees. Notably including fixes for the recently
announced "FragAttacks" WiFi vulnerabilities. Rather large batch,
touching some core parts of the stack, too, but nothing hair-raising.
Current release - regressions:
- tipc: make node link identity publish thread safe
- dsa: felix: re-enable TAS guard band mode
- stmmac: correct clocks enabled in stmmac_vlan_rx_kill_vid()
- stmmac: fix system hang if change mac address after interface
ifdown
Current release - new code bugs:
- mptcp: avoid OOB access in setsockopt()
- bpf: Fix nested bpf_bprintf_prepare with more per-cpu buffers
- ethtool: stats: fix a copy-paste error - init correct array size
Previous releases - regressions:
- sched: fix packet stuck problem for lockless qdisc
- net: really orphan skbs tied to closing sk
- mlx4: fix EEPROM dump support
- bpf: fix alu32 const subreg bound tracking on bitwise operations
- bpf: fix mask direction swap upon off reg sign change
- bpf, offload: reorder offload callback 'prepare' in verifier
- stmmac: Fix MAC WoL not working if PHY does not support WoL
- packetmmap: fix only tx timestamp on request
- tipc: skb_linearize the head skb when reassembling msgs
Previous releases - always broken:
- mac80211: address recent "FragAttacks" vulnerabilities
- mac80211: do not accept/forward invalid EAPOL frames
- mptcp: avoid potential error message floods
- bpf, ringbuf: deny reserve of buffers larger than ringbuf to
prevent out of buffer writes
- bpf: forbid trampoline attach for functions with variable arguments
- bpf: add deny list of functions to prevent inf recursion of tracing
programs
- tls splice: check SPLICE_F_NONBLOCK instead of MSG_DONTWAIT
- can: isotp: prevent race between isotp_bind() and
isotp_setsockopt()
- netfilter: nft_set_pipapo_avx2: Add irq_fpu_usable() check,
fallback to non-AVX2 version
Misc:
- bpf: add kconfig knob for disabling unpriv bpf by default"
* tag 'net-5.13-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (172 commits)
net: phy: Document phydev::dev_flags bits allocation
mptcp: validate 'id' when stopping the ADD_ADDR retransmit timer
mptcp: avoid error message on infinite mapping
mptcp: drop unconditional pr_warn on bad opt
mptcp: avoid OOB access in setsockopt()
nfp: update maintainer and mailing list addresses
net: mvpp2: add buffer header handling in RX
bnx2x: Fix missing error code in bnx2x_iov_init_one()
net: zero-initialize tc skb extension on allocation
net: hns: Fix kernel-doc
sctp: fix the proc_handler for sysctl encap_port
sctp: add the missing setting for asoc encap_port
bpf, selftests: Adjust few selftest result_unpriv outcomes
bpf: No need to simulate speculative domain for immediates
bpf: Fix mask direction swap upon off reg sign change
bpf: Wrap aux data inside bpf_sanitize_info container
bpf: Fix BPF_LSM kconfig symbol dependency
selftests/bpf: Add test for l3 use of bpf_redirect_peer
bpftool: Add sock_release help info for cgroup attach/prog load command
net: dsa: microchip: enable phy errata workaround on 9567
...
These macros are convenient wrappers around the bpf_seq_printf and
bpf_snprintf helpers. They are currently provided by bpf_tracing.h which
targets low level tracing primitives. bpf_helpers.h is a better fit.
The __bpf_narg and __bpf_apply are needed in both files and provided
twice. __bpf_empty isn't used anywhere and is removed from bpf_tracing.h
Reported-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Florent Revest <revest@chromium.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210526164643.2881368-1-revest@chromium.org