IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
Reject explicit requests to change the affinity mask of a task via
set_cpus_allowed_ptr() if the requested mask is not a subset of the
mask returned by task_cpu_possible_mask(). This ensures that the
'cpus_mask' for a given task cannot contain CPUs which are incapable of
executing it, except in cases where the affinity is forced.
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <Valentin.Schneider@arm.com>
Reviewed-by: Quentin Perret <qperret@google.com>
Link: https://lore.kernel.org/r/20210730112443.23245-6-will@kernel.org
select_fallback_rq() only needs to recheck for an allowed CPU if the
affinity mask of the task has changed since the last check.
Return a 'bool' from cpuset_cpus_allowed_fallback() to indicate whether
the affinity mask was updated, and use this to elide the allowed check
when the mask has been left alone.
No functional change.
Suggested-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lore.kernel.org/r/20210730112443.23245-5-will@kernel.org
Asymmetric systems may not offer the same level of userspace ISA support
across all CPUs, meaning that some applications cannot be executed by
some CPUs. As a concrete example, upcoming arm64 big.LITTLE designs do
not feature support for 32-bit applications on both clusters.
Modify guarantee_online_cpus() to take task_cpu_possible_mask() into
account when trying to find a suitable set of online CPUs for a given
task. This will avoid passing an invalid mask to set_cpus_allowed_ptr()
during ->attach() and will subsequently allow the cpuset hierarchy to be
taken into account when forcefully overriding the affinity mask for a
task which requires migration to a compatible CPU.
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <Valentin.Schneider@arm.com>
Link: https://lkml.kernel.org/r/20210730112443.23245-4-will@kernel.org
If the scheduler cannot find an allowed CPU for a task,
cpuset_cpus_allowed_fallback() will widen the affinity to cpu_possible_mask
if cgroup v1 is in use.
In preparation for allowing architectures to provide their own fallback
mask, just return early if we're either using cgroup v1 or we're using
cgroup v2 with a mask that contains invalid CPUs. This will allow
select_fallback_rq() to figure out the mask by itself.
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Reviewed-by: Quentin Perret <qperret@google.com>
Link: https://lkml.kernel.org/r/20210730112443.23245-3-will@kernel.org
Asymmetric systems may not offer the same level of userspace ISA support
across all CPUs, meaning that some applications cannot be executed by
some CPUs. As a concrete example, upcoming arm64 big.LITTLE designs do
not feature support for 32-bit applications on both clusters.
On such a system, we must take care not to migrate a task to an
unsupported CPU when forcefully moving tasks in select_fallback_rq()
in response to a CPU hot-unplug operation.
Introduce a task_cpu_possible_mask() hook which, given a task argument,
allows an architecture to return a cpumask of CPUs that are capable of
executing that task. The default implementation returns the
cpu_possible_mask, since sane machines do not suffer from per-cpu ISA
limitations that affect scheduling. The new mask is used when selecting
the fallback runqueue as a last resort before forcing a migration to the
first active CPU.
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <Valentin.Schneider@arm.com>
Reviewed-by: Quentin Perret <qperret@google.com>
Link: https://lore.kernel.org/r/20210730112443.23245-2-will@kernel.org
This extends SCHED_IDLE to cgroups.
Interface: cgroup/cpu.idle.
0: default behavior
1: SCHED_IDLE
Extending SCHED_IDLE to cgroups means that we incorporate the existing
aspects of SCHED_IDLE; a SCHED_IDLE cgroup will count all of its
descendant threads towards the idle_h_nr_running count of all of its
ancestor cgroups. Thus, sched_idle_rq() will work properly.
Additionally, SCHED_IDLE cgroups are configured with minimum weight.
There are two key differences between the per-task and per-cgroup
SCHED_IDLE interface:
- The cgroup interface allows tasks within a SCHED_IDLE hierarchy to
maintain their relative weights. The entity that is "idle" is the
cgroup, not the tasks themselves.
- Since the idle entity is the cgroup, our SCHED_IDLE wakeup preemption
decision is not made by comparing the current task with the woken
task, but rather by comparing their matching sched_entity.
A typical use-case for this is a user that creates an idle and a
non-idle subtree. The non-idle subtree will dominate competition vs
the idle subtree, but the idle subtree will still be high priority vs
other users on the system. The latter is accomplished via comparing
matching sched_entity in the waken preemption path (this could also be
improved by making the sched_idle_rq() decision dependent on the
perspective of a specific task).
For now, we maintain the existing SCHED_IDLE semantics. Future patches
may make improvements that extend how we treat SCHED_IDLE entities.
The per-task_group idle field is an integer that currently only holds
either a 0 or a 1. This is explicitly typed as an integer to allow for
further extensions to this API. For example, a negative value may
indicate a highly latency-sensitive cgroup that should be preferred
for preemption/placement/etc.
Signed-off-by: Josh Don <joshdon@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20210730020019.1487127-2-joshdon@google.com
The scheduler currently expects NUMA node distances to be stable from
init onwards, and as a consequence builds the related data structures
once-and-for-all at init (see sched_init_numa()).
Unfortunately, on some architectures node distance is unreliable for
offline nodes and may very well change upon onlining.
Skip over offline nodes during sched_init_numa(). Track nodes that have
been onlined at least once, and trigger a build of a node's NUMA masks
when it is first onlined post-init.
Reported-by: Geetika Moolchandani <Geetika.Moolchandani1@ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210818074333.48645-1-srikar@linux.vnet.ibm.com
Eugene tripped over the case where rq_lock(), as called in a
for_each_possible_cpu() loop came apart because rq->core hadn't been
setup yet.
This is a somewhat unusual, but valid case.
Rework things such that rq->core is initialized to point at itself. IOW
initialize each CPU as a single threaded Core. CPU online will then join
the new CPU (thread) to an existing Core where needed.
For completeness sake, have CPU offline fully undo the state so as to
not presume the topology will match the next time it comes online.
Fixes: 9edeaea1bc45 ("sched: Core-wide rq->lock")
Reported-by: Eugene Syromiatnikov <esyr@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Josh Don <joshdon@google.com>
Tested-by: Eugene Syromiatnikov <esyr@redhat.com>
Link: https://lkml.kernel.org/r/YR473ZGeKqMs6kw+@hirez.programming.kicks-ass.net
The consolidation of the debug code for mutex waiter intialization sets
waiter::ww_ctx to a poison value unconditionally. For regular mutexes this
is intended to catch the case where waiter_ww_ctx is dereferenced
accidentally.
For ww_mutex the poison value has to be overwritten either with a context
pointer or NULL for ww_mutexes without context.
The rework broke this as it made the store conditional on the context
pointer instead of the argument which signals whether ww_mutex code should
be compiled in or optiized out. As a result waiter::ww_ctx ends up with the
poison pointer for contextless ww_mutexes which causes a later dereference of
the poison pointer because it is != NULL.
Use the build argument instead so for ww_mutex the poison value is always
overwritten.
Fixes: c0afb0ffc06e6 ("locking/ww_mutex: Gather mutex_waiter initialization")
Reported-by: Guenter Roeck <linux@roeck-us.net>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210819193030.zpwrpvvrmy7xxxiy@linutronix.de
Add logic to call bpf_setsockopt() and bpf_getsockopt() from setsockopt BPF
programs. An example use case is when the user sets the IPV6_TCLASS socket
option, we would also like to change the tcp-cc for that socket.
We don't have any use case for calling bpf_setsockopt() from supposedly read-
only sys_getsockopt(), so it is made available to BPF_CGROUP_SETSOCKOPT only
at this point.
Signed-off-by: Prankur Gupta <prankgup@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20210817224221.3257826-2-prankgup@fb.com
Same as previous patch but for the keys. memdup_bpfptr is renamed
to kvmemdup_bpfptr (and converted to kvmalloc).
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20210818235216.1159202-2-sdf@google.com
Use kvmalloc/kvfree for temporary value when manipulating a map via
syscall. kmalloc might not be sufficient for percpu maps where the value
is big (and further multiplied by hundreds of CPUs).
Can be reproduced with netcnt test on qemu with "-smp 255".
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20210818235216.1159202-1-sdf@google.com
mac80211 trees.
Current release - regressions:
- tipc: call tipc_wait_for_connect only when dlen is not 0
- mac80211: fix locking in ieee80211_restart_work()
Current release - new code bugs:
- bpf: add rcu_read_lock in bpf_get_current_[ancestor_]cgroup_id()
- ethernet: ice: fix perout start time rounding
- wwan: iosm: prevent underflow in ipc_chnl_cfg_get()
Previous releases - regressions:
- bpf: clear zext_dst of dead insns
- sch_cake: fix srchost/dsthost hashing mode
- vrf: reset skb conntrack connection on VRF rcv
- net/rds: dma_map_sg is entitled to merge entries
Previous releases - always broken:
- ethernet: bnxt: fix Tx path locking and races, add Rx path barriers
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmEepAYACgkQMUZtbf5S
IrugdhAAo6/T9A8MmxU711gwq9BKcubOr2BQC0GYGjhM6Pnhign44kDGdBbl1fmL
lt4LTgOwIKZ4bpLYJuwXnq0KpwWG8YKpgdgjvXsc0jXYtDHJvFjzL42vjGGGvAc4
rwIe784dNt3LYppEsADAad2ZdQO25tB+jhQO8MttipNGq8Qvkjib2ttvmLri/hEl
hKRHnyQMEZCXbMV4zn+ILjlqLxZ5a2ZPk97qGL9fAafi1O3cjfv5ZDwvOjM3TGE3
DPLdEvYFRDTECGJO3QOK7SzC1NoQA49Bj1hqwbWRUi9tm8A0lPXEUjNleCbC2i7n
qi6JBexmme6ZiimJJPeKacn8BcgorPDWm6friL9jiWpndvrWycBmqw5LurgdLVHC
nFHjnbji885P6DBZLx8tDbTSXGSpcLoHuv7M3aQbD2DZqQKk/9irKaJMUZ6XwOYm
EXM6oqQ13vaSjFH1GOsK0wx/XjcL5t42uRAtG9INlYusaU+ZvFu12TzEt1+PcWpO
nF7VUQWQatcidSYdYolciw383siKRiSV1F8d6IEon907GHVrod9ty+yii33CamBf
/aUyFULMEf/vaXjmJzhzWABV9vu3QVWd0IxL9oqbbRqNTwFr86hnnRhntqUCje/c
vY0VFWQ4CuaTvohKl0I+IhyvUPLUp1iTD51qSbOSKybG6IZ77+U=
=ofv+
-----END PGP SIGNATURE-----
Merge tag 'net-5.14-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Networking fixes, including fixes from bpf, wireless and mac80211
trees.
Current release - regressions:
- tipc: call tipc_wait_for_connect only when dlen is not 0
- mac80211: fix locking in ieee80211_restart_work()
Current release - new code bugs:
- bpf: add rcu_read_lock in bpf_get_current_[ancestor_]cgroup_id()
- ethernet: ice: fix perout start time rounding
- wwan: iosm: prevent underflow in ipc_chnl_cfg_get()
Previous releases - regressions:
- bpf: clear zext_dst of dead insns
- sch_cake: fix srchost/dsthost hashing mode
- vrf: reset skb conntrack connection on VRF rcv
- net/rds: dma_map_sg is entitled to merge entries
Previous releases - always broken:
- ethernet: bnxt: fix Tx path locking and races, add Rx path
barriers"
* tag 'net-5.14-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (42 commits)
net: dpaa2-switch: disable the control interface on error path
Revert "flow_offload: action should not be NULL when it is referenced"
iavf: Fix ping is lost after untrusted VF had tried to change MAC
i40e: Fix ATR queue selection
r8152: fix the maximum number of PLA bp for RTL8153C
r8152: fix writing USB_BP2_EN
mptcp: full fully established support after ADD_ADDR
mptcp: fix memory leak on address flush
net/rds: dma_map_sg is entitled to merge entries
net: mscc: ocelot: allow forwarding from bridge ports to the tag_8021q CPU port
net: asix: fix uninit value bugs
ovs: clear skb->tstamp in forwarding path
net: mdio-mux: Handle -EPROBE_DEFER correctly
net: mdio-mux: Don't ignore memory allocation errors
net: mdio-mux: Delete unnecessary devm_kfree
net: dsa: sja1105: fix use-after-free after calling of_find_compatible_node, or worse
sch_cake: fix srchost/dsthost hashing mode
ixgbe, xsk: clean up the resources in ixgbe_xsk_pool_enable error path
net: qlcnic: add missed unlock in qlcnic_83xx_flash_read32
mac80211: fix locking in ieee80211_restart_work()
...
The BPF interpreter as well as x86-64 BPF JIT were both in line by allowing
up to 33 tail calls (however odd that number may be!). Recently, this was
changed for the interpreter to reduce it down to 32 with the assumption that
this should have been the actual limit "which is in line with the behavior of
the x86 JITs" according to b61a28cf11d61 ("bpf: Fix off-by-one in tail call
count limiting").
Paul recently reported:
I'm a bit surprised by this because I had previously tested the tail call
limit of several JIT compilers and found it to be 33 (i.e., allowing chains
of up to 34 programs). I've just extended a test program I had to validate
this again on the x86-64 JIT, and found a limit of 33 tail calls again [1].
Also note we had previously changed the RISC-V and MIPS JITs to allow up to
33 tail calls [2, 3], for consistency with other JITs and with the interpreter.
We had decided to increase these two to 33 rather than decrease the other
JITs to 32 for backward compatibility, though that probably doesn't matter
much as I'd expect few people to actually use 33 tail calls.
[1] ae78874829
[2] 96bc4432f5ad ("bpf, riscv: Limit to 33 tail calls")
[3] e49e6f6db04e ("bpf, mips: Limit to 33 tail calls")
Therefore, revert b61a28cf11d61 to re-align interpreter to limit a maximum of
33 tail calls. While it is unlikely to hit the limit for the vast majority,
programs in the wild could one way or another depend on this, so lets rather
be a bit more conservative, and lets align the small remainder of JITs to 33.
If needed in future, this limit could be slightly increased, but not decreased.
Fixes: b61a28cf11d61 ("bpf: Fix off-by-one in tail call count limiting")
Reported-by: Paul Chaignon <paul@cilium.io>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Johan Almbladh <johan.almbladh@anyfinetworks.com>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/CAO5pjwTWrC0_dzTbTHFPSqDwA56aVH+4KFGVqdq8=ASs0MqZGQ@mail.gmail.com
Daniel Borkmann says:
====================
pull-request: bpf 2021-08-19
We've added 3 non-merge commits during the last 3 day(s) which contain
a total of 3 files changed, 29 insertions(+), 6 deletions(-).
The main changes are:
1) Fix to clear zext_dst for dead instructions which was causing invalid program
rejections on JITs with bpf_jit_needs_zext such as s390x, from Ilya Leoshkevich.
2) Fix RCU splat in bpf_get_current_{ancestor_,}cgroup_id() helpers when they are
invoked from sleepable programs, from Yonghong Song.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
selftests, bpf: Test that dead ldx_w insns are accepted
bpf: Clear zext_dst of dead insns
bpf: Add rcu_read_lock in bpf_get_current_[ancestor_]cgroup_id() helpers
====================
Link: https://lore.kernel.org/r/20210819144904.20069-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Since kprobe_events and uprobe_events only check whether the
other same-type probe event has the same name or not, if the
user gives the same name of the existing tracepoint event (or
the other type of probe events), it silently fails to create
the tracefs entry (but registered.) as below.
/sys/kernel/tracing # ls events/task/task_rename
enable filter format hist id trigger
/sys/kernel/tracing # echo p:task/task_rename vfs_read >> kprobe_events
[ 113.048508] Could not create tracefs 'task_rename' directory
/sys/kernel/tracing # cat kprobe_events
p:task/task_rename vfs_read
To fix this issue, check whether the existing events have the
same name or not in trace_probe_register_event_call(). If exists,
it rejects to register the new event.
Link: https://lkml.kernel.org/r/162936876189.187130.17558311387542061930.stgit@devnote2
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
In preparation to allow event probes to use the process_fetch_insn()
callback in trace_probe_tmpl.h, change the data passed to it from a
pointer to pt_regs, as the event probe will not be using regs, and make it
a void pointer instead.
Update the process_fetch_insn() callers for kprobe and uprobe events to
have the regs defined in the function and just typecast the void pointer
parameter.
Link: https://lkml.kernel.org/r/20210819041842.291622924@goodmis.org
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Instead of a boolean "is_return" have traceprobe_set_print_fmt() take a
type (currently just PROBE_PRINT_NORMAL and PROBE_PRINT_RETURN). This will
simplify adding different types. For example, the development of the
event_probe, will need its own type as it prints an event, and not an IP.
Link: https://lkml.kernel.org/r/20210819041842.104626301@goodmis.org
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The functions get_online_cpus() and put_online_cpus() have been
deprecated during the CPU hotplug rework. They map directly to
cpus_read_lock() and cpus_read_unlock().
Replace deprecated CPU-hotplug functions with the official version.
The behavior remains unchanged.
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Jiri Kosina <jikos@kernel.org>
Cc: Miroslav Benes <mbenes@suse.cz>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Joe Lawrence <joe.lawrence@redhat.com>
Cc: live-patching@vger.kernel.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Only build the code to support the global coherent pool if support for
it is enabled.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Dillon Min <dillon.minfei@gmail.com>
Delete/fixup few includes in anticipation of global -isystem compile
option removal.
Note: crypto/aegis128-neon-inner.c keeps <stddef.h> due to redefinition
of uintptr_t error (one definition comes from <stddef.h>, another from
<linux/types.h>).
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Remove SIZEOF_TRACE_KPROBE() and SIZEOF_TRACE_UPROBE() and use
struct_size() as that's what it is made for. No need to have custom
macros. Especially since struct_size() has some extra memory checks for
correctness.
Link: https://lkml.kernel.org/r/20210817035027.795000217@goodmis.org
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Kprobe and uprobe events can add a "system" to the events that are created
via the kprobe_events and uprobe_events files respectively. If they do not
include a "system" in the name, then the default "kprobes" or "uprobes" is
used. The current notation to specify a system for one of these probe
events is to add a '/' delimiter in the name, where the content before the
'/' will be the system to use, and the content after will be the event
name.
echo 'p:my_system/my_event' > kprobe_events
But this is inconsistent with the way histogram triggers separate their
system / event names. The histogram triggers use a '.' delimiter, which
can be confusing.
To allow this to be more consistent, as well as keep backward
compatibility, allow the kprobe and uprobe events to denote a system name
with either a '/' or a '.'.
That is:
echo 'p:my_system/my_event' > kprobe_events
is equivalent to:
echo 'p:my_system.my_event' > kprobe_events
Link: https://lore.kernel.org/linux-trace-devel/20210813004448.51c7de69ce432d338f4d226b@kernel.org/
Link: https://lkml.kernel.org/r/20210817035027.580493202@goodmis.org
Suggested-by: Masami Hiramatsu <mhiramat@kernel.org>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The two places that call traceprobe_parse_probe_arg() allocate a temporary
buffer to copy the argv[i] into, because argv[i] is constant and the
traceprobe_parse_probe_arg() will modify it to do the parsing. These two
places allocate this buffer and then free it right after calling this
function, leaving the onus of this allocation to the caller.
As there's about to be a third user of this function that will have to do
the same thing, instead of having the caller allocate the temporary
buffer, simply move that allocation into the traceprobe_parse_probe_arg()
itself, which will simplify the code of the callers.
Link: https://lkml.kernel.org/r/20210817035027.385422828@goodmis.org
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
As dynamic events are not created by modules, if something is attached to
one, calling "try_module_get()" on its "mod" field, is not going to keep
the dynamic event from going away.
Since dynamic events do not need the "mod" pointer of the event structure,
make a union out of it in order to save memory (there's one structure for
each of the thousand+ events in the kernel), and have any event with the
DYNAMIC flag set to use a ref counter instead.
Link: https://lore.kernel.org/linux-trace-devel/20210813004448.51c7de69ce432d338f4d226b@kernel.org/
Link: https://lkml.kernel.org/r/20210817035027.174869074@goodmis.org
Suggested-by: Masami Hiramatsu <mhiramat@kernel.org>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
To differentiate between static and dynamic events, add a new flag
DYNAMIC to the event flags that all dynamic events have set. This will
allow to differentiate when attaching to a dynamic event from a static
event.
Static events have a mod pointer that references the module they were
created in (or NULL for core kernel). This can be incremented when the
event has something attached to it. But there exists no such mechanism for
dynamic events. This is dangerous as the dynamic events may now disappear
without the "attachment" knowing that it no longer exists.
To enforce the dynamic flag, change dyn_event_add() to pass the event that
is being created such that it can set the DYNAMIC flag of the event. This
helps make sure that no location that creates a dynamic event misses
setting this flag.
Link: https://lore.kernel.org/linux-trace-devel/20210813004448.51c7de69ce432d338f4d226b@kernel.org/
Link: https://lkml.kernel.org/r/20210817035026.936958254@goodmis.org
Suggested-by: Masami Hiramatsu <mhiramat@kernel.org>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Add a new helper to initialize the global coherent pool. This both
cleans up the existing initialization which indirects through the
reserved_mem_ops that are normally only used for struct device, and
also allows using the global pool for non-devicetree architectures.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Dillon Min <dillon.minfei@gmail.com>
Return the allocated dma_coherent_mem structure, set the
use_dma_pfn_offset and print the failure warning inside of
dma_init_coherent_memory instead of leaving that to the callers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Dillon Min <dillon.minfei@gmail.com>
Switch an ifdef so that the global coherent pool is initialized for
any architecture that selects the DMA_GLOBAL_POOL symbol insted of
hardcoding ARM.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Dillon Min <dillon.minfei@gmail.com>
Add an option to allocate uncached memory for dma_alloc_coherent from
the global dma_coherent_default_memory. This will allow to move
arm-nommu (and eventually other platforms) to use generic code for
allocating uncached memory from a pre-populated pool.
Note that this is a different pool from the one that platforms that
can remap at runtime use for GFP_ATOMIC allocations for now, although
there might be opportunities to eventually end up with a common codebase
for the two use cases.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Dillon Min <dillon.minfei@gmail.com>
Pull KCSAN updates from Paul E. McKenney:
- improve comments
- introduce CONFIG_KCSAN_STRICT (which RCU uses)
- optimize use of get_ctx() by kcsan_found_watchpoint()
- rework atomic.h into permissive.h
- add the ability to ignore writes that change only one bit of a given data-racy variable.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The variable allow is being initialized with a value that is never read, it
is being updated later on. The assignment is redundant and can be removed.
Addresses-Coverity: ("Unused value")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210817170842.495440-1-colin.king@canonical.com
The "tp_printk" option redirects the trace event output to printk at boot
up. This is useful when a machine crashes before boot where the trace events
can not be retrieved by the in kernel ring buffer. But it can be "dangerous"
because trace events can be located in high frequency locations such as
interrupts and the scheduler, where a printk can slow it down that it live
locks the machine (because by the time the printk finishes, the next event
is triggered). Thus tp_printk must be used with care.
It was discovered that the filter logic to trace events does not apply to
the tp_printk events. This can cause a surprise and live lock when the user
expects it to be filtered to limit the amount of events printed to the
console when in fact it still prints everything.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCYRwL+RQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qkRHAP9gvTYH1es9l4V5SNFEQ7+GEwknsaq7
B5q4znVKQKgajQD/cd5Cm/alTIbxXdrQ9nxJ7lfffrvk46iqAb9PRX9vhAQ=
=8QxT
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.14-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fix from Steven Rostedt:
"Limit the shooting in the foot of tp_printk
The "tp_printk" option redirects the trace event output to printk at
boot up. This is useful when a machine crashes before boot where the
trace events can not be retrieved by the in kernel ring buffer. But it
can be "dangerous" because trace events can be located in high
frequency locations such as interrupts and the scheduler, where a
printk can slow it down that it live locks the machine (because by the
time the printk finishes, the next event is triggered). Thus tp_printk
must be used with care.
It was discovered that the filter logic to trace events does not apply
to the tp_printk events. This can cause a surprise and live lock when
the user expects it to be filtered to limit the amount of events
printed to the console when in fact it still prints everything"
* tag 'trace-v5.14-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Apply trace filters on all output channels
The functions get_online_cpus() and put_online_cpus() have been
deprecated during the CPU hotplug rework. They map directly to
cpus_read_lock() and cpus_read_unlock().
Replace deprecated CPU-hotplug functions with the official version.
The behavior remains unchanged.
Link: https://lkml.kernel.org/r/20210803141621.780504-37-bigeasy@linutronix.de
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Acked-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
There was no strong reason to or not to flush barrier work items in
flush_workqueue(). And we have to make barrier work items not participate
in nr_active so we had been using WORK_NO_COLOR for them which also makes
them can't be flushed by flush_workqueue().
And the users of flush_workqueue() often do not intend to wait barrier work
items issued by flush_work(). That made the choice sound perfect.
But barrier work items have reference to internal structure (pool_workqueue)
and the worker thread[s] is/are still busy for the workqueue user when the
barrrier work items are not done. So it is reasonable to make flush_workqueue()
also watch for flush_work() to make it more robust.
And a problem[1] reported by Li Zhe shows that we need such robustness.
The warning logs are listed below:
WARNING: CPU: 0 PID: 19336 at kernel/workqueue.c:4430 destroy_workqueue+0x11a/0x2f0
*****
destroy_workqueue: test_workqueue9 has the following busy pwq
pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=0/1 refcnt=2
in-flight: 5658:wq_barrier_func
Showing busy workqueues and worker pools:
*****
It shows that even after drain_workqueue() returns, the barrier work item
is still in flight and the pwq (and a worker) is still busy on it.
The problem is caused by flush_workqueue() not watching flush_work():
Thread A Worker
/* normal work item with linked */
process_scheduled_works()
destroy_workqueue() process_one_work()
drain_workqueue() /* run normal work item */
/-- pwq_dec_nr_in_flight()
flush_workqueue() <---/
/* the last normal work item is done */
sanity_check process_one_work()
/-- raw_spin_unlock_irq(&pool->lock)
raw_spin_lock_irq(&pool->lock) <-/ /* maybe preempt */
*WARNING* wq_barrier_func()
/* maybe preempt by cond_resched() */
Thread A can get the pool lock after the Worker unlocks the pool lock before
running wq_barrier_func(). And if there is any preemption happen around
wq_barrier_func(), destroy_workqueue()'s sanity check is more likely to
get the lock and catch it. (Note: preemption is not necessary to cause the bug,
the unlocking is enough to possibly trigger the WARNING.)
A simple solution might be just executing all linked barrier work items
once without releasing pool lock after the head work item's
pwq_dec_nr_in_flight(). But this solution has two problems:
1) the head work item might also be barrier work item when the user-queued
work item is cancelled. For example:
thread 1: thread 2:
queue_work(wq, &my_work)
flush_work(&my_work)
cancel_work_sync(&my_work);
/* Neiter my_work nor the barrier work is scheduled. */
destroy_workqueue(wq);
/* This is an easier way to catch the WARNING. */
2) there might be too much linked barrier work items and running them
all once without releasing pool lock just causes trouble.
The only solution is to make flush_workqueue() aslo watch barrier work
items. So we have to assign a color to these barrier work items which
is the color of the head (user-queued) work item.
Assigning a color doesn't cause any problem in ative management, because
the prvious patch made barrier work items not participate in nr_active
via WORK_STRUCT_INACTIVE rather than reliance on the (old) WORK_NO_COLOR.
[1]: https://lore.kernel.org/lkml/20210812083814.32453-1-lizhe.67@bytedance.com/
Reported-by: Li Zhe <lizhe.67@bytedance.com>
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Currently, WORK_NO_COLOR has two meanings:
Not participate in flushing
Not participate in nr_active
And only non-barrier work items are marked with WORK_STRUCT_INACTIVE
when they are in inactive_works list. The barrier work items are not
marked INACTIVE even linked in inactive_works list since these tail
items are always moved together with the head work item.
These definitions are simple, clean and practical. (Except a small
blemish that only the first meaning of WORK_NO_COLOR is documented in
include/linux/workqueue.h while both meanings are in workqueue.c)
But dual-purpose WORK_NO_COLOR used for barrier work items has proven to
be problematical[1]. Only the second purpose is obligatory. So we plan
to make barrier work items participate in flushing but keep them still
not participating in nr_active.
So the plan is to mark barrier work items inactive without using
WORK_NO_COLOR in this patch so that we can assign a flushing color to
them in next patch.
The reasonable way is to add or reuse a bit in work data of the work
item. But adding a bit will double the size of pool_workqueue.
Currently, WORK_STRUCT_INACTIVE is only used in try_to_grab_pending()
for user-queued work items and try_to_grab_pending() can't work for
barrier work items. So we extend WORK_STRUCT_INACTIVE to also mark
barrier work items no matter which list they are in because we don't
need to determind which list a barrier work item is in.
So the meaning of WORK_STRUCT_INACTIVE becomes just "the work items don't
participate in nr_active" (no matter whether it is a barrier work item or
a user-queued work item). And WORK_STRUCT_INACTIVE for user-queued work
items means they are in inactive_works list.
This patch does it by setting WORK_STRUCT_INACTIVE for barrier work items
in insert_wq_barrier() and checking WORK_STRUCT_INACTIVE first in
pwq_dec_nr_in_flight(). And the meaning of WORK_NO_COLOR is reduced to
only "not participating in flushing".
There is no functionality change intended in this patch. Because
WORK_NO_COLOR+WORK_STRUCT_INACTIVE represents the previous WORK_NO_COLOR
in meaning and try_to_grab_pending() doesn't use for barrier work items
and avoids being confused by this extended WORK_STRUCT_INACTIVE.
A bunch of comment for nr_active & WORK_STRUCT_INACTIVE is also added for
documenting how WORK_STRUCT_INACTIVE works in nr_active management.
[1]: https://lore.kernel.org/lkml/20210812083814.32453-1-lizhe.67@bytedance.com/
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Add a local var @work_flags to calculate work_flags step by step, so that
we don't need to squeeze several flags in only the last line of code.
Parepare for next patch to add a bit to barrier work item's flag. Not
squshing this to next patch makes it clear that what it will have changed.
No functional change intended.
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Make pwq_dec_nr_in_flight() use work_data rather just work_color.
Prepare for later patch to get WORK_STRUCT_INACTIVE bit from work_data
in pwq_dec_nr_in_flight().
No functional change intended.
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
There are two kinds of "delayed" work items in workqueue subsystem.
One is for timer-delayed work items which are visible to workqueue users.
The other kind is for work items delayed by active management which can
not be directly visible to workqueue users. We mixed the word "delayed"
for both kinds and caused somewhat ambiguity.
This patch renames the later one (delayed by active management) to
"inactive", because it is used for workqueue active management and
most of its related symbols are named with "active" or "activate".
All "delayed" and "DELAYED" are carefully checked and renamed one by
one to avoid accidentally changing the name of the other kind for
timer-delayed.
No functional change intended.
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Add the static and runtime initializer mechanics to support the RT variant
of local_lock, which requires the lock type in the lockdep map to be set
to LD_LOCK_PERCPU.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20210815211305.967526724@linutronix.de
Going to sleep when locks are contended can be quite inefficient when the
contention time is short and the lock owner is running on a different CPU.
The MCS mechanism cannot be used because MCS is strictly FIFO ordered while
for rtmutex based locks the waiter ordering is priority based.
Provide a simple adaptive spinwait mechanism which currently restricts the
spinning to the top priority waiter.
[ tglx: Provide a contemporary changelog, extended it to all rtmutex based
locks and updated it to match the other spin on owner implementations ]
Originally-by: Gregory Haskins <ghaskins@novell.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20210815211305.912050691@linutronix.de
The current logic only allows lock stealing to occur if the current task is
of higher priority than the pending owner.
Significant throughput improvements can be gained by allowing the lock
stealing to include tasks of equal priority when the contended lock is a
spin_lock or a rw_lock and the tasks are not in a RT scheduling task.
The assumption was that the system will make faster progress by allowing
the task already on the CPU to take the lock rather than waiting for the
system to wake up a different task.
This does add a degree of unfairness, but in reality no negative side
effects have been observed in the many years that this has been used in the
RT kernel.
[ tglx: Refactored and rewritten several times by Steve Rostedt, Sebastian
Siewior and myself ]
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20210815211305.857240222@linutronix.de
On PREEMPT_RT the futex hashbucket spinlock becomes 'sleeping' and rtmutex
based. That causes a lockdep false positive because some of the futex
functions invoke spin_unlock(&hb->lock) with the wait_lock of the rtmutex
associated to the pi_futex held. spin_unlock() in turn takes wait_lock of
the rtmutex on which the spinlock is based which makes lockdep notice a
lock recursion.
Give the futex/rtmutex wait_lock a separate key.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20210815211305.750701219@linutronix.de
The requeue_pi() operation on RT kernels creates a problem versus the
task::pi_blocked_on state when a waiter is woken early (signal, timeout)
and that early wake up interleaves with the requeue_pi() operation.
When the requeue manages to block the waiter on the rtmutex which is
associated to the second futex, then a concurrent early wakeup of that
waiter faces the problem that it has to acquire the hash bucket spinlock,
which is not an issue on non-RT kernels, but on RT kernels spinlocks are
substituted by 'sleeping' spinlocks based on rtmutex. If the hash bucket
lock is contended then blocking on that spinlock would result in a
impossible situation: blocking on two locks at the same time (the hash
bucket lock and the rtmutex representing the PI futex).
It was considered to make the hash bucket locks raw_spinlocks, but
especially requeue operations with a large amount of waiters can introduce
significant latencies, so that's not an option for RT.
The RT tree carried a solution which (ab)used task::pi_blocked_on to store
the information about an ongoing requeue and an early wakeup which worked,
but required to add checks for these special states all over the place.
The distangling of an early wakeup of a waiter for a requeue_pi() operation
is already looking at quite some different states and the task::pi_blocked_on
magic just expanded that to a hard to understand 'state machine'.
This can be avoided by keeping track of the waiter/requeue state in the
futex_q object itself.
Add a requeue_state field to struct futex_q with the following possible
states:
Q_REQUEUE_PI_NONE
Q_REQUEUE_PI_IGNORE
Q_REQUEUE_PI_IN_PROGRESS
Q_REQUEUE_PI_WAIT
Q_REQUEUE_PI_DONE
Q_REQUEUE_PI_LOCKED
The waiter starts with state = NONE and the following state transitions are
valid:
On the waiter side:
Q_REQUEUE_PI_NONE -> Q_REQUEUE_PI_IGNORE
Q_REQUEUE_PI_IN_PROGRESS -> Q_REQUEUE_PI_WAIT
On the requeue side:
Q_REQUEUE_PI_NONE -> Q_REQUEUE_PI_INPROGRESS
Q_REQUEUE_PI_IN_PROGRESS -> Q_REQUEUE_PI_DONE/LOCKED
Q_REQUEUE_PI_IN_PROGRESS -> Q_REQUEUE_PI_NONE (requeue failed)
Q_REQUEUE_PI_WAIT -> Q_REQUEUE_PI_DONE/LOCKED
Q_REQUEUE_PI_WAIT -> Q_REQUEUE_PI_IGNORE (requeue failed)
The requeue side ignores a waiter with state Q_REQUEUE_PI_IGNORE as this
signals that the waiter is already on the way out. It also means that
the waiter is still on the 'wait' futex, i.e. uaddr1.
The waiter side signals early wakeup to the requeue side either through
setting state to Q_REQUEUE_PI_IGNORE or to Q_REQUEUE_PI_WAIT depending
on the current state. In case of Q_REQUEUE_PI_IGNORE it can immediately
proceed to take the hash bucket lock of uaddr1. If it set state to WAIT,
which means the wakeup is interleaving with a requeue in progress it has
to wait for the requeue side to change the state. Either to DONE/LOCKED
or to IGNORE. DONE/LOCKED means the waiter q is now on the uaddr2 futex
and either blocked (DONE) or has acquired it (LOCKED). IGNORE is set by
the requeue side when the requeue attempt failed via deadlock detection
and therefore the waiter's futex_q is still on the uaddr1 futex.
While this is not strictly required on !RT making this unconditional has
the benefit of common code and it also allows the waiter to avoid taking
the hash bucket lock on the way out in certain cases, which reduces
contention.
Add the required helpers required for the state transitions, invoke them at
the right places and restructure the futex_wait_requeue_pi() code to handle
the return from wait (early or not) based on the state machine values.
On !RT enabled kernels the waiter spin waits for the state going from
Q_REQUEUE_PI_WAIT to some other state, on RT enabled kernels this is
handled by rcuwait_wait_event() and the corresponding wake up on the
requeue side.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20210815211305.693317658@linutronix.de
Move the futex key match out of handle_early_requeue_pi_wakeup() which
allows to simplify that function. The upcoming state machine for
requeue_pi() will make that go away.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20210815211305.638938670@linutronix.de
No point in allocating memory when the input parameters are bogus.
Validate all parameters before proceeding.
Suggested-by: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20210815211305.581789253@linutronix.de
The comment about the restriction of the number of waiters to wake for the
REQUEUE_PI case is confusing at best. Rewrite it.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20210815211305.524990421@linutronix.de