44220 Commits

Author SHA1 Message Date
Linus Torvalds
56897d5188 Tracing and eventfs fixes for v6.8:
- Fix the return code for ring_buffer_poll_wait()
   It was returing a -EINVAL instead of EPOLLERR.
 
 - Zero out the tracefs_inode so that all fields are initialized.
   The ti->private could have had stale data, but instead of
   just initializing it to NULL, clear out the entire structure
   when it is allocated.
 
 - Fix a crash in timerlat
   The hrtimer was initialized at read and not open, but is
   canceled at close. If the file was opened and never read
   the close will pass a NULL pointer to hrtime_cancel().
 
 - Rewrite of eventfs.
   Linus wrote a patch series to remove the dentry references in the
   eventfs_inode and to use ref counting and more of proper VFS
   interfaces to make it work.
 
 - Add warning to put_ei() if ei is not set to free. That means
   something is about to free it when it shouldn't.
 
 - Restructure the eventfs_inode to make it more compact, and remove
   the unused llist field.
 
 - Remove the fsnotify*() funtions for when the inodes were being created
   in the lookup code. It doesn't make sense to notify about creation
   just because something is being looked up.
 
 - The inode hard link count was not accurate. It was being updated
   when a file was looked up. The inodes of directories were updating
   their parent inode hard link count every time the inode was created.
   That means if memory reclaim cleaned a stale directory inode and
   the inode was lookup up again, it would increment the parent inode
   again as well. Al Viro said to just have all eventfs directories
   have a hard link count of 1. That tells user space not to trust it.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZb1l/RQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qk6jAQDmecDOnx+j/Rm5krbX/meVPYXFj2CU
 1wO7w1HBzopsBwEA5AjTKm9IGrl/eVG/+jViS165b+sJfwEcblHEFPWcIwo=
 =uUzb
 -----END PGP SIGNATURE-----

Merge tag 'trace-v6.8-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing and eventfs fixes from Steven Rostedt:

 - Fix the return code for ring_buffer_poll_wait()

   It was returing a -EINVAL instead of EPOLLERR.

 - Zero out the tracefs_inode so that all fields are initialized.

   The ti->private could have had stale data, but instead of just
   initializing it to NULL, clear out the entire structure when it is
   allocated.

 - Fix a crash in timerlat

   The hrtimer was initialized at read and not open, but is canceled at
   close. If the file was opened and never read the close will pass a
   NULL pointer to hrtime_cancel().

 - Rewrite of eventfs.

   Linus wrote a patch series to remove the dentry references in the
   eventfs_inode and to use ref counting and more of proper VFS
   interfaces to make it work.

 - Add warning to put_ei() if ei is not set to free. That means
   something is about to free it when it shouldn't.

 - Restructure the eventfs_inode to make it more compact, and remove the
   unused llist field.

 - Remove the fsnotify*() funtions for when the inodes were being
   created in the lookup code. It doesn't make sense to notify about
   creation just because something is being looked up.

 - The inode hard link count was not accurate.

   It was being updated when a file was looked up. The inodes of
   directories were updating their parent inode hard link count every
   time the inode was created. That means if memory reclaim cleaned a
   stale directory inode and the inode was lookup up again, it would
   increment the parent inode again as well. Al Viro said to just have
   all eventfs directories have a hard link count of 1. That tells user
   space not to trust it.

* tag 'trace-v6.8-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  eventfs: Keep all directory links at 1
  eventfs: Remove fsnotify*() functions from lookup()
  eventfs: Restructure eventfs_inode structure to be more condensed
  eventfs: Warn if an eventfs_inode is freed without is_freed being set
  tracing/timerlat: Move hrtimer_init to timerlat_fd open()
  eventfs: Get rid of dentry pointers without refcounts
  eventfs: Clean up dentry ops and add revalidate function
  eventfs: Remove unused d_parent pointer field
  tracefs: dentry lookup crapectomy
  tracefs: Avoid using the ei->dentry pointer unnecessarily
  eventfs: Initialize the tracefs inode properly
  tracefs: Zero out the tracefs_inode when allocating it
  ring-buffer: Clean ring_buffer_poll_wait() error return
2024-02-02 15:32:58 -08:00
Eduard Zingerman
6efbde200b bpf: Handle scalar spill vs all MISC in stacksafe()
When check_stack_read_fixed_off() reads value from an spi
all stack slots of which are set to STACK_{MISC,INVALID},
the destination register is set to unbound SCALAR_VALUE.

Exploit this fact by allowing stacksafe() to use a fake
unbound scalar register to compare 'mmmm mmmm' stack value
in old state vs spilled 64-bit scalar in current state
and vice versa.

Veristat results after this patch show some gains:

./veristat -C -e file,prog,states -f 'states_pct>10'  not-opt after
File                     Program                States   (DIFF)
-----------------------  ---------------------  ---------------
bpf_overlay.o            tail_rev_nodeport_lb4    -45 (-15.85%)
bpf_xdp.o                tail_lb_ipv4            -541 (-19.57%)
pyperf100.bpf.o          on_event                -680 (-10.42%)
pyperf180.bpf.o          on_event               -2164 (-19.62%)
pyperf600.bpf.o          on_event               -9799 (-24.84%)
strobemeta.bpf.o         on_event               -9157 (-65.28%)
xdp_synproxy_kern.bpf.o  syncookie_tc             -54 (-19.29%)
xdp_synproxy_kern.bpf.o  syncookie_xdp            -74 (-24.50%)

Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20240127175237.526726-6-maxtram95@gmail.com
2024-02-02 13:22:14 -08:00
Maxim Mikityanskiy
c1e6148cb4 bpf: Preserve boundaries and track scalars on narrowing fill
When the width of a fill is smaller than the width of the preceding
spill, the information about scalar boundaries can still be preserved,
as long as it's coerced to the right width (done by coerce_reg_to_size).
Even further, if the actual value fits into the fill width, the ID can
be preserved as well for further tracking of equal scalars.

Implement the above improvements, which makes narrowing fills behave the
same as narrowing spills and MOVs between registers.

Two tests are adjusted to accommodate for endianness differences and to
take into account that it's now allowed to do a narrowing fill from the
least significant bits.

reg_bounds_sync is added to coerce_reg_to_size to correctly adjust
umin/umax boundaries after the var_off truncation, for example, a 64-bit
value 0xXXXXXXXX00000000, when read as a 32-bit, gets umin = 0, umax =
0xFFFFFFFF, var_off = (0x0; 0xffffffff00000000), which needs to be
synced down to umax = 0, otherwise reg_bounds_sanity_check doesn't pass.

Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20240127175237.526726-4-maxtram95@gmail.com
2024-02-02 13:22:14 -08:00
Maxim Mikityanskiy
e67ddd9b1c bpf: Track spilled unbounded scalars
Support the pattern where an unbounded scalar is spilled to the stack,
then boundary checks are performed on the src register, after which the
stack frame slot is refilled into a register.

Before this commit, the verifier didn't treat the src register and the
stack slot as related if the src register was an unbounded scalar. The
register state wasn't copied, the id wasn't preserved, and the stack
slot was marked as STACK_MISC. Subsequent boundary checks on the src
register wouldn't result in updating the boundaries of the spilled
variable on the stack.

After this commit, the verifier will preserve the bond between src and
dst even if src is unbounded, which permits to do boundary checks on src
and refill dst later, still remembering its boundaries. Such a pattern
is sometimes generated by clang when compiling complex long functions.

One test is adjusted to reflect that now unbounded scalars are tracked.

Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20240127175237.526726-2-maxtram95@gmail.com
2024-02-02 13:22:14 -08:00
Christophe Leroy
315df9c476 modules: Remove #ifdef CONFIG_STRICT_MODULE_RWX around rodata_enabled
Now that rodata_enabled is declared at all time, the #ifdef
CONFIG_STRICT_MODULE_RWX can be removed.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2024-02-02 10:21:25 -08:00
Oleg Nesterov
a1c6d5439f pid: kill the obsolete PIDTYPE_PID code in transfer_pid()
transfer_pid() must be never called with pid == PIDTYPE_PID,
new_leader->thread_pid should be changed by exchange_tids().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/20240202131255.GA26025@redhat.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-02-02 14:57:53 +01:00
Oleg Nesterov
43f0df54c9 pidfd_poll: report POLLHUP when pid_task() == NULL
Add another wake_up_all(wait_pidfd) into __change_pid() and change
pidfd_poll() to include EPOLLHUP if task == NULL.

This allows to wait until the target process/thread is reaped.

TODO: change do_notify_pidfd() to use the keyed wakeups.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/20240202131226.GA26018@redhat.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-02-02 14:57:53 +01:00
Oleg Nesterov
64bef697d3 pidfd: implement PIDFD_THREAD flag for pidfd_open()
With this flag:

	- pidfd_open() doesn't require that the target task must be
	  a thread-group leader

	- pidfd_poll() succeeds when the task exits and becomes a
	  zombie (iow, passes exit_notify()), even if it is a leader
	  and thread-group is not empty.

	  This means that the behaviour of pidfd_poll(PIDFD_THREAD,
	  pid-of-group-leader) is not well defined if it races with
	  exec() from its sub-thread; pidfd_poll() can succeed or not
	  depending on whether pidfd_task_exited() is called before
	  or after exchange_tids().

	  Perhaps we can improve this behaviour later, pidfd_poll()
	  can probably take sig->group_exec_task into account. But
	  this doesn't really differ from the case when the leader
	  exits before other threads (so pidfd_poll() succeeds) and
	  then another thread execs and pidfd_poll() will block again.

thread_group_exited() is no longer used, perhaps it can die.

Co-developed-by: Tycho Andersen <tycho@tycho.pizza>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/20240131132602.GA23641@redhat.com
Tested-by: Tycho Andersen <tandersen@netflix.com>
Reviewed-by: Tycho Andersen <tandersen@netflix.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-02-02 13:12:28 +01:00
Oleg Nesterov
21e25205d7 pidfd: don't do_notify_pidfd() if !thread_group_empty()
do_notify_pidfd() makes no sense until the whole thread group exits, change
do_notify_parent() to check thread_group_empty().

This avoids the unnecessary do_notify_pidfd() when tsk is not a leader, or
it exits before other threads, or it has a ptraced EXIT_ZOMBIE sub-thread.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/20240127132407.GA29136@redhat.com
Reviewed-by: Tycho Andersen <tandersen@netflix.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-02-02 13:12:28 +01:00
Oleg Nesterov
cdefbf2324 pidfd: cleanup the usage of __pidfd_prepare's flags
- make pidfd_create() static.

- Don't pass O_RDWR | O_CLOEXEC to __pidfd_prepare() in copy_process(),
  __pidfd_prepare() adds these flags unconditionally.

- Kill the flags check in __pidfd_prepare(). sys_pidfd_open() checks the
  flags itself, all other users of pidfd_prepare() pass flags = 0.

  If we need a sanity check for those other in kernel users then
  WARN_ON_ONCE(flags & ~PIDFD_NONBLOCK) makes more sense.

- Don't pass O_RDWR to get_unused_fd_flags(), it ignores everything except
  O_CLOEXEC.

- Don't pass O_CLOEXEC to anon_inode_getfile(), it ignores everything except
  O_ACCMODE | O_NONBLOCK.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/20240125161734.GA778@redhat.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-02-02 13:12:28 +01:00
Wang Jinchao
b639585e71 fork: Using clone_flags for legacy clone check
In the current implementation of clone(), there is a line that
initializes `u64 clone_flags = args->flags` at the top.
This means that there is no longer a need to use args->flags
for the legacy clone check.

Signed-off-by: Wang Jinchao <wangjinchao@xfusion.com>
Link: https://lore.kernel.org/r/202401311054+0800-wangjinchao@xfusion.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-02-02 13:12:28 +01:00
Jakub Kicinski
cf244463a2 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

No conflicts or adjacent changes.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-02-01 15:12:37 -08:00
Jingzi Meng
09ce61e27d cap_syslog: remove CAP_SYS_ADMIN when dmesg_restrict
CAP_SYSLOG was separated from CAP_SYS_ADMIN and introduced in Linux
2.6.37 (2010-11). For a long time, certain syslog actions required
CAP_SYS_ADMIN or CAP_SYSLOG. Maybe it’s time to officially remove
CAP_SYS_ADMIN for more fine-grained control.

CAP_SYS_ADMIN was once removed but added back for backwards
compatibility reasons. In commit 38ef4c2e437d ("syslog: check cap_syslog
when dmesg_restrict") (2010-12), CAP_SYS_ADMIN was no longer needed. And
in commit ee24aebffb75 ("cap_syslog: accept CAP_SYS_ADMIN for now")
(2011-02), it was accepted again. Since then, CAP_SYS_ADMIN has been
preserved.

Now that almost 13 years have passed, the legacy application may have
had enough time to be updated.

Signed-off-by: Jingzi Meng <mengjingzi@iie.ac.cn>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20240105062007.26965-1-mengjingzi@iie.ac.cn
Signed-off-by: Kees Cook <keescook@chromium.org>
2024-02-01 10:04:58 -08:00
Matt Bobrowski
1581e5118e bpf: Minor clean-up to sleepable_lsm_hooks BTF set
There's already one main CONFIG_SECURITY_NETWORK ifdef block within
the sleepable_lsm_hooks BTF set. Consolidate this duplicated ifdef
block as there's no need for it and all things guarded by it should
remain in one place in this specific context.

Signed-off-by: Matt Bobrowski <mattbobrowski@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/Zbt1smz43GDMbVU3@google.com
2024-02-01 18:37:45 +01:00
Daniel Bristot de Oliveira
1389358bb0 tracing/timerlat: Move hrtimer_init to timerlat_fd open()
Currently, the timerlat's hrtimer is initialized at the first read of
timerlat_fd, and destroyed at close(). It works, but it causes an error
if the user program open() and close() the file without reading.

Here's an example:

 # echo NO_OSNOISE_WORKLOAD > /sys/kernel/debug/tracing/osnoise/options
 # echo timerlat > /sys/kernel/debug/tracing/current_tracer

 # cat <<EOF > ./timerlat_load.py
 # !/usr/bin/env python3

 timerlat_fd = open("/sys/kernel/tracing/osnoise/per_cpu/cpu0/timerlat_fd", 'r')
 timerlat_fd.close();
 EOF

 # ./taskset -c 0 ./timerlat_load.py
<BOOM>

 BUG: kernel NULL pointer dereference, address: 0000000000000010
 #PF: supervisor read access in kernel mode
 #PF: error_code(0x0000) - not-present page
 PGD 0 P4D 0
 Oops: 0000 [#1] PREEMPT SMP NOPTI
 CPU: 1 PID: 2673 Comm: python3 Not tainted 6.6.13-200.fc39.x86_64 #1
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-1.fc39 04/01/2014
 RIP: 0010:hrtimer_active+0xd/0x50
 Code: 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 48 8b 57 30 <8b> 42 10 a8 01 74 09 f3 90 8b 42 10 a8 01 75 f7 80 7f 38 00 75 1d
 RSP: 0018:ffffb031009b7e10 EFLAGS: 00010286
 RAX: 000000000002db00 RBX: ffff9118f786db08 RCX: 0000000000000000
 RDX: 0000000000000000 RSI: ffff9117a0e64400 RDI: ffff9118f786db08
 RBP: ffff9118f786db80 R08: ffff9117a0ddd420 R09: ffff9117804d4f70
 R10: 0000000000000000 R11: 0000000000000000 R12: ffff9118f786db08
 R13: ffff91178fdd5e20 R14: ffff9117840978c0 R15: 0000000000000000
 FS:  00007f2ffbab1740(0000) GS:ffff9118f7840000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000000000010 CR3: 00000001b402e000 CR4: 0000000000750ee0
 PKRU: 55555554
 Call Trace:
  <TASK>
  ? __die+0x23/0x70
  ? page_fault_oops+0x171/0x4e0
  ? srso_alias_return_thunk+0x5/0x7f
  ? avc_has_extended_perms+0x237/0x520
  ? exc_page_fault+0x7f/0x180
  ? asm_exc_page_fault+0x26/0x30
  ? hrtimer_active+0xd/0x50
  hrtimer_cancel+0x15/0x40
  timerlat_fd_release+0x48/0xe0
  __fput+0xf5/0x290
  __x64_sys_close+0x3d/0x80
  do_syscall_64+0x60/0x90
  ? srso_alias_return_thunk+0x5/0x7f
  ? __x64_sys_ioctl+0x72/0xd0
  ? srso_alias_return_thunk+0x5/0x7f
  ? syscall_exit_to_user_mode+0x2b/0x40
  ? srso_alias_return_thunk+0x5/0x7f
  ? do_syscall_64+0x6c/0x90
  ? srso_alias_return_thunk+0x5/0x7f
  ? exit_to_user_mode_prepare+0x142/0x1f0
  ? srso_alias_return_thunk+0x5/0x7f
  ? syscall_exit_to_user_mode+0x2b/0x40
  ? srso_alias_return_thunk+0x5/0x7f
  ? do_syscall_64+0x6c/0x90
  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
 RIP: 0033:0x7f2ffb321594
 Code: 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 80 3d d5 cd 0d 00 00 74 13 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 3c c3 0f 1f 00 55 48 89 e5 48 83 ec 10 89 7d
 RSP: 002b:00007ffe8d8eef18 EFLAGS: 00000202 ORIG_RAX: 0000000000000003
 RAX: ffffffffffffffda RBX: 00007f2ffba4e668 RCX: 00007f2ffb321594
 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000003
 RBP: 00007ffe8d8eef40 R08: 0000000000000000 R09: 0000000000000000
 R10: 55c926e3167eae79 R11: 0000000000000202 R12: 0000000000000003
 R13: 00007ffe8d8ef030 R14: 0000000000000000 R15: 00007f2ffba4e668
  </TASK>
 CR2: 0000000000000010
 ---[ end trace 0000000000000000 ]---

Move hrtimer_init to timerlat_fd open() to avoid this problem.

Link: https://lore.kernel.org/linux-trace-kernel/7324dd3fc0035658c99b825204a66049389c56e3.1706798888.git.bristot@kernel.org

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: stable@vger.kernel.org
Fixes: e88ed227f639 ("tracing/timerlat: Add user-space interface")
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-02-01 11:50:13 -05:00
Daniel Xu
6f3189f38a bpf: treewide: Annotate BPF kfuncs in BTF
This commit marks kfuncs as such inside the .BTF_ids section. The upshot
of these annotations is that we'll be able to automatically generate
kfunc prototypes for downstream users. The process is as follows:

1. In source, use BTF_KFUNCS_START/END macro pair to mark kfuncs
2. During build, pahole injects into BTF a "bpf_kfunc" BTF_DECL_TAG for
   each function inside BTF_KFUNCS sets
3. At runtime, vmlinux or module BTF is made available in sysfs
4. At runtime, bpftool (or similar) can look at provided BTF and
   generate appropriate prototypes for functions with "bpf_kfunc" tag

To ensure future kfunc are similarly tagged, we now also return error
inside kfunc registration for untagged kfuncs. For vmlinux kfuncs,
we also WARN(), as initcall machinery does not handle errors.

Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Acked-by: Benjamin Tissoires <bentiss@kernel.org>
Link: https://lore.kernel.org/r/e55150ceecbf0a5d961e608941165c0bee7bc943.1706491398.git.dxu@dxuuu.xyz
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-31 20:40:56 -08:00
Vincent Donnefort
66bbea9ed6 ring-buffer: Clean ring_buffer_poll_wait() error return
The return type for ring_buffer_poll_wait() is __poll_t. This is behind
the scenes an unsigned where we can set event bits. In case of a
non-allocated CPU, we do return instead -EINVAL (0xffffffea). Lucky us,
this ends up setting few error bits (EPOLLERR | EPOLLHUP | EPOLLNVAL), so
user-space at least is aware something went wrong.

Nonetheless, this is an incorrect code. Replace that -EINVAL with a
proper EPOLLERR to clean that output. As this doesn't change the
behaviour, there's no need to treat this change as a bug fix.

Link: https://lore.kernel.org/linux-trace-kernel/20240131140955.3322792-1-vdonnefort@google.com

Cc: stable@vger.kernel.org
Fixes: 6721cb6002262 ("ring-buffer: Do not poll non allocated cpu buffers")
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-01-31 14:10:24 -05:00
Tejun Heo
c5f8cd6c62 workqueue: Avoid premature init of wq->node_nr_active[].max
System workqueues are allocated early during boot from
workqueue_init_early(). While allocating unbound workqueues,
wq_update_node_max_active() is invoked from apply_workqueue_attrs() and
accesses NUMA topology to initialize wq->node_nr_active[].max.

However, topology information may not be set up at this point.
wq_update_node_max_active() is explicitly invoked from
workqueue_init_topology() later when topology information is known to be
available.

This doesn't seem to crash anything but it's doing useless work with dubious
data. Let's skip the premature and duplicate node_max_active updates by
initializing the field to WQ_DFL_MIN_ACTIVE on allocation and making
wq_update_node_max_active() noop until workqueue_init_topology().

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/workqueue.c |    8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 9221a4c57ae1..a65081ec6780 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -386,6 +386,8 @@ static const char *wq_affn_names[WQ_AFFN_NR_TYPES] = {
 	[WQ_AFFN_SYSTEM]		= "system",
 };

+static bool wq_topo_initialized = false;
+
 /*
  * Per-cpu work items which run for longer than the following threshold are
  * automatically considered CPU intensive and excluded from concurrency
@@ -1510,6 +1512,9 @@ static void wq_update_node_max_active(struct workqueue_struct *wq, int off_cpu)

 	lockdep_assert_held(&wq->mutex);

+	if (!wq_topo_initialized)
+		return;
+
 	if (!cpumask_test_cpu(off_cpu, effective))
 		off_cpu = -1;

@@ -4356,6 +4361,7 @@ static void free_node_nr_active(struct wq_node_nr_active **nna_ar)

 static void init_node_nr_active(struct wq_node_nr_active *nna)
 {
+	nna->max = WQ_DFL_MIN_ACTIVE;
 	atomic_set(&nna->nr, 0);
 	raw_spin_lock_init(&nna->lock);
 	INIT_LIST_HEAD(&nna->pending_pwqs);
@@ -7400,6 +7406,8 @@ void __init workqueue_init_topology(void)
 	init_pod_type(&wq_pod_types[WQ_AFFN_CACHE], cpus_share_cache);
 	init_pod_type(&wq_pod_types[WQ_AFFN_NUMA], cpus_share_numa);

+	wq_topo_initialized = true;
+
 	mutex_lock(&wq_pool_mutex);

 	/*
2024-01-30 19:17:00 -10:00
Tejun Heo
15930da42f workqueue: Don't call cpumask_test_cpu() with -1 CPU in wq_update_node_max_active()
For wq_update_node_max_active(), @off_cpu of -1 indicates that no CPU is
going down. The function was incorrectly calling cpumask_test_cpu() with -1
CPU leading to oopses like the following on some archs:

  Unable to handle kernel paging request at virtual address ffff0002100296e0
  ..
  pc : wq_update_node_max_active+0x50/0x1fc
  lr : wq_update_node_max_active+0x1f0/0x1fc
  ...
  Call trace:
    wq_update_node_max_active+0x50/0x1fc
    apply_wqattrs_commit+0xf0/0x114
    apply_workqueue_attrs_locked+0x58/0xa0
    alloc_workqueue+0x5ac/0x774
    workqueue_init_early+0x460/0x540
    start_kernel+0x258/0x684
    __primary_switched+0xb8/0xc0
  Code: 9100a273 35000d01 53067f00 d0016dc1 (f8607a60)
  ---[ end trace 0000000000000000 ]---
  Kernel panic - not syncing: Attempted to kill the idle task!
  ---[ end Kernel panic - not syncing: Attempted to kill the idle task! ]---

Fix it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>
Reported-by: Nathan Chancellor <nathan@kernel.org>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Link: http://lkml.kernel.org/r/91eacde0-df99-4d5c-a980-91046f66e612@samsung.com
Fixes: 5797b1c18919 ("workqueue: Implement system-wide nr_active enforcement for unbound workqueues")
2024-01-30 18:55:55 -10:00
Andrii Nakryiko
8f2b44cd9d bpf: add arg:nullable tag to be combined with trusted pointers
Add ability to mark arg:trusted arguments with optional arg:nullable
tag to mark it as PTR_TO_BTF_ID_OR_NULL variant, which will allow
callers to pass NULL, and subsequently will force global subprog's code
to do NULL check. This allows to have "optional" PTR_TO_BTF_ID values
passed into global subprogs.

For now arg:nullable cannot be combined with anything else.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240130000648.2144827-3-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-30 09:41:50 -08:00
Andrii Nakryiko
e2b3c4ff5d bpf: add __arg_trusted global func arg tag
Add support for passing PTR_TO_BTF_ID registers to global subprogs.
Currently only PTR_TRUSTED flavor of PTR_TO_BTF_ID is supported.
Non-NULL semantics is assumed, so caller will be forced to prove
PTR_TO_BTF_ID can't be NULL.

Note, we disallow global subprogs to destroy passed in PTR_TO_BTF_ID
arguments, even the trusted one. We achieve that by not setting
ref_obj_id when validating subprog code. This basically enforces (in
Rust terms) borrowing semantics vs move semantics. Borrowing semantics
seems to be a better fit for isolated global subprog validation
approach.

Implementation-wise, we utilize existing logic for matching
user-provided BTF type to kernel-side BTF type, used by BPF CO-RE logic
and following same matching rules. We enforce a unique match for types.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240130000648.2144827-2-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-30 09:41:50 -08:00
Haiyue Wang
6668e818f9 bpf,token: Use BIT_ULL() to convert the bit mask
Replace the '(1ULL << *)' with the macro BIT_ULL(nr).

Signed-off-by: Haiyue Wang <haiyue.wang@intel.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20240127134901.3698613-1-haiyue.wang@intel.com
2024-01-29 20:04:55 -08:00
Linus Torvalds
9b7bd05beb tracing: Two small fixes for tracefs and eventfs:
- Fix register_snapshot_trigger() on allocation error
   If the snapshot fails to allocate, the register_snapshot_trigger() can
   still return success. If the call to tracing_alloc_snapshot_instance()
   returned anything but 0, it returned 0, but it should have been returning
   the error code from that allocation function.
 
 - Remove leftover code from tracefs doing a dentry walk on remount.
   The update_gid() function was called by the tracefs code on remount
   to update the gid of eventfs, but that is no longer the case, but that
   code wasn't deleted. Nothing calls it. Remove it.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZbcbZBQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qqEsAQCWTIQ/yAktP2doLvuccZbZSc5NFHZt
 hZNRHep/SMQqKwD6AlesgiXlXM39HiRTiV17jPzrL65brcTEYMJmqCaWugk=
 =8nia
 -----END PGP SIGNATURE-----

Merge tag 'trace-v6.8-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing fixes from Steven Rostedt:
 "Two small fixes for tracefs and eventfs:

   - Fix register_snapshot_trigger() on allocation error

     If the snapshot fails to allocate, the register_snapshot_trigger()
     can still return success. If the call to
     tracing_alloc_snapshot_instance() returned anything but 0, it
     returned 0, but it should have been returning the error code from
     that allocation function.

   - Remove leftover code from tracefs doing a dentry walk on remount.

     The update_gid() function was called by the tracefs code on remount
     to update the gid of eventfs, but that is no longer the case, but
     that code wasn't deleted. Nothing calls it. Remove it"

* tag 'trace-v6.8-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracefs: remove stale 'update_gid' code
  tracing/trigger: Fix to return error if failed to alloc snapshot
2024-01-29 18:42:34 -08:00
Leonardo Bras
aae17ebb53 workqueue: Avoid using isolated cpus' timers on queue_delayed_work
When __queue_delayed_work() is called, it chooses a cpu for handling the
timer interrupt. As of today, it will pick either the cpu passed as
parameter or the last cpu used for this.

This is not good if a system does use CPU isolation, because it can take
away some valuable cpu time to:
1 - deal with the timer interrupt,
2 - schedule-out the desired task,
3 - queue work on a random workqueue, and
4 - schedule the desired task back to the cpu.

So to fix this, during __queue_delayed_work(), if cpu isolation is in
place, pick a random non-isolated cpu to handle the timer interrupt.

As an optimization, if the current cpu is not isolated, use it instead
of looking for another candidate.

Signed-off-by: Leonardo Bras <leobras@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-01-29 15:21:37 -10:00
Linus Torvalds
6f3d7d5ced 22 hotfixes. 11 are cc:stable and the remainder address post-6.7 issues
or aren't considered appropriate for backporting.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZbdSnwAKCRDdBJ7gKXxA
 jv49AQCY8eLOgE0L+25HZm99HleBwapbKJozcmsXgMPlgeFZHgEA8saExeL+Nzae
 6ktxmGXoVw2t3FJ67Zr66VE3EyHVKAY=
 =HWuo
 -----END PGP SIGNATURE-----

Merge tag 'mm-hotfixes-stable-2024-01-28-23-21' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull misc fixes from Andrew Morton:
 "22 hotfixes. 11 are cc:stable and the remainder address post-6.7
  issues or aren't considered appropriate for backporting"

* tag 'mm-hotfixes-stable-2024-01-28-23-21' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (22 commits)
  mm: thp_get_unmapped_area must honour topdown preference
  mm: huge_memory: don't force huge page alignment on 32 bit
  userfaultfd: fix mmap_changing checking in mfill_atomic_hugetlb
  selftests/mm: ksm_tests should only MADV_HUGEPAGE valid memory
  scs: add CONFIG_MMU dependency for vfree_atomic()
  mm/memory: fix folio_set_dirty() vs. folio_mark_dirty() in zap_pte_range()
  mm/huge_memory: fix folio_set_dirty() vs. folio_mark_dirty()
  selftests/mm: Update va_high_addr_switch.sh to check CPU for la57 flag
  selftests: mm: fix map_hugetlb failure on 64K page size systems
  MAINTAINERS: supplement of zswap maintainers update
  stackdepot: make fast paths lock-less again
  stackdepot: add stats counters exported via debugfs
  mm, kmsan: fix infinite recursion due to RCU critical section
  mm/writeback: fix possible divide-by-zero in wb_dirty_limits(), again
  selftests/mm: switch to bash from sh
  MAINTAINERS: add man-pages git trees
  mm: memcontrol: don't throttle dying tasks on memory.high
  mm: mmap: map MAP_STACK to VM_NOHUGEPAGE
  uprobes: use pagesize-aligned virtual address when replacing pages
  selftests/mm: mremap_test: fix build warning
  ...
2024-01-29 17:12:16 -08:00
Florian Lehner
aecaa3ed48 perf/bpf: Fix duplicate type check
Remove the duplicate check on type and unify result.

Signed-off-by: Florian Lehner <dev@der-flo.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/bpf/20240120150920.3370-1-dev@der-flo.net
2024-01-29 22:40:37 +01:00
Andrii Nakryiko
add9c58cd4 bpf: move arg:ctx type enforcement check inside the main logic loop
Now that bpf and bpf-next trees converged and we don't run the risk of
merge conflicts, move btf_validate_prog_ctx_type() into its most logical
place inside the main logic loop.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240125205510.3642094-4-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-29 12:34:13 -08:00
Christophe Leroy
3559ad395b module: Change module_enable_{nx/x/ro}() to more explicit names
It's a bit puzzling to see a call to module_enable_nx() followed by a
call to module_enable_x(). This is because one applies on text while
the other applies on data.

Change name to make that more clear.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2024-01-29 12:00:31 -08:00
Christophe Leroy
ac88ee7d2b module: Use set_memory_rox()
A couple of architectures seem concerned about calling set_memory_ro()
and set_memory_x() too frequently and have implemented a version of
set_memory_rox(), see commit 60463628c9e0 ("x86/mm: Implement native
set_memory_rox()") and commit 22e99fa56443 ("s390/mm: implement
set_memory_rox()")

Use set_memory_rox() in modules when STRICT_MODULES_RWX is set.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2024-01-29 12:00:31 -08:00
Tejun Heo
5797b1c189 workqueue: Implement system-wide nr_active enforcement for unbound workqueues
A pool_workqueue (pwq) represents the connection between a workqueue and a
worker_pool. One of the roles that a pwq plays is enforcement of the
max_active concurrency limit. Before 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues"), there was one pwq per each CPU
for per-cpu workqueues and per each NUMA node for unbound workqueues, which
was a natural result of per-cpu workqueues being served by per-cpu pools and
unbound by per-NUMA pools.

In terms of max_active enforcement, this was, while not perfect, workable.
For per-cpu workqueues, it was fine. For unbound, it wasn't great in that
NUMA machines would get max_active that's multiplied by the number of nodes
but didn't cause huge problems because NUMA machines are relatively rare and
the node count is usually pretty low.

However, cache layouts are more complex now and sharing a worker pool across
a whole node didn't really work well for unbound workqueues. Thus, a series
of commits culminating on 8639ecebc9b1 ("workqueue: Make unbound workqueues
to use per-cpu pool_workqueues") implemented more flexible affinity
mechanism for unbound workqueues which enables using e.g. last-level-cache
aligned pools. In the process, 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues") made unbound workqueues use
per-cpu pwqs like per-cpu workqueues.

While the change was necessary to enable more flexible affinity scopes, this
came with the side effect of blowing up the effective max_active for unbound
workqueues. Before, the effective max_active for unbound workqueues was
multiplied by the number of nodes. After, by the number of CPUs.

636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu
pool_workqueues") claims that this should generally be okay. It is okay for
users which self-regulates concurrency level which are the vast majority;
however, there are enough use cases which actually depend on max_active to
prevent the level of concurrency from going bonkers including several IO
handling workqueues that can issue a work item for each in-flight IO. With
targeted benchmarks, the misbehavior can easily be exposed as reported in
http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3.

Unfortunately, there is no way to express what these use cases need using
per-cpu max_active. A CPU may issue most of in-flight IOs, so we don't want
to set max_active too low but as soon as we increase max_active a bit, we
can end up with unreasonable number of in-flight work items when many CPUs
issue IOs at the same time. ie. The acceptable lowest max_active is higher
than the acceptable highest max_active.

Ideally, max_active for an unbound workqueue should be system-wide so that
the users can regulate the total level of concurrency regardless of node and
cache layout. The reasons workqueue hasn't implemented that yet are:

- One max_active enforcement decouples from pool boundaires, chaining
  execution after a work item finishes requires inter-pool operations which
  would require lock dancing, which is nasty.

- Sharing a single nr_active count across the whole system can be pretty
  expensive on NUMA machines.

- Per-pwq enforcement had been more or less okay while we were using
  per-node pools.

It looks like we no longer can avoid decoupling max_active enforcement from
pool boundaries. This patch implements system-wide nr_active mechanism with
the following design characteristics:

- To avoid sharing a single counter across multiple nodes, the configured
  max_active is split across nodes according to the proportion of each
  workqueue's online effective CPUs per node. e.g. A node with twice more
  online effective CPUs will get twice higher portion of max_active.

- Workqueue used to be able to process a chain of interdependent work items
  which is as long as max_active. We can't do this anymore as max_active is
  distributed across the nodes. Instead, a new parameter min_active is
  introduced which determines the minimum level of concurrency within a node
  regardless of how max_active distribution comes out to be.

  It is set to the smaller of max_active and WQ_DFL_MIN_ACTIVE which is 8.
  This can lead to higher effective max_weight than configured and also
  deadlocks if a workqueue was depending on being able to handle chains of
  interdependent work items that are longer than 8.

  I believe these should be fine given that the number of CPUs in each NUMA
  node is usually higher than 8 and work item chain longer than 8 is pretty
  unlikely. However, if these assumptions turn out to be wrong, we'll need
  to add an interface to adjust min_active.

- Each unbound wq has an array of struct wq_node_nr_active which tracks
  per-node nr_active. When its pwq wants to run a work item, it has to
  obtain the matching node's nr_active. If over the node's max_active, the
  pwq is queued on wq_node_nr_active->pending_pwqs. As work items finish,
  the completion path round-robins the pending pwqs activating the first
  inactive work item of each, which involves some pool lock dancing and
  kicking other pools. It's not the simplest code but doesn't look too bad.

v4: - wq_adjust_max_active() updated to invoke wq_update_node_max_active().

    - wq_adjust_max_active() is now protected by wq->mutex instead of
      wq_pool_mutex.

v3: - wq_node_max_active() used to calculate per-node max_active on the fly
      based on system-wide CPU online states. Lai pointed out that this can
      lead to skewed distributions for workqueues with restricted cpumasks.
      Update the max_active distribution to use per-workqueue effective
      online CPU counts instead of system-wide and cache the calculation
      results in node_nr_active->max.

v2: - wq->min/max_active now uses WRITE/READ_ONCE() as suggested by Lai.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Naohiro Aota <Naohiro.Aota@wdc.com>
Link: http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3
Fixes: 636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu pool_workqueues")
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-01-29 08:11:25 -10:00
Tejun Heo
91ccc6e723 workqueue: Introduce struct wq_node_nr_active
Currently, for both percpu and unbound workqueues, max_active applies
per-cpu, which is a recent change for unbound workqueues. The change for
unbound workqueues was a significant departure from the previous behavior of
per-node application. It made some use cases create undesirable number of
concurrent work items and left no good way of fixing them. To address the
problem, workqueue is implementing a NUMA node segmented global nr_active
mechanism, which will be explained further in the next patch.

As a preparation, this patch introduces struct wq_node_nr_active. It's a
data structured allocated for each workqueue and NUMA node pair and
currently only tracks the workqueue's number of active work items on the
node. This is split out from the next patch to make it easier to understand
and review.

Note that there is an extra wq_node_nr_active allocated for the invalid node
nr_node_ids which is used to track nr_active for pools which don't have NUMA
node associated such as the default fallback system-wide pool.

This doesn't cause any behavior changes visible to userland yet. The next
patch will expand to implement the control mechanism on top.

v4: - Fixed out-of-bound access when freeing per-cpu workqueues.

v3: - Use flexible array for wq->node_nr_active as suggested by Lai.

v2: - wq->max_active now uses WRITE/READ_ONCE() as suggested by Lai.

    - Lai pointed out that pwq_tryinc_nr_active() incorrectly dropped
      pwq->max_active check. Restored. As the next patch replaces the
      max_active enforcement mechanism, this doesn't change the end result.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-01-29 08:11:24 -10:00
Tejun Heo
dd6c3c5441 workqueue: Move pwq_dec_nr_in_flight() to the end of work item handling
The planned shared nr_active handling for unbound workqueues will make
pwq_dec_nr_active() sometimes drop the pool lock temporarily to acquire
other pool locks, which is necessary as retirement of an nr_active count
from one pool may need kick off an inactive work item in another pool.

This patch moves pwq_dec_nr_in_flight() call in try_to_grab_pending() to the
end of work item handling so that work item state changes stay atomic.
process_one_work() which is the other user of pwq_dec_nr_in_flight() already
calls it at the end of work item handling. Comments are added to both call
sites and pwq_dec_nr_in_flight().

This shouldn't cause any behavior changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-01-29 08:11:24 -10:00
Tejun Heo
9f66cff212 workqueue: RCU protect wq->dfl_pwq and implement accessors for it
wq->cpu_pwq is RCU protected but wq->dfl_pwq isn't. This is okay because
currently wq->dfl_pwq is used only accessed to install it into wq->cpu_pwq
which doesn't require RCU access. However, we want to be able to access
wq->dfl_pwq under RCU in the future to access its __pod_cpumask and the code
can be made easier to read by making the two pwq fields behave in the same
way.

- Make wq->dfl_pwq RCU protected.

- Add unbound_pwq_slot() and unbound_pwq() which can access both ->dfl_pwq
  and ->cpu_pwq. The former returns the double pointer that can be used
  access and update the pwqs. The latter performs locking check and
  dereferences the double pointer.

- pwq accesses and updates are converted to use unbound_pwq[_slot]().

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-01-29 08:11:24 -10:00
Tejun Heo
c5404d4e6d workqueue: Make wq_adjust_max_active() round-robin pwqs while activating
wq_adjust_max_active() needs to activate work items after max_active is
increased. Previously, it did that by visiting each pwq once activating all
that could be activated. While this makes sense with per-pwq nr_active,
nr_active will be shared across multiple pwqs for unbound wqs. Then, we'd
want to round-robin through pwqs to be fairer.

In preparation, this patch makes wq_adjust_max_active() round-robin pwqs
while activating. While the activation ordering changes, this shouldn't
cause user-noticeable behavior changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-01-29 08:11:24 -10:00
Tejun Heo
1c270b79ce workqueue: Move nr_active handling into helpers
__queue_work(), pwq_dec_nr_in_flight() and wq_adjust_max_active() were
open-coding nr_active handling, which is fine given that the operations are
trivial. However, the planned unbound nr_active update will make them more
complicated, so let's move them into helpers.

- pwq_tryinc_nr_active() is added. It increments nr_active if under
  max_active limit and return a boolean indicating whether inc was
  successful. Note that the function is structured to accommodate future
  changes. __queue_work() is updated to use the new helper.

- pwq_activate_first_inactive() is updated to use pwq_tryinc_nr_active() and
  thus no longer assumes that nr_active is under max_active and returns a
  boolean to indicate whether a work item has been activated.

- wq_adjust_max_active() no longer tests directly whether a work item can be
  activated. Instead, it's updated to use the return value of
  pwq_activate_first_inactive() to tell whether a work item has been
  activated.

- nr_active decrement and activating the first inactive work item is
  factored into pwq_dec_nr_active().

v3: - WARN_ON_ONCE(!WORK_STRUCT_INACTIVE) added to __pwq_activate_work() as
      now we're calling the function unconditionally from
      pwq_activate_first_inactive().

v2: - wq->max_active now uses WRITE/READ_ONCE() as suggested by Lai.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-01-29 08:11:24 -10:00
Tejun Heo
4c6380305d workqueue: Replace pwq_activate_inactive_work() with [__]pwq_activate_work()
To prepare for unbound nr_active handling improvements, move work activation
part of pwq_activate_inactive_work() into __pwq_activate_work() and add
pwq_activate_work() which tests WORK_STRUCT_INACTIVE and updates nr_active.

pwq_activate_first_inactive() and try_to_grab_pending() are updated to use
pwq_activate_work(). The latter conversion is functionally identical. For
the former, this conversion adds an unnecessary WORK_STRUCT_INACTIVE
testing. This is temporary and will be removed by the next patch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-01-29 08:11:24 -10:00
Tejun Heo
afa87ce853 workqueue: Factor out pwq_is_empty()
"!pwq->nr_active && list_empty(&pwq->inactive_works)" test is repeated
multiple times. Let's factor it out into pwq_is_empty().

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-01-29 08:11:24 -10:00
Tejun Heo
a045a272d8 workqueue: Move pwq->max_active to wq->max_active
max_active is a workqueue-wide setting and the configured value is stored in
wq->saved_max_active; however, the effective value was stored in
pwq->max_active. While this is harmless, it makes max_active update process
more complicated and gets in the way of the planned max_active semantic
updates for unbound workqueues.

This patches moves pwq->max_active to wq->max_active. This simplifies the
code and makes freezing and noop max_active updates cheaper too. No
user-visible behavior change is intended.

As wq->max_active is updated while holding wq mutex but read without any
locking, it now uses WRITE/READ_ONCE(). A new locking locking rule WO is
added for it.

v2: wq->max_active now uses WRITE/READ_ONCE() as suggested by Lai.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-01-29 08:11:24 -10:00
Bartosz Golaszewski
aafd753555 genirq/irq_sim: Shrink code by using <linux/cleanup.h> helpers
Use the new __free() mechanism to remove all gotos and simplify the error
paths.

Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Nathan Chancellor <nathan@kernel.org>
Link: https://lore.kernel.org/r/20240122124243.44002-5-brgl@bgdev.pl
2024-01-29 11:07:57 +01:00
Linus Torvalds
648f575d5e - Prevent an inconsistent futex operation leading to stale state
exposure
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmW2LAgACgkQEsHwGGHe
 VUqowBAAiW9aPQmp401DSXLX+bX0oS5IQVEZnAEE3hQTWxdvDoIdmX+SBReSutXy
 PDm8mZgVtIiUg3V5bu7/9Dgpu7ovRuChJPjjkYFUDcEmzmsMI11W6u8+8eyt8yRd
 X9LuGUeXPJSI1kadYudhFUhl6X6KcXj4Y+XUqNcyp8yClSEcLriYeiumNApSEzj6
 BneO5VBbXTpJq1b7GOlC4MNhNXhx+WlUdJUb3VPLlxy/akxrNs9x0ASdOuqslCq8
 X9SJPnKeRh0mpezmWDgU72eQ/3vpvWQzwyXvp2pQGbjArCx7IwwD765NDu0P6651
 C/+4ruXmcd+Jp3wuobdHG8/J2NlZQy8tZQm284YkS5vyBQDi4s17hycXw/aeUFpu
 /3LR1Hppl//u7hkaHszE8vE5l6in4a2XAbk9EozChVj/aHRJqIaLn8TGQRquK4Tg
 uRjIC3O2ubJCsIlNIczysjCobSCO+cELwUuFVHh7cdmQAgUwF3efDab0+pJ7MHFb
 ZEcqQbIt4FGea4BGzvRYCYj6W9bkhzttnH+68ef+mDA3BcdGoYnHcQ143M8duNhe
 0inWCibQXMFC9EGPjC8Sz8WvzF/L5KL9bPQmO1sitIzH6kbU3o7PBk2Fe4V6+KP9
 THK865SJ/9QirjXrGmp9Sle6dqJRUylmt1ts8reOWACZ98LKeWU=
 =ibuM
 -----END PGP SIGNATURE-----

Merge tag 'locking_urgent_for_v6.8_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull locking fix from Borislav Petkov:

 - Prevent an inconsistent futex operation leading to stale state
   exposure

* tag 'locking_urgent_for_v6.8_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  futex: Prevent the reuse of stale pi_state
2024-01-28 10:38:16 -08:00
Linus Torvalds
0e4363ac1a - Initialize the resend node of each IRQ descriptor, not only the first one
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmW2KrEACgkQEsHwGGHe
 VUr1WA//Qsi2JkxO1lyUQgnyuXqs0+oVZJxFyH2dFYzWkfSaxgsyPZ0H+wsweDfP
 OgoNzwwDf1IaNbVz2voV6lSM/30ujJMx4aAucT5WTEXa12cJsvipxRiNd8WU8GqQ
 buBz+vnS9IJ2WfM7UxhIVevYFU8H/ERcSO9WCII0YjlcVxmlwMK3B7kFdpBPdT5Q
 m5hvBorZzIa9wD3TI7e+VvEVbCx0WjsYYEpXDXM/yCf1Juc9952pjjunzx3YmJES
 5JG2WpnEvmNdWwIPO0NAjs7Shw/MNViXTy5Ls5jcbswiAcBoUxNHQlUsvNVDaVyv
 8eMCkPzuSipY8HSoetQSTJl+mr3LyYRvevKahuTgwbS8K+kxgClqHFoLZVqolWnk
 2IDo63R6Ex6lb1Xpb/Rpg/4j4NqUVWcvPHf6Z2CmMRq/XbSk2DIFl1Wxjgy/Cjnu
 +nNLw2FYayEBrKF3VlYgERGoCfBrEsksxzljjeHFn5XWr+G2x1ykF37xaWjQ2+oV
 sFl6UYwIsdqPCjHmpT6R1lwCdeEC3o3Zc2Kf5uEVj+pXacKJkxZU0L6ZneO8UiEc
 rtc0gTgm9ZNd8oDsjsaBU1A3KxH9lOfVz82ZV0tipz94dcN4zrB9Qag0Yw+64YOC
 cQ9cRKiFiCVCeD1ksDLZe1IUX++T2Y9O8MZv06ZDrbaFC56+lU8=
 =IgON
 -----END PGP SIGNATURE-----

Merge tag 'irq_urgent_for_v6.8_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull irq fix from Borislav Petkov:

 - Initialize the resend node of each IRQ descriptor, not only the first
   one

* tag 'irq_urgent_for_v6.8_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  genirq: Initialize resend_node hlist for all interrupt descriptors
2024-01-28 10:34:55 -08:00
Linus Torvalds
90db544eba - Preserve the number of idle calls and sleep entries across CPU hotplug
events in order to be able to compute correct averages
 
 - Limit the duration of the clocksource watchdog checking interval as
   too long intervals lead to wrongly marking the TSC as unstable
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmW2KYEACgkQEsHwGGHe
 VUoX7w//Ulls1tp3m1oiejTBtUmkewSmnhNAfkHJv3MlKNe+ttG6LvQVh9g2bf1Z
 FOu2M2se0ge8G5xf3+I5E4rpqlJZSuhPmNmIET+aj+2a61UJq/6zE1Zw6mxjrJOK
 emjOYKTxZ/HxvKJJGO6NiH8Iv5Aj3nQR3Y6oyb/FyP5TLJ6MCT21iEaqyqU7P+Ix
 AHIS3cL97M5R/tFtP2CY3PV2M6hJ0lqapSi9t75hT8DfJN1TNQ5SvFkKgmOIrGFw
 2WxPTSTEZAnXlvI4cC3Nru9i64QQRw9S05FFelX2pwxE/7wVzBvfh8cjuGZJBve/
 KQhNnQ4/fzv6E/hUcavKuOyk1lx5XonfCuG4RFoLl67LjLbLh+Q55RBdXflBPF4T
 Ow9BSyQNFu391C2Bl5gJUYVd2JMv+IVpi2wUiwrXJ/Mxj+A2J7Fj0jz7hMbNCmsU
 EaA+QyfkAGsoa99xP3UDhPzxoCr2s5YTAxH+IUeSWeI25PMq9f+6fifXBwG+GaVa
 FS6Ei1VI0GCNmcYFYawHJbdM2ui5h7lZ96aEpOBSVcAv/2yBNgqxuYZ+icO/wI6N
 JM0DSEEOrWcytfxftl7LmglJauhXSKZH4UTG4RCz0IkDgR72Wn0QF4cqm4wWQ5yh
 n5/xO+SbkzE57bltsnAkvpu0a110fdK5ec+vkFIy4PrxyO83XhY=
 =9RXx
 -----END PGP SIGNATURE-----

Merge tag 'timers_urgent_for_v6.8_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer fixes from Borislav Petkov:

 - Preserve the number of idle calls and sleep entries across CPU
   hotplug events in order to be able to compute correct averages

 - Limit the duration of the clocksource watchdog checking interval as
   too long intervals lead to wrongly marking the TSC as unstable

* tag 'timers_urgent_for_v6.8_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  tick/sched: Preserve number of idle sleeps across CPU hotplug events
  clocksource: Skip watchdog check for large watchdog intervals
2024-01-28 10:33:14 -08:00
Jakub Kicinski
92046e83c0 bpf-next-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZbQV+gAKCRDbK58LschI
 g2OeAP0VvhZS9SPiS+/AMAFuw2W1BkMrFNbfBTc3nzRnyJSmNAD+NG4CLLJvsKI9
 olu7VC20B8pLTGLUGIUSwqnjOC+Kkgc=
 =wVMl
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Daniel Borkmann says:

====================
pull-request: bpf-next 2024-01-26

We've added 107 non-merge commits during the last 4 day(s) which contain
a total of 101 files changed, 6009 insertions(+), 1260 deletions(-).

The main changes are:

1) Add BPF token support to delegate a subset of BPF subsystem
   functionality from privileged system-wide daemons such as systemd
   through special mount options for userns-bound BPF fs to a trusted
   & unprivileged application. With addressed changes from Christian
   and Linus' reviews, from Andrii Nakryiko.

2) Support registration of struct_ops types from modules which helps
   projects like fuse-bpf that seeks to implement a new struct_ops type,
   from Kui-Feng Lee.

3) Add support for retrieval of cookies for perf/kprobe multi links,
   from Jiri Olsa.

4) Bigger batch of prep-work for the BPF verifier to eventually support
   preserving boundaries and tracking scalars on narrowing fills,
   from Maxim Mikityanskiy.

5) Extend the tc BPF flavor to support arbitrary TCP SYN cookies to help
   with the scenario of SYN floods, from Kuniyuki Iwashima.

6) Add code generation to inline the bpf_kptr_xchg() helper which
   improves performance when stashing/popping the allocated BPF objects,
   from Hou Tao.

7) Extend BPF verifier to track aligned ST stores as imprecise spilled
   registers, from Yonghong Song.

8) Several fixes to BPF selftests around inline asm constraints and
   unsupported VLA code generation, from Jose E. Marchesi.

9) Various updates to the BPF IETF instruction set draft document such
   as the introduction of conformance groups for instructions,
   from Dave Thaler.

10) Fix BPF verifier to make infinite loop detection in is_state_visited()
    exact to catch some too lax spill/fill corner cases,
    from Eduard Zingerman.

11) Refactor the BPF verifier pointer ALU check to allow ALU explicitly
    instead of implicitly for various register types, from Hao Sun.

12) Fix the flaky tc_redirect_dtime BPF selftest due to slowness
    in neighbor advertisement at setup time, from Martin KaFai Lau.

13) Change BPF selftests to skip callback tests for the case when the
    JIT is disabled, from Tiezhu Yang.

14) Add a small extension to libbpf which allows to auto create
    a map-in-map's inner map, from Andrey Grafin.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (107 commits)
  selftests/bpf: Add missing line break in test_verifier
  bpf, docs: Clarify definitions of various instructions
  bpf: Fix error checks against bpf_get_btf_vmlinux().
  bpf: One more maintainer for libbpf and BPF selftests
  selftests/bpf: Incorporate LSM policy to token-based tests
  selftests/bpf: Add tests for LIBBPF_BPF_TOKEN_PATH envvar
  libbpf: Support BPF token path setting through LIBBPF_BPF_TOKEN_PATH envvar
  selftests/bpf: Add tests for BPF object load with implicit token
  selftests/bpf: Add BPF object loading tests with explicit token passing
  libbpf: Wire up BPF token support at BPF object level
  libbpf: Wire up token_fd into feature probing logic
  libbpf: Move feature detection code into its own file
  libbpf: Further decouple feature checking logic from bpf_object
  libbpf: Split feature detectors definitions from cached results
  selftests/bpf: Utilize string values for delegate_xxx mount options
  bpf: Support symbolic BPF FS delegation mount options
  bpf: Fail BPF_TOKEN_CREATE if no delegation option was set on BPF FS
  bpf,selinux: Allocate bpf_security_struct per BPF token
  selftests/bpf: Add BPF token-enabled tests
  libbpf: Add BPF token support to bpf_prog_load() API
  ...
====================

Link: https://lore.kernel.org/r/20240126215710.19855-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-01-26 21:08:22 -08:00
Tejun Heo
e563d0a7cd workqueue: Break up enum definitions and give names to the types
workqueue is collecting different sorts of enums into a single unnamed enum
type which can increase confusion around enum width. Also, unnamed enums
can't be accessed from BPF. Let's break up enum definitions according to
their purposes and give them type names.

Signed-off-by: Tejun Heo <tj@kernel.org>
2024-01-26 11:55:50 -10:00
Tejun Heo
6a229b0e2f workqueue: Drop unnecessary kick_pool() in create_worker()
After creating a new worker, create_worker() is calling kick_pool() to wake
up the new worker task. However, as kick_pool() doesn't do anything if there
is no work pending, it also calls wake_up_process() explicitly. There's no
reason to call kick_pool() at all. wake_up_process() is enough by itself.
Drop the unnecessary kick_pool() call.

Signed-off-by: Tejun Heo <tj@kernel.org>
2024-01-26 11:55:46 -10:00
Masami Hiramatsu (Google)
0958b33ef5 tracing/trigger: Fix to return error if failed to alloc snapshot
Fix register_snapshot_trigger() to return error code if it failed to
allocate a snapshot instead of 0 (success). Unless that, it will register
snapshot trigger without an error.

Link: https://lore.kernel.org/linux-trace-kernel/170622977792.270660.2789298642759362200.stgit@devnote2

Fixes: 0bbe7f719985 ("tracing: Fix the race between registering 'snapshot' event trigger and triggering 'snapshot' operation")
Cc: stable@vger.kernel.org
Cc: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-01-26 15:10:24 -05:00
Li Zhijian
effe6d278e kernel/cpu: Convert snprintf() to sysfs_emit()
Per filesystems/sysfs.rst, show() should only use sysfs_emit()
or sysfs_emit_at() when formatting the value to be returned to user space.

coccinelle complains that there are still a couple of functions that use
snprintf(). Convert them to sysfs_emit().

No functional change intended.

Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20240116045151.3940401-40-lizhijian@fujitsu.com
2024-01-26 18:25:16 +01:00
Randy Dunlap
ef7e585bf4 cpu/hotplug: Delete an extraneous kernel-doc description
struct cpuhp_cpu_state has an extraneous kernel-doc comment for @cpu.
There is no struct member by that name, so remove the comment to
prevent the kernel-doc warning:

  kernel/cpu.c:85: warning: Excess struct member 'cpu' description in 'cpuhp_cpu_state'

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20240114030615.30441-1-rdunlap@infradead.org
2024-01-26 17:44:42 +01:00
Bartosz Golaszewski
8dab7fd47e genirq/irq_sim: Order headers alphabetically
For better readability and maintenance keep headers in alphabetical
order.

Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20240122124243.44002-4-brgl@bgdev.pl
2024-01-26 13:44:48 +01:00
Bartosz Golaszewski
3832f39042 genirq/irq_sim: Remove unused field from struct irq_sim_irq_ctx
The irqnum field is unused. Remove it.

Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20240122124243.44002-3-brgl@bgdev.pl
2024-01-26 13:44:48 +01:00