linux

iv/linux

Author	SHA1	Message	Date
Oleg Nesterov	daa694e413	getrusage: move thread_group_cputime_adjusted() outside of lock_task_sighand() Patch series "getrusage: use sig->stats_lock", v2. This patch (of 2): thread_group_cputime() does its own locking, we can safely shift thread_group_cputime_adjusted() which does another for_each_thread loop outside of ->siglock protected section. This is also preparation for the next patch which changes getrusage() to use stats_lock instead of siglock, thread_group_cputime() takes the same lock. With the current implementation recursive read_seqbegin_or_lock() is fine, thread_group_cputime() can't enter the slow mode if the caller holds stats_lock, yet this looks more safe and better performance-wise. Link: https://lkml.kernel.org/r/20240122155023.GA26169@redhat.com Link: https://lkml.kernel.org/r/20240122155050.GA26205@redhat.com Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reported-by: Dylan Hatch <dylanbhatch@google.com> Tested-by: Dylan Hatch <dylanbhatch@google.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-02-07 21:20:32 -08:00
Masahiro Yamada	e55dad12ab	bpf: Merge two CONFIG_BPF entries 'config BPF' exists in both init/Kconfig and kernel/bpf/Kconfig. Commit `b24abcff91` ("bpf, kconfig: Add consolidated menu entry for bpf with core options") added the second one to kernel/bpf/Kconfig instead of moving the existing one. Merge them together. Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/bpf/20240204075634.32969-1-masahiroy@kernel.org	2024-02-07 16:38:20 -08:00
John Ogness	7412dc6d55	dump_stack: Do not get cpu_sync for panic CPU dump_stack() is called in panic(). If for some reason another CPU is holding the printk_cpu_sync and is unable to release it, the panic CPU will be unable to continue and print the stacktrace. Since non-panic CPUs are not allowed to store new printk messages anyway, there is no need to synchronize the stacktrace output in a panic situation. For the panic CPU, do not get the printk_cpu_sync because it is not needed and avoids a potential deadlock scenario in panic(). Link: https://lore.kernel.org/lkml/ZcIGKU8sxti38Kok@alley Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20240207134103.1357162-15-john.ogness@linutronix.de Signed-off-by: Petr Mladek <pmladek@suse.com>	2024-02-07 17:23:19 +01:00
John Ogness	d988d9a9b9	panic: Flush kernel log buffer at the end If the kernel crashes in a context where printk() calls always defer printing (such as in NMI or inside a printk_safe section) then the final panic messages will be deferred to irq_work. But if irq_work is not available, the messages will not get printed unless explicitly flushed. The result is that the final "end Kernel panic" banner does not get printed. Add one final flush after the last printk() call to make sure the final panic messages make it out as well. Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20240207134103.1357162-14-john.ogness@linutronix.de Signed-off-by: Petr Mladek <pmladek@suse.com>	2024-02-07 17:23:19 +01:00
John Ogness	779dbc2e78	printk: Avoid non-panic CPUs writing to ringbuffer Commit `13fb0f74d7` ("printk: Avoid livelock with heavy printk during panic") introduced a mechanism to silence non-panic CPUs if too many messages are being dropped. Aside from trying to workaround the livelock bugs of legacy consoles, it was also intended to avoid losing panic messages. However, if non-panic CPUs are writing to the ringbuffer, then reacting to dropped messages is too late. Another motivation is that non-finalized messages already might be skipped in panic(). In other words, random messages from non-panic CPUs might already get lost. It is better to ignore all to avoid confusion. To avoid losing panic CPU messages, silence non-panic CPUs immediately on panic. Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20240207134103.1357162-13-john.ogness@linutronix.de Signed-off-by: Petr Mladek <pmladek@suse.com>	2024-02-07 17:23:19 +01:00
Petr Mladek	d04d5882cd	printk: Disable passing console lock owner completely during panic() The commit `d51507098f` ("printk: disable optimistic spin during panic") added checks to avoid becoming a console waiter if a panic is in progress. However, the transition to panic can occur while there is already a waiter. The current owner should not pass the lock to the waiter because it might get stopped or blocked anytime. Also the panic context might pass the console lock owner to an already stopped waiter by mistake. It might happen when console_flush_on_panic() ignores the current lock owner, for example: CPU0 CPU1 ---- ---- console_lock_spinning_enable() console_trylock_spinning() [CPU1 now console waiter] NMI: panic() panic_other_cpus_shutdown() [stopped as console waiter] console_flush_on_panic() console_lock_spinning_enable() [print 1 record] console_lock_spinning_disable_and_check() [handover to stopped CPU1] This results in panic() not flushing the panic messages. Fix these problems by disabling all spinning operations completely during panic(). Another advantage is that it prevents possible deadlocks caused by "console_owner_lock". The panic() context does not need to take it any longer. The lockless checks are safe because the functions become NOPs when they see the panic in progress. All operations manipulating the state are still synchronized by the lock even when non-panic CPUs would notice the panic synchronously. The current owner might stay spinning. But non-panic() CPUs would get stopped anyway and the panic context will never start spinning. Fixes: `dbdda842fe` ("printk: Add console owner and waiter logic to load balance console writes") Signed-off-by: John Ogness <john.ogness@linutronix.de> Link: https://lore.kernel.org/r/20240207134103.1357162-12-john.ogness@linutronix.de Signed-off-by: Petr Mladek <pmladek@suse.com>	2024-02-07 17:23:18 +01:00
John Ogness	b1c4c67a5e	printk: ringbuffer: Skip non-finalized records in panic Normally a reader will stop once reaching a non-finalized record. However, when a panic happens, writers from other CPUs (or an interrupted context on the panic CPU) may have been writing a record and were unable to finalize it. The panic CPU will reserve/commit/finalize its panic records, but these will be located after the non-finalized records. This results in panic() not flushing the panic messages. Extend _prb_read_valid() to skip over non-finalized records if on the panic CPU. Fixes: `896fbe20b4` ("printk: use the lockless ringbuffer") Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20240207134103.1357162-11-john.ogness@linutronix.de Signed-off-by: Petr Mladek <pmladek@suse.com>	2024-02-07 17:23:18 +01:00
John Ogness	ac7d7844c6	printk: Wait for all reserved records with pr_flush() Currently pr_flush() will only wait for records that were available to readers at the time of the call (using prb_next_seq()). But there may be more records (non-finalized) that have following finalized records. pr_flush() should wait for these to print as well. Particularly because any trailing finalized records may be the messages that the calling context wants to ensure are printed. Add a new ringbuffer function prb_next_reserve_seq() to return the sequence number following the most recently reserved record. This guarantees that pr_flush() will wait until all current printk() messages (completed or in progress) have been printed. Fixes: `3b604ca812` ("printk: add pr_flush()") Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20240207134103.1357162-10-john.ogness@linutronix.de Signed-off-by: Petr Mladek <pmladek@suse.com>	2024-02-07 17:23:18 +01:00
John Ogness	584528d621	printk: ringbuffer: Cleanup reader terminology With the lockless ringbuffer, it is allowed that multiple CPUs/contexts write simultaneously into the buffer. This creates an ambiguity as some writers will finalize sooner. The documentation for the prb_read functions is not clear as it refers to "not yet written" and "no data available". Clarify the return values and language to be in terms of the reader: records available for reading. Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20240207134103.1357162-9-john.ogness@linutronix.de Signed-off-by: Petr Mladek <pmladek@suse.com>	2024-02-07 17:23:18 +01:00
John Ogness	36652d0f3b	printk: Add this_cpu_in_panic() There is already panic_in_progress() and other_cpu_in_panic(), but checking if the current CPU is the panic CPU must still be open coded. Add this_cpu_in_panic() to complete the set. Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20240207134103.1357162-8-john.ogness@linutronix.de Signed-off-by: Petr Mladek <pmladek@suse.com>	2024-02-07 17:23:18 +01:00
John Ogness	0ab7cdd004	printk: For @suppress_panic_printk check for other CPU in panic Currently @suppress_panic_printk is checked along with non-matching @panic_cpu and current CPU. This works because @suppress_panic_printk is only set when panic_in_progress() is true. Rather than relying on the @suppress_panic_printk semantics, use the concise helper function other_cpu_in_progress(). The helper function exists to avoid open coding such tests. Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20240207134103.1357162-7-john.ogness@linutronix.de Signed-off-by: Petr Mladek <pmladek@suse.com>	2024-02-07 17:23:18 +01:00
John Ogness	5113cf5f4c	printk: ringbuffer: Clarify special lpos values For empty line records, no data blocks are created. Instead, these valid records are identified by special logical position values (in fields of @prb_desc.text_blk_lpos). Currently the macro NO_LPOS is used for empty line records. This name is confusing because it does not imply _why_ there is no data block. Rename NO_LPOS to EMPTY_LINE_LPOS so that it is clear why there is no data block. Also add comments explaining the use of EMPTY_LINE_LPOS as well as clarification to the values used to represent data-less blocks. Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20240207134103.1357162-6-john.ogness@linutronix.de Signed-off-by: Petr Mladek <pmladek@suse.com>	2024-02-07 17:23:18 +01:00
John Ogness	5f72e52ba9	printk: ringbuffer: Do not skip non-finalized records with prb_next_seq() Commit `f244b4dc53` ("printk: ringbuffer: Improve prb_next_seq() performance") introduced an optimization for prb_next_seq() by using best-effort to track recently finalized records. However, the order of finalization does not necessarily match the order of the records. The optimization changed prb_next_seq() to return inconsistent results, possibly yielding sequence numbers that are not available to readers because they are preceded by non-finalized records or they are not yet visible to the reader CPU. Rather than simply best-effort tracking recently finalized records, force the committing writer to read records and increment the last "contiguous block" of finalized records. In order to do this, the sequence number instead of ID must be stored because ID's cannot be directly compared. A new memory barrier pair is introduced to guarantee that a reader can always read the records up until the sequence number returned by prb_next_seq() (unless the records have since been overwritten in the ringbuffer). This restores the original functionality of prb_next_seq() while also keeping the optimization. For 32bit systems, only the lower 32 bits of the sequence number are stored. When reading the value, it is expanded to the full 64bit sequence number using the 32bit seq macros, which fold in the value returned by prb_first_seq(). Fixes: `f244b4dc53` ("printk: ringbuffer: Improve prb_next_seq() performance") Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20240207134103.1357162-5-john.ogness@linutronix.de Signed-off-by: Petr Mladek <pmladek@suse.com>	2024-02-07 17:23:18 +01:00
John Ogness	90ad525c2d	printk: Use prb_first_seq() as base for 32bit seq macros Note: This change only applies to 32bit architectures. On 64bit architectures the macros are NOPs. Currently prb_next_seq() is used as the base for the 32bit seq macros __u64seq_to_ulseq() and __ulseq_to_u64seq(). However, in a follow-up commit, prb_next_seq() will need to make use of the 32bit seq macros. Use prb_first_seq() as the base for the 32bit seq macros instead because it is guaranteed to return 64bit sequence numbers without relying on any 32bit seq macros. Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20240207134103.1357162-4-john.ogness@linutronix.de Signed-off-by: Petr Mladek <pmladek@suse.com>	2024-02-07 17:23:18 +01:00
Sebastian Andrzej Siewior	418ec1961c	printk: Adjust mapping for 32bit seq macros Note: This change only applies to 32bit architectures. On 64bit architectures the macros are NOPs. __ulseq_to_u64seq() computes the upper 32 bits of the passed argument value (@ulseq). The upper bits are derived from a base value (@rb_next_seq) in a way that assumes @ulseq represents a 64bit number that is less than or equal to @rb_next_seq. Until now this mapping has been correct for all call sites. However, in a follow-up commit, values of @ulseq will be passed in that are higher than the base value. This requires a change to how the 32bit value is mapped to a 64bit sequence number. Rather than mapping @ulseq such that the base value is the end of a 32bit block, map @ulseq such that the base value is in the middle of a 32bit block. This allows supporting 31 bits before and after the base value, which is deemed acceptable for the console sequence number during runtime. Here is an example to illustrate the previous and new mappings. For a base value (@rb_next_seq) of 2 2000 0000... Before this change the range of possible return values was: 1 2000 0001 to 2 2000 0000 __ulseq_to_u64seq(1fff ffff) => 2 1fff ffff __ulseq_to_u64seq(2000 0000) => 2 2000 0000 __ulseq_to_u64seq(2000 0001) => 1 2000 0001 __ulseq_to_u64seq(9fff ffff) => 1 9fff ffff __ulseq_to_u64seq(a000 0000) => 1 a000 0000 __ulseq_to_u64seq(a000 0001) => 1 a000 0001 After this change the range of possible return values are: 1 a000 0001 to 2 a000 0000 __ulseq_to_u64seq(1fff ffff) => 2 1fff ffff __ulseq_to_u64seq(2000 0000) => 2 2000 0000 __ulseq_to_u64seq(2000 0001) => 2 2000 0001 __ulseq_to_u64seq(9fff ffff) => 2 9fff ffff __ulseq_to_u64seq(a000 0000) => 2 a000 0000 __ulseq_to_u64seq(a000 0001) => 1 a000 0001 [ john.ogness: Rewrite commit message. ] Reported-by: Francesco Dolcini <francesco@dolcini.it> Reported-by: kernel test robot <oliver.sang@intel.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20240207134103.1357162-3-john.ogness@linutronix.de Signed-off-by: Petr Mladek <pmladek@suse.com>	2024-02-07 17:23:17 +01:00
John Ogness	5b73e706f0	printk: nbcon: Relocate 32bit seq macros The macros __seq_to_nbcon_seq() and __nbcon_seq_to_seq() are used to provide support for atomic handling of sequence numbers on 32bit systems. Until now this was only used by nbcon.c, which is why they were located in nbcon.c and include nbcon in the name. In a follow-up commit this functionality is also needed by printk_ringbuffer. Rather than duplicating the functionality, relocate the macros to printk_ringbuffer.h. Also, since the macros will be no longer nbcon-specific, rename them to __u64seq_to_ulseq() and __ulseq_to_u64seq(). This does not result in any functional change. Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20240207134103.1357162-2-john.ogness@linutronix.de Signed-off-by: Petr Mladek <pmladek@suse.com>	2024-02-07 17:23:17 +01:00
Peter Hilber	4b7f521229	timekeeping: Evaluate system_counterval_t.cs_id instead of .cs Clocksource pointers can be problematic to obtain for drivers which are not clocksource drivers themselves. In particular, the RFC virtio_rtc driver [1] would require a new helper function to obtain a pointer to the ARM Generic Timer clocksource. The ptp_kvm driver also required a similar workaround. Address this by evaluating the clocksource ID, rather than the clocksource pointer, of struct system_counterval_t. By this, setting the clocksource pointer becomes unneeded, and get_device_system_crosststamp() callers will no longer need to supply clocksource pointers. All relevant clocksource drivers provide the ID, so this change is not changing the behaviour. [1] https://lore.kernel.org/lkml/20231218073849.35294-1-peter.hilber@opensynergy.com/ Signed-off-by: Peter Hilber <peter.hilber@opensynergy.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240201010453.2212371-7-peter.hilber@opensynergy.com	2024-02-07 17:05:21 +01:00
Ricardo B. Marliere	49f1ff50d4	clockevents: Make clockevents_subsys const Now that the driver core can properly handle constant struct bus_type, move the clockevents_subsys variable to be a constant structure as well, placing it into read-only memory which can not be modified at runtime. Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ricardo B. Marliere <ricardo@marliere.net> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Link: https://lore.kernel.org/r/20240204-bus_cleanup-time-v1-2-207ec18e24b8@marliere.net	2024-02-07 15:11:24 +01:00
Ricardo B. Marliere	2bc7fc24f9	clocksource: Make clocksource_subsys const Now that the driver core can properly handle constant struct bus_type, move the clocksource_subsys variable to be a constant structure as well, placing it into read-only memory which can not be modified at runtime. Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ricardo B. Marliere <ricardo@marliere.net> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: John Stultz <jstultz@google.com> Link: https://lore.kernel.org/r/20240204-bus_cleanup-time-v1-1-207ec18e24b8@marliere.net	2024-02-07 15:11:24 +01:00
Tycho Andersen	0c9bd6bc4b	pidfd: getfd should always report ESRCH if a task is exiting We can get EBADF from pidfd_getfd() if a task is currently exiting, which might be confusing. Let's check PF_EXITING, and just report ESRCH if so. I chose PF_EXITING, because it is set in exit_signals(), which is called before exit_files(). Since ->exit_status is mostly set after exit_files() in exit_notify(), using that still leaves a window open for the race. Reviewed-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Tycho Andersen <tandersen@netflix.com> Link: https://lore.kernel.org/r/20240206192357.81942-1-tycho@tycho.pizza Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-02-07 12:09:44 +01:00
Oleg Nesterov	83b290c9e3	pidfd: clone: allow CLONE_THREAD \| CLONE_PIDFD together copy_process() just needs to pass PIDFD_THREAD to __pidfd_prepare() if clone_flags & CLONE_THREAD. We can also add another CLONE_ flag (or perhaps reuse CLONE_DETACHED) to enforce PIDFD_THREAD without CLONE_THREAD. Originally-from: Tycho Andersen <tycho@tycho.pizza> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20240205145532.GA28823@redhat.com Reviewed-by: Tycho Andersen <tandersen@netflix.com> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-02-06 14:39:32 +01:00
Oleg Nesterov	e2e8a142fb	pidfd: exit: kill the no longer used thread_group_exited() It was used by pidfd_poll() but now it has no callers. If it finally finds a modular user we can revert this change, but note that the comment above this helper and the changelog in `38fd525a4c` ("exit: Factor thread_group_exited out of pidfd_poll") are not accurate, thread_group_exited() won't return true if all other threads have passed exit_notify() and are zombies, it returns true only when all other threads are completely gone. Not to mention that it can only work if the task identified by @pid is a thread-group leader. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20240205174347.GA31461@redhat.com Reviewed-by: Tycho Andersen <tandersen@netflix.com> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-02-06 14:02:51 +01:00
Oleg Nesterov	9ed52108f6	pidfd: change do_notify_pidfd() to use __wake_up(poll_to_key(EPOLLIN)) rather than wake_up_all(). This way do_notify_pidfd() won't wakeup the POLLHUP-only waiters which wait for pid_task() == NULL. TODO: - as Christian pointed out, this asks for the new wake_up_all_poll() helper, it can already have other users. - we can probably discriminate the PIDFD_THREAD and non-PIDFD_THREAD waiters, but this needs more work. See https://lore.kernel.org/all/20240205140848.GA15853@redhat.com/ Signed-off-by: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20240205141348.GA16539@redhat.com Reviewed-by: Tycho Andersen <tandersen@netflix.com> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-02-06 13:52:51 +01:00
Frederic Weisbecker	dad6a09f31	hrtimer: Report offline hrtimer enqueue The hrtimers migration on CPU-down hotplug process has been moved earlier, before the CPU actually goes to die. This leaves a small window of opportunity to queue an hrtimer in a blind spot, leaving it ignored. For example a practical case has been reported with RCU waking up a SCHED_FIFO task right before the CPUHP_AP_IDLE_DEAD stage, queuing that way a sched/rt timer to the local offline CPU. Make sure such situations never go unnoticed and warn when that happens. Fixes: `5c0930ccaa` ("hrtimers: Push pending hrtimers away from outgoing CPU earlier") Reported-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20240129235646.3171983-4-boqun.feng@gmail.com	2024-02-06 10:56:35 +01:00
Kumar Kartikeya Dwivedi	6fceea0fa5	bpf: Transfer RCU lock state between subprog calls Allow transferring an imbalanced RCU lock state between subprog calls during verification. This allows patterns where a subprog call returns with an RCU lock held, or a subprog call releases an RCU lock held by the caller. Currently, the verifier would end up complaining if the RCU lock is not released when processing an exit from a subprog, which is non-ideal if its execution is supposed to be enclosed in an RCU read section of the caller. Instead, simply only check whether we are processing exit for frame#0 and do not complain on an active RCU lock otherwise. We only need to update the check when processing BPF_EXIT insn, as copy_verifier_state is already set up to do the right thing. Suggested-by: David Vernet <void@manifault.com> Tested-by: Yafang Shao <laoar.shao@gmail.com> Acked-by: Yonghong Song <yonghong.song@linux.dev> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Acked-by: David Vernet <void@manifault.com> Link: https://lore.kernel.org/r/20240205055646.1112186-2-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2024-02-05 20:00:14 -08:00
Kumar Kartikeya Dwivedi	a44b1334aa	bpf: Allow calling static subprogs while holding a bpf_spin_lock Currently, calling any helpers, kfuncs, or subprogs except the graph data structure (lists, rbtrees) API kfuncs while holding a bpf_spin_lock is not allowed. One of the original motivations of this decision was to force the BPF programmer's hand into keeping the bpf_spin_lock critical section small, and to ensure the execution time of the program does not increase due to lock waiting times. In addition to this, some of the helpers and kfuncs may be unsafe to call while holding a bpf_spin_lock. However, when it comes to subprog calls, atleast for static subprogs, the verifier is able to explore their instructions during verification. Therefore, it is similar in effect to having the same code inlined into the critical section. Hence, not allowing static subprog calls in the bpf_spin_lock critical section is mostly an annoyance that needs to be worked around, without providing any tangible benefit. Unlike static subprog calls, global subprog calls are not safe to permit within the critical section, as the verifier does not explore them during verification, therefore whether the same lock will be taken again, or unlocked, cannot be ascertained. Therefore, allow calling static subprogs within a bpf_spin_lock critical section, and only reject it in case the subprog linkage is global. Acked-by: Yonghong Song <yonghong.song@linux.dev> Acked-by: David Vernet <void@manifault.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20240204222349.938118-2-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2024-02-05 19:58:47 -08:00
Tejun Heo	40911d4457	Merge branch 'for-6.8-fixes' into for-6.9 The for-6.8-fixes commit ae9cc8956944 ("Revert "workqueue: Override implicit ordered attribute in workqueue_apply_unbound_cpumask()") also fixes build for Signed-off-by: Tejun Heo <tj@kernel.org>	2024-02-05 15:49:47 -10:00
Tejun Heo	aac8a59537	Revert "workqueue: Override implicit ordered attribute in workqueue_apply_unbound_cpumask()" This reverts commit `ca10d851b9`. The commit allowed workqueue_apply_unbound_cpumask() to clear __WQ_ORDERED on now removed implicitly ordered workqueues. This was incorrect in that system-wide config change shouldn't break ordering properties of all workqueues. The reason why apply_workqueue_attrs() path was allowed to do so was because it was targeting the specific workqueue - either the workqueue had WQ_SYSFS set or the workqueue user specifically tried to change max_active, both of which indicate that the workqueue doesn't need to be ordered. The implicitly ordered workqueue promotion was removed by the previous commit `3bc1e711c2` ("workqueue: Don't implicitly make UNBOUND workqueues w/ @max_active==1 ordered"). However, it didn't update this path and broke build. Let's revert the commit which was incorrect in the first place which also fixes build. Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: `3bc1e711c2` ("workqueue: Don't implicitly make UNBOUND workqueues w/ @max_active==1 ordered") Fixes: `ca10d851b9` ("workqueue: Override implicit ordered attribute in workqueue_apply_unbound_cpumask()") Cc: stable@vger.kernel.org # v6.6+ Signed-off-by: Tejun Heo <tj@kernel.org>	2024-02-05 15:49:06 -10:00
Tejun Heo	3bc1e711c2	workqueue: Don't implicitly make UNBOUND workqueues w/ @max_active==1 ordered `5c0338c687` ("workqueue: restore WQ_UNBOUND/max_active==1 to be ordered") automoatically promoted UNBOUND workqueues w/ @max_active==1 to ordered workqueues because UNBOUND workqueues w/ @max_active==1 used to be the way to create ordered workqueues and the new NUMA support broke it. These problems can be subtle and the fact that they can only trigger on NUMA machines made them even more difficult to debug. However, overloading the UNBOUND allocation interface this way creates other issues. It's difficult to tell whether a given workqueue actually needs to be ordered and users that legitimately want a min concurrency level wq unexpectedly gets an ordered one instead. With planned UNBOUND workqueue udpates to improve execution locality and more prevalence of chiplet designs which can benefit from such improvements, this isn't a state we wanna be in forever. There aren't that many UNBOUND w/ @max_active==1 users in the tree and the preceding patches audited all and converted them to alloc_ordered_workqueue() as appropriate. This patch removes the implicit promotion of UNBOUND w/ @max_active==1 workqueues to ordered ones. v2: v1 patch incorrectly dropped !list_empty(&wq->pwqs) condition in apply_workqueue_attrs_locked() which spuriously triggers WARNING and fails workqueue creation. Fix it. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: kernel test robot <oliver.sang@intel.com> Link: https://lore.kernel.org/oe-lkp/202304251050.45a5df1f-oliver.sang@intel.com	2024-02-05 14:19:10 -10:00
Tejun Heo	7245d24f87	backtracetest: Convert from tasklet to BH workqueue The only generic interface to execute asynchronously in the BH context is tasklet; however, it's marked deprecated and has some design flaws. To replace tasklets, BH workqueue support was recently added. A BH workqueue behaves similarly to regular workqueues except that the queued work items are executed in the BH context. This patch converts backtracetest from tasklet to BH workqueue. - Replace "irq" with "bh" in names and message to better reflect what's happening. - Replace completion usage with a flush_work() call. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Arjan van de Ven <arjan@linux.intel.com>	2024-02-05 13:22:34 -10:00
Kui-Feng Lee	df9705eaa0	bpf: Remove an unnecessary check. The "i" here is always equal to "btf_type_vlen(t)" since the "for_each_member()" loop never breaks. Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20240203055119.2235598-1-thinker.li@gmail.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>	2024-02-05 10:25:08 -08:00
Waiman Long	8eb17dc1a6	workqueue: Skip __WQ_DESTROYING workqueues when updating global unbound cpumask Skip updating workqueues with __WQ_DESTROYING bit set when updating global unbound cpumask to avoid unnecessary work and other complications. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-02-05 07:52:22 -10:00
Wang Jinchao	96068b6030	workqueue: fix a typo in comment There should be three, fix it. Signed-off-by: Wang Jinchao <wangjinchao@xfusion.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-02-05 07:48:25 -10:00
Tejun Heo	4f19b8e01e	Revert "workqueue: make wq_subsys const" This reverts commit `d412ace111`. This leads to build failures as it depends on a driver-core commit `32f78abe59` ("driver core: bus: constantify subsys_register() calls"). Let's drop it from wq tree and route it through driver-core tree. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202402051505.kM9Rr3CJ-lkp@intel.com/	2024-02-05 07:19:54 -10:00
Jens Axboe	06b23f92af	block: update cached timestamp post schedule/preemption Mark the task as having a cached timestamp when set assign it, so we can efficiently check if it needs updating post being scheduled back in. This covers both the actual schedule out case, which would've flushed the plug, and the preemption case which doesn't touch the plugged requests (for many reasons, one of them being then we'd need to have preemption disabled around plug state manipulation). Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-02-05 10:07:34 -07:00
Nikhil V	8bc2973635	PM: hibernate: Add support for LZ4 compression for hibernation Extend the support for LZ4 compression to be used with hibernation. The main idea is that different compression algorithms have different characteristics and hibernation may benefit when it uses any of these algorithms: a default algorithm, having higher compression rate but is slower(compression/decompression) and a secondary algorithm, that is faster(compression/decompression) but has lower compression rate. LZ4 algorithm has better decompression speeds over LZO. This reduces the hibernation image restore time. As per test results: LZO LZ4 Size before Compression(bytes) 682696704 682393600 Size after Compression(bytes) 146502402 155993547 Decompression Rate 335.02 MB/s 501.05 MB/s Restore time 4.4s 3.8s LZO is the default compression algorithm used for hibernation. Enable CONFIG_HIBERNATION_COMP_LZ4 to set the default compressor as LZ4. Signed-off-by: Nikhil V <quic_nprakash@quicinc.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2024-02-05 14:30:35 +01:00
Nikhil V	a06c6f5d3c	PM: hibernate: Move to crypto APIs for LZO compression Currently for hibernation, LZO is the only compression algorithm available and uses the existing LZO library calls. However, there is no flexibility to switch to other algorithms which provides better results. The main idea is that different compression algorithms have different characteristics and hibernation may benefit when it uses alternate algorithms. By moving to crypto based APIs, it lays a foundation to use other compression algorithms for hibernation. There are no functional changes introduced by this approach. Signed-off-by: Nikhil V <quic_nprakash@quicinc.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2024-02-05 14:28:54 +01:00
Nikhil V	89a807625f	PM: hibernate: Rename lzo* to make it generic Renaming lzo* to generic names, except for lzo_xxx() APIs. This is used in the next patch where we move to crypto based APIs for compression. There are no functional changes introduced by this approach. Signed-off-by: Nikhil V <quic_nprakash@quicinc.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2024-02-05 14:28:54 +01:00
Rafael J. Wysocki	a6d38e991d	PM: sleep: stats: Use locking in dpm_save_failed_dev() Because dpm_save_failed_dev() may be called simultaneously by multiple failing device PM functions, the state of the suspend_stats fields updated by it may become inconsistent. Prevent that from happening by using a lock in dpm_save_failed_dev(). Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com> Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>	2024-02-05 14:28:54 +01:00
Rafael J. Wysocki	9ff544fa5f	PM: sleep: stats: Define suspend_stats next to the code using it It is not necessary to define struct suspend_stats in a header file and the suspend_stats variable in the core device system-wide PM code. They both can be defined in kernel/power/main.c, next to the sysfs and debugfs code accessing suspend_stats, which can be static. Modify the code in question in accordance with the above observation and replace the static inline functions manipulating suspend_stats with regular ones defined in kernel/power/main.c. While at it, move the enum suspend_stat_step to the end of suspend.h which is a more suitable place for it. No intentional functional impact. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>	2024-02-05 14:28:19 +01:00
Rafael J. Wysocki	2231f78d3e	PM: sleep: stats: Use unsigned int for success and failure counters Change the type of the "success" and "fail" fields in struct suspend_stats to unsigned int, because they cannot be negative. No intentional functional impact. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com> Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>	2024-02-05 14:26:27 +01:00
Rafael J. Wysocki	b730bab0b9	PM: sleep: stats: Use an array of step failure counters Instead of using a set of individual struct suspend_stats fields representing suspend step failure counters, use an array of counters indexed by enum suspend_stat_step for this purpose, which allows dpm_save_failed_step() to increment the appropriate counter automatically, so that its callers don't need to do that directly. It also allows suspend_stats_show() to carry out a loop over the counters array to print their values. Because the counters cannot become negative, use unsigned int for representing them. The only user-observable impact of this change is a different ordering of entries in the suspend_stats debugfs file which is not expected to matter. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com> Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>	2024-02-05 14:25:56 +01:00
Rafael J. Wysocki	bc88528cda	PM: sleep: stats: Use array of suspend step names Replace suspend_step_name() in the suspend statistics code with an array of suspend step names which has fewer lines of code and less overhead. While at it, remove two unnecessary line breaks in suspend_stats_show() and adjust some white space in there to the kernel coding style for a more consistent code layout. No intentional functional impact. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com> Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>	2024-02-05 14:21:24 +01:00
Tejun Heo	4cb1ef6460	workqueue: Implement BH workqueues to eventually replace tasklets The only generic interface to execute asynchronously in the BH context is tasklet; however, it's marked deprecated and has some design flaws such as the execution code accessing the tasklet item after the execution is complete which can lead to subtle use-after-free in certain usage scenarios and less-developed flush and cancel mechanisms. This patch implements BH workqueues which share the same semantics and features of regular workqueues but execute their work items in the softirq context. As there is always only one BH execution context per CPU, none of the concurrency management mechanisms applies and a BH workqueue can be thought of as a convenience wrapper around softirq. Except for the inability to sleep while executing and lack of max_active adjustments, BH workqueues and work items should behave the same as regular workqueues and work items. Currently, the execution is hooked to tasklet[_hi]. However, the goal is to convert all tasklet users over to BH workqueues. Once the conversion is complete, tasklet can be removed and BH workqueues can directly take over the tasklet softirqs. system_bh[_highpri]_wq are added. As queue-wide flushing doesn't exist in tasklet, all existing tasklet users should be able to use the system BH workqueues without creating their own workqueues. v3: - Add missing interrupt.h include. v2: - Instead of using tasklets, hook directly into its softirq action functions - tasklet[_hi]_action(). This is slightly cheaper and closer to the eventual code structure we want to arrive at. Suggested by Lai. - Lai also pointed out several places which need NULL worker->task handling or can use clarification. Updated. Signed-off-by: Tejun Heo <tj@kernel.org> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/CAHk-=wjDW53w4-YcSmgKC5RruiRLHmJ1sXeYdp_ZgVoBw=5byA@mail.gmail.com Tested-by: Allen Pais <allen.lkml@gmail.com> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>	2024-02-04 11:28:06 -10:00
Tejun Heo	2fcdb1b444	workqueue: Factor out init_cpu_worker_pool() Factor out init_cpu_worker_pool() from workqueue_init_early(). This is pure reorganization in preparation of BH workqueue support. Signed-off-by: Tejun Heo <tj@kernel.org> Tested-by: Allen Pais <allen.lkml@gmail.com>	2024-02-04 11:28:06 -10:00
Tejun Heo	c35aea39d1	workqueue: Update lock debugging code These changes are in preparation of BH workqueue which will execute work items from BH context. - Update lock and RCU depth checks in process_one_work() so that it remembers and checks against the starting depths and prints out the depth changes. - Factor out lockdep annotations in the flush paths into touch_{wq\|work}_lockdep_map(). The work->lockdep_map touching is moved from __flush_work() to its callee - start_flush_work(). This brings it closer to the wq counterpart and will allow testing the associated wq's flags which will be needed to support BH workqueues. This is not expected to cause any functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Tested-by: Allen Pais <allen.lkml@gmail.com>	2024-02-04 11:28:06 -10:00
Ricardo B. Marliere	d412ace111	workqueue: make wq_subsys const Now that the driver core can properly handle constant struct bus_type, move the wq_subsys variable to be a constant structure as well, placing it into read-only memory which can not be modified at runtime. Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Suggested-and-reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ricardo B. Marliere <ricardo@marliere.net> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-02-04 11:23:25 -10:00
Tejun Heo	c70e1779b7	workqueue: Fix pwq->nr_in_flight corruption in try_to_grab_pending() `dd6c3c5441` ("workqueue: Move pwq_dec_nr_in_flight() to the end of work item handling") relocated pwq_dec_nr_in_flight() after set_work_pool_and_keep_pending(). However, the latter destroys information contained in work->data that's needed by pwq_dec_nr_in_flight() including the flush color. With flush color destroyed, flush_workqueue() can stall easily when mixed with cancel_work*() usages. This is easily triggered by running xfstests generic/001 test on xfs: INFO: task umount:6305 blocked for more than 122 seconds. ... task:umount state:D stack:13008 pid:6305 tgid:6305 ppid:6301 flags:0x00004000 Call Trace: <TASK> __schedule+0x2f6/0xa20 schedule+0x36/0xb0 schedule_timeout+0x20b/0x280 wait_for_completion+0x8a/0x140 __flush_workqueue+0x11a/0x3b0 xfs_inodegc_flush+0x24/0xf0 xfs_unmountfs+0x14/0x180 xfs_fs_put_super+0x3d/0x90 generic_shutdown_super+0x7c/0x160 kill_block_super+0x1b/0x40 xfs_kill_sb+0x12/0x30 deactivate_locked_super+0x35/0x90 deactivate_super+0x42/0x50 cleanup_mnt+0x109/0x170 __cleanup_mnt+0x12/0x20 task_work_run+0x60/0x90 syscall_exit_to_user_mode+0x146/0x150 do_syscall_64+0x5d/0x110 entry_SYSCALL_64_after_hwframe+0x6c/0x74 Fix it by stashing work_data before calling set_work_pool_and_keep_pending() and using the stashed value for pwq_dec_nr_in_flight(). Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Chandan Babu R <chandanbabu@kernel.org> Link: http://lkml.kernel.org/r/87o7cxeehy.fsf@debian-BULLSEYE-live-builder-AMD64 Fixes: `dd6c3c5441` ("workqueue: Move pwq_dec_nr_in_flight() to the end of work item handling")	2024-02-04 11:14:49 -10:00
Andrii Nakryiko	1eb986746a	bpf: don't emit warnings intended for global subprogs for static subprogs When btf_prepare_func_args() was generalized to handle both static and global subprogs, a few warnings/errors that are meant only for global subprog cases started to be emitted for static subprogs, where they are sort of expected and irrelavant. Stop polutting verifier logs with irrelevant scary-looking messages. Fixes: `e26080d0da` ("bpf: prepare btf_prepare_func_args() for handling static subprogs") Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20240202190529.2374377-4-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2024-02-02 18:08:59 -08:00
Andrii Nakryiko	8f13c34087	bpf: handle trusted PTR_TO_BTF_ID_OR_NULL in argument check logic Add PTR_TRUSTED \| PTR_MAYBE_NULL modifiers for PTR_TO_BTF_ID to check_reg_type() to support passing trusted nullable PTR_TO_BTF_ID registers into global functions accepting `__arg_trusted __arg_nullable` arguments. This hasn't been caught earlier because tests were either passing known non-NULL PTR_TO_BTF_ID registers or known NULL (SCALAR) registers. When utilizing this functionality in complicated real-world BPF application that passes around PTR_TO_BTF_ID_OR_NULL, it became apparent that verifier rejects valid case because check_reg_type() doesn't handle this case explicitly. Existing check_reg_type() logic is already anticipating this combination, so we just need to explicitly list this combo in the switch statement. Fixes: `e2b3c4ff5d` ("bpf: add __arg_trusted global func arg tag") Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20240202190529.2374377-2-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2024-02-02 18:08:58 -08:00

... 6 7 8 9 10 ...

44220 Commits