linux

iv/linux

Author	SHA1	Message	Date
Linus Torvalds	161671a6eb	Probes fixes for v6.8-rc5: - fprobe: Fix to allocate entry_data_size buffer for each rethook instance. This fixes a buffer overrun bug (which leads a kernel crash) when fprobe user uses its entry_data in the entry_handler. -----BEGIN PGP SIGNATURE----- iQFPBAABCgA5FiEEh7BulGwFlgAOi5DV2/sHvwUrPxsFAmXhIPgbHG1hc2FtaS5o aXJhbWF0c3VAZ21haWwuY29tAAoJENv7B78FKz8bhwIH/1h5q2ZqNwNplDGVQpWU G1uuRHLlt47jwbGR3gfeYqVELtX0gFigBsmVouCKK3u3qerB1pDscYhULzKeHjS4 1HAsonj+vKY2pbdCaRnxRT7ejlEioN8CwPb1eqY6Bf6XQ2tJqS5gUqdej8JDJuY5 tpNAhHWqAnRvf1V5muclGAIU+9zavrAjbetpgrPEDIjE5idFvN+6D+4PXiM1cRIW KXW1oA7VlShdfY7xprSZ33Lx7C/dLWojM2P/z/BvqyXOf4f1ovqtGFJegW4n7V5b ZgamgOcSBwFLTVOTpOzn0peucduLFTfEWyC7fFGkHjBxTl2JypsQLEupdoaWLvBB el4= =bUgZ -----END PGP SIGNATURE----- Merge tag 'probes-fixes-v6.8-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull fprobe fix from Masami Hiramatsu: - allocate entry_data_size buffer for each rethook instance. This fixes a buffer overrun bug (which leads a kernel crash) when fprobe user uses its entry_data in the entry_handler. * tag 'probes-fixes-v6.8-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: fprobe: Fix to allocate entry_data_size buffer with rethook instances	2024-03-01 11:17:23 -08:00
Uros Bizjak	ce3576ebd6	locking/rtmutex: Use try_cmpxchg_relaxed() in mark_rt_mutex_waiters() Use try_cmpxchg() instead of cmpxchg(ptr, old, new) == old. The x86 CMPXCHG instruction returns success in the ZF flag, so this change saves a compare after CMPXCHG (and related move instruction in front of CMPXCHG). Also, try_cmpxchg() implicitly assigns old ptr value to "old" when CMPXCHG fails. There is no need to re-read the value in the loop. Note that the value from *ptr should be read using READ_ONCE() to prevent the compiler from merging, refetching or reordering the read. No functional change intended. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Waiman Long <longman@redhat.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@kernel.org> Link: https://lore.kernel.org/r/20240124104953.612063-1-ubizjak@gmail.com	2024-03-01 13:02:05 +01:00
Christian Brauner	b28ddcc32d	pidfs: convert to path_from_stashed() helper Moving pidfds from the anonymous inode infrastructure to a separate tiny in-kernel filesystem similar to sockfs, pipefs, and anon_inodefs causes selinux denials and thus various userspace components that make heavy use of pidfds to fail as pidfds used anon_inode_getfile() which aren't subject to any LSM hooks. But dentry_open() is and that would cause regressions. The failures that are seen are selinux denials. But the core failure is dbus-broker. That cascades into other services failing that depend on dbus-broker. For example, when dbus-broker fails to start polkit and all the others won't be able to work because they depend on dbus-broker. The reason for dbus-broker failing is because it doesn't handle failures for SO_PEERPIDFD correctly. Last kernel release we introduced SO_PEERPIDFD (and SCM_PIDFD). SO_PEERPIDFD allows dbus-broker and polkit and others to receive a pidfd for the peer of an AF_UNIX socket. This is the first time in the history of Linux that we can safely authenticate clients in a race-free manner. dbus-broker immediately made use of this but messed up the error checking. It only allowed EINVAL as a valid failure for SO_PEERPIDFD. That's obviously problematic not just because of LSM denials but because of seccomp denials that would prevent SO_PEERPIDFD from working; or any other new error code from there. So this is catching a flawed implementation in dbus-broker as well. It has to fallback to the old pid-based authentication when SO_PEERPIDFD doesn't work no matter the reasons otherwise it'll always risk such failures. So overall that LSM denial should not have caused dbus-broker to fail. It can never assume that a feature released one kernel ago like SO_PEERPIDFD can be assumed to be available. So, the next fix separate from the selinux policy update is to try and fix dbus-broker at [3]. That should make it into Fedora as well. In addition the selinux reference policy should also be updated. See [4] for that. If Selinux is in enforcing mode in userspace and it encounters anything that it doesn't know about it will deny it by default. And the policy is entirely in userspace including declaring new types for stuff like nsfs or pidfs to allow it. For now we continue to raise S_PRIVATE on the inode if it's a pidfs inode which means things behave exactly like before. Link: https://bugzilla.redhat.com/show_bug.cgi?id=2265630 Link: https://github.com/fedora-selinux/selinux-policy/pull/2050 Link: https://github.com/bus1/dbus-broker/pull/343 [3] Link: https://github.com/SELinuxProject/refpolicy/pull/762 [4] Reported-by: Nathan Chancellor <nathan@kernel.org> Link: https://lore.kernel.org/r/20240222190334.GA412503@dev-arch.thelio-3990X Link: https://lore.kernel.org/r/20240218-neufahrzeuge-brauhaus-fb0eb6459771@brauner Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-03-01 12:24:53 +01:00
Christian Brauner	cb12fd8e0d	pidfd: add pidfs This moves pidfds from the anonymous inode infrastructure to a tiny pseudo filesystem. This has been on my todo for quite a while as it will unblock further work that we weren't able to do simply because of the very justified limitations of anonymous inodes. Moving pidfds to a tiny pseudo filesystem allows: * statx() on pidfds becomes useful for the first time. * pidfds can be compared simply via statx() and then comparing inode numbers. * pidfds have unique inode numbers for the system lifetime. * struct pid is now stashed in inode->i_private instead of file->private_data. This means it is now possible to introduce concepts that operate on a process once all file descriptors have been closed. A concrete example is kill-on-last-close. * file->private_data is freed up for per-file options for pidfds. * Each struct pid will refer to a different inode but the same struct pid will refer to the same inode if it's opened multiple times. In contrast to now where each struct pid refers to the same inode. Even if we were to move to anon_inode_create_getfile() which creates new inodes we'd still be associating the same struct pid with multiple different inodes. The tiny pseudo filesystem is not visible anywhere in userspace exactly like e.g., pipefs and sockfs. There's no lookup, there's no complex inode operations, nothing. Dentries and inodes are always deleted when the last pidfd is closed. We allocate a new inode for each struct pid and we reuse that inode for all pidfds. We use iget_locked() to find that inode again based on the inode number which isn't recycled. We allocate a new dentry for each pidfd that uses the same inode. That is similar to anonymous inodes which reuse the same inode for thousands of dentries. For pidfds we're talking way less than that. There usually won't be a lot of concurrent openers of the same struct pid. They can probably often be counted on two hands. I know that systemd does use separate pidfd for the same struct pid for various complex process tracking issues. So I think with that things actually become way simpler. Especially because we don't have to care about lookup. Dentries and inodes continue to be always deleted. The code is entirely optional and fairly small. If it's not selected we fallback to anonymous inodes. Heavily inspired by nsfs which uses a similar stashing mechanism just for namespaces. Link: https://lore.kernel.org/r/20240213-vfs-pidfd_fs-v1-2-f863f58cfce1@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-03-01 12:23:37 +01:00
Masami Hiramatsu (Google)	6572786006	fprobe: Fix to allocate entry_data_size buffer with rethook instances Fix to allocate fprobe::entry_data_size buffer with rethook instances. If fprobe doesn't allocate entry_data_size buffer for each rethook instance, fprobe entry handler can cause a buffer overrun when storing entry data in entry handler. Link: https://lore.kernel.org/all/170920576727.107552.638161246679734051.stgit@devnote2/ Reported-by: Jiri Olsa <olsajiri@gmail.com> Closes: https://lore.kernel.org/all/Zd9eBn2FTQzYyg7L@krava/ Fixes: 4bbd93455659 ("kprobes: kretprobe scalability improvement") Cc: stable@vger.kernel.org Tested-by: Jiri Olsa <olsajiri@gmail.com> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>	2024-03-01 09:18:24 +09:00
Kees Cook	896880ff30	bpf: Replace bpf_lpm_trie_key 0-length array with flexible array Replace deprecated 0-length array in struct bpf_lpm_trie_key with flexible array. Found with GCC 13: ../kernel/bpf/lpm_trie.c:207:51: warning: array subscript i is outside array bounds of 'const __u8[0]' {aka 'const unsigned char[]'} [-Warray-bounds=] 207 \| (__be16 )&key->data[i]); \| ^~~~~~~~~~~~~ ../include/uapi/linux/swab.h:102:54: note: in definition of macro '__swab16' 102 \| #define __swab16(x) (__u16)__builtin_bswap16((__u16)(x)) \| ^ ../include/linux/byteorder/generic.h:97:21: note: in expansion of macro '__be16_to_cpu' 97 \| #define be16_to_cpu __be16_to_cpu \| ^~~~~~~~~~~~~ ../kernel/bpf/lpm_trie.c:206:28: note: in expansion of macro 'be16_to_cpu' 206 \| u16 diff = be16_to_cpu((__be16 )&node->data[i] ^ \| ^~~~~~~~~~~ In file included from ../include/linux/bpf.h:7: ../include/uapi/linux/bpf.h:82:17: note: while referencing 'data' 82 \| __u8 data[0]; /* Arbitrary size / \| ^~~~ And found at run-time under CONFIG_FORTIFY_SOURCE: UBSAN: array-index-out-of-bounds in kernel/bpf/lpm_trie.c:218:49 index 0 is out of range for type '__u8 []' Changing struct bpf_lpm_trie_key is difficult since has been used by userspace. For example, in Cilium: struct egress_gw_policy_key { struct bpf_lpm_trie_key lpm_key; __u32 saddr; __u32 daddr; }; While direct references to the "data" member haven't been found, there are static initializers what include the final member. For example, the "{}" here: struct egress_gw_policy_key in_key = { .lpm_key = { 32 + 24, {} }, .saddr = CLIENT_IP, .daddr = EXTERNAL_SVC_IP & 0Xffffff, }; To avoid the build time and run time warnings seen with a 0-sized trailing array for struct bpf_lpm_trie_key, introduce a new struct that correctly uses a flexible array for the trailing bytes, struct bpf_lpm_trie_key_u8. As part of this, include the "header" portion (which is just the "prefixlen" member), so it can be used by anything building a bpf_lpr_trie_key that has trailing members that aren't a u8 flexible array (like the self-test[1]), which is named struct bpf_lpm_trie_key_hdr. Unfortunately, C++ refuses to parse the __struct_group() helper, so it is not possible to define struct bpf_lpm_trie_key_hdr directly in struct bpf_lpm_trie_key_u8, so we must open-code the union directly. Adjust the kernel code to use struct bpf_lpm_trie_key_u8 through-out, and for the selftest to use struct bpf_lpm_trie_key_hdr. Add a comment to the UAPI header directing folks to the two new options. Reported-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Gustavo A. R. Silva <gustavoars@kernel.org> Closes: https://paste.debian.net/hidden/ca500597/ Link: https://lore.kernel.org/all/202206281009.4332AA33@keescook/ [1] Link: https://lore.kernel.org/bpf/20240222155612.it.533-kees@kernel.org	2024-02-29 22:52:43 +01:00
Tejun Heo	1acd92d95f	workqueue: Drain BH work items on hot-unplugged CPUs Boqun pointed out that workqueues aren't handling BH work items on offlined CPUs. Unlike tasklet which transfers out the pending tasks from CPUHP_SOFTIRQ_DEAD, BH workqueue would just leave them pending which is problematic. Note that this behavior is specific to BH workqueues as the non-BH per-CPU workers just become unbound when the CPU goes offline. This patch fixes the issue by draining the pending BH work items from an offlined CPU from CPUHP_SOFTIRQ_DEAD. Because work items carry more context, it's not as easy to transfer the pending work items from one pool to another. Instead, run BH work items which execute the offlined pools on an online CPU. Note that this assumes that no further BH work items will be queued on the offlined CPUs. This assumption is shared with tasklet and should be fine for conversions. However, this issue also exists for per-CPU workqueues which will just keep executing work items queued after CPU offline on unbound workers and workqueue should reject per-CPU and BH work items queued on offline CPUs. This will be addressed separately later. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-and-reviewed-by: Boqun Feng <boqun.feng@gmail.com> Link: http://lkml.kernel.org/r/Zdvw0HdSXcU3JZ4g@boqun-archlinux	2024-02-29 11:51:24 -10:00
Kamalesh Babulal	25125a4762	cgroup/cpuset: Fix retval in update_cpumask() The update_cpumask(), checks for newly requested cpumask by calling validate_change(), which returns an error on passing an invalid set of cpu(s). Independent of the error returned, update_cpumask() always returns zero, suppressing the error and returning success to the user on writing an invalid cpu range for a cpuset. Fix it by returning retval instead, which is returned by validate_change(). Fixes: 99fe36ba6fc1 ("cgroup/cpuset: Improve temporary cpumasks handling") Signed-off-by: Kamalesh Babulal <kamalesh.babulal@oracle.com> Reviewed-by: Waiman Long <longman@redhat.com> Cc: stable@vger.kernel.org # v6.6+ Signed-off-by: Tejun Heo <tj@kernel.org>	2024-02-29 10:30:35 -10:00
Xiongwei Song	3ab67a9ce8	cgroup/cpuset: Mark memory_spread_slab as obsolete We've removed the SLAB allocator, cpuset_do_slab_mem_spread() and SLAB_MEM_SPREAD, memory_spread_slab is a no-op now. We can mark memory_spread_slab as obsolete in case someone still wants to use it after cpuset_do_slab_mem_spread() removed. For more details, please check [1]. [1] https://lore.kernel.org/lkml/32bc1403-49da-445a-8c00-9686a3b0d6a3@redhat.com/T/#m8e292e21b00f95a4bb8086371fa7387fa4ea8f60 tj: Description and cosmetic updates. Signed-off-by: Xiongwei Song <xiongwei.song@windriver.com> Acked-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-02-29 10:28:19 -10:00
Maulik Shah	9bc4ffd32e	PM: suspend: Set mem_sleep_current during kernel command line setup psci_init_system_suspend() invokes suspend_set_ops() very early during bootup even before kernel command line for mem_sleep_default is setup. This leads to kernel command line mem_sleep_default=s2idle not working as mem_sleep_current gets changed to deep via suspend_set_ops() and never changes back to s2idle. Set mem_sleep_current along with mem_sleep_default during kernel command line setup as default suspend mode. Fixes: faf7ec4a92c0 ("drivers: firmware: psci: add system suspend support") CC: stable@vger.kernel.org # 5.4+ Signed-off-by: Maulik Shah <quic_mkshah@quicinc.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2024-02-29 20:33:00 +01:00
Arnd Bergmann	a184d9835a	tick/sched: Fix build failure for CONFIG_NO_HZ_COMMON=n In configurations with CONFIG_TICK_ONESHOT but no CONFIG_NO_HZ or CONFIG_HIGH_RES_TIMERS, tick_sched_timer_dying() is stubbed out, but still defined as a global function as well: kernel/time/tick-sched.c:1599:6: error: redefinition of 'tick_sched_timer_dying' 1599 \| void tick_sched_timer_dying(int cpu) \| ^ kernel/time/tick-sched.h:111:20: note: previous definition is here 111 \| static inline void tick_sched_timer_dying(int cpu) { } \| ^ This configuration only appears with ARM CONFIG_ARCH_BCM_MOBILE, which should not actually select CONFIG_TICK_ONESHOT. Adjust the #ifdef for the stub to match the condition for building the tick-sched.c file for consistency with the definition and to avoid the build regression. Fixes: 3aedb7fcd88a ("tick/sched: Remove useless oneshot ifdeffery") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240228123850.3499024-1-arnd@kernel.org	2024-02-29 17:41:29 +01:00
Waiman Long	66f40b926d	cgroup/cpuset: Fix a memory leak in update_exclusive_cpumask() Fix a possible memory leak in update_exclusive_cpumask() by moving the alloc_cpumasks() down after the validate_change() check which can fail and still before the temporary cpumasks are needed. Fixes: e2ffe502ba45 ("cgroup/cpuset: Add cpuset.cpus.exclusive for v2") Reported-and-tested-by: Mirsad Todorovac <mirsad.todorovac@alu.hr> Closes: https://lore.kernel.org/lkml/14915689-27a3-4cd8-80d2-9c30d0c768b6@alu.unizg.hr Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: stable@vger.kernel.org # v6.7+	2024-02-28 08:02:55 -10:00
Christian Brauner	50f4f2d197	pidfd: move struct pidfd_fops Move the pidfd file operations over to their own file in preparation of implementing pidfs and to isolate them from other mostly unrelated functionality in other files. Link: https://lore.kernel.org/r/20240213-vfs-pidfd_fs-v1-1-f863f58cfce1@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-02-28 17:17:07 +01:00
Alex Shi	54de442747	sched/topology: Rename SD_SHARE_PKG_RESOURCES to SD_SHARE_LLC SD_SHARE_PKG_RESOURCES is a bit of a misnomer: its naming suggests that it's sharing all 'package resources' - while in reality it's specifically for sharing the LLC only. Rename it to SD_SHARE_LLC to reduce confusion. [ mingo: Rewrote the confusing changelog as well. ] Suggested-by: Valentin Schneider <vschneid@redhat.com> Signed-off-by: Alex Shi <alexs@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com> Reviewed-by: Barry Song <baohua@kernel.org> Link: https://lore.kernel.org/r/20240210113924.1130448-5-alexs@kernel.org	2024-02-28 15:43:17 +01:00
Alex Shi	fbc449864e	sched/fair: Check the SD_ASYM_PACKING flag in sched_use_asym_prio() sched_use_asym_prio() checks whether CPU priorities should be used. It makes sense to check for the SD_ASYM_PACKING() inside the function. Since both sched_asym() and sched_group_asym() use sched_use_asym_prio(), remove the now superfluous checks for the flag in various places. Signed-off-by: Alex Shi <alexs@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com> Reviewed-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20240210113924.1130448-4-alexs@kernel.org	2024-02-28 15:43:17 +01:00
Alex Shi	45de206234	sched/fair: Rework sched_use_asym_prio() and sched_asym_prefer() sched_use_asym_prio() and sched_asym_prefer() are used together in various places. Consolidate them into a single function sched_asym(). The existing sched_asym() function is only used when collecting statistics of a scheduling group. Rename it as sched_group_asym(), and remove the obsolete function description. This makes the code easier to read. No functional changes. Signed-off-by: Alex Shi <alexs@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com> Reviewed-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20240210113924.1130448-3-alexs@kernel.org	2024-02-28 15:43:17 +01:00
Alex Shi	5a64983731	sched/fair: Remove unused parameter from sched_asym() The 'sds' argument is not used in the sched_asym() function anymore, remove it. Fixes: c9ca07886aaa ("sched/fair: Do not even the number of busy CPUs via asym_packing") Signed-off-by: Alex Shi <alexs@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20240210113924.1130448-2-alexs@kernel.org	2024-02-28 15:43:08 +01:00
Alex Shi	d654c8ddde	sched/topology: Remove duplicate descriptions from TOPOLOGY_SD_FLAGS These flags are already documented in include/linux/sched/sd_flags.h. Also, add missing SD_CLUSTER and keep the comment on SD_ASYM_PACKING as it is a special case. Suggested-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com> Signed-off-by: Alex Shi <alexs@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20240210113924.1130448-1-alexs@kernel.org	2024-02-28 15:29:21 +01:00
David Vernet	7e9f7d17fe	sched/fair: Simplify the update_sd_pick_busiest() logic When comparing the current struct sched_group with the yet-busiest domain in update_sd_pick_busiest(), if the two groups have the same group type, we're currently doing a bit of unnecessary work for any group >= group_misfit_task. We're comparing the two groups, and then returning only if false (the group in question is not the busiest). Otherwise, we break out, do an extra unnecessary conditional check that's vacuously false for any group type > group_fully_busy, and then always return true. Let's just return directly in the switch statement instead. This doesn't change the size of vmlinux with llvm 17 (not surprising given that all of this is inlined in load_balance()), but it does shrink load_balance() by 88 bytes on x86. Given that it also improves readability, this seems worth doing. Signed-off-by: David Vernet <void@manifault.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Link: https://lore.kernel.org/r/20240206043921.850302-4-void@manifault.com	2024-02-28 15:19:26 +01:00
David Vernet	7f1a722971	sched/fair: Do strict inequality check for busiest misfit task group In update_sd_pick_busiest(), when comparing two sched groups that are both of type group_misfit_task, we currently consider the new group as busier than the current busiest group even if the new group has the same misfit task load as the current busiest group. We can avoid some unnecessary writes if we instead only consider the newest group to be the busiest if it has a higher load than the current busiest. This matches the behavior of other group types where we compare load, such as two groups that are both overloaded. Let's update the group_misfit_task type comparison to also only update the busiest group in the event of strict inequality. Signed-off-by: David Vernet <void@manifault.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Link: https://lore.kernel.org/r/20240206043921.850302-3-void@manifault.com	2024-02-28 15:19:24 +01:00
David Vernet	9dfbc26d27	sched/fair: Remove unnecessary goto in update_sd_lb_stats() In update_sd_lb_stats(), when we're iterating over the sched groups that comprise a sched domain, we're skipping the call to update_sd_pick_busiest() for the sched group that contains the local / destination CPU. We use a goto to skip the call, but we could just as easily check !local_group, as there's no other logic that we need to skip with the goto. Let's remove the goto, and check for !local_group in the if statement instead. Signed-off-by: David Vernet <void@manifault.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Link: https://lore.kernel.org/r/20240206043921.850302-2-void@manifault.com	2024-02-28 15:19:23 +01:00
Keisuke Nishimura	23d04d8c6b	sched/fair: Take the scheduling domain into account in select_idle_core() When picking a CPU on task wakeup, select_idle_core() has to take into account the scheduling domain where the function looks for the CPU. This is because the "isolcpus" kernel command line option can remove CPUs from the domain to isolate them from other SMT siblings. This change replaces the set of CPUs allowed to run the task from p->cpus_ptr by the intersection of p->cpus_ptr and sched_domain_span(sd) which is stored in the 'cpus' argument provided by select_idle_cpu(). Fixes: 9fe1f127b913 ("sched/fair: Merge select_idle_core/cpu()") Signed-off-by: Keisuke Nishimura <keisuke.nishimura@inria.fr> Signed-off-by: Julia Lawall <julia.lawall@inria.fr> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20240110131707.437301-2-keisuke.nishimura@inria.fr	2024-02-28 15:15:49 +01:00
Keisuke Nishimura	8aeaffef8c	sched/fair: Take the scheduling domain into account in select_idle_smt() When picking a CPU on task wakeup, select_idle_smt() has to take into account the scheduling domain of @target. This is because the "isolcpus" kernel command line option can remove CPUs from the domain to isolate them from other SMT siblings. This fix checks if the candidate CPU is in the target scheduling domain. Commit: df3cb4ea1fb6 ("sched/fair: Fix wrong cpu selecting from isolated domain") ... originally introduced this fix by adding the check of the scheduling domain in the loop. However, commit: 3e6efe87cd5cc ("sched/fair: Remove redundant check in select_idle_smt()") ... accidentally removed the check. Bring it back. Fixes: 3e6efe87cd5c ("sched/fair: Remove redundant check in select_idle_smt()") Signed-off-by: Keisuke Nishimura <keisuke.nishimura@inria.fr> Signed-off-by: Julia Lawall <julia.lawall@inria.fr> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20240110131707.437301-1-keisuke.nishimura@inria.fr	2024-02-28 15:15:48 +01:00
Shrikanth Hegde	a6965b3188	sched/fair: Add READ_ONCE() and use existing helper function to access ->avg_irq Use existing helper function cpu_util_irq() instead of open-coding access to ->avg_irq. During review it was noted that ->avg_irq could be updated by a different CPU than the one which is trying to access it. ->avg_irq is updated with WRITE_ONCE(), use READ_ONCE to access it in order to avoid any compiler optimizations. Signed-off-by: Shrikanth Hegde <sshegde@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20240101154624.100981-3-sshegde@linux.vnet.ibm.com	2024-02-28 15:11:15 +01:00
Shrikanth Hegde	8b936fc1d8	sched/fair: Use existing helper functions to access ->avg_rt and ->avg_dl There are helper functions called cpu_util_dl() and cpu_util_rt() which give the average utilization of DL and RT respectively. But there are a few places in code where access to these variables is open-coded. Instead use the helper function so that code becomes simpler and easier to maintain later on. No functional changes intended. Signed-off-by: Shrikanth Hegde <sshegde@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20240101154624.100981-2-sshegde@linux.vnet.ibm.com	2024-02-28 15:11:14 +01:00
Rick Edgecombe	b9fa16949d	dma-direct: Leak pages on dma_set_decrypted() failure On TDX it is possible for the untrusted host to cause set_memory_encrypted() or set_memory_decrypted() to fail such that an error is returned and the resulting memory is shared. Callers need to take care to handle these errors to avoid returning decrypted (shared) memory to the page allocator, which could lead to functional or security issues. DMA could free decrypted/shared pages if dma_set_decrypted() fails. This should be a rare case. Just leak the pages in this case instead of freeing them. Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2024-02-28 05:31:38 -08:00
ZhangPeng	02e7656970	swiotlb: add debugfs to track swiotlb transient pool usage Introduce a new debugfs interface io_tlb_transient_nslabs. The device driver can create a new swiotlb transient memory pool once default memory pool is full. To export the swiotlb transient memory pool usage via debugfs would help the user estimate the size of transient swiotlb memory pool or analyze device driver memory leak issue. Signed-off-by: ZhangPeng <zhangpeng362@huawei.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2024-02-28 05:31:38 -08:00
Namhyung Kim	f3e3620f1a	locking/percpu-rwsem: Trigger contention tracepoints only if contended We mistakenly always fire lock contention tracepoints in the writer path, while it should be conditional on the trylock result. Signed-off-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Waiman Long <longman@redhat.com> Link: https://lore.kernel.org/r/20231108215322.2845536-1-namhyung@kernel.org	2024-02-28 13:10:29 +01:00
Waiman Long	d566c78659	locking/rwsem: Clarify that RWSEM_READER_OWNED is just a hint Clarify in the comments that the RWSEM_READER_OWNED bit in the owner field is just a hint, not an authoritative state of the rwsem. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Boqun Feng <boqun.feng@gmail.com> Link: https://lore.kernel.org/r/20240222150540.79981-4-longman@redhat.com	2024-02-28 13:08:38 +01:00
Waiman Long	ca4bc2e07b	locking/qspinlock: Fix 'wait_early' set but not used warning When CONFIG_LOCK_EVENT_COUNTS is off, the wait_early variable will be set but not used. This is expected. Recent compilers will not generate wait_early code in this case. Add the __maybe_unused attribute to wait_early for suppressing this W=1 warning. Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Boqun Feng <boqun.feng@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20240222150540.79981-2-longman@redhat.com Closes: https://lore.kernel.org/oe-kbuild-all/202312260422.f4pK3f9m-lkp@intel.com/	2024-02-28 13:08:37 +01:00
David Gow	133e267ef4	time: test: Fix incorrect format specifier 'days' is a s64 (from div_s64), and so should use a %lld specifier. This was found by extending KUnit's assertion macros to use gcc's __printf attribute. Fixes: 276010551664 ("time: Improve performance of time64_to_tm()") Signed-off-by: David Gow <davidgow@google.com> Tested-by: Guenter Roeck <linux@roeck-us.net> Reviewed-by: Justin Stitt <justinstitt@google.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>	2024-02-27 15:26:08 -07:00
Ingo Molnar	9b9c280b9a	Merge branch 'x86/urgent' into x86/apic, to resolve conflicts Conflicts: arch/x86/kernel/cpu/common.c arch/x86/kernel/cpu/intel.c Signed-off-by: Ingo Molnar <mingo@kernel.org>	2024-02-27 10:09:49 +01:00
Ingo Molnar	4c8a498541	smp: Avoid 'setup_max_cpus' namespace collision/shadowing bringup_nonboot_cpus() gets passed the 'setup_max_cpus' variable in init/main.c - which is also the name of the parameter, shadowing the name. To reduce confusion and to allow the 'setup_max_cpus' value to be #defined in the <linux/smp.h> header, use the 'max_cpus' name for the function parameter name. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org	2024-02-27 10:05:32 +01:00
Boqun Feng	3add00be5f	Merge branches 'rcu-doc.2024.02.14a', 'rcu-nocb.2024.02.14a', 'rcu-exp.2024.02.14a', 'rcu-tasks.2024.02.26a' and 'rcu-misc.2024.02.14a' into rcu.2024.02.26a	2024-02-26 17:37:25 -08:00
Frederic Weisbecker	19b344a91f	timers: Assert no next dyntick timer look-up while CPU is offline The next timer (re-)evaluation, with the purpose of entering/updating the dyntick mode, can happen from 3 sites and none of them are relevant while the CPU is offline: 1) The idle loop: a) From the quick check helping the cpuidle governor to heuristically predict the best C-state. b) While stopping the tick. But if the CPU is offline, the tick has been cancelled and there is consequently no need to further stop the tick. 2) Remote expiry: when a CPU remotely expires global timers on behalf of another CPU, the latter target's next timer is re-evaluated afterwards. However remote expîry doesn't happen on offline CPUs. 3) IRQ exit: on nohz_full mode, the tick is (re-)evaluated on IRQ exit. But full dynticks is disabled on offline CPUs. Therefore it is safe to assume that no next dyntick timer lookup can be performed on offline CPUs. Assert this expectation to report any surprise. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240225225508.11587-17-frederic@kernel.org	2024-02-26 11:37:32 +01:00
Frederic Weisbecker	500f8f9bce	tick: Assume timekeeping is correctly handed over upon last offline idle call The timekeeping duty is handed over from the outgoing CPU on stop machine, then the oneshot tick is stopped right after. Therefore it's guaranteed that the current CPU isn't the timekeeper upon its last call to idle. Besides, calling tick_nohz_idle_stop_tick() while the dying CPU goes into idle suggests that the tick is going to be stopped while it is actually stopped already from the appropriate CPU hotplug state. Remove the confusing call and the obsolete case handling and convert it to a sanity check that verifies the above assumption. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240225225508.11587-16-frederic@kernel.org	2024-02-26 11:37:32 +01:00
Frederic Weisbecker	3f69d04e14	tick: Shut down low-res tick from dying CPU The timekeeping duty is handed over from the outgoing CPU within stop machine. This works well if CONFIG_NO_HZ_COMMON=n or the tick is in high-res mode. However in low-res dynticks mode, the tick isn't cancelled until the clockevent is shut down, which can happen later. The tick may therefore fire again once IRQs are re-enabled on stop machine and until IRQs are disabled for good upon the last call to idle. That's so many opportunities for a timekeeper to go idle and the outgoing CPU to take over that duty. This is why tick_nohz_idle_stop_tick() is called one last time on idle if the CPU is seen offline: so that the timekeeping duty is handed over again in case the CPU has re-taken the duty. This means there are two timekeeping handovers on CPU down hotplug with different undocumented constraints and purposes: 1) A handover on stop machine for !dynticks \|\| highres. All online CPUs are guaranteed to be non-idle and the timekeeping duty can be safely handed-over. The hrtimer tick is cancelled so it is guaranteed that in dynticks mode the outgoing CPU won't take again the duty. 2) A handover on last idle call for dynticks && lowres. Setting the duty to TICK_DO_TIMER_NONE makes sure that a CPU will take over the timekeeping. Prepare for consolidating the handover to a single place (the first one) with shutting down the low-res tick as well from tick_cancel_sched_timer() as well. This will simplify the handover and unify the tick cancellation between high-res and low-res. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240225225508.11587-15-frederic@kernel.org	2024-02-26 11:37:32 +01:00
Frederic Weisbecker	7988e5ae2b	tick: Split nohz and highres features from nohz_mode The nohz mode field tells about low resolution nohz mode or high resolution nohz mode but it doesn't tell about high resolution non-nohz mode. In order to retrieve the latter state, tick_cancel_sched_timer() must fiddle with struct hrtimer's internals to guess if the tick has been initialized in high resolution. Move instead the nohz mode field information into the tick flags and provide two new bits: one to know if the tick is in nohz mode and another one to know if the tick is in high resolution. The combination of those two flags provides all the needed informations to determine which of the three tick modes is running. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240225225508.11587-14-frederic@kernel.org	2024-02-26 11:37:32 +01:00
Frederic Weisbecker	a478ffb2ae	tick: Move individual bit features to debuggable mask accesses The individual bitfields of struct tick_sched must be modified from IRQs disabled places, otherwise local modifications can race due to them sharing the same memory storage. The recent move of the "got_idle_tick" bitfield to its own storage shows that the use of these bitfields, as pretty as they look, can be as much error prone. In order to avoid future issues of the like and make sure that those bitfields are safely accessed, move those flags to an explicit mask along with a mutator function performing the basic IRQs disabled sanity check. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240225225508.11587-13-frederic@kernel.org	2024-02-26 11:37:32 +01:00
Frederic Weisbecker	3ce74f1a85	tick: Move got_idle_tick away from common flags tick_nohz_idle_got_tick() is called by cpuidle_reflect() within the idle loop with interrupts enabled. This function modifies the struct tick_sched's bitfield "got_idle_tick". However this bitfield is stored within the same mask as other bitfields that can be modified from interrupts. Fortunately so far it looks like the only race that can happen is while writing ->got_idle_tick to 0, an interrupt fires and writes the ->idle_active field to 0. It's then possible that the interrupted write to ->got_idle_tick writes back the old value of ->idle_active back to 1. However if that happens, the worst possible outcome is that the time spent between that interrupt and the upcoming call to tick_nohz_idle_exit() is accounted as idle, which is negligible quantity. Still all the bitfield writes within this struct tick_sched's shadow mask should be IRQ-safe. Therefore move this bitfield out to its own storage to avoid further suprises. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240225225508.11587-12-frederic@kernel.org	2024-02-26 11:37:32 +01:00
Frederic Weisbecker	d9b1865c86	tick: Assume the tick can't be stopped in NOHZ_MODE_INACTIVE mode The full-nohz update function checks if the nohz mode is active before proceeding. It considers one exception though: if the tick is already stopped even though the nohz mode is inactive, it still moves on in order to update/restart the tick if needed. However in order for the tick to be stopped, the nohz_mode has to be either NOHZ_MODE_LOWRES or NOHZ_MODE_HIGHRES. Therefore it doesn't make sense to test if the tick is stopped before verifying NOHZ_MODE_INACTIVE mode. Remove the needless related condition. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240225225508.11587-11-frederic@kernel.org	2024-02-26 11:37:32 +01:00
Frederic Weisbecker	ef8969bb55	tick: Move broadcast cancellation up to CPUHP_AP_TICK_DYING The broadcast shutdown code is executed through a random explicit call within stop machine from the outgoing CPU. However the tick broadcast is a midware between the tick callback and the clocksource, therefore it makes more sense to shut it down after the tick callback and before the clocksource drivers. Move it instead to the common tick shutdown CPU hotplug state where related operations can be ordered from highest to lowest level. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240225225508.11587-10-frederic@kernel.org	2024-02-26 11:37:32 +01:00
Frederic Weisbecker	f04e51220a	tick: Move tick cancellation up to CPUHP_AP_TICK_DYING The tick hrtimer is cancelled right before hrtimers are migrated. This is done from the hrtimer subsystem even though it shouldn't know about its actual users. Move instead the tick hrtimer cancellation to the relevant CPU hotplug state that aims at centralizing high level tick shutdown operations so that the related flow is easy to follow. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240225225508.11587-9-frederic@kernel.org	2024-02-26 11:37:31 +01:00
Frederic Weisbecker	3ad6eb0683	tick: Start centralizing tick related CPU hotplug operations During the CPU offlining process, the various timer tick features are shut down from scattered places, sometimes from teardown callbacks on stop machine, sometimes through explicit calls, sometimes from the control CPU after the CPU died. The reason why these shutdown operations are spread around is not always clear and it makes the tick lifecycle hard to follow. The tick should be shut down in order from highest to lowest level: On stop machine from the dying CPU (high-level): 1) Hand-over the timekeeping duty (tick_handover_do_timer()) 2) Cancel the tick implementation called by the clockevent callback (tick_cancel_sched_timer()) 3) Shutdown broadcasting (tick_offline_cpu() / tick_broadcast_offline()) On stop machine from the dying CPU (low-level): 4) Shutdown clockevents drivers (CPUHP_AP_*_TIMER_STARTING states) From the control CPU after the CPU died (low-level): 5) Shutdown/unregister/cleanup clockevents for the dead CPU (tick_cleanup_dead_cpu()) Instead the current order is 2, 4 (both from CPU hotplug states), then 1 and 3 through direct calls. This layout and order don't make much sense. The operations 1, 2, 3 should be gathered together and in order. Sort this situation with creating a new TICK shut-down CPU hotplug state and start with introducing the timekeeping duty hand-over there. The state must precede hrtimers migration because the tick hrtimer will be stopped from it in a further patch. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240225225508.11587-8-frederic@kernel.org	2024-02-26 11:37:31 +01:00
Frederic Weisbecker	60313c21c3	tick/sched: Don't clear ts::next_tick again in can_stop_idle_tick() The tick sched structure is already cleared from tick_cancel_sched_timer(), so there is no need to clear that field again. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240225225508.11587-7-frederic@kernel.org	2024-02-26 11:37:31 +01:00
Frederic Weisbecker	3650f49bfb	tick/sched: Rename tick_nohz_stop_sched_tick() to tick_nohz_full_stop_tick() tick_nohz_stop_sched_tick() is only about NOHZ_full and not about dynticks-idle. Reflect that in the function name to avoid confusion. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240225225508.11587-6-frederic@kernel.org	2024-02-26 11:37:31 +01:00
Frederic Weisbecker	27dc08096c	tick: Use IS_ENABLED() whenever possible Avoid ifdeferry if it can be converted to IS_ENABLED() whenever possible Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240225225508.11587-5-frederic@kernel.org	2024-02-26 11:37:31 +01:00
Frederic Weisbecker	3aedb7fcd8	tick/sched: Remove useless oneshot ifdeffery tick-sched.c is only built when CONFIG_TICK_ONESHOT=y, which is selected only if CONFIG_NO_HZ_COMMON=y or CONFIG_HIGH_RES_TIMERS=y. Therefore the related ifdeferry in this file is needless and can be removed. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240225225508.11587-4-frederic@kernel.org	2024-02-26 11:37:31 +01:00
Peng Liu	37263ba0c4	tick/nohz: Remove duplicate between lowres and highres handlers tick_nohz_lowres_handler() does the same work as tick_nohz_highres_handler() plus the clockevent device reprogramming, so make the former reuse the latter and rename it accordingly. Signed-off-by: Peng Liu <liupeng17@lenovo.com> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240225225508.11587-3-frederic@kernel.org	2024-02-26 11:37:31 +01:00
Peng Liu	ffb7e01c4e	tick/nohz: Remove duplicate between tick_nohz_switch_to_nohz() and tick_setup_sched_timer() The ts->sched_timer initialization work of tick_nohz_switch_to_nohz() is almost the same as that of tick_setup_sched_timer(), so adjust the latter to get it reused by tick_nohz_switch_to_nohz(). This also makes the low resolution mode sched_timer benefit from the tick skew boot option. Signed-off-by: Peng Liu <liupeng17@lenovo.com> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240225225508.11587-2-frederic@kernel.org	2024-02-26 11:37:31 +01:00

... 2 3 4 5 6 ...

44261 Commits