linux

iv/linux

Author	SHA1	Message	Date
Benjamin Tissoires	b98a5c68cc	bpf: Do not walk twice the map on free If someone stores both a timer and a workqueue in a map, on free we would walk it twice. Add a check in array_map_free_timers_wq and free the timers and workqueues if they are present. Fixes: `246331e3f1` ("bpf: allow struct bpf_wq to be embedded in arraymaps and hashmaps") Signed-off-by: Benjamin Tissoires <bentiss@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/bpf/20240430-bpf-next-v3-1-27afe7f3b17c@kernel.org	2024-04-30 16:28:33 +02:00
Justin Stitt	7b831bd3cf	PM: hibernate: replace deprecated strncpy() with strscpy() strncpy() is deprecated for use on NUL-terminated destination strings [1] and as such we should prefer more robust and less ambiguous string interfaces. This kernel config option is simply assigned with the resume_file buffer. It should be NUL-terminated but not necessarily NUL-padded as per its further usage with other string apis: \| static int __init find_resume_device(void) \| { \| if (!strlen(resume_file)) \| return -ENOENT; \| \| pm_pr_dbg("Checking hibernation image partition %s\n", resume_file); Use strscpy() [2] as it guarantees NUL-termination on the destination buffer. Specifically, use the new 2-argument version of strscpy() introduced in Commit `e6584c3964` ("string: Allow 2-argument strscpy()"). Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings [1] Link: https://manpages.debian.org/testing/linux-manual-4.8/strscpy.9.en.html [2] Link: https://github.com/KSPP/linux/issues/90 Signed-off-by: Justin Stitt <justinstitt@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Dhruva Gole <d-gole@ti.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2024-04-30 12:59:30 +02:00
Andy Shevchenko	a3034872cd	bpf: Switch to krealloc_array() Let the krealloc_array() copy the original data and check for a multiplication overflow. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/bpf/20240429120005.3539116-1-andriy.shevchenko@linux.intel.com	2024-04-29 16:13:14 -07:00
Andy Shevchenko	cb01621b6d	bpf: Use struct_size() Use struct_size() instead of hand writing it. This is less verbose and more robust. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/bpf/20240429121323.3818497-1-andriy.shevchenko@linux.intel.com	2024-04-29 16:12:03 -07:00
Linus Torvalds	98369dccd2	workqueue: Fixes for v6.9-rc6 Two doc update patches and the following three fixes: - On single node systems, the default pool is used but the node_nr_active for the default pool was set to min_active. This effectively limited the max concurrency of unbound pools on single node systems to 8 causing performance regressions on some workloads. Fixed by setting the default pool's node_nr_active to max_active. - wq_update_node_max_active() could trigger divide-by-zero if the intersection between the allowed CPUs for an unbound workqueue and online CPUs becomes empty. - When kick_pool() was trying to repatriate a worker to a CPU in its pod by setting task->wake_cpu, it didn't consider whether the CPU being selected is online or not which obviously can lead to subobtimal behaviors. On s390, this triggered a crash in arch code. The workqueue patch removes the gross misbehavior but doesn't fix the crash completely as there's a race window in which CPUs can go down after wake_cpu is set. Need to decide whether the fix should be on the core or arch side. -----BEGIN PGP SIGNATURE----- iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZjAaug4cdGpAa2VybmVs Lm9yZwAKCRCxYfJx3gVYGT4fAQC5d8dNCDrAJmMgI0OBCwVgGGISTPalI+/ix4zu 5muBLwEAszuSZ4hEmg4L/jseTk+gZV0vIi4/IHjOzWwYczzLxQA= =SeX1 -----END PGP SIGNATURE----- Merge tag 'wq-for-6.9-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq Pull workqueue fixes from Tejun Heo: "Two doc update patches and the following three fixes: - On single node systems, the default pool is used but the node_nr_active for the default pool was set to min_active. This effectively limited the max concurrency of unbound pools on single node systems to 8 causing performance regressions on some workloads. Fixed by setting the default pool's node_nr_active to max_active. - wq_update_node_max_active() could trigger divide-by-zero if the intersection between the allowed CPUs for an unbound workqueue and online CPUs becomes empty. - When kick_pool() was trying to repatriate a worker to a CPU in its pod by setting task->wake_cpu, it didn't consider whether the CPU being selected is online or not which obviously can lead to subobtimal behaviors. On s390, this triggered a crash in arch code. The workqueue patch removes the gross misbehavior but doesn't fix the crash completely as there's a race window in which CPUs can go down after wake_cpu is set. Need to decide whether the fix should be on the core or arch side" * tag 'wq-for-6.9-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: Fix divide error in wq_update_node_max_active() workqueue: The default node_nr_active should have its max set to max_active workqueue: Fix selection of wake_cpu in kick_pool() docs/zh_CN: core-api: Update translation of workqueue.rst to 6.9-rc1 Documentation/core-api: Update events_freezable_power references.	2024-04-29 15:57:37 -07:00
Borislav Petkov (AMD)	54db412e61	clocksource: Make the int help prompt unit readable in ncurses When doing make menuconfig and searching for the CLOCKSOURCE_WATCHDOG_MAX_SKEW_US config item, the help says: │ Symbol: CLOCKSOURCE_WATCHDOG_MAX_SKEW_US [=125] │ Type : integer │ Range : [50 1000] │ Defined at kernel/time/Kconfig:204 │ Prompt: Clocksource watchdog maximum allowable skew (in s) ^^^ │ Depends on: GENERIC_CLOCKEVENTS [=y] && CLOCKSOURCE_WATCHDOG [=y] because on some terminals, it cannot display the 'μ' char, unicode number 0x3bc. So simply write it out so that there's no trouble. Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Paul E. McKenney <paulmck@kernel.org> Acked-by: Randy Dunlap <rdunlap@infradead.org> Link: https://lore.kernel.org/r/20240428102143.26764-1-bp@kernel.org	2024-04-30 00:12:22 +02:00
Alexei Starovoitov	0db63c0b86	bpf: Fix verifier assumptions about socket->sk The verifier assumes that 'sk' field in 'struct socket' is valid and non-NULL when 'socket' pointer itself is trusted and non-NULL. That may not be the case when socket was just created and passed to LSM socket_accept hook. Fix this verifier assumption and adjust tests. Reported-by: Liam Wisehart <liamwisehart@meta.com> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Fixes: `6fcd486b3a` ("bpf: Refactor RCU enforcement in the verifier.") Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/r/20240427002544.68803-1-alexei.starovoitov@gmail.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>	2024-04-29 14:16:41 -07:00
Jakub Kicinski	89de2db193	bpf-next-for-netdev -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZi9+AAAKCRDbK58LschI g0nEAP487m7L0nLVriC2oIOWsi29tklW3etm6DO7gmGRGIHgrgEAnMyV1xBj3bGj v6jJwDcybCym1hLx+1x1JCZ4eoAFswE= =xbna -----END PGP SIGNATURE----- Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== pull-request: bpf-next 2024-04-29 We've added 147 non-merge commits during the last 32 day(s) which contain a total of 158 files changed, 9400 insertions(+), 2213 deletions(-). The main changes are: 1) Add an internal-only BPF per-CPU instruction for resolving per-CPU memory addresses and implement support in x86 BPF JIT. This allows inlining per-CPU array and hashmap lookups and the bpf_get_smp_processor_id() helper, from Andrii Nakryiko. 2) Add BPF link support for sk_msg and sk_skb programs, from Yonghong Song. 3) Optimize x86 BPF JIT's emit_mov_imm64, and add support for various atomics in bpf_arena which can be JITed as a single x86 instruction, from Alexei Starovoitov. 4) Add support for passing mark with bpf_fib_lookup helper, from Anton Protopopov. 5) Add a new bpf_wq API for deferring events and refactor sleepable bpf_timer code to keep common code where possible, from Benjamin Tissoires. 6) Fix BPF_PROG_TEST_RUN infra with regards to bpf_dummy_struct_ops programs to check when NULL is passed for non-NULLable parameters, from Eduard Zingerman. 7) Harden the BPF verifier's and/or/xor value tracking, from Harishankar Vishwanathan. 8) Introduce crypto kfuncs to make BPF programs able to utilize the kernel crypto subsystem, from Vadim Fedorenko. 9) Various improvements to the BPF instruction set standardization doc, from Dave Thaler. 10) Extend libbpf APIs to partially consume items from the BPF ringbuffer, from Andrea Righi. 11) Bigger batch of BPF selftests refactoring to use common network helpers and to drop duplicate code, from Geliang Tang. 12) Support bpf_tail_call_static() helper for BPF programs with GCC 13, from Jose E. Marchesi. 13) Add bpf_preempt_{disable,enable}() kfuncs in order to allow a BPF program to have code sections where preemption is disabled, from Kumar Kartikeya Dwivedi. 14) Allow invoking BPF kfuncs from BPF_PROG_TYPE_SYSCALL programs, from David Vernet. 15) Extend the BPF verifier to allow different input maps for a given bpf_for_each_map_elem() helper call in a BPF program, from Philo Lu. 16) Add support for PROBE_MEM32 and bpf_addr_space_cast instructions for riscv64 and arm64 JITs to enable BPF Arena, from Puranjay Mohan. 17) Shut up a false-positive KMSAN splat in interpreter mode by unpoison the stack memory, from Martin KaFai Lau. 18) Improve xsk selftest coverage with new tests on maximum and minimum hardware ring size configurations, from Tushar Vyavahare. 19) Various ReST man pages fixes as well as documentation and bash completion improvements for bpftool, from Rameez Rehman & Quentin Monnet. 20) Fix libbpf with regards to dumping subsequent char arrays, from Quentin Deslandes. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (147 commits) bpf, docs: Clarify PC use in instruction-set.rst bpf_helpers.h: Define bpf_tail_call_static when building with GCC bpf, docs: Add introduction for use in the ISA Internet Draft selftests/bpf: extend BPF_SOCK_OPS_RTT_CB test for srtt and mrtt_us bpf: add mrtt and srtt as BPF_SOCK_OPS_RTT_CB args selftests/bpf: dummy_st_ops should reject 0 for non-nullable params bpf: check bpf_dummy_struct_ops program params for test runs selftests/bpf: do not pass NULL for non-nullable params in dummy_st_ops selftests/bpf: adjust dummy_st_ops_success to detect additional error bpf: mark bpf_dummy_struct_ops.test_1 parameter as nullable selftests/bpf: Add ring_buffer__consume_n test. bpf: Add bpf_guard_preempt() convenience macro selftests: bpf: crypto: add benchmark for crypto functions selftests: bpf: crypto skcipher algo selftests bpf: crypto: add skcipher to bpf crypto bpf: make common crypto API for TC/XDP programs bpf: update the comment for BTF_FIELDS_MAX selftests/bpf: Fix wq test. selftests/bpf: Use make_sockaddr in test_sock_addr selftests/bpf: Use connect_to_addr in test_sock_addr ... ==================== Link: https://lore.kernel.org/r/20240429131657.19423-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-04-29 13:12:19 -07:00
Matthew Wilcox (Oracle)	5af385f5f4	bounds: Use the right number of bits for power-of-two CONFIG_NR_CPUS bits_per() rounds up to the next power of two when passed a power of two. This causes crashes on some machines and configurations. Reported-by: Михаил Новоселов <m.novosyolov@rosalinux.ru> Tested-by: Ильфат Гаптрахманов <i.gaptrakhmanov@rosalinux.ru> Link: https://gitlab.freedesktop.org/drm/amd/-/issues/3347 Link: https://lore.kernel.org/all/1c978cf1-2934-4e66-e4b3-e81b04cb3571@rosalinux.ru/ Fixes: `f2d5dcb48f` (bounds: support non-power-of-two CONFIG_NR_CPUS) Cc: <stable@vger.kernel.org> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2024-04-29 08:29:29 -07:00
LuMingYin	dce3696271	tracing/probes: Fix memory leak in traceprobe_parse_probe_arg_body() If traceprobe_parse_probe_arg_body() failed to allocate 'parg->fmt', it jumps to the label 'out' instead of 'fail' by mistake.In the result, the buffer 'tmp' is not freed in this case and leaks its memory. Thus jump to the label 'fail' in that error case. Link: https://lore.kernel.org/all/20240427072347.1421053-1-lumingyindetect@126.com/ Fixes: `032330abd0` ("tracing/probes: Cleanup probe argument parser") Signed-off-by: LuMingYin <lumingyindetect@126.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>	2024-04-29 22:30:46 +09:00
Zqiang	1dd1eff161	softirq: Fix suspicious RCU usage in __do_softirq() Currently, the condition "__this_cpu_read(ksoftirqd) == current" is used to invoke rcu_softirq_qs() in ksoftirqd tasks context for non-RT kernels. This works correctly as long as the context is actually task context but this condition is wrong when: - the current task is ksoftirqd - the task is interrupted in a RCU read side critical section - __do_softirq() is invoked on return from interrupt Syzkaller triggered the following scenario: -> finish_task_switch() -> put_task_struct_rcu_user() -> call_rcu(&task->rcu, delayed_put_task_struct) -> __kasan_record_aux_stack() -> pfn_valid() -> rcu_read_lock_sched() <interrupt> __irq_exit_rcu() -> __do_softirq)() -> if (!IS_ENABLED(CONFIG_PREEMPT_RT) && __this_cpu_read(ksoftirqd) == current) -> rcu_softirq_qs() -> RCU_LOCKDEP_WARN(lock_is_held(&rcu_sched_lock_map)) The rcu quiescent state is reported in the rcu-read critical section, so the lockdep warning is triggered. Fix this by splitting out the inner working of __do_softirq() into a helper function which takes an argument to distinguish between ksoftirqd task context and interrupted context and invoke it from the relevant call sites with the proper context information and use that for the conditional invocation of rcu_softirq_qs(). Reported-by: syzbot+dce04ed6d1438ad69656@syzkaller.appspotmail.com Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Zqiang <qiang.zhang1211@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240427102808.29356-1-qiang.zhang1211@gmail.com Link: https://lore.kernel.org/lkml/8f281a10-b85a-4586-9586-5bbc12dc784f@paulmck-laptop/T/#mea8aba4abfcb97bbf499d169ce7f30c4cff1b0e3	2024-04-29 05:03:51 +02:00
Linus Torvalds	245c8e8174	Misc fixes: - Fix EEVDF corner cases - Fix two nohz_full= related bugs that can cause boot crashes and warnings. Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmYuBxcRHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1im6A/+JfNAwxPghp9zM43ERLadl3MUbH0hsdV9 54xhQm58Fi8wzXhxhRiOcLqrhFDNsy91mRWHxt9/PjvdFXhp9GiNpehMsHCmTsS8 7ywJKcAeKTM1+7nq4RFbDFSSpr1J5aUYKXfuhWwr0QVF3mNoRkmZaLdlnVjxebbA sKXXtEKbn0yCeIsdPwmZlLxzNyOV2j0p8Xck8DKrLjW57pbebiBHyt2N59PsARb4 Yt9wNbyb48DqGNy2FaRCWlm/8OL0BLMB0tMnXkIDtW89uVuP4V6fQF0vau3re+vy A8+OMD8gpeYjNV5WKrT5r3+EQyFJGI7nr6PbWTY8KLIGCjSSu9iGojn0hdVMGTj7 rQe6LJNSMe6xW53ZrecMh6OGZ3esgkaZKafXrMcczcSq/CCX0wSVSAbANCkhyANx VFZsCgxX/zdRSwSRZiyiHLnP/3/lw0sOxoBS/m0hDSJulJF7fbQGLAfLx+Zccnoe 2KBra2DXk/49OH+jehrj2C1m2ozWp2+4Kb7mwYISrTJVp0ylgjNiznAKkmB5R8XN UOfio5nr09KJWpRKW3UoR2CpaPu/BXUB249DDm36zK1I9V/ljYzrCHKjw+TTWgdS nPEVVYR9aj4t/De8wPm0gk/Orv9KaQkpdsOCgezRB0hJGuLpABcA9FGlTJntQ+n9 UPLMOgN36Q4= =Zhc/ -----END PGP SIGNATURE----- Merge tag 'sched-urgent-2024-04-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar: - Fix EEVDF corner cases - Fix two nohz_full= related bugs that can cause boot crashes and warnings * tag 'sched-urgent-2024-04-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/isolation: Fix boot crash when maxcpus < first housekeeping CPU sched/isolation: Prevent boot crash when the boot CPU is nohz_full sched/eevdf: Prevent vlag from going out of bounds in reweight_eevdf() sched/eevdf: Fix miscalculation in reweight_entity() when se is not curr sched/eevdf: Always update V if se->on_rq when reweighting	2024-04-28 12:11:26 -07:00
Linus Torvalds	aec147c188	Misc fixes: - Make the CPU_MITIGATIONS=n interaction with conflicting mitigation-enabling boot parameters a bit saner. - Re-enable CPU mitigations by default on non-x86 - Fix TDX shared bit propagation on mprotect() - Fix potential show_regs() system hang when PKE initialization is not fully finished yet. - Add the 0x10-0x1f model IDs to the Zen5 range - Harden #VC instruction emulation some more Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJEBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmYuCVMRHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1h0Hw/1HVlmRGTrQQBvVMlzt6Y3GlUk2uHSiSh0 pO57sh9tMu/3kWdcrUi4xkEVHmfBjMxXY5sw/7VXQ9mG7wv+SVgF3gAaAl+5q73K JKPPAhkPqUmXP3Sm1rqTt8iZtTViY3ilP6QEZaOIfL2Pwa7X3QP8TJRBKAJCrXEM hOEMXSd1W1Escs/uPlhCXHx8TRVTr9f4bv8TdHBXZGHTida5vejj+yhMSdaM94qw ywZ4an1NOnLGcNEMMYhOQ6Kbh9Ckj46JRjpodTfmjodLd/jOhVU5C7nTZfHRXSRU 3UQBZtTZIYYCs8Urg2l/W5IhywWV3P9Jg+D+vl/bdEKJ+yINLAnOgVhVPqeG2GWt Ww3FelgRz0AkQKTegRCK2jQWnHActSrYmkr4M24wa/cVkMrcpXT3LHj8PgRnllx5 q5JqQ37G3QYHMzslbBqyUHzJv8KzgdZdgyFTN3dX1q9n5FPy7Ul9Ue1Zp2SoId8i K6u+IjCkftWwIbv8AhXiEVo0ynfBkmV4UNVGJks1xIPA3lmNv3ax5nQMJLvZzJ48 n+Id8ALEWxyOrKR6bdWdPtJqd0Nw/q4e6AOTzVYE94X8+uVuug4m4X7QPo+Ctbz1 IkhTxmBbHzgKylbddK6LkdnXnHCGidOmXsF3VS6TRfz7ALaMUgpaHw34reEhiOlT xsIw+XVOKg== =AfRR -----END PGP SIGNATURE----- Merge tag 'x86-urgent-2024-04-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Ingo Molnar: - Make the CPU_MITIGATIONS=n interaction with conflicting mitigation-enabling boot parameters a bit saner. - Re-enable CPU mitigations by default on non-x86 - Fix TDX shared bit propagation on mprotect() - Fix potential show_regs() system hang when PKE initialization is not fully finished yet. - Add the 0x10-0x1f model IDs to the Zen5 range - Harden #VC instruction emulation some more * tag 'x86-urgent-2024-04-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: cpu: Ignore "mitigations" kernel parameter if CPU_MITIGATIONS=n cpu: Re-enable CPU mitigations by default for !X86 architectures x86/tdx: Preserve shared bit on mprotect() x86/cpu: Fix check for RDPKRU in __show_regs() x86/CPU/AMD: Add models 0x10-0x1f to the Zen5 range x86/sev: Check for MWAITX and MONITORX opcodes in the #VC handler	2024-04-28 11:58:16 -07:00
Oleg Nesterov	257bf89d84	sched/isolation: Fix boot crash when maxcpus < first housekeeping CPU housekeeping_setup() checks cpumask_intersects(present, online) to ensure that the kernel will have at least one housekeeping CPU after smp_init(), but this doesn't work if the maxcpus= kernel parameter limits the number of processors available after bootup. For example, a kernel with "maxcpus=2 nohz_full=0-2" parameters crashes at boot time on a virtual machine with 4 CPUs. Change housekeeping_setup() to use cpumask_first_and() and check that the returned CPU number is valid and less than setup_max_cpus. Another corner case is "nohz_full=0" on a machine with a single CPU or with the maxcpus=1 kernel argument. In this case non_housekeeping_mask is empty and tick_nohz_full_setup() makes no sense. And indeed, the kernel hits the WARN_ON(tick_nohz_full_running) in tick_sched_do_timer(). And how should the kernel interpret the "nohz_full=" parameter? It should be silently ignored, but currently cpulist_parse() happily returns the empty cpumask and this leads to the same problem. Change housekeeping_setup() to check cpumask_empty(non_housekeeping_mask) and do nothing in this case. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Phil Auld <pauld@redhat.com> Acked-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/20240413141746.GA10008@redhat.com	2024-04-28 10:08:21 +02:00
Oleg Nesterov	5097cbcb38	sched/isolation: Prevent boot crash when the boot CPU is nohz_full Documentation/timers/no_hz.rst states that the "nohz_full=" mask must not include the boot CPU, which is no longer true after: `08ae95f4fd` ("nohz_full: Allow the boot CPU to be nohz_full"). However after: `aae17ebb53` ("workqueue: Avoid using isolated cpus' timers on queue_delayed_work") the kernel will crash at boot time in this case; housekeeping_any_cpu() returns an invalid CPU number until smp_init() brings the first housekeeping CPU up. Change housekeeping_any_cpu() to check the result of cpumask_any_and() and return smp_processor_id() in this case. This is just the simple and backportable workaround which fixes the symptom, but smp_processor_id() at boot time should be safe at least for type == HK_TYPE_TIMER, this more or less matches the tick_do_timer_boot_cpu logic. There is no worry about cpu_down(); tick_nohz_cpu_down() will not allow to offline tick_do_timer_cpu (the 1st online housekeeping CPU). Fixes: `aae17ebb53` ("workqueue: Avoid using isolated cpus' timers on queue_delayed_work") Reported-by: Chris von Recklinghausen <crecklin@redhat.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Phil Auld <pauld@redhat.com> Acked-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/20240411143905.GA19288@redhat.com Closes: https://lore.kernel.org/all/20240402105847.GA24832@redhat.com/	2024-04-28 10:07:12 +02:00
Tetsuo Handa	2e5449f4f2	profiling: Remove create_prof_cpu_mask(). create_prof_cpu_mask() is no longer used after commit `1f44a22577` ("s390: convert interrupt handling to use generic hardirq"). Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2024-04-27 11:17:48 -07:00
Jakub Kicinski	b2ff42c6d3	bpf-for-netdev -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZiwdfQAKCRDbK58LschI g1oqAP9mjayeIHCfYMQZa2eevy1PmVlgdNdFdMDWZFS/pHv9cgD/ZdmGzbUDKCAQ Y/KiTajitZw3kxtHX45v8/Ugtlsh9Qg= =Ewiw -----END PGP SIGNATURE----- Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Daniel Borkmann says: ==================== pull-request: bpf 2024-04-26 We've added 12 non-merge commits during the last 22 day(s) which contain a total of 14 files changed, 168 insertions(+), 72 deletions(-). The main changes are: 1) Fix BPF_PROBE_MEM in verifier and JIT to skip loads from vsyscall page, from Puranjay Mohan. 2) Fix a crash in XDP with devmap broadcast redirect when the latter map is in process of being torn down, from Toke Høiland-Jørgensen. 3) Fix arm64 and riscv64 BPF JITs to properly clear start time for BPF program runtime stats, from Xu Kuohai. 4) Fix a sockmap KCSAN-reported data race in sk_psock_skb_ingress_enqueue, from Jason Xing. 5) Fix BPF verifier error message in resolve_pseudo_ldimm64, from Anton Protopopov. 6) Fix missing DEBUG_INFO_BTF_MODULES Kconfig menu item, from Andrii Nakryiko. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: selftests/bpf: Test PROBE_MEM of VSYSCALL_ADDR on x86-64 bpf, x86: Fix PROBE_MEM runtime load check bpf: verifier: prevent userspace memory access xdp: use flags field to disambiguate broadcast redirect arm32, bpf: Reimplement sign-extension mov instruction riscv, bpf: Fix incorrect runtime stats bpf, arm64: Fix incorrect runtime stats bpf: Fix a verifier verbose message bpf, skmsg: Fix NULL pointer dereference in sk_psock_skb_ingress_enqueue MAINTAINERS: bpf: Add Lehui and Puranjay as riscv64 reviewers MAINTAINERS: Update email address for Puranjay Mohan bpf, kconfig: Fix DEBUG_INFO_BTF_MODULES Kconfig definition ==================== Link: https://lore.kernel.org/r/20240426224248.26197-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-04-26 17:36:53 -07:00
Linus Torvalds	e6ebf01172	11 hotfixes. 8 are cc:stable and the remaining 3 (nice ratio!) address post-6.8 issues or aren't considered suitable for backporting. All except one of these are for MM. I see no particular theme - it's singletons all over. -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZiwPZwAKCRDdBJ7gKXxA jmcQAPkB6UT/rBUMvFZb1dom9R6SDYl5ZBr20Vj1HvfakCLxmQEAqEd0N7QoWvKS hKNCMDujiEKqDUWeUaJen4cqXFFE2Qg= =1wP7 -----END PGP SIGNATURE----- Merge tag 'mm-hotfixes-stable-2024-04-26-13-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "11 hotfixes. 8 are cc:stable and the remaining 3 (nice ratio!) address post-6.8 issues or aren't considered suitable for backporting. All except one of these are for MM. I see no particular theme - it's singletons all over" * tag 'mm-hotfixes-stable-2024-04-26-13-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: mm/hugetlb: fix DEBUG_LOCKS_WARN_ON(1) when dissolve_free_hugetlb_folio() selftests: mm: protection_keys: save/restore nr_hugepages value from launch script stackdepot: respect __GFP_NOLOCKDEP allocation flag hugetlb: check for anon_vma prior to folio allocation mm: zswap: fix shrinker NULL crash with cgroup_disable=memory mm: turn folio_test_hugetlb into a PageType mm: support page_mapcount() on page_has_type() pages mm: create FOLIO_FLAG_FALSE and FOLIO_TYPE_OPS macros mm/hugetlb: fix missing hugetlb_lock for resv uncharge selftests: mm: fix unused and uninitialized variable warning selftests/harness: remove use of LINE_MAX	2024-04-26 13:48:03 -07:00
Puranjay Mohan	66e13b615a	bpf: verifier: prevent userspace memory access With BPF_PROBE_MEM, BPF allows de-referencing an untrusted pointer. To thwart invalid memory accesses, the JITs add an exception table entry for all such accesses. But in case the src_reg + offset is a userspace address, the BPF program might read that memory if the user has mapped it. Make the verifier add guard instructions around such memory accesses and skip the load if the address falls into the userspace region. The JITs need to implement bpf_arch_uaddress_limit() to define where the userspace addresses end for that architecture or TASK_SIZE is taken as default. The implementation is as follows: REG_AX = SRC_REG if(offset) REG_AX += offset; REG_AX >>= 32; if (REG_AX <= (uaddress_limit >> 32)) DST_REG = 0; else DST_REG = (size )(SRC_REG + offset); Comparing just the upper 32 bits of the load address with the upper 32 bits of uaddress_limit implies that the values are being aligned down to a 4GB boundary before comparison. The above means that all loads with address <= uaddress_limit + 4GB are skipped. This is acceptable because there is a large hole (much larger than 4GB) between userspace and kernel space memory, therefore a correctly functioning BPF program should not access this 4GB memory above the userspace. Let's analyze what this patch does to the following fentry program dereferencing an untrusted pointer: SEC("fentry/tcp_v4_connect") int BPF_PROG(fentry_tcp_v4_connect, struct sock sk) { (volatile long )sk; return 0; } BPF Program before \| BPF Program after ------------------ \| ----------------- 0: (79) r1 = (u64 )(r1 +0) 0: (79) r1 = (u64 )(r1 +0) ----------------------------------------------------------------------- 1: (79) r1 = (u64 )(r1 +0) --\ 1: (bf) r11 = r1 ----------------------------\ \ 2: (77) r11 >>= 32 2: (b7) r0 = 0 \ \ 3: (b5) if r11 <= 0x8000 goto pc+2 3: (95) exit \ \-> 4: (79) r1 = (u64 *)(r1 +0) \ 5: (05) goto pc+1 \ 6: (b7) r1 = 0 \-------------------------------------- 7: (b7) r0 = 0 8: (95) exit As you can see from above, in the best case (off=0), 5 extra instructions are emitted. Now, we analyze the same program after it has gone through the JITs of ARM64 and RISC-V architectures. We follow the single load instruction that has the untrusted pointer and see what instrumentation has been added around it. x86-64 JIT ========== JIT's Instrumentation (upstream) --------------------- 0: nopl 0x0(%rax,%rax,1) 5: xchg %ax,%ax 7: push %rbp 8: mov %rsp,%rbp b: mov 0x0(%rdi),%rdi --------------------------------- f: movabs $0x800000000000,%r11 19: cmp %r11,%rdi 1c: jb 0x000000000000002a 1e: mov %rdi,%r11 21: add $0x0,%r11 28: jae 0x000000000000002e 2a: xor %edi,%edi 2c: jmp 0x0000000000000032 2e: mov 0x0(%rdi),%rdi --------------------------------- 32: xor %eax,%eax 34: leave 35: ret The x86-64 JIT already emits some instructions to protect against user memory access. This patch doesn't make any changes for the x86-64 JIT. ARM64 JIT ========= No Intrumentation Verifier's Instrumentation (upstream) (This patch) ----------------- -------------------------- 0: add x9, x30, #0x0 0: add x9, x30, #0x0 4: nop 4: nop 8: paciasp 8: paciasp c: stp x29, x30, [sp, #-16]! c: stp x29, x30, [sp, #-16]! 10: mov x29, sp 10: mov x29, sp 14: stp x19, x20, [sp, #-16]! 14: stp x19, x20, [sp, #-16]! 18: stp x21, x22, [sp, #-16]! 18: stp x21, x22, [sp, #-16]! 1c: stp x25, x26, [sp, #-16]! 1c: stp x25, x26, [sp, #-16]! 20: stp x27, x28, [sp, #-16]! 20: stp x27, x28, [sp, #-16]! 24: mov x25, sp 24: mov x25, sp 28: mov x26, #0x0 28: mov x26, #0x0 2c: sub x27, x25, #0x0 2c: sub x27, x25, #0x0 30: sub sp, sp, #0x0 30: sub sp, sp, #0x0 34: ldr x0, [x0] 34: ldr x0, [x0] -------------------------------------------------------------------------------- 38: ldr x0, [x0] ----------\ 38: add x9, x0, #0x0 -----------------------------------\\ 3c: lsr x9, x9, #32 3c: mov x7, #0x0 \\ 40: cmp x9, #0x10, lsl #12 40: mov sp, sp \\ 44: b.ls 0x0000000000000050 44: ldp x27, x28, [sp], #16 \\--> 48: ldr x0, [x0] 48: ldp x25, x26, [sp], #16 \ 4c: b 0x0000000000000054 4c: ldp x21, x22, [sp], #16 \ 50: mov x0, #0x0 50: ldp x19, x20, [sp], #16 \--------------------------------------- 54: ldp x29, x30, [sp], #16 54: mov x7, #0x0 58: add x0, x7, #0x0 58: mov sp, sp 5c: autiasp 5c: ldp x27, x28, [sp], #16 60: ret 60: ldp x25, x26, [sp], #16 64: nop 64: ldp x21, x22, [sp], #16 68: ldr x10, 0x0000000000000070 68: ldp x19, x20, [sp], #16 6c: br x10 6c: ldp x29, x30, [sp], #16 70: add x0, x7, #0x0 74: autiasp 78: ret 7c: nop 80: ldr x10, 0x0000000000000088 84: br x10 There are 6 extra instructions added in ARM64 in the best case. This will become 7 in the worst case (off != 0). RISC-V JIT (RISCV_ISA_C Disabled) ========== No Intrumentation Verifier's Instrumentation (upstream) (This patch) ----------------- -------------------------- 0: nop 0: nop 4: nop 4: nop 8: li a6, 33 8: li a6, 33 c: addi sp, sp, -16 c: addi sp, sp, -16 10: sd s0, 8(sp) 10: sd s0, 8(sp) 14: addi s0, sp, 16 14: addi s0, sp, 16 18: ld a0, 0(a0) 18: ld a0, 0(a0) --------------------------------------------------------------- 1c: ld a0, 0(a0) --\ 1c: mv t0, a0 --------------------------\ \ 20: srli t0, t0, 32 20: li a5, 0 \ \ 24: lui t1, 4096 24: ld s0, 8(sp) \ \ 28: sext.w t1, t1 28: addi sp, sp, 16 \ \ 2c: bgeu t1, t0, 12 2c: sext.w a0, a5 \ \--> 30: ld a0, 0(a0) 30: ret \ 34: j 8 \ 38: li a0, 0 \------------------------------ 3c: li a5, 0 40: ld s0, 8(sp) 44: addi sp, sp, 16 48: sext.w a0, a5 4c: ret There are 7 extra instructions added in RISC-V. Fixes: `8008342853` ("bpf, arm64: Add BPF exception tables") Reported-by: Breno Leitao <leitao@debian.org> Suggested-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Ilya Leoshkevich <iii@linux.ibm.com> Signed-off-by: Puranjay Mohan <puranjay12@gmail.com> Link: https://lore.kernel.org/r/20240424100210.11982-2-puranjay@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2024-04-26 09:45:18 -07:00
Jakub Kicinski	2bd87951de	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR. Conflicts: drivers/net/ethernet/ti/icssg/icssg_prueth.c net/mac80211/chan.c `89884459a0` ("wifi: mac80211: fix idle calculation with multi-link") `87f5500285` ("wifi: mac80211: simplify ieee80211_assign_link_chanctx()") https://lore.kernel.org/all/20240422105623.7b1fbda2@canb.auug.org.au/ net/unix/garbage.c `1971d13ffa` ("af_unix: Suppress false-positive lockdep splat for spin_lock() in __unix_gc().") `4090fa373f` ("af_unix: Replace garbage collection algorithm.") drivers/net/ethernet/ti/icssg/icssg_prueth.c drivers/net/ethernet/ti/icssg/icssg_common.c `4dcd0e83ea` ("net: ti: icssg-prueth: Fix signedness bug in prueth_init_rx_chns()") `e2dc7bfd67` ("net: ti: icssg-prueth: Move common functions into a separate file") No adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-04-25 12:41:37 -07:00
Xiu Jianfeng	b7d56d953a	cgroup/cpuset: Remove outdated comment in sched_partition_write() The comment here is outdated and can cause confusion, from the code perspective, there’s also no need for new comment, so just remove it. Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com> Acked-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-04-25 07:16:19 -10:00
Sean Christopherson	ce0abef6a1	cpu: Ignore "mitigations" kernel parameter if CPU_MITIGATIONS=n Explicitly disallow enabling mitigations at runtime for kernels that were built with CONFIG_CPU_MITIGATIONS=n, as some architectures may omit code entirely if mitigations are disabled at compile time. E.g. on x86, a large pile of Kconfigs are buried behind CPU_MITIGATIONS, and trying to provide sane behavior for retroactively enabling mitigations is extremely difficult, bordering on impossible. E.g. page table isolation and call depth tracking require build-time support, BHI mitigations will still be off without additional kernel parameters, etc. [ bp: Touchups. ] Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20240420000556.2645001-3-seanjc@google.com	2024-04-25 15:47:39 +02:00
Sean Christopherson	fe42754b94	cpu: Re-enable CPU mitigations by default for !X86 architectures Rename x86's to CPU_MITIGATIONS, define it in generic code, and force it on for all architectures exception x86. A recent commit to turn mitigations off by default if SPECULATION_MITIGATIONS=n kinda sorta missed that "cpu_mitigations" is completely generic, whereas SPECULATION_MITIGATIONS is x86-specific. Rename x86's SPECULATIVE_MITIGATIONS instead of keeping both and have it select CPU_MITIGATIONS, as having two configs for the same thing is unnecessary and confusing. This will also allow x86 to use the knob to manage mitigations that aren't strictly related to speculative execution. Use another Kconfig to communicate to common code that CPU_MITIGATIONS is already defined instead of having x86's menu depend on the common CPU_MITIGATIONS. This allows keeping a single point of contact for all of x86's mitigations, and it's not clear that other architectures want to allow disabling mitigations at compile-time. Fixes: `f337a6a21e` ("x86/cpu: Actually turn off mitigations by default for SPECULATION_MITIGATIONS=n") Closes: https://lkml.kernel.org/r/20240413115324.53303a68%40canb.auug.org.au Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Reported-by: Michael Ellerman <mpe@ellerman.id.au> Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: Josh Poimboeuf <jpoimboe@kernel.org> Acked-by: Borislav Petkov (AMD) <bp@alien8.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20240420000556.2645001-2-seanjc@google.com	2024-04-25 15:47:35 +02:00
Matthew Wilcox (Oracle)	d99e3140a4	mm: turn folio_test_hugetlb into a PageType The current folio_test_hugetlb() can be fooled by a concurrent folio split into returning true for a folio which has never belonged to hugetlbfs. This can't happen if the caller holds a refcount on it, but we have a few places (memory-failure, compaction, procfs) which do not and should not take a speculative reference. Since hugetlb pages do not use individual page mapcounts (they are always fully mapped and use the entire_mapcount field to record the number of mappings), the PageType field is available now that page_mapcount() ignores the value in this field. In compaction and with CONFIG_DEBUG_VM enabled, the current implementation can result in an oops, as reported by Luis. This happens since `9c5ccf2db0` ("mm: remove HUGETLB_PAGE_DTOR") effectively added some VM_BUG_ON() checks in the PageHuge() testing path. [willy@infradead.org: update vmcoreinfo] Link: https://lkml.kernel.org/r/ZgGZUvsdhaT1Va-T@casper.infradead.org Link: https://lkml.kernel.org/r/20240321142448.1645400-6-willy@infradead.org Fixes: `9c5ccf2db0` ("mm: remove HUGETLB_PAGE_DTOR") Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reported-by: Luis Chamberlain <mcgrof@kernel.org> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218227 Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-04-24 19:34:26 -07:00
Vadim Fedorenko	3e1c6f3540	bpf: make common crypto API for TC/XDP programs Add crypto API support to BPF to be able to decrypt or encrypt packets in TC/XDP BPF programs. Special care should be taken for initialization part of crypto algo because crypto alloc) doesn't work with preemtion disabled, it can be run only in sleepable BPF program. Also async crypto is not supported because of the very same issue - TC/XDP BPF programs are not sleepable. Signed-off-by: Vadim Fedorenko <vadfed@meta.com> Link: https://lore.kernel.org/r/20240422225024.2847039-2-vadfed@meta.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>	2024-04-24 16:01:10 -07:00
Jinjie Ruan	6678ae1918	genirq: Reuse irq_is_nmi() Move irq_is_nmi() to the internal header file and reuse it all over the place. Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240423024037.3331215-1-ruanjinjie@huawei.com	2024-04-24 20:42:57 +02:00
Dongli Zhang	88d724e230	genirq/cpuhotplug: Retry with cpu_online_mask when migration fails When a CPU goes offline, the interrupts affine to that CPU are re-configured. Managed interrupts undergo either migration to other CPUs or shutdown if all CPUs listed in the affinity are offline. The migration of managed interrupts is guaranteed on x86 because there are interrupt vectors reserved. Regular interrupts are migrated to a still online CPU in the affinity mask or if there is no online CPU to any online CPU. This works as long as the still online CPUs in the affinity mask have interrupt vectors available, but in case that none of those CPUs has a vector available the migration fails and the device interrupt becomes stale. This is not any different from the case where the affinity mask does not contain any online CPU, but there is no fallback operation for this. Instead of giving up, retry the migration attempt with the online CPU mask if the interrupt is not managed, as managed interrupts cannot be affected by this problem. Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240423073413.79625-1-dongli.zhang@oracle.com	2024-04-24 20:42:57 +02:00
David Stevens	a60dd06af6	genirq/cpuhotplug: Skip suspended interrupts when restoring affinity irq_restore_affinity_of_irq() restarts managed interrupts unconditionally when the first CPU in the affinity mask comes online. That's correct during normal hotplug operations, but not when resuming from S3 because the drivers are not resumed yet and interrupt delivery is not expected by them. Skip the startup of suspended interrupts and let resume_device_irqs() deal with restoring them. This ensures that irqs are not delivered to drivers during the noirq phase of resuming from S3, after non-boot CPUs are brought back online. Signed-off-by: David Stevens <stevensd@chromium.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240424090341.72236-1-stevensd@chromium.org	2024-04-24 20:42:57 +02:00
Lai Jiangshan	91f098704c	workqueue: Fix divide error in wq_update_node_max_active() Yue Sun and xingwei lee reported a divide error bug in wq_update_node_max_active(): divide error: 0000 [#1] PREEMPT SMP KASAN PTI CPU: 1 PID: 21 Comm: cpuhp/1 Not tainted 6.9.0-rc5 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014 RIP: 0010:wq_update_node_max_active+0x369/0x6b0 kernel/workqueue.c:1605 Code: 24 bf 00 00 00 80 44 89 fe e8 83 27 33 00 41 83 fc ff 75 0d 41 81 ff 00 00 00 80 0f 84 68 01 00 00 e8 fb 22 33 00 44 89 f8 99 <41> f7 fc 89 c5 89 c7 44 89 ee e8 a8 24 33 00 89 ef 8b 5c 24 04 89 RSP: 0018:ffffc9000018fbb0 EFLAGS: 00010293 RAX: 00000000000000ff RBX: 0000000000000001 RCX: ffff888100ada500 RDX: 0000000000000000 RSI: 00000000000000ff RDI: 0000000080000000 RBP: 0000000000000001 R08: ffffffff815b1fcd R09: 1ffff1100364ad72 R10: dffffc0000000000 R11: ffffed100364ad73 R12: 0000000000000000 R13: 0000000000000100 R14: 0000000000000000 R15: 00000000000000ff FS: 0000000000000000(0000) GS:ffff888135c00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fb8c06ca6f8 CR3: 000000010d6c6000 CR4: 0000000000750ef0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: <TASK> workqueue_offline_cpu+0x56f/0x600 kernel/workqueue.c:6525 cpuhp_invoke_callback+0x4e1/0x870 kernel/cpu.c:194 cpuhp_thread_fun+0x411/0x7d0 kernel/cpu.c:1092 smpboot_thread_fn+0x544/0xa10 kernel/smpboot.c:164 kthread+0x2ed/0x390 kernel/kthread.c:388 ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x11/0x20 arch/x86/entry/entry_64.S:244 </TASK> Modules linked in: ---[ end trace 0000000000000000 ]--- After analysis, it happens when all of the CPUs in a workqueue's affinity get offine. The problem can be easily reproduced by: # echo 8 > /sys/devices/virtual/workqueue/<any-wq-name>/cpumask # echo 0 > /sys/devices/system/cpu/cpu3/online Use the default max_actives for nodes when all of the CPUs in the workqueue's affinity get offline to fix the problem. Reported-by: Yue Sun <samsun1006219@gmail.com> Reported-by: xingwei lee <xrivendell7@gmail.com> Link: https://lore.kernel.org/lkml/CAEkJfYPGS1_4JqvpSo0=FM0S1ytB8CEbyreLTtWpR900dUZymw@mail.gmail.com/ Fixes: `5797b1c189` ("workqueue: Implement system-wide nr_active enforcement for unbound workqueues") Cc: stable@vger.kernel.org Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-04-24 07:23:06 -10:00
Kumar Kartikeya Dwivedi	fc7566ad0a	bpf: Introduce bpf_preempt_[disable,enable] kfuncs Introduce two new BPF kfuncs, bpf_preempt_disable and bpf_preempt_enable. These kfuncs allow disabling preemption in BPF programs. Nesting is allowed, since the intended use cases includes building native BPF spin locks without kernel helper involvement. Apart from that, this can be used to per-CPU data structures for cases where programs (or userspace) may preempt one or the other. Currently, while per-CPU access is stable, whether it will be consistent is not guaranteed, as only migration is disabled for BPF programs. Global functions are disallowed from being called, but support for them will be added as a follow up not just preempt kfuncs, but rcu_read_lock kfuncs as well. Static subprog calls are permitted. Sleepable helpers and kfuncs are disallowed in non-preemptible regions. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20240424031315.2757363-2-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2024-04-24 09:47:49 -07:00
Alexei Starovoitov	dc92febf7b	bpf: Don't check for recursion in bpf_wq_work. __bpf_prog_enter_sleepable_recur does recursion check which is not applicable to wq callback. The callback function is part of bpf program and bpf prog might be running on the same cpu. So recursion check would incorrectly prevent callback from running. The code can call __bpf_prog_enter_sleepable(), but run_ctx would be fake, hence use explicit rcu_read_lock_trace(); migrate_disable(); to address this problem. Another reason to open code is __bpf_prog_enter* are not available in !JIT configs. Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202404241719.IIGdpAku-lkp@intel.com/ Closes: https://lore.kernel.org/oe-kbuild-all/202404241811.FFV4Bku3-lkp@intel.com/ Fixes: `eb48f6cd41` ("bpf: wq: add bpf_wq_init") Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2024-04-24 09:06:50 -07:00
Vincent Guittot	97450eb909	sched/pelt: Remove shift of thermal clock The optional shift of the clock used by thermal/hw load avg has been introduced to handle case where the signal was not always a high frequency hw signal. Now that cpufreq provides a signal for firmware and SW pressure, we can remove this exception and always keep this PELT signal aligned with other signals. Mark sysctl_sched_migration_cost boot parameter as deprecated Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Lukasz Luba <lukasz.luba@arm.com> Reviewed-by: Qais Yousef <qyousef@layalina.io> Reviewed-by: Lukasz Luba <lukasz.luba@arm.com> Link: https://lore.kernel.org/r/20240326091616.3696851-6-vincent.guittot@linaro.org	2024-04-24 12:08:02 +02:00
Vincent Guittot	d4dbc99171	sched/cpufreq: Rename arch_update_thermal_pressure() => arch_update_hw_pressure() Now that cpufreq provides a pressure value to the scheduler, rename arch_update_thermal_pressure into HW pressure to reflect that it returns a pressure applied by HW (i.e. with a high frequency change) and not always related to thermal mitigation but also generated by max current limitation as an example. Such high frequency signal needs filtering to be smoothed and provide an value that reflects the average available capacity into the scheduler time scale. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Lukasz Luba <lukasz.luba@arm.com> Reviewed-by: Qais Yousef <qyousef@layalina.io> Reviewed-by: Lukasz Luba <lukasz.luba@arm.com> Link: https://lore.kernel.org/r/20240326091616.3696851-5-vincent.guittot@linaro.org	2024-04-24 12:08:01 +02:00
Vincent Guittot	f1f8d0a224	sched/cpufreq: Take cpufreq feedback into account Aggregate the different pressures applied on the capacity of CPUs and create a new function that returns the actual capacity of the CPU: get_actual_cpu_capacity(). Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Lukasz Luba <lukasz.luba@arm.com> Reviewed-by: Lukasz Luba <lukasz.luba@arm.com> Reviewed-by: Qais Yousef <qyousef@layalina.io> Link: https://lore.kernel.org/r/20240326091616.3696851-3-vincent.guittot@linaro.org	2024-04-24 12:07:59 +02:00
Vincent Guittot	cd18bec668	sched/fair: Fix update of rd->sg_overutilized sg_overloaded is used instead of sg_overutilized to update rd->sg_overutilized. Fixes: `4475cd8bfd` ("sched/balancing: Simplify the sg_status bitmask and use separate ->overloaded and ->overutilized flags") Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20240404155738.2866102-1-vincent.guittot@linaro.org	2024-04-24 12:02:51 +02:00
Thomas Weißschuh	795f90c6f1	sysctl: treewide: constify argument ctl_table_root::permissions(table) The permissions callback should not modify the ctl_table. Enforce this expectation via the typesystem. This is a step to put "struct ctl_table" into .rodata throughout the kernel. The patch was created with the following coccinelle script: @@ identifier func, head, ctl; @@ int func( struct ctl_table_header head, - struct ctl_table ctl) + const struct ctl_table *ctl) { ... } (insert_entry() from fs/proc/proc_sysctl.c is a false-positive) No additional occurrences of '.permissions =' were found after a tree-wide search for places missed by the conccinelle script. Reviewed-by: Joel Granados <j.granados@samsung.com> Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Signed-off-by: Joel Granados <j.granados@samsung.com>	2024-04-24 09:43:54 +02:00
Joel Granados	1adb825af9	bpf: Remove the now superfluous sentinel elements from ctl_table array This commit comes at the tail end of a greater effort to remove the empty elements at the end of the ctl_table arrays (sentinels) which will reduce the overall build time size of the kernel and run time memory bloat by ~64 bytes per sentinel (further information Link : https://lore.kernel.org/all/ZO5Yx5JFogGi%2FcBo@bombadil.infradead.org/) Remove sentinel element from bpf_syscall_table. Acked-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Joel Granados <j.granados@samsung.com>	2024-04-24 09:43:54 +02:00
Joel Granados	f15843f725	delayacct: Remove the now superfluous sentinel elements from ctl_table array This commit comes at the tail end of a greater effort to remove the empty elements at the end of the ctl_table arrays (sentinels) which will reduce the overall build time size of the kernel and run time memory bloat by ~64 bytes per sentinel (further information Link : https://lore.kernel.org/all/ZO5Yx5JFogGi%2FcBo@bombadil.infradead.org/) Remove sentinel element from kern_delayacct_table Signed-off-by: Joel Granados <j.granados@samsung.com>	2024-04-24 09:43:54 +02:00
Joel Granados	f884cd3862	kprobes: Remove the now superfluous sentinel elements from ctl_table array This commit comes at the tail end of a greater effort to remove the empty elements at the end of the ctl_table arrays (sentinels) which will reduce the overall build time size of the kernel and run time memory bloat by ~64 bytes per sentinel (further information Link : https://lore.kernel.org/all/ZO5Yx5JFogGi%2FcBo@bombadil.infradead.org/) Remove sentinel element from kprobe_sysclts Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Joel Granados <j.granados@samsung.com>	2024-04-24 09:43:54 +02:00
Joel Granados	f842d9a96e	printk: Remove the now superfluous sentinel elements from ctl_table array This commit comes at the tail end of a greater effort to remove the empty elements at the end of the ctl_table arrays (sentinels) which will reduce the overall build time size of the kernel and run time memory bloat by ~64 bytes per sentinel (further information Link : https://lore.kernel.org/all/ZO5Yx5JFogGi%2FcBo@bombadil.infradead.org/) rm sentinel element from printk_sysctls Reviewed-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Joel Granados <j.granados@samsung.com>	2024-04-24 09:43:54 +02:00
Joel Granados	f532376e88	scheduler: Remove the now superfluous sentinel elements from ctl_table array This commit comes at the tail end of a greater effort to remove the empty elements at the end of the ctl_table arrays (sentinels) which will reduce the overall build time size of the kernel and run time memory bloat by ~64 bytes per sentinel (further information Link : https://lore.kernel.org/all/ZO5Yx5JFogGi%2FcBo@bombadil.infradead.org/) rm sentinel element from ctl_table arrays Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Signed-off-by: Joel Granados <j.granados@samsung.com>	2024-04-24 09:43:54 +02:00
Joel Granados	e822582eff	seccomp: Remove the now superfluous sentinel elements from ctl_table array This commit comes at the tail end of a greater effort to remove the empty elements at the end of the ctl_table arrays (sentinels) which will reduce the overall build time size of the kernel and run time memory bloat by ~64 bytes per sentinel (further information Link : https://lore.kernel.org/all/ZO5Yx5JFogGi%2FcBo@bombadil.infradead.org/) Remove sentinel element from seccomp_sysctl_table. Acked-by: Kees Cook <keescook@chromium.org> Signed-off-by: Joel Granados <j.granados@samsung.com>	2024-04-24 09:43:54 +02:00
Joel Granados	fe6fc8e11b	timekeeping: Remove the now superfluous sentinel elements from ctl_table array This commit comes at the tail end of a greater effort to remove the empty elements at the end of the ctl_table arrays (sentinels) which will reduce the overall build time size of the kernel and run time memory bloat by ~64 bytes per sentinel (further information Link : https://lore.kernel.org/all/ZO5Yx5JFogGi%2FcBo@bombadil.infradead.org/) Remove sentinel element from time_sysctl Signed-off-by: Joel Granados <j.granados@samsung.com>	2024-04-24 09:43:54 +02:00
Joel Granados	66f20b11d3	ftrace: Remove the now superfluous sentinel elements from ctl_table array This commit comes at the tail end of a greater effort to remove the empty elements at the end of the ctl_table arrays (sentinels) which will reduce the overall build time size of the kernel and run time memory bloat by ~64 bytes per sentinel (further information Link : https://lore.kernel.org/all/ZO5Yx5JFogGi%2FcBo@bombadil.infradead.org/) Remove sentinel elements from ftrace_sysctls and user_event_sysctls Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Joel Granados <j.granados@samsung.com>	2024-04-24 09:43:53 +02:00
Joel Granados	7fd9c63f87	umh: Remove the now superfluous sentinel elements from ctl_table array This commit comes at the tail end of a greater effort to remove the empty elements at the end of the ctl_table arrays (sentinels) which will reduce the overall build time size of the kernel and run time memory bloat by ~64 bytes per sentinel (further information Link : https://lore.kernel.org/all/ZO5Yx5JFogGi%2FcBo@bombadil.infradead.org/) Remove sentinel element from usermodehelper_table Signed-off-by: Joel Granados <j.granados@samsung.com>	2024-04-24 09:43:53 +02:00
Joel Granados	11a921909f	kernel misc: Remove the now superfluous sentinel elements from ctl_table array This commit comes at the tail end of a greater effort to remove the empty elements at the end of the ctl_table arrays (sentinels) which will reduce the overall build time size of the kernel and run time memory bloat by ~64 bytes per sentinel (further information Link : https://lore.kernel.org/all/ZO5Yx5JFogGi%2FcBo@bombadil.infradead.org/) Remove the sentinel from ctl_table arrays. Reduce by one the values used to compare the size of the adjusted arrays. Signed-off-by: Joel Granados <j.granados@samsung.com>	2024-04-24 09:43:53 +02:00
Tejun Heo	d40f92020c	workqueue: The default node_nr_active should have its max set to max_active The default nna (node_nr_active) is used when the pool isn't tied to a specific NUMA node. This can happen in the following cases: 1. On NUMA, if per-node pwq init failure and the fallback pwq is used. 2. On NUMA, if a pool is configured to span multiple nodes. 3. On single node setups. `5797b1c189` ("workqueue: Implement system-wide nr_active enforcement for unbound workqueues") set the default nna->max to min_active because only #1 was being considered. For #2 and #3, using min_active means that the max concurrency in normal operation is pushed down to min_active which is currently 8, which can obviously lead to performance issues. exact value nna->max is set to doesn't really matter. #2 can only happen if the workqueue is intentionally configured to ignore NUMA boundaries and there's no good way to distribute max_active in this case. #3 is the default behavior on single node machines. Let's set it the default nna->max to max_active. This fixes the artificially lowered concurrency problem on single node machines and shouldn't hurt anything for other cases. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com> Fixes: `5797b1c189` ("workqueue: Implement system-wide nr_active enforcement for unbound workqueues") Link: https://lore.kernel.org/dm-devel/20240410084531.2134621-1-shinichiro.kawasaki@wdc.com/ Signed-off-by: Tejun Heo <tj@kernel.org>	2024-04-23 17:32:59 -10:00
Waiman Long	04d63da4da	cgroup/cpuset: Fix incorrect top_cpuset flags Commit `8996f93fc3` ("cgroup/cpuset: Statically initialize more members of top_cpuset") uses an incorrect "<" relational operator for the CS_SCHED_LOAD_BALANCE bit when initializing the top_cpuset. This results in load_balancing turned off by default in the top cpuset which is bad for performance. Fix this by using the BIT() helper macro to set the desired top_cpuset flags and avoid similar mistake from being made in the future. Fixes: `8996f93fc3` ("cgroup/cpuset: Statically initialize more members of top_cpuset") Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-04-23 17:31:18 -10:00
Benjamin Tissoires	8e83da9732	bpf: add bpf_wq_start again, copy/paste from bpf_timer_start(). Signed-off-by: Benjamin Tissoires <bentiss@kernel.org> Link: https://lore.kernel.org/r/20240420-bpf_wq-v2-15-6c986a5a741f@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2024-04-23 19:46:57 -07:00
Benjamin Tissoires	81f1d7a583	bpf: wq: add bpf_wq_set_callback_impl To support sleepable async callbacks, we need to tell push_async_cb() whether the cb is sleepable or not. The verifier now detects that we are in bpf_wq_set_callback_impl and can allow a sleepable callback to happen. Signed-off-by: Benjamin Tissoires <bentiss@kernel.org> Link: https://lore.kernel.org/r/20240420-bpf_wq-v2-13-6c986a5a741f@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2024-04-23 19:46:57 -07:00
Benjamin Tissoires	eb48f6cd41	bpf: wq: add bpf_wq_init We need to teach the verifier about the second argument which is declared as void * but which is of type KF_ARG_PTR_TO_MAP. We could have dropped this extra case if we declared the second argument as struct bpf_map *, but that means users will have to do extra casting to have their program compile. We also need to duplicate the timer code for the checking if the map argument is matching the provided workqueue. Signed-off-by: Benjamin Tissoires <bentiss@kernel.org> Link: https://lore.kernel.org/r/20240420-bpf_wq-v2-11-6c986a5a741f@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2024-04-23 19:46:57 -07:00
Benjamin Tissoires	246331e3f1	bpf: allow struct bpf_wq to be embedded in arraymaps and hashmaps Currently bpf_wq_cancel_and_free() is just a placeholder as there is no memory allocation for bpf_wq just yet. Again, duplication of the bpf_timer approach Signed-off-by: Benjamin Tissoires <bentiss@kernel.org> Link: https://lore.kernel.org/r/20240420-bpf_wq-v2-9-6c986a5a741f@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2024-04-23 18:31:25 -07:00
Benjamin Tissoires	d940c9b94d	bpf: add support for KF_ARG_PTR_TO_WORKQUEUE Introduce support for KF_ARG_PTR_TO_WORKQUEUE. The kfuncs will use bpf_wq as argument and that will be recognized as workqueue argument by verifier. bpf_wq_kern casting can happen inside kfunc, but using bpf_wq in argument makes life easier for users who work with non-kern type in BPF progs. Duplicate process_timer_func into process_wq_func. meta argument is only needed to ensure bpf_wq_init's workqueue and map arguments are coming from the same map (map_uid logic is necessary for correct inner-map handling), so also amend check_kfunc_args() to match what helpers functions check is doing. Signed-off-by: Benjamin Tissoires <bentiss@kernel.org> Link: https://lore.kernel.org/r/20240420-bpf_wq-v2-8-6c986a5a741f@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2024-04-23 18:31:25 -07:00
Benjamin Tissoires	ad2c03e691	bpf: verifier: bail out if the argument is not a map When a kfunc is declared with a KF_ARG_PTR_TO_MAP, we should have reg->map_ptr set to a non NULL value, otherwise, that means that the underlying type is not a map. Signed-off-by: Benjamin Tissoires <bentiss@kernel.org> Link: https://lore.kernel.org/r/20240420-bpf_wq-v2-7-6c986a5a741f@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2024-04-23 18:31:24 -07:00
Benjamin Tissoires	d56b63cf0c	bpf: add support for bpf_wq user type Mostly a copy/paste from the bpf_timer API, without the initialization and free, as they will be done in a separate patch. Signed-off-by: Benjamin Tissoires <bentiss@kernel.org> Link: https://lore.kernel.org/r/20240420-bpf_wq-v2-5-6c986a5a741f@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2024-04-23 18:31:24 -07:00
Benjamin Tissoires	fc22d9495f	bpf: replace bpf_timer_cancel_and_free with a generic helper Same reason than most bpf_timer* functions, we need almost the same for workqueues. So extract the generic part out of it so bpf_wq_cancel_and_free can reuse it. Signed-off-by: Benjamin Tissoires <bentiss@kernel.org> Link: https://lore.kernel.org/r/20240420-bpf_wq-v2-4-6c986a5a741f@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2024-04-23 18:31:24 -07:00
Benjamin Tissoires	073f11b026	bpf: replace bpf_timer_set_callback with a generic helper In the same way we have a generic __bpf_async_init(), we also need to share code between timer and workqueue for the set_callback call. We just add an unused flags parameter, as it will be used for workqueues. Signed-off-by: Benjamin Tissoires <bentiss@kernel.org> Link: https://lore.kernel.org/r/20240420-bpf_wq-v2-3-6c986a5a741f@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2024-04-23 18:31:24 -07:00
Benjamin Tissoires	56b4a177ae	bpf: replace bpf_timer_init with a generic helper No code change except for the new flags argument being stored in the local data struct. Signed-off-by: Benjamin Tissoires <bentiss@kernel.org> Link: https://lore.kernel.org/r/20240420-bpf_wq-v2-2-6c986a5a741f@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2024-04-23 18:31:24 -07:00
Benjamin Tissoires	be2749beff	bpf: make timer data struct more generic To be able to add workqueues and reuse most of the timer code, we need to make bpf_hrtimer more generic. There is no code change except that the new struct gets a new u64 flags attribute. We are still below 2 cache lines, so this shouldn't impact the current running codes. The ordering is also changed. Everything related to async callback is now on top of bpf_hrtimer. Signed-off-by: Benjamin Tissoires <bentiss@kernel.org> Link: https://lore.kernel.org/r/20240420-bpf_wq-v2-1-6c986a5a741f@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2024-04-23 18:31:24 -07:00
Sven Schnelle	57a01eafdc	workqueue: Fix selection of wake_cpu in kick_pool() With cpu_possible_mask=0-63 and cpu_online_mask=0-7 the following kernel oops was observed: smp: Bringing up secondary CPUs ... smp: Brought up 1 node, 8 CPUs Unable to handle kernel pointer dereference in virtual kernel address space Failing address: 0000000000000000 TEID: 0000000000000803 [..] Call Trace: arch_vcpu_is_preempted+0x12/0x80 select_idle_sibling+0x42/0x560 select_task_rq_fair+0x29a/0x3b0 try_to_wake_up+0x38e/0x6e0 kick_pool+0xa4/0x198 __queue_work.part.0+0x2bc/0x3a8 call_timer_fn+0x36/0x160 __run_timers+0x1e2/0x328 __run_timer_base+0x5a/0x88 run_timer_softirq+0x40/0x78 __do_softirq+0x118/0x388 irq_exit_rcu+0xc0/0xd8 do_ext_irq+0xae/0x168 ext_int_handler+0xbe/0xf0 psw_idle_exit+0x0/0xc default_idle_call+0x3c/0x110 do_idle+0xd4/0x158 cpu_startup_entry+0x40/0x48 rest_init+0xc6/0xc8 start_kernel+0x3c4/0x5e0 startup_continue+0x3c/0x50 The crash is caused by calling arch_vcpu_is_preempted() for an offline CPU. To avoid this, select the cpu with cpumask_any_and_distribute() to mask __pod_cpumask with cpu_online_mask. In case no cpu is left in the pool, skip the assignment. tj: This doesn't fully fix the bug as CPUs can still go down between picking the target CPU and the wake call. Fixing that likely requires adding cpu_online() test to either the sched or s390 arch code. However, regardless of how that is fixed, workqueue shouldn't be picking a CPU which isn't online as that would result in unpredictable and worse behavior. Signed-off-by: Sven Schnelle <svens@linux.ibm.com> Fixes: `8639ecebc9` ("workqueue: Implement non-strict affinity scope for unbound workqueues") Cc: stable@vger.kernel.org # v6.6+ Signed-off-by: Tejun Heo <tj@kernel.org>	2024-04-23 06:22:40 -10:00
Xiu Jianfeng	e8784765fa	cgroup/cpuset: Avoid clearing CS_SCHED_LOAD_BALANCE twice In cpuset_css_online(), CS_SCHED_LOAD_BALANCE will be cleared twice, the former one in the is_in_v2_mode() case could be removed because is_in_v2_mode() can be true for cgroup v1 if the "cpuset_v2_mode" mount option is specified, that balance flag change isn't appropriate for this particular case. Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-04-23 06:00:43 -10:00
Sourabh Jain	79365026f8	crash: add a new kexec flag for hotplug support Commit `a72bbec70d` ("crash: hotplug support for kexec_load()") introduced a new kexec flag, `KEXEC_UPDATE_ELFCOREHDR`. Kexec tool uses this flag to indicate to the kernel that it is safe to modify the elfcorehdr of the kdump image loaded using the kexec_load system call. However, it is possible that architectures may need to update kexec segments other then elfcorehdr. For example, FDT (Flatten Device Tree) on PowerPC. Introducing a new kexec flag for every new kexec segment may not be a good solution. Hence, a generic kexec flag bit, `KEXEC_CRASH_HOTPLUG_SUPPORT`, is introduced to share the CPU/Memory hotplug support intent between the kexec tool and the kernel for the kexec_load system call. Now we have two kexec flags that enables crash hotplug support for kexec_load system call. First is KEXEC_UPDATE_ELFCOREHDR (only used in x86), and second is KEXEC_CRASH_HOTPLUG_SUPPORT (for all architectures). To simplify the process of finding and reporting the crash hotplug support the following changes are introduced. 1. Define arch specific function to process the kexec flags and determine crash hotplug support 2. Rename the @update_elfcorehdr member of struct kimage to @hotplug_support and populate it for both kexec_load and kexec_file_load syscalls, because architecture can update more than one kexec segment 3. Let generic function crash_check_hotplug_support report hotplug support for loaded kdump image based on value of @hotplug_support To bring the x86 crash hotplug support in line with the above points, the following changes have been made: - Introduce the arch_crash_hotplug_support function to process kexec flags and determine crash hotplug support - Remove the arch_crash_hotplug_[cpu\|memory]_support functions Signed-off-by: Sourabh Jain <sourabhjain@linux.ibm.com> Acked-by: Baoquan He <bhe@redhat.com> Acked-by: Hari Bathini <hbathini@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20240326055413.186534-3-sourabhjain@linux.ibm.com	2024-04-23 14:59:01 +10:00
Sourabh Jain	118005713e	crash: forward memory_notify arg to arch crash hotplug handler In the event of memory hotplug or online/offline events, the crash memory hotplug notifier `crash_memhp_notifier()` receives a `memory_notify` object but doesn't forward that object to the generic and architecture-specific crash hotplug handler. The `memory_notify` object contains the starting PFN (Page Frame Number) and the number of pages in the hot-removed memory. This information is necessary for architectures like PowerPC to update/recreate the kdump image, specifically `elfcorehdr`. So update the function signature of `crash_handle_hotplug_event()` and `arch_crash_handle_hotplug_event()` to accept the `memory_notify` object as an argument from crash memory hotplug notifier. Since no such object is available in the case of CPU hotplug event, the crash CPU hotplug notifier `crash_cpuhp_online()` passes NULL to the crash hotplug handler. Signed-off-by: Sourabh Jain <sourabhjain@linux.ibm.com> Acked-by: Baoquan He <bhe@redhat.com> Acked-by: Hari Bathini <hbathini@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20240326055413.186534-2-sourabhjain@linux.ibm.com	2024-04-23 14:59:01 +10:00
Jinjie Ruan	bb58c1baa5	genirq: Simplify the checks for irq_set_percpu_devid_partition() Since whether desc is NULL or desc->percpu_enabled is true, it returns -EINVAL, check them together, and assign desc->percpu_affinity using a ternary to simplify the code. Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240417085356.3785381-1-ruanjinjie@huawei.com	2024-04-23 00:17:07 +02:00
Xiu Jianfeng	8996f93fc3	cgroup/cpuset: Statically initialize more members of top_cpuset Initializing top_cpuset.relax_domain_level and setting CS_SCHED_LOAD_BALANCE to top_cpuset.flags in cpuset_init() could be completed at the time of top_cpuset definition by compiler. Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-04-22 09:51:33 -10:00
Rafael Passos	a7de265cb2	bpf: Fix typos in comments Found the following typos in comments, and fixed them: s/unpriviledged/unprivileged/ s/reponsible/responsible/ s/possiblities/possibilities/ s/Divison/Division/ s/precsion/precision/ s/havea/have a/ s/reponsible/responsible/ s/responsibile/responsible/ s/tigher/tighter/ s/respecitve/respective/ Signed-off-by: Rafael Passos <rafael@rcpassos.me> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/6af7deb4-bb24-49e8-b3f1-8dd410597337@smtp-relay.sendinblue.com	2024-04-22 17:48:08 +02:00
Rafael Passos	e1a7545981	bpf: Fix typo in function save_aux_ptr_type I found this typo in the save_aux_ptr_type function. s/allow_trust_missmatch/allow_trust_mismatch/ I did not find this anywhere else in the codebase. Signed-off-by: Rafael Passos <rafael@rcpassos.me> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/fbe1d636-8172-4698-9a5a-5a3444b55322@smtp-relay.sendinblue.com	2024-04-22 17:12:05 +02:00
Jiapeng Chong	b7c8e1f8a7	hrtimer: Rename __hrtimer_hres_active() to hrtimer_hres_active() The function hrtimer_hres_active() are defined in the hrtimer.c file, but not called elsewhere, so rename __hrtimer_hres_active() to hrtimer_hres_active() and remove the old hrtimer_hres_active() function. kernel/time/hrtimer.c:653:19: warning: unused function 'hrtimer_hres_active'. Fixes: `82ccdf062a` ("hrtimer: Remove unused function") Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Link: https://lore.kernel.org/r/20240418023000.130324-1-jiapeng.chong@linux.alibaba.com Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=8778	2024-04-22 16:13:19 +02:00
Xuewen Yan	1560d1f6eb	sched/eevdf: Prevent vlag from going out of bounds in reweight_eevdf() It was possible to have pick_eevdf() return NULL, which then causes a NULL-deref. This turned out to be due to entity_eligible() returning falsely negative because of a s64 multiplcation overflow. Specifically, reweight_eevdf() computes the vlag without considering the limit placed upon vlag as update_entity_lag() does, and then the scaling multiplication (remember that weight is 20bit fixed point) can overflow. This then leads to the new vruntime being weird which then causes the above entity_eligible() to go side-ways and claim nothing is eligible. Thus limit the range of vlag accordingly. All this was quite rare, but fatal when it does happen. Closes: https://lore.kernel.org/all/ZhuYyrh3mweP_Kd8@nz.home/ Closes: https://lore.kernel.org/all/CA+9S74ih+45M_2TPUY_mPPVDhNvyYfy1J1ftSix+KjiTVxg8nw@mail.gmail.com/ Closes: https://lore.kernel.org/lkml/202401301012.2ed95df0-oliver.sang@intel.com/ Fixes: `eab03c23c2` ("sched/eevdf: Fix vruntime adjustment on reweight") Reported-by: Sergei Trofimovich <slyich@gmail.com> Reported-by: Igor Raits <igor@gooddata.com> Reported-by: Breno Leitao <leitao@debian.org> Reported-by: kernel test robot <oliver.sang@intel.com> Reported-by: Yujie Liu <yujie.liu@intel.com> Signed-off-by: Xuewen Yan <xuewen.yan@unisoc.com> Reviewed-and-tested-by: Chen Yu <yu.c.chen@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20240422082238.5784-1-xuewen.yan@unisoc.com	2024-04-22 13:01:27 +02:00
Tianchen Ding	afae8002b4	sched/eevdf: Fix miscalculation in reweight_entity() when se is not curr reweight_eevdf() only keeps V unchanged inside itself. When se != cfs_rq->curr, it would be dequeued from rb tree first. So that V is changed and the result is wrong. Pass the original V to reweight_eevdf() to fix this issue. Fixes: `eab03c23c2` ("sched/eevdf: Fix vruntime adjustment on reweight") Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com> [peterz: flip if() condition for clarity] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Abel Wu <wuyun.abel@bytedance.com> Link: https://lkml.kernel.org/r/20240306022133.81008-3-dtcccc@linux.alibaba.com	2024-04-22 13:01:26 +02:00
Tianchen Ding	11b1b8bc2b	sched/eevdf: Always update V if se->on_rq when reweighting reweight_eevdf() needs the latest V to do accurate calculation for new ve and vd. So update V unconditionally when se is runnable. Fixes: `eab03c23c2` ("sched/eevdf: Fix vruntime adjustment on reweight") Suggested-by: Abel Wu <wuyun.abel@bytedance.com> Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Abel Wu <wuyun.abel@bytedance.com> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Tested-by: Chen Yu <yu.c.chen@intel.com> Link: https://lore.kernel.org/r/20240306022133.81008-2-dtcccc@linux.alibaba.com	2024-04-22 13:01:26 +02:00
Thomas Weißschuh	bfa858f220	sysctl: treewide: constify ctl_table_header::ctl_table_arg To be able to constify instances of struct ctl_tables it is necessary to remove ways through which non-const versions are exposed from the sysctl core. One of these is the ctl_table_arg member of struct ctl_table_header. Constify this reference as a prerequisite for the full constification of struct ctl_table instances. No functional change. Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2024-04-22 08:56:31 +01:00
Linus Torvalds	3b68086599	- Add a missing memory barrier in the concurrency ID mm switching -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmYk064ACgkQEsHwGGHe VUo+Gg/9GvZD7InmNHQx+TZORbCphljcx110JxB0wh6l1hTdqy57VbLjZtnztCVt EkFGku/5DEDLQVDD1VHvR6w8w3rayqBAhbGApnJfVGEDJfqyYBGXG4Nhx6qsJD5F YBtgk+HZjZ7G5kmGlokgNKXmh4NadUwADbkBaV+iSXBQs+Teld1THJFylO7ju+rZ jUttExna3YhsiHdcRwBhlxFwHm0SXsd591n9slTxNoIRg19SVAvfkYmcoL2W70c0 qtHtaVS37sHInFuVzI53W82hNfGLLgPt+vkQEtiG00CwR4956I2LkorsfAG3JwAA CPlbgO3ypwWSJjSdRrvYkjn7pw9JtlMQMpVh+ypaMgQ1zucb6FpDEzHnAA4lbgNW u31ALXlqhUcnKC/3HZC+vPQosRNSyBP2rWtNNdMZ3YKrW4vU/5zII+4foZj9kv3u irH27Suh6aKMc1y6THNDZmwcgJIFRtvr3K1fm7WwM0z5qz5uNq3sOW5RIQTTL3ti +lW7pK1NqUHZhTsOs09+kt/M7vfxyrMqk5qma18kgdgkaRCwCSJLcPGICVCM5WmD mnLsbs15PIKdN4cwfqeC0KzWr/xroPn1VHiWgU9zZeXcEVss0+UDcMbopQERsXH0 6AJV5A0/H/PVfDFKs5aTdwzQXB3LKG2AGdGoX7KpN8vtHf5TFt4= =YLBT -----END PGP SIGNATURE----- Merge tag 'sched_urgent_for_v6.9_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fix from Borislav Petkov: - Add a missing memory barrier in the concurrency ID mm switching * tag 'sched_urgent_for_v6.9_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched: Add missing memory barrier in switch_mm_cid	2024-04-21 09:39:36 -07:00
Linus Torvalds	2d412262cc	hardening fixes for v6.9-rc5 - Correctly disable UBSAN configs in configs/hardening (Nathan Chancellor) - Add missing signed integer overflow trap types to arm64 handler -----BEGIN PGP SIGNATURE----- iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmYi0M8WHGtlZXNjb29r QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJjPrEACrLlPZUPLiJPlPdYC5bW4lLSgZ v6z5XjeEVWVvIlzW3DKPKvzMmIl6D3CTN6KbgdjHR+s5VYGYVlkoQJw09SSBu1OX yFC2i0lyqUKuAmh6jK1T46kXvbrgK/3ClO3nQk7KotTfvuRcorAcGEmayTYnaWqd JX3qyry2oEiQG9pWDHl9bRQ1ZgbNdNkxR2YYIhy88lrMWORdVNG7PFkgNnVsEbnb UAbXl817//TSTuXUwklTllz0UNInmDmQTjrmMRUhiwTKEs8aRS5VX6biSyc1Fucz KYXNeK9ciV80mQYnj7jDxgXC5jNThtrjEokzht8vvZGHcBp3WMr6CJLmwj9aaSXE edib7mJf/YveJTCPN17xAvIMHZAFZyoyeiVIOE1Ys2lWSj8rXH5TvWnn/E4QPxHK 77lOKGZNwNMYmIa+L6gb3OOWpiZpOMTLCGMuJh6VSDf5BcA0i45yTxAlAe5JYpgw txxDscFu5MtrabR4Z28J+VY/wnWqQAC89D6qYsJOPH8kL0o3XhELCDKPNUZoY094 LV7XuhAB+xDqNdvZi7SHTmTtZSLPqRBlNrOUqQSmXrwjp11naya26l7fn1Y0cpQM K8o3ioUkSg0PJNox/kGxryouHXXMqtN/k52JPotkfa6XEQDpN82uo0xJD9r21Viu qhA7A8vcQ3KIb0cUbw== =Q6Hg -----END PGP SIGNATURE----- Merge tag 'hardening-v6.9-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull hardening fixes from Kees Cook: - Correctly disable UBSAN configs in configs/hardening (Nathan Chancellor) - Add missing signed integer overflow trap types to arm64 handler * tag 'hardening-v6.9-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: ubsan: Add awareness of signed integer overflow traps configs/hardening: Disable CONFIG_UBSAN_SIGNED_WRAP configs/hardening: Fix disabling UBSAN configurations	2024-04-19 14:10:11 -07:00
Xiu Jianfeng	19fc8a8965	cgroup: Avoid unnecessary looping in cgroup_no_v1() No need to continue the for_each_subsys loop after the token matches the name of subsys and cgroup_no_v1_mask is set. Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-04-19 05:43:36 -10:00
Paolo Bonzini	a96cb3bf39	Merge x86 bugfixes from Linux 6.9-rc3 Pull fix for SEV-SNP late disable bugs. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-04-19 09:02:22 -04:00
Jakub Kicinski	41e3ddb291	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR. Conflicts: include/trace/events/rpcgss.h `386f4a7379` ("trace: events: cleanup deprecated strncpy uses") `a4833e3aba` ("SUNRPC: Fix rpcgss_context trace event acceptor field") Adjacent changes: drivers/net/ethernet/intel/ice/ice_tc_lib.c `2cca35f5dd` ("ice: Fix checking for unsupported keys on non-tunnel device") `784feaa65d` ("ice: Add support for PFCP hardware offload in switchdev") Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-04-18 13:12:24 -07:00
Xiu Jianfeng	f71bfbe1e2	cgroup, legacy_freezer: update comment for freezer_css_offline() After commit `f5d39b0208` ("freezer,sched: Rewrite core freezer logic"), system_freezing_count was replaced by freezer_active. Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-04-18 06:01:50 -10:00
Xiu Jianfeng	15a0b5fe1a	cgroup: don't call cgroup1_pidlist_destroy_all() for v2 Currently cgroup1_pidlist_destroy_all() will be called when releasing cgroup even if the cgroup is on default hierarchy, however it doesn't make any sense for v2 to destroy pidlist of v1. Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-04-18 05:56:58 -10:00
Xiu Jianfeng	a6b8daba00	cgroup_freezer: update comment for freezer_css_online() The freezer->lock was replaced by freezer_mutex in commit `e5ced8ebb1` ("cgroup_freezer: replace freezer->lock with freezer_mutex"), so the comment here is out-of-date, update it. Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-04-17 15:59:12 -10:00
Jesper Dangaard Brouer	97a46a66ad	cgroup/rstat: desc member cgrp in cgroup_rstat_flush_release Recent change to cgroup_rstat_flush_release added a parameter cgrp, which is used by tracepoint to correlate with other tracepoints that also have this cgrp. The kernel test robot detected kernel doc was missing a description of this member. Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202404170821.HwZGISTY-lkp@intel.com/ Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-04-17 15:56:30 -10:00
Alexander Gordeev	89d6910cc5	sched/vtime: Get rid of generic vtime_task_switch() implementation The generic vtime_task_switch() implementation gets built only if __ARCH_HAS_VTIME_TASK_SWITCH is not defined, but requires an architecture to implement arch_vtime_task_switch() callback at the same time, which is confusing. Further, arch_vtime_task_switch() is implemented for 32-bit PowerPC architecture only and vtime_task_switch() generic variant is rather superfluous. Simplify the whole vtime_task_switch() wiring by moving the existing generic implementation to PowerPC. Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Nicholas Piggin <npiggin@gmail.com> Acked-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/2cb6e3caada93623f6d4f78ad938ac6cd0e2fda8.1712760275.git.agordeev@linux.ibm.com	2024-04-17 13:37:20 +02:00
Miaohe Lin	35e351780f	fork: defer linking file vma until vma is fully initialized Thorvald reported a WARNING [1]. And the root cause is below race: CPU 1 CPU 2 fork hugetlbfs_fallocate dup_mmap hugetlbfs_punch_hole i_mmap_lock_write(mapping); vma_interval_tree_insert_after -- Child vma is visible through i_mmap tree. i_mmap_unlock_write(mapping); hugetlb_dup_vma_private -- Clear vma_lock outside i_mmap_rwsem! i_mmap_lock_write(mapping); hugetlb_vmdelete_list vma_interval_tree_foreach hugetlb_vma_trylock_write -- Vma_lock is cleared. tmp->vm_ops->open -- Alloc new vma_lock outside i_mmap_rwsem! hugetlb_vma_unlock_write -- Vma_lock is assigned!!! i_mmap_unlock_write(mapping); hugetlb_dup_vma_private() and hugetlb_vm_op_open() are called outside i_mmap_rwsem lock while vma lock can be used in the same time. Fix this by deferring linking file vma until vma is fully initialized. Those vmas should be initialized first before they can be used. Link: https://lkml.kernel.org/r/20240410091441.3539905-1-linmiaohe@huawei.com Fixes: `8d9bfb2608` ("hugetlb: add vma based lock for pmd sharing") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reported-by: Thorvald Natvig <thorvald@google.com> Closes: https://lore.kernel.org/linux-mm/20240129161735.6gmjsswx62o4pbja@revolver/T/ [1] Reviewed-by: Jane Chu <jane.chu@oracle.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peng Zhang <zhangpeng.00@bytedance.com> Cc: Tycho Andersen <tandersen@netflix.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-04-16 15:39:51 -07:00
Jesper Dangaard Brouer	fc29e04ae1	cgroup/rstat: add cgroup_rstat_lock helpers and tracepoints This commit enhances the ability to troubleshoot the global cgroup_rstat_lock by introducing wrapper helper functions for the lock along with associated tracepoints. Although global, the cgroup_rstat_lock helper APIs and tracepoints take arguments such as cgroup pointer and cpu_in_loop variable. This adjustment is made because flushing occurs per cgroup despite the lock being global. Hence, when troubleshooting, it's important to identify the relevant cgroup. The cpu_in_loop variable is necessary because the global lock may be released within the main flushing loop that traverses CPUs. In the tracepoints, the cpu_in_loop value is set to -1 when acquiring the main lock; otherwise, it denotes the CPU number processed last. The new feature in this patchset is detecting when lock is contended. The tracepoints are implemented with production in mind. For minimum overhead attach to cgroup:cgroup_rstat_lock_contended, which only gets activated when trylock detects lock is contended. A quick production check for issues could be done via this perf commands: perf record -g -e cgroup:cgroup_rstat_lock_contended Next natural question would be asking how long time do lock contenders wait for obtaining the lock. This can be answered by measuring the time between cgroup:cgroup_rstat_lock_contended and cgroup:cgroup_rstat_locked when args->contended is set. Like this bpftrace script: bpftrace -e ' tracepoint:cgroup:cgroup_rstat_lock_contended {@start[tid]=nsecs} tracepoint:cgroup:cgroup_rstat_locked { if (args->contended) { @wait_ns=hist(nsecs-@start[tid]); delete(@start[tid]);}} interval:s:1 {time("%H:%M:%S "); print(@wait_ns); }' Extending with time spend holding the lock will be more expensive as this also looks at all the non-contended cases. Like this bpftrace script: bpftrace -e ' tracepoint:cgroup:cgroup_rstat_lock_contended {@start[tid]=nsecs} tracepoint:cgroup:cgroup_rstat_locked { @locked[tid]=nsecs; if (args->contended) { @wait_ns=hist(nsecs-@start[tid]); delete(@start[tid]);}} tracepoint:cgroup:cgroup_rstat_unlock { @locked_ns=hist(nsecs-@locked[tid]); delete(@locked[tid]);} interval:s:1 {time("%H:%M:%S "); print(@wait_ns);print(@locked_ns); }' Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-04-16 12:10:42 -10:00
Michal Koutný	15b8b9ab50	cgroup/pids: Remove superfluous zeroing Atomic counters are in kzalloc'd struct. They are zeroed already and atomic64_t does not need special initialization (cf kernel/trace/trace_clock.c:trace_counter). Signed-off-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-04-16 09:18:36 -10:00
Harishankar Vishwanathan	1f586614f3	bpf: Harden and/or/xor value tracking in verifier This patch addresses a latent unsoundness issue in the scalar(32)_min_max_and/or/xor functions. While it is not a bugfix, it ensures that the functions produce sound outputs for all inputs. The issue occurs in these functions when setting signed bounds. The following example illustrates the issue for scalar_min_max_and(), but it applies to the other functions. In scalar_min_max_and() the following clause is executed when ANDing positive numbers: /* ANDing two positives gives a positive, so safe to * cast result into s64. / dst_reg->smin_value = dst_reg->umin_value; dst_reg->smax_value = dst_reg->umax_value; However, if umin_value and umax_value of dst_reg cross the sign boundary (i.e., if (s64)dst_reg->umin_value > (s64)dst_reg->umax_value), then we will end up with smin_value > smax_value, which is unsound. Previous works [1, 2] have discovered and reported this issue. Our tool Agni [2, 3] consideres it a false positive. This is because, during the verification of the abstract operator scalar_min_max_and(), Agni restricts its inputs to those passing through reg_bounds_sync(). This mimics real-world verifier behavior, as reg_bounds_sync() is invariably executed at the tail of every abstract operator. Therefore, such behavior is unlikely in an actual verifier execution. However, it is still unsound for an abstract operator to set signed bounds such that smin_value > smax_value. This patch fixes it, making the abstract operator sound for all (well-formed) inputs. It is worth noting that while the previous code updated the signed bounds (using the output unsigned bounds) only when the input signed* bounds were positive, the new code updates them whenever the output unsigned bounds do not cross the sign boundary. An alternative approach to fix this latent unsoundness would be to unconditionally set the signed bounds to unbounded [S64_MIN, S64_MAX], and let reg_bounds_sync() refine the signed bounds using the unsigned bounds and the tnum. We found that our approach produces more precise (tighter) bounds. For example, consider these inputs to BPF_AND: /* dst_reg / var_off.value: 8608032320201083347 var_off.mask: 615339716653692460 smin_value: 8070450532247928832 smax_value: 8070450532247928832 umin_value: 13206380674380886586 umax_value: 13206380674380886586 s32_min_value: -2110561598 s32_max_value: -133438816 u32_min_value: 4135055354 u32_max_value: 4135055354 / src_reg */ var_off.value: 8584102546103074815 var_off.mask: 9862641527606476800 smin_value: 2920655011908158522 smax_value: 7495731535348625717 umin_value: 7001104867969363969 umax_value: 8584102543730304042 s32_min_value: -2097116671 s32_max_value: 71704632 u32_min_value: 1047457619 u32_max_value: 4268683090 After going through tnum_and() -> scalar32_min_max_and() -> scalar_min_max_and() -> reg_bounds_sync(), our patch produces the following bounds for s32: s32_min_value: -1263875629 s32_max_value: -159911942 Whereas, setting the signed bounds to unbounded in scalar_min_max_and() produces: s32_min_value: -1263875629 s32_max_value: -1 As observed, our patch produces a tighter s32 bound. We also confirmed using Agni and SMT verification that our patch always produces signed bounds that are equal to or more precise than setting the signed bounds to unbounded in scalar_min_max_and(). [1] https://sanjit-bhat.github.io/assets/pdf/ebpf-verifier-range-analysis22.pdf [2] https://link.springer.com/chapter/10.1007/978-3-031-37709-9_12 [3] https://github.com/bpfverif/agni Co-developed-by: Matan Shachnai <m.shachnai@rutgers.edu> Signed-off-by: Matan Shachnai <m.shachnai@rutgers.edu> Co-developed-by: Srinivas Narayana <srinivas.narayana@rutgers.edu> Signed-off-by: Srinivas Narayana <srinivas.narayana@rutgers.edu> Co-developed-by: Santosh Nagarakatte <santosh.nagarakatte@rutgers.edu> Signed-off-by: Santosh Nagarakatte <santosh.nagarakatte@rutgers.edu> Signed-off-by: Harishankar Vishwanathan <harishankar.vishwanathan@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20240402212039.51815-1-harishankar.vishwanathan@gmail.com Link: https://lore.kernel.org/bpf/20240416115303.331688-1-harishankar.vishwanathan@gmail.com	2024-04-16 17:55:27 +02:00
Ard Biesheuvel	fc5eb4a84e	btf: Avoid weak external references If the BTF code is enabled in the build configuration, the start/stop BTF markers are guaranteed to exist. Only when CONFIG_DEBUG_INFO_BTF=n, the references in btf_parse_vmlinux() will remain unsatisfied, relying on the weak linkage of the external references to avoid breaking the build. Avoid GOT based relocations to these markers in the final executable by dropping the weak attribute and instead, make btf_parse_vmlinux() return ERR_PTR(-ENOENT) directly if CONFIG_DEBUG_INFO_BTF is not enabled to begin with. The compiler will drop any subsequent references to __start_BTF and __stop_BTF in that case, allowing the link to succeed. Note that Clang will notice that taking the address of __start_BTF can no longer yield NULL, so testing for that condition becomes unnecessary. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/bpf/20240415162041.2491523-8-ardb+git@google.com	2024-04-16 16:35:13 +02:00
Mathieu Desnoyers	fe90f3967b	sched: Add missing memory barrier in switch_mm_cid Many architectures' switch_mm() (e.g. arm64) do not have an smp_mb() which the core scheduler code has depended upon since commit: commit `223baf9d17` ("sched: Fix performance regression introduced by mm_cid") If switch_mm() doesn't call smp_mb(), sched_mm_cid_remote_clear() can unset the actively used cid when it fails to observe active task after it sets lazy_put. There is a memory barrier between storing to rq->curr and _return to userspace_ (as required by membarrier), but the rseq mm_cid has stricter requirements: the barrier needs to be issued between store to rq->curr and switch_mm_cid(), which happens earlier than: - spin_unlock(), - switch_to(). So it's fine when the architecture switch_mm() happens to have that barrier already, but less so when the architecture only provides the full barrier in switch_to() or spin_unlock(). It is a bug in the rseq switch_mm_cid() implementation. All architectures that don't have memory barriers in switch_mm(), but rather have the full barrier either in finish_lock_switch() or switch_to() have them too late for the needs of switch_mm_cid(). Introduce a new smp_mb__after_switch_mm(), defined as smp_mb() in the generic barrier.h header, and use it in switch_mm_cid() for scheduler transitions where switch_mm() is expected to provide a memory barrier. Architectures can override smp_mb__after_switch_mm() if their switch_mm() implementation provides an implicit memory barrier. Override it with a no-op on x86 which implicitly provide this memory barrier by writing to CR3. Fixes: `223baf9d17` ("sched: Fix performance regression introduced by mm_cid") Reported-by: levi.yun <yeoreum.yun@arm.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> # for arm64 Acked-by: Dave Hansen <dave.hansen@linux.intel.com> # for x86 Cc: <stable@vger.kernel.org> # 6.4.x Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20240415152114.59122-2-mathieu.desnoyers@efficios.com	2024-04-16 13:59:45 +02:00
Zqiang	1c67318b3d	rcutorture: Use rcu_gp_slow_register/unregister() only for rcutype test The rcu_gp_slow_register/unregister() is only useful in tests where torture_type=rcu, so this commit therefore generates ->gp_slow_register() and ->gp_slow_unregister() function pointers in the rcu_torture_ops structure, and slows grace periods only when these function pointers exist. Signed-off-by: Zqiang <qiang.zhang1211@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>	2024-04-16 11:16:36 +02:00
Zqiang	668c0406d8	rcutorture: Fix invalid context warning when enable srcu barrier testing When the torture_type is set srcu or srcud and cb_barrier is non-zero, running the rcutorture test will trigger the following warning: [ 163.910989][ C1] BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:48 [ 163.910994][ C1] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 0, name: swapper/1 [ 163.910999][ C1] preempt_count: 10001, expected: 0 [ 163.911002][ C1] RCU nest depth: 0, expected: 0 [ 163.911005][ C1] INFO: lockdep is turned off. [ 163.911007][ C1] irq event stamp: 30964 [ 163.911010][ C1] hardirqs last enabled at (30963): [<ffffffffabc7df52>] do_idle+0x362/0x500 [ 163.911018][ C1] hardirqs last disabled at (30964): [<ffffffffae616eff>] sysvec_call_function_single+0xf/0xd0 [ 163.911025][ C1] softirqs last enabled at (0): [<ffffffffabb6475f>] copy_process+0x16ff/0x6580 [ 163.911033][ C1] softirqs last disabled at (0): [<0000000000000000>] 0x0 [ 163.911038][ C1] Preemption disabled at: [ 163.911039][ C1] [<ffffffffacf1964b>] stack_depot_save_flags+0x24b/0x6c0 [ 163.911063][ C1] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G W 6.8.0-rc4-rt4-yocto-preempt-rt+ #3 1e39aa9a737dd024a3275c4f835a872f673a7d3a [ 163.911071][ C1] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014 [ 163.911075][ C1] Call Trace: [ 163.911078][ C1] <IRQ> [ 163.911080][ C1] dump_stack_lvl+0x88/0xd0 [ 163.911089][ C1] dump_stack+0x10/0x20 [ 163.911095][ C1] __might_resched+0x36f/0x530 [ 163.911105][ C1] rt_spin_lock+0x82/0x1c0 [ 163.911112][ C1] spin_lock_irqsave_ssp_contention+0xb8/0x100 [ 163.911121][ C1] srcu_gp_start_if_needed+0x782/0xf00 [ 163.911128][ C1] ? _raw_spin_unlock_irqrestore+0x46/0x70 [ 163.911136][ C1] ? debug_object_active_state+0x336/0x470 [ 163.911148][ C1] ? __pfx_srcu_gp_start_if_needed+0x10/0x10 [ 163.911156][ C1] ? __pfx_lock_release+0x10/0x10 [ 163.911165][ C1] ? __pfx_rcu_torture_barrier_cbf+0x10/0x10 [ 163.911188][ C1] __call_srcu+0x9f/0xe0 [ 163.911196][ C1] call_srcu+0x13/0x20 [ 163.911201][ C1] srcu_torture_call+0x1b/0x30 [ 163.911224][ C1] rcu_torture_barrier1cb+0x4a/0x60 [ 163.911247][ C1] __flush_smp_call_function_queue+0x267/0xca0 [ 163.911256][ C1] ? __pfx_rcu_torture_barrier1cb+0x10/0x10 [ 163.911281][ C1] generic_smp_call_function_single_interrupt+0x13/0x20 [ 163.911288][ C1] __sysvec_call_function_single+0x7d/0x280 [ 163.911295][ C1] sysvec_call_function_single+0x93/0xd0 [ 163.911302][ C1] </IRQ> [ 163.911304][ C1] <TASK> [ 163.911308][ C1] asm_sysvec_call_function_single+0x1b/0x20 [ 163.911313][ C1] RIP: 0010:default_idle+0x17/0x20 [ 163.911326][ C1] RSP: 0018:ffff888001997dc8 EFLAGS: 00000246 [ 163.911333][ C1] RAX: 0000000000000000 RBX: dffffc0000000000 RCX: ffffffffae618b51 [ 163.911337][ C1] RDX: 0000000000000000 RSI: ffffffffaea80920 RDI: ffffffffaec2de80 [ 163.911342][ C1] RBP: ffff888001997dc8 R08: 0000000000000001 R09: ffffed100d740cad [ 163.911346][ C1] R10: ffffed100d740cac R11: ffff88806ba06563 R12: 0000000000000001 [ 163.911350][ C1] R13: ffffffffafe460c0 R14: ffffffffafe460c0 R15: 0000000000000000 [ 163.911358][ C1] ? ct_kernel_exit.constprop.3+0x121/0x160 [ 163.911369][ C1] ? lockdep_hardirqs_on+0xc4/0x150 [ 163.911376][ C1] arch_cpu_idle+0x9/0x10 [ 163.911383][ C1] default_idle_call+0x7a/0xb0 [ 163.911390][ C1] do_idle+0x362/0x500 [ 163.911398][ C1] ? __pfx_do_idle+0x10/0x10 [ 163.911404][ C1] ? complete_with_flags+0x8b/0xb0 [ 163.911416][ C1] cpu_startup_entry+0x58/0x70 [ 163.911423][ C1] start_secondary+0x221/0x280 [ 163.911430][ C1] ? __pfx_start_secondary+0x10/0x10 [ 163.911440][ C1] secondary_startup_64_no_verify+0x17f/0x18b [ 163.911455][ C1] </TASK> This commit therefore use smp_call_on_cpu() instead of smp_call_function_single(), make rcu_torture_barrier1cb() invoked happens on task-context. Signed-off-by: Zqiang <qiang.zhang1211@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>	2024-04-16 11:16:36 +02:00
Zqiang	431315a563	rcutorture: Make stall-tasks directly exit when rcutorture tests end When the rcutorture tests start to exit, the rcu_torture_cleanup() is invoked to stop kthreads and release resources, if the stall-task kthreads exist, cpu-stall has started and the rcutorture.stall_cpu is set to a larger value, the rcu_torture_cleanup() will be blocked for a long time and the hung-task may occur, this commit therefore add kthread_should_stop() to the loop of cpu-stall operation, when rcutorture tests ends, no need to wait for cpu-stall to end, exit directly. Use the following command to test: insmod rcutorture.ko torture_type=srcu fwd_progress=0 stat_interval=4 stall_cpu_block=1 stall_cpu=200 stall_cpu_holdoff=10 read_exit_burst=0 object_debug=1 rmmod rcutorture [15361.918610] INFO: task rmmod:878 blocked for more than 122 seconds. [15361.918613] Tainted: G W 6.8.0-rc2-yoctodev-standard+ #25 [15361.918615] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [15361.918616] task:rmmod state:D stack:0 pid:878 tgid:878 ppid:773 flags:0x00004002 [15361.918621] Call Trace: [15361.918623] <TASK> [15361.918626] __schedule+0xc0d/0x28f0 [15361.918631] ? __pfx___schedule+0x10/0x10 [15361.918635] ? rcu_is_watching+0x19/0xb0 [15361.918638] ? schedule+0x1f6/0x290 [15361.918642] ? __pfx_lock_release+0x10/0x10 [15361.918645] ? schedule+0xc9/0x290 [15361.918648] ? schedule+0xc9/0x290 [15361.918653] ? trace_preempt_off+0x54/0x100 [15361.918657] ? schedule+0xc9/0x290 [15361.918661] schedule+0xd0/0x290 [15361.918665] schedule_timeout+0x56d/0x7d0 [15361.918669] ? debug_smp_processor_id+0x1b/0x30 [15361.918672] ? rcu_is_watching+0x19/0xb0 [15361.918676] ? __pfx_schedule_timeout+0x10/0x10 [15361.918679] ? debug_smp_processor_id+0x1b/0x30 [15361.918683] ? rcu_is_watching+0x19/0xb0 [15361.918686] ? wait_for_completion+0x179/0x4c0 [15361.918690] ? __pfx_lock_release+0x10/0x10 [15361.918693] ? __kasan_check_write+0x18/0x20 [15361.918696] ? wait_for_completion+0x9d/0x4c0 [15361.918700] ? _raw_spin_unlock_irq+0x36/0x50 [15361.918703] ? wait_for_completion+0x179/0x4c0 [15361.918707] ? _raw_spin_unlock_irq+0x36/0x50 [15361.918710] ? wait_for_completion+0x179/0x4c0 [15361.918714] ? trace_preempt_on+0x54/0x100 [15361.918718] ? wait_for_completion+0x179/0x4c0 [15361.918723] wait_for_completion+0x181/0x4c0 [15361.918728] ? __pfx_wait_for_completion+0x10/0x10 [15361.918738] kthread_stop+0x152/0x470 [15361.918742] _torture_stop_kthread+0x44/0xc0 [torture 7af7f9cbba28271a10503b653f9e05d518fbc8c3] [15361.918752] rcu_torture_cleanup+0x2ac/0xe90 [rcutorture f2cb1f556ee7956270927183c4c2c7749a336529] [15361.918766] ? __pfx_rcu_torture_cleanup+0x10/0x10 [rcutorture f2cb1f556ee7956270927183c4c2c7749a336529] [15361.918777] ? __kasan_check_write+0x18/0x20 [15361.918781] ? __mutex_unlock_slowpath+0x17c/0x670 [15361.918789] ? __might_fault+0xcd/0x180 [15361.918793] ? find_module_all+0x104/0x1d0 [15361.918799] __x64_sys_delete_module+0x2a4/0x3f0 [15361.918803] ? __pfx___x64_sys_delete_module+0x10/0x10 [15361.918807] ? syscall_exit_to_user_mode+0x149/0x280 Signed-off-by: Zqiang <qiang.zhang1211@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>	2024-04-16 11:16:36 +02:00
Zqiang	710cf51d37	rcutorture: Removing redundant function pointer initialization For these rcu_torture_ops structure's objects defined by using static, if the value of the function pointer in its member is not set, the default value will be NULL, this commit therefore remove the pre-existing initialization of function pointers to NULL. Signed-off-by: Zqiang <qiang.zhang1211@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>	2024-04-16 11:16:36 +02:00
Zqiang	dddcddef14	rcutorture: Make rcutorture support print rcu-tasks gp state This commit make rcu-tasks related rcutorture test support rcu-tasks gp state printing when the writer stall occurs or the at the end of rcutorture test, and generate rcu_ops->get_gp_data() operation to simplify the acquisition of gp state for different types of rcutorture tests. Signed-off-by: Zqiang <qiang.zhang1211@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>	2024-04-16 11:16:35 +02:00
Zqiang	e38bf06d50	rcutorture: Use the gp_kthread_dbg operation specified by cur_ops Despite there being a cur_ops->gp_kthread_dbg(), rcu_torture_writer() unconditionally invokes vanilla RCU's show_rcu_gp_kthreads(). This is not at all helpful when some other flavor of RCU is being tested. This commit therefore makes rcu_torture_writer() invoke cur_ops->gp_kthread_dbg() for RCU implementations providing this function. Signed-off-by: Zqiang <qiang.zhang1211@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>	2024-04-16 11:16:35 +02:00
linke li	a10e3cbf32	rcutorture: Re-use value stored to ->rtort_pipe_count instead of re-reading Currently, the rcu_torture_pipe_update_one() writes the value (i + 1) to rp->rtort_pipe_count, then immediately re-reads it in order to compare it to RCU_TORTURE_PIPE_LEN. This re-read is pointless because no other update to rp->rtort_pipe_count can occur at this point. This commit therefore instead re-uses the (i + 1) value stored in the comparison instead of re-reading rp->rtort_pipe_count. Signed-off-by: linke li <lilinke99@qq.com> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>	2024-04-16 11:16:35 +02:00
Paul E. McKenney	8b9b443fa8	rcutorture: Fix rcu_torture_one_read() pipe_count overflow comment The "pipe_count > RCU_TORTURE_PIPE_LEN" check has a comment saying "Should not happen, but...". This is only true when testing an RCU whose grace periods are always long enough. This commit therefore fixes this comment. Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Closes: https://lore.kernel.org/lkml/CAHk-=wi7rJ-eGq+xaxVfzFEgbL9tdf6Kc8Z89rCpfcQOKm74Tw@mail.gmail.com/ Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>	2024-04-16 11:16:35 +02:00
Paul E. McKenney	8d0f9a6639	rcutorture: Remove extraneous rcu_torture_pipe_update_one() READ_ONCE() The rcu_torture_pipe_update_one() cannot run concurrently with any updates of ->rtort_pipe_count, so this commit removes the extraneous READ_ONCE() from the read from this field. Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Closes: https://lore.kernel.org/lkml/CAHk-=wiX_zF5Mpt8kUm_LFQpYY-mshrXJPOe+wKNwiVhEUcU9g@mail.gmail.com/ Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>	2024-04-16 11:16:25 +02:00
Nathan Chancellor	7fcb91d94e	configs/hardening: Disable CONFIG_UBSAN_SIGNED_WRAP kernel/configs/hardening.config turns on UBSAN for the bounds sanitizer, as that in combination with trapping can stop the exploitation of buffer overflows within the kernel. At the same time, hardening.config turns off every other UBSAN sanitizer because trapping means all UBSAN reports will be fatal and the problems brought up by other sanitizers generally do not have security implications. The signed integer overflow sanitizer was recently added back to the kernel and it is default on with just CONFIG_UBSAN=y, meaning that it gets enabled when merging hardening.config into another configuration. While this sanitizer does have security implications like the array bounds sanitizer, work to clean up enough instances to allow this to run in production environments is still ramping up, which means regular users and testers may be broken by these instances with CONFIG_UBSAN_TRAP=y. Disable CONFIG_UBSAN_SIGNED_WRAP in hardening.config to avoid this situation. Fixes: `557f8c582a` ("ubsan: Reintroduce signed overflow sanitizer") Signed-off-by: Nathan Chancellor <nathan@kernel.org> Link: https://lore.kernel.org/r/20240411-fix-ubsan-in-hardening-config-v1-2-e0177c80ffaa@kernel.org Signed-off-by: Kees Cook <keescook@chromium.org>	2024-04-15 11:08:24 -07:00
Nathan Chancellor	e048d668f2	configs/hardening: Fix disabling UBSAN configurations The initial change that added kernel/configs/hardening.config attempted to disable all UBSAN sanitizers except for the array bounds one while turning on UBSAN_TRAP. Unfortunately, it only got the syntax for CONFIG_UBSAN_SHIFT correct, so configurations that are on by default with CONFIG_UBSAN=y such as CONFIG_UBSAN_{BOOL,ENUM} do not get disabled properly. CONFIG_ARCH_HAS_UBSAN=y CONFIG_UBSAN=y CONFIG_UBSAN_TRAP=y CONFIG_CC_HAS_UBSAN_BOUNDS_STRICT=y CONFIG_UBSAN_BOUNDS=y CONFIG_UBSAN_BOUNDS_STRICT=y # CONFIG_UBSAN_SHIFT is not set # CONFIG_UBSAN_DIV_ZERO is not set # CONFIG_UBSAN_UNREACHABLE is not set CONFIG_UBSAN_SIGNED_WRAP=y CONFIG_UBSAN_BOOL=y CONFIG_UBSAN_ENUM=y # CONFIG_TEST_UBSAN is not set Add the missing 'is not set' to each configuration that needs it so that they get disabled as intended. CONFIG_ARCH_HAS_UBSAN=y CONFIG_UBSAN=y CONFIG_UBSAN_TRAP=y CONFIG_CC_HAS_UBSAN_BOUNDS_STRICT=y CONFIG_UBSAN_BOUNDS=y CONFIG_UBSAN_BOUNDS_STRICT=y # CONFIG_UBSAN_SHIFT is not set # CONFIG_UBSAN_DIV_ZERO is not set # CONFIG_UBSAN_UNREACHABLE is not set CONFIG_UBSAN_SIGNED_WRAP=y # CONFIG_UBSAN_BOOL is not set # CONFIG_UBSAN_ENUM is not set # CONFIG_TEST_UBSAN is not set Fixes: `215199e3d9` ("hardening: Provide Kconfig fragments for basic options") Signed-off-by: Nathan Chancellor <nathan@kernel.org> Link: https://lore.kernel.org/r/20240411-fix-ubsan-in-hardening-config-v1-1-e0177c80ffaa@kernel.org Signed-off-by: Kees Cook <keescook@chromium.org>	2024-04-15 11:08:24 -07:00
Uladzislau Rezki (Sony)	0fd210baa0	rcu: Allocate WQ with WQ_MEM_RECLAIM bit set synchronize_rcu() users have to be processed regardless of memory pressure so our private WQ needs to have at least one execution context what WQ_MEM_RECLAIM flag guarantees. Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>	2024-04-15 19:51:26 +02:00
Uladzislau Rezki (Sony)	462df2f543	rcu: Support direct wake-up of synchronize_rcu() users This patch introduces a small enhancement which allows to do a direct wake-up of synchronize_rcu() callers. It occurs after a completion of grace period, thus by the gp-kthread. Number of clients is limited by the hard-coded maximum allowed threshold. The remaining part, if still exists is deferred to a main worker. Link: https://lore.kernel.org/lkml/Zd0ZtNu+Rt0qXkfS@lothringen/ Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>	2024-04-15 19:47:51 +02:00
Uladzislau Rezki (Sony)	2053937a31	rcu: Add a trace event for synchronize_rcu_normal() Add an rcu_sr_normal() trace event. It takes three arguments first one is the name of RCU flavour, second one is a user id which triggeres synchronize_rcu_normal() and last one is an event. There are two traces in the synchronize_rcu_normal(). On entry, when a new request is registered and on exit point when request is completed. Please note, CONFIG_RCU_TRACE=y is required to activate traces. Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>	2024-04-15 19:47:51 +02:00
Uladzislau Rezki (Sony)	988f569ae0	rcu: Reduce synchronize_rcu() latency A call to a synchronize_rcu() can be optimized from a latency point of view. Workloads which depend on this can benefit of it. The delay of wakeme_after_rcu() callback, which unblocks a waiter, depends on several factors: - how fast a process of offloading is started. Combination of: - !CONFIG_RCU_NOCB_CPU/CONFIG_RCU_NOCB_CPU; - !CONFIG_RCU_LAZY/CONFIG_RCU_LAZY; - other. - when started, invoking path is interrupted due to: - time limit; - need_resched(); - if limit is reached. - where in a nocb list it is located; - how fast previous callbacks completed; Example: 1. On our embedded devices i can easily trigger the scenario when it is a last in the list out of ~3600 callbacks: <snip> <...>-29 [001] d..1. 21950.145313: rcu_batch_start: rcu_preempt CBs=3613 bl=28 ... <...>-29 [001] ..... 21950.152578: rcu_invoke_callback: rcu_preempt rhp=00000000b2d6dee8 func=__free_vm_area_struct.cfi_jt <...>-29 [001] ..... 21950.152579: rcu_invoke_callback: rcu_preempt rhp=00000000a446f607 func=__free_vm_area_struct.cfi_jt <...>-29 [001] ..... 21950.152580: rcu_invoke_callback: rcu_preempt rhp=00000000a5cab03b func=__free_vm_area_struct.cfi_jt <...>-29 [001] ..... 21950.152581: rcu_invoke_callback: rcu_preempt rhp=0000000013b7e5ee func=__free_vm_area_struct.cfi_jt <...>-29 [001] ..... 21950.152582: rcu_invoke_callback: rcu_preempt rhp=000000000a8ca6f9 func=__free_vm_area_struct.cfi_jt <...>-29 [001] ..... 21950.152583: rcu_invoke_callback: rcu_preempt rhp=000000008f162ca8 func=wakeme_after_rcu.cfi_jt <...>-29 [001] d..1. 21950.152625: rcu_batch_end: rcu_preempt CBs-invoked=3612 idle=.... <snip> 2. We use cpuset/cgroup to classify tasks and assign them into different cgroups. For example "backgrond" group which binds tasks only to little CPUs or "foreground" which makes use of all CPUs. Tasks can be migrated between groups by a request if an acceleration is needed. See below an example how "surfaceflinger" task gets migrated. Initially it is located in the "system-background" cgroup which allows to run only on little cores. In order to speed it up it can be temporary moved into "foreground" cgroup which allows to use big/all CPUs: cgroup_attach_task(): -> cgroup_migrate_execute() -> cpuset_can_attach() -> percpu_down_write() -> rcu_sync_enter() -> synchronize_rcu() -> now move tasks to the new cgroup. -> cgroup_migrate_finish() <snip> rcuop/1-29 [000] ..... 7030.528570: rcu_invoke_callback: rcu_preempt rhp=00000000461605e0 func=wakeme_after_rcu.cfi_jt PERFD-SERVER-1855 [000] d..1. 7030.530293: cgroup_attach_task: dst_root=3 dst_id=22 dst_level=1 dst_path=/foreground pid=1900 comm=surfaceflinger TimerDispatch-2768 [002] d..5. 7030.537542: sched_migrate_task: comm=surfaceflinger pid=1900 prio=98 orig_cpu=0 dest_cpu=4 <snip> "Boosting a task" depends on synchronize_rcu() latency: - first trace shows a completion of synchronize_rcu(); - second shows attaching a task to a new group; - last shows a final step when migration occurs. 3. To address this drawback, maintain a separate track that consists of synchronize_rcu() callers only. After completion of a grace period users are deferred to a dedicated worker to process requests. 4. This patch reduces the latency of synchronize_rcu() approximately by ~30-40% on synthetic tests. The real test case, camera launch time, shows(time is in milliseconds): 1-run 542 vs 489 improvement 9% 2-run 540 vs 466 improvement 13% 3-run 518 vs 468 improvement 9% 4-run 531 vs 457 improvement 13% 5-run 548 vs 475 improvement 13% 6-run 509 vs 484 improvement 4% Synthetic test(no "noise" from other callbacks): Hardware: x86_64 64 CPUs, 64GB of memory Linux-6.6 - 10K tasks(simultaneous); - each task does(1000 loops) synchronize_rcu(); kfree(p); default: CONFIG_RCU_NOCB_CPU: takes 54 seconds to complete all users; patch: CONFIG_RCU_NOCB_CPU: takes 35 seconds to complete all users. Running 60K gives approximately same results on my setup. Please note it is without any interaction with another type of callbacks, otherwise it will impact a lot a default case. 5. By default it is disabled. To enable this perform one of the below sequence: echo 1 > /sys/module/rcutree/parameters/rcu_normal_wake_from_gp or pass a boot parameter "rcutree.rcu_normal_wake_from_gp=1" Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Co-developed-by: Neeraj Upadhyay (AMD) <neeraj.iitr10@gmail.com> Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.iitr10@gmail.com> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>	2024-04-15 19:47:49 +02:00
Nikita Kiryushin	3758f7d991	rcu: Fix buffer overflow in print_cpu_stall_info() The rcuc-starvation output from print_cpu_stall_info() might overflow the buffer if there is a huge difference in jiffies difference. The situation might seem improbable, but computers sometimes get very confused about time, which can result in full-sized integers, and, in this case, buffer overflow. Also, the unsigned jiffies difference is printed using %ld, which is normally for signed integers. This is intentional for debugging purposes, but it is not obvious from the code. This commit therefore changes sprintf() to snprintf() and adds a clarifying comment about intention of %ld format. Found by Linux Verification Center (linuxtesting.org) with SVACE. Fixes: `245a629825` ("rcu: Dump rcuc kthread status for CPUs not reporting quiescent state") Signed-off-by: Nikita Kiryushin <kiryushin@ancud.ru> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>	2024-04-15 19:43:50 +02:00
Nikita Kiryushin	cc5645fddb	rcu-tasks: Fix show_rcu_tasks_trace_gp_kthread buffer overflow There is a possibility of buffer overflow in show_rcu_tasks_trace_gp_kthread() if counters, passed to sprintf() are huge. Counter numbers, needed for this are unrealistically high, but buffer overflow is still possible. Use snprintf() with buffer size instead of sprintf(). Found by Linux Verification Center (linuxtesting.org) with SVACE. Fixes: `edf3775f0a` ("rcu-tasks: Add count for idle tasks on offline CPUs") Signed-off-by: Nikita Kiryushin <kiryushin@ancud.ru> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>	2024-04-15 19:36:55 +02:00
Zqiang	5f48fa85fd	rcu-tasks: Fix the comments for tasks_rcu_exit_srcu_stall_timer The synchronize_srcu() has been removed by commit("rcu-tasks: Eliminate deadlocks involving do_exit() and RCU tasks") in rcu_tasks_postscan. This commit therefore fixes the tasks_rcu_exit_srcu_stall_timer comment. Signed-off-by: Zqiang <qiang.zhang1211@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>	2024-04-15 19:36:55 +02:00
Paul E. McKenney	8db610c3bd	rcu-tasks: Replace exit_tasks_rcu_start() initialization with WARN_ON_ONCE() Because the Tasks RCU ->rtp_exit_list is initialized at rcu_init() time while there is only one CPU running with interrupts disabled, it is not possible for an exiting task to encounter an uninitialized list. This commit therefore replaces the conditional initialization with a WARN_ON_ONCE(). Reported-by: Frederic Weisbecker <frederic@kernel.org> Closes: https://lore.kernel.org/all/ZdiNXmO3wRvmzPsr@lothringen/ Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>	2024-04-15 19:36:41 +02:00
Paul E. McKenney	fc2897d2ab	rcu: Inform KCSAN of one-byte cmpxchg() in rcu_trc_cmpxchg_need_qs() Tasks Trace RCU needs a single-byte cmpxchg(), but no such thing exists. Therefore, rcu_trc_cmpxchg_need_qs() emulates one using field substitution and a four-byte cmpxchg(), such that the other three bytes are always atomically updated to their old values. This works, but results in false-positive KCSAN failures because as far as KCSAN knows, this cmpxchg() operation is updating all four bytes. This commit therefore encloses the cmpxchg() in a data_race() and adds a single-byte instrument_atomic_read_write(), thus telling KCSAN exactly what is going on so as to avoid the false positives. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Marco Elver <elver@google.com> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>	2024-04-15 18:12:18 +02:00
Paul E. McKenney	ae2b217ab5	rcu: Make hotplug operations track GP state, not flags Currently, there are rcu_data structure fields named ->rcu_onl_gp_seq and ->rcu_ofl_gp_seq that track the rcu_state.gp_flags field at the time of the corresponding CPU's last online or offline operation, respectively. However, this information is not particularly useful. It would be better to instead track the grace period state kept in rcu_state.gp_state. This would also be consistent with the initialization in rcu_boot_init_percpu_data(), which is to RCU_GP_CLEANED (an rcu_state.gp_state value), and also with the diagnostics in rcu_implicit_dynticks_qs(), whose format is consistent with an integer, not a bitmask. This commit therefore makes this change and changes the names to ->rcu_onl_gp_flags and ->rcu_ofl_gp_flags, respectively. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>	2024-04-15 16:48:28 +02:00
Paul E. McKenney	09e077cf22	rcu: Mark loads from rcu_state.n_online_cpus The rcu_state.n_online_cpus value is only ever updated by CPU-hotplug operations, which are serialized. However, this value is read locklessly. This commit therefore marks those reads. While in the area, it also adds ASSERT_EXCLUSIVE_WRITER() calls just in case parallel CPU hotplug becomes a thing. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>	2024-04-15 16:48:26 +02:00
Paul E. McKenney	a542d116ba	rcu: Mark writes to rcu_sync ->gp_count field The rcu_sync structure's ->gp_count field is updated under the protection of ->rss_lock, but read locklessly, and KCSAN noted the data race. This commit therefore uses WRITE_ONCE() to do this update to clearly document its racy nature. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>	2024-04-15 16:28:46 +02:00
Paul E. McKenney	c90b9e4978	rcu: Bring diagnostic read of rcu_state.gp_flags into alignment This commit adds READ_ONCE() to a lockless diagnostic read from rcu_state.gp_flags to avoid giving the compiler any chance whatsoever of confusing the diagnostic state printed. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>	2024-04-15 16:28:44 +02:00
Paul E. McKenney	62bb24c4b0	rcu: Remove redundant READ_ONCE() of rcu_state.gp_flags in tree.c Although it is functionally OK to do READ_ONCE() of a variable that cannot change, it is confusing and at best an accident waiting to happen. This commit therefore removes a number of READ_ONCE(rcu_state.gp_flags) instances from kernel/rcu/tree.c that are not needed due to updates to this field being excluded by virtue of holding the root rcu_node structure's ->lock. Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Closes: https://lore.kernel.org/lkml/4857c5ef-bd8f-4670-87ac-0600a1699d05@paulmck-laptop/T/#mccb23c2a4902da4d3c750165329f8de056903c58 Reported-by: Julia Lawall <julia.lawall@inria.fr> Closes: https://lore.kernel.org/lkml/4857c5ef-bd8f-4670-87ac-0600a1699d05@paulmck-laptop/T/#md1b5c026584f9c3c7b0fbc9240dd7de584597b73 Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>	2024-04-15 13:07:14 +02:00
Paul E. McKenney	11b8b378c5	rcu: Make Tiny RCU explicitly disable preemption Because Tiny RCU is used only in kernels built with either CONFIG_PREEMPT_NONE=y or CONFIG_PREEMPT_VOLUNTARY=y, there has not been any need for TINY RCU to explicitly disable preemption. However, the prospect of lazy preemption changes that, and preemption means that the non-atomic increment in synchronize_rcu() can be preempted, with the possibility that one of the increments is lost. This could cause failures for users of the APIs that poll RCU grace periods. This commit therefore adds the needed preempt_disable() and preempt_enable() call to Tiny RCU. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Ankur Arora <ankur.a.arora@oracle.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>	2024-04-15 11:29:48 +02:00
Paul E. McKenney	3dbd8652f8	rcu: Remove redundant BH disabling in TINY_RCU The TINY_RCU rcu_process_callbacks() function is only ever invoked from a softirq handler, which means that BH is already disabled. This commit therefore removes the redundant local_bh_disable() and local_bh_ennable() from this function. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>	2024-04-15 11:29:48 +02:00
Paul E. McKenney	1b4e9fdf9e	rcu: Create NEED_TASKS_RCU to factor out enablement logic Currently, if a Kconfig option depends on TASKS_RCU, it conditionally does "select TASKS_RCU if PREEMPTION". This works, but requires any change in this enablement logic to be replicated across all such "select" clauses. This commit therefore creates a new NEED_TASKS_RCU Kconfig option so that the default value of TASKS_RCU can depend on a combination of this new option and any needed enablement logic, so that this logic is in one place. While in the area, also anticipate a likely future change by adding PREEMPT_AUTO to that logic. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Ankur Arora <ankur.a.arora@oracle.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Steven Rostedt <rostedt@goodmis.org> Acked-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>	2024-04-15 11:29:48 +02:00
Paul E. McKenney	65b4a59557	srcu: Make Tiny SRCU explicitly disable preemption Because Tiny SRCU is used only in kernels built with either CONFIG_PREEMPT_NONE=y or CONFIG_PREEMPT_VOLUNTARY=y, there has not been any need for TINY SRCU to explicitly disable preemption. However, the prospect of lazy preemption changes that, and the lazy-preemption patches do result in rcutorture runs finding both too-short grace periods and grace-period hangs for Tiny SRCU. This commit therefore adds the needed preempt_disable() and preempt_enable() calls to Tiny SRCU. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Ankur Arora <ankur.a.arora@oracle.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>	2024-04-15 11:29:48 +02:00
Paul E. McKenney	c1ec7c1580	rcu: Make TINY_RCU depend on !PREEMPT_RCU rather than !PREEMPTION Right now, TINY_RCU depends on (!PREEMPTION && !SMP), which has served the kernel well for many years due to the fact that PREEMPT_RCU is normally a synonym for PREEMPTION. But with the advent of lazy preemption, it will be possible to have non-preemptible RCU in a preemptible kernel, so that kernels could be built with PREEMPT_RCU=n and PREEMPTION=y. This commit therefore makes TINY_RCU depend on (!PREEMPT_RCU && !SMP), thus allowing for a non-preemptible RCU in preemptible kernels. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Ankur Arora <ankur.a.arora@oracle.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>	2024-04-15 11:29:47 +02:00
Ingo Molnar	854dd99b5d	perf/bpf: Mark perf_event_set_bpf_handler() and perf_event_free_bpf_handler() as inline too They can be unused with certain Kconfig variations: kernel/events/core.c:9622:13: warning: ‘perf_event_free_bpf_handler’ defined but not used [-Wunused-function] kernel/events/core.c:9586:12: warning: ‘perf_event_set_bpf_handler’ defined but not used [-Wunused-function] Since they are both single-use, mark them inline. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: linux-kernel@vger.kernel.org Cc: Kyle Huey <khuey@kylehuey.com>	2024-04-14 22:35:26 +02:00
Kyle Huey	fd20bb51ed	perf/ring_buffer: Trigger IO signals for watermark_wakeup perf_output_wakeup() already marks the perf event fd available for polling. Trigger IO signals with FASYNC too. Signed-off-by: Kyle Huey <khuey@kylehuey.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20240413141618.4160-3-khuey@kylehuey.com	2024-04-14 22:26:32 +02:00
Kyle Huey	4a01398066	perf: Move perf_event_fasync() to perf_event.h This will allow it to be called from perf_output_wakeup(). Signed-off-by: Kyle Huey <khuey@kylehuey.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20240413141618.4160-2-khuey@kylehuey.com	2024-04-14 22:26:32 +02:00
Ingo Molnar	d0331aa978	Merge branch 'linus' into perf/core, to pick up perf/urgent fixes Pick up perf/urgent fixes that are upstream already, but not yet in the perf/core development branch. Signed-off-by: Ingo Molnar <mingo@kernel.org>	2024-04-14 22:25:18 +02:00
Linus Torvalds	27fd80851d	Misc x86 fixes: - Follow up fixes for the BHI mitigations code. - Fix !SPECULATION_MITIGATIONS bug not turning off mitigations as expected. - Work around an APIC emulation bug when the kernel is built with Clang and run as a SEV guest. - Follow up x86 topology fixes. Note that there's minor cleanups included in the BHI fixes, which we'd normally delay to the next merge window, but the BHI mitigations code is new and will be backported widely, so we thought it would be better to have a unified codebase at this stage. (Let me know if that assumption is wrong and I'll rebase it.) Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmYbmmkRHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1iArg//R2VYSVgAfUf4cN3rTm8RuX8U/8V4A1C/ FklDWziEWTw8DhQCidiIt7ETuMjvOf7HFzVTfLP+oOiBxHGMCEjxtf74mHRaeZ+f avPvwOfp3mrIdn/h2+t7pSYxZDf310EDvarliIXq0z2hmNjmQjKXo3dyWNiWJe9i LFzST8dRyR0Tg4MjLuY2g9ZEauRHpY6aAVk9UrRi7uaiLyoHnLBASn1DCL1pmMhT cqKfHhilkeUMYwTXbDTu1iIQwBHqcpCmUp2h6VPpxPkTxJXwzo0E4lVuFVu7tzUM yrMZrNxhtC5Cg9NVF56AqZocKjutlJfWXnmuvpc+dM7z1dF/EwFbx2ZM+3PDuw8Z uODs6bVzYlhzWxTMb9Obsp3RvHe1B7ZCFCZ8uo3G9lMFXqu047UwfuwqnZw4YpX6 CDEKz3zLbV4s64HPlvbju3CpX+m9CXhg5cR6HfCJiwYCytb1bxZU0ottnPnYxVCb 9Th7wi2f1sCZvtQ/T8aZpF7FhYe7abt/CDvlDoDxiRg1f+Z2nduznfDJH81nn1KS duZEicG0O+iYi9OCPVBTVutRxSWA6D8D4F0VN7cDRr4QHyXpOFJ1KsZyHBbZo9I6 ubp9RiFNaocgdWiWVgEpvvmlYpVPDnR38oVHNPpNAyIYHu3mlCoS76znZTNX8WfJ GlvHqOp+EM0= =2ig7 -----END PGP SIGNATURE----- Merge tag 'x86-urgent-2024-04-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull misc x86 fixes from Ingo Molnar: - Follow up fixes for the BHI mitigations code - Fix !SPECULATION_MITIGATIONS bug not turning off mitigations as expected - Work around an APIC emulation bug when the kernel is built with Clang and run as a SEV guest - Follow up x86 topology fixes * tag 'x86-urgent-2024-04-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/cpu/amd: Move TOPOEXT enablement into the topology parser x86/cpu/amd: Make the NODEID_MSR union actually work x86/cpu/amd: Make the CPUID 0x80000008 parser correct x86/bugs: Replace CONFIG_SPECTRE_BHI_{ON,OFF} with CONFIG_MITIGATION_SPECTRE_BHI x86/bugs: Remove CONFIG_BHI_MITIGATION_AUTO and spectre_bhi=auto x86/bugs: Clarify that syscall hardening isn't a BHI mitigation x86/bugs: Fix BHI handling of RRSBA x86/bugs: Rename various 'ia32_cap' variables to 'x86_arch_cap_msr' x86/bugs: Cache the value of MSR_IA32_ARCH_CAPABILITIES x86/bugs: Fix BHI documentation x86/cpu: Actually turn off mitigations by default for SPECULATION_MITIGATIONS=n x86/topology: Don't update cpu_possible_map in topo_set_cpuids() x86/bugs: Fix return type of spectre_bhi_state() x86/apic: Force native_apic_mem_read() to use the MOV instruction	2024-04-14 10:48:51 -07:00
Linus Torvalds	c748fc3b1f	Misc timer fixes: - Address a (valid) W=1 build warning - Fix timer self-tests - Annotate a KCSAN warning wrt. accesses to the tick_do_timer_cpu global variable. - Address a !CONFIG_BUG build warning Heads up for the !CONFIG_BUG warning patch, which we addressed with: `5284984a4f` bug: Fix no-return-statement warning with !CONFIG_BUG Not everyone agreed though, see: https://lore.kernel.org/all/20240410153212.127477-1-adrian.hunter@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmYblgkRHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1jYWg//eNeJkdzJVdbj6g4n2t3WDDuX7dxuRqdG AQHJdZctG+kNZBp+U2Zvbb8BDZfDRSQDBDfQI0ck3xG314pzXzNg92YMJB95r/Zf aRcxMSFc3a2dN3vW97UDKquPuCarCPsZQvbQKmZ55OmgW6ZRhhsjed0f18Nq63xR oWrQ0rotNhMJ98dpSOfPqrMoCXza78P/7nA49LxVIQcuDb+dtyqVTuAbENOOkFYq nqAkvuieZGzLb4nKH2d1rK4agYuXwnMLJ71MOcCNWFp8njuRRx+Yc+3gyoNl7e9E ipd6DcelOEl/DaYRao9rRy3ij0veJoUvshKZBTEWPw9FQU24odwqX4p/Mj2vF1iN KExtF+S7LBxdJAdivHyuPtt9B0rKRmgIp/Q8Ytgzuxu9rZ3LNev+7l80qDOIM8MF Mozv6JsJN2sVOMWvnzF9B1WNjVSikcyuvd2JRPbQYh1zy8aCpFHhZY+LcvK3vYBQ qdzY8o5dmIW0JrtHZw4H7tqKByUKEbJMsslPefD9qNIq5bpAUgHi7HFOMTU0kOvx 2rFDnC6cJk39CXyJrpLMyKDqZzDTHGV/J4nV7/L7vQzy3iIOcfVcszfGESaM/txk 6cgdncf9pr8aOE34A6/5Kr4L45vgh7B6YGc4oqHpdlvFLR0ve0gi+BIjNja8Jy7C IwGsS2uloCA= =oEym -----END PGP SIGNATURE----- Merge tag 'timers-urgent-2024-04-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fixes from Ingo Molnar: - Address a (valid) W=1 build warning - Fix timer self-tests - Annotate a KCSAN warning wrt. accesses to the tick_do_timer_cpu global variable - Address a !CONFIG_BUG build warning * tag 'timers-urgent-2024-04-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: selftests: kselftest: Fix build failure with NOLIBC selftests: timers: Fix abs() warning in posix_timers test selftests: kselftest: Mark functions that unconditionally call exit() as __noreturn selftests: timers: Fix posix_timers ksft_print_msg() warning selftests: timers: Fix valid-adjtimex signed left-shift undefined behavior bug: Fix no-return-statement warning with !CONFIG_BUG timekeeping: Use READ/WRITE_ONCE() for tick_do_timer_cpu selftests/timers/posix_timers: Reimplement check_timer_distribution() irqflags: Explicitly ignore lockdep_hrtimer_exit() argument	2024-04-14 10:32:22 -07:00
Linus Torvalds	ddd7ad5cf1	dma-mapping fixes for Linux 6.9 - fix up swiotlb buffer padding even more (Petr Tesarik) - fix for partial dma_sync on swiotlb (Michael Kelley) - swiotlb debugfs fix (Dexuan Cui) -----BEGIN PGP SIGNATURE----- iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAmYbZMsLHGhjaEBsc3Qu ZGUACgkQD55TZVIEUYMzpw/9EA2Hfo0QRokyvYzHA9ks1JeMZMeV5fzJEb/NYb+V 6cX2U0jRsdRqsZZXKl9c/oXwhiGLnQtuA9pIsMc18hqZooWxzoFWuc438UUihfvk 6BVmciBl13gm0Addd5ajDVV7tPLWGmFY/rSNyD0yhPRlRy4PmVYMtaWpX+ec0jnW YUTSI+UTjknl7H5TXVDg/JmyHB7xTjGCAI75kMgxNK4PCX5oF5uqVpXZcF1YH5sj STQuUrLU/jE54G6X3sM44GJntDWq3RBRI2pTsgQC70MAdrrPm2g60BorccA8qx38 Kz5CII8KuUvkIOtvncNmEzlGlfwflx91nF5ItQtJNobrd5GMSNtzaSLDR0LyfzLF w+awHYZyReEwE5oYcN8VQQR4mOZnqi3VRK+vTlTjcHXyXY3k0LdW+Jxzd6TkAeqr 49ECkVHpw4QaQBKH6RWkED1Xd0CDAABx2HJ2QkHYEwwLL/KPvaehExY9LjDpQut4 qJaAqLPKzxvAJdFjvH4HlJDhpw1BH0Pj3H6i4y0UgEzOKReMGun1FoqCseSJhhGc laFtFrIpooLqOD/h0e21XpVfEcRDkLwEjAtpkZXrfBZenZF8rFDm+vTOpBiyAK2a 2KiU0+z/f1Flx5E0gLtxn5yPsWo2cuIjKfsVSewIwjrh6p/XFBF+iQi4dwISFxKz JrE= =U10z -----END PGP SIGNATURE----- Merge tag 'dma-maping-6.9-2024-04-14' of git://git.infradead.org/users/hch/dma-mapping Pull dma-mapping fixes from Christoph Hellwig: - fix up swiotlb buffer padding even more (Petr Tesarik) - fix for partial dma_sync on swiotlb (Michael Kelley) - swiotlb debugfs fix (Dexuan Cui) * tag 'dma-maping-6.9-2024-04-14' of git://git.infradead.org/users/hch/dma-mapping: swiotlb: do not set total_used to 0 in swiotlb_create_debugfs_files() swiotlb: fix swiotlb_bounce() to do partial sync's correctly swiotlb: extend buffer pre-padding to alloc_align_mask if necessary	2024-04-14 10:02:40 -07:00
Anton Protopopov	37eacb9f6e	bpf: Fix a verifier verbose message Long ago a map file descriptor in a pseudo ldimm64 instruction could only be present as an immediate value insn[0].imm, and thus this value was used in a verbose verifier message printed when the file descriptor wasn't valid. Since addition of BPF_PSEUDO_MAP_IDX_VALUE/BPF_PSEUDO_MAP_IDX the insn[0].imm field can also contain an index pointing to the file descriptor in the attr.fd_array array. However, if the file descriptor is invalid, the verifier still prints the verbose message containing value of insn[0].imm. Patch the verifier message to always print the actual file descriptor value. Fixes: `387544bfa2` ("bpf: Introduce fd_idx") Signed-off-by: Anton Protopopov <aspsk@isovalent.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20240412141100.3562942-1-aspsk@isovalent.com	2024-04-12 18:37:20 +02:00
Linus Torvalds	5939d45155	Tracing fixes for 6.9: - Fix the buffer_percent accounting as it is dependent on three variables: 1) pages_read - number of subbuffers read 2) pages_lost - number of subbuffers lost due to overwrite 3) pages_touched - number of pages that a writer entered These three counters only increment, and to know how many active pages there are on the buffer at any given time, the pages_read and pages_lost are subtracted from pages_touched. But the pages touched was incremented whenever any writer went to the next subbuffer even if it wasn't the only one, so it was incremented more than it should be causing the counter for how many subbuffers currently have content incorrect, which caused the buffer_percent that holds waiters until the ring buffer is filled to a given percentage to wake up early. - Fix warning of unused functions when PERF_EVENTS is not configured in - Replace bad tab with space in Kconfig for FTRACE_RECORD_RECURSION_SIZE - Fix to some kerneldoc function comments in eventfs code. -----BEGIN PGP SIGNATURE----- iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZhk+khQccm9zdGVkdEBn b29kbWlzLm9yZwAKCRAp5XQQmuv6qvs0AP98c226UFU6Dha4wvgSulC/wKVvHN3X jeclMTdn8RGs2gD/b9OULKNv1//6fP16ZRun7ntRQkotVhlNhf9Ee0smiwU= =UYrk -----END PGP SIGNATURE----- Merge tag 'trace-v6.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: - Fix the buffer_percent accounting as it is dependent on three variables: 1) pages_read - number of subbuffers read 2) pages_lost - number of subbuffers lost due to overwrite 3) pages_touched - number of pages that a writer entered These three counters only increment, and to know how many active pages there are on the buffer at any given time, the pages_read and pages_lost are subtracted from pages_touched. But the pages touched was incremented whenever any writer went to the next subbuffer even if it wasn't the only one, so it was incremented more than it should be causing the counter for how many subbuffers currently have content incorrect, which caused the buffer_percent that holds waiters until the ring buffer is filled to a given percentage to wake up early. - Fix warning of unused functions when PERF_EVENTS is not configured in - Replace bad tab with space in Kconfig for FTRACE_RECORD_RECURSION_SIZE - Fix to some kerneldoc function comments in eventfs code. * tag 'trace-v6.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: ring-buffer: Only update pages_touched when a new page is touched tracing: hide unused ftrace_event_id_fops tracing: Fix FTRACE_RECORD_RECURSION_SIZE Kconfig entry eventfs: Fix kernel-doc comments to functions	2024-04-12 09:02:24 -07:00
Bitao Hu	e9a9292e23	watchdog/softlockup: Report the most frequent interrupts When the watchdog determines that the current soft lockup is due to an interrupt storm based on CPU utilization, reporting the most frequent interrupts could be good enough for further troubleshooting. Below is an example of interrupt storm. The call tree does not provide useful information, but analyzing which interrupt caused the soft lockup by comparing the counts of interrupts during the lockup period allows to identify the culprit. [ 638.870231] watchdog: BUG: soft lockup - CPU#9 stuck for 26s! [swapper/9:0] [ 638.870825] CPU#9 Utilization every 4s during lockup: [ 638.871194] #1: 0% system, 0% softirq, 100% hardirq, 0% idle [ 638.871652] #2: 0% system, 0% softirq, 100% hardirq, 0% idle [ 638.872107] #3: 0% system, 0% softirq, 100% hardirq, 0% idle [ 638.872563] #4: 0% system, 0% softirq, 100% hardirq, 0% idle [ 638.873018] #5: 0% system, 0% softirq, 100% hardirq, 0% idle [ 638.873494] CPU#9 Detect HardIRQ Time exceeds 50%. Most frequent HardIRQs: [ 638.873994] #1: 330945 irq#7 [ 638.874236] #2: 31 irq#82 [ 638.874493] #3: 10 irq#10 [ 638.874744] #4: 2 irq#89 [ 638.874992] #5: 1 irq#102 ... [ 638.875313] Call trace: [ 638.875315] __do_softirq+0xa8/0x364 Signed-off-by: Bitao Hu <yaoma@linux.alibaba.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Liu Song <liusong@linux.alibaba.com> Reviewed-by: Douglas Anderson <dianders@chromium.org> Link: https://lore.kernel.org/r/20240411074134.30922-6-yaoma@linux.alibaba.com	2024-04-12 17:08:06 +02:00
Bitao Hu	d7037381d0	watchdog/softlockup: Low-overhead detection of interrupt storm The following softlockup is caused by interrupt storm, but it cannot be identified from the call tree. Because the call tree is just a snapshot and doesn't fully capture the behavior of the CPU during the soft lockup. watchdog: BUG: soft lockup - CPU#28 stuck for 23s! [fio:83921] ... Call trace: __do_softirq+0xa0/0x37c __irq_exit_rcu+0x108/0x140 irq_exit+0x14/0x20 __handle_domain_irq+0x84/0xe0 gic_handle_irq+0x80/0x108 el0_irq_naked+0x50/0x58 Therefore, it is necessary to report CPU utilization during the softlockup_threshold period (report once every sample_period, for a total of 5 reportings), like this: watchdog: BUG: soft lockup - CPU#28 stuck for 23s! [fio:83921] CPU#28 Utilization every 4s during lockup: #1: 0% system, 0% softirq, 100% hardirq, 0% idle #2: 0% system, 0% softirq, 100% hardirq, 0% idle #3: 0% system, 0% softirq, 100% hardirq, 0% idle #4: 0% system, 0% softirq, 100% hardirq, 0% idle #5: 0% system, 0% softirq, 100% hardirq, 0% idle ... This is helpful in determining whether an interrupt storm has occurred or in identifying the cause of the softlockup. The criteria for determination are as follows: a. If the hardirq utilization is high, then interrupt storm should be considered and the root cause cannot be determined from the call tree. b. If the softirq utilization is high, then the call might not necessarily point at the root cause. c. If the system utilization is high, then analyzing the root cause from the call tree is possible in most cases. The mechanism requires a considerable amount of global storage space when configured for the maximum number of CPUs. Therefore, adding a SOFTLOCKUP_DETECTOR_INTR_STORM Kconfig knob that defaults to "yes" if the max number of CPUs is <= 128. Signed-off-by: Bitao Hu <yaoma@linux.alibaba.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Douglas Anderson <dianders@chromium.org> Reviewed-by: Liu Song <liusong@linux.alibaba.com> Link: https://lore.kernel.org/r/20240411074134.30922-5-yaoma@linux.alibaba.com	2024-04-12 17:08:05 +02:00
Bitao Hu	25a4a01511	genirq: Avoid summation loops for /proc/interrupts show_interrupts() unconditionally accumulates the per CPU interrupt statistics to determine whether an interrupt was ever raised. This can be avoided for all interrupts which are not strictly per CPU and not of type NMI because those interrupts provide already an accumulated counter. The required logic is already implemented in kstat_irqs(). Split the inner access logic out of kstat_irqs() and use it for kstat_irqs() and show_interrupts() to avoid the accumulation loop when possible. Originally-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Bitao Hu <yaoma@linux.alibaba.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Liu Song <liusong@linux.alibaba.com> Reviewed-by: Douglas Anderson <dianders@chromium.org> Link: https://lore.kernel.org/r/20240411074134.30922-4-yaoma@linux.alibaba.com	2024-04-12 17:08:05 +02:00
Bitao Hu	99cf63c566	genirq: Provide a snapshot mechanism for interrupt statistics The soft lockup detector lacks a mechanism to identify interrupt storms as root cause of a lockup. To enable this the detector needs a mechanism to snapshot the interrupt count statistics on a CPU when the detector observes a potential lockup scenario and compare that against the interrupt count when it warns about the lockup later on. The number of interrupts in that period give a hint whether the lockup might have been caused by an interrupt storm. Instead of having extra storage in the lockup detector and accessing the internals of the interrupt descriptor directly, add a snapshot member to the per CPU irq_desc::kstat_irq structure and provide interfaces to take a snapshot of all interrupts on the current CPU and to retrieve the delta of a specific interrupt later on. Originally-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Bitao Hu <yaoma@linux.alibaba.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240411074134.30922-3-yaoma@linux.alibaba.com	2024-04-12 17:08:05 +02:00
Bitao Hu	86d2a2f51f	genirq: Convert kstat_irqs to a struct The irq_desc::kstat_irqs member is a per-CPU variable of type int, which is only capable of counting. A snapshot mechanism for interrupt statistics will be added soon, which requires an additional variable to store the snapshot. To facilitate expansion, convert kstat_irqs here to a struct containing only the count. Originally-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Bitao Hu <yaoma@linux.alibaba.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240411074134.30922-2-yaoma@linux.alibaba.com	2024-04-12 17:08:05 +02:00
Ingo Molnar	93d3fde7fd	perf/bpf: Change the !CONFIG_BPF_SYSCALL stubs to static inlines Otherwise the compiler will be unhappy if they go unused, which they do on allnoconfigs. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Kyle Huey <me@kylehuey.com> Link: https://lore.kernel.org/r/ZhkE9F4dyfR2dH2D@gmail.com	2024-04-12 11:56:00 +02:00
Kyle Huey	c4fcc7d1f4	perf/bpf: Allow a BPF program to suppress all sample side effects Returning zero from a BPF program attached to a perf event already suppresses any data output. Return early from __perf_event_overflow() in this case so it will also suppress event_limit accounting, SIGTRAP generation, and F_ASYNC signalling. Signed-off-by: Kyle Huey <khuey@kylehuey.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Song Liu <song@kernel.org> Acked-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Namhyung Kim <namhyung@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20240412015019.7060-7-khuey@kylehuey.com	2024-04-12 11:49:50 +02:00
Kyle Huey	f11f10bfa1	perf/bpf: Call BPF handler directly, not through overflow machinery To ultimately allow BPF programs attached to perf events to completely suppress all of the effects of a perf event overflow (rather than just the sample output, as they do today), call bpf_overflow_handler() from __perf_event_overflow() directly rather than modifying struct perf_event's overflow_handler. Return the BPF program's return value from bpf_overflow_handler() so that __perf_event_overflow() knows how to proceed. Remove the now unnecessary orig_overflow_handler from struct perf_event. This patch is solely a refactoring and results in no behavior change. Suggested-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Kyle Huey <khuey@kylehuey.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Song Liu <song@kernel.org> Acked-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20240412015019.7060-5-khuey@kylehuey.com	2024-04-12 11:49:49 +02:00
Kyle Huey	924d934393	perf/bpf: Create bpf_overflow_handler() stub for !CONFIG_BPF_SYSCALL This will allow __perf_event_overflow() (which is independent of CONFIG_BPF_SYSCALL) to call bpf_overflow_handler(). Signed-off-by: Kyle Huey <khuey@kylehuey.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20240412015019.7060-3-khuey@kylehuey.com	2024-04-12 11:49:48 +02:00
Kyle Huey	4c03fe11b9	perf/bpf: Reorder bpf_overflow_handler() ahead of __perf_event_overflow() This will allow __perf_event_overflow() to call bpf_overflow_handler(). Signed-off-by: Kyle Huey <khuey@kylehuey.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20240412015019.7060-2-khuey@kylehuey.com	2024-04-12 11:49:48 +02:00
Uros Bizjak	fea0e1820b	locking/pvqspinlock: Use try_cmpxchg() in qspinlock_paravirt.h Use try_cmpxchg(ptr, &old, new) instead of cmpxchg(ptr, old, new) == old in qspinlock_paravirt.h x86 CMPXCHG instruction returns success in ZF flag, so this change saves a compare after cmpxchg. No functional change intended. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Waiman Long <longman@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20240411192317.25432-2-ubizjak@gmail.com	2024-04-12 11:42:39 +02:00
Uros Bizjak	6a97734f22	locking/pvqspinlock: Use try_cmpxchg_acquire() in trylock_clear_pending() Replace this pattern in trylock_clear_pending(): cmpxchg_acquire(ptr, old, new) == old ... with the simpler and faster: try_cmpxchg_acquire(ptr, &old, new) The x86 CMPXCHG instruction returns success in the ZF flag, so this change saves a compare after the CMPXCHG. Also change the return type of the function to bool and streamline the control flow in the _Q_PENDING_BITS == 8 variant a bit. No functional change intended. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Waiman Long <longman@redhat.com> Reviewed-by: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20240325140943.815051-1-ubizjak@gmail.com	2024-04-12 11:40:51 +02:00
Paul E. McKenney	64ec8b6ad6	ftrace: Choose RCU Tasks based on TASKS_RCU rather than PREEMPTION The advent of CONFIG_PREEMPT_AUTO, AKA lazy preemption, will mean that even kernels built with CONFIG_PREEMPT_NONE or CONFIG_PREEMPT_VOLUNTARY might see the occasional preemption, and that this preemption just might happen within a trampoline. Therefore, update ftrace_shutdown() to invoke synchronize_rcu_tasks() based on CONFIG_TASKS_RCU instead of CONFIG_PREEMPTION. [ paulmck: Apply Steven Rostedt feedback. ] Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Ankur Arora <ankur.a.arora@oracle.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: <linux-trace-kernel@vger.kernel.org> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>	2024-04-12 11:23:42 +02:00
Paul E. McKenney	1e52af7f02	bpf: Choose RCU Tasks based on TASKS_RCU rather than PREEMPTION The advent of CONFIG_PREEMPT_AUTO, AKA lazy preemption, will mean that even kernels built with CONFIG_PREEMPT_NONE or CONFIG_PREEMPT_VOLUNTARY might see the occasional preemption, and that this preemption just might happen within a trampoline. Therefore, update bpf_tramp_image_put() to choose call_rcu_tasks() based on CONFIG_TASKS_RCU instead of CONFIG_PREEMPTION. This change might enable further simplifications, but the goal of this effort is to make the code safe, not necessarily optimal. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: John Fastabend <john.fastabend@gmail.com> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: Song Liu <song@kernel.org> Cc: Yonghong Song <yonghong.song@linux.dev> Cc: KP Singh <kpsingh@kernel.org> Cc: Stanislav Fomichev <sdf@google.com> Cc: Hao Luo <haoluo@google.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Ankur Arora <ankur.a.arora@oracle.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: <bpf@vger.kernel.org> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>	2024-04-12 11:23:25 +02:00
Paolo Bonzini	f7842747d1	mm: replace set_pte_at_notify() with just set_pte_at() With the demise of the .change_pte() MMU notifier callback, there is no notification happening in set_pte_at_notify(). It is a synonym of set_pte_at() and can be replaced with it. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Message-ID: <20240405115815.3226315-5-pbonzini@redhat.com> Acked-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-04-12 04:40:27 -04:00
Herbert Xu	58329c4312	padata: Disable BH when taking works lock on MT path As the old padata code can execute in softirq context, disable softirqs for the new padata_do_mutithreaded code too as otherwise lockdep will get antsy. Reported-by: syzbot+0cb5bb0f4bf9e79db3b3@syzkaller.appspotmail.com Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Acked-by: Daniel Jordan <daniel.m.jordan@oracle.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2024-04-12 15:07:51 +08:00
Steven Rostedt (Google)	ffe3986fec	ring-buffer: Only update pages_touched when a new page is touched The "buffer_percent" logic that is used by the ring buffer splice code to only wake up the tasks when there's no data after the buffer is filled to the percentage of the "buffer_percent" file is dependent on three variables that determine the amount of data that is in the ring buffer: 1) pages_read - incremented whenever a new sub-buffer is consumed 2) pages_lost - incremented every time a writer overwrites a sub-buffer 3) pages_touched - incremented when a write goes to a new sub-buffer The percentage is the calculation of: (pages_touched - (pages_lost + pages_read)) / nr_pages Basically, the amount of data is the total number of sub-bufs that have been touched, minus the number of sub-bufs lost and sub-bufs consumed. This is divided by the total count to give the buffer percentage. When the percentage is greater than the value in the "buffer_percent" file, it wakes up splice readers waiting for that amount. It was observed that over time, the amount read from the splice was constantly decreasing the longer the trace was running. That is, if one asked for 60%, it would read over 60% when it first starts tracing, but then it would be woken up at under 60% and would slowly decrease the amount of data read after being woken up, where the amount becomes much less than the buffer percent. This was due to an accounting of the pages_touched incrementation. This value is incremented whenever a writer transfers to a new sub-buffer. But the place where it was incremented was incorrect. If a writer overflowed the current sub-buffer it would go to the next one. If it gets preempted by an interrupt at that time, and the interrupt performs a trace, it too will end up going to the next sub-buffer. But only one should increment the counter. Unfortunately, that was not the case. Change the cmpxchg() that does the real switch of the tail-page into a try_cmpxchg(), and on success, perform the increment of pages_touched. This will only increment the counter once for when the writer moves to a new sub-buffer, and not when there's a race and is incremented for when a writer and its preempting writer both move to the same new sub-buffer. Link: https://lore.kernel.org/linux-trace-kernel/20240409151309.0d0e5056@gandalf.local.home Cc: stable@vger.kernel.org Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Fixes: `2c2b0a78b3` ("ring-buffer: Add percentage of ring buffer full to wake up reader") Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2024-04-11 17:49:57 -04:00
Arnd Bergmann	5281ec8345	tracing: hide unused ftrace_event_id_fops When CONFIG_PERF_EVENTS, a 'make W=1' build produces a warning about the unused ftrace_event_id_fops variable: kernel/trace/trace_events.c:2155:37: error: 'ftrace_event_id_fops' defined but not used [-Werror=unused-const-variable=] 2155 \| static const struct file_operations ftrace_event_id_fops = { Hide this in the same #ifdef as the reference to it. Link: https://lore.kernel.org/linux-trace-kernel/20240403080702.3509288-7-arnd@kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Zheng Yejian <zhengyejian1@huawei.com> Cc: Kees Cook <keescook@chromium.org> Cc: Ajay Kaher <akaher@vmware.com> Cc: Jinjie Ruan <ruanjinjie@huawei.com> Cc: Clément Léger <cleger@rivosinc.com> Cc: Dan Carpenter <dan.carpenter@linaro.org> Cc: "Tzvetomir Stoyanov (VMware)" <tz.stoyanov@gmail.com> Fixes: `620a30e97f` ("tracing: Don't pass file_operations array to event_create_dir()") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2024-04-11 17:46:55 -04:00
Prasad Pandit	d96c36004e	tracing: Fix FTRACE_RECORD_RECURSION_SIZE Kconfig entry Fix FTRACE_RECORD_RECURSION_SIZE entry, replace tab with a space character. It helps Kconfig parsers to read file without error. Link: https://lore.kernel.org/linux-trace-kernel/20240322121801.1803948-1-ppandit@redhat.com Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Fixes: `773c167050` ("ftrace: Add recording of functions that caused recursion") Signed-off-by: Prasad Pandit <pjp@fedoraproject.org> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2024-04-11 17:45:18 -04:00
Jakub Kicinski	94426ed213	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR. Conflicts: net/unix/garbage.c `47d8ac011f` ("af_unix: Fix garbage collector racing against connect()") `4090fa373f` ("af_unix: Replace garbage collection algorithm.") Adjacent changes: drivers/net/ethernet/broadcom/bnxt/bnxt.c `faa12ca245` ("bnxt_en: Reset PTP tx_avail after possible firmware reset") `b3d0083caf` ("bnxt_en: Support RSS contexts in ethtool .{get\|set}_rxfh()") drivers/net/ethernet/broadcom/bnxt/bnxt_ulp.c `7ac10c7d72` ("bnxt_en: Fix possible memory leak in bnxt_rdma_aux_device_init()") `194fad5b27` ("bnxt_en: Refactor bnxt_rdma_aux_device_init/uninit functions") drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c `958f56e483` ("net/mlx5e: Un-expose functions in en.h") `49e6c93870` ("net/mlx5e: RSS, Block XOR hash with over 128 channels") Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-04-11 14:23:47 -07:00
Linus Torvalds	136eb5fd6a	Power management fix for 6.9-rc4 Fix the suspend-to-idle core code to guarantee that timers queued on CPUs other than the one that has first left the idle state, which should expire directly after resume, will be handled (Anna-Maria Behnsen). -----BEGIN PGP SIGNATURE----- iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAmYYIvYSHHJqd0Byand5 c29ja2kubmV0AAoJEILEb/54YlRxq5oP/RS98YTjjrhqieQ2mfKN+thVpgTEeYBa C4NoRbQf0x8kyhHbqEmjEdxlIUUmLDpolWMxFtlYdhTSZB7+k0z9GL3fZfTKRTVq 8lYSuxEzyicylve+b+gVcQ6m61kaV3M2jwP23Myaf5MOIx2OvQy7SnNS02p3yfJ8 NhBTqqTXbupWnnRuSPZvejnZaU/3wW1UvVmbkBUNwhR0O+6J0bxC5fNrvgip//kG Iyx04j88787nwS9rUL68A9xSA9WVRuLNZRMb7RE2mZ9YVR6qLhxhUxfA8WrKignl SxFPPmzGNkWdwtlWeIsQvcAOOhTKnWqOm7aggBfcf7jGoYBOJ7Uo4gfxvEvA1gnc 36UTpJXc/5HQDeVKYcSx5O9VSm7/cKx4sP5jiDKT/iBPLnSXAKoAflgl1RvkchvA Gg0aJ37MRIvoP53OidNACkHN2VsikTMuYA4b21G+we/ib7q8kIbdI9yqDF4aPcCU vV0rMlSGSfM5+PGn8fDei2sbZ4E6Mk3XRwzF386xJDIyrvnqr6hhmVxDmWOx7oEb Jpv80971cbr1lmtP2SKFVdsRxElhRWfXk3OwLOGRbLbQI1BP+SAWeyICAYUanPJI W6vmJwDqylCUFRy4mhufDe7fceXHnvFUYMhi0XWkGvCgWEgLffKHTzFNup12hU1Y 65Rcv19bRP8h =8DDE -----END PGP SIGNATURE----- Merge tag 'pm-6.9-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management fix from Rafael Wysocki: "Fix the suspend-to-idle core code to guarantee that timers queued on CPUs other than the one that has first left the idle state, which should expire directly after resume, will be handled (Anna-Maria Behnsen)" * tag 'pm-6.9-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: PM: s2idle: Make sure CPUs will wakeup directly on resume	2024-04-11 12:00:25 -07:00
Paul E. McKenney	02b3c5fcdf	tracing: Select new NEED_TASKS_RCU Kconfig option Currently, if a Kconfig option depends on TASKS_RCU, it conditionally does "select TASKS_RCU if PREEMPTION". This works, but requires any change in this enablement logic to be replicated across all such "select" clauses. A new NEED_TASKS_RCU Kconfig option has been created to allow this enablement logic to be in one place in kernel/rcu/Kconfig. Therefore, select the new NEED_TASKS_RCU Kconfig option instead of the old TASKS_RCU option. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: <linux-trace-kernel@vger.kernel.org> Cc: Ankur Arora <ankur.a.arora@oracle.com> Cc: Thomas Gleixner <tglx@linutronix.de> Acked-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>	2024-04-11 18:18:47 +02:00
Uladzislau Rezki (Sony)	dfd458a95d	rcu: Add data structures for synchronize_rcu() The synchronize_rcu() call is going to be reworked, thus this patch adds dedicated fields into the rcu_state structure. Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>	2024-04-11 17:57:15 +02:00

1 2 3 4 5 ...

44650 Commits