linux

iv/linux

Author	SHA1	Message	Date
Florian Westphal	132328e8e8	bpf: netfilter: Add BPF_NETFILTER bpf_attach_type Andrii Nakryiko writes: And we currently don't have an attach type for NETLINK BPF link. Thankfully it's not too late to add it. I see that link_create() in kernel/bpf/syscall.c just bypasses attach_type check. We shouldn't have done that. Instead we need to add BPF_NETLINK attach type to enum bpf_attach_type. And wire all that properly throughout the kernel and libbpf itself. This adds BPF_NETFILTER and uses it. This breaks uabi but this wasn't in any non-rc release yet, so it should be fine. v2: check link_attack prog type in link_create too Fixes: 84601d6ee68a ("bpf: add bpf_link support for BPF_NETFILTER programs") Suggested-by: Andrii Nakryiko <andrii.nakryiko@gmail.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/CAEf4BzZ69YgrQW7DHCJUT_X+GqMq_ZQQPBwopaJJVGFD5=d5Vg@mail.gmail.com/ Link: https://lore.kernel.org/bpf/20230605131445.32016-1-fw@strlen.de	2023-06-05 15:01:43 -07:00
David Vernet	51302c951c	bpf: Teach verifier that trusted PTR_TO_BTF_ID pointers are non-NULL In reg_type_not_null(), we currently assume that a pointer may be NULL if it has the PTR_MAYBE_NULL modifier, or if it doesn't belong to one of several base type of pointers that are never NULL-able. For example, PTR_TO_CTX, PTR_TO_MAP_VALUE, etc. It turns out that in some cases, PTR_TO_BTF_ID can never be NULL as well, though we currently don't specify it. For example, if you had the following program: SEC("tc") long example_refcnt_fail(void ctx) { struct bpf_cpumask mask1, mask2; mask1 = bpf_cpumask_create(); mask2 = bpf_cpumask_create(); if (!mask1 \|\| !mask2) goto error_release; bpf_cpumask_test_cpu(0, (const struct cpumask )mask1); bpf_cpumask_test_cpu(0, (const struct cpumask *)mask2); error_release: if (mask1) bpf_cpumask_release(mask1); if (mask2) bpf_cpumask_release(mask2); return ret; } The verifier will incorrectly fail to load the program, thinking (unintuitively) that we have a possibly-unreleased reference if the mask is NULL, because we (correctly) don't issue a bpf_cpumask_release() on the NULL path. The reason the verifier gets confused is due to the fact that we don't explicitly tell the verifier that trusted PTR_TO_BTF_ID pointers can never be NULL. Basically, if we successfully get past the if check (meaning both pointers go from ptr_or_null_bpf_cpumask to ptr_bpf_cpumask), the verifier will correctly assume that the references need to be dropped on any possible branch that leads to program exit. However, it will _incorrectly_ think that the ptr == NULL branch is possible, and will erroneously detect it as a branch on which we failed to drop the reference. The solution is of course to teach the verifier that trusted PTR_TO_BTF_ID pointers can never be NULL, so that it doesn't incorrectly think it's possible for the reference to be present on the ptr == NULL branch. A follow-on patch will add a selftest that verifies this behavior. Signed-off-by: David Vernet <void@manifault.com> Link: https://lore.kernel.org/r/20230602150112.1494194-1-void@manifault.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-06-05 14:36:57 -07:00
Daniel T. Lee	503e4def54	bpf: Replace open code with for allocated object check >From commit 282de143ead9 ("bpf: Introduce allocated objects support"), With this allocated object with BPF program, (PTR_TO_BTF_ID \| MEM_ALLOC) has been a way of indicating to check the type is the allocated object. commit d8939cb0a03c ("bpf: Loosen alloc obj test in verifier's reg_btf_record") >From the commit, there has been helper function for checking this, named type_is_ptr_alloc_obj(). But still, some of the code use open code to retrieve this info. This commit replaces the open code with the type_is_alloc(), and the type_is_ptr_alloc_obj() function. Signed-off-by: Daniel T. Lee <danieltimlee@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20230527122706.59315-1-danieltimlee@gmail.com	2023-06-05 14:33:17 -07:00
Miaohe Lin	0dad9b072b	cgroup: make cgroup_is_threaded() and cgroup_is_thread_root() static They're only called inside cgroup.c. Make them static. Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2023-06-05 10:31:41 -10:00
Dave Marchevsky	7793fc3bab	bpf: Make bpf_refcount_acquire fallible for non-owning refs This patch fixes an incorrect assumption made in the original bpf_refcount series [0], specifically that the BPF program calling bpf_refcount_acquire on some node can always guarantee that the node is alive. In that series, the patch adding failure behavior to rbtree_add and list_push_{front, back} breaks this assumption for non-owning references. Consider the following program: n = bpf_kptr_xchg(&mapval, NULL); /* skip error checking / bpf_spin_lock(&l); if(bpf_rbtree_add(&t, &n->rb, less)) { bpf_refcount_acquire(n); / Failed to add, do something else with the node */ } bpf_spin_unlock(&l); It's incorrect to assume that bpf_refcount_acquire will always succeed in this scenario. bpf_refcount_acquire is being called in a critical section here, but the lock being held is associated with rbtree t, which isn't necessarily the lock associated with the tree that the node is already in. So after bpf_rbtree_add fails to add the node and calls bpf_obj_drop in it, the program has no ownership of the node's lifetime. Therefore the node's refcount can be decr'd to 0 at any time after the failing rbtree_add. If this happens before the refcount_acquire above, the node might be free'd, and regardless refcount_acquire will be incrementing a 0 refcount. Later patches in the series exercise this scenario, resulting in the expected complaint from the kernel (without this patch's changes): refcount_t: addition on 0; use-after-free. WARNING: CPU: 1 PID: 207 at lib/refcount.c:25 refcount_warn_saturate+0xbc/0x110 Modules linked in: bpf_testmod(O) CPU: 1 PID: 207 Comm: test_progs Tainted: G O 6.3.0-rc7-02231-g723de1a718a2-dirty #371 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014 RIP: 0010:refcount_warn_saturate+0xbc/0x110 Code: 6f 64 f6 02 01 e8 84 a3 5c ff 0f 0b eb 9d 80 3d 5e 64 f6 02 00 75 94 48 c7 c7 e0 13 d2 82 c6 05 4e 64 f6 02 01 e8 64 a3 5c ff <0f> 0b e9 7a ff ff ff 80 3d 38 64 f6 02 00 0f 85 6d ff ff ff 48 c7 RSP: 0018:ffff88810b9179b0 EFLAGS: 00010082 RAX: 0000000000000000 RBX: 0000000000000002 RCX: 0000000000000000 RDX: 0000000000000202 RSI: 0000000000000008 RDI: ffffffff857c3680 RBP: ffff88810027d3c0 R08: ffffffff8125f2a4 R09: ffff88810b9176e7 R10: ffffed1021722edc R11: 746e756f63666572 R12: ffff88810027d388 R13: ffff88810027d3c0 R14: ffffc900005fe030 R15: ffffc900005fe048 FS: 00007fee0584a700(0000) GS:ffff88811b280000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00005634a96f6c58 CR3: 0000000108ce9002 CR4: 0000000000770ee0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: <TASK> bpf_refcount_acquire_impl+0xb5/0xc0 (rest of output snipped) The patch addresses this by changing bpf_refcount_acquire_impl to use refcount_inc_not_zero instead of refcount_inc and marking bpf_refcount_acquire KF_RET_NULL. For owning references, though, we know the above scenario is not possible and thus that bpf_refcount_acquire will always succeed. Some verifier bookkeeping is added to track "is input owning ref?" for bpf_refcount_acquire calls and return false from is_kfunc_ret_null for bpf_refcount_acquire on owning refs despite it being marked KF_RET_NULL. Existing selftests using bpf_refcount_acquire are modified where necessary to NULL-check its return value. [0]: https://lore.kernel.org/bpf/20230415201811.343116-1-davemarchevsky@fb.com/ Fixes: d2dcc67df910 ("bpf: Migrate bpf_rbtree_add and bpf_list_push_{front,back} to possibly fail") Reported-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Link: https://lore.kernel.org/r/20230602022647.1571784-5-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-06-05 13:17:20 -07:00
Dave Marchevsky	cc0d76cafe	bpf: Fix __bpf_{list,rbtree}_add's beginning-of-node calculation Given the pointer to struct bpf_{rb,list}_node within a local kptr and the byte offset of that field within the kptr struct, the calculation changed by this patch is meant to find the beginning of the kptr so that it can be passed to bpf_obj_drop. Unfortunately instead of doing ptr_to_kptr = ptr_to_node_field - offset_bytes the calculation is erroneously doing ptr_to_ktpr = ptr_to_node_field - (offset_bytes * sizeof(struct bpf_rb_node)) or the bpf_list_node equivalent. This patch fixes the calculation. Fixes: d2dcc67df910 ("bpf: Migrate bpf_rbtree_add and bpf_list_push_{front,back} to possibly fail") Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Link: https://lore.kernel.org/r/20230602022647.1571784-4-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-06-05 13:17:19 -07:00
Dave Marchevsky	2140a6e342	bpf: Set kptr_struct_meta for node param to list and rbtree insert funcs In verifier.c, fixup_kfunc_call uses struct bpf_insn_aux_data's kptr_struct_meta field to pass information about local kptr types to various helpers and kfuncs at runtime. The recent bpf_refcount series added a few functions to the set that need this information: * bpf_refcount_acquire * Needs to know where the refcount field is in order to increment * Graph collection insert kfuncs: bpf_rbtree_add, bpf_list_push_{front,back} * Were migrated to possibly fail by the bpf_refcount series. If insert fails, the input node is bpf_obj_drop'd. bpf_obj_drop needs the kptr_struct_meta in order to decr refcount and properly free special fields. Unfortunately the verifier handling of collection insert kfuncs was not modified to actually populate kptr_struct_meta. Accordingly, when the node input to those kfuncs is passed to bpf_obj_drop, it is done so without the information necessary to decr refcount. This patch fixes the issue by populating kptr_struct_meta for those kfuncs. Fixes: d2dcc67df910 ("bpf: Migrate bpf_rbtree_add and bpf_list_push_{front,back} to possibly fail") Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Link: https://lore.kernel.org/r/20230602022647.1571784-3-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-06-05 13:17:19 -07:00
Gaosheng Cui	c20d4d889f	rdmacg: fix kernel-doc warnings in rdmacg Fix all kernel-doc warnings in rdmacg: kernel/cgroup/rdma.c:209: warning: Function parameter or member 'cg' not described in 'rdmacg_uncharge_hierarchy' kernel/cgroup/rdma.c:230: warning: Function parameter or member 'cg' not described in 'rdmacg_uncharge' Signed-off-by: Gaosheng Cui <cuigaosheng1@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2023-06-05 09:45:14 -10:00
Gaosheng Cui	7bf11e90a3	cgroup: Replace the css_set call with cgroup_get We will release the refcnt of cgroup via cgroup_put, for example in the cgroup_lock_and_drain_offline function, so replace css_get with the cgroup_get function for better readability. Signed-off-by: Gaosheng Cui <cuigaosheng1@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2023-06-05 09:42:42 -10:00
Miaohe Lin	a49a11dc64	cgroup: remove unused macro for_each_e_css() for_each_e_css() is unused now. Remove it. Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Yosry Ahmed <yosryahmed@google.com> Reviewed-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2023-06-05 09:20:20 -10:00
Dietmar Eggemann	7d0583cf9e	sched/fair, cpufreq: Introduce 'runnable boosting' The responsiveness of the Per Entity Load Tracking (PELT) util_avg in mobile devices is still considered too low for utilization changes during task ramp-up. In Android this manifests in the fact that the first frames of a UI activity are very prone to be jankframes (a frame which doesn't meet the required frame rendering time, e.g. 16ms@60Hz) since the CPU frequency is normally low at this point and has to ramp up quickly. The beginning of an UI activity is also characterized by the occurrence of CPU contention, especially on little CPUs. Current little CPUs can have an original CPU capacity of only ~ 150 which means that the actual CPU capacity at lower frequency can even be much smaller. Schedutil maps CPU util_avg into CPU frequency request via: util = effective_cpu_util(..., cpu_util_cfs(cpu), ...) -> util = map_util_perf(util) -> freq = map_util_freq(util, ...) CPU contention for CFS tasks can be detected by 'CPU runnable > CPU utililization' in cpu_util_cfs_boost() -> cpu_util(..., boost = 1). Schedutil uses 'runnable boosting' by calling cpu_util_cfs_boost(). To be in sync with schedutil's CPU frequency selection, Energy Aware Scheduling (EAS) also calls cpu_util(..., boost = 1) during max util detection. Moreover, 'runnable boosting' is also used in load-balance for busiest CPU selection when the migration type is 'migrate_util', i.e. only at sched domains which don't have the SD_SHARE_PKG_RESOURCES flag set. Suggested-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20230515115735.296329-3-dietmar.eggemann@arm.com	2023-06-05 21:13:44 +02:00
Dietmar Eggemann	3eb6d6ecec	sched/fair: Refactor CPU utilization functions There is a lot of code duplication in cpu_util_next() & cpu_util_cfs(). Remove this by allowing cpu_util_next() to be called with p = NULL. Rename cpu_util_next() to cpu_util() since the '_next' suffix is no longer necessary to distinct cpu utilization related functions. Implement cpu_util_cfs(cpu) as cpu_util(cpu, p = NULL, -1). This will allow to code future related cpu util changes only in one place, namely in cpu_util(). Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20230515115735.296329-2-dietmar.eggemann@arm.com	2023-06-05 21:13:43 +02:00
Peter Zijlstra	fb7d4948c4	sched/clock: Provide local_clock_noinstr() Now that all ARCH_WANTS_NO_INSTR architectures (arm64, loongarch, s390, x86) provide sched_clock_noinstr(), use this to provide local_clock_noinstr(). This local_clock_noinstr() will be safe to use from noinstr code with the assumption that any such noinstr code is non-preemptible (it had better be, entry code will have IRQs disabled while __cpuidle must have preemption disabled). Specifically, preempt_enable_notrace(), a common part of many a sched_clock() implementation calls out to schedule() -- even though, per the above, it will never trigger -- which frustrates noinstr validation. vmlinux.o: warning: objtool: local_clock+0xb5: call to preempt_schedule_notrace_thunk() leaves .noinstr.text section Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> # Hyper-V Link: https://lore.kernel.org/r/20230519102715.978624636@infradead.org	2023-06-05 21:11:09 +02:00
Peter Zijlstra	5949a68c73	time/sched_clock: Provide sched_clock_noinstr() With the intent to provide local_clock_noinstr(), a variant of local_clock() that's safe to be called from noinstr code (with the assumption that any such code will already be non-preemptible), prepare for things by providing a noinstr sched_clock_noinstr() function. Specifically, preempt_enable_*() calls out to schedule(), which upsets noinstr validation efforts. As such, pull out the preempt_{dis,en}able_notrace() requirements from the sched_clock_read() implementations by explicitly providing it in the sched_clock() function. This further requires said sched_clock_read() functions to be noinstr themselves, for ARCH_WANTS_NO_INSTR users. See the next few patches. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> # Hyper-V Link: https://lore.kernel.org/r/20230519102715.302350330@infradead.org	2023-06-05 21:11:04 +02:00
Peter Zijlstra	d16317de9b	seqlock/latch: Provide raw_read_seqcount_latch_retry() The read side of seqcount_latch consists of: do { seq = raw_read_seqcount_latch(&latch->seq); ... } while (read_seqcount_latch_retry(&latch->seq, seq)); which is asymmetric in the raw_ department, and sure enough, read_seqcount_latch_retry() includes (explicit) instrumentation where raw_read_seqcount_latch() does not. This inconsistency becomes a problem when trying to use it from noinstr code. As such, fix it by renaming and re-implementing raw_read_seqcount_latch_retry() without the instrumentation. Specifically the instrumentation in question is kcsan_atomic_next(0) in do___read_seqcount_retry(). Loosing this annotation is not a problem because raw_read_seqcount_latch() does not pass through kcsan_atomic_next(KCSAN_SEQLOCK_REGION_MAX). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Tested-by: Michael Kelley <mikelley@microsoft.com> # Hyper-V Link: https://lore.kernel.org/r/20230519102715.233598176@infradead.org	2023-06-05 21:11:03 +02:00
Peter Zijlstra	1c06918788	sched: Consider task_struct::saved_state in wait_task_inactive() With the introduction of task_struct::saved_state in commit 5f220be21418 ("sched/wakeup: Prepare for RT sleeping spin/rwlocks") matching the task state has gotten more complicated. That same commit changed try_to_wake_up() to consider both states, but wait_task_inactive() has been neglected. Sebastian noted that the wait_task_inactive() usage in ptrace_check_attach() can misbehave when ptrace_stop() is blocked on the tasklist_lock after it sets TASK_TRACED. Therefore extract a common helper from ttwu_state_match() and use that to teach wait_task_inactive() about the PREEMPT_RT locks. Originally-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://lkml.kernel.org/r/20230601091234.GW83892@hirez.programming.kicks-ass.net	2023-06-05 21:11:03 +02:00
Peter Zijlstra	d5e1586617	sched: Unconditionally use full-fat wait_task_inactive() While modifying wait_task_inactive() for PREEMPT_RT; the build robot noted that UP got broken. This led to audit and consideration of the UP implementation of wait_task_inactive(). It looks like the UP implementation is also broken for PREEMPT; consider task_current_syscall() getting preempted between the two calls to wait_task_inactive(). Therefore move the wait_task_inactive() implementation out of CONFIG_SMP and unconditionally use it. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20230602103731.GA630648%40hirez.programming.kicks-ass.net	2023-06-05 21:11:02 +02:00
Yicong Yang	0dd37d6dd3	sched/fair: Don't balance task to its current running CPU We've run into the case that the balancer tries to balance a migration disabled task and trigger the warning in set_task_cpu() like below: ------------[ cut here ]------------ WARNING: CPU: 7 PID: 0 at kernel/sched/core.c:3115 set_task_cpu+0x188/0x240 Modules linked in: hclgevf xt_CHECKSUM ipt_REJECT nf_reject_ipv4 <...snip> CPU: 7 PID: 0 Comm: swapper/7 Kdump: loaded Tainted: G O 6.1.0-rc4+ #1 Hardware name: Huawei TaiShan 2280 V2/BC82AMDC, BIOS 2280-V2 CS V5.B221.01 12/09/2021 pstate: 604000c9 (nZCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : set_task_cpu+0x188/0x240 lr : load_balance+0x5d0/0xc60 sp : ffff80000803bc70 x29: ffff80000803bc70 x28: ffff004089e190e8 x27: ffff004089e19040 x26: ffff007effcabc38 x25: 0000000000000000 x24: 0000000000000001 x23: ffff80000803be84 x22: 000000000000000c x21: ffffb093e79e2a78 x20: 000000000000000c x19: ffff004089e19040 x18: 0000000000000000 x17: 0000000000001fad x16: 0000000000000030 x15: 0000000000000000 x14: 0000000000000003 x13: 0000000000000000 x12: 0000000000000000 x11: 0000000000000001 x10: 0000000000000400 x9 : ffffb093e4cee530 x8 : 00000000fffffffe x7 : 0000000000ce168a x6 : 000000000000013e x5 : 00000000ffffffe1 x4 : 0000000000000001 x3 : 0000000000000b2a x2 : 0000000000000b2a x1 : ffffb093e6d6c510 x0 : 0000000000000001 Call trace: set_task_cpu+0x188/0x240 load_balance+0x5d0/0xc60 rebalance_domains+0x26c/0x380 _nohz_idle_balance.isra.0+0x1e0/0x370 run_rebalance_domains+0x6c/0x80 __do_softirq+0x128/0x3d8 ____do_softirq+0x18/0x24 call_on_irq_stack+0x2c/0x38 do_softirq_own_stack+0x24/0x3c __irq_exit_rcu+0xcc/0xf4 irq_exit_rcu+0x18/0x24 el1_interrupt+0x4c/0xe4 el1h_64_irq_handler+0x18/0x2c el1h_64_irq+0x74/0x78 arch_cpu_idle+0x18/0x4c default_idle_call+0x58/0x194 do_idle+0x244/0x2b0 cpu_startup_entry+0x30/0x3c secondary_start_kernel+0x14c/0x190 __secondary_switched+0xb0/0xb4 ---[ end trace 0000000000000000 ]--- Further investigation shows that the warning is superfluous, the migration disabled task is just going to be migrated to its current running CPU. This is because that on load balance if the dst_cpu is not allowed by the task, we'll re-select a new_dst_cpu as a candidate. If no task can be balanced to dst_cpu we'll try to balance the task to the new_dst_cpu instead. In this case when the migration disabled task is not on CPU it only allows to run on its current CPU, load balance will select its current CPU as new_dst_cpu and later triggers the warning above. The new_dst_cpu is chosen from the env->dst_grpmask. Currently it contains CPUs in sched_group_span() and if we have overlapped groups it's possible to run into this case. This patch makes env->dst_grpmask of group_balance_mask() which exclude any CPUs from the busiest group and solve the issue. For balancing in a domain with no overlapped groups the behaviour keeps same as before. Suggested-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Yicong Yang <yangyicong@hisilicon.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20230530082507.10444-1-yangyicong@huawei.com	2023-06-05 21:08:24 +02:00
Christoph Hellwig	1e8c813b08	PM: hibernate: don't use early_lookup_bdev in resume_store resume_store is a sysfs attribute written during normal kernel runtime, and it should not use the early_lookup_bdev API that bypasses all normal path based permission checking, and might cause problems with certain container environments renaming devices. Switch to lookup_bdev, which does a normal path lookup instead, and fall back to trying to parse a numeric dev_t just like early_lookup_bdev did. Note that this strictly speaking changes the kernel ABI as the PARTUUID= and PARTLABEL= style syntax is now not available during a running systems. They never were intended for that, but this breaks things we'll have to figure out a way to make them available again. But if avoidable in any way I'd rather avoid that. Fixes: 421a5fa1a6cf ("PM / hibernate: use name_to_dev_t to parse resume") Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Rafael J. Wysocki <rafael@kernel.org> Link: https://lore.kernel.org/r/20230531125535.676098-22-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-05 10:57:40 -06:00
Christoph Hellwig	cf056a4312	init: improve the name_to_dev_t interface name_to_dev_t has a very misleading name, that doesn't make clear it should only be used by the early init code, and also has a bad calling convention that doesn't allow returning different kinds of errors. Rename it to early_lookup_bdev to make the use case clear, and return an errno, where -EINVAL means the string could not be parsed, and -ENODEV means it the string was valid, but there was no device found for it. Also stub out the whole call for !CONFIG_BLOCK as all the non-block root cases are always covered in the caller. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230531125535.676098-14-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-05 10:56:46 -06:00
Christoph Hellwig	cc89c63e2f	PM: hibernate: move finding the resume device out of software_resume software_resume can be called either from an init call in the boot code, or from sysfs once the system has finished booting, and the two invocation methods this can't race with each other. For the latter case we did just parse the suspend device manually, while the former might not have one. Split software_resume so that the search only happens for the boot case, which also means the special lockdep nesting annotation can go away as the system transition mutex can be taken a little later and doesn't have the sysfs locking nest inside it. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Rafael J. Wysocki <rafael@kernel.org> Link: https://lore.kernel.org/r/20230531125535.676098-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-05 10:55:20 -06:00
Christoph Hellwig	d6545e6872	PM: hibernate: remove the global snapshot_test variable Passing call dependent variable in global variables is a huge antipattern. Fix it up. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Rafael J. Wysocki <rafael@kernel.org> Link: https://lore.kernel.org/r/20230531125535.676098-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-05 10:55:20 -06:00
Christoph Hellwig	02b42d58f3	PM: hibernate: factor out a helper to find the resume device Split the logic to find the resume device out software_resume and into a separate helper to start unwindig the convoluted goto logic. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Rafael J. Wysocki <rafael@kernel.org> Link: https://lore.kernel.org/r/20230531125535.676098-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-05 10:55:20 -06:00
Christoph Hellwig	0718afd47f	block: introduce holder ops Add a new blk_holder_ops structure, which is passed to blkdev_get_by_* and installed in the block_device for exclusive claims. It will be used to allow the block layer to call back into the user of the block device for thing like notification of a removed device or a device resize. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Link: https://lore.kernel.org/r/20230601094459.1350643-10-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-05 10:53:04 -06:00
Josh Poimboeuf	42cffe980c	livepatch: Make 'klp_stack_entries' static The 'klp_stack_entries' percpu array is only used in transition.c. Make it static. Fixes: e92606fa172f ("livepatch: Convert stack entries array to percpu") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202305171329.i0UQ4TJa-lkp@intel.com/ Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Reviewed-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/5115752fca6537720700f4bf5b178959dfbca41a.1685488550.git.jpoimboe@kernel.org	2023-06-05 13:56:52 +02:00
Mark Rutland	0f613bfa82	locking/atomic: treewide: use raw_atomic_<op>() Now that we have raw_atomic_<op>() definitions, there's no need to use arch_atomic_<op>() definitions outside of the low-level atomic definitions. Move treewide users of arch_atomic_<op>() over to the equivalent raw_atomic*_<op>(). There should be no functional change as a result of this patch. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20230605070124.3741859-19-mark.rutland@arm.com	2023-06-05 09:57:20 +02:00
Linus Torvalds	51f269a6ec	Probes fixes for 6.4-rc4: - Return NULL if the trace_probe list on trace_probe_event is empty. - selftests/ftrace: Choose testing symbol name for filtering feature from sample data instead of fixed symbol. -----BEGIN PGP SIGNATURE----- iQEzBAABCgAdFiEEh7BulGwFlgAOi5DV2/sHvwUrPxsFAmR640AACgkQ2/sHvwUr PxugGgf/YwwocmUqiEtTukTB7fzoAjYyQXr0YaJM+DjeZXMqAJ4dl9tV1/AmAL4j iWtZd53aolTym/3P2VADfSc4xiyWjFdkYv7zRPjpqfMg3XsELJgshwz+12dmmMdx 0uco1l2/Ge3JNPK6BuWaO3V44QjoPSgiRsmxxKLh5K7M9V5swL7fadoLtins1B0r TVVqdyEHQkZLTByexg7wHYd/ro+4lexv1yhvyP4rEmYRPDoR56eOF2zwcQMHPvaY qstdP2ce6m5rG0gp4TsY7oRkezb64y903hNQuumoU6VR9nI3IK4PZjuX5/xns2By G9mRaOqb02+UmP+HhX4QGmr92G9Vyw== =o07w -----END PGP SIGNATURE----- Merge tag 'probes-fixes-6.4-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull probes fixes from Masami Hiramatsu: - Return NULL if the trace_probe list on trace_probe_event is empty - selftests/ftrace: Choose testing symbol name for filtering feature from sample data instead of fixed symbol * tag 'probes-fixes-6.4-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: selftests/ftrace: Choose target function for filter test from samples tracing/probe: trace_probe_primary_from_call(): checked list_first_entry	2023-06-03 08:23:16 -04:00
Rhys Rustad-Elliott	cba41bb78d	bpf: Fix elem_size not being set for inner maps Commit d937bc3449fa ("bpf: make uniform use of array->elem_size everywhere in arraymap.c") changed array_map_gen_lookup to use array->elem_size instead of round_up(map->value_size, 8) as the element size when generating code to access a value in an array map. array->elem_size, however, is not set by bpf_map_meta_alloc when initializing an BPF_MAP_TYPE_ARRAY_OF_MAPS or BPF_MAP_TYPE_HASH_OF_MAPS. This results in array_map_gen_lookup incorrectly outputting code that always accesses index 0 in the array (as the index will be calculated via a multiplication with the element size, which is incorrectly set to 0). Set elem_size on the bpf_array object when allocating an array or hash of maps to fix this. Fixes: d937bc3449fa ("bpf: make uniform use of array->elem_size everywhere in arraymap.c") Signed-off-by: Rhys Rustad-Elliott <me@rhysre.net> Link: https://lore.kernel.org/r/20230602190110.47068-2-me@rhysre.net Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>	2023-06-02 16:22:12 -07:00
KP Singh	b0fd1852bc	bpf: Fix UAF in task local storage When task local storage was generalized for tracing programs, the bpf_task_local_storage callback was moved from a BPF LSM hook callback for security_task_free LSM hook to it's own callback. But a failure case in bad_fork_cleanup_security was missed which, when triggered, led to a dangling task owner pointer and a subsequent use-after-free. Move the bpf_task_storage_free to the very end of free_task to handle all failure cases. This issue was noticed when a BPF LSM program was attached to the task_alloc hook on a kernel with KASAN enabled. The program used bpf_task_storage_get to copy the task local storage from the current task to the new task being created. Fixes: a10787e6d58c ("bpf: Enable task local storage for tracing programs") Reported-by: Kuba Piecuch <jpiecuch@google.com> Signed-off-by: KP Singh <kpsingh@kernel.org> Acked-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230602002612.1117381-1-kpsingh@kernel.org Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>	2023-06-02 10:18:07 -07:00
Linus Torvalds	c43a6ff9f9	modules-6.4-rc5-second-pull A zstd fix by lucas as he tested zstd decompression support -----BEGIN PGP SIGNATURE----- iQJGBAABCgAwFiEENnNq2KuOejlQLZofziMdCjCSiKcFAmR5D8USHG1jZ3JvZkBr ZXJuZWwub3JnAAoJEM4jHQowkoingK8P/3XmCG83G1JULOgFso657i1qdsVbqQek Qzo7t9IniifEYZPu8+BTplrtwP3G7x9xlci94JWSftaDvRCWh7kULJ2EMYe61aQk eEiYu79AYQYDAYx2mUacvf8QeSWoJmxY7DyUHt9eZQTJ90VN6/svKThxLk2HbQvR LtDryHVJqVnH2iSdkP3GA/kVulvLfrQ4+entVhr609kFsvZz21WICtnPU5/xu3PJ /Cb8T5FvTjQO6+nCjKMWlBM5sfXSQDxMqcnppZO/sed0yPP47hGzljJM2tlWNkK7 qb7IrXYW4SowjcIGzQIFHbPSBRM+02hhIF0iSK66kGVrDHocJ1u2pKhhzgsnZdfk Wjr1R1CHAthjr1e1S4sFAnl3VTXBCPtC2L2SdCR3aGb1EB2bEjPmZTex6HWWNeCV iBVLdJxyt0K6NQTlFe4b5ZE5j0JB2h/uDSpY9OrwMwkVA4BptKRlkUt+4EliPeaf lBNABLFDSK82x7bL9MSurzJhLOumj/9CMpl/WkwJYsT5RK5zkfcLpmh6YuRQlA/1 xR4NKlJn3pVGjXrmAl2t6VJrSa/wCQjkUzd1xyilLI/oxk4uxelGFJiofCyl4zSF +A0MHbvrL6N8x6conbfWkQ4cFgSGLSneB5UgVe9myl2b/P3n6PCgyghulo7F9JYj G65Ty9CqKJRm =FVHd -----END PGP SIGNATURE----- Merge tag 'modules-6.4-rc5-second-pull' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux Pull modules fix from Luis Chamberlain: "A zstd fix by lucas as he tested zstd decompression support" * tag 'modules-6.4-rc5-second-pull' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux: module/decompress: Fix error checking on zstd decompression	2023-06-01 20:48:16 -04:00
Jakub Kicinski	a03a91bd68	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR. No conflicts. Adjacent changes: drivers/net/ethernet/sfc/tc.c 622ab656344a ("sfc: fix error unwinds in TC offload") b6583d5e9e94 ("sfc: support TC decap rules matching on enc_src_port") net/mptcp/protocol.c 5b825727d087 ("mptcp: add annotations around msk->subflow accesses") e76c8ef5cc5b ("mptcp: refactor mptcp_stream_accept()") Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-06-01 15:38:26 -07:00
Lucas De Marchi	fadb74f9f2	module/decompress: Fix error checking on zstd decompression While implementing support for in-kernel decompression in kmod, finit_module() was returning a very suspicious value: finit_module(3, "", MODULE_INIT_COMPRESSED_FILE) = 18446744072717407296 It turns out the check for module_get_next_page() failing is wrong, and hence the decompression was not really taking place. Invert the condition to fix it. Fixes: 169a58ad824d ("module/decompress: Support zstd in-kernel decompression") Cc: stable@kernel.org Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com> Cc: Stephen Boyd <swboyd@chromium.org> Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>	2023-06-01 14:36:46 -07:00
Mike Christie	f9010dbdce	fork, vhost: Use CLONE_THREAD to fix freezer/ps regression When switching from kthreads to vhost_tasks two bugs were added: 1. The vhost worker tasks's now show up as processes so scripts doing ps or ps a would not incorrectly detect the vhost task as another process. 2. kthreads disabled freeze by setting PF_NOFREEZE, but vhost tasks's didn't disable or add support for them. To fix both bugs, this switches the vhost task to be thread in the process that does the VHOST_SET_OWNER ioctl, and has vhost_worker call get_signal to support SIGKILL/SIGSTOP and freeze signals. Note that SIGKILL/STOP support is required because CLONE_THREAD requires CLONE_SIGHAND which requires those 2 signals to be supported. This is a modified version of the patch written by Mike Christie <michael.christie@oracle.com> which was a modified version of patch originally written by Linus. Much of what depended upon PF_IO_WORKER now depends on PF_USER_WORKER. Including ignoring signals, setting up the register state, and having get_signal return instead of calling do_group_exit. Tidied up the vhost_task abstraction so that the definition of vhost_task only needs to be visible inside of vhost_task.c. Making it easier to review the code and tell what needs to be done where. As part of this the main loop has been moved from vhost_worker into vhost_task_fn. vhost_worker now returns true if work was done. The main loop has been updated to call get_signal which handles SIGSTOP, freezing, and collects the message that tells the thread to exit as part of process exit. This collection clears __fatal_signal_pending. This collection is not guaranteed to clear signal_pending() so clear that explicitly so the schedule() sleeps. For now the vhost thread continues to exist and run work until the last file descriptor is closed and the release function is called as part of freeing struct file. To avoid hangs in the coredump rendezvous and when killing threads in a multi-threaded exec. The coredump code and de_thread have been modified to ignore vhost threads. Remvoing the special case for exec appears to require teaching vhost_dev_flush how to directly complete transactions in case the vhost thread is no longer running. Removing the special case for coredump rendezvous requires either the above fix needed for exec or moving the coredump rendezvous into get_signal. Fixes: 6e890c5d5021 ("vhost: use vhost_tasks for worker threads") Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Co-developed-by: Mike Christie <michael.christie@oracle.com> Signed-off-by: Mike Christie <michael.christie@oracle.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2023-06-01 17:15:33 -04:00
Azeem Shaikh	76edc27eda	clocksource: Replace all non-returning strlcpy with strscpy strlcpy() reads the entire source buffer first. This read may exceed the destination size limit. This is both inefficient and can lead to linear read overflows if a source string is not NUL-terminated [1]. In an effort to remove strlcpy() completely [2], replace strlcpy() here with strscpy(). No return values were used, so direct replacement is safe. [1] https://www.kernel.org/doc/html/latest/process/deprecated.html#strlcpy [2] https://github.com/KSPP/linux/issues/89 Signed-off-by: Azeem Shaikh <azeemshaikh38@gmail.com> Acked-by: John Stultz <jstultz@google.com> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20230530163546.986188-1-azeemshaikh38@gmail.com	2023-06-01 11:24:50 -07:00
Azeem Shaikh	ffadc37252	bpf: Replace all non-returning strlcpy with strscpy strlcpy() reads the entire source buffer first. This read may exceed the destination size limit. This is both inefficient and can lead to linear read overflows if a source string is not NUL-terminated [1]. This is not the case here, however, in an effort to remove strlcpy() completely [2], lets replace strlcpy() here with strscpy(). No return values were used, so a direct replacement is safe. [1] https://www.kernel.org/doc/html/latest/process/deprecated.html#strlcpy [2] https://github.com/KSPP/linux/issues/89 Signed-off-by: Azeem Shaikh <azeemshaikh38@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/bpf/20230530155659.309657-1-azeemshaikh38@gmail.com	2023-05-31 13:04:20 +02:00
Pietro Borrello	81d0fa4cb4	tracing/probe: trace_probe_primary_from_call(): checked list_first_entry All callers of trace_probe_primary_from_call() check the return value to be non NULL. However, the function returns list_first_entry(&tpe->probes, ...) which can never be NULL. Additionally, it does not check for the list being possibly empty, possibly causing a type confusion on empty lists. Use list_first_entry_or_null() which solves both problems. Link: https://lore.kernel.org/linux-trace-kernel/20230128-list-entry-null-check-v1-1-8bde6a3da2ef@diag.uniroma1.it/ Fixes: 60d53e2c3b75 ("tracing/probe: Split trace_event related data from trace_probe") Signed-off-by: Pietro Borrello <borrello@diag.uniroma1.it> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Acked-by: Mukesh Ojha <quic_mojha@quicinc.com> Cc: stable@vger.kernel.org Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>	2023-05-31 18:47:10 +09:00
Azeem Shaikh	d0c2d66fcc	ftrace: Replace all non-returning strlcpy with strscpy strlcpy() reads the entire source buffer first. This read may exceed the destination size limit. This is both inefficient and can lead to linear read overflows if a source string is not NUL-terminated [1]. In an effort to remove strlcpy() completely [2], replace strlcpy() here with strscpy(). No return values were used, so direct replacement is safe. [1] https://www.kernel.org/doc/html/latest/process/deprecated.html#strlcpy [2] https://github.com/KSPP/linux/issues/89 Signed-off-by: Azeem Shaikh <azeemshaikh38@gmail.com> Reviewed-by: Kees Cook <keescook@chromium.org> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20230517145323.1522010-1-azeemshaikh38@gmail.com	2023-05-30 16:42:00 -07:00
Luis Chamberlain	01e6aac78b	signal: move show_unhandled_signals sysctl to its own file The show_unhandled_signals sysctl is the only sysctl for debug left on kernel/sysctl.c. We've been moving the syctls out from kernel/sysctl.c so to help avoid merge conflicts as the shared array gets out of hand. This change incurs simplifies sysctl registration by localizing it where it should go for a penalty in size of increasing the kernel by 23 bytes, we accept this given recent cleanups have actually already saved us 1465 bytes in the prior commits. ./scripts/bloat-o-meter vmlinux.3-remove-dev-table vmlinux.4-remove-debug-table add/remove: 3/1 grow/shrink: 0/1 up/down: 177/-154 (23) Function old new delta signal_debug_table - 128 +128 init_signal_sysctls - 33 +33 __pfx_init_signal_sysctls - 16 +16 sysctl_init_bases 85 59 -26 debug_table 128 - -128 Total: Before=21256967, After=21256990, chg +0.00% Reviewed-by: Joel Granados <j.granados@samsung.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>	2023-05-30 15:46:31 -07:00
Luis Chamberlain	996ef312f2	sysctl: remove empty dev table Now that all the dev sysctls have been moved out we can remove the dev sysctl base directory. We don't need to create base directories, they are created for you as if using 'mkdir -p' with register_syctl() and register_sysctl_init(). For details refer to sysctl_mkdir_p() usage. We save 90 bytes with this changes: ./scripts/bloat-o-meter vmlinux.2.remove-sysctl-table vmlinux.3-remove-dev-table add/remove: 0/1 grow/shrink: 0/1 up/down: 0/-90 (-90) Function old new delta sysctl_init_bases 111 85 -26 dev_table 64 - -64 Total: Before=21257057, After=21256967, chg -0.00% The empty dev table has been in place since the v2.5.0 days because back then ordering was essentialy. But later commit 7ec66d06362d ("sysctl: Stop requiring explicit management of sysctl directories"), merged as of v3.4-rc1, the entire ordering of directories was replaced by allowing sysctl directory autogeneration. This new mechanism introduced on v3.4 allows for sysctl directories to automatically be created for sysctl tables when they are needed and automatically removes them when no sysctl tables use them. That commit also added a dedicated struct ctl_dir as a new type for these autogenerated directories. Reviewed-by: Joel Granados <j.granados@samsung.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>	2023-05-30 15:46:11 -07:00
Yonghong Song	e6c2f594ed	bpf: Silence a warning in btf_type_id_size() syzbot reported a warning in [1] with the following stacktrace: WARNING: CPU: 0 PID: 5005 at kernel/bpf/btf.c:1988 btf_type_id_size+0x2d9/0x9d0 kernel/bpf/btf.c:1988 ... RIP: 0010:btf_type_id_size+0x2d9/0x9d0 kernel/bpf/btf.c:1988 ... Call Trace: <TASK> map_check_btf kernel/bpf/syscall.c:1024 [inline] map_create+0x1157/0x1860 kernel/bpf/syscall.c:1198 __sys_bpf+0x127f/0x5420 kernel/bpf/syscall.c:5040 __do_sys_bpf kernel/bpf/syscall.c:5162 [inline] __se_sys_bpf kernel/bpf/syscall.c:5160 [inline] __x64_sys_bpf+0x79/0xc0 kernel/bpf/syscall.c:5160 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd With the following btf [1] DECL_TAG 'a' type_id=4 component_idx=-1 [2] PTR '(anon)' type_id=0 [3] TYPE_TAG 'a' type_id=2 [4] VAR 'a' type_id=3, linkage=static and when the bpf_attr.btf_key_type_id = 1 (DECL_TAG), the following WARN_ON_ONCE in btf_type_id_size() is triggered: if (WARN_ON_ONCE(!btf_type_is_modifier(size_type) && !btf_type_is_var(size_type))) return NULL; Note that 'return NULL' is the correct behavior as we don't want a DECL_TAG type to be used as a btf_{key,value}_type_id even for the case like 'DECL_TAG -> STRUCT'. So there is no correctness issue here, we just want to silence warning. To silence the warning, I added DECL_TAG as one of kinds in btf_type_nosize() which will cause btf_type_id_size() returning NULL earlier without the warning. [1] https://lore.kernel.org/bpf/000000000000e0df8d05fc75ba86@google.com/ Reported-by: syzbot+958967f249155967d42a@syzkaller.appspotmail.com Signed-off-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/r/20230530205029.264910-1-yhs@fb.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>	2023-05-30 14:34:46 -07:00
Miaohe Lin	3f4bf7aa31	sched/deadline: remove unused dl_bandwidth The default deadline bandwidth control structure has been removed since commit eb77cf1c151c ("sched/deadline: Remove unused def_dl_bandwidth") leading to unused init_dl_bandwidth() and struct dl_bandwidth. Remove them to clean up the code. Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Juri Lelli <juri.lelli@redhat.com> Link: https://lore.kernel.org/r/20230524102514.407486-1-linmiaohe@huawei.com	2023-05-30 22:46:26 +02:00
Arnd Bergmann	7aa55f2a59	sched/fair: Move unused stub functions to header These four functions have a normal definition for CONFIG_FAIR_GROUP_SCHED, and empty one that is only referenced when FAIR_GROUP_SCHED is disabled but CGROUP_SCHED is still enabled. If both are turned off, the functions are still defined but the misisng prototype causes a W=1 warning: kernel/sched/fair.c:12544:6: error: no previous prototype for 'free_fair_sched_group' kernel/sched/fair.c:12546:5: error: no previous prototype for 'alloc_fair_sched_group' kernel/sched/fair.c:12553:6: error: no previous prototype for 'online_fair_sched_group' kernel/sched/fair.c:12555:6: error: no previous prototype for 'unregister_fair_sched_group' Move the alternatives into the header as static inline functions with the correct combination of #ifdef checks to avoid the warning without adding even more complexity. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20230522195021.3456768-6-arnd@kernel.org	2023-05-30 22:46:26 +02:00
Arnd Bergmann	f7df852ad6	sched: Make task_vruntime_update() prototype visible Having the prototype next to the caller but not visible to the callee causes a W=1 warning: kernel/sched/fair.c:11985:6: error: no previous prototype for 'task_vruntime_update' [-Werror=missing-prototypes] Move this to a header, as we do for all other function declarations. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20230522195021.3456768-5-arnd@kernel.org	2023-05-30 22:46:26 +02:00
Arnd Bergmann	c0bdfd72fb	sched/fair: Hide unused init_cfs_bandwidth() stub init_cfs_bandwidth() is only used when CONFIG_FAIR_GROUP_SCHED is enabled, and without this causes a W=1 warning for the missing prototype: kernel/sched/fair.c:6131:6: error: no previous prototype for 'init_cfs_bandwidth' The normal implementation is only defined for CONFIG_CFS_BANDWIDTH, so the stub exists when CFS_BANDWIDTH is disabled but FAIR_GROUP_SCHED is enabled. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20230522195021.3456768-4-arnd@kernel.org	2023-05-30 22:46:25 +02:00
Arnd Bergmann	378be384e0	sched: Add schedule_user() declaration The schedule_user() function is used on powerpc and sparc architectures, but only ever called from assembler, so it has no prototype, causing a harmless W=1 warning: kernel/sched/core.c:6730:35: error: no previous prototype for 'schedule_user' [-Werror=missing-prototypes] Add a prototype in sched/sched.h to shut up the warning. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20230522195021.3456768-3-arnd@kernel.org	2023-05-30 22:46:25 +02:00
Arnd Bergmann	d55ebae3f3	sched: Hide unused sched_update_scaling() This function is only used when CONFIG_SMP is enabled, without that there is no caller and no prototype: kernel/sched/fair.c:688:5: error: no previous prototype for 'sched_update_scaling' [-Werror=missing-prototypes Hide the definition in the same #ifdef check as the declaration. Fixes: 8a99b6833c88 ("sched: Move SCHED_DEBUG sysctl to debugfs") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20230522195021.3456768-2-arnd@kernel.org	2023-05-30 22:46:24 +02:00
Linus Torvalds	6d86b56f54	modules-6.4-rc5 A fix is provided for ia64. Even though ia64 is on life support it helps to fix issues if we can. Thanks to Linus for doing tons of the ia64 debugging. -----BEGIN PGP SIGNATURE----- iQJGBAABCgAwFiEENnNq2KuOejlQLZofziMdCjCSiKcFAmR2JPsSHG1jZ3JvZkBr ZXJuZWwub3JnAAoJEM4jHQowkoinwGIQAIrm0qpwVgDvh98B/anYDVbPpdn1U+/I s9LBUMZCMIVEHGoOUU7YV3QGMh27OFvsNW72ggrgPuCbOgPzAfyxLJZoNY1nLURO ZWSi2Jg4Om82BTP/Vw79yDjikLjJWAUKT/nHuyJxOnLKVsjKrlKuX+UAPXCqMv1O yukZnCoQx6c57iRFLrpGq/+OM4Y/vZ2w8zeb5/HOSvVglkIITlXvcMUXmk0JxtS5 Rr5R14F59BZnpQD5F3hYYvIWycM2DYNdHA5FFPLt1US6TAXjYlk/hf6jgEmHxvHI jN6U3qG0Wm8VO9ZlPXzwKYTPmHhc5llXqSlkXILuYke79w1dS1BINJb7dg3LcGna eq2IX0ZwxC4SicvDDFWZGmriwIIYbR7jWrSmtTr7W1HIb5mICIrXSuG0150xaCYv fc79brpBRm5me+py+WYkFKqT3DKkGHdB85+c7iXPToZeHY8v6581t3R4XQSHzPF6 JY0Uhlsb17Korbsl8Iw2FFQ8K8DRjLpeeQnDU7iwEuGpZrPSx872o8pDNH1rnmIp wIOjfnjWc5arlCPjWWc0Ga/np7ewBP4Zw4Ff4AFyZG1ISrTmzFBWQI7TqyemjC9v RTIBN5QmJIW6Zai/7oL5xNapRJCO250+dg1G8uOka2dklfnLdKxIBafWBa+/Znmn ZpmhFljq1w3y =h/VF -----END PGP SIGNATURE----- Merge tag 'modules-6.4-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux Pull modules fix from Luis Chamberlain: "A fix is provided for ia64. Even though ia64 is on life support it helps to fix issues if we can. Thanks to Linus for doing tons of the ia64 debugging" * tag 'modules-6.4-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux: module: fix module load for ia64	2023-05-30 15:49:40 -04:00
Jonathan Cameron	143f83e200	perf: Allow a PMU to have a parent Some PMUs have well defined parents such as PCI devices. As the device_initialize() and device_add() are all within pmu_dev_alloc() which is called from perf_pmu_register() there is no opportunity to set the parent from within a driver. Add a struct device *parent field to struct pmu and use that to set the parent. Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Link: https://lore.kernel.org/r/20230526095824.16336-2-Jonathan.Cameron@huawei.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2023-05-30 11:20:35 -07:00
Song Liu	db3e33dd8b	module: fix module load for ia64 Frank reported boot regression in ia64 as: ELILO v3.16 for EFI/IA-64 .. Uncompressing Linux... done Loading file AC100221.initrd.img...done [ 0.000000] Linux version 6.4.0-rc3 (root@x4270) (ia64-linux-gcc (GCC) 12.2.0, GNU ld (GNU Binutils) 2.39) #1 SMP Thu May 25 15:52:20 CEST 2023 [ 0.000000] efi: EFI v1.1 by HP [ 0.000000] efi: SALsystab=0x3ee7a000 ACPI 2.0=0x3fe2a000 ESI=0x3ee7b000 SMBIOS=0x3ee7c000 HCDP=0x3fe28000 [ 0.000000] PCDP: v3 at 0x3fe28000 [ 0.000000] earlycon: uart8250 at MMIO 0x00000000f4050000 (options '9600n8') [ 0.000000] printk: bootconsole [uart8250] enabled [ 0.000000] ACPI: Early table checksum verification disabled [ 0.000000] ACPI: RSDP 0x000000003FE2A000 000028 (v02 HP ) [ 0.000000] ACPI: XSDT 0x000000003FE2A02C 0000CC (v01 HP rx2620 00000000 HP 00000000) [...] [ 3.793350] Run /init as init process Loading, please wait... Starting systemd-udevd version 252.6-1 [ 3.951100] ------------[ cut here ]------------ [ 3.951100] WARNING: CPU: 6 PID: 140 at kernel/module/main.c:1547 __layout_sections+0x370/0x3c0 [ 3.949512] Unable to handle kernel paging request at virtual address 1000000000000000 [ 3.951100] Modules linked in: [ 3.951100] CPU: 6 PID: 140 Comm: (udev-worker) Not tainted 6.4.0-rc3 #1 [ 3.956161] (udev-worker)[142]: Oops 11003706212352 [1] [ 3.951774] Hardware name: hp server rx2620 , BIOS 04.29 11/30/2007 [ 3.951774] [ 3.951774] Call Trace: [ 3.958339] Unable to handle kernel paging request at virtual address 1000000000000000 [ 3.956161] Modules linked in: [ 3.951774] [<a0000001000156d0>] show_stack.part.0+0x30/0x60 [ 3.951774] sp=e000000183a67b20 bsp=e000000183a61628 [ 3.956161] [ 3.956161] which bisect to module_memory change [1]. Debug showed that ia64 uses some special sections: __layout_sections: section .got (sh_flags 10000002) matched to MOD_INVALID __layout_sections: section .sdata (sh_flags 10000003) matched to MOD_INVALID __layout_sections: section .sbss (sh_flags 10000003) matched to MOD_INVALID All these sections are loaded to module core memory before [1]. Fix ia64 boot by loading these sections to MOD_DATA (core rw data). [1] commit ac3b43283923 ("module: replace module_layout with module_memory") Fixes: ac3b43283923 ("module: replace module_layout with module_memory") Reported-by: Frank Scheiner <frank.scheiner@web.de> Closes: https://lists.debian.org/debian-ia64/2023/05/msg00010.html Closes: https://marc.info/?l=linux-ia64&m=168509859125505 Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Song Liu <song@kernel.org> Tested-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>	2023-05-30 09:29:40 -07:00
Linus Torvalds	8b817fded4	Tracing fixes: User events: - Use long instead of int for storing the enable set/clear bit, as it was found that big endian machines could end up using the wrong bits. - Split allocating mm and attaching it. This keeps the allocation separate from the registration and avoids various races. - Remove RCU locking around pin_user_pages_remote() as that can schedule. The RCU protection is no longer needed with the above split of mm allocation and attaching. - Rename the "link" fields of the various structs to something more meaningful. - Add comments around user_event_mm struct usage and locking requirements. Timerlat tracer: - Fix missed wakeup of timerlat thread caused by the timerlat interrupt triggering when tracing is off. The timer interrupt handler needs to always wake up the timerlat thread regardless if tracing is enabled or not, otherwise, it will never wake up. Histograms: - Fix regression of breaking the "stacktrace" modifier for variables. That modifier cannot be used for values, but can be used for variables that are passed from one histogram to the next. This was broken when adding the restriction to values as the variable logic used the same code. - Rename the special field "stacktrace" to "common_stacktrace". Special fields (that are not actually part of the event, but can act just like event fields, like 'comm' and 'timestamp') should be prefixed with 'common_' for consistency. To keep backward compatibility, 'stacktrace' can still be used (as with the special field 'cpu'), but can be overridden if the event has a field called 'stacktrace'. - Update the synthetic event selftests to use the new name (synthetic events are created by histograms) Tracing bootup selftests: - Reorganize the code to keep artifacts of the selftests not compiled in when selftests are not configured. - Add various cond_resched() around the selftest code, as the softlock watchdog was triggering much more often. It appears that the kernel runs slower now with full debugging enabled. - While debugging ftrace with ftrace (using an instance ring buffer instead of the top level one), I found that the selftests were disabling prints to the debug instance. This should not happen, as the selftests only disable printing to the main buffer as the selftests examine the main buffer to see if it has what it expects, and prints can make the tests fail. Make the selftests only disable printing to the toplevel buffer, and leave the instance buffers alone. -----BEGIN PGP SIGNATURE----- iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZHQGJBQccm9zdGVkdEBn b29kbWlzLm9yZwAKCRAp5XQQmuv6qu6hAQCJ1WebZUTJ/s7pFo36mXirLnrW4afB Ua6sALseqKNesgEAyhLmd2+sMeqmAbCCIUWtcWJb/Pod0jGOt0U8+cBxfw8= =PhaX -----END PGP SIGNATURE----- Merge tag 'trace-v6.4-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: "User events: - Use long instead of int for storing the enable set/clear bit, as it was found that big endian machines could end up using the wrong bits. - Split allocating mm and attaching it. This keeps the allocation separate from the registration and avoids various races. - Remove RCU locking around pin_user_pages_remote() as that can schedule. The RCU protection is no longer needed with the above split of mm allocation and attaching. - Rename the "link" fields of the various structs to something more meaningful. - Add comments around user_event_mm struct usage and locking requirements. Timerlat tracer: - Fix missed wakeup of timerlat thread caused by the timerlat interrupt triggering when tracing is off. The timer interrupt handler needs to always wake up the timerlat thread regardless if tracing is enabled or not, otherwise, it will never wake up. Histograms: - Fix regression of breaking the "stacktrace" modifier for variables. That modifier cannot be used for values, but can be used for variables that are passed from one histogram to the next. This was broken when adding the restriction to values as the variable logic used the same code. - Rename the special field "stacktrace" to "common_stacktrace". Special fields (that are not actually part of the event, but can act just like event fields, like 'comm' and 'timestamp') should be prefixed with 'common_' for consistency. To keep backward compatibility, 'stacktrace' can still be used (as with the special field 'cpu'), but can be overridden if the event has a field called 'stacktrace'. - Update the synthetic event selftests to use the new name (synthetic events are created by histograms) Tracing bootup selftests: - Reorganize the code to keep artifacts of the selftests not compiled in when selftests are not configured. - Add various cond_resched() around the selftest code, as the softlock watchdog was triggering much more often. It appears that the kernel runs slower now with full debugging enabled. - While debugging ftrace with ftrace (using an instance ring buffer instead of the top level one), I found that the selftests were disabling prints to the debug instance. This should not happen, as the selftests only disable printing to the main buffer as the selftests examine the main buffer to see if it has what it expects, and prints can make the tests fail. Make the selftests only disable printing to the toplevel buffer, and leave the instance buffers alone" * tag 'trace-v6.4-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing: Have function_graph selftest call cond_resched() tracing: Only make selftest conditionals affect the global_trace tracing: Make tracing_selftest_running/delete nops when not used tracing: Have tracer selftests call cond_resched() before running tracing: Move setting of tracing_selftest_running out of register_tracer() tracing/selftests: Update synthetic event selftest to use common_stacktrace tracing: Rename stacktrace field to common_stacktrace tracing/histograms: Allow variables to have some modifiers tracing/user_events: Document user_event_mm one-shot list usage tracing/user_events: Rename link fields for clarity tracing/user_events: Remove RCU lock while pinning pages tracing/user_events: Split up mm alloc and attach tracing/timerlat: Always wakeup the timerlat thread tracing/user_events: Use long vs int for atomic bit ops	2023-05-29 07:20:13 -04:00

... 5 6 7 8 9 ...

42274 Commits