IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
Expected canonical argument type for global function arguments
representing PTR_TO_CTX is `bpf_user_pt_regs_t *ctx`. This currently
works on s390x by accident because kernel resolves such typedef to
underlying struct (which is anonymous on s390x), and erroneously
accepting it as expected context type. We are fixing this problem next,
which would break s390x arch, so we need to handle `bpf_user_pt_regs_t`
case explicitly for KPROBE programs.
Fixes: 91cc1a9974 ("bpf: Annotate context types")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240212233221.2575350-3-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Return result of btf_get_prog_ctx_type() is never used and callers only
check NULL vs non-NULL case to determine if given type matches expected
PTR_TO_CTX type. So rename function to `btf_is_prog_ctx_type()` and
return a simple true/false. We'll use this simpler interface to handle
kprobe program type's special typedef case in the next patch.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240212233221.2575350-2-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Originally, this patch removed a redundant check in
BPF_CGROUP_RUN_PROG_INET_EGRESS, as the check was already being done in
the function it called, __cgroup_bpf_run_filter_skb. For v2, it was
reccomended that I remove the check from __cgroup_bpf_run_filter_skb,
and add the checks to the other macro that calls that function,
BPF_CGROUP_RUN_PROG_INET_INGRESS.
To sum it up, checking that the socket exists and that it is a full
socket is now part of both macros BPF_CGROUP_RUN_PROG_INET_EGRESS and
BPF_CGROUP_RUN_PROG_INET_INGRESS, and it is no longer part of the
function they call, __cgroup_bpf_run_filter_skb.
v3->v4: Fixed weird merge conflict.
v2->v3: Sent to bpf-next instead of generic patch
v1->v2: Addressed feedback about where check should be removed.
Signed-off-by: Oliver Crumrine <ozlinuxc@gmail.com>
Acked-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/7lv62yiyvmj5a7eozv2iznglpkydkdfancgmbhiptrgvgan5sy@3fl3onchgdz3
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Collect argument information from the type information of stub functions to
mark arguments of BPF struct_ops programs with PTR_MAYBE_NULL if they are
nullable. A nullable argument is annotated by suffixing "__nullable" at
the argument name of stub function.
For nullable arguments, this patch sets a struct bpf_ctx_arg_aux to label
their reg_type with PTR_TO_BTF_ID | PTR_TRUSTED | PTR_MAYBE_NULL. This
makes the verifier to check programs and ensure that they properly check
the pointer. The programs should check if the pointer is null before
accessing the pointed memory.
The implementer of a struct_ops type should annotate the arguments that can
be null. The implementer should define a stub function (empty) as a
placeholder for each defined operator. The name of a stub function should
be in the pattern "<st_op_type>__<operator name>". For example, for
test_maybe_null of struct bpf_testmod_ops, it's stub function name should
be "bpf_testmod_ops__test_maybe_null". You mark an argument nullable by
suffixing the argument name with "__nullable" at the stub function.
Since we already has stub functions for kCFI, we just reuse these stub
functions with the naming convention mentioned earlier. These stub
functions with the naming convention is only required if there are nullable
arguments to annotate. For functions having not nullable arguments, stub
functions are not necessary for the purpose of this patch.
This patch will prepare a list of struct bpf_ctx_arg_aux, aka arg_info, for
each member field of a struct_ops type. "arg_info" will be assigned to
"prog->aux->ctx_arg_info" of BPF struct_ops programs in
check_struct_ops_btf_id() so that it can be used by btf_ctx_access() later
to set reg_type properly for the verifier.
Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Link: https://lore.kernel.org/r/20240209023750.1153905-4-thinker.li@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Move __kfunc_param_match_suffix() to btf.c and rename it as
btf_param_match_suffix(). It can be reused by bpf_struct_ops later.
Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Link: https://lore.kernel.org/r/20240209023750.1153905-3-thinker.li@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Compiling with CONFIG_BPF_SYSCALL & !CONFIG_BPF_JIT throws the below
warning:
"WARN: resolve_btfids: unresolved symbol bpf_cpumask"
Fix it by adding the appropriate #ifdef.
Signed-off-by: Hari Bathini <hbathini@linux.ibm.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Stanislav Fomichev <sdf@google.com>
Acked-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/bpf/20240208100115.602172-1-hbathini@linux.ibm.com
Currently tracing is supposed not to allow for bpf_spin_{lock,unlock}()
helper calls. This is to prevent deadlock for the following cases:
- there is a prog (prog-A) calling bpf_spin_{lock,unlock}().
- there is a tracing program (prog-B), e.g., fentry, attached
to bpf_spin_lock() and/or bpf_spin_unlock().
- prog-B calls bpf_spin_{lock,unlock}().
For such a case, when prog-A calls bpf_spin_{lock,unlock}(),
a deadlock will happen.
The related source codes are below in kernel/bpf/helpers.c:
notrace BPF_CALL_1(bpf_spin_lock, struct bpf_spin_lock *, lock)
notrace BPF_CALL_1(bpf_spin_unlock, struct bpf_spin_lock *, lock)
notrace is supposed to prevent fentry prog from attaching to
bpf_spin_{lock,unlock}().
But actually this is not the case and fentry prog can successfully
attached to bpf_spin_lock(). Siddharth Chintamaneni reported
the issue in [1]. The following is the macro definition for
above BPF_CALL_1:
#define BPF_CALL_x(x, name, ...) \
static __always_inline \
u64 ____##name(__BPF_MAP(x, __BPF_DECL_ARGS, __BPF_V, __VA_ARGS__)); \
typedef u64 (*btf_##name)(__BPF_MAP(x, __BPF_DECL_ARGS, __BPF_V, __VA_ARGS__)); \
u64 name(__BPF_REG(x, __BPF_DECL_REGS, __BPF_N, __VA_ARGS__)); \
u64 name(__BPF_REG(x, __BPF_DECL_REGS, __BPF_N, __VA_ARGS__)) \
{ \
return ((btf_##name)____##name)(__BPF_MAP(x,__BPF_CAST,__BPF_N,__VA_ARGS__));\
} \
static __always_inline \
u64 ____##name(__BPF_MAP(x, __BPF_DECL_ARGS, __BPF_V, __VA_ARGS__))
#define BPF_CALL_1(name, ...) BPF_CALL_x(1, name, __VA_ARGS__)
The notrace attribute is actually applied to the static always_inline function
____bpf_spin_{lock,unlock}(). The actual callback function
bpf_spin_{lock,unlock}() is not marked with notrace, hence
allowing fentry prog to attach to two helpers, and this
may cause the above mentioned deadlock. Siddharth Chintamaneni
actually has a reproducer in [2].
To fix the issue, a new macro NOTRACE_BPF_CALL_1 is introduced which
will add notrace attribute to the original function instead of
the hidden always_inline function and this fixed the problem.
[1] https://lore.kernel.org/bpf/CAE5sdEigPnoGrzN8WU7Tx-h-iFuMZgW06qp0KHWtpvoXxf1OAQ@mail.gmail.com/
[2] https://lore.kernel.org/bpf/CAE5sdEg6yUc_Jz50AnUXEEUh6O73yQ1Z6NV2srJnef0ZrQkZew@mail.gmail.com/
Fixes: d83525ca62 ("bpf: introduce bpf_spin_lock")
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/bpf/20240207070102.335167-1-yonghong.song@linux.dev
Since 20d59ee551 ("libbpf: add bpf_core_cast() macro"), libbpf is now
exporting a const arg version of bpf_rdonly_cast(). This causes the
following conflicting type error when generating kfunc prototypes from
BTF:
In file included from skeleton/pid_iter.bpf.c:5:
/home/dxu/dev/linux/tools/bpf/bpftool/bootstrap/libbpf/include/bpf/bpf_core_read.h:297:14: error: conflicting types for 'bpf_rdonly_cast'
extern void *bpf_rdonly_cast(const void *obj__ign, __u32 btf_id__k) __ksym __weak;
^
./vmlinux.h:135625:14: note: previous declaration is here
extern void *bpf_rdonly_cast(void *obj__ign, u32 btf_id__k) __weak __ksym;
This is b/c the kernel defines bpf_rdonly_cast() with non-const arg.
Since const arg is more permissive and thus backwards compatible, we
change the kernel definition as well to avoid conflicting type errors.
Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/bpf/dfd3823f11ffd2d4c838e961d61ec9ae8a646773.1707080349.git.dxu@dxuuu.xyz
tracer_tracing_is_on() checks whether record_disabled is not zero. This
checks both the record_disabled counter and the RB_BUFFER_OFF flag.
Reading the source it looks like this function should only check for
the RB_BUFFER_OFF flag. Therefore use ring_buffer_record_is_set_on().
This fixes spurious fails in the 'test for function traceon/off triggers'
test from the ftrace testsuite when the system is under load.
Link: https://lore.kernel.org/linux-trace-kernel/20240205065340.2848065-1-svens@linux.ibm.com
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Tested-By: Mete Durlu <meted@linux.ibm.com>
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Commit a8b9cf62ad ("ftrace: Fix DIRECT_CALLS to use SAVE_REGS by
default") attempted to fix an issue with direct trampolines on x86, see
its description for details. However, it wrongly referenced the
HAVE_DYNAMIC_FTRACE_WITH_REGS config option and the problem is still
present.
Add the missing "CONFIG_" prefix for the logic to work as intended.
Link: https://lore.kernel.org/linux-trace-kernel/20240213132434.22537-1-petr.pavlu@suse.com
Fixes: a8b9cf62ad ("ftrace: Fix DIRECT_CALLS to use SAVE_REGS by default")
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
timers have been migrated on the CPU down path and thus said timer
will get ignored
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmXIpjEACgkQEsHwGGHe
VUrG8g/7BuGzGzFHYli7LZKuzI76tN0CU44v7oPauqZvkcaSUmF/E+/RAeHxMjUa
20Mlo2AUGrPkKbRUgN93IrRKZAfdjqKQ6UZiH/FTyFUtFfs9XNv2G0cCIVQAepXL
WbKPxL/M9vbnJwK6CC5prSLHazGH2y5vg0zbY9RycGKvxza+HgkIrZoYp7ctHX1A
xZFF5EyLu6g0x+yz7Tt0Zf93tADJxFSHmfz7Nmx1RFh0GJzceuKUvC2ZVyUr63fv
ttQn4TLm0NKySaR+SPYPKKp1lLkHvfh9pFV2kdI/c7oo4Pig6bFTjMJpf0o541A2
s87sz2w6P16LMi2sjf/ASQmgMHmGiIQlmjjFbVX8sKeibdtUM3Vg7s/Hs/EilY7Q
P7ANSmZTtQBoQsWd/E+8aOBUkC263Ua0uoOufH7dcfL9mSJxUos1SPleRnaO4o5n
mm5GVDxggNj879nHZUBh9g7+JsLdZ5yozWne0xAyrI0WycsK0hzWuW0B5p4QMK3T
4zamSuZNObBUdbwb5cD1fL5X5aRkPvorj9iLliui1X0wfA+nByR5cjlcptXirCq4
8D2WCiQ+tbP3w4UJ0gtrC3mlXokWorMV8XoLZm9+RtWLi2LFAhMCZ9U3B0EgsziD
whMGJBXKRaeeFnCnxPjlHC/5a1nXTCqFlks+PSUVT6RiavYJ7VA=
=+SjG
-----END PGP SIGNATURE-----
Merge tag 'timers_urgent_for_v6.8_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fix from Borislav Petkov:
- Make sure a warning is issued when a hrtimer gets queued after the
timers have been migrated on the CPU down path and thus said timer
will get ignored
* tag 'timers_urgent_for_v6.8_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
hrtimer: Report offline hrtimer enqueue
issues or aren't considered to be needed in earlier kernel versions.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZcfLvgAKCRDdBJ7gKXxA
joCTAP4/XdBXA7Sj3GyjSAkYjg2U0quwX9oRhsx2Qy9duPDaLAD+NRl9XG14YSOB
f/7OiTQoDfnwVgHAOVBHY/ylrcgZRQg=
=2wdS
-----END PGP SIGNATURE-----
Merge tag 'mm-hotfixes-stable-2024-02-10-11-16' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"21 hotfixes. 12 are cc:stable and the remainder pertain to post-6.7
issues or aren't considered to be needed in earlier kernel versions"
* tag 'mm-hotfixes-stable-2024-02-10-11-16' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (21 commits)
nilfs2: fix potential bug in end_buffer_async_write
mm/damon/sysfs-schemes: fix wrong DAMOS tried regions update timeout setup
nilfs2: fix hang in nilfs_lookup_dirty_data_buffers()
MAINTAINERS: Leo Yan has moved
mm/zswap: don't return LRU_SKIP if we have dropped lru lock
fs,hugetlb: fix NULL pointer dereference in hugetlbs_fill_super
mailmap: switch email address for John Moon
mm: zswap: fix objcg use-after-free in entry destruction
mm/madvise: don't forget to leave lazy MMU mode in madvise_cold_or_pageout_pte_range()
arch/arm/mm: fix major fault accounting when retrying under per-VMA lock
selftests: core: include linux/close_range.h for CLOSE_RANGE_* macros
mm/memory-failure: fix crash in split_huge_page_to_list from soft_offline_page
mm: memcg: optimize parent iteration in memcg_rstat_updated()
nilfs2: fix data corruption in dsync block recovery for small block sizes
mm/userfaultfd: UFFDIO_MOVE implementation should use ptep_get()
exit: wait_task_zombie: kill the no longer necessary spin_lock_irq(siglock)
fs/proc: do_task_stat: use sig->stats_lock to gather the threads/children stats
fs/proc: do_task_stat: move thread_group_cputime_adjusted() outside of lock_task_sighand()
getrusage: use sig->stats_lock rather than lock_task_sighand()
getrusage: move thread_group_cputime_adjusted() outside of lock_task_sighand()
...
Turn kill_pid_info() into kill_pid_info_type(), this allows to pass any
pid_type to group_send_sig_info(), despite its name it should work fine
even if type = PIDTYPE_PID.
Change pidfd_send_signal() to use PIDTYPE_PID or PIDTYPE_TGID depending
on PIDFD_THREAD.
While at it kill another TODO comment in pidfd_show_fdinfo(). As Christian
expains fdinfo reports f_flags, userspace can already detect PIDFD_THREAD.
Reviewed-by: Tycho Andersen <tandersen@netflix.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/20240209130650.GA8048@redhat.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
So that do_tkill() can use this helper too. This also simplifies
the next patch.
TODO: perhaps we can kill prepare_kill_siginfo() and change the
callers to use SEND_SIG_NOINFO, but this needs some changes in
__send_signal_locked() and TP_STORE_SIGINFO().
Reviewed-by: Tycho Andersen <tandersen@netflix.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/20240209130620.GA8039@redhat.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
Async can schedule a number of interdependent work items. However, since
5797b1c189 ("workqueue: Implement system-wide nr_active enforcement for
unbound workqueues"), unbound workqueues have separate min_active which sets
the number of interdependent work items that can be handled. This default
value is 8 which isn't sufficient for async and can lead to stalls during
resume from suspend in some cases.
Let's use a dedicated unbound workqueue with raised min_active.
Link: http://lkml.kernel.org/r/708a65cc-79ec-44a6-8454-a93d0f3114c3@samsung.com
Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Since 5797b1c189 ("workqueue: Implement system-wide nr_active enforcement
for unbound workqueues"), unbound workqueues have separate min_active which
sets the number of interdependent work items that can be handled. This value
is currently initialized to WQ_DFL_MIN_ACTIVE which is 8. This isn't high
enough for some users, let's add an interface to adjust the setting.
Signed-off-by: Tejun Heo <tj@kernel.org>
Fix the kernel-doc comment of the unplug_oldest_pwq() function to enable
proper processing and formatting of the embedded ASCII diagram.
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
- Fix broken direct trampolines being called when another callback is
attached the same function. ARM 64 does not support FTRACE_WITH_REGS, and
when it added direct trampoline calls from ftrace, it removed the
"WITH_REGS" flag from the ftrace_ops for direct trampolines. This broke
x86 as x86 requires direct trampolines to have WITH_REGS. This wasn't
noticed because direct trampolines work as long as the function it is
attached to is not shared with other callbacks (like the function tracer).
When there's other callbacks, a helper trampoline is called, to call all
the non direct callbacks and when it returns, the direct trampoline is
called. For x86, the direct trampoline sets a flag in the regs field to
tell the x86 specific code to call the direct trampoline. But this only
works if the ftrace_ops had WITH_REGS set. ARM does things differently
that does not require this. For now, set WITH_REGS if the arch supports
WITH_REGS (which ARM does not), and this makes it work for both ARM64 and
x86.
- Fix wasted memory in the saved_cmdlines logic.
The saved_cmdlines is a cache that maps PIDs to COMMs that tracing can
use. Most trace events only save the PID in the event. The saved_cmdlines
file lists PIDs to COMMs so that the tracing tools can show an actual name
and not just a PID for each event. There's an array of PIDs that map to a
small set of saved COMM strings. The array is set to PID_MAX_DEFAULT which
is usually set to 32768. When a PID comes in, it will add itself to this
array along with the index into the COMM array (note if the system allows
more than PID_MAX_DEFAULT, this cache is similar to cache lines as an
update of a PID that has the same PID_MAX_DEFAULT bits set will flush out
another task with the same matching bits set).
A while ago, the size of this cache was changed to be dynamic and the
array was moved into a structure and created with kmalloc(). But this
new structure had the size of 131104 bytes, or 0x20020 in hex. As kmalloc
allocates in powers of two, it was actually allocating 0x40000 bytes
(262144) leaving 131040 bytes of wasted memory. The last element of this
structure was a pointer to the COMM string array which defaulted to just
saving 128 COMMs.
By changing the last field of this structure to a variable length string,
and just having it round up to fill the allocated memory, the default
size of the saved COMM cache is now 8190. This not only uses the wasted
space, but actually saves space by removing the extra allocation for the
COMM names.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZcYi8RQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qqENAQD6xGE9EPkbHArElKfgpSuQOfGhcyyP
LjgVhqVgmIoqUwD8CeVpxk3VwZIOQYvPn5XictcZgkYSeEWUZcKYg4c/3gs=
=iIBv
-----END PGP SIGNATURE-----
Merge tag 'trace-v6.8-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fixes from Steven Rostedt:
- Fix broken direct trampolines being called when another callback is
attached the same function.
ARM 64 does not support FTRACE_WITH_REGS, and when it added direct
trampoline calls from ftrace, it removed the "WITH_REGS" flag from
the ftrace_ops for direct trampolines. This broke x86 as x86 requires
direct trampolines to have WITH_REGS.
This wasn't noticed because direct trampolines work as long as the
function it is attached to is not shared with other callbacks (like
the function tracer). When there are other callbacks, a helper
trampoline is called, to call all the non direct callbacks and when
it returns, the direct trampoline is called.
For x86, the direct trampoline sets a flag in the regs field to tell
the x86 specific code to call the direct trampoline. But this only
works if the ftrace_ops had WITH_REGS set. ARM does things
differently that does not require this. For now, set WITH_REGS if the
arch supports WITH_REGS (which ARM does not), and this makes it work
for both ARM64 and x86.
- Fix wasted memory in the saved_cmdlines logic.
The saved_cmdlines is a cache that maps PIDs to COMMs that tracing
can use. Most trace events only save the PID in the event. The
saved_cmdlines file lists PIDs to COMMs so that the tracing tools can
show an actual name and not just a PID for each event. There's an
array of PIDs that map to a small set of saved COMM strings. The
array is set to PID_MAX_DEFAULT which is usually set to 32768. When a
PID comes in, it will add itself to this array along with the index
into the COMM array (note if the system allows more than
PID_MAX_DEFAULT, this cache is similar to cache lines as an update of
a PID that has the same PID_MAX_DEFAULT bits set will flush out
another task with the same matching bits set).
A while ago, the size of this cache was changed to be dynamic and the
array was moved into a structure and created with kmalloc(). But this
new structure had the size of 131104 bytes, or 0x20020 in hex. As
kmalloc allocates in powers of two, it was actually allocating
0x40000 bytes (262144) leaving 131040 bytes of wasted memory. The
last element of this structure was a pointer to the COMM string array
which defaulted to just saving 128 COMMs.
By changing the last field of this structure to a variable length
string, and just having it round up to fill the allocated memory, the
default size of the saved COMM cache is now 8190. This not only uses
the wasted space, but actually saves space by removing the extra
allocation for the COMM names.
* tag 'trace-v6.8-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing: Fix wasted memory in saved_cmdlines logic
ftrace: Fix DIRECT_CALLS to use SAVE_REGS by default
While looking at improving the saved_cmdlines cache I found a huge amount
of wasted memory that should be used for the cmdlines.
The tracing data saves pids during the trace. At sched switch, if a trace
occurred, it will save the comm of the task that did the trace. This is
saved in a "cache" that maps pids to comms and exposed to user space via
the /sys/kernel/tracing/saved_cmdlines file. Currently it only caches by
default 128 comms.
The structure that uses this creates an array to store the pids using
PID_MAX_DEFAULT (which is usually set to 32768). This causes the structure
to be of the size of 131104 bytes on 64 bit machines.
In hex: 131104 = 0x20020, and since the kernel allocates generic memory in
powers of two, the kernel would allocate 0x40000 or 262144 bytes to store
this structure. That leaves 131040 bytes of wasted space.
Worse, the structure points to an allocated array to store the comm names,
which is 16 bytes times the amount of names to save (currently 128), which
is 2048 bytes. Instead of allocating a separate array, make the structure
end with a variable length string and use the extra space for that.
This is similar to a recommendation that Linus had made about eventfs_inode names:
https://lore.kernel.org/all/20240130190355.11486-5-torvalds@linux-foundation.org/
Instead of allocating a separate string array to hold the saved comms,
have the structure end with: char saved_cmdlines[]; and round up to the
next power of two over sizeof(struct saved_cmdline_buffers) + num_cmdlines * TASK_COMM_LEN
It will use this extra space for the saved_cmdline portion.
Now, instead of saving only 128 comms by default, by using this wasted
space at the end of the structure it can save over 8000 comms and even
saves space by removing the need for allocating the other array.
Link: https://lore.kernel.org/linux-trace-kernel/20240209063622.1f7b6d5f@rorschach.local.home
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Vincent Donnefort <vdonnefort@google.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Mete Durlu <meted@linux.ibm.com>
Fixes: 939c7a4f04 ("tracing: Introduce saved_cmdlines_size file")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The commit 60c8971899 ("ftrace: Make DIRECT_CALLS work WITH_ARGS
and !WITH_REGS") changed DIRECT_CALLS to use SAVE_ARGS when there
are multiple ftrace_ops at the same function, but since the x86 only
support to jump to direct_call from ftrace_regs_caller, when we set
the function tracer on the same target function on x86, ftrace-direct
does not work as below (this actually works on arm64.)
At first, insmod ftrace-direct.ko to put a direct_call on
'wake_up_process()'.
# insmod kernel/samples/ftrace/ftrace-direct.ko
# less trace
...
<idle>-0 [006] ..s1. 564.686958: my_direct_func: waking up rcu_preempt-17
<idle>-0 [007] ..s1. 564.687836: my_direct_func: waking up kcompactd0-63
<idle>-0 [006] ..s1. 564.690926: my_direct_func: waking up rcu_preempt-17
<idle>-0 [006] ..s1. 564.696872: my_direct_func: waking up rcu_preempt-17
<idle>-0 [007] ..s1. 565.191982: my_direct_func: waking up kcompactd0-63
Setup a function filter to the 'wake_up_process' too, and enable it.
# cd /sys/kernel/tracing/
# echo wake_up_process > set_ftrace_filter
# echo function > current_tracer
# less trace
...
<idle>-0 [006] ..s3. 686.180972: wake_up_process <-call_timer_fn
<idle>-0 [006] ..s3. 686.186919: wake_up_process <-call_timer_fn
<idle>-0 [002] ..s3. 686.264049: wake_up_process <-call_timer_fn
<idle>-0 [002] d.h6. 686.515216: wake_up_process <-kick_pool
<idle>-0 [002] d.h6. 686.691386: wake_up_process <-kick_pool
Then, only function tracer is shown on x86.
But if you enable 'kprobe on ftrace' event (which uses SAVE_REGS flag)
on the same function, it is shown again.
# echo 'p wake_up_process' >> dynamic_events
# echo 1 > events/kprobes/p_wake_up_process_0/enable
# echo > trace
# less trace
...
<idle>-0 [006] ..s2. 2710.345919: p_wake_up_process_0: (wake_up_process+0x4/0x20)
<idle>-0 [006] ..s3. 2710.345923: wake_up_process <-call_timer_fn
<idle>-0 [006] ..s1. 2710.345928: my_direct_func: waking up rcu_preempt-17
<idle>-0 [006] ..s2. 2710.349931: p_wake_up_process_0: (wake_up_process+0x4/0x20)
<idle>-0 [006] ..s3. 2710.349934: wake_up_process <-call_timer_fn
<idle>-0 [006] ..s1. 2710.349937: my_direct_func: waking up rcu_preempt-17
To fix this issue, use SAVE_REGS flag for multiple ftrace_ops flag of
direct_call by default.
Link: https://lore.kernel.org/linux-trace-kernel/170484558617.178953.1590516949390270842.stgit@devnote2
Fixes: 60c8971899 ("ftrace: Make DIRECT_CALLS work WITH_ARGS and !WITH_REGS")
Cc: stable@vger.kernel.org
Cc: Florent Revest <revest@chromium.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reviewed-by: Mark Rutland <mark.rutland@arm.com>
Tested-by: Mark Rutland <mark.rutland@arm.com> [arm64]
Acked-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Cross-merge networking fixes after downstream PR.
No conflicts.
Adjacent changes:
drivers/net/ethernet/stmicro/stmmac/common.h
38cc3c6dcc ("net: stmmac: protect updates of 64-bit statistics counters")
fd5a6a7131 ("net: stmmac: est: Per Tx-queue error count for HLBF")
c5c3e1bfc9 ("net: stmmac: Offload queueMaxSDU from tc-taprio")
drivers/net/wireless/microchip/wilc1000/netdev.c
c901388028 ("wifi: fill in MODULE_DESCRIPTION()s for wilc1000")
328efda22a ("wifi: wilc1000: do not realloc workqueue everytime an interface is added")
net/unix/garbage.c
11498715f2 ("af_unix: Remove io_uring code for GC.")
1279f9d9de ("af_unix: Call kfree_skb() for dead unix_(sk)->oob_skb in GC.")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Similar to the handling in the functions __register_btf_kfunc_id_set()
and register_btf_id_dtor_kfuncs(), this patch uses the newly added
helper check_btf_kconfigs() to handle module with its btf section
stripped.
While at it, the patch also adds the missed IS_ERR() check to fix the
commit f6be98d199 ("bpf, net: switch to dynamic registration")
Fixes: f6be98d199 ("bpf, net: switch to dynamic registration")
Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Link: https://lore.kernel.org/r/69082b9835463fe36f9e354bddf2d0a97df39c2b.1707373307.git.tanggeliang@kylinos.cn
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Commit 85f0ab43f9 ("kernel/workqueue: Bind rescuer to unbound
cpumask for WQ_UNBOUND") modified init_rescuer() to bind rescuer of
an unbound workqueue to the cpumask in wq->unbound_attrs. However
unbound_attrs->cpumask's of all workqueues are initialized to
cpu_possible_mask and will only be changed if it has the WQ_SYSFS flag
to expose a cpumask sysfs file to be written by users. So this patch
doesn't achieve what it is intended to do.
If an unbound workqueue is created after wq_unbound_cpumask is modified
and there is no more unbound cpumask update after that, the unbound
rescuer will be bound to all CPUs unless the workqueue is created
with the WQ_SYSFS flag and a user explicitly modified its cpumask
sysfs file. Fix this problem by binding directly to wq_unbound_cpumask
in init_rescuer().
Fixes: 85f0ab43f9 ("kernel/workqueue: Bind rescuer to unbound cpumask for WQ_UNBOUND")
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
When workqueue cpumask changes are committed the associated rescuer (if
one exists) affinity is not touched and this might be a problem down the
line for isolated setups.
Make sure rescuers affinity is updated every time a workqueue cpumask
changes, so that rescuers can't break isolation.
[longman: set_cpus_allowed_ptr() will block until the designated task
is enqueued on an allowed CPU, no wake_up_process() needed. Also use
the unbound_effective_cpumask() helper as suggested by Tejun.]
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
This patch extracts duplicate code on error path when btf_get_module_btf()
returns NULL from the functions __register_btf_kfunc_id_set() and
register_btf_id_dtor_kfuncs() into a new helper named check_btf_kconfigs()
to check CONFIG_DEBUG_INFO_BTF and CONFIG_DEBUG_INFO_BTF_MODULES in it.
Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/fa5537fc55f1e4d0bfd686598c81b7ab9dbd82b7.1707373307.git.tanggeliang@kylinos.cn
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Ordered workqueues does not currently follow changes made to the
global unbound cpumask because per-pool workqueue changes may break
the ordering guarantee. IOW, a work function in an ordered workqueue
may run on an isolated CPU.
This patch enables ordered workqueues to follow changes made to the
global unbound cpumask by temporaily plug or suspend the newly allocated
pool_workqueue from executing newly queued work items until the old
pwq has been properly drained. For ordered workqueues, there should
only be one pwq that is unplugged, the rests should be plugged.
This enables ordered workqueues to follow the unbound cpumask changes
like other unbound workqueues at the expense of some delay in execution
of work functions during the transition period.
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Add a new pwq into the tail of wq->pwqs so that pwq iteration will
start from the oldest pwq to the newest. This ordering will facilitate
the inclusion of ordered workqueues in a wq_unbound_cpumask update.
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The same as __register_btf_kfunc_id_set(), to let the modules with
stripped btf section loaded, this patch changes the return value of
register_btf_id_dtor_kfuncs() too from -ENOENT to 0 when btf is NULL.
Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Link: https://lore.kernel.org/r/eab65586d7fb0e72f2707d3747c7d4a5d60c823f.1707373307.git.tanggeliang@kylinos.cn
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
ri and sym is assigned first, so it does not need to initialize the
assignment.
Link: https://lore.kernel.org/all/20230919012823.7815-1-zeming@nfschina.com/
Signed-off-by: Li zeming <zeming@nfschina.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Fix to show a parse error for bad type (non-string) for $comm/$COMM and
immediate-string. With this fix, error_log file shows appropriate error
message as below.
/sys/kernel/tracing # echo 'p vfs_read $comm:u32' >> kprobe_events
sh: write error: Invalid argument
/sys/kernel/tracing # echo 'p vfs_read \"hoge":u32' >> kprobe_events
sh: write error: Invalid argument
/sys/kernel/tracing # cat error_log
[ 30.144183] trace_kprobe: error: $comm and immediate-string only accepts string type
Command: p vfs_read $comm:u32
^
[ 62.618500] trace_kprobe: error: $comm and immediate-string only accepts string type
Command: p vfs_read \"hoge":u32
^
Link: https://lore.kernel.org/all/170602215411.215583.2238016352271091852.stgit@devnote2/
Fixes: 3dd1f7f24f ("tracing: probeevent: Fix to make the type of $comm string")
Cc: stable@vger.kernel.org
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
The device drivers can modify EM at runtime by providing a new EM table.
The EM is used by the EAS and the em_perf_state::cost stores
pre-calculated value to avoid overhead. This patch provides the API for
device drivers to calculate the cost values properly (and not duplicate
the same code).
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Remove the old EM table which wasn't able to modify the data. Clean the
unneeded function and refactor the code a bit.
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Dump the runtime EM table values which can be modified in time. In order
to do that allocate chunk of debug memory which can be later freed
automatically thanks to devm_kcalloc().
This design can handle the fact that the EM table memory can change
after EM update, so debug code cannot use the pointer from initialization
phase.
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The Energy Model (EM) can be modified at runtime which brings new
possibilities. The em_cpu_energy() is called by the Energy Aware Scheduler
(EAS) in its hot path. The energy calculation uses power value for
a given performance state (ps) and the CPU busy time as percentage for that
given frequency.
It is possible to avoid the division by 'scale_cpu' at runtime, because
EM is updated whenever new max capacity CPU is set in the system.
Use that feature and do the needed division during the calculation of the
coefficient 'ps->cost'. That enhanced 'ps->cost' value can be then just
multiplied simply by utilization:
pd_nrg = ps->cost * \Sum cpu_util
to get the needed energy for whole Performance Domain (PD).
With this optimization and earlier removal of map_util_freq(), the
em_cpu_energy() should run faster on the Big CPU by 1.43x and on the Little
CPU by 1.69x (RockPi 4B board).
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The patch adds needed infrastructure to handle the late CPUs boot, which
might change the previous CPUs capacity values. With this changes the new
CPUs which try to register EM will trigger the needed re-calculations for
other CPUs EMs. Thanks to that the em_per_state::performance values will
be aligned with the CPU capacity information after all CPUs finish the
boot and EM registrations.
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The performance doesn't scale linearly with the frequency. Also, it may
be different in different workloads. Some CPUs are designed to be
particularly good at some applications e.g. images or video processing
and other CPUs in different. When those different types of CPUs are
combined in one SoC they should be properly modeled to get max of the HW
in Energy Aware Scheduler (EAS). The Energy Model (EM) provides the
power vs. performance curves to the EAS, but assumes the CPUs capacity
is fixed and scales linearly with the frequency. This patch allows to
adjust the curve on the 'performance' axis as well.
Code speed optimization:
Removing map_util_freq() allows to avoid one division and one
multiplication operations from the EAS hot code path.
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Add API function em_dev_update_perf_domain() which allows the EM to be
changed safely.
Concurrent updaters are serialized with a mutex and the removal of memory
that will not be used any more is carried out with the help of RCU.
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The runtime modified EM table can be provided from drivers. Create
mechanism which allows safely allocate and free the table for device
drivers. The same table can be used by the EAS in task scheduler code
paths, so make sure the memory is not freed when the device driver module
is unloaded.
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The new runtime table can be populated with a new power data to better
reflect the actual efficiency of the device e.g. CPU. The power can vary
over time e.g. due to the SoC temperature change. Higher temperature can
increase power values. For longer running scenarios, such as game or
camera, when also other devices are used (e.g. GPU, ISP) the CPU power can
change. The new EM framework is able to addresses this issue and change
the EM data at runtime safely.
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Split the process of allocation and data initialization for the EM table.
The upcoming changes for modifiable EM will use it.
This change is not expected to alter the general functionality.
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Subsequent changes will introduce a case in which 'cb->get_cost' may
not be set in em_compute_costs(), so add a check to ensure that it is
not NULL before attempting to dereference it.
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Move the EM costs computation code into a new dedicated function,
em_compute_costs(), that can be reused in other places in the future.
This change is not expected to alter the general functionality.
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The Energy Model might be updated at runtime and the energy efficiency
for each OPP may change. Thus, there is a need to update also the
cpufreq framework and make it aligned to the new values. In order to
do that, use a first active CPU from the Performance Domain. This is
needed since the first CPU in the cpumask might be offline when we
run this code path.
Reviewed-by: Hongyan Xia <hongyan.xia2@arm.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
In order to prepare the code for the modifiable EM perf_state table,
make em_cpufreq_update_efficiencies() take a pointer to the EM table
as its second argument and modify it to use that new argument instead
of the 'table' member of dev->em_pd.
No functional impact.
Reviewed-by: Hongyan Xia <hongyan.xia2@arm.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Fix missing newline for the string long in the error code path.
Reviewed-by: Hongyan Xia <hongyan.xia2@arm.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
After the recent changes nobody use siglock to read the values protected
by stats_lock, we can kill spin_lock_irq(¤t->sighand->siglock) and
update the comment.
With this patch only __exit_signal() and thread_group_start_cputime() take
stats_lock under siglock.
Link: https://lkml.kernel.org/r/20240123153359.GA21866@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Dylan Hatch <dylanbhatch@google.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
lock_task_sighand() can trigger a hard lockup. If NR_CPUS threads call
getrusage() at the same time and the process has NR_THREADS, spin_lock_irq
will spin with irqs disabled O(NR_CPUS * NR_THREADS) time.
Change getrusage() to use sig->stats_lock, it was specifically designed
for this type of use. This way it runs lockless in the likely case.
TODO:
- Change do_task_stat() to use sig->stats_lock too, then we can
remove spin_lock_irq(siglock) in wait_task_zombie().
- Turn sig->stats_lock into seqcount_rwlock_t, this way the
readers in the slow mode won't exclude each other. See
https://lore.kernel.org/all/20230913154907.GA26210@redhat.com/
- stats_lock has to disable irqs because ->siglock can be taken
in irq context, it would be very nice to change __exit_signal()
to avoid the siglock->stats_lock dependency.
Link: https://lkml.kernel.org/r/20240122155053.GA26214@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reported-by: Dylan Hatch <dylanbhatch@google.com>
Tested-by: Dylan Hatch <dylanbhatch@google.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "getrusage: use sig->stats_lock", v2.
This patch (of 2):
thread_group_cputime() does its own locking, we can safely shift
thread_group_cputime_adjusted() which does another for_each_thread loop
outside of ->siglock protected section.
This is also preparation for the next patch which changes getrusage() to
use stats_lock instead of siglock, thread_group_cputime() takes the same
lock. With the current implementation recursive read_seqbegin_or_lock()
is fine, thread_group_cputime() can't enter the slow mode if the caller
holds stats_lock, yet this looks more safe and better performance-wise.
Link: https://lkml.kernel.org/r/20240122155023.GA26169@redhat.com
Link: https://lkml.kernel.org/r/20240122155050.GA26205@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reported-by: Dylan Hatch <dylanbhatch@google.com>
Tested-by: Dylan Hatch <dylanbhatch@google.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
'config BPF' exists in both init/Kconfig and kernel/bpf/Kconfig.
Commit b24abcff91 ("bpf, kconfig: Add consolidated menu entry for bpf
with core options") added the second one to kernel/bpf/Kconfig instead
of moving the existing one.
Merge them together.
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/bpf/20240204075634.32969-1-masahiroy@kernel.org
dump_stack() is called in panic(). If for some reason another CPU
is holding the printk_cpu_sync and is unable to release it, the
panic CPU will be unable to continue and print the stacktrace.
Since non-panic CPUs are not allowed to store new printk messages
anyway, there is no need to synchronize the stacktrace output in
a panic situation.
For the panic CPU, do not get the printk_cpu_sync because it is
not needed and avoids a potential deadlock scenario in panic().
Link: https://lore.kernel.org/lkml/ZcIGKU8sxti38Kok@alley
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240207134103.1357162-15-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
If the kernel crashes in a context where printk() calls always
defer printing (such as in NMI or inside a printk_safe section)
then the final panic messages will be deferred to irq_work. But
if irq_work is not available, the messages will not get printed
unless explicitly flushed. The result is that the final
"end Kernel panic" banner does not get printed.
Add one final flush after the last printk() call to make sure
the final panic messages make it out as well.
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240207134103.1357162-14-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
Commit 13fb0f74d7 ("printk: Avoid livelock with heavy printk
during panic") introduced a mechanism to silence non-panic CPUs
if too many messages are being dropped. Aside from trying to
workaround the livelock bugs of legacy consoles, it was also
intended to avoid losing panic messages. However, if non-panic
CPUs are writing to the ringbuffer, then reacting to dropped
messages is too late.
Another motivation is that non-finalized messages already might
be skipped in panic(). In other words, random messages from
non-panic CPUs might already get lost. It is better to ignore
all to avoid confusion.
To avoid losing panic CPU messages, silence non-panic CPUs
immediately on panic.
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240207134103.1357162-13-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
The commit d51507098f ("printk: disable optimistic spin
during panic") added checks to avoid becoming a console waiter
if a panic is in progress.
However, the transition to panic can occur while there is
already a waiter. The current owner should not pass the lock to
the waiter because it might get stopped or blocked anytime.
Also the panic context might pass the console lock owner to an
already stopped waiter by mistake. It might happen when
console_flush_on_panic() ignores the current lock owner, for
example:
CPU0 CPU1
---- ----
console_lock_spinning_enable()
console_trylock_spinning()
[CPU1 now console waiter]
NMI: panic()
panic_other_cpus_shutdown()
[stopped as console waiter]
console_flush_on_panic()
console_lock_spinning_enable()
[print 1 record]
console_lock_spinning_disable_and_check()
[handover to stopped CPU1]
This results in panic() not flushing the panic messages.
Fix these problems by disabling all spinning operations
completely during panic().
Another advantage is that it prevents possible deadlocks caused
by "console_owner_lock". The panic() context does not need to
take it any longer. The lockless checks are safe because the
functions become NOPs when they see the panic in progress. All
operations manipulating the state are still synchronized by the
lock even when non-panic CPUs would notice the panic
synchronously.
The current owner might stay spinning. But non-panic() CPUs
would get stopped anyway and the panic context will never start
spinning.
Fixes: dbdda842fe ("printk: Add console owner and waiter logic to load balance console writes")
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Link: https://lore.kernel.org/r/20240207134103.1357162-12-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
Normally a reader will stop once reaching a non-finalized
record. However, when a panic happens, writers from other CPUs
(or an interrupted context on the panic CPU) may have been
writing a record and were unable to finalize it. The panic CPU
will reserve/commit/finalize its panic records, but these will
be located after the non-finalized records. This results in
panic() not flushing the panic messages.
Extend _prb_read_valid() to skip over non-finalized records if
on the panic CPU.
Fixes: 896fbe20b4 ("printk: use the lockless ringbuffer")
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240207134103.1357162-11-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
Currently pr_flush() will only wait for records that were
available to readers at the time of the call (using
prb_next_seq()). But there may be more records (non-finalized)
that have following finalized records. pr_flush() should wait
for these to print as well. Particularly because any trailing
finalized records may be the messages that the calling context
wants to ensure are printed.
Add a new ringbuffer function prb_next_reserve_seq() to return
the sequence number following the most recently reserved record.
This guarantees that pr_flush() will wait until all current
printk() messages (completed or in progress) have been printed.
Fixes: 3b604ca812 ("printk: add pr_flush()")
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240207134103.1357162-10-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
With the lockless ringbuffer, it is allowed that multiple
CPUs/contexts write simultaneously into the buffer. This creates
an ambiguity as some writers will finalize sooner.
The documentation for the prb_read functions is not clear as it
refers to "not yet written" and "no data available". Clarify the
return values and language to be in terms of the reader: records
available for reading.
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240207134103.1357162-9-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
There is already panic_in_progress() and other_cpu_in_panic(),
but checking if the current CPU is the panic CPU must still be
open coded.
Add this_cpu_in_panic() to complete the set.
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240207134103.1357162-8-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
Currently @suppress_panic_printk is checked along with
non-matching @panic_cpu and current CPU. This works
because @suppress_panic_printk is only set when
panic_in_progress() is true.
Rather than relying on the @suppress_panic_printk semantics,
use the concise helper function other_cpu_in_progress(). The
helper function exists to avoid open coding such tests.
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240207134103.1357162-7-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
For empty line records, no data blocks are created. Instead,
these valid records are identified by special logical position
values (in fields of @prb_desc.text_blk_lpos).
Currently the macro NO_LPOS is used for empty line records.
This name is confusing because it does not imply _why_ there is
no data block.
Rename NO_LPOS to EMPTY_LINE_LPOS so that it is clear why there
is no data block.
Also add comments explaining the use of EMPTY_LINE_LPOS as well
as clarification to the values used to represent data-less
blocks.
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240207134103.1357162-6-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
Commit f244b4dc53 ("printk: ringbuffer: Improve
prb_next_seq() performance") introduced an optimization for
prb_next_seq() by using best-effort to track recently finalized
records. However, the order of finalization does not
necessarily match the order of the records. The optimization
changed prb_next_seq() to return inconsistent results, possibly
yielding sequence numbers that are not available to readers
because they are preceded by non-finalized records or they are
not yet visible to the reader CPU.
Rather than simply best-effort tracking recently finalized
records, force the committing writer to read records and
increment the last "contiguous block" of finalized records. In
order to do this, the sequence number instead of ID must be
stored because ID's cannot be directly compared.
A new memory barrier pair is introduced to guarantee that a
reader can always read the records up until the sequence number
returned by prb_next_seq() (unless the records have since
been overwritten in the ringbuffer).
This restores the original functionality of prb_next_seq()
while also keeping the optimization.
For 32bit systems, only the lower 32 bits of the sequence
number are stored. When reading the value, it is expanded to
the full 64bit sequence number using the 32bit seq macros,
which fold in the value returned by prb_first_seq().
Fixes: f244b4dc53 ("printk: ringbuffer: Improve prb_next_seq() performance")
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240207134103.1357162-5-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
Note: This change only applies to 32bit architectures. On 64bit
architectures the macros are NOPs.
Currently prb_next_seq() is used as the base for the 32bit seq
macros __u64seq_to_ulseq() and __ulseq_to_u64seq(). However, in
a follow-up commit, prb_next_seq() will need to make use of the
32bit seq macros.
Use prb_first_seq() as the base for the 32bit seq macros instead
because it is guaranteed to return 64bit sequence numbers without
relying on any 32bit seq macros.
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240207134103.1357162-4-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
Note: This change only applies to 32bit architectures. On 64bit
architectures the macros are NOPs.
__ulseq_to_u64seq() computes the upper 32 bits of the passed
argument value (@ulseq). The upper bits are derived from a base
value (@rb_next_seq) in a way that assumes @ulseq represents a
64bit number that is less than or equal to @rb_next_seq.
Until now this mapping has been correct for all call sites. However,
in a follow-up commit, values of @ulseq will be passed in that are
higher than the base value. This requires a change to how the 32bit
value is mapped to a 64bit sequence number.
Rather than mapping @ulseq such that the base value is the end of a
32bit block, map @ulseq such that the base value is in the middle of
a 32bit block. This allows supporting 31 bits before and after the
base value, which is deemed acceptable for the console sequence
number during runtime.
Here is an example to illustrate the previous and new mappings.
For a base value (@rb_next_seq) of 2 2000 0000...
Before this change the range of possible return values was:
1 2000 0001 to 2 2000 0000
__ulseq_to_u64seq(1fff ffff) => 2 1fff ffff
__ulseq_to_u64seq(2000 0000) => 2 2000 0000
__ulseq_to_u64seq(2000 0001) => 1 2000 0001
__ulseq_to_u64seq(9fff ffff) => 1 9fff ffff
__ulseq_to_u64seq(a000 0000) => 1 a000 0000
__ulseq_to_u64seq(a000 0001) => 1 a000 0001
After this change the range of possible return values are:
1 a000 0001 to 2 a000 0000
__ulseq_to_u64seq(1fff ffff) => 2 1fff ffff
__ulseq_to_u64seq(2000 0000) => 2 2000 0000
__ulseq_to_u64seq(2000 0001) => 2 2000 0001
__ulseq_to_u64seq(9fff ffff) => 2 9fff ffff
__ulseq_to_u64seq(a000 0000) => 2 a000 0000
__ulseq_to_u64seq(a000 0001) => 1 a000 0001
[ john.ogness: Rewrite commit message. ]
Reported-by: Francesco Dolcini <francesco@dolcini.it>
Reported-by: kernel test robot <oliver.sang@intel.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240207134103.1357162-3-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
The macros __seq_to_nbcon_seq() and __nbcon_seq_to_seq() are
used to provide support for atomic handling of sequence numbers
on 32bit systems. Until now this was only used by nbcon.c,
which is why they were located in nbcon.c and include nbcon in
the name.
In a follow-up commit this functionality is also needed by
printk_ringbuffer. Rather than duplicating the functionality,
relocate the macros to printk_ringbuffer.h.
Also, since the macros will be no longer nbcon-specific, rename
them to __u64seq_to_ulseq() and __ulseq_to_u64seq().
This does not result in any functional change.
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240207134103.1357162-2-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
Clocksource pointers can be problematic to obtain for drivers which are not
clocksource drivers themselves. In particular, the RFC virtio_rtc driver
[1] would require a new helper function to obtain a pointer to the ARM
Generic Timer clocksource. The ptp_kvm driver also required a similar
workaround.
Address this by evaluating the clocksource ID, rather than the clocksource
pointer, of struct system_counterval_t. By this, setting the clocksource
pointer becomes unneeded, and get_device_system_crosststamp() callers will
no longer need to supply clocksource pointers.
All relevant clocksource drivers provide the ID, so this change is not
changing the behaviour.
[1] https://lore.kernel.org/lkml/20231218073849.35294-1-peter.hilber@opensynergy.com/
Signed-off-by: Peter Hilber <peter.hilber@opensynergy.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20240201010453.2212371-7-peter.hilber@opensynergy.com
Now that the driver core can properly handle constant struct bus_type,
move the clockevents_subsys variable to be a constant structure as well,
placing it into read-only memory which can not be modified at runtime.
Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Ricardo B. Marliere <ricardo@marliere.net>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://lore.kernel.org/r/20240204-bus_cleanup-time-v1-2-207ec18e24b8@marliere.net
Now that the driver core can properly handle constant struct bus_type,
move the clocksource_subsys variable to be a constant structure as well,
placing it into read-only memory which can not be modified at runtime.
Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Ricardo B. Marliere <ricardo@marliere.net>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/r/20240204-bus_cleanup-time-v1-1-207ec18e24b8@marliere.net
We can get EBADF from pidfd_getfd() if a task is currently exiting,
which might be confusing. Let's check PF_EXITING, and just report ESRCH
if so.
I chose PF_EXITING, because it is set in exit_signals(), which is called
before exit_files(). Since ->exit_status is mostly set after
exit_files() in exit_notify(), using that still leaves a window open for
the race.
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Tycho Andersen <tandersen@netflix.com>
Link: https://lore.kernel.org/r/20240206192357.81942-1-tycho@tycho.pizza
Signed-off-by: Christian Brauner <brauner@kernel.org>
copy_process() just needs to pass PIDFD_THREAD to __pidfd_prepare()
if clone_flags & CLONE_THREAD.
We can also add another CLONE_ flag (or perhaps reuse CLONE_DETACHED)
to enforce PIDFD_THREAD without CLONE_THREAD.
Originally-from: Tycho Andersen <tycho@tycho.pizza>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/20240205145532.GA28823@redhat.com
Reviewed-by: Tycho Andersen <tandersen@netflix.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
It was used by pidfd_poll() but now it has no callers.
If it finally finds a modular user we can revert this change, but note
that the comment above this helper and the changelog in 38fd525a4c
("exit: Factor thread_group_exited out of pidfd_poll") are not accurate,
thread_group_exited() won't return true if all other threads have passed
exit_notify() and are zombies, it returns true only when all other threads
are completely gone. Not to mention that it can only work if the task
identified by @pid is a thread-group leader.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/20240205174347.GA31461@redhat.com
Reviewed-by: Tycho Andersen <tandersen@netflix.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
rather than wake_up_all(). This way do_notify_pidfd() won't wakeup the
POLLHUP-only waiters which wait for pid_task() == NULL.
TODO:
- as Christian pointed out, this asks for the new wake_up_all_poll()
helper, it can already have other users.
- we can probably discriminate the PIDFD_THREAD and non-PIDFD_THREAD
waiters, but this needs more work. See
https://lore.kernel.org/all/20240205140848.GA15853@redhat.com/
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/20240205141348.GA16539@redhat.com
Reviewed-by: Tycho Andersen <tandersen@netflix.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
The hrtimers migration on CPU-down hotplug process has been moved
earlier, before the CPU actually goes to die. This leaves a small window
of opportunity to queue an hrtimer in a blind spot, leaving it ignored.
For example a practical case has been reported with RCU waking up a
SCHED_FIFO task right before the CPUHP_AP_IDLE_DEAD stage, queuing that
way a sched/rt timer to the local offline CPU.
Make sure such situations never go unnoticed and warn when that happens.
Fixes: 5c0930ccaa ("hrtimers: Push pending hrtimers away from outgoing CPU earlier")
Reported-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20240129235646.3171983-4-boqun.feng@gmail.com
Allow transferring an imbalanced RCU lock state between subprog calls
during verification. This allows patterns where a subprog call returns
with an RCU lock held, or a subprog call releases an RCU lock held by
the caller. Currently, the verifier would end up complaining if the RCU
lock is not released when processing an exit from a subprog, which is
non-ideal if its execution is supposed to be enclosed in an RCU read
section of the caller.
Instead, simply only check whether we are processing exit for frame#0
and do not complain on an active RCU lock otherwise. We only need to
update the check when processing BPF_EXIT insn, as copy_verifier_state
is already set up to do the right thing.
Suggested-by: David Vernet <void@manifault.com>
Tested-by: Yafang Shao <laoar.shao@gmail.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Acked-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/r/20240205055646.1112186-2-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Currently, calling any helpers, kfuncs, or subprogs except the graph
data structure (lists, rbtrees) API kfuncs while holding a bpf_spin_lock
is not allowed. One of the original motivations of this decision was to
force the BPF programmer's hand into keeping the bpf_spin_lock critical
section small, and to ensure the execution time of the program does not
increase due to lock waiting times. In addition to this, some of the
helpers and kfuncs may be unsafe to call while holding a bpf_spin_lock.
However, when it comes to subprog calls, atleast for static subprogs,
the verifier is able to explore their instructions during verification.
Therefore, it is similar in effect to having the same code inlined into
the critical section. Hence, not allowing static subprog calls in the
bpf_spin_lock critical section is mostly an annoyance that needs to be
worked around, without providing any tangible benefit.
Unlike static subprog calls, global subprog calls are not safe to permit
within the critical section, as the verifier does not explore them
during verification, therefore whether the same lock will be taken
again, or unlocked, cannot be ascertained.
Therefore, allow calling static subprogs within a bpf_spin_lock critical
section, and only reject it in case the subprog linkage is global.
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: David Vernet <void@manifault.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20240204222349.938118-2-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The for-6.8-fixes commit ae9cc8956944 ("Revert "workqueue: Override implicit
ordered attribute in workqueue_apply_unbound_cpumask()") also fixes build for
Signed-off-by: Tejun Heo <tj@kernel.org>
This reverts commit ca10d851b9.
The commit allowed workqueue_apply_unbound_cpumask() to clear __WQ_ORDERED
on now removed implicitly ordered workqueues. This was incorrect in that
system-wide config change shouldn't break ordering properties of all
workqueues. The reason why apply_workqueue_attrs() path was allowed to do so
was because it was targeting the specific workqueue - either the workqueue
had WQ_SYSFS set or the workqueue user specifically tried to change
max_active, both of which indicate that the workqueue doesn't need to be
ordered.
The implicitly ordered workqueue promotion was removed by the previous
commit 3bc1e711c2 ("workqueue: Don't implicitly make UNBOUND workqueues w/
@max_active==1 ordered"). However, it didn't update this path and broke
build. Let's revert the commit which was incorrect in the first place which
also fixes build.
Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 3bc1e711c2 ("workqueue: Don't implicitly make UNBOUND workqueues w/ @max_active==1 ordered")
Fixes: ca10d851b9 ("workqueue: Override implicit ordered attribute in workqueue_apply_unbound_cpumask()")
Cc: stable@vger.kernel.org # v6.6+
Signed-off-by: Tejun Heo <tj@kernel.org>
5c0338c687 ("workqueue: restore WQ_UNBOUND/max_active==1 to be ordered")
automoatically promoted UNBOUND workqueues w/ @max_active==1 to ordered
workqueues because UNBOUND workqueues w/ @max_active==1 used to be the way
to create ordered workqueues and the new NUMA support broke it. These
problems can be subtle and the fact that they can only trigger on NUMA
machines made them even more difficult to debug.
However, overloading the UNBOUND allocation interface this way creates other
issues. It's difficult to tell whether a given workqueue actually needs to
be ordered and users that legitimately want a min concurrency level wq
unexpectedly gets an ordered one instead. With planned UNBOUND workqueue
udpates to improve execution locality and more prevalence of chiplet designs
which can benefit from such improvements, this isn't a state we wanna be in
forever.
There aren't that many UNBOUND w/ @max_active==1 users in the tree and the
preceding patches audited all and converted them to
alloc_ordered_workqueue() as appropriate. This patch removes the implicit
promotion of UNBOUND w/ @max_active==1 workqueues to ordered ones.
v2: v1 patch incorrectly dropped !list_empty(&wq->pwqs) condition in
apply_workqueue_attrs_locked() which spuriously triggers WARNING and
fails workqueue creation. Fix it.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: kernel test robot <oliver.sang@intel.com>
Link: https://lore.kernel.org/oe-lkp/202304251050.45a5df1f-oliver.sang@intel.com
The only generic interface to execute asynchronously in the BH context is
tasklet; however, it's marked deprecated and has some design flaws. To
replace tasklets, BH workqueue support was recently added. A BH workqueue
behaves similarly to regular workqueues except that the queued work items
are executed in the BH context.
This patch converts backtracetest from tasklet to BH workqueue.
- Replace "irq" with "bh" in names and message to better reflect what's
happening.
- Replace completion usage with a flush_work() call.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
The "i" here is always equal to "btf_type_vlen(t)" since
the "for_each_member()" loop never breaks.
Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20240203055119.2235598-1-thinker.li@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Skip updating workqueues with __WQ_DESTROYING bit set when updating
global unbound cpumask to avoid unnecessary work and other complications.
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
This reverts commit d412ace111. This leads to
build failures as it depends on a driver-core commit 32f78abe59 ("driver
core: bus: constantify subsys_register() calls"). Let's drop it from wq tree
and route it through driver-core tree.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202402051505.kM9Rr3CJ-lkp@intel.com/
Mark the task as having a cached timestamp when set assign it, so we
can efficiently check if it needs updating post being scheduled back in.
This covers both the actual schedule out case, which would've flushed
the plug, and the preemption case which doesn't touch the plugged
requests (for many reasons, one of them being then we'd need to have
preemption disabled around plug state manipulation).
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Extend the support for LZ4 compression to be used with hibernation.
The main idea is that different compression algorithms
have different characteristics and hibernation may benefit when it uses
any of these algorithms: a default algorithm, having higher
compression rate but is slower(compression/decompression) and a
secondary algorithm, that is faster(compression/decompression) but has
lower compression rate.
LZ4 algorithm has better decompression speeds over LZO. This reduces
the hibernation image restore time.
As per test results:
LZO LZ4
Size before Compression(bytes) 682696704 682393600
Size after Compression(bytes) 146502402 155993547
Decompression Rate 335.02 MB/s 501.05 MB/s
Restore time 4.4s 3.8s
LZO is the default compression algorithm used for hibernation. Enable
CONFIG_HIBERNATION_COMP_LZ4 to set the default compressor as LZ4.
Signed-off-by: Nikhil V <quic_nprakash@quicinc.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Currently for hibernation, LZO is the only compression algorithm
available and uses the existing LZO library calls. However, there
is no flexibility to switch to other algorithms which provides better
results. The main idea is that different compression algorithms have
different characteristics and hibernation may benefit when it uses
alternate algorithms.
By moving to crypto based APIs, it lays a foundation to use other
compression algorithms for hibernation. There are no functional changes
introduced by this approach.
Signed-off-by: Nikhil V <quic_nprakash@quicinc.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Renaming lzo* to generic names, except for lzo_xxx() APIs. This is
used in the next patch where we move to crypto based APIs for
compression. There are no functional changes introduced by this
approach.
Signed-off-by: Nikhil V <quic_nprakash@quicinc.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Because dpm_save_failed_dev() may be called simultaneously by multiple
failing device PM functions, the state of the suspend_stats fields
updated by it may become inconsistent.
Prevent that from happening by using a lock in dpm_save_failed_dev().
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>
It is not necessary to define struct suspend_stats in a header file and the
suspend_stats variable in the core device system-wide PM code. They both
can be defined in kernel/power/main.c, next to the sysfs and debugfs code
accessing suspend_stats, which can be static.
Modify the code in question in accordance with the above observation and
replace the static inline functions manipulating suspend_stats with
regular ones defined in kernel/power/main.c.
While at it, move the enum suspend_stat_step to the end of suspend.h which
is a more suitable place for it.
No intentional functional impact.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>
Change the type of the "success" and "fail" fields in struct
suspend_stats to unsigned int, because they cannot be negative.
No intentional functional impact.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>
Instead of using a set of individual struct suspend_stats fields
representing suspend step failure counters, use an array of counters
indexed by enum suspend_stat_step for this purpose, which allows
dpm_save_failed_step() to increment the appropriate counter
automatically, so that its callers don't need to do that directly.
It also allows suspend_stats_show() to carry out a loop over the
counters array to print their values.
Because the counters cannot become negative, use unsigned int for
representing them.
The only user-observable impact of this change is a different
ordering of entries in the suspend_stats debugfs file which is not
expected to matter.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>
Replace suspend_step_name() in the suspend statistics code with an array
of suspend step names which has fewer lines of code and less overhead.
While at it, remove two unnecessary line breaks in suspend_stats_show()
and adjust some white space in there to the kernel coding style for a
more consistent code layout.
No intentional functional impact.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>
The only generic interface to execute asynchronously in the BH context is
tasklet; however, it's marked deprecated and has some design flaws such as
the execution code accessing the tasklet item after the execution is
complete which can lead to subtle use-after-free in certain usage scenarios
and less-developed flush and cancel mechanisms.
This patch implements BH workqueues which share the same semantics and
features of regular workqueues but execute their work items in the softirq
context. As there is always only one BH execution context per CPU, none of
the concurrency management mechanisms applies and a BH workqueue can be
thought of as a convenience wrapper around softirq.
Except for the inability to sleep while executing and lack of max_active
adjustments, BH workqueues and work items should behave the same as regular
workqueues and work items.
Currently, the execution is hooked to tasklet[_hi]. However, the goal is to
convert all tasklet users over to BH workqueues. Once the conversion is
complete, tasklet can be removed and BH workqueues can directly take over
the tasklet softirqs.
system_bh[_highpri]_wq are added. As queue-wide flushing doesn't exist in
tasklet, all existing tasklet users should be able to use the system BH
workqueues without creating their own workqueues.
v3: - Add missing interrupt.h include.
v2: - Instead of using tasklets, hook directly into its softirq action
functions - tasklet[_hi]_action(). This is slightly cheaper and closer
to the eventual code structure we want to arrive at. Suggested by Lai.
- Lai also pointed out several places which need NULL worker->task
handling or can use clarification. Updated.
Signed-off-by: Tejun Heo <tj@kernel.org>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/CAHk-=wjDW53w4-YcSmgKC5RruiRLHmJ1sXeYdp_ZgVoBw=5byA@mail.gmail.com
Tested-by: Allen Pais <allen.lkml@gmail.com>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Factor out init_cpu_worker_pool() from workqueue_init_early(). This is pure
reorganization in preparation of BH workqueue support.
Signed-off-by: Tejun Heo <tj@kernel.org>
Tested-by: Allen Pais <allen.lkml@gmail.com>
These changes are in preparation of BH workqueue which will execute work
items from BH context.
- Update lock and RCU depth checks in process_one_work() so that it
remembers and checks against the starting depths and prints out the depth
changes.
- Factor out lockdep annotations in the flush paths into
touch_{wq|work}_lockdep_map(). The work->lockdep_map touching is moved
from __flush_work() to its callee - start_flush_work(). This brings it
closer to the wq counterpart and will allow testing the associated wq's
flags which will be needed to support BH workqueues. This is not expected
to cause any functional changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Tested-by: Allen Pais <allen.lkml@gmail.com>
Now that the driver core can properly handle constant struct bus_type,
move the wq_subsys variable to be a constant structure as well,
placing it into read-only memory which can not be modified at runtime.
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Suggested-and-reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Ricardo B. Marliere <ricardo@marliere.net>
Signed-off-by: Tejun Heo <tj@kernel.org>
dd6c3c5441 ("workqueue: Move pwq_dec_nr_in_flight() to the end of work
item handling") relocated pwq_dec_nr_in_flight() after
set_work_pool_and_keep_pending(). However, the latter destroys information
contained in work->data that's needed by pwq_dec_nr_in_flight() including
the flush color. With flush color destroyed, flush_workqueue() can stall
easily when mixed with cancel_work*() usages.
This is easily triggered by running xfstests generic/001 test on xfs:
INFO: task umount:6305 blocked for more than 122 seconds.
...
task:umount state:D stack:13008 pid:6305 tgid:6305 ppid:6301 flags:0x00004000
Call Trace:
<TASK>
__schedule+0x2f6/0xa20
schedule+0x36/0xb0
schedule_timeout+0x20b/0x280
wait_for_completion+0x8a/0x140
__flush_workqueue+0x11a/0x3b0
xfs_inodegc_flush+0x24/0xf0
xfs_unmountfs+0x14/0x180
xfs_fs_put_super+0x3d/0x90
generic_shutdown_super+0x7c/0x160
kill_block_super+0x1b/0x40
xfs_kill_sb+0x12/0x30
deactivate_locked_super+0x35/0x90
deactivate_super+0x42/0x50
cleanup_mnt+0x109/0x170
__cleanup_mnt+0x12/0x20
task_work_run+0x60/0x90
syscall_exit_to_user_mode+0x146/0x150
do_syscall_64+0x5d/0x110
entry_SYSCALL_64_after_hwframe+0x6c/0x74
Fix it by stashing work_data before calling set_work_pool_and_keep_pending()
and using the stashed value for pwq_dec_nr_in_flight().
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Chandan Babu R <chandanbabu@kernel.org>
Link: http://lkml.kernel.org/r/87o7cxeehy.fsf@debian-BULLSEYE-live-builder-AMD64
Fixes: dd6c3c5441 ("workqueue: Move pwq_dec_nr_in_flight() to the end of work item handling")
When btf_prepare_func_args() was generalized to handle both static and
global subprogs, a few warnings/errors that are meant only for global
subprog cases started to be emitted for static subprogs, where they are
sort of expected and irrelavant.
Stop polutting verifier logs with irrelevant scary-looking messages.
Fixes: e26080d0da ("bpf: prepare btf_prepare_func_args() for handling static subprogs")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240202190529.2374377-4-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Add PTR_TRUSTED | PTR_MAYBE_NULL modifiers for PTR_TO_BTF_ID to
check_reg_type() to support passing trusted nullable PTR_TO_BTF_ID
registers into global functions accepting `__arg_trusted __arg_nullable`
arguments. This hasn't been caught earlier because tests were either
passing known non-NULL PTR_TO_BTF_ID registers or known NULL (SCALAR)
registers.
When utilizing this functionality in complicated real-world BPF
application that passes around PTR_TO_BTF_ID_OR_NULL, it became apparent
that verifier rejects valid case because check_reg_type() doesn't handle
this case explicitly. Existing check_reg_type() logic is already
anticipating this combination, so we just need to explicitly list this
combo in the switch statement.
Fixes: e2b3c4ff5d ("bpf: add __arg_trusted global func arg tag")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240202190529.2374377-2-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
- Fix the return code for ring_buffer_poll_wait()
It was returing a -EINVAL instead of EPOLLERR.
- Zero out the tracefs_inode so that all fields are initialized.
The ti->private could have had stale data, but instead of
just initializing it to NULL, clear out the entire structure
when it is allocated.
- Fix a crash in timerlat
The hrtimer was initialized at read and not open, but is
canceled at close. If the file was opened and never read
the close will pass a NULL pointer to hrtime_cancel().
- Rewrite of eventfs.
Linus wrote a patch series to remove the dentry references in the
eventfs_inode and to use ref counting and more of proper VFS
interfaces to make it work.
- Add warning to put_ei() if ei is not set to free. That means
something is about to free it when it shouldn't.
- Restructure the eventfs_inode to make it more compact, and remove
the unused llist field.
- Remove the fsnotify*() funtions for when the inodes were being created
in the lookup code. It doesn't make sense to notify about creation
just because something is being looked up.
- The inode hard link count was not accurate. It was being updated
when a file was looked up. The inodes of directories were updating
their parent inode hard link count every time the inode was created.
That means if memory reclaim cleaned a stale directory inode and
the inode was lookup up again, it would increment the parent inode
again as well. Al Viro said to just have all eventfs directories
have a hard link count of 1. That tells user space not to trust it.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZb1l/RQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qk6jAQDmecDOnx+j/Rm5krbX/meVPYXFj2CU
1wO7w1HBzopsBwEA5AjTKm9IGrl/eVG/+jViS165b+sJfwEcblHEFPWcIwo=
=uUzb
-----END PGP SIGNATURE-----
Merge tag 'trace-v6.8-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing and eventfs fixes from Steven Rostedt:
- Fix the return code for ring_buffer_poll_wait()
It was returing a -EINVAL instead of EPOLLERR.
- Zero out the tracefs_inode so that all fields are initialized.
The ti->private could have had stale data, but instead of just
initializing it to NULL, clear out the entire structure when it is
allocated.
- Fix a crash in timerlat
The hrtimer was initialized at read and not open, but is canceled at
close. If the file was opened and never read the close will pass a
NULL pointer to hrtime_cancel().
- Rewrite of eventfs.
Linus wrote a patch series to remove the dentry references in the
eventfs_inode and to use ref counting and more of proper VFS
interfaces to make it work.
- Add warning to put_ei() if ei is not set to free. That means
something is about to free it when it shouldn't.
- Restructure the eventfs_inode to make it more compact, and remove the
unused llist field.
- Remove the fsnotify*() funtions for when the inodes were being
created in the lookup code. It doesn't make sense to notify about
creation just because something is being looked up.
- The inode hard link count was not accurate.
It was being updated when a file was looked up. The inodes of
directories were updating their parent inode hard link count every
time the inode was created. That means if memory reclaim cleaned a
stale directory inode and the inode was lookup up again, it would
increment the parent inode again as well. Al Viro said to just have
all eventfs directories have a hard link count of 1. That tells user
space not to trust it.
* tag 'trace-v6.8-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
eventfs: Keep all directory links at 1
eventfs: Remove fsnotify*() functions from lookup()
eventfs: Restructure eventfs_inode structure to be more condensed
eventfs: Warn if an eventfs_inode is freed without is_freed being set
tracing/timerlat: Move hrtimer_init to timerlat_fd open()
eventfs: Get rid of dentry pointers without refcounts
eventfs: Clean up dentry ops and add revalidate function
eventfs: Remove unused d_parent pointer field
tracefs: dentry lookup crapectomy
tracefs: Avoid using the ei->dentry pointer unnecessarily
eventfs: Initialize the tracefs inode properly
tracefs: Zero out the tracefs_inode when allocating it
ring-buffer: Clean ring_buffer_poll_wait() error return
When check_stack_read_fixed_off() reads value from an spi
all stack slots of which are set to STACK_{MISC,INVALID},
the destination register is set to unbound SCALAR_VALUE.
Exploit this fact by allowing stacksafe() to use a fake
unbound scalar register to compare 'mmmm mmmm' stack value
in old state vs spilled 64-bit scalar in current state
and vice versa.
Veristat results after this patch show some gains:
./veristat -C -e file,prog,states -f 'states_pct>10' not-opt after
File Program States (DIFF)
----------------------- --------------------- ---------------
bpf_overlay.o tail_rev_nodeport_lb4 -45 (-15.85%)
bpf_xdp.o tail_lb_ipv4 -541 (-19.57%)
pyperf100.bpf.o on_event -680 (-10.42%)
pyperf180.bpf.o on_event -2164 (-19.62%)
pyperf600.bpf.o on_event -9799 (-24.84%)
strobemeta.bpf.o on_event -9157 (-65.28%)
xdp_synproxy_kern.bpf.o syncookie_tc -54 (-19.29%)
xdp_synproxy_kern.bpf.o syncookie_xdp -74 (-24.50%)
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20240127175237.526726-6-maxtram95@gmail.com
When the width of a fill is smaller than the width of the preceding
spill, the information about scalar boundaries can still be preserved,
as long as it's coerced to the right width (done by coerce_reg_to_size).
Even further, if the actual value fits into the fill width, the ID can
be preserved as well for further tracking of equal scalars.
Implement the above improvements, which makes narrowing fills behave the
same as narrowing spills and MOVs between registers.
Two tests are adjusted to accommodate for endianness differences and to
take into account that it's now allowed to do a narrowing fill from the
least significant bits.
reg_bounds_sync is added to coerce_reg_to_size to correctly adjust
umin/umax boundaries after the var_off truncation, for example, a 64-bit
value 0xXXXXXXXX00000000, when read as a 32-bit, gets umin = 0, umax =
0xFFFFFFFF, var_off = (0x0; 0xffffffff00000000), which needs to be
synced down to umax = 0, otherwise reg_bounds_sanity_check doesn't pass.
Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20240127175237.526726-4-maxtram95@gmail.com
Support the pattern where an unbounded scalar is spilled to the stack,
then boundary checks are performed on the src register, after which the
stack frame slot is refilled into a register.
Before this commit, the verifier didn't treat the src register and the
stack slot as related if the src register was an unbounded scalar. The
register state wasn't copied, the id wasn't preserved, and the stack
slot was marked as STACK_MISC. Subsequent boundary checks on the src
register wouldn't result in updating the boundaries of the spilled
variable on the stack.
After this commit, the verifier will preserve the bond between src and
dst even if src is unbounded, which permits to do boundary checks on src
and refill dst later, still remembering its boundaries. Such a pattern
is sometimes generated by clang when compiling complex long functions.
One test is adjusted to reflect that now unbounded scalars are tracked.
Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20240127175237.526726-2-maxtram95@gmail.com
Now that rodata_enabled is declared at all time, the #ifdef
CONFIG_STRICT_MODULE_RWX can be removed.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
transfer_pid() must be never called with pid == PIDTYPE_PID,
new_leader->thread_pid should be changed by exchange_tids().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/20240202131255.GA26025@redhat.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
Add another wake_up_all(wait_pidfd) into __change_pid() and change
pidfd_poll() to include EPOLLHUP if task == NULL.
This allows to wait until the target process/thread is reaped.
TODO: change do_notify_pidfd() to use the keyed wakeups.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/20240202131226.GA26018@redhat.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
With this flag:
- pidfd_open() doesn't require that the target task must be
a thread-group leader
- pidfd_poll() succeeds when the task exits and becomes a
zombie (iow, passes exit_notify()), even if it is a leader
and thread-group is not empty.
This means that the behaviour of pidfd_poll(PIDFD_THREAD,
pid-of-group-leader) is not well defined if it races with
exec() from its sub-thread; pidfd_poll() can succeed or not
depending on whether pidfd_task_exited() is called before
or after exchange_tids().
Perhaps we can improve this behaviour later, pidfd_poll()
can probably take sig->group_exec_task into account. But
this doesn't really differ from the case when the leader
exits before other threads (so pidfd_poll() succeeds) and
then another thread execs and pidfd_poll() will block again.
thread_group_exited() is no longer used, perhaps it can die.
Co-developed-by: Tycho Andersen <tycho@tycho.pizza>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/20240131132602.GA23641@redhat.com
Tested-by: Tycho Andersen <tandersen@netflix.com>
Reviewed-by: Tycho Andersen <tandersen@netflix.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
do_notify_pidfd() makes no sense until the whole thread group exits, change
do_notify_parent() to check thread_group_empty().
This avoids the unnecessary do_notify_pidfd() when tsk is not a leader, or
it exits before other threads, or it has a ptraced EXIT_ZOMBIE sub-thread.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/20240127132407.GA29136@redhat.com
Reviewed-by: Tycho Andersen <tandersen@netflix.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
- make pidfd_create() static.
- Don't pass O_RDWR | O_CLOEXEC to __pidfd_prepare() in copy_process(),
__pidfd_prepare() adds these flags unconditionally.
- Kill the flags check in __pidfd_prepare(). sys_pidfd_open() checks the
flags itself, all other users of pidfd_prepare() pass flags = 0.
If we need a sanity check for those other in kernel users then
WARN_ON_ONCE(flags & ~PIDFD_NONBLOCK) makes more sense.
- Don't pass O_RDWR to get_unused_fd_flags(), it ignores everything except
O_CLOEXEC.
- Don't pass O_CLOEXEC to anon_inode_getfile(), it ignores everything except
O_ACCMODE | O_NONBLOCK.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/20240125161734.GA778@redhat.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
In the current implementation of clone(), there is a line that
initializes `u64 clone_flags = args->flags` at the top.
This means that there is no longer a need to use args->flags
for the legacy clone check.
Signed-off-by: Wang Jinchao <wangjinchao@xfusion.com>
Link: https://lore.kernel.org/r/202401311054+0800-wangjinchao@xfusion.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
CAP_SYSLOG was separated from CAP_SYS_ADMIN and introduced in Linux
2.6.37 (2010-11). For a long time, certain syslog actions required
CAP_SYS_ADMIN or CAP_SYSLOG. Maybe it’s time to officially remove
CAP_SYS_ADMIN for more fine-grained control.
CAP_SYS_ADMIN was once removed but added back for backwards
compatibility reasons. In commit 38ef4c2e43 ("syslog: check cap_syslog
when dmesg_restrict") (2010-12), CAP_SYS_ADMIN was no longer needed. And
in commit ee24aebffb ("cap_syslog: accept CAP_SYS_ADMIN for now")
(2011-02), it was accepted again. Since then, CAP_SYS_ADMIN has been
preserved.
Now that almost 13 years have passed, the legacy application may have
had enough time to be updated.
Signed-off-by: Jingzi Meng <mengjingzi@iie.ac.cn>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20240105062007.26965-1-mengjingzi@iie.ac.cn
Signed-off-by: Kees Cook <keescook@chromium.org>
There's already one main CONFIG_SECURITY_NETWORK ifdef block within
the sleepable_lsm_hooks BTF set. Consolidate this duplicated ifdef
block as there's no need for it and all things guarded by it should
remain in one place in this specific context.
Signed-off-by: Matt Bobrowski <mattbobrowski@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/Zbt1smz43GDMbVU3@google.com
This commit marks kfuncs as such inside the .BTF_ids section. The upshot
of these annotations is that we'll be able to automatically generate
kfunc prototypes for downstream users. The process is as follows:
1. In source, use BTF_KFUNCS_START/END macro pair to mark kfuncs
2. During build, pahole injects into BTF a "bpf_kfunc" BTF_DECL_TAG for
each function inside BTF_KFUNCS sets
3. At runtime, vmlinux or module BTF is made available in sysfs
4. At runtime, bpftool (or similar) can look at provided BTF and
generate appropriate prototypes for functions with "bpf_kfunc" tag
To ensure future kfunc are similarly tagged, we now also return error
inside kfunc registration for untagged kfuncs. For vmlinux kfuncs,
we also WARN(), as initcall machinery does not handle errors.
Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Acked-by: Benjamin Tissoires <bentiss@kernel.org>
Link: https://lore.kernel.org/r/e55150ceecbf0a5d961e608941165c0bee7bc943.1706491398.git.dxu@dxuuu.xyz
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The return type for ring_buffer_poll_wait() is __poll_t. This is behind
the scenes an unsigned where we can set event bits. In case of a
non-allocated CPU, we do return instead -EINVAL (0xffffffea). Lucky us,
this ends up setting few error bits (EPOLLERR | EPOLLHUP | EPOLLNVAL), so
user-space at least is aware something went wrong.
Nonetheless, this is an incorrect code. Replace that -EINVAL with a
proper EPOLLERR to clean that output. As this doesn't change the
behaviour, there's no need to treat this change as a bug fix.
Link: https://lore.kernel.org/linux-trace-kernel/20240131140955.3322792-1-vdonnefort@google.com
Cc: stable@vger.kernel.org
Fixes: 6721cb6002 ("ring-buffer: Do not poll non allocated cpu buffers")
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
System workqueues are allocated early during boot from
workqueue_init_early(). While allocating unbound workqueues,
wq_update_node_max_active() is invoked from apply_workqueue_attrs() and
accesses NUMA topology to initialize wq->node_nr_active[].max.
However, topology information may not be set up at this point.
wq_update_node_max_active() is explicitly invoked from
workqueue_init_topology() later when topology information is known to be
available.
This doesn't seem to crash anything but it's doing useless work with dubious
data. Let's skip the premature and duplicate node_max_active updates by
initializing the field to WQ_DFL_MIN_ACTIVE on allocation and making
wq_update_node_max_active() noop until workqueue_init_topology().
Signed-off-by: Tejun Heo <tj@kernel.org>
---
kernel/workqueue.c | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 9221a4c57ae1..a65081ec6780 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -386,6 +386,8 @@ static const char *wq_affn_names[WQ_AFFN_NR_TYPES] = {
[WQ_AFFN_SYSTEM] = "system",
};
+static bool wq_topo_initialized = false;
+
/*
* Per-cpu work items which run for longer than the following threshold are
* automatically considered CPU intensive and excluded from concurrency
@@ -1510,6 +1512,9 @@ static void wq_update_node_max_active(struct workqueue_struct *wq, int off_cpu)
lockdep_assert_held(&wq->mutex);
+ if (!wq_topo_initialized)
+ return;
+
if (!cpumask_test_cpu(off_cpu, effective))
off_cpu = -1;
@@ -4356,6 +4361,7 @@ static void free_node_nr_active(struct wq_node_nr_active **nna_ar)
static void init_node_nr_active(struct wq_node_nr_active *nna)
{
+ nna->max = WQ_DFL_MIN_ACTIVE;
atomic_set(&nna->nr, 0);
raw_spin_lock_init(&nna->lock);
INIT_LIST_HEAD(&nna->pending_pwqs);
@@ -7400,6 +7406,8 @@ void __init workqueue_init_topology(void)
init_pod_type(&wq_pod_types[WQ_AFFN_CACHE], cpus_share_cache);
init_pod_type(&wq_pod_types[WQ_AFFN_NUMA], cpus_share_numa);
+ wq_topo_initialized = true;
+
mutex_lock(&wq_pool_mutex);
/*
For wq_update_node_max_active(), @off_cpu of -1 indicates that no CPU is
going down. The function was incorrectly calling cpumask_test_cpu() with -1
CPU leading to oopses like the following on some archs:
Unable to handle kernel paging request at virtual address ffff0002100296e0
..
pc : wq_update_node_max_active+0x50/0x1fc
lr : wq_update_node_max_active+0x1f0/0x1fc
...
Call trace:
wq_update_node_max_active+0x50/0x1fc
apply_wqattrs_commit+0xf0/0x114
apply_workqueue_attrs_locked+0x58/0xa0
alloc_workqueue+0x5ac/0x774
workqueue_init_early+0x460/0x540
start_kernel+0x258/0x684
__primary_switched+0xb8/0xc0
Code: 9100a273 35000d01 53067f00 d0016dc1 (f8607a60)
---[ end trace 0000000000000000 ]---
Kernel panic - not syncing: Attempted to kill the idle task!
---[ end Kernel panic - not syncing: Attempted to kill the idle task! ]---
Fix it.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>
Reported-by: Nathan Chancellor <nathan@kernel.org>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Link: http://lkml.kernel.org/r/91eacde0-df99-4d5c-a980-91046f66e612@samsung.com
Fixes: 5797b1c189 ("workqueue: Implement system-wide nr_active enforcement for unbound workqueues")
Add ability to mark arg:trusted arguments with optional arg:nullable
tag to mark it as PTR_TO_BTF_ID_OR_NULL variant, which will allow
callers to pass NULL, and subsequently will force global subprog's code
to do NULL check. This allows to have "optional" PTR_TO_BTF_ID values
passed into global subprogs.
For now arg:nullable cannot be combined with anything else.
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240130000648.2144827-3-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Add support for passing PTR_TO_BTF_ID registers to global subprogs.
Currently only PTR_TRUSTED flavor of PTR_TO_BTF_ID is supported.
Non-NULL semantics is assumed, so caller will be forced to prove
PTR_TO_BTF_ID can't be NULL.
Note, we disallow global subprogs to destroy passed in PTR_TO_BTF_ID
arguments, even the trusted one. We achieve that by not setting
ref_obj_id when validating subprog code. This basically enforces (in
Rust terms) borrowing semantics vs move semantics. Borrowing semantics
seems to be a better fit for isolated global subprog validation
approach.
Implementation-wise, we utilize existing logic for matching
user-provided BTF type to kernel-side BTF type, used by BPF CO-RE logic
and following same matching rules. We enforce a unique match for types.
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240130000648.2144827-2-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
- Fix register_snapshot_trigger() on allocation error
If the snapshot fails to allocate, the register_snapshot_trigger() can
still return success. If the call to tracing_alloc_snapshot_instance()
returned anything but 0, it returned 0, but it should have been returning
the error code from that allocation function.
- Remove leftover code from tracefs doing a dentry walk on remount.
The update_gid() function was called by the tracefs code on remount
to update the gid of eventfs, but that is no longer the case, but that
code wasn't deleted. Nothing calls it. Remove it.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZbcbZBQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qqEsAQCWTIQ/yAktP2doLvuccZbZSc5NFHZt
hZNRHep/SMQqKwD6AlesgiXlXM39HiRTiV17jPzrL65brcTEYMJmqCaWugk=
=8nia
-----END PGP SIGNATURE-----
Merge tag 'trace-v6.8-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fixes from Steven Rostedt:
"Two small fixes for tracefs and eventfs:
- Fix register_snapshot_trigger() on allocation error
If the snapshot fails to allocate, the register_snapshot_trigger()
can still return success. If the call to
tracing_alloc_snapshot_instance() returned anything but 0, it
returned 0, but it should have been returning the error code from
that allocation function.
- Remove leftover code from tracefs doing a dentry walk on remount.
The update_gid() function was called by the tracefs code on remount
to update the gid of eventfs, but that is no longer the case, but
that code wasn't deleted. Nothing calls it. Remove it"
* tag 'trace-v6.8-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracefs: remove stale 'update_gid' code
tracing/trigger: Fix to return error if failed to alloc snapshot
When __queue_delayed_work() is called, it chooses a cpu for handling the
timer interrupt. As of today, it will pick either the cpu passed as
parameter or the last cpu used for this.
This is not good if a system does use CPU isolation, because it can take
away some valuable cpu time to:
1 - deal with the timer interrupt,
2 - schedule-out the desired task,
3 - queue work on a random workqueue, and
4 - schedule the desired task back to the cpu.
So to fix this, during __queue_delayed_work(), if cpu isolation is in
place, pick a random non-isolated cpu to handle the timer interrupt.
As an optimization, if the current cpu is not isolated, use it instead
of looking for another candidate.
Signed-off-by: Leonardo Bras <leobras@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
or aren't considered appropriate for backporting.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZbdSnwAKCRDdBJ7gKXxA
jv49AQCY8eLOgE0L+25HZm99HleBwapbKJozcmsXgMPlgeFZHgEA8saExeL+Nzae
6ktxmGXoVw2t3FJ67Zr66VE3EyHVKAY=
=HWuo
-----END PGP SIGNATURE-----
Merge tag 'mm-hotfixes-stable-2024-01-28-23-21' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"22 hotfixes. 11 are cc:stable and the remainder address post-6.7
issues or aren't considered appropriate for backporting"
* tag 'mm-hotfixes-stable-2024-01-28-23-21' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (22 commits)
mm: thp_get_unmapped_area must honour topdown preference
mm: huge_memory: don't force huge page alignment on 32 bit
userfaultfd: fix mmap_changing checking in mfill_atomic_hugetlb
selftests/mm: ksm_tests should only MADV_HUGEPAGE valid memory
scs: add CONFIG_MMU dependency for vfree_atomic()
mm/memory: fix folio_set_dirty() vs. folio_mark_dirty() in zap_pte_range()
mm/huge_memory: fix folio_set_dirty() vs. folio_mark_dirty()
selftests/mm: Update va_high_addr_switch.sh to check CPU for la57 flag
selftests: mm: fix map_hugetlb failure on 64K page size systems
MAINTAINERS: supplement of zswap maintainers update
stackdepot: make fast paths lock-less again
stackdepot: add stats counters exported via debugfs
mm, kmsan: fix infinite recursion due to RCU critical section
mm/writeback: fix possible divide-by-zero in wb_dirty_limits(), again
selftests/mm: switch to bash from sh
MAINTAINERS: add man-pages git trees
mm: memcontrol: don't throttle dying tasks on memory.high
mm: mmap: map MAP_STACK to VM_NOHUGEPAGE
uprobes: use pagesize-aligned virtual address when replacing pages
selftests/mm: mremap_test: fix build warning
...
Remove the duplicate check on type and unify result.
Signed-off-by: Florian Lehner <dev@der-flo.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/bpf/20240120150920.3370-1-dev@der-flo.net
Now that bpf and bpf-next trees converged and we don't run the risk of
merge conflicts, move btf_validate_prog_ctx_type() into its most logical
place inside the main logic loop.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240125205510.3642094-4-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
It's a bit puzzling to see a call to module_enable_nx() followed by a
call to module_enable_x(). This is because one applies on text while
the other applies on data.
Change name to make that more clear.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
A couple of architectures seem concerned about calling set_memory_ro()
and set_memory_x() too frequently and have implemented a version of
set_memory_rox(), see commit 60463628c9 ("x86/mm: Implement native
set_memory_rox()") and commit 22e99fa564 ("s390/mm: implement
set_memory_rox()")
Use set_memory_rox() in modules when STRICT_MODULES_RWX is set.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
A pool_workqueue (pwq) represents the connection between a workqueue and a
worker_pool. One of the roles that a pwq plays is enforcement of the
max_active concurrency limit. Before 636b927eba ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues"), there was one pwq per each CPU
for per-cpu workqueues and per each NUMA node for unbound workqueues, which
was a natural result of per-cpu workqueues being served by per-cpu pools and
unbound by per-NUMA pools.
In terms of max_active enforcement, this was, while not perfect, workable.
For per-cpu workqueues, it was fine. For unbound, it wasn't great in that
NUMA machines would get max_active that's multiplied by the number of nodes
but didn't cause huge problems because NUMA machines are relatively rare and
the node count is usually pretty low.
However, cache layouts are more complex now and sharing a worker pool across
a whole node didn't really work well for unbound workqueues. Thus, a series
of commits culminating on 8639ecebc9 ("workqueue: Make unbound workqueues
to use per-cpu pool_workqueues") implemented more flexible affinity
mechanism for unbound workqueues which enables using e.g. last-level-cache
aligned pools. In the process, 636b927eba ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues") made unbound workqueues use
per-cpu pwqs like per-cpu workqueues.
While the change was necessary to enable more flexible affinity scopes, this
came with the side effect of blowing up the effective max_active for unbound
workqueues. Before, the effective max_active for unbound workqueues was
multiplied by the number of nodes. After, by the number of CPUs.
636b927eba ("workqueue: Make unbound workqueues to use per-cpu
pool_workqueues") claims that this should generally be okay. It is okay for
users which self-regulates concurrency level which are the vast majority;
however, there are enough use cases which actually depend on max_active to
prevent the level of concurrency from going bonkers including several IO
handling workqueues that can issue a work item for each in-flight IO. With
targeted benchmarks, the misbehavior can easily be exposed as reported in
http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3.
Unfortunately, there is no way to express what these use cases need using
per-cpu max_active. A CPU may issue most of in-flight IOs, so we don't want
to set max_active too low but as soon as we increase max_active a bit, we
can end up with unreasonable number of in-flight work items when many CPUs
issue IOs at the same time. ie. The acceptable lowest max_active is higher
than the acceptable highest max_active.
Ideally, max_active for an unbound workqueue should be system-wide so that
the users can regulate the total level of concurrency regardless of node and
cache layout. The reasons workqueue hasn't implemented that yet are:
- One max_active enforcement decouples from pool boundaires, chaining
execution after a work item finishes requires inter-pool operations which
would require lock dancing, which is nasty.
- Sharing a single nr_active count across the whole system can be pretty
expensive on NUMA machines.
- Per-pwq enforcement had been more or less okay while we were using
per-node pools.
It looks like we no longer can avoid decoupling max_active enforcement from
pool boundaries. This patch implements system-wide nr_active mechanism with
the following design characteristics:
- To avoid sharing a single counter across multiple nodes, the configured
max_active is split across nodes according to the proportion of each
workqueue's online effective CPUs per node. e.g. A node with twice more
online effective CPUs will get twice higher portion of max_active.
- Workqueue used to be able to process a chain of interdependent work items
which is as long as max_active. We can't do this anymore as max_active is
distributed across the nodes. Instead, a new parameter min_active is
introduced which determines the minimum level of concurrency within a node
regardless of how max_active distribution comes out to be.
It is set to the smaller of max_active and WQ_DFL_MIN_ACTIVE which is 8.
This can lead to higher effective max_weight than configured and also
deadlocks if a workqueue was depending on being able to handle chains of
interdependent work items that are longer than 8.
I believe these should be fine given that the number of CPUs in each NUMA
node is usually higher than 8 and work item chain longer than 8 is pretty
unlikely. However, if these assumptions turn out to be wrong, we'll need
to add an interface to adjust min_active.
- Each unbound wq has an array of struct wq_node_nr_active which tracks
per-node nr_active. When its pwq wants to run a work item, it has to
obtain the matching node's nr_active. If over the node's max_active, the
pwq is queued on wq_node_nr_active->pending_pwqs. As work items finish,
the completion path round-robins the pending pwqs activating the first
inactive work item of each, which involves some pool lock dancing and
kicking other pools. It's not the simplest code but doesn't look too bad.
v4: - wq_adjust_max_active() updated to invoke wq_update_node_max_active().
- wq_adjust_max_active() is now protected by wq->mutex instead of
wq_pool_mutex.
v3: - wq_node_max_active() used to calculate per-node max_active on the fly
based on system-wide CPU online states. Lai pointed out that this can
lead to skewed distributions for workqueues with restricted cpumasks.
Update the max_active distribution to use per-workqueue effective
online CPU counts instead of system-wide and cache the calculation
results in node_nr_active->max.
v2: - wq->min/max_active now uses WRITE/READ_ONCE() as suggested by Lai.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Naohiro Aota <Naohiro.Aota@wdc.com>
Link: http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3
Fixes: 636b927eba ("workqueue: Make unbound workqueues to use per-cpu pool_workqueues")
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Currently, for both percpu and unbound workqueues, max_active applies
per-cpu, which is a recent change for unbound workqueues. The change for
unbound workqueues was a significant departure from the previous behavior of
per-node application. It made some use cases create undesirable number of
concurrent work items and left no good way of fixing them. To address the
problem, workqueue is implementing a NUMA node segmented global nr_active
mechanism, which will be explained further in the next patch.
As a preparation, this patch introduces struct wq_node_nr_active. It's a
data structured allocated for each workqueue and NUMA node pair and
currently only tracks the workqueue's number of active work items on the
node. This is split out from the next patch to make it easier to understand
and review.
Note that there is an extra wq_node_nr_active allocated for the invalid node
nr_node_ids which is used to track nr_active for pools which don't have NUMA
node associated such as the default fallback system-wide pool.
This doesn't cause any behavior changes visible to userland yet. The next
patch will expand to implement the control mechanism on top.
v4: - Fixed out-of-bound access when freeing per-cpu workqueues.
v3: - Use flexible array for wq->node_nr_active as suggested by Lai.
v2: - wq->max_active now uses WRITE/READ_ONCE() as suggested by Lai.
- Lai pointed out that pwq_tryinc_nr_active() incorrectly dropped
pwq->max_active check. Restored. As the next patch replaces the
max_active enforcement mechanism, this doesn't change the end result.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
The planned shared nr_active handling for unbound workqueues will make
pwq_dec_nr_active() sometimes drop the pool lock temporarily to acquire
other pool locks, which is necessary as retirement of an nr_active count
from one pool may need kick off an inactive work item in another pool.
This patch moves pwq_dec_nr_in_flight() call in try_to_grab_pending() to the
end of work item handling so that work item state changes stay atomic.
process_one_work() which is the other user of pwq_dec_nr_in_flight() already
calls it at the end of work item handling. Comments are added to both call
sites and pwq_dec_nr_in_flight().
This shouldn't cause any behavior changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
wq->cpu_pwq is RCU protected but wq->dfl_pwq isn't. This is okay because
currently wq->dfl_pwq is used only accessed to install it into wq->cpu_pwq
which doesn't require RCU access. However, we want to be able to access
wq->dfl_pwq under RCU in the future to access its __pod_cpumask and the code
can be made easier to read by making the two pwq fields behave in the same
way.
- Make wq->dfl_pwq RCU protected.
- Add unbound_pwq_slot() and unbound_pwq() which can access both ->dfl_pwq
and ->cpu_pwq. The former returns the double pointer that can be used
access and update the pwqs. The latter performs locking check and
dereferences the double pointer.
- pwq accesses and updates are converted to use unbound_pwq[_slot]().
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
wq_adjust_max_active() needs to activate work items after max_active is
increased. Previously, it did that by visiting each pwq once activating all
that could be activated. While this makes sense with per-pwq nr_active,
nr_active will be shared across multiple pwqs for unbound wqs. Then, we'd
want to round-robin through pwqs to be fairer.
In preparation, this patch makes wq_adjust_max_active() round-robin pwqs
while activating. While the activation ordering changes, this shouldn't
cause user-noticeable behavior changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
__queue_work(), pwq_dec_nr_in_flight() and wq_adjust_max_active() were
open-coding nr_active handling, which is fine given that the operations are
trivial. However, the planned unbound nr_active update will make them more
complicated, so let's move them into helpers.
- pwq_tryinc_nr_active() is added. It increments nr_active if under
max_active limit and return a boolean indicating whether inc was
successful. Note that the function is structured to accommodate future
changes. __queue_work() is updated to use the new helper.
- pwq_activate_first_inactive() is updated to use pwq_tryinc_nr_active() and
thus no longer assumes that nr_active is under max_active and returns a
boolean to indicate whether a work item has been activated.
- wq_adjust_max_active() no longer tests directly whether a work item can be
activated. Instead, it's updated to use the return value of
pwq_activate_first_inactive() to tell whether a work item has been
activated.
- nr_active decrement and activating the first inactive work item is
factored into pwq_dec_nr_active().
v3: - WARN_ON_ONCE(!WORK_STRUCT_INACTIVE) added to __pwq_activate_work() as
now we're calling the function unconditionally from
pwq_activate_first_inactive().
v2: - wq->max_active now uses WRITE/READ_ONCE() as suggested by Lai.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
To prepare for unbound nr_active handling improvements, move work activation
part of pwq_activate_inactive_work() into __pwq_activate_work() and add
pwq_activate_work() which tests WORK_STRUCT_INACTIVE and updates nr_active.
pwq_activate_first_inactive() and try_to_grab_pending() are updated to use
pwq_activate_work(). The latter conversion is functionally identical. For
the former, this conversion adds an unnecessary WORK_STRUCT_INACTIVE
testing. This is temporary and will be removed by the next patch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
"!pwq->nr_active && list_empty(&pwq->inactive_works)" test is repeated
multiple times. Let's factor it out into pwq_is_empty().
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
max_active is a workqueue-wide setting and the configured value is stored in
wq->saved_max_active; however, the effective value was stored in
pwq->max_active. While this is harmless, it makes max_active update process
more complicated and gets in the way of the planned max_active semantic
updates for unbound workqueues.
This patches moves pwq->max_active to wq->max_active. This simplifies the
code and makes freezing and noop max_active updates cheaper too. No
user-visible behavior change is intended.
As wq->max_active is updated while holding wq mutex but read without any
locking, it now uses WRITE/READ_ONCE(). A new locking locking rule WO is
added for it.
v2: wq->max_active now uses WRITE/READ_ONCE() as suggested by Lai.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Use the new __free() mechanism to remove all gotos and simplify the error
paths.
Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Nathan Chancellor <nathan@kernel.org>
Link: https://lore.kernel.org/r/20240122124243.44002-5-brgl@bgdev.pl
exposure
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmW2LAgACgkQEsHwGGHe
VUqowBAAiW9aPQmp401DSXLX+bX0oS5IQVEZnAEE3hQTWxdvDoIdmX+SBReSutXy
PDm8mZgVtIiUg3V5bu7/9Dgpu7ovRuChJPjjkYFUDcEmzmsMI11W6u8+8eyt8yRd
X9LuGUeXPJSI1kadYudhFUhl6X6KcXj4Y+XUqNcyp8yClSEcLriYeiumNApSEzj6
BneO5VBbXTpJq1b7GOlC4MNhNXhx+WlUdJUb3VPLlxy/akxrNs9x0ASdOuqslCq8
X9SJPnKeRh0mpezmWDgU72eQ/3vpvWQzwyXvp2pQGbjArCx7IwwD765NDu0P6651
C/+4ruXmcd+Jp3wuobdHG8/J2NlZQy8tZQm284YkS5vyBQDi4s17hycXw/aeUFpu
/3LR1Hppl//u7hkaHszE8vE5l6in4a2XAbk9EozChVj/aHRJqIaLn8TGQRquK4Tg
uRjIC3O2ubJCsIlNIczysjCobSCO+cELwUuFVHh7cdmQAgUwF3efDab0+pJ7MHFb
ZEcqQbIt4FGea4BGzvRYCYj6W9bkhzttnH+68ef+mDA3BcdGoYnHcQ143M8duNhe
0inWCibQXMFC9EGPjC8Sz8WvzF/L5KL9bPQmO1sitIzH6kbU3o7PBk2Fe4V6+KP9
THK865SJ/9QirjXrGmp9Sle6dqJRUylmt1ts8reOWACZ98LKeWU=
=ibuM
-----END PGP SIGNATURE-----
Merge tag 'locking_urgent_for_v6.8_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking fix from Borislav Petkov:
- Prevent an inconsistent futex operation leading to stale state
exposure
* tag 'locking_urgent_for_v6.8_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
futex: Prevent the reuse of stale pi_state
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmW2KrEACgkQEsHwGGHe
VUr1WA//Qsi2JkxO1lyUQgnyuXqs0+oVZJxFyH2dFYzWkfSaxgsyPZ0H+wsweDfP
OgoNzwwDf1IaNbVz2voV6lSM/30ujJMx4aAucT5WTEXa12cJsvipxRiNd8WU8GqQ
buBz+vnS9IJ2WfM7UxhIVevYFU8H/ERcSO9WCII0YjlcVxmlwMK3B7kFdpBPdT5Q
m5hvBorZzIa9wD3TI7e+VvEVbCx0WjsYYEpXDXM/yCf1Juc9952pjjunzx3YmJES
5JG2WpnEvmNdWwIPO0NAjs7Shw/MNViXTy5Ls5jcbswiAcBoUxNHQlUsvNVDaVyv
8eMCkPzuSipY8HSoetQSTJl+mr3LyYRvevKahuTgwbS8K+kxgClqHFoLZVqolWnk
2IDo63R6Ex6lb1Xpb/Rpg/4j4NqUVWcvPHf6Z2CmMRq/XbSk2DIFl1Wxjgy/Cjnu
+nNLw2FYayEBrKF3VlYgERGoCfBrEsksxzljjeHFn5XWr+G2x1ykF37xaWjQ2+oV
sFl6UYwIsdqPCjHmpT6R1lwCdeEC3o3Zc2Kf5uEVj+pXacKJkxZU0L6ZneO8UiEc
rtc0gTgm9ZNd8oDsjsaBU1A3KxH9lOfVz82ZV0tipz94dcN4zrB9Qag0Yw+64YOC
cQ9cRKiFiCVCeD1ksDLZe1IUX++T2Y9O8MZv06ZDrbaFC56+lU8=
=IgON
-----END PGP SIGNATURE-----
Merge tag 'irq_urgent_for_v6.8_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fix from Borislav Petkov:
- Initialize the resend node of each IRQ descriptor, not only the first
one
* tag 'irq_urgent_for_v6.8_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
genirq: Initialize resend_node hlist for all interrupt descriptors
events in order to be able to compute correct averages
- Limit the duration of the clocksource watchdog checking interval as
too long intervals lead to wrongly marking the TSC as unstable
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmW2KYEACgkQEsHwGGHe
VUoX7w//Ulls1tp3m1oiejTBtUmkewSmnhNAfkHJv3MlKNe+ttG6LvQVh9g2bf1Z
FOu2M2se0ge8G5xf3+I5E4rpqlJZSuhPmNmIET+aj+2a61UJq/6zE1Zw6mxjrJOK
emjOYKTxZ/HxvKJJGO6NiH8Iv5Aj3nQR3Y6oyb/FyP5TLJ6MCT21iEaqyqU7P+Ix
AHIS3cL97M5R/tFtP2CY3PV2M6hJ0lqapSi9t75hT8DfJN1TNQ5SvFkKgmOIrGFw
2WxPTSTEZAnXlvI4cC3Nru9i64QQRw9S05FFelX2pwxE/7wVzBvfh8cjuGZJBve/
KQhNnQ4/fzv6E/hUcavKuOyk1lx5XonfCuG4RFoLl67LjLbLh+Q55RBdXflBPF4T
Ow9BSyQNFu391C2Bl5gJUYVd2JMv+IVpi2wUiwrXJ/Mxj+A2J7Fj0jz7hMbNCmsU
EaA+QyfkAGsoa99xP3UDhPzxoCr2s5YTAxH+IUeSWeI25PMq9f+6fifXBwG+GaVa
FS6Ei1VI0GCNmcYFYawHJbdM2ui5h7lZ96aEpOBSVcAv/2yBNgqxuYZ+icO/wI6N
JM0DSEEOrWcytfxftl7LmglJauhXSKZH4UTG4RCz0IkDgR72Wn0QF4cqm4wWQ5yh
n5/xO+SbkzE57bltsnAkvpu0a110fdK5ec+vkFIy4PrxyO83XhY=
=9RXx
-----END PGP SIGNATURE-----
Merge tag 'timers_urgent_for_v6.8_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fixes from Borislav Petkov:
- Preserve the number of idle calls and sleep entries across CPU
hotplug events in order to be able to compute correct averages
- Limit the duration of the clocksource watchdog checking interval as
too long intervals lead to wrongly marking the TSC as unstable
* tag 'timers_urgent_for_v6.8_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
tick/sched: Preserve number of idle sleeps across CPU hotplug events
clocksource: Skip watchdog check for large watchdog intervals
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZbQV+gAKCRDbK58LschI
g2OeAP0VvhZS9SPiS+/AMAFuw2W1BkMrFNbfBTc3nzRnyJSmNAD+NG4CLLJvsKI9
olu7VC20B8pLTGLUGIUSwqnjOC+Kkgc=
=wVMl
-----END PGP SIGNATURE-----
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:
====================
pull-request: bpf-next 2024-01-26
We've added 107 non-merge commits during the last 4 day(s) which contain
a total of 101 files changed, 6009 insertions(+), 1260 deletions(-).
The main changes are:
1) Add BPF token support to delegate a subset of BPF subsystem
functionality from privileged system-wide daemons such as systemd
through special mount options for userns-bound BPF fs to a trusted
& unprivileged application. With addressed changes from Christian
and Linus' reviews, from Andrii Nakryiko.
2) Support registration of struct_ops types from modules which helps
projects like fuse-bpf that seeks to implement a new struct_ops type,
from Kui-Feng Lee.
3) Add support for retrieval of cookies for perf/kprobe multi links,
from Jiri Olsa.
4) Bigger batch of prep-work for the BPF verifier to eventually support
preserving boundaries and tracking scalars on narrowing fills,
from Maxim Mikityanskiy.
5) Extend the tc BPF flavor to support arbitrary TCP SYN cookies to help
with the scenario of SYN floods, from Kuniyuki Iwashima.
6) Add code generation to inline the bpf_kptr_xchg() helper which
improves performance when stashing/popping the allocated BPF objects,
from Hou Tao.
7) Extend BPF verifier to track aligned ST stores as imprecise spilled
registers, from Yonghong Song.
8) Several fixes to BPF selftests around inline asm constraints and
unsupported VLA code generation, from Jose E. Marchesi.
9) Various updates to the BPF IETF instruction set draft document such
as the introduction of conformance groups for instructions,
from Dave Thaler.
10) Fix BPF verifier to make infinite loop detection in is_state_visited()
exact to catch some too lax spill/fill corner cases,
from Eduard Zingerman.
11) Refactor the BPF verifier pointer ALU check to allow ALU explicitly
instead of implicitly for various register types, from Hao Sun.
12) Fix the flaky tc_redirect_dtime BPF selftest due to slowness
in neighbor advertisement at setup time, from Martin KaFai Lau.
13) Change BPF selftests to skip callback tests for the case when the
JIT is disabled, from Tiezhu Yang.
14) Add a small extension to libbpf which allows to auto create
a map-in-map's inner map, from Andrey Grafin.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (107 commits)
selftests/bpf: Add missing line break in test_verifier
bpf, docs: Clarify definitions of various instructions
bpf: Fix error checks against bpf_get_btf_vmlinux().
bpf: One more maintainer for libbpf and BPF selftests
selftests/bpf: Incorporate LSM policy to token-based tests
selftests/bpf: Add tests for LIBBPF_BPF_TOKEN_PATH envvar
libbpf: Support BPF token path setting through LIBBPF_BPF_TOKEN_PATH envvar
selftests/bpf: Add tests for BPF object load with implicit token
selftests/bpf: Add BPF object loading tests with explicit token passing
libbpf: Wire up BPF token support at BPF object level
libbpf: Wire up token_fd into feature probing logic
libbpf: Move feature detection code into its own file
libbpf: Further decouple feature checking logic from bpf_object
libbpf: Split feature detectors definitions from cached results
selftests/bpf: Utilize string values for delegate_xxx mount options
bpf: Support symbolic BPF FS delegation mount options
bpf: Fail BPF_TOKEN_CREATE if no delegation option was set on BPF FS
bpf,selinux: Allocate bpf_security_struct per BPF token
selftests/bpf: Add BPF token-enabled tests
libbpf: Add BPF token support to bpf_prog_load() API
...
====================
Link: https://lore.kernel.org/r/20240126215710.19855-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
workqueue is collecting different sorts of enums into a single unnamed enum
type which can increase confusion around enum width. Also, unnamed enums
can't be accessed from BPF. Let's break up enum definitions according to
their purposes and give them type names.
Signed-off-by: Tejun Heo <tj@kernel.org>
After creating a new worker, create_worker() is calling kick_pool() to wake
up the new worker task. However, as kick_pool() doesn't do anything if there
is no work pending, it also calls wake_up_process() explicitly. There's no
reason to call kick_pool() at all. wake_up_process() is enough by itself.
Drop the unnecessary kick_pool() call.
Signed-off-by: Tejun Heo <tj@kernel.org>
Fix register_snapshot_trigger() to return error code if it failed to
allocate a snapshot instead of 0 (success). Unless that, it will register
snapshot trigger without an error.
Link: https://lore.kernel.org/linux-trace-kernel/170622977792.270660.2789298642759362200.stgit@devnote2
Fixes: 0bbe7f7199 ("tracing: Fix the race between registering 'snapshot' event trigger and triggering 'snapshot' operation")
Cc: stable@vger.kernel.org
Cc: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Per filesystems/sysfs.rst, show() should only use sysfs_emit()
or sysfs_emit_at() when formatting the value to be returned to user space.
coccinelle complains that there are still a couple of functions that use
snprintf(). Convert them to sysfs_emit().
No functional change intended.
Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20240116045151.3940401-40-lizhijian@fujitsu.com
struct cpuhp_cpu_state has an extraneous kernel-doc comment for @cpu.
There is no struct member by that name, so remove the comment to
prevent the kernel-doc warning:
kernel/cpu.c:85: warning: Excess struct member 'cpu' description in 'cpuhp_cpu_state'
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20240114030615.30441-1-rdunlap@infradead.org
For better readability and maintenance keep headers in alphabetical
order.
Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20240122124243.44002-4-brgl@bgdev.pl
uprobes passes an unaligned page mapping address to
folio_add_new_anon_rmap(), which ends up triggering a VM_BUG_ON() we
recently extended in commit 372cbd4d5a ("mm: non-pmd-mappable, large
folios for folio_add_new_anon_rmap()").
Arguably, this is uprobes code doing something wrong; however, for the
time being it would have likely worked in rmap code because
__folio_set_anon() would set folio->index to the same value.
Looking at __replace_page(), we'd also pass slightly wrong values to
mmu_notifier_range_init(), page_vma_mapped_walk(), flush_cache_page(),
ptep_clear_flush() and set_pte_at_notify(). I suspect most of them are
fine, but let's just mark the introducing commit as the one needed fixing.
I don't think CC stable is warranted.
We'll add more sanity checks in rmap code separately, to make sure that we
always get properly aligned addresses.
Link: https://lkml.kernel.org/r/20240115100731.91007-1-david@redhat.com
Fixes: c517ee744b ("uprobes: __replace_page() should not use page_address_in_vma()")
Signed-off-by: David Hildenbrand <david@redhat.com>
Reported-by: Jiri Olsa <jolsa@kernel.org>
Closes: https://lkml.kernel.org/r/ZaMR2EWN-HvlCfUl@krava
Tested-by: Jiri Olsa <jolsa@kernel.org>
Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Shishkin
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Ian Rogers <irogers@google.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
In bpf_struct_ops_map_alloc, it needs to check for NULL in the returned
pointer of bpf_get_btf_vmlinux() when CONFIG_DEBUG_INFO_BTF is not set.
ENOTSUPP is used to preserve the same behavior before the
struct_ops kmod support.
In the function check_struct_ops_btf_id(), instead of redoing the
bpf_get_btf_vmlinux() that has already been done in syscall.c, the fix
here is to check for prog->aux->attach_btf_id.
BPF_PROG_TYPE_STRUCT_OPS must require attach_btf_id and syscall.c
guarantees a valid attach_btf as long as attach_btf_id is set.
When attach_btf_id is not set, this patch returns -ENOTSUPP
because it is what the selftest in test_libbpf_probe_prog_types()
and libbpf_probes.c are expecting for feature probing purpose.
Changes from v1:
- Remove an unnecessary NULL check in check_struct_ops_btf_id()
Reported-by: syzbot+88f0aafe5f950d7489d7@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/bpf/00000000000040d68a060fc8db8c@google.com/
Reported-by: syzbot+1336f3d4b10bcda75b89@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/bpf/00000000000026353b060fc21c07@google.com/
Fixes: fcc2c1fb06 ("bpf: pass attached BTF to the bpf_struct_ops subsystem")
Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Link: https://lore.kernel.org/r/20240126023113.1379504-1-thinker.li@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Since we have set the WQ_NAME_LEN to 32, decrease the name of
events_freezable_power_efficient so that it does not trip the name length
warning when the workqueue is created.
Signed-off-by: Audra Mitchell <audra@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
This commit fixes RCU grace period stalls, which are observed when
an outgoing CPU's quiescent state reporting results in wakeup of
one of the grace period kthreads, to complete the grace period. If
those kthreads have SCHED_FIFO policy, the wake up can indirectly
arm the RT bandwith timer to the local offline CPU. Earlier migration
of the hrtimers from the CPU introduced in commit 5c0930ccaa
("hrtimers: Push pending hrtimers away from outgoing CPU earlier")
results in this timer getting ignored. If the RCU grace period
kthreads are waiting for RT bandwidth to be available, they may
never be actually scheduled, resulting in RCU stall warnings.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSi2tPIQIc2VEtjarIAHS7/6Z0wpQUCZbFOUgAKCRAAHS7/6Z0w
pQjcAQCg/tJYRjwGUPebKLUgkmXlR+IIANzEgvES/RgWTOld5gEAklZVTjf3J0qt
QeU9WC3My2cVPKvv6kqnuQ9rrqMQ3g0=
=U5an
-----END PGP SIGNATURE-----
Merge tag 'urgent-rcu.2024.01.24a' of https://github.com/neeraju/linux
Pull RCU fix from Neeraj Upadhyay:
"This fixes RCU grace period stalls, which are observed when an
outgoing CPU's quiescent state reporting results in wakeup of one of
the grace period kthreads, to complete the grace period.
If those kthreads have SCHED_FIFO policy, the wake up can indirectly
arm the RT bandwith timer to the local offline CPU.
Earlier migration of the hrtimers from the CPU introduced in commit
5c0930ccaa ("hrtimers: Push pending hrtimers away from outgoing CPU
earlier") results in this timer getting ignored.
If the RCU grace period kthreads are waiting for RT bandwidth to be
available, they may never be actually scheduled, resulting in RCU
stall warnings"
* tag 'urgent-rcu.2024.01.24a' of https://github.com/neeraju/linux:
rcu: Defer RCU kthreads wakeup when CPU is dying
Use the new KMEM_CACHE() macro instead of direct kmem_cache_create
to simplify the creation of SLAB caches.
Signed-off-by: Kunwu Chan <chentao@kylinos.cn>
[PM: subject line tweaks]
Signed-off-by: Paul Moore <paul@paul-moore.com>
The ret variable is assigned when it does not need to be defined, as it
has already been assigned before use.
Signed-off-by: Li zeming <zeming@nfschina.com>
[PM: rewrite subject line]
Signed-off-by: Paul Moore <paul@paul-moore.com>
Commit 71fee48f ("tick-sched: Fix idle and iowait sleeptime accounting vs
CPU hotplug") preserved total idle sleep time and iowait sleeptime across
CPU hotplug events.
Similar reasoning applies to the number of idle calls and idle sleeps to
get the proper average of sleep time per idle invocation.
Preserve those fields too.
Fixes: 71fee48f ("tick-sched: Fix idle and iowait sleeptime accounting vs CPU hotplug")
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20240122233534.3094238-1-tim.c.chen@linux.intel.com
There have been reports of the watchdog marking clocksources unstable on
machines with 8 NUMA nodes:
clocksource: timekeeping watchdog on CPU373:
Marking clocksource 'tsc' as unstable because the skew is too large:
clocksource: 'hpet' wd_nsec: 14523447520
clocksource: 'tsc' cs_nsec: 14524115132
The measured clocksource skew - the absolute difference between cs_nsec
and wd_nsec - was 668 microseconds:
cs_nsec - wd_nsec = 14524115132 - 14523447520 = 667612
The kernel used 200 microseconds for the uncertainty_margin of both the
clocksource and watchdog, resulting in a threshold of 400 microseconds (the
md variable). Both the cs_nsec and the wd_nsec value indicate that the
readout interval was circa 14.5 seconds. The observed behaviour is that
watchdog checks failed for large readout intervals on 8 NUMA node
machines. This indicates that the size of the skew was directly proportinal
to the length of the readout interval on those machines. The measured
clocksource skew, 668 microseconds, was evaluated against a threshold (the
md variable) that is suited for readout intervals of roughly
WATCHDOG_INTERVAL, i.e. HZ >> 1, which is 0.5 second.
The intention of 2e27e793e2 ("clocksource: Reduce clocksource-skew
threshold") was to tighten the threshold for evaluating skew and set the
lower bound for the uncertainty_margin of clocksources to twice
WATCHDOG_MAX_SKEW. Later in c37e85c135 ("clocksource: Loosen clocksource
watchdog constraints"), the WATCHDOG_MAX_SKEW constant was increased to
125 microseconds to fit the limit of NTP, which is able to use a
clocksource that suffers from up to 500 microseconds of skew per second.
Both the TSC and the HPET use default uncertainty_margin. When the
readout interval gets stretched the default uncertainty_margin is no
longer a suitable lower bound for evaluating skew - it imposes a limit
that is far stricter than the skew with which NTP can deal.
The root causes of the skew being directly proportinal to the length of
the readout interval are:
* the inaccuracy of the shift/mult pairs of clocksources and the watchdog
* the conversion to nanoseconds is imprecise for large readout intervals
Prevent this by skipping the current watchdog check if the readout
interval exceeds 2 * WATCHDOG_INTERVAL. Considering the maximum readout
interval of 2 * WATCHDOG_INTERVAL, the current default uncertainty margin
(of the TSC and HPET) corresponds to a limit on clocksource skew of 250
ppm (microseconds of skew per second). To keep the limit imposed by NTP
(500 microseconds of skew per second) for all possible readout intervals,
the margins would have to be scaled so that the threshold value is
proportional to the length of the actual readout interval.
As for why the readout interval may get stretched: Since the watchdog is
executed in softirq context the expiration of the watchdog timer can get
severely delayed on account of a ksoftirqd thread not getting to run in a
timely manner. Surely, a system with such belated softirq execution is not
working well and the scheduling issue should be looked into but the
clocksource watchdog should be able to deal with it accordingly.
Fixes: 2e27e793e2 ("clocksource: Reduce clocksource-skew threshold")
Suggested-by: Feng Tang <feng.tang@intel.com>
Signed-off-by: Jiri Wiesner <jwiesner@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Feng Tang <feng.tang@intel.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20240122172350.GA740@incl
Besides already supported special "any" value and hex bit mask, support
string-based parsing of delegation masks based on exact enumerator
names. Utilize BTF information of `enum bpf_cmd`, `enum bpf_map_type`,
`enum bpf_prog_type`, and `enum bpf_attach_type` types to find supported
symbolic names (ignoring __MAX_xxx guard values and stripping repetitive
prefixes like BPF_ for cmd and attach types, BPF_MAP_TYPE_ for maps, and
BPF_PROG_TYPE_ for prog types). The case doesn't matter, but it is
normalized to lower case in mount option output. So "PROG_LOAD",
"prog_load", and "MAP_create" are all valid values to specify for
delegate_cmds options, "array" is among supported for map types, etc.
Besides supporting string values, we also support multiple values
specified at the same time, using colon (':') separator.
There are corresponding changes on bpf_show_options side to use known
values to print them in human-readable format, falling back to hex mask
printing, if there are any unrecognized bits. This shouldn't be
necessary when enum BTF information is present, but in general we should
always be able to fall back to this even if kernel was built without BTF.
As mentioned, emitted symbolic names are normalized to be all lower case.
Example below shows various ways to specify delegate_cmds options
through mount command and how mount options are printed back:
12/14 14:39:07.604
vmuser@archvm:~/local/linux/tools/testing/selftests/bpf
$ mount | rg token
$ sudo mkdir -p /sys/fs/bpf/token
$ sudo mount -t bpf bpffs /sys/fs/bpf/token \
-o delegate_cmds=prog_load:MAP_CREATE \
-o delegate_progs=kprobe \
-o delegate_attachs=xdp
$ mount | grep token
bpffs on /sys/fs/bpf/token type bpf (rw,relatime,delegate_cmds=map_create:prog_load,delegate_progs=kprobe,delegate_attachs=xdp)
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20240124022127.2379740-20-andrii@kernel.org
It's quite confusing in practice when it's possible to successfully
create a BPF token from BPF FS that didn't have any of delegate_xxx
mount options set up. While it's not wrong, it's actually more
meaningful to reject BPF_TOKEN_CREATE with specific error code (-ENOENT)
to let user-space know that no token delegation is setup up.
So, instead of creating empty BPF token that will be always ignored
because it doesn't have any of the allow_xxx bits set, reject it with
-ENOENT. If we ever need empty BPF token to be possible, we can support
that with extra flag passed into BPF_TOKEN_CREATE.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Christian Brauner <brauner@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20240124022127.2379740-19-andrii@kernel.org
Wire up bpf_token_create and bpf_token_free LSM hooks, which allow to
allocate LSM security blob (we add `void *security` field to struct
bpf_token for that), but also control who can instantiate BPF token.
This follows existing pattern for BPF map and BPF prog.
Also add security_bpf_token_allow_cmd() and security_bpf_token_capable()
LSM hooks that allow LSM implementation to control and negate (if
necessary) BPF token's delegation of a specific bpf_cmd and capability,
respectively.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Paul Moore <paul@paul-moore.com>
Link: https://lore.kernel.org/bpf/20240124022127.2379740-12-andrii@kernel.org
Similarly to bpf_prog_alloc LSM hook, rename and extend bpf_map_alloc
hook into bpf_map_create, taking not just struct bpf_map, but also
bpf_attr and bpf_token, to give a fuller context to LSMs.
Unlike bpf_prog_alloc, there is no need to move the hook around, as it
currently is firing right before allocating BPF map ID and FD, which
seems to be a sweet spot.
But like bpf_prog_alloc/bpf_prog_free combo, make sure that bpf_map_free
LSM hook is called even if bpf_map_create hook returned error, as if few
LSMs are combined together it could be that one LSM successfully
allocated security blob for its needs, while subsequent LSM rejected BPF
map creation. The former LSM would still need to free up LSM blob, so we
need to ensure security_bpf_map_free() is called regardless of the
outcome.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Paul Moore <paul@paul-moore.com>
Link: https://lore.kernel.org/bpf/20240124022127.2379740-11-andrii@kernel.org
Based on upstream discussion ([0]), rework existing
bpf_prog_alloc_security LSM hook. Rename it to bpf_prog_load and instead
of passing bpf_prog_aux, pass proper bpf_prog pointer for a full BPF
program struct. Also, we pass bpf_attr union with all the user-provided
arguments for BPF_PROG_LOAD command. This will give LSMs as much
information as we can basically provide.
The hook is also BPF token-aware now, and optional bpf_token struct is
passed as a third argument. bpf_prog_load LSM hook is called after
a bunch of sanity checks were performed, bpf_prog and bpf_prog_aux were
allocated and filled out, but right before performing full-fledged BPF
verification step.
bpf_prog_free LSM hook is now accepting struct bpf_prog argument, for
consistency. SELinux code is adjusted to all new names, types, and
signatures.
Note, given that bpf_prog_load (previously bpf_prog_alloc) hook can be
used by some LSMs to allocate extra security blob, but also by other
LSMs to reject BPF program loading, we need to make sure that
bpf_prog_free LSM hook is called after bpf_prog_load/bpf_prog_alloc one
*even* if the hook itself returned error. If we don't do that, we run
the risk of leaking memory. This seems to be possible today when
combining SELinux and BPF LSM, as one example, depending on their
relative ordering.
Also, for BPF LSM setup, add bpf_prog_load and bpf_prog_free to
sleepable LSM hooks list, as they are both executed in sleepable
context. Also drop bpf_prog_load hook from untrusted, as there is no
issue with refcount or anything else anymore, that originally forced us
to add it to untrusted list in c0c852dd18 ("bpf: Do not mark certain LSM
hook arguments as trusted"). We now trigger this hook much later and it
should not be an issue anymore.
[0] https://lore.kernel.org/bpf/9fe88aef7deabbe87d3fc38c4aea3c69.paul@paul-moore.com/
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Paul Moore <paul@paul-moore.com>
Link: https://lore.kernel.org/bpf/20240124022127.2379740-10-andrii@kernel.org
Remove remaining direct queries to perfmon_capable() and bpf_capable()
in BPF verifier logic and instead use BPF token (if available) to make
decisions about privileges.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20240124022127.2379740-9-andrii@kernel.org
Instead of performing unconditional system-wide bpf_capable() and
perfmon_capable() calls inside bpf_base_func_proto() function (and other
similar ones) to determine eligibility of a given BPF helper for a given
program, use previously recorded BPF token during BPF_PROG_LOAD command
handling to inform the decision.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20240124022127.2379740-8-andrii@kernel.org
Add basic support of BPF token to BPF_PROG_LOAD. BPF_F_TOKEN_FD flag
should be set in prog_flags field when providing prog_token_fd.
Wire through a set of allowed BPF program types and attach types,
derived from BPF FS at BPF token creation time. Then make sure we
perform bpf_token_capable() checks everywhere where it's relevant.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20240124022127.2379740-7-andrii@kernel.org
Accept BPF token FD in BPF_BTF_LOAD command to allow BTF data loading
through delegated BPF token. BPF_F_TOKEN_FD flag has to be specified
when passing BPF token FD. Given BPF_BTF_LOAD command didn't have flags
field before, we also add btf_flags field.
BTF loading is a pretty straightforward operation, so as long as BPF
token is created with allow_cmds granting BPF_BTF_LOAD command, kernel
proceeds to parsing BTF data and creating BTF object.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20240124022127.2379740-6-andrii@kernel.org
Allow providing token_fd for BPF_MAP_CREATE command to allow controlled
BPF map creation from unprivileged process through delegated BPF token.
New BPF_F_TOKEN_FD flag is added to specify together with BPF token FD
for BPF_MAP_CREATE command.
Wire through a set of allowed BPF map types to BPF token, derived from
BPF FS at BPF token creation time. This, in combination with allowed_cmds
allows to create a narrowly-focused BPF token (controlled by privileged
agent) with a restrictive set of BPF maps that application can attempt
to create.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20240124022127.2379740-5-andrii@kernel.org
Add new kind of BPF kernel object, BPF token. BPF token is meant to
allow delegating privileged BPF functionality, like loading a BPF
program or creating a BPF map, from privileged process to a *trusted*
unprivileged process, all while having a good amount of control over which
privileged operations could be performed using provided BPF token.
This is achieved through mounting BPF FS instance with extra delegation
mount options, which determine what operations are delegatable, and also
constraining it to the owning user namespace (as mentioned in the
previous patch).
BPF token itself is just a derivative from BPF FS and can be created
through a new bpf() syscall command, BPF_TOKEN_CREATE, which accepts BPF
FS FD, which can be attained through open() API by opening BPF FS mount
point. Currently, BPF token "inherits" delegated command, map types,
prog type, and attach type bit sets from BPF FS as is. In the future,
having an BPF token as a separate object with its own FD, we can allow
to further restrict BPF token's allowable set of things either at the
creation time or after the fact, allowing the process to guard itself
further from unintentionally trying to load undesired kind of BPF
programs. But for now we keep things simple and just copy bit sets as is.
When BPF token is created from BPF FS mount, we take reference to the
BPF super block's owning user namespace, and then use that namespace for
checking all the {CAP_BPF, CAP_PERFMON, CAP_NET_ADMIN, CAP_SYS_ADMIN}
capabilities that are normally only checked against init userns (using
capable()), but now we check them using ns_capable() instead (if BPF
token is provided). See bpf_token_capable() for details.
Such setup means that BPF token in itself is not sufficient to grant BPF
functionality. User namespaced process has to *also* have necessary
combination of capabilities inside that user namespace. So while
previously CAP_BPF was useless when granted within user namespace, now
it gains a meaning and allows container managers and sys admins to have
a flexible control over which processes can and need to use BPF
functionality within the user namespace (i.e., container in practice).
And BPF FS delegation mount options and derived BPF tokens serve as
a per-container "flag" to grant overall ability to use bpf() (plus further
restrict on which parts of bpf() syscalls are treated as namespaced).
Note also, BPF_TOKEN_CREATE command itself requires ns_capable(CAP_BPF)
within the BPF FS owning user namespace, rounding up the ns_capable()
story of BPF token. Also creating BPF token in init user namespace is
currently not supported, given BPF token doesn't have any effect in init
user namespace anyways.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/bpf/20240124022127.2379740-4-andrii@kernel.org
Add few new mount options to BPF FS that allow to specify that a given
BPF FS instance allows creation of BPF token (added in the next patch),
and what sort of operations are allowed under BPF token. As such, we get
4 new mount options, each is a bit mask
- `delegate_cmds` allow to specify which bpf() syscall commands are
allowed with BPF token derived from this BPF FS instance;
- if BPF_MAP_CREATE command is allowed, `delegate_maps` specifies
a set of allowable BPF map types that could be created with BPF token;
- if BPF_PROG_LOAD command is allowed, `delegate_progs` specifies
a set of allowable BPF program types that could be loaded with BPF token;
- if BPF_PROG_LOAD command is allowed, `delegate_attachs` specifies
a set of allowable BPF program attach types that could be loaded with
BPF token; delegate_progs and delegate_attachs are meant to be used
together, as full BPF program type is, in general, determined
through both program type and program attach type.
Currently, these mount options accept the following forms of values:
- a special value "any", that enables all possible values of a given
bit set;
- numeric value (decimal or hexadecimal, determined by kernel
automatically) that specifies a bit mask value directly;
- all the values for a given mount option are combined, if specified
multiple times. E.g., `mount -t bpf nodev /path/to/mount -o
delegate_maps=0x1 -o delegate_maps=0x2` will result in a combined 0x3
mask.
Ideally, more convenient (for humans) symbolic form derived from
corresponding UAPI enums would be accepted (e.g., `-o
delegate_progs=kprobe|tracepoint`) and I intend to implement this, but
it requires a bunch of UAPI header churn, so I postponed it until this
feature lands upstream or at least there is a definite consensus that
this feature is acceptable and is going to make it, just to minimize
amount of wasted effort and not increase amount of non-essential code to
be reviewed.
Attentive reader will notice that BPF FS is now marked as
FS_USERNS_MOUNT, which theoretically makes it mountable inside non-init
user namespace as long as the process has sufficient *namespaced*
capabilities within that user namespace. But in reality we still
restrict BPF FS to be mountable only by processes with CAP_SYS_ADMIN *in
init userns* (extra check in bpf_fill_super()). FS_USERNS_MOUNT is added
to allow creating BPF FS context object (i.e., fsopen("bpf")) from
inside unprivileged process inside non-init userns, to capture that
userns as the owning userns. It will still be required to pass this
context object back to privileged process to instantiate and mount it.
This manipulation is important, because capturing non-init userns as the
owning userns of BPF FS instance (super block) allows to use that userns
to constraint BPF token to that userns later on (see next patch). So
creating BPF FS with delegation inside unprivileged userns will restrict
derived BPF token objects to only "work" inside that intended userns,
making it scoped to a intended "container". Also, setting these
delegation options requires capable(CAP_SYS_ADMIN), so unprivileged
process cannot set this up without involvement of a privileged process.
There is a set of selftests at the end of the patch set that simulates
this sequence of steps and validates that everything works as intended.
But careful review is requested to make sure there are no missed gaps in
the implementation and testing.
This somewhat subtle set of aspects is the result of previous
discussions ([0]) about various user namespace implications and
interactions with BPF token functionality and is necessary to contain
BPF token inside intended user namespace.
[0] https://lore.kernel.org/bpf/20230704-hochverdient-lehne-eeb9eeef785e@brauner/
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/bpf/20240124022127.2379740-3-andrii@kernel.org
Within BPF syscall handling code CAP_NET_ADMIN checks stand out a bit
compared to CAP_BPF and CAP_PERFMON checks. For the latter, CAP_BPF or
CAP_PERFMON are checked first, but if they are not set, CAP_SYS_ADMIN
takes over and grants whatever part of BPF syscall is required.
Similar kind of checks that involve CAP_NET_ADMIN are not so consistent.
One out of four uses does follow CAP_BPF/CAP_PERFMON model: during
BPF_PROG_LOAD, if the type of BPF program is "network-related" either
CAP_NET_ADMIN or CAP_SYS_ADMIN is required to proceed.
But in three other cases CAP_NET_ADMIN is required even if CAP_SYS_ADMIN
is set:
- when creating DEVMAP/XDKMAP/CPU_MAP maps;
- when attaching CGROUP_SKB programs;
- when handling BPF_PROG_QUERY command.
This patch is changing the latter three cases to follow BPF_PROG_LOAD
model, that is allowing to proceed under either CAP_NET_ADMIN or
CAP_SYS_ADMIN.
This also makes it cleaner in subsequent BPF token patches to switch
wholesomely to a generic bpf_token_capable(int cap) check, that always
falls back to CAP_SYS_ADMIN if requested capability is missing.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yafang Shao <laoar.shao@gmail.com>
Link: https://lore.kernel.org/bpf/20240124022127.2379740-2-andrii@kernel.org
Just to help distinguish the fs->in_exec flag from the current->in_execve
flag, add comments in check_unsafe_exec() and copy_fs() for more
context. Also note that in_execve is only used by TOMOYO now.
Cc: Kentaro Takeda <takedakn@nttdata.co.jp>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-mm@kvack.org
Signed-off-by: Kees Cook <keescook@chromium.org>
When the CPU goes idle for the last time during the CPU down hotplug
process, RCU reports a final quiescent state for the current CPU. If
this quiescent state propagates up to the top, some tasks may then be
woken up to complete the grace period: the main grace period kthread
and/or the expedited main workqueue (or kworker).
If those kthreads have a SCHED_FIFO policy, the wake up can indirectly
arm the RT bandwith timer to the local offline CPU. Since this happens
after hrtimers have been migrated at CPUHP_AP_HRTIMERS_DYING stage, the
timer gets ignored. Therefore if the RCU kthreads are waiting for RT
bandwidth to be available, they may never be actually scheduled.
This triggers TREE03 rcutorture hangs:
rcu: INFO: rcu_preempt self-detected stall on CPU
rcu: 4-...!: (1 GPs behind) idle=9874/1/0x4000000000000000 softirq=0/0 fqs=20 rcuc=21071 jiffies(starved)
rcu: (t=21035 jiffies g=938281 q=40787 ncpus=6)
rcu: rcu_preempt kthread starved for 20964 jiffies! g938281 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=0
rcu: Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.
rcu: RCU grace-period kthread stack dump:
task:rcu_preempt state:R running task stack:14896 pid:14 tgid:14 ppid:2 flags:0x00004000
Call Trace:
<TASK>
__schedule+0x2eb/0xa80
schedule+0x1f/0x90
schedule_timeout+0x163/0x270
? __pfx_process_timeout+0x10/0x10
rcu_gp_fqs_loop+0x37c/0x5b0
? __pfx_rcu_gp_kthread+0x10/0x10
rcu_gp_kthread+0x17c/0x200
kthread+0xde/0x110
? __pfx_kthread+0x10/0x10
ret_from_fork+0x2b/0x40
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1b/0x30
</TASK>
The situation can't be solved with just unpinning the timer. The hrtimer
infrastructure and the nohz heuristics involved in finding the best
remote target for an unpinned timer would then also need to handle
enqueues from an offline CPU in the most horrendous way.
So fix this on the RCU side instead and defer the wake up to an online
CPU if it's too late for the local one.
Reported-by: Paul E. McKenney <paulmck@kernel.org>
Fixes: 5c0930ccaa ("hrtimers: Push pending hrtimers away from outgoing CPU earlier")
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.iitr10@gmail.com>
alloc_desc() and early_irq_init() contain duplicated code to initialize
interrupt descriptors.
Replace that with a helper function.
Suggested-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Dawei Li <dawei.li@shingroup.cn>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20240122085716.2999875-6-dawei.li@shingroup.cn
For a CONFIG_SPARSE_IRQ=n kernel, early_irq_init() is supposed to
initialize all interrupt descriptors.
It does except for irq_desc::resend_node, which ia only initialized for the
first descriptor.
Use the indexed decriptor and not the base pointer to address that.
Fixes: bc06a9e087 ("genirq: Use hlist for managing resend handlers")
Signed-off-by: Dawei Li <dawei.li@shingroup.cn>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Marc Zyngier <maz@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20240122085716.2999875-5-dawei.li@shingroup.cn
The module requires the use of btf_ctx_access() to invoke
bpf_tracing_btf_ctx_access() from a module. This function is valuable for
implementing validation functions that ensure proper access to ctx.
Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Link: https://lore.kernel.org/r/20240119225005.668602-14-thinker.li@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Replace the static list of struct_ops types with per-btf struct_ops_tab to
enable dynamic registration.
Both bpf_dummy_ops and bpf_tcp_ca now utilize the registration function
instead of being listed in bpf_struct_ops_types.h.
Cc: netdev@vger.kernel.org
Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Link: https://lore.kernel.org/r/20240119225005.668602-12-thinker.li@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
A value_type should consist of three components: refcnt, state, and data.
refcnt and state has been move to struct bpf_struct_ops_common_value to
make it easier to check the value type.
Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Link: https://lore.kernel.org/r/20240119225005.668602-11-thinker.li@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
To ensure that a module remains accessible whenever a struct_ops object of
a struct_ops type provided by the module is still in use.
struct bpf_struct_ops_map doesn't hold a refcnt to btf anymore since a
module will hold a refcnt to it's btf already. But, struct_ops programs are
different. They hold their associated btf, not the module since they need
only btf to assure their types (signatures).
However, verifier holds the refcnt of the associated module of a struct_ops
type temporarily when verify a struct_ops prog. Verifier needs the help
from the verifier operators (struct bpf_verifier_ops) provided by the owner
module to verify data access of a prog, provide information, and generate
code.
This patch also add a count of links (links_cnt) to bpf_struct_ops_map. It
avoids bpf_struct_ops_map_put_progs() from accessing btf after calling
module_put() in bpf_struct_ops_map_free().
Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Link: https://lore.kernel.org/r/20240119225005.668602-10-thinker.li@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Pass the fd of a btf from the userspace to the bpf() syscall, and then
convert the fd into a btf. The btf is generated from the module that
defines the target BPF struct_ops type.
In order to inform the kernel about the module that defines the target
struct_ops type, the userspace program needs to provide a btf fd for the
respective module's btf. This btf contains essential information on the
types defined within the module, including the target struct_ops type.
A btf fd must be provided to the kernel for struct_ops maps and for the bpf
programs attached to those maps.
In the case of the bpf programs, the attach_btf_obj_fd parameter is passed
as part of the bpf_attr and is converted into a btf. This btf is then
stored in the prog->aux->attach_btf field. Here, it just let the verifier
access attach_btf directly.
In the case of struct_ops maps, a btf fd is passed as value_type_btf_obj_fd
of bpf_attr. The bpf_struct_ops_map_alloc() function converts the fd to a
btf and stores it as st_map->btf. A flag BPF_F_VTYPE_BTF_OBJ_FD is added
for map_flags to indicate that the value of value_type_btf_obj_fd is set.
Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Link: https://lore.kernel.org/r/20240119225005.668602-9-thinker.li@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
This is a preparation for searching for struct_ops types from a specified
module. BTF is always btf_vmlinux now. This patch passes a pointer of BTF
to bpf_struct_ops_find_value() and bpf_struct_ops_find(). Once the new
registration API of struct_ops types is used, other BTFs besides
btf_vmlinux can also be passed to them.
Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Link: https://lore.kernel.org/r/20240119225005.668602-8-thinker.li@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Include btf object id (btf_obj_id) in bpf_map_info so that tools (ex:
bpftools struct_ops dump) know the correct btf from the kernel to look up
type information of struct_ops types.
Since struct_ops types can be defined and registered in a module. The
type information of a struct_ops type are defined in the btf of the
module defining it. The userspace tools need to know which btf is for
the module defining a struct_ops type.
Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Link: https://lore.kernel.org/r/20240119225005.668602-7-thinker.li@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Once new struct_ops can be registered from modules, btf_vmlinux is no
longer the only btf that struct_ops_map would face. st_map should remember
what btf it should use to get type information.
Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Link: https://lore.kernel.org/r/20240119225005.668602-6-thinker.li@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Maintain a registry of registered struct_ops types in the per-btf (module)
struct_ops_tab. This registry allows for easy lookup of struct_ops types
that are registered by a specific module.
It is a preparation work for supporting kernel module struct_ops in a
latter patch. Each struct_ops will be registered under its own kernel
module btf and will be stored in the newly added btf->struct_ops_tab. The
bpf verifier and bpf syscall (e.g. prog and map cmd) can find the
struct_ops and its btf type/size/id... information from
btf->struct_ops_tab.
Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Link: https://lore.kernel.org/r/20240119225005.668602-5-thinker.li@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Move some of members of bpf_struct_ops to bpf_struct_ops_desc. type_id is
unavailabe in bpf_struct_ops anymore. Modules should get it from the btf
received by kmod's init function.
Cc: netdev@vger.kernel.org
Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Link: https://lore.kernel.org/r/20240119225005.668602-4-thinker.li@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Get ready to remove bpf_struct_ops_init() in the future. By using
BTF_ID_LIST, it is possible to gather type information while building
instead of runtime.
Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Link: https://lore.kernel.org/r/20240119225005.668602-3-thinker.li@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Move the majority of the code to bpf_struct_ops_init_one(), which can then
be utilized for the initialization of newly registered dynamically
allocated struct_ops types in the following patches.
Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Link: https://lore.kernel.org/r/20240119225005.668602-2-thinker.li@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Storing cookies in kprobe_multi bpf_link_info data. The cookies
field is optional and if provided it needs to be an array of
__u64 with kprobe_multi.count length.
Acked-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240119110505.400573-3-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
At the moment we don't store cookie for perf_event probes,
while we do that for the rest of the probes.
Adding cookie fields to struct bpf_link_info perf event
probe records:
perf_event.uprobe
perf_event.kprobe
perf_event.tracepoint
perf_event.perf_event
And the code to store that in bpf_link_info struct.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Song Liu <song@kernel.org>
Acked-by: Yafang Shao <laoar.shao@gmail.com>
Link: https://lore.kernel.org/r/20240119110505.400573-2-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Current checking rules are structured to disallow alu on particular ptr
types explicitly, so default cases are allowed implicitly. This may lead
to newly added ptr types being allowed unexpectedly. So restruture it to
allow alu explicitly. The tradeoff is mainly a bit more cases added in
the switch. The following table from Eduard summarizes the rules:
| Pointer type | Arithmetics allowed |
|---------------------+---------------------|
| PTR_TO_CTX | yes |
| CONST_PTR_TO_MAP | conditionally |
| PTR_TO_MAP_VALUE | yes |
| PTR_TO_MAP_KEY | yes |
| PTR_TO_STACK | yes |
| PTR_TO_PACKET_META | yes |
| PTR_TO_PACKET | yes |
| PTR_TO_PACKET_END | no |
| PTR_TO_FLOW_KEYS | conditionally |
| PTR_TO_SOCKET | no |
| PTR_TO_SOCK_COMMON | no |
| PTR_TO_TCP_SOCK | no |
| PTR_TO_TP_BUFFER | yes |
| PTR_TO_XDP_SOCK | no |
| PTR_TO_BTF_ID | yes |
| PTR_TO_MEM | yes |
| PTR_TO_BUF | yes |
| PTR_TO_FUNC | yes |
| CONST_PTR_TO_DYNPTR | yes |
The refactored rules are equivalent to the original one. Note that
PTR_TO_FUNC and CONST_PTR_TO_DYNPTR are not reject here because: (1)
check_mem_access() rejects load/store on those ptrs, and those ptrs
with offset passing to calls are rejected check_func_arg_reg_off();
(2) someone may rely on the verifier not rejecting programs earily.
Signed-off-by: Hao Sun <sunhao.th@gmail.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20240117094012.36798-1-sunhao.th@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
With patch set [1], precision backtracing supports register spill/fill
to/from the stack. The patch [2] allows initial imprecise register spill
with content 0. This is a common case for cpuv3 and lower for
initializing the stack variables with pattern
r1 = 0
*(u64 *)(r10 - 8) = r1
and the [2] has demonstrated good verification improvement.
For cpuv4, the initialization could be
*(u64 *)(r10 - 8) = 0
The current verifier marks the r10-8 contents with STACK_ZERO.
Similar to [2], let us permit the above insn to behave like
imprecise register spill which can reduce number of verified states.
The change is in function check_stack_write_fixed_off().
Before this patch, spilled zero will be marked as STACK_ZERO
which can provide precise values. In check_stack_write_var_off(),
STACK_ZERO will be maintained if writing a const zero
so later it can provide precise values if needed.
The above handling of '*(u64 *)(r10 - 8) = 0' as a spill
will have issues in check_stack_write_var_off() as the spill
will be converted to STACK_MISC and the precise value 0
is lost. To fix this issue, if the spill slots with const
zero and the BPF_ST write also with const zero, the spill slots
are preserved, which can later provide precise values
if needed. Without the change in check_stack_write_var_off(),
the test_verifier subtest 'BPF_ST_MEM stack imm zero, variable offset'
will fail.
I checked cpuv3 and cpuv4 with and without this patch with veristat.
There is no state change for cpuv3 since '*(u64 *)(r10 - 8) = 0'
is only generated with cpuv4.
For cpuv4:
$ ../veristat -C old.cpuv4.csv new.cpuv4.csv -e file,prog,insns,states -f 'insns_diff!=0'
File Program Insns (A) Insns (B) Insns (DIFF) States (A) States (B) States (DIFF)
------------------------------------------ ------------------- --------- --------- --------------- ---------- ---------- -------------
local_storage_bench.bpf.linked3.o get_local 228 168 -60 (-26.32%) 17 14 -3 (-17.65%)
pyperf600_bpf_loop.bpf.linked3.o on_event 6066 4889 -1177 (-19.40%) 403 321 -82 (-20.35%)
test_cls_redirect.bpf.linked3.o cls_redirect 35483 35387 -96 (-0.27%) 2179 2177 -2 (-0.09%)
test_l4lb_noinline.bpf.linked3.o balancer_ingress 4494 4522 +28 (+0.62%) 217 219 +2 (+0.92%)
test_l4lb_noinline_dynptr.bpf.linked3.o balancer_ingress 1432 1455 +23 (+1.61%) 92 94 +2 (+2.17%)
test_xdp_noinline.bpf.linked3.o balancer_ingress_v6 3462 3458 -4 (-0.12%) 216 216 +0 (+0.00%)
verifier_iterating_callbacks.bpf.linked3.o widening 52 41 -11 (-21.15%) 4 3 -1 (-25.00%)
xdp_synproxy_kern.bpf.linked3.o syncookie_tc 12412 11719 -693 (-5.58%) 345 330 -15 (-4.35%)
xdp_synproxy_kern.bpf.linked3.o syncookie_xdp 12478 11794 -684 (-5.48%) 346 331 -15 (-4.34%)
test_l4lb_noinline and test_l4lb_noinline_dynptr has minor regression, but
pyperf600_bpf_loop and local_storage_bench gets pretty good improvement.
[1] https://lore.kernel.org/all/20231205184248.1502704-1-andrii@kernel.org/
[2] https://lore.kernel.org/all/20231205184248.1502704-9-andrii@kernel.org/
Cc: Kuniyuki Iwashima <kuniyu@amazon.com>
Cc: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Tested-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240110051348.2737007-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Currently, when a scalar bounded register is spilled to the stack, its
ID is preserved, but only if was already assigned, i.e. if this register
was MOVed before.
Assign an ID on spill if none is set, so that equal scalars could be
tracked if a register is spilled to the stack and filled into another
register.
One test is adjusted to reflect the change in register IDs.
Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20240108205209.838365-9-maxtram95@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Put calculation of the register value width into a dedicated function.
This function will also be used in a following commit.
Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com>
Link: https://lore.kernel.org/r/20240108205209.838365-8-maxtram95@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Extract the common code that generates a register ID for src_reg before
MOV if needed into a new function. This function will also be used in
a following commit.
Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20240108205209.838365-7-maxtram95@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Current infinite loops detection mechanism is speculative:
- first, states_maybe_looping() check is done which simply does memcmp
for R1-R10 in current frame;
- second, states_equal(..., exact=false) is called. With exact=false
states_equal() would compare scalars for equality only if in old
state scalar has precision mark.
Such logic might be problematic if compiler makes some unlucky stack
spill/fill decisions. An artificial example of a false positive looks
as follows:
r0 = ... unknown scalar ...
r0 &= 0xff;
*(u64 *)(r10 - 8) = r0;
r0 = 0;
loop:
r0 = *(u64 *)(r10 - 8);
if r0 > 10 goto exit_;
r0 += 1;
*(u64 *)(r10 - 8) = r0;
r0 = 0;
goto loop;
This commit updates call to states_equal to use exact=true, forcing
all scalar comparisons to be exact.
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20240108205209.838365-3-maxtram95@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Add ability to iterate multiple decl_tag types pointed to the same
function argument. Use this to support multiple __arg_xxx tags per
global subprog argument.
We leave btf_find_decl_tag_value() intact, but change its implementation
to use a new btf_find_next_decl_tag() which can be straightforwardly
used to find next BTF type ID of a matching btf_decl_tag type.
btf_prepare_func_args() is switched from btf_find_decl_tag_value() to
btf_find_next_decl_tag() to gain multiple tags per argument support.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20240105000909.2818934-5-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Add btf_arg_tag flags enum to be able to record multiple tags per
argument. Also streamline pointer argument processing some more.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20240105000909.2818934-4-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Move scalar arg processing in btf_prepare_func_args() after all pointer
arg processing is done. This makes it easier to do validation. One
example of unintended behavior right now is ability to specify
__arg_nonnull for integer/enum arguments. This patch fixes this.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20240105000909.2818934-3-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The motivation of inlining bpf_kptr_xchg() comes from the performance
profiling of bpf memory allocator benchmark. The benchmark uses
bpf_kptr_xchg() to stash the allocated objects and to pop the stashed
objects for free. After inling bpf_kptr_xchg(), the performance for
object free on 8-CPUs VM increases about 2%~10%. The inline also has
downside: both the kasan and kcsan checks on the pointer will be
unavailable.
bpf_kptr_xchg() can be inlined by converting the calling of
bpf_kptr_xchg() into an atomic_xchg() instruction. But the conversion
depends on two conditions:
1) JIT backend supports atomic_xchg() on pointer-sized word
2) For the specific arch, the implementation of xchg is the same as
atomic_xchg() on pointer-sized words.
It seems most 64-bit JIT backends satisfies these two conditions. But
as a precaution, defining a weak function bpf_jit_supports_ptr_xchg()
to state whether such conversion is safe and only supporting inline for
64-bit host.
For x86-64, it supports BPF_XCHG atomic operation and both xchg() and
atomic_xchg() use arch_xchg() to implement the exchange, so enabling the
inline of bpf_kptr_xchg() on x86-64 first.
Reviewed-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20240105104819.3916743-2-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Running the following two commands in parallel on a multi-processor
AArch64 machine can sporadically produce an unexpected warning about
duplicate histogram entries:
$ while true; do
echo hist:key=id.syscall:val=hitcount > \
/sys/kernel/debug/tracing/events/raw_syscalls/sys_enter/trigger
cat /sys/kernel/debug/tracing/events/raw_syscalls/sys_enter/hist
sleep 0.001
done
$ stress-ng --sysbadaddr $(nproc)
The warning looks as follows:
[ 2911.172474] ------------[ cut here ]------------
[ 2911.173111] Duplicates detected: 1
[ 2911.173574] WARNING: CPU: 2 PID: 12247 at kernel/trace/tracing_map.c:983 tracing_map_sort_entries+0x3e0/0x408
[ 2911.174702] Modules linked in: iscsi_ibft(E) iscsi_boot_sysfs(E) rfkill(E) af_packet(E) nls_iso8859_1(E) nls_cp437(E) vfat(E) fat(E) ena(E) tiny_power_button(E) qemu_fw_cfg(E) button(E) fuse(E) efi_pstore(E) ip_tables(E) x_tables(E) xfs(E) libcrc32c(E) aes_ce_blk(E) aes_ce_cipher(E) crct10dif_ce(E) polyval_ce(E) polyval_generic(E) ghash_ce(E) gf128mul(E) sm4_ce_gcm(E) sm4_ce_ccm(E) sm4_ce(E) sm4_ce_cipher(E) sm4(E) sm3_ce(E) sm3(E) sha3_ce(E) sha512_ce(E) sha512_arm64(E) sha2_ce(E) sha256_arm64(E) nvme(E) sha1_ce(E) nvme_core(E) nvme_auth(E) t10_pi(E) sg(E) scsi_mod(E) scsi_common(E) efivarfs(E)
[ 2911.174738] Unloaded tainted modules: cppc_cpufreq(E):1
[ 2911.180985] CPU: 2 PID: 12247 Comm: cat Kdump: loaded Tainted: G E 6.7.0-default #2 1b58bbb22c97e4399dc09f92d309344f69c44a01
[ 2911.182398] Hardware name: Amazon EC2 c7g.8xlarge/, BIOS 1.0 11/1/2018
[ 2911.183208] pstate: 61400005 (nZCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
[ 2911.184038] pc : tracing_map_sort_entries+0x3e0/0x408
[ 2911.184667] lr : tracing_map_sort_entries+0x3e0/0x408
[ 2911.185310] sp : ffff8000a1513900
[ 2911.185750] x29: ffff8000a1513900 x28: ffff0003f272fe80 x27: 0000000000000001
[ 2911.186600] x26: ffff0003f272fe80 x25: 0000000000000030 x24: 0000000000000008
[ 2911.187458] x23: ffff0003c5788000 x22: ffff0003c16710c8 x21: ffff80008017f180
[ 2911.188310] x20: ffff80008017f000 x19: ffff80008017f180 x18: ffffffffffffffff
[ 2911.189160] x17: 0000000000000000 x16: 0000000000000000 x15: ffff8000a15134b8
[ 2911.190015] x14: 0000000000000000 x13: 205d373432323154 x12: 5b5d313131333731
[ 2911.190844] x11: 00000000fffeffff x10: 00000000fffeffff x9 : ffffd1b78274a13c
[ 2911.191716] x8 : 000000000017ffe8 x7 : c0000000fffeffff x6 : 000000000057ffa8
[ 2911.192554] x5 : ffff0012f6c24ec0 x4 : 0000000000000000 x3 : ffff2e5b72b5d000
[ 2911.193404] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff0003ff254480
[ 2911.194259] Call trace:
[ 2911.194626] tracing_map_sort_entries+0x3e0/0x408
[ 2911.195220] hist_show+0x124/0x800
[ 2911.195692] seq_read_iter+0x1d4/0x4e8
[ 2911.196193] seq_read+0xe8/0x138
[ 2911.196638] vfs_read+0xc8/0x300
[ 2911.197078] ksys_read+0x70/0x108
[ 2911.197534] __arm64_sys_read+0x24/0x38
[ 2911.198046] invoke_syscall+0x78/0x108
[ 2911.198553] el0_svc_common.constprop.0+0xd0/0xf8
[ 2911.199157] do_el0_svc+0x28/0x40
[ 2911.199613] el0_svc+0x40/0x178
[ 2911.200048] el0t_64_sync_handler+0x13c/0x158
[ 2911.200621] el0t_64_sync+0x1a8/0x1b0
[ 2911.201115] ---[ end trace 0000000000000000 ]---
The problem appears to be caused by CPU reordering of writes issued from
__tracing_map_insert().
The check for the presence of an element with a given key in this
function is:
val = READ_ONCE(entry->val);
if (val && keys_match(key, val->key, map->key_size)) ...
The write of a new entry is:
elt = get_free_elt(map);
memcpy(elt->key, key, map->key_size);
entry->val = elt;
The "memcpy(elt->key, key, map->key_size);" and "entry->val = elt;"
stores may become visible in the reversed order on another CPU. This
second CPU might then incorrectly determine that a new key doesn't match
an already present val->key and subsequently insert a new element,
resulting in a duplicate.
Fix the problem by adding a write barrier between
"memcpy(elt->key, key, map->key_size);" and "entry->val = elt;", and for
good measure, also use WRITE_ONCE(entry->val, elt) for publishing the
element. The sequence pairs with the mentioned "READ_ONCE(entry->val);"
and the "val->key" check which has an address dependency.
The barrier is placed on a path executed when adding an element for
a new key. Subsequent updates targeting the same key remain unaffected.
From the user's perspective, the issue was introduced by commit
c193707dde ("tracing: Remove code which merges duplicates"), which
followed commit cbf4100efb ("tracing: Add support to detect and avoid
duplicates"). The previous code operated differently; it inherently
expected potential races which result in duplicates but merged them
later when they occurred.
Link: https://lore.kernel.org/linux-trace-kernel/20240122150928.27725-1-petr.pavlu@suse.com
Fixes: c193707dde ("tracing: Remove code which merges duplicates")
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
Acked-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
- A fix for the idle and iowait time accounting vs. CPU hotplug.
The time is reset on CPU hotplug which makes the accumulated
systemwide time jump backwards.
- Assorted fixes and improvements for clocksource/event drivers
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmWtTLgTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoUXiD/4uN4Ntps8TwxSdg1X11M6++rizg9q9
EmIfwWcfQQJDM5Ss5FE88ye55NxIOwJ1brYo08+yTAXjnnZ/yNP1BBegHbMNiGil
NCHye7tYKZle25+hErdgfBB9n6brPz7dPOvV04/wRRWW+9p2ejt/5nEvojkyco9Y
S9KgBCxkvUqScMbdKKFW1UsThWh2euxwQXRGiWhTPPkbKcVynPvQJjvVyRxn01NS
eEhTn8YUNcAPT+1YApouGXrSCxo/IzBJ36CxOoCoUfaXcJ6FG1LLeAjNxKZ26Dfs
Ah0e3Hhyv6KOsBvBNwwabXDwryd6L8rZd8yL2KakI1vIC51uS2wneFy8GCieDVGh
xmy3U/tfkS0L7pmN+dQW2l4k9PHRNrwvbISKhs0UAHSOgGIMHZcjE6aFbYKru5i4
1W+dEjiktlceZ94mrEHbLpKmxWH2z5P8m0BzUs4kt3nkaOf6CTUKqa/qdAiU5dv+
lovKT26L8HBrMXf48I70UpgW/bYzOUGk55sR6hiLTXAelz1z02D1uYHFkshc0NCO
/O4wvHcgvMM46CtWVbim42AlRcyyWCr+FrY+jvfiG2icOcHPLqc81iHL8EKj7pJl
IxLgyPHVckgnE5gx+GQ8aDkg/qwCZnj4rFWgub8QMYtjI+pO+9T9kPAYPCxFhP7J
gmcJxZAB2RnKXA==
=RD6E
-----END PGP SIGNATURE-----
Merge tag 'timers-core-2024-01-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer updates from Thomas Gleixner:
"Updates for time and clocksources:
- A fix for the idle and iowait time accounting vs CPU hotplug.
The time is reset on CPU hotplug which makes the accumulated
systemwide time jump backwards.
- Assorted fixes and improvements for clocksource/event drivers"
* tag 'timers-core-2024-01-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
tick-sched: Fix idle and iowait sleeptime accounting vs CPU hotplug
clocksource/drivers/ep93xx: Fix error handling during probe
clocksource/drivers/cadence-ttc: Fix some kernel-doc warnings
clocksource/drivers/timer-ti-dm: Fix make W=n kerneldoc warnings
clocksource/timer-riscv: Add riscv_clock_shutdown callback
dt-bindings: timer: Add StarFive JH8100 clint
dt-bindings: timer: thead,c900-aclint-mtimer: separate mtime and mtimecmp regs
When offlining and onlining CPUs the overall reported idle and iowait
times as reported by /proc/stat jump backward and forward:
cpu 132 0 176 225249 47 6 6 21 0 0
cpu0 80 0 115 112575 33 3 4 18 0 0
cpu1 52 0 60 112673 13 3 1 2 0 0
cpu 133 0 177 226681 47 6 6 21 0 0
cpu0 80 0 116 113387 33 3 4 18 0 0
cpu 133 0 178 114431 33 6 6 21 0 0 <---- jump backward
cpu0 80 0 116 114247 33 3 4 18 0 0
cpu1 52 0 61 183 0 3 1 2 0 0 <---- idle + iowait start with 0
cpu 133 0 178 228956 47 6 6 21 0 0 <---- jump forward
cpu0 81 0 117 114929 33 3 4 18 0 0
Reason for this is that get_idle_time() in fs/proc/stat.c has different
sources for both values depending on if a CPU is online or offline:
- if a CPU is online the values may be taken from its per cpu
tick_cpu_sched structure
- if a CPU is offline the values are taken from its per cpu cpustat
structure
The problem is that the per cpu tick_cpu_sched structure is set to zero on
CPU offline. See tick_cancel_sched_timer() in kernel/time/tick-sched.c.
Therefore when a CPU is brought offline and online afterwards both its idle
and iowait sleeptime will be zero, causing a jump backward in total system
idle and iowait sleeptime. In a similar way if a CPU is then brought
offline again the total idle and iowait sleeptimes will jump forward.
It looks like this behavior was introduced with commit 4b0c0f294f
("tick: Cleanup NOHZ per cpu data on cpu down").
This was only noticed now on s390, since we switched to generic idle time
reporting with commit be76ea6144 ("s390/idle: remove arch_cpu_idle_time()
and corresponding code").
Fix this by preserving the values of idle_sleeptime and iowait_sleeptime
members of the per-cpu tick_sched structure on CPU hotplug.
Fixes: 4b0c0f294f ("tick: Cleanup NOHZ per cpu data on cpu down")
Reported-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20240115163555.1004144-1-hca@linux.ibm.com
Jiri Slaby reported a futex state inconsistency resulting in -EINVAL during
a lock operation for a PI futex. It requires that the a lock process is
interrupted by a timeout or signal:
T1 Owns the futex in user space.
T2 Tries to acquire the futex in kernel (futex_lock_pi()). Allocates a
pi_state and attaches itself to it.
T2 Times out and removes its rt_waiter from the rt_mutex. Drops the
rtmutex lock and tries to acquire the hash bucket lock to remove
the futex_q. The lock is contended and T2 schedules out.
T1 Unlocks the futex (futex_unlock_pi()). Finds a futex_q but no
rt_waiter. Unlocks the futex (do_uncontended) and makes it available
to user space.
T3 Acquires the futex in user space.
T4 Tries to acquire the futex in kernel (futex_lock_pi()). Finds the
existing futex_q of T2 and tries to attach itself to the existing
pi_state. This (attach_to_pi_state()) fails with -EINVAL because uval
contains the TID of T3 but pi_state points to T1.
It's incorrect to unlock the futex and make it available for user space to
acquire as long as there is still an existing state attached to it in the
kernel.
T1 cannot hand over the futex to T2 because T2 already gave up and started
to clean up and is blocked on the hash bucket lock, so T2's futex_q with
the pi_state pointing to T1 is still queued.
T2 observes the futex_q, but ignores it as there is no waiter on the
corresponding rt_mutex and takes the uncontended path which allows the
subsequent caller of futex_lock_pi() (T4) to observe that stale state.
To prevent this the unlock path must dequeue all futex_q entries which
point to the same pi_state when there is no waiter on the rt mutex. This
requires obviously to make the dequeue conditional in the locking path to
prevent a double dequeue. With that it's guaranteed that user space cannot
observe an uncontended futex which has kernel state attached.
Fixes: fbeb558b0d ("futex/pi: Fix recursive rt_mutex waiter state")
Reported-by: Jiri Slaby <jirislaby@kernel.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Jiri Slaby <jirislaby@kernel.org>
Link: https://lore.kernel.org/r/20240118115451.0TkD_ZhB@linutronix.de
Closes: https://lore.kernel.org/all/4611bcf2-44d0-4c34-9b84-17406f881003@kernel.org
The entire changeset for kgdb this cycle is a single two-line change to
remove some deadcode that, had it not been dead, would have called
strncpy() in an unsafe manner.
To be fair there were other modest clean ups were discussed this cycle
but they are not finalized and will have to wait until next time.
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEELzVBU1D3lWq6cKzwfOMlXTn3iKEFAmWpMp8ACgkQfOMlXTn3
iKE2YQ/9HGgROWCSQPHXmly/VFbcvp/0J77+XA4yNLqTKMfyV1ZHQ5xWoiDp2Q0Q
o689iYQ6V6YWl1/KNR5c7Xct82zgKaSksypTvIZBvk7vPVCElkQM9tpvE9VahMr7
/YB59GFr83Rks7a0tfQ6SMUFWyFFtaU3YdV1374CfGTmlFiqMBbfCz40izjgT7so
yLbbDCZxIaNBlRnRore1kPTt/3KAih6udai74H+OmlifWR0saS57Hwywlfd2H3yv
Q3bFK4NkYLkptJZVmAC8B68YK5c4W24Y+lQ9fouKiFMajym5ONHLFsmdbZe6MrPk
CUU2cGgDvei8nyAXAucocr5ndvJMi8R/JNunBNbnI6oYnT7K1MW3u+v1R+SRYm22
Ixza37hcFkR+7wWnXsPB/POQo74C5ylZt5Of20ogzBEm8qjY7zAGoVZhNh3Ny4t2
NGHwm6MR9613dIh/cVuD4waZdWS2TWYkaTA1E8PwIIqSnKfmMo0pJQU7ddy5V8Kb
F6un3/+9IA6mBQLOqPuFpHZNzIMkDdgE4GvZIrF/DIzlHeWxnk6hDOO2ftbGgsIY
BTelXfE1L8UoHlhixnjnFrTxRNjsZE8CGLY/Zd++b1mhY6tCwSGUS+cInUiIf/3k
X9bN+rwzxV+MVx2V5j/g2i2sEP2YELgCGDKtb7WbXDv43T+Q6/I=
=dLyK
-----END PGP SIGNATURE-----
Merge tag 'kgdb-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux
Pull kgdb update from Daniel Thompson:
"The entire changeset for kgdb this cycle is a single two-line change
to remove some deadcode that, had it not been dead, would have called
strncpy() in an unsafe manner.
To be fair there were other modest clean ups were discussed this cycle
but they are not finalized and will have to wait until next time"
* tag 'kgdb-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux:
kdb: Fix a potential buffer overflow in kdb_local()
Previous releases - regressions:
- Revert "net: rtnetlink: Enslave device before bringing it up",
breaks the case inverse to the one it was trying to fix
- net: dsa: fix oob access in DSA's netdevice event handler
dereference netdev_priv() before check its a DSA port
- sched: track device in tcf_block_get/put_ext() only for clsact
binder types
- net: tls, fix WARNING in __sk_msg_free when record becomes full
during splice and MORE hint set
- sfp-bus: fix SFP mode detect from bitrate
- drv: stmmac: prevent DSA tags from breaking COE
Previous releases - always broken:
- bpf: fix no forward progress in in bpf_iter_udp if output
buffer is too small
- bpf: reject variable offset alu on registers with a type
of PTR_TO_FLOW_KEYS to prevent oob access
- netfilter: tighten input validation
- net: add more sanity check in virtio_net_hdr_to_skb()
- rxrpc: fix use of Don't Fragment flag on RESPONSE packets,
avoid infinite loop
- amt: do not use the portion of skb->cb area which may get clobbered
- mptcp: improve validation of the MPTCPOPT_MP_JOIN MCTCP option
Misc:
- spring cleanup of inactive maintainers
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmWpnvoACgkQMUZtbf5S
Irvskg/+Or5tETxOmpQXxnj6ECZyrSp0Jcyd7+TIcos/7JfPdn3Kebl004SG4h/s
bwKDOIIP1iSjQ+0NFsPjyYIVd6wFuCElSB7npV5uQAT6ptXx7A4Ym68/rVxodI8T
6hiYV/mlPuZF8JjRhtp/VJL8sY1qnG7RIUB4oH3y9HQNfwZX0lIWChuUilHuWfbq
zQ2Iu97tMkoIBjXrkIT3Qaj0aFxYbjCOrg9zy+FZ69a7Rmrswr//7amlCH6saNTx
Ku7Wl8FXhe7O23OiM6GSl7AechSM1aJ5kOS3orseej0+aSp9eH3ekYGmbsQr6sjz
ix/eZ7V7SUkJK3bEH5haeymk4TDV3lHE8SziMbosK4wVbHOyPwEmqCxppADYJLZs
WycHZKcTBluFBOxknAofH7m5Hh0ToXkeTfpptSSGtRB4WncAOMsMapr3yS4WXg/q
AnOo/tzCBgMrnSJtD/kjqgUiCk8vYoLc8lBR9K74l0zqI1sf13OfuTHvEgqIS6z1
Ir/ewlAV6fCH8gQbyzjKUVlyjZS4+vFv19xg/2GgLf+LdyzcCOxUZkND3/DE6+OA
Dgf9gtABYU4hGXMUfTfml3KCBTF65QmY8dIh17zraNylYUHEJ2lI4D+sdiqWUrXb
mXPBJh4nOPwIV5t2gT80skNwF3aWPr6l4ieY2codSbP04rO74S8=
=YhQQ
-----END PGP SIGNATURE-----
Merge tag 'net-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from bpf and netfilter.
Previous releases - regressions:
- Revert "net: rtnetlink: Enslave device before bringing it up",
breaks the case inverse to the one it was trying to fix
- net: dsa: fix oob access in DSA's netdevice event handler
dereference netdev_priv() before check its a DSA port
- sched: track device in tcf_block_get/put_ext() only for clsact
binder types
- net: tls, fix WARNING in __sk_msg_free when record becomes full
during splice and MORE hint set
- sfp-bus: fix SFP mode detect from bitrate
- drv: stmmac: prevent DSA tags from breaking COE
Previous releases - always broken:
- bpf: fix no forward progress in in bpf_iter_udp if output buffer is
too small
- bpf: reject variable offset alu on registers with a type of
PTR_TO_FLOW_KEYS to prevent oob access
- netfilter: tighten input validation
- net: add more sanity check in virtio_net_hdr_to_skb()
- rxrpc: fix use of Don't Fragment flag on RESPONSE packets, avoid
infinite loop
- amt: do not use the portion of skb->cb area which may get clobbered
- mptcp: improve validation of the MPTCPOPT_MP_JOIN MCTCP option
Misc:
- spring cleanup of inactive maintainers"
* tag 'net-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (88 commits)
i40e: Include types.h to some headers
ipv6: mcast: fix data-race in ipv6_mc_down / mld_ifc_work
selftests: mlxsw: qos_pfc: Adjust the test to support 8 lanes
selftests: mlxsw: qos_pfc: Remove wrong description
mlxsw: spectrum_router: Register netdevice notifier before nexthop
mlxsw: spectrum_acl_tcam: Fix stack corruption
mlxsw: spectrum_acl_tcam: Fix NULL pointer dereference in error path
mlxsw: spectrum_acl_erp: Fix error flow of pool allocation failure
ethtool: netlink: Add missing ethnl_ops_begin/complete
selftests: bonding: Add more missing config options
selftests: netdevsim: add a config file
libbpf: warn on unexpected __arg_ctx type when rewriting BTF
selftests/bpf: add tests confirming type logic in kernel for __arg_ctx
bpf: enforce types for __arg_ctx-tagged arguments in global subprogs
bpf: extract bpf_ctx_convert_map logic and make it more reusable
libbpf: feature-detect arg:ctx tag support in kernel
ipvs: avoid stat macros calls from preemptible context
netfilter: nf_tables: reject NFT_SET_CONCAT with not field length description
netfilter: nf_tables: skip dead set elements in netlink dump
netfilter: nf_tables: do not allow mismatch field size and set key length
...
Including:
- Core changes:
- Fix race conditions in device probe path
- Retire IOMMU bus_ops
- Support for passing custom allocators to page table drivers
- Clean up Kconfig around IOMMU_SVA
- Support for sharing SVA domains with all devices bound to
a mm
- Firmware data parsing cleanup
- Tracing improvements for iommu-dma code
- Some smaller fixes and cleanups
- ARM-SMMU drivers:
- Device-tree binding updates:
- Add additional compatible strings for Qualcomm SoCs
- Document Adreno clocks for Qualcomm's SM8350 SoC
- SMMUv2:
- Implement support for the ->domain_alloc_paging() callback
- Ensure Secure context is restored following suspend of Qualcomm SMMU
implementation
- SMMUv3:
- Disable stalling mode for the "quiet" context descriptor
- Minor refactoring and driver cleanups
- Intel VT-d driver:
- Cleanup and refactoring
- AMD IOMMU driver:
- Improve IO TLB invalidation logic
- Small cleanups and improvements
- Rockchip IOMMU driver:
- DT binding update to add Rockchip RK3588
- Apple DART driver:
- Apple M1 USB4/Thunderbolt DART support
- Cleanups
- Virtio IOMMU driver:
- Add support for iotlb_sync_map
- Enable deferred IO TLB flushes
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEr9jSbILcajRFYWYyK/BELZcBGuMFAmWecQoACgkQK/BELZcB
GuN5ZxAAzC5QUKAzANx0puk7QhPpKKlbSvj6Q7iRgCLk00KJO1+VQh9v4ouCmXqF
kn3Ko8gddjhtrgwN0OQ54F39cLUrp1SBemy71K5YOR+vu8VKtwtmawZGeeRZ+k+B
Eohw58oaXTiR1maYvoLixLYczLrjklqyJOQ1vZ0GxFGxDqrFByAryHDgG/3OCpJx
C9e6PsLbbfhfqA8Kv97iKcBqniGbXxAMuodqSUG0buQ3oZgfpIP6Bt3EgUzFGPGk
3BTlYxowS/gkjUWd3fgjQFIFLTA01u9FhpA2Jb0a4v67pUCR64YxHN7rBQ6ZChtG
kB9laQfU9re79RsHhqQzr0JT9x/eyq7pzGzjp5TV5TPW6IW+sqjMIPhzd9P08Ef7
BclkCVobx0jSAHOhnnG4QJiKANr2Y2oM3HfsAJccMMY45RRhUKmVqM7jxMPfGn3A
i+inlee73xTjZXJse1EWG1fmKKMLvX9LDEp4DyOfn9CqVT+7hpZvzPjfbGr937Rm
JlwXhF3rQXEpOCagEsbt1vOf+V0e9QiCLf1Y2KpkIkDbE5wwSD/2qLm3tFhJG3oF
fkW+J14Cid0pj+hY0afGe0kOUOIYlimu0nFmSf0pzMH+UktZdKogSfyb1gSDsy+S
rsZRGPFhMJ832ExqhlDfxqBebqh+jsfKynlskui6Td5C9ZULaHA=
=q751
-----END PGP SIGNATURE-----
Merge tag 'iommu-updates-v6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu
Pull iommu updates from Joerg Roedel:
"Core changes:
- Fix race conditions in device probe path
- Retire IOMMU bus_ops
- Support for passing custom allocators to page table drivers
- Clean up Kconfig around IOMMU_SVA
- Support for sharing SVA domains with all devices bound to a mm
- Firmware data parsing cleanup
- Tracing improvements for iommu-dma code
- Some smaller fixes and cleanups
ARM-SMMU drivers:
- Device-tree binding updates:
- Add additional compatible strings for Qualcomm SoCs
- Document Adreno clocks for Qualcomm's SM8350 SoC
- SMMUv2:
- Implement support for the ->domain_alloc_paging() callback
- Ensure Secure context is restored following suspend of Qualcomm
SMMU implementation
- SMMUv3:
- Disable stalling mode for the "quiet" context descriptor
- Minor refactoring and driver cleanups
Intel VT-d driver:
- Cleanup and refactoring
AMD IOMMU driver:
- Improve IO TLB invalidation logic
- Small cleanups and improvements
Rockchip IOMMU driver:
- DT binding update to add Rockchip RK3588
Apple DART driver:
- Apple M1 USB4/Thunderbolt DART support
- Cleanups
Virtio IOMMU driver:
- Add support for iotlb_sync_map
- Enable deferred IO TLB flushes"
* tag 'iommu-updates-v6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (66 commits)
iommu: Don't reserve 0-length IOVA region
iommu/vt-d: Move inline helpers to header files
iommu/vt-d: Remove unused vcmd interfaces
iommu/vt-d: Remove unused parameter of intel_pasid_setup_pass_through()
iommu/vt-d: Refactor device_to_iommu() to retrieve iommu directly
iommu/sva: Fix memory leak in iommu_sva_bind_device()
dt-bindings: iommu: rockchip: Add Rockchip RK3588
iommu/dma: Trace bounce buffer usage when mapping buffers
iommu/arm-smmu: Convert to domain_alloc_paging()
iommu/arm-smmu: Pass arm_smmu_domain to internal functions
iommu/arm-smmu: Implement IOMMU_DOMAIN_BLOCKED
iommu/arm-smmu: Convert to a global static identity domain
iommu/arm-smmu: Reorganize arm_smmu_domain_add_master()
iommu/arm-smmu-v3: Remove ARM_SMMU_DOMAIN_NESTED
iommu/arm-smmu-v3: Master cannot be NULL in arm_smmu_write_strtab_ent()
iommu/arm-smmu-v3: Add a type for the STE
iommu/arm-smmu-v3: disable stall for quiet_cd
iommu/qcom: restore IOMMU state if needed
iommu/arm-smmu-qcom: Add QCM2290 MDSS compatible
iommu/arm-smmu-qcom: Add missing GMU entry to match table
...
- Allow kernel trace instance creation to specify what events are created
Inside the kernel, a subsystem may create a tracing instance that it can
use to send events to user space. This sub-system may not care about the
thousands of events that exist in eventfs. Allow the sub-system to specify
what sub-systems of events it cares about, and only those events are exposed
to this instance.
- Allow the ring buffer to be broken up into bigger sub-buffers than just the
architecture page size. A new tracefs file called "buffer_subbuf_size_kb"
is created. The user can now specify a minimum size the sub-buffer may be
in kilobytes. Note, that the implementation currently make the sub-buffer
size a power of 2 pages (1, 2, 4, 8, 16, ...) but the user only writes in
kilobyte size, and the sub-buffer will be updated to the next size that
it will can accommodate it. If the user writes in 10, it will change the
size to be 4 pages on x86 (16K), as that is the next available size that
can hold 10K pages.
- Update the debug output when a corrupt time is detected in the ring buffer.
If the ring buffer detects inconsistent timestamps, there's a debug config
options that will dump the contents of the meta data of the sub-buffer that
is used for debugging. Add some more information to this dump that helps
with debugging.
- Add more timestamp debugging checks (only triggers when the config is enabled)
- Increase the trace_seq iterator to 2 page sizes.
- Allow strings written into tracefs_marker to be larger. Up to just under
2 page sizes (based on what trace_seq can hold).
- Increase the trace_maker_raw write to be as big as a sub-buffer can hold.
- Remove 32 bit time stamp logic, now that the rb_time_cmpxchg() has been
removed.
- More selftests were added.
- Some code clean ups as well.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZZ8p3BQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6ql2GAQDZg/zlFEiJHyTfWbCIE8pA3T5xbzKo
26TNxIZAxJJZpQEAvGFU5Smy14pG6soEoVMp8B6ZOANbqU8VVamhOL+r+Qw=
=0OYG
-----END PGP SIGNATURE-----
Merge tag 'trace-v6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing updates from Steven Rostedt:
- Allow kernel trace instance creation to specify what events are
created
Inside the kernel, a subsystem may create a tracing instance that it
can use to send events to user space. This sub-system may not care
about the thousands of events that exist in eventfs. Allow the
sub-system to specify what sub-systems of events it cares about, and
only those events are exposed to this instance.
- Allow the ring buffer to be broken up into bigger sub-buffers than
just the architecture page size.
A new tracefs file called "buffer_subbuf_size_kb" is created. The
user can now specify a minimum size the sub-buffer may be in
kilobytes. Note, that the implementation currently make the
sub-buffer size a power of 2 pages (1, 2, 4, 8, 16, ...) but the user
only writes in kilobyte size, and the sub-buffer will be updated to
the next size that it will can accommodate it. If the user writes in
10, it will change the size to be 4 pages on x86 (16K), as that is
the next available size that can hold 10K pages.
- Update the debug output when a corrupt time is detected in the ring
buffer. If the ring buffer detects inconsistent timestamps, there's a
debug config options that will dump the contents of the meta data of
the sub-buffer that is used for debugging. Add some more information
to this dump that helps with debugging.
- Add more timestamp debugging checks (only triggers when the config is
enabled)
- Increase the trace_seq iterator to 2 page sizes.
- Allow strings written into tracefs_marker to be larger. Up to just
under 2 page sizes (based on what trace_seq can hold).
- Increase the trace_maker_raw write to be as big as a sub-buffer can
hold.
- Remove 32 bit time stamp logic, now that the rb_time_cmpxchg() has
been removed.
- More selftests were added.
- Some code clean ups as well.
* tag 'trace-v6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (29 commits)
ring-buffer: Remove stale comment from ring_buffer_size()
tracing histograms: Simplify parse_actions() function
tracing/selftests: Remove exec permissions from trace_marker.tc test
ring-buffer: Use subbuf_order for buffer page masking
tracing: Update subbuffer with kilobytes not page order
ringbuffer/selftest: Add basic selftest to test changing subbuf order
ring-buffer: Add documentation on the buffer_subbuf_order file
ring-buffer: Just update the subbuffers when changing their allocation order
ring-buffer: Keep the same size when updating the order
tracing: Stop the tracing while changing the ring buffer subbuf size
tracing: Update snapshot order along with main buffer order
ring-buffer: Make sure the spare sub buffer used for reads has same size
ring-buffer: Do no swap cpu buffers if order is different
ring-buffer: Clear pages on error in ring_buffer_subbuf_order_set() failure
ring-buffer: Read and write to ring buffers with custom sub buffer size
ring-buffer: Set new size of the ring buffer sub page
ring-buffer: Add interface for configuring trace sub buffer size
ring-buffer: Page size per ring buffer
ring-buffer: Have ring_buffer_print_page_header() be able to access ring_buffer_iter
ring-buffer: Check if absolute timestamp goes backwards
...
- Kprobes trace event to show the actual function name in notrace-symbol
warning. Instead of using user specified symbol name, use "%ps" printk
format to show the actual symbol at the probe address. Since kprobe
event accepts the offset from symbol which is bigger than the symbol
size, user specified symbol may not be the actual probed symbol.
-----BEGIN PGP SIGNATURE-----
iQFPBAABCgA5FiEEh7BulGwFlgAOi5DV2/sHvwUrPxsFAmWdZB0bHG1hc2FtaS5o
aXJhbWF0c3VAZ21haWwuY29tAAoJENv7B78FKz8b9AcH/R8mNbgAbKlxSXUm0NAG
xrUcN9vyb9yaLgvoIEvW+XF6EMaCM6G2kG+wSaJB6xFiPlJgf9FhILjDjHAtV2x1
wXL8r3eLyKvkU3HXfS7RphUTPecgblI16FHZ12x2TkQ41KoRzQf2c7cSQs4B8SHP
W5LPqvxxqjbV84iqZPScez99S0ZS0Of3ubmepVEm2LDshfhUVMIUH1vfvEn3vQI7
k5PoNiVRem+rjduERM3I7Zd51K7Lz/5hN56q6ok2vY8hVoRdp0j83Ly36h21ClS9
CtvlzPX0YjaogVd8Gyc3z+vqy61YiNA1q0fRqIhagmfIy/26s1ORaq/0S2gywxXn
piA=
=mfU0
-----END PGP SIGNATURE-----
Merge tag 'probes-v6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull probes update from Masami Hiramatsu:
- Update the Kprobes trace event to show the actual function name in
notrace-symbol warning.
Instead of using the user specified symbol name, use "%ps" printk
format to show the actual symbol at the probe address. Since kprobe
event accepts the offset from symbol which is bigger than the symbol
size, the user specified symbol may not be the actual probed symbol.
* tag 'probes-v6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
trace/kprobe: Display the actual notrace function when rejecting a probe
where the CPU would remain at the lowest frequency, degrading
performance substantially.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmWpM0sRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1giEg/+Mn9hdLqgE7xPPvCa8UWoJzFGTIYgTT3O
gma5Ras/kqB6cJTb1zn/HocAIj1Y2gZAsRU/U3IpOfPzklwIKQLBID1PE+d0izAc
NC9N0LuPau+XbMY5U+G0YNQZzDW+Zioe/9I6uDRKRTtLTdZAk8Plk9yh+tRtpSG8
aEswyoDOJfvkLbl7kJGymHgxDiDtmXEcz6j2pNlFtcEdHFjiSHo2Jq09DMia9sHr
W563FSvO7DVBMOosKH8sq7sSPdCBi0zshaWDiyz2M7Ry2uBsqJvx+9qxDnloafTp
Yqp5rkSVzOxtQwxjtYD+WWy+AgwQqo+O5FHsm0JmoiGVkmpB95bdhQxk2gtshSCo
IwUt2Gqsndd0JM4v5gOn4G/qCPxFUA/Tx1OMWM89nQUVp3OmIwm8z99f5gFxoSYa
DFn2P2Ku/A/fiKfWcNDOCyMgYcJNmqRKSjWEh+mfFeexiuWR3jPrQ4GKbSl9Gusw
vLmBM9pMSyGvivptu+ALXERDDm95wEVVkULgxlcUgpuT8jjpmovbtFj2xYcnzvc4
EKOgJ0FmXCM/B6QFnnbzgMzu2IThoQpL8Ud3JlMeGDRLGDvZip9AA+0RsnirURwX
+EuE7fHcDzfAA+Fv9sGosaFmxD1dUh1EJL41XrFZSYfMsZzzzlj+k9PWf9ABCE4R
6gEHuRza+rU=
=c7Ib
-----END PGP SIGNATURE-----
Merge tag 'sched-urgent-2024-01-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fix from Ingo Molnar:
"Fix a cpufreq related performance regression on certain systems, where
the CPU would remain at the lowest frequency, degrading performance
substantially"
* tag 'sched-urgent-2024-01-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/fair: Fix frequency selection for non-invariant case
Here are the set of driver core and kernfs changes for 6.8-rc1. Nothing
major in here this release cycle, just lots of small cleanups and some
tweaks on kernfs that in the very end, got reverted and will come back
in a safer way next release cycle.
Included in here are:
- more driver core 'const' cleanups and fixes
- fw_devlink=rpm is now the default behavior
- kernfs tiny changes to remove some string functions
- cpu handling in the driver core is updated to work better on many
systems that add topologies and cpus after booting
- other minor changes and cleanups
All of the cpu handling patches have been acked by the respective
maintainers and are coming in here in one series. Everything has been
in linux-next for a while with no reported issues.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCZaeOrg8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+ymtcwCffzvKKkSY9qAp6+0v2WQNkZm1JWoAoJCPYUwF
If6wEoPLWvRfKx4gIoq9
=D96r
-----END PGP SIGNATURE-----
Merge tag 'driver-core-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core updates from Greg KH:
"Here are the set of driver core and kernfs changes for 6.8-rc1.
Nothing major in here this release cycle, just lots of small cleanups
and some tweaks on kernfs that in the very end, got reverted and will
come back in a safer way next release cycle.
Included in here are:
- more driver core 'const' cleanups and fixes
- fw_devlink=rpm is now the default behavior
- kernfs tiny changes to remove some string functions
- cpu handling in the driver core is updated to work better on many
systems that add topologies and cpus after booting
- other minor changes and cleanups
All of the cpu handling patches have been acked by the respective
maintainers and are coming in here in one series. Everything has been
in linux-next for a while with no reported issues"
* tag 'driver-core-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (51 commits)
Revert "kernfs: convert kernfs_idr_lock to an irq safe raw spinlock"
kernfs: convert kernfs_idr_lock to an irq safe raw spinlock
class: fix use-after-free in class_register()
PM: clk: make pm_clk_add_notifier() take a const pointer
EDAC: constantify the struct bus_type usage
kernfs: fix reference to renamed function
driver core: device.h: fix Excess kernel-doc description warning
driver core: class: fix Excess kernel-doc description warning
driver core: mark remaining local bus_type variables as const
driver core: container: make container_subsys const
driver core: bus: constantify subsys_register() calls
driver core: bus: make bus_sort_breadthfirst() take a const pointer
kernfs: d_obtain_alias(NULL) will do the right thing...
driver core: Better advertise dev_err_probe()
kernfs: Convert kernfs_path_from_node_locked() from strlcpy() to strscpy()
kernfs: Convert kernfs_name_locked() from strlcpy() to strscpy()
kernfs: Convert kernfs_walk_ns() from strlcpy() to strscpy()
initramfs: Expose retained initrd as sysfs file
fs/kernfs/dir: obey S_ISGID
kernel/cgroup: use kernfs_create_dir_ns()
...
Add enforcement of expected types for context arguments tagged with
arg:ctx (__arg_ctx) tag.
First, any program type will accept generic `void *` context type when
combined with __arg_ctx tag.
Besides accepting "canonical" struct names and `void *`, for a bunch of
program types for which program context is actually a named struct, we
allows a bunch of pragmatic exceptions to match real-world and expected
usage:
- for both kprobes and perf_event we allow `bpf_user_pt_regs_t *` as
canonical context argument type, where `bpf_user_pt_regs_t` is a
*typedef*, not a struct;
- for kprobes, we also always accept `struct pt_regs *`, as that's what
actually is passed as a context to any kprobe program;
- for perf_event, we resolve typedefs (unless it's `bpf_user_pt_regs_t`)
down to actual struct type and accept `struct pt_regs *`, or
`struct user_pt_regs *`, or `struct user_regs_struct *`, depending
on the actual struct type kernel architecture points `bpf_user_pt_regs_t`
typedef to; otherwise, canonical `struct bpf_perf_event_data *` is
expected;
- for raw_tp/raw_tp.w programs, `u64/long *` are accepted, as that's
what's expected with BPF_PROG() usage; otherwise, canonical
`struct bpf_raw_tracepoint_args *` is expected;
- tp_btf supports both `struct bpf_raw_tracepoint_args *` and `u64 *`
formats, both are coded as expections as tp_btf is actually a TRACING
program type, which has no canonical context type;
- iterator programs accept `struct bpf_iter__xxx *` structs, currently
with no further iterator-type specific enforcement;
- fentry/fexit/fmod_ret/lsm/struct_ops all accept `u64 *`;
- classic tracepoint programs, as well as syscall and freplace
programs allow any user-provided type.
In all other cases kernel will enforce exact match of struct name to
expected canonical type. And if user-provided type doesn't match that
expectation, verifier will emit helpful message with expected type name.
Note a bit unnatural way the check is done after processing all the
arguments. This is done to avoid conflict between bpf and bpf-next
trees. Once trees converge, a small follow up patch will place a simple
btf_validate_prog_ctx_type() check into a proper ARG_PTR_TO_CTX branch
(which bpf-next tree patch refactored already), removing duplicated
arg:ctx detection logic.
Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240118033143.3384355-4-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Refactor btf_get_prog_ctx_type() a bit to allow reuse of
bpf_ctx_convert_map logic in more than one places. Simplify interface by
returning btf_type instead of btf_member (field reference in BTF).
To do the above we need to touch and start untangling
btf_translate_to_vmlinux() implementation. We do the bare minimum to
not regress anything for btf_translate_to_vmlinux(), but its
implementation is very questionable for what it claims to be doing.
Mapping kfunc argument types to kernel corresponding types conceptually
is quite different from recognizing program context types. Fixing this
is out of scope for this change though.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240118033143.3384355-3-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Here is the big set of char/misc and other driver subsystem changes for
6.8-rc1. Lots of stuff in here, but first off, you will get a merge
conflict in drivers/android/binder_alloc.c when merging this tree due to
changing coming in through the -mm tree.
The resolution of the merge issue can be found here:
https://lore.kernel.org/r/20231207134213.25631ae9@canb.auug.org.au
or in a simpler patch form in that thread:
https://lore.kernel.org/r/ZXHzooF07LfQQYiE@google.com
If there are issues with the merge of this file, please let me know.
Other than lots of binder driver changes (as you can see by the merge
conflicts) included in here are:
- lots of iio driver updates and additions
- spmi driver updates
- eeprom driver updates
- firmware driver updates
- ocxl driver updates
- mhi driver updates
- w1 driver updates
- nvmem driver updates
- coresight driver updates
- platform driver remove callback api changes
- tags.sh script updates
- bus_type constant marking cleanups
- lots of other small driver updates
All of these have been in linux-next for a while with no reported issues
(other than the binder merge conflict.)
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCZaeMMQ8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+ynWNgCfQ/Yz7QO6EMLDwHO5LRsb3YMhjL4AoNVdanjP
YoI7f1I4GBcC0GKNfK6s
=+Kyv
-----END PGP SIGNATURE-----
Merge tag 'char-misc-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc
Pull char/misc and other driver updates from Greg KH:
"Here is the big set of char/misc and other driver subsystem changes
for 6.8-rc1.
Other than lots of binder driver changes (as you can see by the merge
conflicts) included in here are:
- lots of iio driver updates and additions
- spmi driver updates
- eeprom driver updates
- firmware driver updates
- ocxl driver updates
- mhi driver updates
- w1 driver updates
- nvmem driver updates
- coresight driver updates
- platform driver remove callback api changes
- tags.sh script updates
- bus_type constant marking cleanups
- lots of other small driver updates
All of these have been in linux-next for a while with no reported
issues"
* tag 'char-misc-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (341 commits)
android: removed duplicate linux/errno
uio: Fix use-after-free in uio_open
drivers: soc: xilinx: add check for platform
firmware: xilinx: Export function to use in other module
scripts/tags.sh: remove find_sources
scripts/tags.sh: use -n to test archinclude
scripts/tags.sh: add local annotation
scripts/tags.sh: use more portable -path instead of -wholename
scripts/tags.sh: Update comment (addition of gtags)
firmware: zynqmp: Convert to platform remove callback returning void
firmware: turris-mox-rwtm: Convert to platform remove callback returning void
firmware: stratix10-svc: Convert to platform remove callback returning void
firmware: stratix10-rsu: Convert to platform remove callback returning void
firmware: raspberrypi: Convert to platform remove callback returning void
firmware: qemu_fw_cfg: Convert to platform remove callback returning void
firmware: mtk-adsp-ipc: Convert to platform remove callback returning void
firmware: imx-dsp: Convert to platform remove callback returning void
firmware: coreboot_table: Convert to platform remove callback returning void
firmware: arm_scpi: Convert to platform remove callback returning void
firmware: arm_scmi: Convert to platform remove callback returning void
...
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZaHe5gAKCRDdBJ7gKXxA
jrAiAQCYZQuwsNVyGJUuPD/GGQzqVUZNpWcuYwMXXAi6dO5rSAD+LDeFviun2K52
uHCz4iRq5EwNLA+MbdHtAnQzr+e5CQ8=
=Jjkw
-----END PGP SIGNATURE-----
Merge tag 'mm-hotfixes-stable-2024-01-12-16-52' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc hotfixes from Andrew Morton:
"For once not mostly MM-related.
17 hotfixes. 10 address post-6.7 issues and the other 7 are cc:stable"
* tag 'mm-hotfixes-stable-2024-01-12-16-52' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
userfaultfd: avoid huge_zero_page in UFFDIO_MOVE
MAINTAINERS: add entry for shrinker
selftests: mm: hugepage-vmemmap fails on 64K page size systems
mm/memory_hotplug: fix memmap_on_memory sysfs value retrieval
mailmap: switch email for Tanzir Hasan
mailmap: add old address mappings for Randy
kernel/crash_core.c: make __crash_hotplug_lock static
efi: disable mirror feature during crashkernel
kexec: do syscore_shutdown() in kernel_kexec
mailmap: update entry for Manivannan Sadhasivam
fs/proc/task_mmu: move mmu notification mechanism inside mm lock
mm: zswap: switch maintainers to recently active developers and reviewers
scripts/decode_stacktrace.sh: optionally use LLVM utilities
kasan: avoid resetting aux_lock
lib/Kconfig.debug: disable CONFIG_DEBUG_INFO_BTF for Hexagon
MAINTAINERS: update LTP maintainers
kdump: defer the insertion of crashkernel resources
When appending "[defcmd]" to 'kdb_prompt_str', the size of the string
already in the buffer should be taken into account.
An option could be to switch from strncat() to strlcat() which does the
correct test to avoid such an overflow.
However, this actually looks as dead code, because 'defcmd_in_progress'
can't be true here.
See a more detailed explanation at [1].
[1]: https://lore.kernel.org/all/CAD=FV=WSh7wKN7Yp-3wWiDgX4E3isQ8uh0LCzTmd1v9Cg9j+nQ@mail.gmail.com/
Fixes: 5d5314d679 ("kdb: core for kgdb back end (1 of 2)")
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Currently the workqueue just checks the atomic and locking states after work
execution ends. However, sometimes, a work item may not unlock rcu after
acquiring rcu_read_lock(). And as a result, it would cause rcu stall, but
the rcu stall warning can not dump the work func, because the work has
finished.
In order to quickly discover those works that do not call rcu_read_unlock()
after rcu_read_lock(), add the rcu lock check.
Use rcu_preempt_depth() to check the work's rcu status. Normally, this value
is 0. If this value is bigger than 0, it means the work are still holding
rcu lock. If so, print err info and the work func.
tj: Reworded the description for clarity. Minor formatting tweak.
Signed-off-by: Xuewen Yan <xuewen.yan@unisoc.com>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
At the time they are created unbound workqueues rescuers currently use
cpu_possible_mask as their affinity, but this can be too wide in case a
workqueue unbound mask has been set as a subset of cpu_possible_mask.
Make new rescuers use their associated workqueue unbound cpumask from
the start.
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Currently we limit the size of the workqueue name to 24 characters due to
commit ecf6881ff3 ("workqueue: make workqueue->name[] fixed len")
Increase the size to 32 characters and print a warning in the event
the requested name is larger than the limit of 32 characters.
Signed-off-by: Audra Mitchell <audra@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Linus reported a ~50% performance regression on single-threaded
workloads on his AMD Ryzen system, and bisected it to:
9c0b4bb7f6 ("sched/cpufreq: Rework schedutil governor performance estimation")
When frequency invariance is not enabled, get_capacity_ref_freq(policy)
is supposed to return the current frequency and the performance margin
applied by map_util_perf(), enabling the utilization to go above the
maximum compute capacity and to select a higher frequency than the current one.
After the changes in 9c0b4bb7f6, the performance margin was applied
earlier in the path to take into account utilization clampings and
we couldn't get a utilization higher than the maximum compute capacity,
and the CPU remained 'stuck' at lower frequencies.
To fix this, we must use a frequency above the current frequency to
get a chance to select a higher OPP when the current one becomes fully used.
Apply the same margin and return a frequency 25% higher than the current
one in order to switch to the next OPP before we fully use the CPU
at the current one.
[ mingo: Clarified the changelog. ]
Fixes: 9c0b4bb7f6 ("sched/cpufreq: Rework schedutil governor performance estimation")
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Bisected-by: Linus Torvalds <torvalds@linux-foundation.org>
Reported-by: Wyes Karny <wkarny@gmail.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Wyes Karny <wkarny@gmail.com>
Link: https://lore.kernel.org/r/20240114183600.135316-1-vincent.guittot@linaro.org
Update the kernel-doc comments to catch up with the code changes and
fix the kernel-doc warnings:
debug.c:83: warning: Excess struct member 'stacktrace' description in 'dma_debug_entry'
debug.c:83: warning: Function parameter or struct member 'stack_len' not described in 'dma_debug_entry'
debug.c:83: warning: Function parameter or struct member 'stack_entries' not described in 'dma_debug_entry'
Fixes: 746017ed8d ("dma/debug: Simplify stracktrace retrieval")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: iommu@lists.linux.dev
Signed-off-by: Christoph Hellwig <hch@lst.de>
This pull request contains the following branches:
doc.2023.12.13a: Documentation and comment updates.
torture.2023.11.23a: RCU torture, locktorture updates that include
cleanups; nolibc init build support for mips, ppc and rv64;
testing of mid stall duration scenario and fixing fqs task
creation conditions.
fixes.2023.12.13a: Misc fixes, most notably restricting usage of
RCU CPU stall notifiers, to confine their usage primarily
to debug kernels.
rcu-tasks.2023.12.12b: RCU tasks minor fixes.
srcu.2023.12.13a: lockdep annotation fix for NMI-safe accesses,
callback advancing/acceleration cleanup and documentation
improvements.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSi2tPIQIc2VEtjarIAHS7/6Z0wpQUCZYUS0AAKCRAAHS7/6Z0w
pRXgAQD+k8oqjvKL6la61ppWm5Y7NLjdj/IbV+cOd42jKnM6PAEAyavNhX0n7zGx
o9cDlvIDxJfHnFrOTc5WLH9yEs3IiQQ=
=8rdu
-----END PGP SIGNATURE-----
Merge tag 'rcu.release.v6.8' of https://github.com/neeraju/linux
Pull RCU updates from Neeraj Upadhyay:
- Documentation and comment updates
- RCU torture, locktorture updates that include cleanups; nolibc init
build support for mips, ppc and rv64; testing of mid stall duration
scenario and fixing fqs task creation conditions
- Misc fixes, most notably restricting usage of RCU CPU stall
notifiers, to confine their usage primarily to debug kernels
- RCU tasks minor fixes
- lockdep annotation fix for NMI-safe accesses, callback
advancing/acceleration cleanup and documentation improvements
* tag 'rcu.release.v6.8' of https://github.com/neeraju/linux:
rcu: Force quiescent states only for ongoing grace period
doc: Clarify historical disclaimers in memory-barriers.txt
doc: Mention address and data dependencies in rcu_dereference.rst
doc: Clarify RCU Tasks reader/updater checklist
rculist.h: docs: Fix wrong function summary
Documentation: RCU: Remove repeated word in comments
srcu: Use try-lock lockdep annotation for NMI-safe access.
srcu: Explain why callbacks invocations can't run concurrently
srcu: No need to advance/accelerate if no callback enqueued
srcu: Remove superfluous callbacks advancing from srcu_gp_start()
rcu: Remove unused macros from rcupdate.h
rcu: Restrict access to RCU CPU stall notifiers
rcu-tasks: Mark RCU Tasks accesses to current->rcu_tasks_idle_cpu
rcutorture: Add fqs_holdoff check before fqs_task is created
rcutorture: Add mid-sized stall to TREE07
rcutorture: add nolibc init support for mips, ppc and rv64
locktorture: Increase Hamming distance between call_rcu_chain and rcu_call_chains
sparse warnings:
kernel/crash_core.c:749:1: sparse: sparse: symbol '__crash_hotplug_lock' was not declared. Should it be static?
Fixes: e2a8f20dd8 ("Crash: add lock to serialize crash hotplug handling")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202401080654.IjjU5oK7-lkp@intel.com/
Cc: Baoquan He <bhe@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
syscore_shutdown() runs driver and module callbacks to get the system into
a state where it can be correctly shut down. In commit 6f389a8f1d ("PM
/ reboot: call syscore_shutdown() after disable_nonboot_cpus()")
syscore_shutdown() was removed from kernel_restart_prepare() and hence got
(incorrectly?) removed from the kexec flow. This was innocuous until
commit 6735150b69 ("KVM: Use syscore_ops instead of reboot_notifier to
hook restart/shutdown") changed the way that KVM registered its shutdown
callbacks, switching from reboot notifiers to syscore_ops.shutdown. As
syscore_shutdown() is missing from kexec, KVM's shutdown hook is not run
and virtualisation is left enabled on the boot CPU which results in triple
faults when switching to the new kernel on Intel x86 VT-x with VMXE
enabled.
Fix this by adding syscore_shutdown() to the kexec sequence. In terms of
where to add it, it is being added after migrating the kexec task to the
boot CPU, but before APs are shut down. It is not totally clear if this
is the best place: in commit 6f389a8f1d ("PM / reboot: call
syscore_shutdown() after disable_nonboot_cpus()") it is stated that
"syscore_ops operations should be carried with one CPU on-line and
interrupts disabled." APs are only offlined later in machine_shutdown(),
so this syscore_shutdown() is being run while APs are still online. This
seems to be the correct place as it matches where syscore_shutdown() is
run in the reboot and halt flows - they also run it before APs are shut
down. The assumption is that the commit message in commit 6f389a8f1d
("PM / reboot: call syscore_shutdown() after disable_nonboot_cpus()") is
no longer valid.
KVM has been discussed here as it is what broke loudly by not having
syscore_shutdown() in kexec, but this change impacts more than just KVM;
all drivers/modules which register a syscore_ops.shutdown callback will
now be invoked in the kexec flow. Looking at some of them like x86 MCE it
is probably more correct to also shut these down during kexec.
Maintainers of all drivers which use syscore_ops.shutdown are added on CC
for visibility. They are:
arch/powerpc/platforms/cell/spu_base.c .shutdown = spu_shutdown,
arch/x86/kernel/cpu/mce/core.c .shutdown = mce_syscore_shutdown,
arch/x86/kernel/i8259.c .shutdown = i8259A_shutdown,
drivers/irqchip/irq-i8259.c .shutdown = i8259A_shutdown,
drivers/irqchip/irq-sun6i-r.c .shutdown = sun6i_r_intc_shutdown,
drivers/leds/trigger/ledtrig-cpu.c .shutdown = ledtrig_cpu_syscore_shutdown,
drivers/power/reset/sc27xx-poweroff.c .shutdown = sc27xx_poweroff_shutdown,
kernel/irq/generic-chip.c .shutdown = irq_gc_shutdown,
virt/kvm/kvm_main.c .shutdown = kvm_shutdown,
This has been tested by doing a kexec on x86_64 and aarch64.
Link: https://lkml.kernel.org/r/20231213064004.2419447-1-jgowans@amazon.com
Fixes: 6735150b69 ("KVM: Use syscore_ops instead of reboot_notifier to hook restart/shutdown")
Signed-off-by: James Gowans <jgowans@amazon.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Chen-Yu Tsai <wens@csie.org>
Cc: Jernej Skrabec <jernej.skrabec@gmail.com>
Cc: Samuel Holland <samuel@sholland.org>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Sebastian Reichel <sre@kernel.org>
Cc: Orson Zhai <orsonzhai@gmail.com>
Cc: Alexander Graf <graf@amazon.de>
Cc: Jan H. Schoenherr <jschoenh@amazon.de>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Core & protocols
----------------
- Analyze and reorganize core networking structs (socks, netdev,
netns, mibs) to optimize cacheline consumption and set up
build time warnings to safeguard against future header changes.
This improves TCP performances with many concurrent connections
up to 40%.
- Add page-pool netlink-based introspection, exposing the
memory usage and recycling stats. This helps indentify
bad PP users and possible leaks.
- Refine TCP/DCCP source port selection to no longer favor even
source port at connect() time when IP_LOCAL_PORT_RANGE is set.
This lowers the time taken by connect() for hosts having
many active connections to the same destination.
- Refactor the TCP bind conflict code, shrinking related socket
structs.
- Refactor TCP SYN-Cookie handling, as a preparation step to
allow arbitrary SYN-Cookie processing via eBPF.
- Tune optmem_max for 0-copy usage, increasing the default value
to 128KB and namespecifying it.
- Allow coalescing for cloned skbs coming from page pools, improving
RX performances with some common configurations.
- Reduce extension header parsing overhead at GRO time.
- Add bridge MDB bulk deletion support, allowing user-space to
request the deletion of matching entries.
- Reorder nftables struct members, to keep data accessed by the
datapath first.
- Introduce TC block ports tracking and use. This allows supporting
multicast-like behavior at the TC layer.
- Remove UAPI support for retired TC qdiscs (dsmark, CBQ and ATM) and
classifiers (RSVP and tcindex).
- More data-race annotations.
- Extend the diag interface to dump TCP bound-only sockets.
- Conditional notification of events for TC qdisc class and actions.
- Support for WPAN dynamic associations with nearby devices, to form
a sub-network using a specific PAN ID.
- Implement SMCv2.1 virtual ISM device support.
- Add support for Batman-avd mulicast packet type.
BPF
---
- Tons of verifier improvements:
- BPF register bounds logic and range support along with a large
test suite
- log improvements
- complete precision tracking support for register spills
- track aligned STACK_ZERO cases as imprecise spilled registers. It
improves the verifier "instructions processed" metric from single
digit to 50-60% for some programs
- support for user's global BPF subprogram arguments with few
commonly requested annotations for a better developer experience
- support tracking of BPF_JNE which helps cases when the compiler
transforms (unsigned) "a > 0" into "if a == 0 goto xxx" and the
like
- several fixes
- Add initial TX metadata implementation for AF_XDP with support in
mlx5 and stmmac drivers. Two types of offloads are supported right
now, that is, TX timestamp and TX checksum offload.
- Fix kCFI bugs in BPF all forms of indirect calls from BPF into
kernel and from kernel into BPF work with CFI enabled. This allows
BPF to work with CONFIG_FINEIBT=y.
- Change BPF verifier logic to validate global subprograms lazily
instead of unconditionally before the main program, so they can be
guarded using BPF CO-RE techniques.
- Support uid/gid options when mounting bpffs.
- Add a new kfunc which acquires the associated cgroup of a task
within a specific cgroup v1 hierarchy where the latter is identified
by its id.
- Extend verifier to allow bpf_refcount_acquire() of a map value field
obtained via direct load which is a use-case needed in sched_ext.
- Add BPF link_info support for uprobe multi link along with bpftool
integration for the latter.
- Support for VLAN tag in XDP hints.
- Remove deprecated bpfilter kernel leftovers given the project
is developed in user-space (https://github.com/facebook/bpfilter).
Misc
----
- Support for parellel TC self-tests execution.
- Increase MPTCP self-tests coverage.
- Updated the bridge documentation, including several so-far
undocumented features.
- Convert all the net self-tests to run in unique netns, to
avoid random failures due to conflict and allow concurrent
runs.
- Add TCP-AO self-tests.
- Add kunit tests for both cfg80211 and mac80211.
- Autogenerate Netlink families documentation from YAML spec.
- Add yml-gen support for fixed headers and recursive nests, the
tool can now generate user-space code for all genetlink families
for which we have specs.
- A bunch of additional module descriptions fixes.
- Catch incorrect freeing of pages belonging to a page pool.
Driver API
----------
- Rust abstractions for network PHY drivers; do not cover yet the
full C API, but already allow implementing functional PHY drivers
in rust.
- Introduce queue and NAPI support in the netdev Netlink interface,
allowing complete access to the device <> NAPIs <> queues
relationship.
- Introduce notifications filtering for devlink to allow control
application scale to thousands of instances.
- Improve PHY validation, requesting rate matching information for
each ethtool link mode supported by both the PHY and host.
- Add support for ethtool symmetric-xor RSS hash.
- ACPI based Wifi band RFI (WBRF) mitigation feature for the AMD
platform.
- Expose pin fractional frequency offset value over new DPLL generic
netlink attribute.
- Convert older drivers to platform remove callback returning void.
- Add support for PHY package MMD read/write.
New hardware / drivers
----------------------
- Ethernet:
- Octeon CN10K devices
- Broadcom 5760X P7
- Qualcomm SM8550 SoC
- Texas Instrument DP83TG720S PHY
- Bluetooth:
- IMC Networks Bluetooth radio
Removed
-------
- WiFi:
- libertas 16-bit PCMCIA support
- Atmel at76c50x drivers
- HostAP ISA/PCMCIA style 802.11b driver
- zd1201 802.11b USB dongles
- Orinoco ISA/PCMCIA 802.11b driver
- Aviator/Raytheon driver
- Planet WL3501 driver
- RNDIS USB 802.11b driver
Drivers
-------
- Ethernet high-speed NICs:
- Intel (100G, ice, idpf):
- allow one by one port representors creation and removal
- add temperature and clock information reporting
- add get/set for ethtool's header split ringparam
- add again FW logging
- adds support switchdev hardware packet mirroring
- iavf: implement symmetric-xor RSS hash
- igc: add support for concurrent physical and free-running timers
- i40e: increase the allowable descriptors
- nVidia/Mellanox:
- Preparation for Socket-Direct multi-dev netdev. That will allow
in future releases combining multiple PFs devices attached to
different NUMA nodes under the same netdev
- Broadcom (bnxt):
- TX completion handling improvements
- add basic ntuple filter support
- reduce MSIX vectors usage for MQPRIO offload
- add VXLAN support, USO offload and TX coalesce completion for P7
- Marvell Octeon EP:
- xmit-more support
- add PF-VF mailbox support and use it for FW notifications for VFs
- Wangxun (ngbe/txgbe):
- implement ethtool functions to operate pause param, ring param,
coalesce channel number and msglevel
- Netronome/Corigine (nfp):
- add flow-steering support
- support UDP segmentation offload
- Ethernet NICs embedded, slower, virtual:
- Xilinx AXI: remove duplicate DMA code adopting the dma engine driver
- stmmac: add support for HW-accelerated VLAN stripping
- TI AM654x sw: add mqprio, frame preemption & coalescing
- gve: add support for non-4k page sizes.
- virtio-net: support dynamic coalescing moderation
- nVidia/Mellanox Ethernet datacenter switches:
- allow firmware upgrade without a reboot
- more flexible support for bridge flooding via the compressed
FID flooding mode
- Ethernet embedded switches:
- Microchip:
- fine-tune flow control and speed configurations in KSZ8xxx
- KSZ88X3: enable setting rmii reference
- Renesas:
- add jumbo frames support
- Marvell:
- 88E6xxx: add "eth-mac" and "rmon" stats support
- Ethernet PHYs:
- aquantia: add firmware load support
- at803x: refactor the driver to simplify adding support for more
chip variants
- NXP C45 TJA11xx: Add MACsec offload support
- Wifi:
- MediaTek (mt76):
- NVMEM EEPROM improvements
- mt7996 Extremely High Throughput (EHT) improvements
- mt7996 Wireless Ethernet Dispatcher (WED) support
- mt7996 36-bit DMA support
- Qualcomm (ath12k):
- support for a single MSI vector
- WCN7850: support AP mode
- Intel (iwlwifi):
- new debugfs file fw_dbg_clear
- allow concurrent P2P operation on DFS channels
- Bluetooth:
- QCA2066: support HFP offload
- ISO: more broadcast-related improvements
- NXP: better recovery in case receiver/transmitter get out of sync
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmWdamsSHHBhYmVuaUBy
ZWRoYXQuY29tAAoJECkkeY3MjxOkGC4P/2xjLzdw22ckSssuE9ORbGko9SNjnqHk
PQh1E+26BHiCg5KB8VvzMsL78E79MRNXEattSW+1g7dhCvln3oi+Vd0WkdRkgt35
98Iv18zLbbwFAJeyKvmLAPAkQkMLtVj19QILBBRrugF+egEZgVSE3JBcTAiKv2ZQ
HzkabA171Ri6LpCcEEtY5XuaKvimGnGzF8YMFf8rX0wtqd2p5kbY9aMe47WAGxvU
Vf9548XvH+A5yVH2/4/gujtUOpA/RHuhuCMb+oo0cZ+VCC1x9MGzoXzj6r87OTkf
k2W1whNzcGoin92f+9Lk1JYMuiGKBH4QVaDdNXJnYFSJWPTE7RvRsPzYTSD4/GzK
yEZbzSJXpy/2vDQm16NoAxl7evRs8Sorzkw4LQRviZHI/5SAkK2ZQiCK5CO8QSYy
C1LELcV5kn6Foe24xWnrWLjAGug9oJnYoGPMU5gvPmFJMvUMXqm5rmbBgUWL5Rxw
q1M6gVzabCyWUy6z2G2vaqW2ZntNVvCkdsLtIX0XZkcTzNoP0MA+TuhyGz4wbiuo
PeyQp/mbGnDgCYggqKIA0YWrTVxkhFrKN520cbO8qXBQytV9oFbM/0/+C0/r/5WX
pL1JVzLrh6l5ME7EIQfha8UOF9j8q4ueSwb40P3AR2NaZiDABM0zfUZ6+sx+91WF
ucqPEcZB5cRE
=1bW6
-----END PGP SIGNATURE-----
Merge tag 'net-next-6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Paolo Abeni:
"The most interesting thing is probably the networking structs
reorganization and a significant amount of changes is around
self-tests.
Core & protocols:
- Analyze and reorganize core networking structs (socks, netdev,
netns, mibs) to optimize cacheline consumption and set up build
time warnings to safeguard against future header changes
This improves TCP performances with many concurrent connections up
to 40%
- Add page-pool netlink-based introspection, exposing the memory
usage and recycling stats. This helps indentify bad PP users and
possible leaks
- Refine TCP/DCCP source port selection to no longer favor even
source port at connect() time when IP_LOCAL_PORT_RANGE is set. This
lowers the time taken by connect() for hosts having many active
connections to the same destination
- Refactor the TCP bind conflict code, shrinking related socket
structs
- Refactor TCP SYN-Cookie handling, as a preparation step to allow
arbitrary SYN-Cookie processing via eBPF
- Tune optmem_max for 0-copy usage, increasing the default value to
128KB and namespecifying it
- Allow coalescing for cloned skbs coming from page pools, improving
RX performances with some common configurations
- Reduce extension header parsing overhead at GRO time
- Add bridge MDB bulk deletion support, allowing user-space to
request the deletion of matching entries
- Reorder nftables struct members, to keep data accessed by the
datapath first
- Introduce TC block ports tracking and use. This allows supporting
multicast-like behavior at the TC layer
- Remove UAPI support for retired TC qdiscs (dsmark, CBQ and ATM) and
classifiers (RSVP and tcindex)
- More data-race annotations
- Extend the diag interface to dump TCP bound-only sockets
- Conditional notification of events for TC qdisc class and actions
- Support for WPAN dynamic associations with nearby devices, to form
a sub-network using a specific PAN ID
- Implement SMCv2.1 virtual ISM device support
- Add support for Batman-avd mulicast packet type
BPF:
- Tons of verifier improvements:
- BPF register bounds logic and range support along with a large
test suite
- log improvements
- complete precision tracking support for register spills
- track aligned STACK_ZERO cases as imprecise spilled registers.
This improves the verifier "instructions processed" metric from
single digit to 50-60% for some programs
- support for user's global BPF subprogram arguments with few
commonly requested annotations for a better developer
experience
- support tracking of BPF_JNE which helps cases when the compiler
transforms (unsigned) "a > 0" into "if a == 0 goto xxx" and the
like
- several fixes
- Add initial TX metadata implementation for AF_XDP with support in
mlx5 and stmmac drivers. Two types of offloads are supported right
now, that is, TX timestamp and TX checksum offload
- Fix kCFI bugs in BPF all forms of indirect calls from BPF into
kernel and from kernel into BPF work with CFI enabled. This allows
BPF to work with CONFIG_FINEIBT=y
- Change BPF verifier logic to validate global subprograms lazily
instead of unconditionally before the main program, so they can be
guarded using BPF CO-RE techniques
- Support uid/gid options when mounting bpffs
- Add a new kfunc which acquires the associated cgroup of a task
within a specific cgroup v1 hierarchy where the latter is
identified by its id
- Extend verifier to allow bpf_refcount_acquire() of a map value
field obtained via direct load which is a use-case needed in
sched_ext
- Add BPF link_info support for uprobe multi link along with bpftool
integration for the latter
- Support for VLAN tag in XDP hints
- Remove deprecated bpfilter kernel leftovers given the project is
developed in user-space (https://github.com/facebook/bpfilter)
Misc:
- Support for parellel TC self-tests execution
- Increase MPTCP self-tests coverage
- Updated the bridge documentation, including several so-far
undocumented features
- Convert all the net self-tests to run in unique netns, to avoid
random failures due to conflict and allow concurrent runs
- Add TCP-AO self-tests
- Add kunit tests for both cfg80211 and mac80211
- Autogenerate Netlink families documentation from YAML spec
- Add yml-gen support for fixed headers and recursive nests, the tool
can now generate user-space code for all genetlink families for
which we have specs
- A bunch of additional module descriptions fixes
- Catch incorrect freeing of pages belonging to a page pool
Driver API:
- Rust abstractions for network PHY drivers; do not cover yet the
full C API, but already allow implementing functional PHY drivers
in rust
- Introduce queue and NAPI support in the netdev Netlink interface,
allowing complete access to the device <> NAPIs <> queues
relationship
- Introduce notifications filtering for devlink to allow control
application scale to thousands of instances
- Improve PHY validation, requesting rate matching information for
each ethtool link mode supported by both the PHY and host
- Add support for ethtool symmetric-xor RSS hash
- ACPI based Wifi band RFI (WBRF) mitigation feature for the AMD
platform
- Expose pin fractional frequency offset value over new DPLL generic
netlink attribute
- Convert older drivers to platform remove callback returning void
- Add support for PHY package MMD read/write
New hardware / drivers:
- Ethernet:
- Octeon CN10K devices
- Broadcom 5760X P7
- Qualcomm SM8550 SoC
- Texas Instrument DP83TG720S PHY
- Bluetooth:
- IMC Networks Bluetooth radio
Removed:
- WiFi:
- libertas 16-bit PCMCIA support
- Atmel at76c50x drivers
- HostAP ISA/PCMCIA style 802.11b driver
- zd1201 802.11b USB dongles
- Orinoco ISA/PCMCIA 802.11b driver
- Aviator/Raytheon driver
- Planet WL3501 driver
- RNDIS USB 802.11b driver
Driver updates:
- Ethernet high-speed NICs:
- Intel (100G, ice, idpf):
- allow one by one port representors creation and removal
- add temperature and clock information reporting
- add get/set for ethtool's header split ringparam
- add again FW logging
- adds support switchdev hardware packet mirroring
- iavf: implement symmetric-xor RSS hash
- igc: add support for concurrent physical and free-running
timers
- i40e: increase the allowable descriptors
- nVidia/Mellanox:
- Preparation for Socket-Direct multi-dev netdev. That will
allow in future releases combining multiple PFs devices
attached to different NUMA nodes under the same netdev
- Broadcom (bnxt):
- TX completion handling improvements
- add basic ntuple filter support
- reduce MSIX vectors usage for MQPRIO offload
- add VXLAN support, USO offload and TX coalesce completion
for P7
- Marvell Octeon EP:
- xmit-more support
- add PF-VF mailbox support and use it for FW notifications
for VFs
- Wangxun (ngbe/txgbe):
- implement ethtool functions to operate pause param, ring
param, coalesce channel number and msglevel
- Netronome/Corigine (nfp):
- add flow-steering support
- support UDP segmentation offload
- Ethernet NICs embedded, slower, virtual:
- Xilinx AXI: remove duplicate DMA code adopting the dma engine
driver
- stmmac: add support for HW-accelerated VLAN stripping
- TI AM654x sw: add mqprio, frame preemption & coalescing
- gve: add support for non-4k page sizes.
- virtio-net: support dynamic coalescing moderation
- nVidia/Mellanox Ethernet datacenter switches:
- allow firmware upgrade without a reboot
- more flexible support for bridge flooding via the compressed
FID flooding mode
- Ethernet embedded switches:
- Microchip:
- fine-tune flow control and speed configurations in KSZ8xxx
- KSZ88X3: enable setting rmii reference
- Renesas:
- add jumbo frames support
- Marvell:
- 88E6xxx: add "eth-mac" and "rmon" stats support
- Ethernet PHYs:
- aquantia: add firmware load support
- at803x: refactor the driver to simplify adding support for more
chip variants
- NXP C45 TJA11xx: Add MACsec offload support
- Wifi:
- MediaTek (mt76):
- NVMEM EEPROM improvements
- mt7996 Extremely High Throughput (EHT) improvements
- mt7996 Wireless Ethernet Dispatcher (WED) support
- mt7996 36-bit DMA support
- Qualcomm (ath12k):
- support for a single MSI vector
- WCN7850: support AP mode
- Intel (iwlwifi):
- new debugfs file fw_dbg_clear
- allow concurrent P2P operation on DFS channels
- Bluetooth:
- QCA2066: support HFP offload
- ISO: more broadcast-related improvements
- NXP: better recovery in case receiver/transmitter get out of sync"
* tag 'net-next-6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1714 commits)
lan78xx: remove redundant statement in lan78xx_get_eee
lan743x: remove redundant statement in lan743x_ethtool_get_eee
bnxt_en: Fix RCU locking for ntuple filters in bnxt_rx_flow_steer()
bnxt_en: Fix RCU locking for ntuple filters in bnxt_srxclsrldel()
bnxt_en: Remove unneeded variable in bnxt_hwrm_clear_vnic_filter()
tcp: Revert no longer abort SYN_SENT when receiving some ICMP
Revert "mlx5 updates 2023-12-20"
Revert "net: stmmac: Enable Per DMA Channel interrupt"
ipvlan: Remove usage of the deprecated ida_simple_xx() API
ipvlan: Fix a typo in a comment
net/sched: Remove ipt action tests
net: stmmac: Use interrupt mode INTM=1 for per channel irq
net: stmmac: Add support for TX/RX channel interrupt
net: stmmac: Make MSI interrupt routine generic
dt-bindings: net: snps,dwmac: per channel irq
net: phy: at803x: make read_status more generic
net: phy: at803x: add support for cdt cross short test for qca808x
net: phy: at803x: refactor qca808x cable test get status function
net: phy: at803x: generalize cdt fault length function
net: ethernet: cortina: Drop TSO support
...
Just one cleanup and one documentation improvement change. No functional
changes. However, this has been tested on linux-next for over 1 month.
-----BEGIN PGP SIGNATURE-----
iQJGBAABCgAwFiEENnNq2KuOejlQLZofziMdCjCSiKcFAmWdW0USHG1jZ3JvZkBr
ZXJuZWwub3JnAAoJEM4jHQowkoinL5gQAMUtU4nv+HahLwJK3ha5fRmcHm2kV9Q4
g9X7JUscEe+mNYLNy2Kjpl/BWatEHWn8jQD2nsMQJOcEkX2Mf5tfqBR3561wrfSZ
dnMmNDG7Ym6Y9kOSDz6cpCxsu8Xm5Dj9MLKJ51qPfyXgeobD8IiKBe2oCDqcKzGY
ZpDnmpUaYCOIloNhJNK9ybrNLsVDQwdPiC8vVQXULC4ePBw3i+mnh8c1wr442wEO
G6xfi0wNXIoB9S8ynzakW2lJPD1XeMQYu/SJR0nz61KhJxAjs/LVAt9k7itcnUgG
Zbc1Fn944oWoe7/ywwDIxstR56NYVpcXoTJexeHxrLe6PiEmRbh6phrBMUCOpUsh
0DrHJE8z4dsHovo6w6m1zvMF6FphLHhUU6L/opBwrUJ5CGrYfegG95v8d3jRSphe
GSoMo9iGHvr0PgY7OcG77m5NJjrFwwbPO88Fe3IAXmOXIrKYuzBoUnt3Te1dw8vX
6vYPUUZ3HLDOACWtKf2Tjhr7pM6b1C72vrg7uYc2560BH6ERqjAtuOWKrx8QOeeo
SUT4ACs4qa3DrPz7zpNhwwnfwpFoVuJd+fooMV/WhOU9KsCVCokC7zZub+J8UD7m
1j1nhWkKaF08c0BKW4zRMmUVSV6Dh/AO0YybetBq9b4o7NPEcij2HlIk+jJa0VMJ
XSx5UcuxRyf7
=hvRa
-----END PGP SIGNATURE-----
Merge tag 'modules-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux
Pull module updates from Luis Chamberlain:
"Just one cleanup and one documentation improvement change. No
functional changes"
* tag 'modules-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux:
kernel/module: improve documentation for try_module_get()
module: Remove redundant TASK_UNINTERRUPTIBLE
The goal is to get sched.h down to a type only header, so the main thing
happening in this patchset is splitting out various _types.h headers and
dependency fixups, as well as moving some things out of sched.h to
better locations.
This is prep work for the memory allocation profiling patchset which
adds new sched.h interdepencencies.
Testing - it's been in -next, and fixes from pretty much all
architectures have percolated in - nothing major.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmWfBwwACgkQE6szbY3K
bnZPwBAAmuRojXaeWxi01IPIOehSGDe68vw44PR9glEMZvxdnZuPOdvE4/+245/L
bRKU2WBCjBUokUbV9msIShwRkFTZAmEMPNfPAAsFMA+VXeDYHKB+ZRdwTggNAQ+I
SG6fZgh5m0HsewCDxU8oqVHkjVq4fXn0cy+aL6xLEd9gu67GoBzX2pDieS2Kvy6j
jnyoKTxFwb+LTQgph0P4EIpq5I2umAsdLwdSR8EJ+8e9NiNvMo1pI00Lx/ntAnFZ
JftWUJcMy3TQ5u1GkyfQN9y/yThX1bZK5GvmHS9SJ2Dkacaus5d+xaKCHtRuFS1I
7C6b8PsNgRczUMumBXus44HdlNfNs1yU3lvVxFvBIPE1qC9pYRHrkWIXXIocXLLC
oxTEJ6B2G3BQZVQgLIA4fOaxMVhmvKffi/aEZLi9vN9VVosd1a6XNKI6KbyRnXFp
GSs9qDqszhn5I3GYNlDNQTc/8UsRlhPFgS6nS0By6QnvxtGi9QkU2tBRBsXvqwCy
cLoCYIhc2tvugHvld70dz26umiJ4rnmxGlobStNoigDvIKAIUt1UmIdr1so8P8eH
xehnL9ZcOX6xnANDL0AqMFFHV6I58CJynhFdUoXfVQf/DWLGX48mpi9LVNsYBzsI
CAwVOAQ0UjGrpdWmJ9ueY/ABYqg9vRjzaDEXQ+MhAYO55CLaVsg=
=3tyT
-----END PGP SIGNATURE-----
Merge tag 'header_cleanup-2024-01-10' of https://evilpiepirate.org/git/bcachefs
Pull header cleanups from Kent Overstreet:
"The goal is to get sched.h down to a type only header, so the main
thing happening in this patchset is splitting out various _types.h
headers and dependency fixups, as well as moving some things out of
sched.h to better locations.
This is prep work for the memory allocation profiling patchset which
adds new sched.h interdepencencies"
* tag 'header_cleanup-2024-01-10' of https://evilpiepirate.org/git/bcachefs: (51 commits)
Kill sched.h dependency on rcupdate.h
kill unnecessary thread_info.h include
Kill unnecessary kernel.h include
preempt.h: Kill dependency on list.h
rseq: Split out rseq.h from sched.h
LoongArch: signal.c: add header file to fix build error
restart_block: Trim includes
lockdep: move held_lock to lockdep_types.h
sem: Split out sem_types.h
uidgid: Split out uidgid_types.h
seccomp: Split out seccomp_types.h
refcount: Split out refcount_types.h
uapi/linux/resource.h: fix include
x86/signal: kill dependency on time.h
syscall_user_dispatch.h: split out *_types.h
mm_types_task.h: Trim dependencies
Split out irqflags_types.h
ipc: Kill bogus dependency on spinlock.h
shm: Slim down dependencies
workqueue: Split out workqueue_types.h
...
- Introduce the param_unknown_fn type and other clean ups (Andy Shevchenko)
- Various __counted_by annotations (Christophe JAILLET, Gustavo A. R. Silva,
Kees Cook)
- Add KFENCE test to LKDTM (Stephen Boyd)
- Various strncpy() refactorings (Justin Stitt)
- Fix qnx4 to avoid writing into the smaller of two overlapping buffers
- Various strlcpy() refactorings
-----BEGIN PGP SIGNATURE-----
iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmWcOsQWHGtlZXNjb29r
QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJoiDD/9gNhalNG+6MNF5TDwSvO9X7pvL
bQ6D3clByRxYjnJ4dMQ7p3s+rJ937uQt9PezIWHgRoldjQy3x7AJ5BxkhjeMlD2B
YLbfdVYPy09X0Ewk1Efvfm/ta6tJpBGYF7Bc7LIneZrdQ6gemBpLW1PNZAFYzcWX
oDjV+M1NytxaiF0aebxPZvZ1W+NGQ105Sxvj5MheDoezyO/j0CTe+ZYtCzFguFY0
8SPpR5FG4AFidb8GHd5Ndv0trVWjF1jat0FUFgEFOCE0fJNWLVR0Bbr2MtXiG7wL
LF7IZ/Mn+mi+O3BmcD6JiaYf9EPlMUXCyqc8NvsnoWGqhWhWmQPCInZVrpplMUNK
V/UHVMkmjDs4f/lAHBJoJHDK6fmOD+cAFaNMOltfErcjV4s+lEo6vHoiKl8hfPnH
EzpQaK3funGroVYwTc35e07NrJJHCzqIUhZ0FJO7ByuOE2tIomiVo9Xy9gy54iCT
qzC7zkrZ0MKqui4qiUY9FWayRRYLX4qNxELm4yie6Pzmk8943hNOaDofcyKWuZFC
eqvhIkvqb4LasLrzCBk+ehA2KWSRmTrR6E9IygwbBXUTsvn2yj2RRYeAlGQNBTBZ
adgSXQpRBmtKYqyihWLhP4QcunknEiQdDS3lS2qJmPH33Iv3jGH4yS6BNIBufMGL
PoC2UxSfGd+YT079fw==
=1Wxx
-----END PGP SIGNATURE-----
Merge tag 'hardening-v6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull hardening updates from Kees Cook:
- Introduce the param_unknown_fn type and other clean ups (Andy
Shevchenko)
- Various __counted_by annotations (Christophe JAILLET, Gustavo A. R.
Silva, Kees Cook)
- Add KFENCE test to LKDTM (Stephen Boyd)
- Various strncpy() refactorings (Justin Stitt)
- Fix qnx4 to avoid writing into the smaller of two overlapping buffers
- Various strlcpy() refactorings
* tag 'hardening-v6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
qnx4: Use get_directory_fname() in qnx4_match()
qnx4: Extract dir entry filename processing into helper
atags_proc: Add __counted_by for struct buffer and use struct_size()
tracing/uprobe: Replace strlcpy() with strscpy()
params: Fix multi-line comment style
params: Sort headers
params: Use size_add() for kmalloc()
params: Do not go over the limit when getting the string length
params: Introduce the param_unknown_fn type
lkdtm: Add kfence read after free crash type
nvme-fc: replace deprecated strncpy with strscpy
nvdimm/btt: replace deprecated strncpy with strscpy
nvme-fabrics: replace deprecated strncpy with strscpy
drm/modes: replace deprecated strncpy with strscpy_pad
afs: Add __counted_by for struct afs_acl and use struct_size()
VMCI: Annotate struct vmci_handle_arr with __counted_by
i40e: Annotate struct i40e_qvlist_info with __counted_by
HID: uhid: replace deprecated strncpy with strscpy
samples: Replace strlcpy() with strscpy()
SUNRPC: Replace strlcpy() with strscpy()
Step 5/10 of the namespace unification of CPU mitigations related Kconfig options.
[ mingo: Converted a few more uses in comments/messages as well. ]
Suggested-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Ariel Miculas <amiculas@cisco.com>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20231121160740.1249350-6-leitao@debian.org
It's been 11 years since the ring_buffer_size() function was updated to
use the nr_pages from the buffer->buffers[cpu] structure instead of using
the buffer->nr_pages that no longer exists.
The comment in the code is more of what a change log should have and is
pretty much useless for development. It's saying how things worked back in
2012 that bares no purpose on today's code. Remove it.
Link: https://lore.kernel.org/linux-trace-kernel/84d3b41a72bd43dbb9d44921ef535c92@AcuMS.aculab.com/
Link: https://lore.kernel.org/linux-trace-kernel/20231220081028.7cd7e8e2@gandalf.local.home
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Reported-by: David Laight <David.Laight@ACULAB.COM>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
This KUnit update for Linux 6.8-rc1 consists of:
- a new feature that adds APIs for managing devices introducing
a set of helper functions which allow devices (internally a
struct kunit_device) to be created and managed by KUnit.
These devices will be automatically unregistered on
test exit. These helpers can either use a user-provided
struct device_driver, or have one automatically created and
managed by KUnit. In both cases, the device lives on a new
kunit_bus.
- changes to switch drm/tests to use kunit devices
- several fixes and enhancements to attribute feature
- changes to reorganize deferred action function introducing
KUNIT_DEFINE_ACTION_WRAPPER
- new feature adds ability to run tests after boot using debugfs
- fixes and enhancements to string-stream-test:
- parse ERR_PTR in string_stream_destroy()
- unchecked dereference in bug fix in debugfs_print_results()
- handling errors from alloc_string_stream()
- NULL-dereference bug fix in kunit_init_suite()
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEPZKym/RZuOCGeA/kCwJExA0NQxwFAmWdiHIACgkQCwJExA0N
QxxzCxAAmhn+rkKV4DfuGXxUAJbO109H7LSP1Y7FKMYCVp83msWKASziujb2IQR9
87jnmgeJMbmQaPcc9m//NHuFhZmJwQZwAdGZryoDiz7XK+1MwLxYeUj92HI7FPaD
o5Jz6tlqFdehx5jCOymgwbvhI5kJMkQCTTtnEaiHCByfaA02UqmTtt3bXK5OeJkZ
UG0HqdvI/6Xo01i+BnerRBZFcQV49GMhl4acw1k+dJnPLkqusL6txftRBoKtxuVd
mXQHKS1SmNgiNA+nqs4d/8qERoMJWuwj6wV4pldVBXhgZwOHXbBxBf67i7hTakE/
TkEURCkOb5X0QrT6akJj6phJ2xqXsF7xwzBJh9G4jF2Pdwwo8GGuAXW+ol0TRrm8
ZEQ4eMBGIK07Lb9FeBMLO2bZ0Ox+oiN+YNGY/bs1d6Ibf4PnBUfy7IlmMjKL9h/V
A/EpYdaq5q72IZZQ2pu1rYkBRPbnP7vHmjLXVYIq7Pq8bLA9/ycKO/0jnGHdo1oz
rBK/6t7yB+ATi4KeKQpjpmUTX/vdEenUQI/QDn9ngXIEwYQfNrEUZitEvBXR1Kw+
T8iKDIPFkvb/yEZgjWgNpxETooDx3yBkeeC29gKMj4QoN38wEdfy0Xltp8eqq9cS
6lijRoipUypHRAuXeSJMW2dflLnFIt4mtC25hBNF+DmyNVT+IF4=
=79+u
-----END PGP SIGNATURE-----
Merge tag 'linux_kselftest-kunit-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
Pull KUnit updates from Shuah Khan:
- a new feature that adds APIs for managing devices introducing a set
of helper functions which allow devices (internally a struct
kunit_device) to be created and managed by KUnit.
These devices will be automatically unregistered on test exit. These
helpers can either use a user-provided struct device_driver, or have
one automatically created and managed by KUnit. In both cases, the
device lives on a new kunit_bus.
- changes to switch drm/tests to use kunit devices
- several fixes and enhancements to attribute feature
- changes to reorganize deferred action function introducing
KUNIT_DEFINE_ACTION_WRAPPER
- new feature adds ability to run tests after boot using debugfs
- fixes and enhancements to string-stream-test:
- parse ERR_PTR in string_stream_destroy()
- unchecked dereference in bug fix in debugfs_print_results()
- handling errors from alloc_string_stream()
- NULL-dereference bug fix in kunit_init_suite()
* tag 'linux_kselftest-kunit-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest: (27 commits)
kunit: Fix some comments which were mistakenly kerneldoc
kunit: Protect string comparisons against NULL
kunit: Add example of kunit_activate_static_stub() with pointer-to-function
kunit: Allow passing function pointer to kunit_activate_static_stub()
kunit: Fix NULL-dereference in kunit_init_suite() if suite->log is NULL
kunit: Reset test->priv after each param iteration
kunit: Add example for using test->priv
drm/tests: Switch to kunit devices
ASoC: topology: Replace fake root_device with kunit_device in tests
overflow: Replace fake root_device with kunit_device
fortify: test: Use kunit_device
kunit: Add APIs for managing devices
Documentation: Add debugfs docs with run after boot
kunit: add ability to run tests after boot using debugfs
kunit: add is_init test attribute
kunit: add example suite to test init suites
kunit: add KUNIT_INIT_TABLE to init linker section
kunit: move KUNIT_TABLE out of INIT_DATA
kunit: tool: add test for parsing attributes
kunit: tool: fix parsing of test attributes
...
- Add support for the Sierra Forest, Grand Ridge and Meteorlake SoCs to
the intel_idle cpuidle driver (Artem Bityutskiy, Zhang Rui).
- Do not enable interrupts when entering idle in the haltpoll cpuidle
driver (Borislav Petkov).
- Add Emerald Rapids support in no-HWP mode to the intel_pstate cpufreq
driver (Zhenguo Yao).
- Use EPP values programmed by the platform firmware as balanced
performance ones by default in intel_pstate (Srinivas Pandruvada).
- Add a missing function return value check to the SCMI cpufreq driver
to avoid unexpected behavior (Alexandra Diupina).
- Fix parameter type warning in the armada-8k cpufreq driver (Gregory
CLEMENT).
- Rework trans_stat_show() in the devfreq core code to avoid buffer
overflows (Christian Marangi).
- Synchronize devfreq_monitor_[start/stop] so as to prevent a timer
list corruption from occurring when devfreq governors are switched
frequently (Mukesh Ojha).
- Fix possible deadlocks in the core system-wide PM code that occur if
device-handling functions cannot be executed asynchronously during
resume from system-wide suspend (Rafael J. Wysocki).
- Clean up unnecessary local variable initializations in multiple
places in the hibernation code (Wang chaodong, Li zeming).
- Adjust core hibernation code to avoid missing wakeup events that
occur after saving an image to persistent storage (Chris Feng).
- Update hibernation code to enforce correct ordering during image
compression and decompression (Hongchen Zhang).
- Use kmap_local_page() instead of kmap_atomic() in copy_data_page()
during hibernation and restore (Chen Haonan).
- Adjust documentation and code comments to reflect recent tasks freezer
changes (Kevin Hao).
- Repair excess function parameter description warning in the
hibernation image-saving code (Randy Dunlap).
- Fix _set_required_opps when opp is NULL (Bryan O'Donoghue).
- Use device_get_match_data() in the OPP code for TI (Rob Herring).
- Clean up OPP level and other parts and call dev_pm_opp_set_opp()
recursively for required OPPs (Viresh Kumar).
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAmWb8o8SHHJqd0Byand5
c29ja2kubmV0AAoJEILEb/54YlRxbOQP/2D+YGyd9R1awYqxQoIQSGbpIr7a9oTR
BOIcOn2PvWLXH8w9wVbIUJsSL9Nx90D9T4S5QcyHrS2qHmR+1Gb0gX3D4QAmEBcG
+wFCLt2//5PwqShtPcJEUcGdL274aVEpmnEAmzKnk20MkLQM3twxe6FKSkwWMLYb
u9OKgdN8Vah0iSBUCpyT52O0x4d65MD/tka0QaGjLg64TtyqhTSKi+XgWtZkSZ7H
lRgn9qMoMXq/h1aeK4MKp5UtJKRxBWRdMijIFFXAfgO8dwbDbyXTo0d2LMR6DiEM
VsvRIjEePoRcGf7bwAbrUeSoNb5Ec32RW3v9GSNn2sWutW+vhD//frZq48zAR6lm
i8Xlf2Ar63Z+qNcFpCZjlNwAbfEuZ1vIr0Pu3oDd0GkOXjxiVMgAwtatTp1nSW7/
wWFuMA5G+wdzU/Z5KcV1p7S8CP1gC8S05LHGwtKKGm9pLbzhauF8GK6Xpa4711T8
oI3uDFIgxaxW8B/ymsM5cNa2QbfYUuQbOFTwXvBcy4gizrbZwwXRSpfaKoDIYAXZ
2kfwmFbu3IbrRypboY58lG3SzbnN94oEMANtsVYuxMimGz2x3ZmHBAFm2l9YPYRz
dBq/RUM7sMIvM1SwqR4tG8rt206L7KpPyW99pUa2AhEdof4iV2bpyujHFdkm83MK
nJ0OF/xcc98Z
=zh4c
-----END PGP SIGNATURE-----
Merge tag 'pm-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management updates from Rafael Wysocki:
"These add support for new processors (Sierra Forest, Grand Ridge and
Meteor Lake) to the intel_idle driver, make intel_pstate run on
Emerald Rapids without HWP support and adjust it to utilize EPP values
supplied by the platform firmware, fix issues, clean up code and
improve documentation.
The most significant fix addresses deadlocks in the core system-wide
resume code that occur if async_schedule_dev() attempts to run its
argument function synchronously (for example, due to a memory
allocation failure). It rearranges the code in question which may
increase the system resume time in some cases, but this basically is a
removal of a premature optimization. That optimization will be added
back later, but properly this time.
Specifics:
- Add support for the Sierra Forest, Grand Ridge and Meteorlake SoCs
to the intel_idle cpuidle driver (Artem Bityutskiy, Zhang Rui)
- Do not enable interrupts when entering idle in the haltpoll cpuidle
driver (Borislav Petkov)
- Add Emerald Rapids support in no-HWP mode to the intel_pstate
cpufreq driver (Zhenguo Yao)
- Use EPP values programmed by the platform firmware as balanced
performance ones by default in intel_pstate (Srinivas Pandruvada)
- Add a missing function return value check to the SCMI cpufreq
driver to avoid unexpected behavior (Alexandra Diupina)
- Fix parameter type warning in the armada-8k cpufreq driver (Gregory
CLEMENT)
- Rework trans_stat_show() in the devfreq core code to avoid buffer
overflows (Christian Marangi)
- Synchronize devfreq_monitor_[start/stop] so as to prevent a timer
list corruption from occurring when devfreq governors are switched
frequently (Mukesh Ojha)
- Fix possible deadlocks in the core system-wide PM code that occur
if device-handling functions cannot be executed asynchronously
during resume from system-wide suspend (Rafael J. Wysocki)
- Clean up unnecessary local variable initializations in multiple
places in the hibernation code (Wang chaodong, Li zeming)
- Adjust core hibernation code to avoid missing wakeup events that
occur after saving an image to persistent storage (Chris Feng)
- Update hibernation code to enforce correct ordering during image
compression and decompression (Hongchen Zhang)
- Use kmap_local_page() instead of kmap_atomic() in copy_data_page()
during hibernation and restore (Chen Haonan)
- Adjust documentation and code comments to reflect recent tasks
freezer changes (Kevin Hao)
- Repair excess function parameter description warning in the
hibernation image-saving code (Randy Dunlap)
- Fix _set_required_opps when opp is NULL (Bryan O'Donoghue)
- Use device_get_match_data() in the OPP code for TI (Rob Herring)
- Clean up OPP level and other parts and call dev_pm_opp_set_opp()
recursively for required OPPs (Viresh Kumar)"
* tag 'pm-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (35 commits)
OPP: Rename 'rate_clk_single'
OPP: Pass rounded rate to _set_opp()
OPP: Relocate dev_pm_opp_sync_regulators()
PM: sleep: Fix possible deadlocks in core system-wide PM code
OPP: Move dev_pm_opp_icc_bw to internal opp.h
async: Introduce async_schedule_dev_nocall()
async: Split async_schedule_node_domain()
cpuidle: haltpoll: Do not enable interrupts when entering idle
OPP: Fix _set_required_opps when opp is NULL
OPP: The level field is always of unsigned int type
PM: hibernate: Repair excess function parameter description warning
PM: sleep: Remove obsolete comment from unlock_system_sleep()
cpufreq: intel_pstate: Add Emerald Rapids support in no-HWP mode
Documentation: PM: Adjust freezing-of-tasks.rst to the freezer changes
PM: hibernate: Use kmap_local_page() in copy_data_page()
intel_idle: add Sierra Forest SoC support
intel_idle: add Grand Ridge SoC support
PM / devfreq: Synchronize devfreq_monitor_[start/stop]
cpufreq: armada-8k: Fix parameter type warning
PM: hibernate: Enforce ordering during image compression/decompression
...
- Add dynamic thresholds for trip point crossing detection to prevent
trip point crossing notifications from being sent at incorrect times
or not at all in some cases (Rafael J. Wysocki).
- Fix synchronization issues related to the resume of thermal zones
during a system-wide resume and allow thermal zones to be resumed
concurrently (Rafael J. Wysocki).
- Modify the thermal zone unregistration to wait for the given zone to
go away completely before returning to the caller and rework the
sysfs interface for trip points on top of that (Rafael J. Wysocki).
- Fix a possible NULL pointer dereference in thermal zone registration
error path (Rafael J. Wysocki).
- Clean up the IPA thermal governor and modify it (with the help of a
new governor callback) to avoid allocating and freeing memory every
time its throttling callback is invoked (Lukasz Luba).
- Make the IPA thermal governor handle thermal instance weight changes
via sysfs correctly (Lukasz Luba).
- Update the thermal netlink code to avoid sending messages if there
are no recipients (Stanislaw Gruszka).
- Convert Mediatek Thermal to the json-schema (Rafał Miłecki).
- Fix thermal DT bindings issue on Loongson (Binbin Zhou).
- Fix returning NULL instead of -ENODEV during thermal probe on
Loogsoon (Binbin Zhou).
- Add thermal DT binding for tsens on the SM8650 platform (Neil
Armstrong).
- Add reboot on the critical trip point crossing option feature (Fabio
Estevam).
- Use DEFINE_SIMPLE_DEV_PM_OPS do define PM functions for thermal
suspend/resume on AmLogic (Uwe Kleine-König)
- Add D1/T113s THS controller support to the Sun8i thermal control
driver (Maxim Kiselev)
- Fix example in the thermal DT binding for QCom SPMI (Johan Hovold).
- Fix compilation warning in the tmon utility (Florian Eckert).
- Add support for interrupt-based thermal configuration on Exynos along
with a set of related cleanups (Mateusz Majewski).
- Make the Intel HFI thermal driver enable an HFI instance (eg. processor
package) from its first online CPU and disable it when the last CPU in
it goes offline (Ricardo Neri).
- Fix a kernel-doc warning and a spello in the cpuidle_cooling thermal
driver (Randy Dunlap).
- Move the .get_temp() thermal zone callback presence check to the
thermal zone registration code (Daniel Lezcano).
- Use the for_each_trip() macro for trip points table walks in a few
places in the thermal core (Rafael J. Wysocki).
- Make all trip point updates (via sysfs as well as from the platform
firmware) trigger trip change notifications (Rafael J. Wysocki).
- Drop redundant code from the thermal core and make one function in
it take a const pointer argument (Rafael J. Wysocki).
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAmWb8iUSHHJqd0Byand5
c29ja2kubmV0AAoJEILEb/54YlRxKLkP/iDsuDwmhZAjbAu2iftk/8ad8Trm2VoK
+9eZ5Eqa8lKEcJLb0RxueTnFT4ppvT/hY99HOG4FM+mCnWeH/Z32N697DhqiUg4v
GZUpeOPzxYgsfOOTeuL5XgfrVMgBjJrJunTXmzgAd8lIhTmRbAMVmFVJ18CJO11O
RHgqvYznYFi5cywA9/NkG2xkhFB0VDoiTuIiuMMV+pMjqF0d5ooBMkhmjvPQ5Rp9
FjNJ7hqiTamAsDPdULAFqhIGGhKZWWFbh4+S+JPCwBW8nqvxyJpemsm20vrwctJR
bSXWQkgkDpWEeg9yrEAOO/Uk9yGd3jiLfkvPBKbK0x/YxGZ4hOYHcbF3cOUvmPYP
5K3ZJ61DNrzB/5S3LY54VYrWmTVRdK6Lk3HYNvfAUYFJZMZ5oMYZLCUmo4SswUdy
UUEIY27H7L18eLhP9zCcKo4njdaVG+vXQn/rJIFOpG0k9OElzPs1X8Dp/m9pKQDR
rDUsMXqB34NUVrIEhjAgqvwF5xHooW8gykpuJgxwBetA9w8Pls2A/mzLsDY3wgdQ
htiANGpKTDqBQSn+HrjzYckv9/R+1tDyTJmEDNZwllA1DJfrOlpCRD2VHRpgTZEA
Ldnq0bhyq6RQnousqxhgpYkIAoGaebs9XasRH0YtBG5gIumeWfqeVzmTcM5xdsNB
yf6RdQy8QunS
=QVyh
-----END PGP SIGNATURE-----
Merge tag 'thermal-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull thermal control updates from Rafael Wysocki:
"These add support for the D1/T113s THS controller to the sun8i driver
and a DT-based mechanism for platforms to indicate a preference to
reboot (instead of shutting down) on crossing a critical trip point,
fix issues, make other improvements (in the IPA governor, the Intel
HFI driver, the exynos driver and the thermal netlink interface among
other places) and clean up code.
One long-standing issue addressed here is that trip point crossing
notifications sent to user space might be unreliable due to the
incorrect handling of trip point hysteresis in the thermal core:
multiple notifications might be sent for the same event or there might
be events without any notification at all.
Specifics:
- Add dynamic thresholds for trip point crossing detection to prevent
trip point crossing notifications from being sent at incorrect
times or not at all in some cases (Rafael J. Wysocki)
- Fix synchronization issues related to the resume of thermal zones
during a system-wide resume and allow thermal zones to be resumed
concurrently (Rafael J. Wysocki)
- Modify the thermal zone unregistration to wait for the given zone
to go away completely before returning to the caller and rework the
sysfs interface for trip points on top of that (Rafael J. Wysocki)
- Fix a possible NULL pointer dereference in thermal zone
registration error path (Rafael J. Wysocki)
- Clean up the IPA thermal governor and modify it (with the help of a
new governor callback) to avoid allocating and freeing memory every
time its throttling callback is invoked (Lukasz Luba)
- Make the IPA thermal governor handle thermal instance weight
changes via sysfs correctly (Lukasz Luba)
- Update the thermal netlink code to avoid sending messages if there
are no recipients (Stanislaw Gruszka)
- Convert Mediatek Thermal to the json-schema (Rafał Miłecki)
- Fix thermal DT bindings issue on Loongson (Binbin Zhou)
- Fix returning NULL instead of -ENODEV during thermal probe on
Loogsoon (Binbin Zhou)
- Add thermal DT binding for tsens on the SM8650 platform (Neil
Armstrong)
- Add reboot on the critical trip point crossing option feature
(Fabio Estevam)
- Use DEFINE_SIMPLE_DEV_PM_OPS do define PM functions for thermal
suspend/resume on AmLogic (Uwe Kleine-König)
- Add D1/T113s THS controller support to the Sun8i thermal control
driver (Maxim Kiselev)
- Fix example in the thermal DT binding for QCom SPMI (Johan Hovold)
- Fix compilation warning in the tmon utility (Florian Eckert)
- Add support for interrupt-based thermal configuration on Exynos
along with a set of related cleanups (Mateusz Majewski)
- Make the Intel HFI thermal driver enable an HFI instance (eg.
processor package) from its first online CPU and disable it when
the last CPU in it goes offline (Ricardo Neri)
- Fix a kernel-doc warning and a spello in the cpuidle_cooling
thermal driver (Randy Dunlap)
- Move the .get_temp() thermal zone callback presence check to the
thermal zone registration code (Daniel Lezcano)
- Use the for_each_trip() macro for trip points table walks in a few
places in the thermal core (Rafael J. Wysocki)
- Make all trip point updates (via sysfs as well as from the platform
firmware) trigger trip change notifications (Rafael J. Wysocki)
- Drop redundant code from the thermal core and make one function in
it take a const pointer argument (Rafael J. Wysocki)"
* tag 'thermal-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (64 commits)
thermal: trip: Constify thermal zone argument of thermal_zone_trip_id()
thermal: intel: hfi: Disable an HFI instance when all its CPUs go offline
thermal: intel: hfi: Enable an HFI instance from its first online CPU
thermal: intel: hfi: Refactor enabling code into helper functions
thermal/drivers/exynos: Use set_trips ops
thermal/drivers/exynos: Use BIT wherever possible
thermal/drivers/exynos: Split initialization of TMU and the thermal zone
thermal/drivers/exynos: Stop using the threshold mechanism on Exynos 4210
thermal/drivers/exynos: Simplify regulator (de)initialization
thermal/drivers/exynos: Handle devm_regulator_get_optional return value correctly
thermal/drivers/exynos: Wwitch from workqueue-driven interrupt handling to threaded interrupts
thermal/drivers/exynos: Drop id field
thermal/drivers/exynos: Remove an unnecessary field description
tools/thermal/tmon: Fix compilation warning for wrong format
dt-bindings: thermal: qcom-spmi-adc-tm5/hc: Clean up examples
dt-bindings: thermal: qcom-spmi-adc-tm5/hc: Fix example node names
thermal/drivers/sun8i: Add D1/T113s THS controller support
dt-bindings: thermal: sun8i: Add binding for D1/T113s THS controller
thermal: amlogic: Use DEFINE_SIMPLE_DEV_PM_OPS for PM functions
thermal: amlogic: Make amlogic_thermal_disable() return void
...
-----BEGIN PGP SIGNATURE-----
iQJIBAABCAAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAmWYKUIUHHBhdWxAcGF1
bC1tb29yZS5jb20ACgkQ6iDy2pc3iXNyHw/+IKnqL1MZ5QS+/HtSzi4jCL47N9yZ
OHLol6XswyEGHH9myKPPGnT5lVA93v98v4ty2mws7EJUSGZQQUntYBPbU9Gi40+B
XDzYSRocoj96sdlKeOJMgaWo3NBRD9HYSoGPDNWZixy6m+bLPk/Dqhn3FabKf1lo
2qQSmstvChFRmVNkmgaQnBCAtWVqla4EJEL0EKX6cspHbuzRNTeJdTPn6Q/zOUVL
O2znOZuEtSVpYS7yg3uJT0hHD8H0GnIciAcDAhyPSBL5Uk5l6gwJiACcdRfLRbgp
QM5Z4qUFdKljV5XBCzYnfhhrx1df08h1SG84El8UK8HgTTfOZfYmawByJRWNJSQE
TdCmtyyvEbfb61CKBFVwD7Tzb9/y8WgcY5N3Un8uCQqRzFIO+6cghHri5NrVhifp
nPFlP4klxLHh3d7ZVekLmCMHbpaacRyJKwLy+f/nwbBEID47jpPkvZFIpbalat+r
QaKRBNWdTeV+GZ+Yu0uWsI029aQnpcO1kAnGg09fl6b/dsmxeKOVWebir25AzQ++
a702S8HRmj80X+VnXHU9a64XeGtBH7Nq0vu0lGHQPgwhSx/9P6/qICEPwsIriRjR
I9OulWt4OBPDtlsonHFgDs+lbnd0Z0GJUwYT8e9pjRDMxijVO9lhAXyglVRmuNR8
to2ByKP5BO+Vh8Y=
=Py+n
-----END PGP SIGNATURE-----
Merge tag 'lsm-pr-20240105' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm
Pull security module updates from Paul Moore:
- Add three new syscalls: lsm_list_modules(), lsm_get_self_attr(), and
lsm_set_self_attr().
The first syscall simply lists the LSMs enabled, while the second and
third get and set the current process' LSM attributes. Yes, these
syscalls may provide similar functionality to what can be found under
/proc or /sys, but they were designed to support multiple,
simultaneaous (stacked) LSMs from the start as opposed to the current
/proc based solutions which were created at a time when only one LSM
was allowed to be active at a given time.
We have spent considerable time discussing ways to extend the
existing /proc interfaces to support multiple, simultaneaous LSMs and
even our best ideas have been far too ugly to support as a kernel
API; after +20 years in the kernel, I felt the LSM layer had
established itself enough to justify a handful of syscalls.
Support amongst the individual LSM developers has been nearly
unanimous, with a single objection coming from Tetsuo (TOMOYO) as he
is worried that the LSM_ID_XXX token concept will make it more
difficult for out-of-tree LSMs to survive. Several members of the LSM
community have demonstrated the ability for out-of-tree LSMs to
continue to exist by picking high/unused LSM_ID values as well as
pointing out that many kernel APIs rely on integer identifiers, e.g.
syscalls (!), but unfortunately Tetsuo's objections remain.
My personal opinion is that while I have no interest in penalizing
out-of-tree LSMs, I'm not going to penalize in-tree development to
support out-of-tree development, and I view this as a necessary step
forward to support the push for expanded LSM stacking and reduce our
reliance on /proc and /sys which has occassionally been problematic
for some container users. Finally, we have included the linux-api
folks on (all?) recent revisions of the patchset and addressed all of
their concerns.
- Add a new security_file_ioctl_compat() LSM hook to handle the 32-bit
ioctls on 64-bit systems problem.
This patch includes support for all of the existing LSMs which
provide ioctl hooks, although it turns out only SELinux actually
cares about the individual ioctls. It is worth noting that while
Casey (Smack) and Tetsuo (TOMOYO) did not give explicit ACKs to this
patch, they did both indicate they are okay with the changes.
- Fix a potential memory leak in the CALIPSO code when IPv6 is disabled
at boot.
While it's good that we are fixing this, I doubt this is something
users are seeing in the wild as you need to both disable IPv6 and
then attempt to configure IPv6 labeled networking via
NetLabel/CALIPSO; that just doesn't make much sense.
Normally this would go through netdev, but Jakub asked me to take
this patch and of all the trees I maintain, the LSM tree seemed like
the best fit.
- Update the LSM MAINTAINERS entry with additional information about
our process docs, patchwork, bug reporting, etc.
I also noticed that the Lockdown LSM is missing a dedicated
MAINTAINERS entry so I've added that to the pull request. I've been
working with one of the major Lockdown authors/contributors to see if
they are willing to step up and assume a Lockdown maintainer role;
hopefully that will happen soon, but in the meantime I'll continue to
look after it.
- Add a handful of mailmap entries for Serge Hallyn and myself.
* tag 'lsm-pr-20240105' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm: (27 commits)
lsm: new security_file_ioctl_compat() hook
lsm: Add a __counted_by() annotation to lsm_ctx.ctx
calipso: fix memory leak in netlbl_calipso_add_pass()
selftests: remove the LSM_ID_IMA check in lsm/lsm_list_modules_test
MAINTAINERS: add an entry for the lockdown LSM
MAINTAINERS: update the LSM entry
mailmap: add entries for Serge Hallyn's dead accounts
mailmap: update/replace my old email addresses
lsm: mark the lsm_id variables are marked as static
lsm: convert security_setselfattr() to use memdup_user()
lsm: align based on pointer length in lsm_fill_user_ctx()
lsm: consolidate buffer size handling into lsm_fill_user_ctx()
lsm: correct error codes in security_getselfattr()
lsm: cleanup the size counters in security_getselfattr()
lsm: don't yet account for IMA in LSM_CONFIG_COUNT calculation
lsm: drop LSM_ID_IMA
LSM: selftests for Linux Security Module syscalls
SELinux: Add selfattr hooks
AppArmor: Add selfattr hooks
Smack: implement setselfattr and getselfattr hooks
...
-----BEGIN PGP SIGNATURE-----
iQJIBAABCAAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAmWYKJAUHHBhdWxAcGF1
bC1tb29yZS5jb20ACgkQ6iDy2pc3iXML3hAAyP/6RwJjUMM9Gsi9ZJNRz79X8uIp
/MYONbzy1xKq/d78jZhjsJm9yIlLk3muVdd3oRcXdmahA5zs3jOKRaM+OfNLOrt6
nuNwS+yaMUYKsKNh/A8TLoxcmBuNAN6lubCKwbccR6hvugqrZuZFkAqCIkiWUDeb
N64u1rL1q/tLI+jI76GIiK4SMMQihF3MMVVTmBWYDiIdrfPhFIHxipLgZaEBUqZM
43+2Y/blV75jcqPTZRgT9tk0LVLkiFtO97qUp9j+pYZbeoJ7CAaDH5A8NVm38yIX
tyzYiTV2lGS3qf/HdLc3OpJQlBVkhbq6cRiLGvyiKQp60xiqYffoL7iFP4/DJMoT
JKzoqXCixINRqdHWYbVY9hHBGg6R5c+1QqZzsnEy2MnBF++iLwJQAMz5JO9Qdh8F
tD6fD82QzvfoNPuP0lBA67preqN3wiH1Zsv6cstoI/6QKCAMeTMZt/ywniBTKhX6
WMmhdmMQKTwGrnCosydAOonYesieiYPhxz6oSeRIqoHRHtNow8rjnFh7DR7yi8uc
nv1x5bDqEI+QTrDys0cAq6fvdUZT2B9joqSovzXUGllRRS7w17WNf1Cu16jMTrHH
FeZ2P1BvKE7YIFkqxcE/RY5NHX3ylxA4unFM8UgIheYiWbWLm5+xrwZdNL30KQJ4
4Hvvy3Buq6kb4HE=
=908g
-----END PGP SIGNATURE-----
Merge tag 'audit-pr-20240105' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit
Pull audit updates from Paul Moore:
"The audit updates are fairly minor with only two patches:
- Send an audit ACK to userspace immediately upon receiving an auditd
registration event as opposed to waiting until the registration has
been fully processed and the audit backlog starts filling the
netlink buffers.
Sending the ACK earlier, as done here, is still safe as the
operation should not fail at the point when the ACK is done, and
doing so helps avoid the ACK being dropped in extreme situations.
- Update the audit MAINTAINERS entry with additional information.
There isn't anything in this update that should be new to regular
contributors or list subscribers, but I'm pushing to start
documenting our processes, conventions, etc. and this seems like an
important part of that"
* tag 'audit-pr-20240105' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
MAINTAINERS: update the audit entry
audit: Send netlink ACK before setting connection in auditd_set
many places. The notable patch series are:
- nilfs2 folio conversion from Matthew Wilcox in "nilfs2: Folio
conversions for file paths".
- Additional nilfs2 folio conversion from Ryusuke Konishi in "nilfs2:
Folio conversions for directory paths".
- IA64 remnant removal in Heiko Carstens's "Remove unused code after
IA-64 removal".
- Arnd Bergmann has enabled the -Wmissing-prototypes warning everywhere
in "Treewide: enable -Wmissing-prototypes". This had some followup
fixes:
- Nathan Chancellor has cleaned up the hexagon build in the series
"hexagon: Fix up instances of -Wmissing-prototypes".
- Nathan also addressed some s390 warnings in "s390: A couple of
fixes for -Wmissing-prototypes".
- Arnd Bergmann addresses the same warnings for MIPS in his series
"mips: address -Wmissing-prototypes warnings".
- Baoquan He has made kexec_file operate in a top-down-fitting manner
similar to kexec_load in the series "kexec_file: Load kernel at top of
system RAM if required"
- Baoquan He has also added the self-explanatory "kexec_file: print out
debugging message if required".
- Some checkstack maintenance work from Tiezhu Yang in the series
"Modify some code about checkstack".
- Douglas Anderson has disentangled the watchdog code's logging when
multiple reports are occurring simultaneously. The series is "watchdog:
Better handling of concurrent lockups".
- Yuntao Wang has contributed some maintenance work on the crash code in
"crash: Some cleanups and fixes".
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZZ2R6AAKCRDdBJ7gKXxA
juCVAP4t76qUISDOSKugB/Dn5E4Nt9wvPY9PcufnmD+xoPsgkQD+JVl4+jd9+gAV
vl6wkJDiJO5JZ3FVtBtC3DFA/xHtVgk=
=kQw+
-----END PGP SIGNATURE-----
Merge tag 'mm-nonmm-stable-2024-01-09-10-33' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull non-MM updates from Andrew Morton:
"Quite a lot of kexec work this time around. Many singleton patches in
many places. The notable patch series are:
- nilfs2 folio conversion from Matthew Wilcox in 'nilfs2: Folio
conversions for file paths'.
- Additional nilfs2 folio conversion from Ryusuke Konishi in 'nilfs2:
Folio conversions for directory paths'.
- IA64 remnant removal in Heiko Carstens's 'Remove unused code after
IA-64 removal'.
- Arnd Bergmann has enabled the -Wmissing-prototypes warning
everywhere in 'Treewide: enable -Wmissing-prototypes'. This had
some followup fixes:
- Nathan Chancellor has cleaned up the hexagon build in the series
'hexagon: Fix up instances of -Wmissing-prototypes'.
- Nathan also addressed some s390 warnings in 's390: A couple of
fixes for -Wmissing-prototypes'.
- Arnd Bergmann addresses the same warnings for MIPS in his series
'mips: address -Wmissing-prototypes warnings'.
- Baoquan He has made kexec_file operate in a top-down-fitting manner
similar to kexec_load in the series 'kexec_file: Load kernel at top
of system RAM if required'
- Baoquan He has also added the self-explanatory 'kexec_file: print
out debugging message if required'.
- Some checkstack maintenance work from Tiezhu Yang in the series
'Modify some code about checkstack'.
- Douglas Anderson has disentangled the watchdog code's logging when
multiple reports are occurring simultaneously. The series is
'watchdog: Better handling of concurrent lockups'.
- Yuntao Wang has contributed some maintenance work on the crash code
in 'crash: Some cleanups and fixes'"
* tag 'mm-nonmm-stable-2024-01-09-10-33' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (157 commits)
crash_core: fix and simplify the logic of crash_exclude_mem_range()
x86/crash: use SZ_1M macro instead of hardcoded value
x86/crash: remove the unused image parameter from prepare_elf_headers()
kdump: remove redundant DEFAULT_CRASH_KERNEL_LOW_SIZE
scripts/decode_stacktrace.sh: strip unexpected CR from lines
watchdog: if panicking and we dumped everything, don't re-enable dumping
watchdog/hardlockup: use printk_cpu_sync_get_irqsave() to serialize reporting
watchdog/softlockup: use printk_cpu_sync_get_irqsave() to serialize reporting
watchdog/hardlockup: adopt softlockup logic avoiding double-dumps
kexec_core: fix the assignment to kimage->control_page
x86/kexec: fix incorrect end address passed to kernel_ident_mapping_init()
lib/trace_readwrite.c:: replace asm-generic/io with linux/io
nilfs2: cpfile: fix some kernel-doc warnings
stacktrace: fix kernel-doc typo
scripts/checkstack.pl: fix no space expression between sp and offset
x86/kexec: fix incorrect argument passed to kexec_dprintk()
x86/kexec: use pr_err() instead of kexec_dprintk() when an error occurs
nilfs2: add missing set_freezable() for freezable kthread
kernel: relay: remove relay_file_splice_read dead code, doesn't work
docs: submit-checklist: remove all of "make namespacecheck"
...
are included in this merge do the following:
- Peng Zhang has done some mapletree maintainance work in the
series
"maple_tree: add mt_free_one() and mt_attr() helpers"
"Some cleanups of maple tree"
- In the series "mm: use memmap_on_memory semantics for dax/kmem"
Vishal Verma has altered the interworking between memory-hotplug
and dax/kmem so that newly added 'device memory' can more easily
have its memmap placed within that newly added memory.
- Matthew Wilcox continues folio-related work (including a few
fixes) in the patch series
"Add folio_zero_tail() and folio_fill_tail()"
"Make folio_start_writeback return void"
"Fix fault handler's handling of poisoned tail pages"
"Convert aops->error_remove_page to ->error_remove_folio"
"Finish two folio conversions"
"More swap folio conversions"
- Kefeng Wang has also contributed folio-related work in the series
"mm: cleanup and use more folio in page fault"
- Jim Cromie has improved the kmemleak reporting output in the
series "tweak kmemleak report format".
- In the series "stackdepot: allow evicting stack traces" Andrey
Konovalov to permits clients (in this case KASAN) to cause
eviction of no longer needed stack traces.
- Charan Teja Kalla has fixed some accounting issues in the page
allocator's atomic reserve calculations in the series "mm:
page_alloc: fixes for high atomic reserve caluculations".
- Dmitry Rokosov has added to the samples/ dorectory some sample
code for a userspace memcg event listener application. See the
series "samples: introduce cgroup events listeners".
- Some mapletree maintanance work from Liam Howlett in the series
"maple_tree: iterator state changes".
- Nhat Pham has improved zswap's approach to writeback in the
series "workload-specific and memory pressure-driven zswap
writeback".
- DAMON/DAMOS feature and maintenance work from SeongJae Park in
the series
"mm/damon: let users feed and tame/auto-tune DAMOS"
"selftests/damon: add Python-written DAMON functionality tests"
"mm/damon: misc updates for 6.8"
- Yosry Ahmed has improved memcg's stats flushing in the series
"mm: memcg: subtree stats flushing and thresholds".
- In the series "Multi-size THP for anonymous memory" Ryan Roberts
has added a runtime opt-in feature to transparent hugepages which
improves performance by allocating larger chunks of memory during
anonymous page faults.
- Matthew Wilcox has also contributed some cleanup and maintenance
work against eh buffer_head code int he series "More buffer_head
cleanups".
- Suren Baghdasaryan has done work on Andrea Arcangeli's series
"userfaultfd move option". UFFDIO_MOVE permits userspace heap
compaction algorithms to move userspace's pages around rather than
UFFDIO_COPY'a alloc/copy/free.
- Stefan Roesch has developed a "KSM Advisor", in the series
"mm/ksm: Add ksm advisor". This is a governor which tunes KSM's
scanning aggressiveness in response to userspace's current needs.
- Chengming Zhou has optimized zswap's temporary working memory
use in the series "mm/zswap: dstmem reuse optimizations and
cleanups".
- Matthew Wilcox has performed some maintenance work on the
writeback code, both code and within filesystems. The series is
"Clean up the writeback paths".
- Andrey Konovalov has optimized KASAN's handling of alloc and
free stack traces for secondary-level allocators, in the series
"kasan: save mempool stack traces".
- Andrey also performed some KASAN maintenance work in the series
"kasan: assorted clean-ups".
- David Hildenbrand has gone to town on the rmap code. Cleanups,
more pte batching, folio conversions and more. See the series
"mm/rmap: interface overhaul".
- Kinsey Ho has contributed some maintenance work on the MGLRU
code in the series "mm/mglru: Kconfig cleanup".
- Matthew Wilcox has contributed lruvec page accounting code
cleanups in the series "Remove some lruvec page accounting
functions".
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZZyF2wAKCRDdBJ7gKXxA
jjWjAP42LHvGSjp5M+Rs2rKFL0daBQsrlvy6/jCHUequSdWjSgEAmOx7bc5fbF27
Oa8+DxGM9C+fwqZ/7YxU2w/WuUmLPgU=
=0NHs
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2024-01-08-15-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
"Many singleton patches against the MM code. The patch series which are
included in this merge do the following:
- Peng Zhang has done some mapletree maintainance work in the series
'maple_tree: add mt_free_one() and mt_attr() helpers'
'Some cleanups of maple tree'
- In the series 'mm: use memmap_on_memory semantics for dax/kmem'
Vishal Verma has altered the interworking between memory-hotplug
and dax/kmem so that newly added 'device memory' can more easily
have its memmap placed within that newly added memory.
- Matthew Wilcox continues folio-related work (including a few fixes)
in the patch series
'Add folio_zero_tail() and folio_fill_tail()'
'Make folio_start_writeback return void'
'Fix fault handler's handling of poisoned tail pages'
'Convert aops->error_remove_page to ->error_remove_folio'
'Finish two folio conversions'
'More swap folio conversions'
- Kefeng Wang has also contributed folio-related work in the series
'mm: cleanup and use more folio in page fault'
- Jim Cromie has improved the kmemleak reporting output in the series
'tweak kmemleak report format'.
- In the series 'stackdepot: allow evicting stack traces' Andrey
Konovalov to permits clients (in this case KASAN) to cause eviction
of no longer needed stack traces.
- Charan Teja Kalla has fixed some accounting issues in the page
allocator's atomic reserve calculations in the series 'mm:
page_alloc: fixes for high atomic reserve caluculations'.
- Dmitry Rokosov has added to the samples/ dorectory some sample code
for a userspace memcg event listener application. See the series
'samples: introduce cgroup events listeners'.
- Some mapletree maintanance work from Liam Howlett in the series
'maple_tree: iterator state changes'.
- Nhat Pham has improved zswap's approach to writeback in the series
'workload-specific and memory pressure-driven zswap writeback'.
- DAMON/DAMOS feature and maintenance work from SeongJae Park in the
series
'mm/damon: let users feed and tame/auto-tune DAMOS'
'selftests/damon: add Python-written DAMON functionality tests'
'mm/damon: misc updates for 6.8'
- Yosry Ahmed has improved memcg's stats flushing in the series 'mm:
memcg: subtree stats flushing and thresholds'.
- In the series 'Multi-size THP for anonymous memory' Ryan Roberts
has added a runtime opt-in feature to transparent hugepages which
improves performance by allocating larger chunks of memory during
anonymous page faults.
- Matthew Wilcox has also contributed some cleanup and maintenance
work against eh buffer_head code int he series 'More buffer_head
cleanups'.
- Suren Baghdasaryan has done work on Andrea Arcangeli's series
'userfaultfd move option'. UFFDIO_MOVE permits userspace heap
compaction algorithms to move userspace's pages around rather than
UFFDIO_COPY'a alloc/copy/free.
- Stefan Roesch has developed a 'KSM Advisor', in the series 'mm/ksm:
Add ksm advisor'. This is a governor which tunes KSM's scanning
aggressiveness in response to userspace's current needs.
- Chengming Zhou has optimized zswap's temporary working memory use
in the series 'mm/zswap: dstmem reuse optimizations and cleanups'.
- Matthew Wilcox has performed some maintenance work on the writeback
code, both code and within filesystems. The series is 'Clean up the
writeback paths'.
- Andrey Konovalov has optimized KASAN's handling of alloc and free
stack traces for secondary-level allocators, in the series 'kasan:
save mempool stack traces'.
- Andrey also performed some KASAN maintenance work in the series
'kasan: assorted clean-ups'.
- David Hildenbrand has gone to town on the rmap code. Cleanups, more
pte batching, folio conversions and more. See the series 'mm/rmap:
interface overhaul'.
- Kinsey Ho has contributed some maintenance work on the MGLRU code
in the series 'mm/mglru: Kconfig cleanup'.
- Matthew Wilcox has contributed lruvec page accounting code cleanups
in the series 'Remove some lruvec page accounting functions'"
* tag 'mm-stable-2024-01-08-15-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (361 commits)
mm, treewide: rename MAX_ORDER to MAX_PAGE_ORDER
mm, treewide: introduce NR_PAGE_ORDERS
selftests/mm: add separate UFFDIO_MOVE test for PMD splitting
selftests/mm: skip test if application doesn't has root privileges
selftests/mm: conform test to TAP format output
selftests: mm: hugepage-mmap: conform to TAP format output
selftests/mm: gup_test: conform test to TAP format output
mm/selftests: hugepage-mremap: conform test to TAP format output
mm/vmstat: move pgdemote_* out of CONFIG_NUMA_BALANCING
mm: zsmalloc: return -ENOSPC rather than -EINVAL in zs_malloc while size is too large
mm/memcontrol: remove __mod_lruvec_page_state()
mm/khugepaged: use a folio more in collapse_file()
slub: use a folio in __kmalloc_large_node
slub: use folio APIs in free_large_kmalloc()
slub: use alloc_pages_node() in alloc_slab_page()
mm: remove inc/dec lruvec page state functions
mm: ratelimit stat flush from workingset shrinker
kasan: stop leaking stack trace handles
mm/mglru: remove CONFIG_TRANSPARENT_HUGEPAGE
mm/mglru: add dummy pmd_dirty()
...
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEe7vIQRWZI0iWSE3xu+CwddJFiJoFAmWWu9EACgkQu+CwddJF
iJpXvQf/aGL7uEY57VpTm0t4gPwoZ9r2P89HxI/nQs9XgVzDcBmVp/cC0LDvSdcm
t91kJO538KeGjMgvlhLMTEuoShH5FlPs6cOwrGAYUoAGa4NwiOpGvliGky+nNHqY
w887ZgSzVLq0UOuSvn86N6enumMvewt4V+872+OWo6O1HWOJhC0SgHTIa8QPQtwb
yZ9BghO5IqMRXiZEsSIwyO+tQHcaU6l2G5huFXzgMFUhkQqAB9KTFc3h6rYI+i80
L4ppNXo2KNPGTDRb9dA8LNMWgvmfjhCb7chs8o1zSY2PwZlkzOix7EUBLCAIbc/2
EIaFC8AsZjfT47D1t72r8QpHB+C14Q==
=J+E7
-----END PGP SIGNATURE-----
Merge tag 'slab-for-6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab
Pull slab updates from Vlastimil Babka:
- SLUB: delayed freezing of CPU partial slabs (Chengming Zhou)
Freezing is an operation involving double_cmpxchg() that makes a slab
exclusive for a particular CPU. Chengming noticed that we use it also
in situations where we are not yet installing the slab as the CPU
slab, because freezing also indicates that the slab is not on the
shared list. This results in redundant freeze/unfreeze operation and
can be avoided by marking separately the shared list presence by
reusing the PG_workingset flag.
This approach neatly avoids the issues described in 9b1ea29bc0
("Revert "mm, slub: consider rest of partial list if acquire_slab()
fails"") as we can now grab a slab from the shared list in a quick
and guaranteed way without the cmpxchg_double() operation that
amplifies the lock contention and can fail.
As a result, lkp has reported 34.2% improvement of
stress-ng.rawudp.ops_per_sec
- SLAB removal and SLUB cleanups (Vlastimil Babka)
The SLAB allocator has been deprecated since 6.5 and nobody has
objected so far. We agreed at LSF/MM to wait until the next LTS,
which is 6.6, so we should be good to go now.
This doesn't yet erase all traces of SLAB outside of mm/ so some dead
code, comments or documentation remain, and will be cleaned up
gradually (some series are already in the works).
Removing the choice of allocators has already allowed to simplify and
optimize the code wiring up the kmalloc APIs to the SLUB
implementation.
* tag 'slab-for-6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab: (34 commits)
mm/slub: free KFENCE objects in slab_free_hook()
mm/slub: handle bulk and single object freeing separately
mm/slub: introduce __kmem_cache_free_bulk() without free hooks
mm/slub: fix bulk alloc and free stats
mm/slub: optimize free fast path code layout
mm/slub: optimize alloc fastpath code layout
mm/slub: remove slab_alloc() and __kmem_cache_alloc_lru() wrappers
mm/slab: move kmalloc() functions from slab_common.c to slub.c
mm/slab: move kmalloc_slab() to mm/slab.h
mm/slab: move kfree() from slab_common.c to slub.c
mm/slab: move struct kmem_cache_node from slab.h to slub.c
mm/slab: move memcg related functions from slab.h to slub.c
mm/slab: move pre/post-alloc hooks from slab.h to slub.c
mm/slab: consolidate includes in the internal mm/slab.h
mm/slab: move the rest of slub_def.h to mm/slab.h
mm/slab: move struct kmem_cache_cpu declaration to slub.c
mm/slab: remove mm/slab.c and slab_def.h
mm/mempool/dmapool: remove CONFIG_DEBUG_SLAB ifdefs
mm/slab: remove CONFIG_SLAB code from slab common code
cpu/hotplug: remove CPUHP_SLAB_PREPARE hooks
...
The allocation request for swiotlb contiguous memory greater than
128*2KB cannot be fulfilled because it exceeds the maximum contiguous
memory limit. If the swiotlb memory we allocate is larger than 128*2KB,
swiotlb_find_slots() will still schedule the allocation of a new memory
pool, which will increase memory overhead.
Fix it by adding a check with alloc_size no more than 128*2KB before
scheduling the allocation of a new memory pool in swiotlb_find_slots().
Signed-off-by: ZhangPeng <zhangpeng362@huawei.com>
Reviewed-by: Petr Tesarik <petr.tesarik1@huawei-partners.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
- Yafang Shao added task_get_cgroup1() helper to enable a similar BPF helper
so that BPF progs can be more useful on cgroup1 hierarchies. While cgroup1
is mostly in maintenance mode, this addition is very small while having an
outsized usefulness for users who are still on cgroup1. Yafang also
optimized root cgroup list access by making it RCU protected in the
process.
- Waiman Long optimized rstat operation leading to substantially lower and
more consistent lock hold time while flushing the hierarchical statistics.
As the lock can be acquired briefly in various hot paths, this reduction
has cascading benefits.
- Waiman also improved the quality of isolation for cpuset's isolated
partitions. CPUs which are allocated to isolated partitions are now
excluded from running unbound work items and cpu_is_isolated() test which
is used by vmstat and memcg to reduce interference now includes cpuset
isolated CPUs. While it isn't there yet, the hope is eventually reaching
parity with the isolation level provided by the `isolcpus` boot param but
in a dynamic manner.
This involved a couple workqueue patches which were applied directly to
cgroup/for-6.8 rather than ping-ponged through the wq tree. This was
because the wq code change was small and the area is usually very static
and unlikely to cause conflicts. However, luck had it that there was a wq
bug fix in the area during the 6.7 cycle which caused a conflict. The
conflict is contextual but can be a bit confusing to resolve, so there is
one merge from wq/for-6.7-fixes.
-----BEGIN PGP SIGNATURE-----
iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZYnuJg4cdGpAa2VybmVs
Lm9yZwAKCRCxYfJx3gVYGQ5kAP9nMMWqi+R1HeG7+hWROTVjQZ0OM9KRcpZ1TmjF
FNbkJgEAzt+sPnoWwYDTSI7pkNeZ/IM7x1qkkKGvENNtUXrz0Ac=
=PyYN
-----END PGP SIGNATURE-----
Merge tag 'cgroup-for-6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:
- Yafang Shao added task_get_cgroup1() helper to enable a similar BPF
helper so that BPF progs can be more useful on cgroup1 hierarchies.
While cgroup1 is mostly in maintenance mode, this addition is very
small while having an outsized usefulness for users who are still on
cgroup1. Yafang also optimized root cgroup list access by making it
RCU protected in the process.
- Waiman Long optimized rstat operation leading to substantially lower
and more consistent lock hold time while flushing the hierarchical
statistics. As the lock can be acquired briefly in various hot paths,
this reduction has cascading benefits.
- Waiman also improved the quality of isolation for cpuset's isolated
partitions. CPUs which are allocated to isolated partitions are now
excluded from running unbound work items and cpu_is_isolated() test
which is used by vmstat and memcg to reduce interference now includes
cpuset isolated CPUs. While it isn't there yet, the hope is
eventually reaching parity with the isolation level provided by the
`isolcpus` boot param but in a dynamic manner.
* tag 'cgroup-for-6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: Move rcu_head up near the top of cgroup_root
cgroup/cpuset: Include isolated cpuset CPUs in cpu_is_isolated() check
cgroup: Avoid false cacheline sharing of read mostly rstat_cpu
cgroup/rstat: Optimize cgroup_rstat_updated_list()
cgroup: Fix documentation for cpu.idle
cgroup/cpuset: Expose cpuset.cpus.isolated
workqueue: Move workqueue_set_unbound_cpumask() and its helpers inside CONFIG_SYSFS
cgroup/rstat: Reduce cpu_lock hold time in cgroup_rstat_flush_locked()
cgroup/cpuset: Take isolated CPUs out of workqueue unbound cpumask
cgroup/cpuset: Keep track of CPUs in isolated partitions
selftests/cgroup: Minor code cleanup and reorganization of test_cpuset_prs.sh
workqueue: Add workqueue_unbound_exclude_cpumask() to exclude CPUs from wq_unbound_cpumask
selftests: cgroup: Fixes a typo in a comment
cgroup: Add a new helper for cgroup1 hierarchy
cgroup: Add annotation for holding namespace_sem in current_cgns_cgroup_from_root()
cgroup: Eliminate the need for cgroup_mutex in proc_cgroup_show()
cgroup: Make operations on the cgroup root_list RCU safe
cgroup: Remove unnecessary list_empty()
- Energy scheduling:
- Consolidate how the max compute capacity is
used in the scheduler and how we calculate
the frequency for a level of utilization.
- Rework interface between the scheduler and
the schedutil governor
- Simplify the util_est logic
- Deadline scheduler:
- Work more towards reducing SCHED_DEADLINE
starvation of low priority tasks (e.g., SCHED_OTHER)
tasks when higher priority tasks monopolize CPU
cycles, via the introduction of 'deadline servers'
(nested/2-level scheduling).
"Fair servers" to make use of this facility are
not introduced yet.
- EEVDF:
- Introduce O(1) fastpath for EEVDF task selection
- NUMA balancing:
- Tune the NUMA-balancing vma scanning logic some more,
to better distribute the probability
of a particular vma getting scanned.
- Plus misc fixes, cleanups and updates.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmWcASMRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1jLbg/+NOwF18M6klF1/3jUaV1PU09vRzYnnA7w
oF7Tru7JLV+/vZK+rwI1zxzj5Nj3sVBQPIyp1embEHx7Z/QH8MIaIVpcSFsDDCYY
Q8n6ZVRB+lKWEo5+Ti6JEJftDAWuLHXwFWDa57oWPuR0Tc736+zYHUfj7jdKk0RI
nT/lnOT6hXU8q26O4QFrBrrhvCCxc4byo7buKPQfqie0bDA70ppIWkFQoQME6mvQ
US9jvOyUipOiPV06DPwFvPDJUQBGq2VdJNk+5zCEtcqEfLREuo/Xq1Ww1x1BWaZI
761532EuDo73iMK4IFZrvVmj1ioz957qbje11MSSkDdKj692xxjXyvnY0NBvZuho
Ueog/jQ4D4I2qu7pPSCF8UfnI/Hw4Q+KJ89j3pcywRm4hmCTf9k3MGpAaVLVxH7G
e5REZ5MSsFZi4Cs+zF87Of5KCKLhTr1qSetNtShinKahg06WZ+MZ8tW4jb52qy0j
F8PMlvfBI3f7SOtA8s2P26mDGQ21YQehN2d5P+Fbwj/U3fjIlSTOyx6NwLpFwYaS
Vf+fctchGFV1Sh7c2JjCh+ecYfXx3ghT/pvyPOImJtxtCKSRUQ8c26ApC1OsWfOE
FdHv4f2dPqcyswCZzIv/2fyDXc9eaS2E05EMDNqVuMCGnzidzSs81n7hBioNMrnH
ZgHK90TmEbw=
=wTVh
-----END PGP SIGNATURE-----
Merge tag 'sched-core-2024-01-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
"Energy scheduling:
- Consolidate how the max compute capacity is used in the scheduler
and how we calculate the frequency for a level of utilization.
- Rework interface between the scheduler and the schedutil governor
- Simplify the util_est logic
Deadline scheduler:
- Work more towards reducing SCHED_DEADLINE starvation of low
priority tasks (e.g., SCHED_OTHER) tasks when higher priority tasks
monopolize CPU cycles, via the introduction of 'deadline servers'
(nested/2-level scheduling).
"Fair servers" to make use of this facility are not introduced yet.
EEVDF:
- Introduce O(1) fastpath for EEVDF task selection
NUMA balancing:
- Tune the NUMA-balancing vma scanning logic some more, to better
distribute the probability of a particular vma getting scanned.
Plus misc fixes, cleanups and updates"
* tag 'sched-core-2024-01-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (30 commits)
sched/fair: Fix tg->load when offlining a CPU
sched/fair: Remove unused 'next_buddy_marked' local variable in check_preempt_wakeup_fair()
sched/fair: Use all little CPUs for CPU-bound workloads
sched/fair: Simplify util_est
sched/fair: Remove SCHED_FEAT(UTIL_EST_FASTUP, true)
arm64/amu: Use capacity_ref_freq() to set AMU ratio
cpufreq/cppc: Set the frequency used for computing the capacity
cpufreq/cppc: Move and rename cppc_cpufreq_{perf_to_khz|khz_to_perf}()
energy_model: Use a fixed reference frequency
cpufreq/schedutil: Use a fixed reference frequency
cpufreq: Use the fixed and coherent frequency for scaling capacity
sched/topology: Add a new arch_scale_freq_ref() method
freezer,sched: Clean saved_state when restoring it during thaw
sched/fair: Update min_vruntime for reweight_entity() correctly
sched/doc: Update documentation after renames and synchronize Chinese version
sched/cpufreq: Rework iowait boost
sched/cpufreq: Rework schedutil governor performance estimation
sched/pelt: Avoid underestimation of task utilization
sched/timers: Explain why idle task schedules out on remote timer enqueue
sched/cpuidle: Comment about timers requirements VS idle handler
...
- Add branch stack counters ABI extension to better capture
the growing amount of information the PMU exposes via
branch stack sampling. There's matching tooling support.
- Fix race when creating the nr_addr_filters sysfs file
- Add Intel Sierra Forest and Grand Ridge intel/cstate
PMU support.
- Add Intel Granite Rapids, Sierra Forest and Grand Ridge
uncore PMU support.
- Misc cleanups & fixes.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmWb4lURHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1jlnQ/+NSzrPQ9hEiS5a1iMMxdwC6IoXCmeFVsv
s5NsGaVC7FEgjm3oCfvQlP63HolMO9R7TNLZsgINzOda5IHtE7WUcgBK7gbZr+NT
WabdTyFrdmUr+Br0rLrEe0bxDSQU7r41ptqKE5HZRM9/3SbLhWgaXSJbfFAG2JV0
xboZ/2qzb7Puch6VTWv1YhuIpr1Pi817As4SOo7JR4V8jBB2bh2eZ7XBN1z23aw2
xuglbYml5gs4dOaFTqkRLWyn2PmrZ9wYKcdp63FVUscZ4LxvSw749BxEcNpTbxLp
PT6uXIKw9PnStNfscfrsk6fDocVJzqrOK71blgiOKbmhWTE0UimEpFf1Hd3ooewg
hFp3hmkE5Bc2MTUnwivkBxj96fz5rXH+3+Cue/5NsvDNlhlkswIIxzDw8M1G4rOI
KQMDUYFOhQPa3Hi1lSp2SgHI5AcYHudepr/Z3QMxD3iLs+Wo2cmDcp8d2VrMLfb7
GHSITG592iYcZPYsJosxby8CSFaUPxIl9l3AODQwWuEjd4PcOYa6iB2HbEa/mC3R
wXcs8mFIMAaH/HRYUlqUDA5pOqN5chb13iDtS4JqJqBKyWgdrDLCVxoZSQvB64+I
bldyy1e5oQSVVwJ42WLkUK3Eld2x75ki1JLZFwMgYuOgQv3jfu2VNenUWJ5ig0La
dPpHP8PwOoc=
=2O/5
-----END PGP SIGNATURE-----
Merge tag 'perf-core-2024-01-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull performance events updates from Ingo Molnar:
- Add branch stack counters ABI extension to better capture the growing
amount of information the PMU exposes via branch stack sampling.
There's matching tooling support.
- Fix race when creating the nr_addr_filters sysfs file
- Add Intel Sierra Forest and Grand Ridge intel/cstate PMU support
- Add Intel Granite Rapids, Sierra Forest and Grand Ridge uncore PMU
support
- Misc cleanups & fixes
* tag 'perf-core-2024-01-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/intel/uncore: Factor out topology_gidnid_map()
perf/x86/intel/uncore: Fix NULL pointer dereference issue in upi_fill_topology()
perf/x86/amd: Reject branch stack for IBS events
perf/x86/intel/uncore: Support Sierra Forest and Grand Ridge
perf/x86/intel/uncore: Support IIO free-running counters on GNR
perf/x86/intel/uncore: Support Granite Rapids
perf/x86/uncore: Use u64 to replace unsigned for the uncore offsets array
perf/x86/intel/uncore: Generic uncore_get_uncores and MMIO format of SPR
perf: Fix the nr_addr_filters fix
perf/x86/intel/cstate: Add Grand Ridge support
perf/x86/intel/cstate: Add Sierra Forest support
x86/smp: Export symbol cpu_clustergroup_mask()
perf/x86/intel/cstate: Cleanup duplicate attr_groups
perf/core: Fix narrow startup race when creating the perf nr_addr_filters sysfs file
perf/x86/intel: Support branch counters logging
perf/x86/intel: Reorganize attrs and is_visible
perf: Add branch_sample_call_stack
perf/x86: Add PERF_X86_EVENT_NEEDS_BRANCH_STACK flag
perf: Add branch stack counters
- Various preparatory cleanups & enhancements of the timer-wheel code,
in preparation for the WIP 'pull timers at expiry' timer migration model
series (which will replace the current 'push timers at enqueue' migration
model), by Anna-Maria Behnsen:
- Update comments and clean up confusing variable names
- Add debug check to warn about time travel
- Improve/expand timer-wheel tracepoints
- Optimize away unnecessary IPIs for deferrable timers
- Restructure & clean up next_expiry_recalc()
- Clean up forward_timer_base()
- Introduce __forward_timer_base() and use it to simplify
and micro-optimize get_next_timer_interrupt()
- Restructure the get_next_timer_interrupt()'s idle logic
for better readability and to enable a minor optimization.
- Fix the nextevt calculation when no timers are pending
- Fix the sysfs_get_uname() prototype declaration
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmWb0XIRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1h9kg/9FpjbiogIKrDXb/pJHyhYkK6jzN4aNrQo
wsOz4FDKyvioqLfr5ndpFE++DwsyzUibPfHJzfwD5IilTyolm2jW44VSCBzNdm72
lI6NGIcIxmIeCuO4bLmJj/fuQAugQ+ajmA2pyC/aBSO4Q2jtnxjYMGiV9zMWmOsa
E816CK5zp6IVx+w0GWwK5yW5YR5dscSQCD+mBYVAdTWYoRNTy6xonsMTRuNi0ePx
donetpu0eWG9NGwUdax/65oKVLZMR/rKAI/3pInhkOS9BsL2o8Rt4o2Y+9aBFi2t
2m+ZbFg5hngJwhP8Mfc7A+I3qiWgCOMGNGrebyzlwb+0PnNBPzrwnNPveW3R9QRx
LMxSU3aH66bXeX+YCF4y2tjWSmYooAnztPstUGrs+sq36+NF0wyY6ip/36S6MRGk
zjedqWnrHQeeZlzOLiKNzB+FIBnOt6bhZEh1Wk1/zwi9UWxw+7+I6tR0b57NqRxZ
VHq38fp+O2OEAj5JvwJ6FomOd+onqQ2wwveG5OMCa+hwM2ZCuVXQRYgM2ohMfwl3
BMSd3KMZsBiHT0zyun3k/uJ7CaIjArPh016baSS10ArSl9sE64aJj7ELtuSLqtaD
idJFXu3tv6VgDT2rMhLWNHvzQoK+gb8+/qnms4Ea+wY2f7nubi0aH20qHfugkgis
4KOkw9cQw0U=
=n40J
-----END PGP SIGNATURE-----
Merge tag 'timers-core-2024-01-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer subsystem updates from Ingo Molnar:
- Various preparatory cleanups & enhancements of the timer-wheel code,
in preparation for the WIP 'pull timers at expiry' timer migration
model series (which will replace the current 'push timers at enqueue'
migration model), by Anna-Maria Behnsen:
- Update comments and clean up confusing variable names
- Add debug check to warn about time travel
- Improve/expand timer-wheel tracepoints
- Optimize away unnecessary IPIs for deferrable timers
- Restructure & clean up next_expiry_recalc()
- Clean up forward_timer_base()
- Introduce __forward_timer_base() and use it to simplify and
micro-optimize get_next_timer_interrupt()
- Restructure the get_next_timer_interrupt()'s idle logic for better
readability and to enable a minor optimization.
- Fix the nextevt calculation when no timers are pending
- Fix the sysfs_get_uname() prototype declaration
* tag 'timers-core-2024-01-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
timers: Fix nextevt calculation when no timers are pending
timers: Rework idle logic
timers: Use already existing function for forwarding timer base
timers: Split out forward timer base functionality
timers: Clarify check in forward_timer_base()
timers: Move store of next event into __next_timer_interrupt()
timers: Do not IPI for deferrable timers
tracing/timers: Add tracepoint for tracking timer base is_idle flag
tracing/timers: Enhance timer_start tracepoint
tick-sched: Warn when next tick seems to be in the past
tick/sched: Cleanup confusing variables
tick-sched: Fix function names in comments
time: Make sysfs_get_uname() function visible in header
and always-inline them, to improve syscall entry performance on s390 by ~11%.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmWbxQ4RHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1h5Cw//TEIlWPCLpIeiDsOCKb5g2e4U+AatNIGt
ysmCvTWsKOiBEItDbZpwdpcdv/Ed41UXkS7Zmwetw81P50rz/i+kIJZW4gdl9GiV
qhjj0gbhGQ43myQkGdYIcmdVaHl9fuyDGZSai6c17zgdOoL5CvCGGiL5Dn4Cn36x
skm8P66r9DuM9cLTnhqQHMKp7cf4HQAX+awhFeppCquhzh3M2I8GsUVrT7tZV+Jw
zOMLVjsI8Va4JyGsl07DoqWlyFWcoYvJ5ayzvDCaBxgeFIK9uZgwkKV0HT9q5tvg
RhsHQK4zbxgkaMMCgEt/WdT14YesO2+5+ml91Zkjp2NMud0O0gmd2YXZju1aOQQw
XCL3pm6DB4oN+IkW9lo6k3rqo9PEip9rt/FAfkNLeb50elHfSZSvE1ZxXSQwx5N4
pHDNMcK6SMsJhEdJInNotViKrpXX0Rjr7x1pY/2DA9bMP/jX/9+J3ODuGCDZrvjp
eq4JM15VSq6tVmg+LMcszThWz+9gIaLFAqQwFt3G082ANDkOvg0mK7T65gccDuyA
Gl6f/p3tAYHYxOI9KOBN6Daq3QAqlMT+M4YgNbbv8fanWYIzRd3U/Y+YrUCnryu8
Db+8FHlUkbJb/clUofJ5nj0Ene7xReJX/8m8XxA95Cc/UGYU8w0cachXDoPyKZUP
xtFW3xn8K4M=
=VHW9
-----END PGP SIGNATURE-----
Merge tag 'core-entry-2024-01-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull generic syscall updates from Ingo Molnar:
"Move various entry functions from kernel/entry/common.c to a header
file, and always-inline them, to improve syscall entry performance
on s390 by ~11%"
* tag 'core-entry-2024-01-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
entry: Move syscall_enter_from_user_mode() to header file
entry: Move enter_from_user_mode() to header file
entry: Move exit to usermode functions to header file
- lock guards:
- Use lock guards in the ptrace code
- Introduce conditional guards to extend to conditional lock
primitives like mutex_trylock()/mutex_lock_interruptible()/etc.
- lockdep:
- Optimize 'struct lock_class' to be smaller
- Update file patterns in MAINTAINERS
- mutexes: Document mutex lifetime rules a bit more
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmWbur0RHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1i19xAAtAZs8Xi5X3IuLoygfMg81k91GvavLpv/
sVYGbKLm6+r0dOr4w3w76jVyWXNrVtTUFJysfK6nTNInEO+P0stL6aDmQVDYIaCw
jaQDO/UgPfxUxnBgMy7dUvsnmxBw4GO3zcQpx8GuAuuuEtNcQcnMoP2t/RvBpBEI
K2xCCkIT5LQPKbu9LkVZ/fAhZHtMypipuIMtEpfVYEKCMEwDmXoHuj3SNo4LGt04
0wZ5hHVhTcOQDm1/tjSXKsmxwQRVhI6OCcjXJ8hxDiPXg9vWO0+CzOibujl/jfhs
Dw45D7wwiCHFUsmKHKz335Jtk8wBpgWUtlxZ+GB/TVfAQkLv1xBdoE1F8yLcBEfx
yBNxh+0wecPWSrIsRLZEotRRu7obmBsIW04qUVP3oLWIu3tzhcud9gR43CeeplkT
RW1lkdLt5SzN//MLx1cWPqKjfi6wiolaveD1RrqIbVXRhhrFH3o7+EDEOqGNwOvu
0D9yx4Su7SYlAYXgGnGLnnmcmDt2cEoOXkY3K608sYW45dZOpNIu56+EfHiVH1fI
Q0lZNHXNDkX85Zzxoam6Y0SYo74lXYBmtL+RSYvvaKjRYgrJGxGcOciK+/c8kmY2
+Eazx5sxoR5nKDMP8MrZCZ5CQtqdPB9IYk7hUvQ31BYL1LSKhA5Mi/xjsFwMSNwv
pEBazHxty58=
=fOTd
-----END PGP SIGNATURE-----
Merge tag 'locking-core-2024-01-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking updates from Ingo Molar:
"Lock guards:
- Use lock guards in the ptrace code
- Introduce conditional guards to extend to conditional lock
primitives like mutex_trylock()/mutex_lock_interruptible()/etc.
lockdep:
- Optimize 'struct lock_class' to be smaller
- Update file patterns in MAINTAINERS
mutexes:
- Document mutex lifetime rules a bit more"
* tag 'locking-core-2024-01-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
locking/mutex: Clarify that mutex_unlock(), and most other sleeping locks, can still use the lock object after it's unlocked
locking/mutex: Document that mutex_unlock() is non-atomic
ptrace: Convert ptrace_attach() to use lock guards
locking/lockdep: Slightly reorder 'struct lock_class' to save some memory
MAINTAINERS: Add include/linux/lockdep*.h
cleanup: Add conditional guard support
commit 23baf831a3 ("mm, treewide: redefine MAX_ORDER sanely") has
changed the definition of MAX_ORDER to be inclusive. This has caused
issues with code that was not yet upstream and depended on the previous
definition.
To draw attention to the altered meaning of the define, rename MAX_ORDER
to MAX_PAGE_ORDER.
Link: https://lkml.kernel.org/r/20231228144704.14033-2-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
NR_PAGE_ORDERS defines the number of page orders supported by the page
allocator, ranging from 0 to MAX_ORDER, MAX_ORDER + 1 in total.
NR_PAGE_ORDERS assists in defining arrays of page orders and allows for
more natural iteration over them.
[kirill.shutemov@linux.intel.com: fixup for kerneldoc warning]
Link: https://lkml.kernel.org/r/20240101111512.7empzyifq7kxtzk3@box
Link: https://lkml.kernel.org/r/20231228144704.14033-1-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZZUzBQAKCRCRxhvAZXjc
ot+3AQCZw1PBD4azVxFMWH76qwlAGoVIFug4+ogKU/iUa4VLygEA2FJh1vLJw5iI
LpgBEIUTPVkwtzinAW94iJJo1Vr7NAI=
=p6PB
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.8.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs iov_iter cleanups from Christian Brauner:
"This contains a minor cleanup. The patches drop an unused argument
from import_single_range() allowing to replace import_single_range()
with import_ubuf() and dropping import_single_range() completely"
* tag 'vfs-6.8.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
iov_iter: replace import_single_range() with import_ubuf()
iov_iter: remove unused 'iov' argument from import_single_range()
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZZUxRQAKCRCRxhvAZXjc
ov/QAQDzvge3oQ9MEymmOiyzzcF+HhAXBr+9oEsYJjFc1p0TsgEA61gXjZo7F1jY
KBqd6znOZCR+Waj0kIVJRAo/ISRBqQc=
=0bRl
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.8.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull misc vfs updates from Christian Brauner:
"This contains the usual miscellaneous features, cleanups, and fixes
for vfs and individual fses.
Features:
- Add Jan Kara as VFS reviewer
- Show correct device and inode numbers in proc/<pid>/maps for vma
files on stacked filesystems. This is now easily doable thanks to
the backing file work from the last cycles. This comes with
selftests
Cleanups:
- Remove a redundant might_sleep() from wait_on_inode()
- Initialize pointer with NULL, not 0
- Clarify comment on access_override_creds()
- Rework and simplify eventfd_signal() and eventfd_signal_mask()
helpers
- Process aio completions in batches to avoid needless wakeups
- Completely decouple struct mnt_idmap from namespaces. We now only
keep the actual idmapping around and don't stash references to
namespaces
- Reformat maintainer entries to indicate that a given subsystem
belongs to fs/
- Simplify fput() for files that were never opened
- Get rid of various pointless file helpers
- Rename various file helpers
- Rename struct file members after SLAB_TYPESAFE_BY_RCU switch from
last cycle
- Make relatime_need_update() return bool
- Use GFP_KERNEL instead of GFP_USER when allocating superblocks
- Replace deprecated ida_simple_*() calls with their current ida_*()
counterparts
Fixes:
- Fix comments on user namespace id mapping helpers. They aren't
kernel doc comments so they shouldn't be using /**
- s/Retuns/Returns/g in various places
- Add missing parameter documentation on can_move_mount_beneath()
- Rename i_mapping->private_data to i_mapping->i_private_data
- Fix a false-positive lockdep warning in pipe_write() for watch
queues
- Improve __fget_files_rcu() code generation to improve performance
- Only notify writer that pipe resizing has finished after setting
pipe->max_usage otherwise writers are never notified that the pipe
has been resized and hang
- Fix some kernel docs in hfsplus
- s/passs/pass/g in various places
- Fix kernel docs in ntfs
- Fix kcalloc() arguments order reported by gcc 14
- Fix uninitialized value in reiserfs"
* tag 'vfs-6.8.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (36 commits)
reiserfs: fix uninit-value in comp_keys
watch_queue: fix kcalloc() arguments order
ntfs: dir.c: fix kernel-doc function parameter warnings
fs: fix doc comment typo fs tree wide
selftests/overlayfs: verify device and inode numbers in /proc/pid/maps
fs/proc: show correct device and inode numbers in /proc/pid/maps
eventfd: Remove usage of the deprecated ida_simple_xx() API
fs: super: use GFP_KERNEL instead of GFP_USER for super block allocation
fs/hfsplus: wrapper.c: fix kernel-doc warnings
fs: add Jan Kara as reviewer
fs/inode: Make relatime_need_update return bool
pipe: wakeup wr_wait after setting max_usage
file: remove __receive_fd()
file: stop exposing receive_fd_user()
fs: replace f_rcuhead with f_task_work
file: remove pointless wrapper
file: s/close_fd_get_file()/file_close_fd()/g
Improve __fget_files_rcu() code generation (and thus __fget_light())
file: massage cleanup of files that failed to open
fs/pipe: Fix lockdep false-positive in watchqueue pipe_write()
...
The parse_actions() function uses 'len = str_has_prefix()' to test which
action is in the string being parsed. But then it goes and repeats the
logic for each different action. This logic can be simplified and
duplicate code can be removed as 'len' contains the length of the found
prefix which should be used for all actions.
Link: https://lore.kernel.org/all/20240107112044.6702cb66@gandalf.local.home/
Link: https://lore.kernel.org/linux-trace-kernel/20240107203258.37e26d2b@gandalf.local.home
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andy Shevchenko <andy@kernel.org>
Cc: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Merge system-wide power management updates for 6.8-rc1:
- Fix possible deadlocks in the core system-wide PM code that occur if
device-handling functions cannot be executed asynchronously during
resune from system-wide suspend (Rafael J. Wysocki).
- Clean up unnecessary local variable initializations in multiple
places in the hibernation code (Wang chaodong, Li zeming).
- Adjust core hibernation code to avoid missing wakeup events that
occur after saving an image to persistent storage (Chris Feng).
- Update hibernation code to enforce correct ordering during image
compression and decompression (Hongchen Zhang).
- Use kmap_local_page() instead of kmap_atomic() in copy_data_page()
during hibernation and restore (Chen Haonan).
- Adjust documentation and code comments to reflect recent task freezer
changes (Kevin Hao).
- Repair excess function parameter description warning in the
hibernation image-saving code (Randy Dunlap).
* pm-sleep:
PM: sleep: Fix possible deadlocks in core system-wide PM code
async: Introduce async_schedule_dev_nocall()
async: Split async_schedule_node_domain()
PM: hibernate: Repair excess function parameter description warning
PM: sleep: Remove obsolete comment from unlock_system_sleep()
Documentation: PM: Adjust freezing-of-tasks.rst to the freezer changes
PM: hibernate: Use kmap_local_page() in copy_data_page()
PM: hibernate: Enforce ordering during image compression/decompression
PM: hibernate: Avoid missing wakeup events during hibernation
PM: hibernate: Do not initialize error in snapshot_write_next()
PM: hibernate: Do not initialize error in swap_write_page()
PM: hibernate: Drop unnecessary local variable initialization
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZZgrfgAKCRDbK58LschI
g87JAQDu+oUG3aWnRJi+lJTK8vGnKRuBwUxgnI5Ze99N0tuPmAEAz1gpXLYP+fKE
eqRhZGGhujdHC9if3Le+nG6nvf8Gvw0=
=KPkZ
-----END PGP SIGNATURE-----
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:
====================
pull-request: bpf-next 2024-01-05
We've added 40 non-merge commits during the last 2 day(s) which contain
a total of 73 files changed, 1526 insertions(+), 951 deletions(-).
The main changes are:
1) Fix a memory leak when streaming AF_UNIX sockets were inserted
into multiple sockmap slots/maps, from John Fastabend.
2) Fix gotol in s390 BPF JIT with large offsets, from Ilya Leoshkevich.
3) Fix reattachment branch in bpf_tracing_prog_attach() and reject
the request if there is no valid attach_btf, from Jiri Olsa.
4) Remove deprecated bpfilter kernel leftovers given the project
is developed in user space (https://github.com/facebook/bpfilter),
from Quentin Deslandes.
5) Relax tracing BPF program recursive attach rules given right now
it is not possible to create tracing program call cycles,
from Dmitrii Dolgov.
6) Fix excessive memory consumption for the bpf_global_percpu_ma
for systems with a large number of CPUs, from Yonghong Song.
7) Small x86 BPF JIT cleanup to reuse emit_nops instead of open-coding
memcpy of x86_nops, from Leon Hwang.
8) Follow-up for libbpf to support __arg_ctx global function argument tag
semantics to complement the merged kernel side, from Andrii Nakryiko.
9) Introduce "volatile compare" macros for BPF selftests in order
to make the latter more robust against compiler optimization,
from Alexei Starovoitov.
10) Small simplification in verifier's size checking of helper accesses
along with additional selftests, from Andrei Matei.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (40 commits)
selftests/bpf: Test re-attachment fix for bpf_tracing_prog_attach
bpf: Fix re-attachment branch in bpf_tracing_prog_attach
selftests/bpf: Add test for recursive attachment of tracing progs
bpf: Relax tracing prog recursive attach rules
bpf, x86: Use emit_nops to replace memcpy x86_nops
selftests/bpf: Test gotol with large offsets
selftests/bpf: Double the size of test_loader log
s390/bpf: Fix gotol with large offsets
bpfilter: remove bpfilter
bpf: Remove unnecessary cpu == 0 check in memalloc
selftests/bpf: add __arg_ctx BTF rewrite test
selftests/bpf: add arg:ctx cases to test_global_funcs tests
libbpf: implement __arg_ctx fallback logic
libbpf: move BTF loading step after relocation step
libbpf: move exception callbacks assignment logic into relocation step
libbpf: use stable map placeholder FDs
libbpf: don't rely on map->fd as an indicator of map being created
libbpf: use explicit map reuse flag to skip map creation steps
libbpf: make uniform use of btf__fd() accessor inside libbpf
selftests/bpf: Add a selftest with > 512-byte percpu allocation size
...
====================
Link: https://lore.kernel.org/r/20240105170105.21070-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The purpose of crash_exclude_mem_range() is to remove all memory ranges
that overlap with [mstart-mend]. However, the current logic only removes
the first overlapping memory range.
Commit a2e9a95d21 ("kexec: Improve & fix crash_exclude_mem_range() to
handle overlapping ranges") attempted to address this issue, but it did
not fix all error cases.
Let's fix and simplify the logic of crash_exclude_mem_range().
Link: https://lkml.kernel.org/r/20240102144905.110047-4-ytcoode@gmail.com
Signed-off-by: Yuntao Wang <ytcoode@gmail.com>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov (AMD) <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Hari Bathini <hbathini@linux.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Sourabh Jain <sourabhjain@linux.ibm.com>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Add CONFIG_LRU_GEN_WALKS_MMU such that if disabled, the code that
walks page tables to promote pages into the youngest generation will
not be built.
Also improves code readability by adding two helper functions
get_mm_state() and get_next_mm().
Link: https://lkml.kernel.org/r/20231227141205.2200125-3-kinseyho@google.com
Signed-off-by: Kinsey Ho <kinseyho@google.com>
Co-developed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Tested-by: Donet Tom <donettom@linux.vnet.ibm.com>
Acked-by: Yu Zhao <yuzhao@google.com>
Cc: kernel test robot <lkp@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The following case can cause a crash due to missing attach_btf:
1) load rawtp program
2) load fentry program with rawtp as target_fd
3) create tracing link for fentry program with target_fd = 0
4) repeat 3
In the end we have:
- prog->aux->dst_trampoline == NULL
- tgt_prog == NULL (because we did not provide target_fd to link_create)
- prog->aux->attach_btf == NULL (the program was loaded with attach_prog_fd=X)
- the program was loaded for tgt_prog but we have no way to find out which one
BUG: kernel NULL pointer dereference, address: 0000000000000058
Call Trace:
<TASK>
? __die+0x20/0x70
? page_fault_oops+0x15b/0x430
? fixup_exception+0x22/0x330
? exc_page_fault+0x6f/0x170
? asm_exc_page_fault+0x22/0x30
? bpf_tracing_prog_attach+0x279/0x560
? btf_obj_id+0x5/0x10
bpf_tracing_prog_attach+0x439/0x560
__sys_bpf+0x1cf4/0x2de0
__x64_sys_bpf+0x1c/0x30
do_syscall_64+0x41/0xf0
entry_SYSCALL_64_after_hwframe+0x6e/0x76
Return -EINVAL in this situation.
Fixes: f3a9507554 ("bpf: Allow trampoline re-attach for tracing and lsm programs")
Cc: stable@vger.kernel.org
Signed-off-by: Jiri Olsa <olsajiri@gmail.com>
Acked-by: Jiri Olsa <olsajiri@gmail.com>
Acked-by: Song Liu <song@kernel.org>
Signed-off-by: Dmitrii Dolgov <9erthalion6@gmail.com>
Link: https://lore.kernel.org/r/20240103190559.14750-4-9erthalion6@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Currently, it's not allowed to attach an fentry/fexit prog to another
one fentry/fexit. At the same time it's not uncommon to see a tracing
program with lots of logic in use, and the attachment limitation
prevents usage of fentry/fexit for performance analysis (e.g. with
"bpftool prog profile" command) in this case. An example could be
falcosecurity libs project that uses tp_btf tracing programs.
Following the corresponding discussion [1], the reason for that is to
avoid tracing progs call cycles without introducing more complex
solutions. But currently it seems impossible to load and attach tracing
programs in a way that will form such a cycle. The limitation is coming
from the fact that attach_prog_fd is specified at the prog load (thus
making it impossible to attach to a program loaded after it in this
way), as well as tracing progs not implementing link_detach.
Replace "no same type" requirement with verification that no more than
one level of attachment nesting is allowed. In this way only one
fentry/fexit program could be attached to another fentry/fexit to cover
profiling use case, and still no cycle could be formed. To implement,
add a new field into bpf_prog_aux to track nested attachment for tracing
programs.
[1]: https://lore.kernel.org/bpf/20191108064039.2041889-16-ast@kernel.org/
Acked-by: Jiri Olsa <olsajiri@gmail.com>
Acked-by: Song Liu <song@kernel.org>
Signed-off-by: Dmitrii Dolgov <9erthalion6@gmail.com>
Link: https://lore.kernel.org/r/20240103190559.14750-2-9erthalion6@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The driver core now can handle a const struct bus_type pointer, and the
dma_debug_add_bus() call just passes on the pointer give to it to the
driver core, so make this pointer const as well to allow everyone to use
read-only struct bus_type pointers going forward.
Cc: Christoph Hellwig <hch@lst.de>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: <iommu@lists.linux.dev>
Reviewed-by: Robin Murphy <robin.murphy@arm.com>
Link: https://lore.kernel.org/r/2023121941-dejected-nugget-681e@gregkh
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
For percpu data structure allocation with bpf_global_percpu_ma,
the maximum data size is 4K. But for a system with large
number of cpus, bigger data size (e.g., 2K, 4K) might consume
a lot of memory. For example, the percpu memory consumption
with unit size 2K and 1024 cpus will be 2K * 1K * 1k = 2GB
memory.
We should discourage such usage. Let us limit the maximum data
size to be 512 for bpf_global_percpu_ma allocation.
Acked-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20231222031801.1290841-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Currently, refill low/high marks are set with the assumption
of normal non-percpu memory allocation. For example, for
an allocation size 256, for non-percpu memory allocation,
low mark is 32 and high mark is 96, resulting in the
batch allocation of 48 elements and the allocated memory
will be 48 * 256 = 12KB for this particular cpu.
Assuming an 128-cpu system, the total memory consumption
across all cpus will be 12K * 128 = 1.5MB memory.
This might be okay for non-percpu allocation, but may not be
good for percpu allocation, which will consume 1.5MB * 128 = 192MB
memory in the worst case if every cpu has a chance of memory
allocation.
In practice, percpu allocation is very rare compared to
non-percpu allocation. So let us have smaller low/high marks
which can avoid unnecessary memory consumption.
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20231222031755.1289671-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Typically for percpu map element or data structure, once allocated,
most operations are lookup or in-place update. Deletion are really
rare. Currently, for percpu data strcture, 4 elements will be
refilled if the size is <= 256. Let us just do with one element
for percpu data. For example, for size 256 and 128 cpus, the
potential saving will be 3 * 256 * 128 * 128 = 12MB.
Acked-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20231222031750.1289290-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Commit 41a5db8d81 ("Add support for non-fix-size percpu mem allocation")
added support for non-fix-size percpu memory allocation.
Such allocation will allocate percpu memory for all buckets on all
cpus and the memory consumption is in the order to quadratic.
For example, let us say, 4 cpus, unit size 16 bytes, so each
cpu has 16 * 4 = 64 bytes, with 4 cpus, total will be 64 * 4 = 256 bytes.
Then let us say, 8 cpus with the same unit size, each cpu
has 16 * 8 = 128 bytes, with 8 cpus, total will be 128 * 8 = 1024 bytes.
So if the number of cpus doubles, the number of memory consumption
will be 4 times. So for a system with large number of cpus, the
memory consumption goes up quickly with quadratic order.
For example, for 4KB percpu allocation, 128 cpus. The total memory
consumption will 4KB * 128 * 128 = 64MB. Things will become
worse if the number of cpus is bigger (e.g., 512, 1024, etc.)
In Commit 41a5db8d81, the non-fix-size percpu memory allocation is
done in boot time, so for system with large number of cpus, the initial
percpu memory consumption is very visible. For example, for 128 cpu
system, the total percpu memory allocation will be at least
(16 + 32 + 64 + 96 + 128 + 196 + 256 + 512 + 1024 + 2048 + 4096)
* 128 * 128 = ~138MB.
which is pretty big. It will be even bigger for larger number of cpus.
Note that the current prefill also allocates 4 entries if the unit size
is less than 256. So on top of 138MB memory consumption, this will
add more consumption with
3 * (16 + 32 + 64 + 96 + 128 + 196 + 256) * 128 * 128 = ~38MB.
Next patch will try to reduce this memory consumption.
Later on, Commit 1fda5bb66a ("bpf: Do not allocate percpu memory
at init stage") moved the non-fix-size percpu memory allocation
to bpf verificaiton stage. Once a particular bpf_percpu_obj_new()
is called by bpf program, the memory allocator will try to fill in
the cache with all sizes, causing the same amount of percpu memory
consumption as in the boot stage.
To reduce the initial percpu memory consumption for non-fix-size
percpu memory allocation, instead of filling the cache with all
supported allocation sizes, this patch intends to fill the cache
only for the requested size. As typically users will not use large
percpu data structure, this can save memory significantly.
For example, the allocation size is 64 bytes with 128 cpus.
Then total percpu memory amount will be 64 * 128 * 128 = 1MB,
much less than previous 138MB.
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20231222031745.1289082-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The objcg is a bpf_mem_alloc level property since all bpf_mem_cache's
are with the same objcg. This patch made such a property explicit.
The next patch will use this property to save and restore objcg
for percpu unit allocator.
Acked-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20231222031739.1288590-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Currently, for percpu memory allocation, say if the user
requests allocation size to be 32 bytes, the actually
calculated size will be 40 bytes and it further rounds
to 64 bytes, and eventually 64 bytes are allocated,
wasting 32-byte memory.
Change bpf_mem_alloc() to calculate the cache index
based on the user-provided allocation size so unnecessary
extra memory can be avoided.
Suggested-by: Hou Tao <houtao1@huawei.com>
Acked-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20231222031734.1288400-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This patch simplifies the verification of size arguments associated to
pointer arguments to helpers and kfuncs. Many helpers take a pointer
argument followed by the size of the memory access performed to be
performed through that pointer. Before this patch, the handling of the
size argument in check_mem_size_reg() was confusing and wasteful: if the
size register's lower bound was 0, then the verification was done twice:
once considering the size of the access to be the lower-bound of the
respective argument, and once considering the upper bound (even if the
two are the same). The upper bound checking is a super-set of the
lower-bound checking(*), except: the only point of the lower-bound check
is to handle the case where zero-sized-accesses are explicitly not
allowed and the lower-bound is zero. This static condition is now
checked explicitly, replacing a much more complex, expensive and
confusing verification call to check_helper_mem_access().
Error messages change in this patch. Before, messages about illegal
zero-size accesses depended on the type of the pointer and on other
conditions, and sometimes the message was plain wrong: in some tests
that changed you'll see that the old message was something like "R1 min
value is outside of the allowed memory range", where R1 is the pointer
register; the error was wrongly claiming that the pointer was bad
instead of the size being bad. Other times the information that the size
came for a register with a possible range of values was wrong, and the
error presented the size as a fixed zero. Now the errors refer to the
right register. However, the old error messages did contain useful
information about the pointer register which is now lost; recovering
this information was deemed not important enough.
(*) Besides standing to reason that the checks for a bigger size access
are a super-set of the checks for a smaller size access, I have also
mechanically verified this by reading the code for all types of
pointers. I could convince myself that it's true for all but
PTR_TO_BTF_ID (check_ptr_to_btf_access). There, simply looking
line-by-line does not immediately prove what we want. If anyone has any
qualms, let me know.
Signed-off-by: Andrei Matei <andreimatei1@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20231221232225.568730-2-andreimatei1@gmail.com
In preparation for subsequent changes, introduce a specialized variant
of async_schedule_dev() that will not invoke the argument function
synchronously when it cannot be scheduled for asynchronous execution.
The new function, async_schedule_dev_nocall(), will be used for fixing
possible deadlocks in the system-wide power management core code.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com> for the series.
Tested-by: Youngmin Nam <youngmin.nam@samsung.com>
Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>
In preparation for subsequent changes, split async_schedule_node_domain()
in two pieces so as to allow the bottom part of it to be called from a
somewhat different code path.
No functional impact.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
Tested-by: Youngmin Nam <youngmin.nam@samsung.com>
Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>
Introduce thermal_zone_device_critical_reboot() to trigger an
emergency reboot.
It is a counterpart of thermal_zone_device_critical() with the
difference that it will force a reboot instead of shutdown.
The motivation for doing this is to allow the thermal subystem
to trigger a reboot when the temperature reaches the critical
temperature.
Signed-off-by: Fabio Estevam <festevam@denx.de>
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Link: https://lore.kernel.org/r/20231129124330.519423-3-festevam@gmail.com
Add some helper functions to make it easier introducing the support
for thermal reboot.
No functional change.
Signed-off-by: Fabio Estevam <festevam@denx.de>
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Link: https://lore.kernel.org/r/20231129124330.519423-2-festevam@gmail.com
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZYVEqQAKCRDbK58LschI
gzH6AP9hVXLpHFTWMT0+2GK2lx69VX8zW1C0SmN7WHaxUbPN9QEAwzGnELfKk00P
0IKRHSl5abhVMX7JOM3sSOhCILeKjQg=
=wRLJ
-----END PGP SIGNATURE-----
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:
====================
bpf-next-for-netdev
The following pull-request contains BPF updates for your *net-next* tree.
We've added 22 non-merge commits during the last 3 day(s) which contain
a total of 23 files changed, 652 insertions(+), 431 deletions(-).
The main changes are:
1) Add verifier support for annotating user's global BPF subprogram arguments
with few commonly requested annotations for a better developer experience,
from Andrii Nakryiko.
These tags are:
- Ability to annotate a special PTR_TO_CTX argument
- Ability to annotate a generic PTR_TO_MEM as non-NULL
2) Support BPF verifier tracking of BPF_JNE which helps cases when the compiler
transforms (unsigned) "a > 0" into "if a == 0 goto xxx" and the like, from
Menglong Dong.
3) Fix a warning in bpf_mem_cache's check_obj_size() as reported by LKP, from Hou Tao.
4) Re-support uid/gid options when mounting bpffs which had to be reverted with
the prior token series revert to avoid conflicts, from Daniel Borkmann.
5) Fix a libbpf NULL pointer dereference in bpf_object__collect_prog_relos() found
from fuzzing the library with malformed ELF files, from Mingyi Zhang.
6) Skip DWARF sections in libbpf's linker sanity check given compiler options to
generate compressed debug sections can trigger a rejection due to misalignment,
from Alyssa Ross.
7) Fix an unnecessary use of the comma operator in BPF verifier, from Simon Horman.
8) Fix format specifier for unsigned long values in cpustat sample, from Colin Ian King.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
- Fix readers that are blocked on the ring buffer when buffer_percent is
100%. They are supposed to wake up when the buffer is full, but
because the sub-buffer that the writer is on is never considered
"dirty" in the calculation, dirty pages will never equal nr_pages.
Add +1 to the dirty count in order to count for the sub-buffer that
the writer is on.
- When a reader is blocked on the "snapshot_raw" file, it is to be
woken up when a snapshot is done and be able to read the snapshot
buffer. But because the snapshot swaps the buffers (the main one
with the snapshot one), and the snapshot reader is waiting on the
old snapshot buffer, it was not woken up (because it is now on
the main buffer after the swap). Worse yet, when it reads the buffer
after a snapshot, it's not reading the snapshot buffer, it's reading
the live active main buffer.
Fix this by forcing a wakeup of all readers on the snapshot buffer when
a new snapshot happens, and then update the buffer that the reader
is reading to be back on the snapshot buffer.
- Fix the modification of the direct_function hash. There was a race
when new functions were added to the direct_function hash as when
it moved function entries from the old hash to the new one, a direct
function trace could be hit and not see its entry.
This is fixed by allocating the new hash, copy all the old entries
onto it as well as the new entries, and then use rcu_assign_pointer()
to update the new direct_function hash with it.
This also fixes a memory leak in that code.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZZAzTxQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qs9IAP9e6wZ74aEjMED9nsbC49EpyCNTqa72
y0uDS/p9ppv52gD7Be+l+kJQzYNh6bZU0+B19hNC2QVn38jb7sOadfO/1Q8=
=NDkf
-----END PGP SIGNATURE-----
Merge tag 'trace-v6.7-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fixes from Steven Rostedt:
- Fix readers that are blocked on the ring buffer when buffer_percent
is 100%. They are supposed to wake up when the buffer is full, but
because the sub-buffer that the writer is on is never considered
"dirty" in the calculation, dirty pages will never equal nr_pages.
Add +1 to the dirty count in order to count for the sub-buffer that
the writer is on.
- When a reader is blocked on the "snapshot_raw" file, it is to be
woken up when a snapshot is done and be able to read the snapshot
buffer. But because the snapshot swaps the buffers (the main one with
the snapshot one), and the snapshot reader is waiting on the old
snapshot buffer, it was not woken up (because it is now on the main
buffer after the swap). Worse yet, when it reads the buffer after a
snapshot, it's not reading the snapshot buffer, it's reading the live
active main buffer.
Fix this by forcing a wakeup of all readers on the snapshot buffer
when a new snapshot happens, and then update the buffer that the
reader is reading to be back on the snapshot buffer.
- Fix the modification of the direct_function hash. There was a race
when new functions were added to the direct_function hash as when it
moved function entries from the old hash to the new one, a direct
function trace could be hit and not see its entry.
This is fixed by allocating the new hash, copy all the old entries
onto it as well as the new entries, and then use rcu_assign_pointer()
to update the new direct_function hash with it.
This also fixes a memory leak in that code.
- Fix eventfs ownership
* tag 'trace-v6.7-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
ftrace: Fix modification of direct_function hash while in use
tracing: Fix blocked reader of snapshot buffer
ring-buffer: Fix wake ups when buffer_percent is set to 100
eventfs: Fix file and directory uid and gid ownership
Directly return NULL or 'next' instead of breaking out of the loop.
Signed-off-by: David Laight <david.laight@aculab.com>
[ Split original patch into two independent parts - Linus ]
Link: https://lore.kernel.org/lkml/7c8828aec72e42eeb841ca0ee3397e9a@AcuMS.aculab.com/
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
osq_wait_next() is passed 'prev' from osq_lock() and NULL from
osq_unlock() but only needs the 'cpu' value to write to lock->tail.
Just pass prev->cpu or OSQ_UNLOCKED_VAL instead.
Should have no effect on the generated code since gcc manages to assume
that 'prev != NULL' due to an earlier dereference.
Signed-off-by: David Laight <david.laight@aculab.com>
[ Changed 'old' to 'old_cpu' by request from Waiman Long - Linus ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
struct optimistic_spin_node is private to the implementation.
Move it into the C file to ensure nothing is accessing it.
Signed-off-by: David Laight <david.laight@aculab.com>
Acked-by: Waiman Long <longman@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Masami Hiramatsu reported a memory leak in register_ftrace_direct() where
if the number of new entries are added is large enough to cause two
allocations in the loop:
for (i = 0; i < size; i++) {
hlist_for_each_entry(entry, &hash->buckets[i], hlist) {
new = ftrace_add_rec_direct(entry->ip, addr, &free_hash);
if (!new)
goto out_remove;
entry->direct = addr;
}
}
Where ftrace_add_rec_direct() has:
if (ftrace_hash_empty(direct_functions) ||
direct_functions->count > 2 * (1 << direct_functions->size_bits)) {
struct ftrace_hash *new_hash;
int size = ftrace_hash_empty(direct_functions) ? 0 :
direct_functions->count + 1;
if (size < 32)
size = 32;
new_hash = dup_hash(direct_functions, size);
if (!new_hash)
return NULL;
*free_hash = direct_functions;
direct_functions = new_hash;
}
The "*free_hash = direct_functions;" can happen twice, losing the previous
allocation of direct_functions.
But this also exposed a more serious bug.
The modification of direct_functions above is not safe. As
direct_functions can be referenced at any time to find what direct caller
it should call, the time between:
new_hash = dup_hash(direct_functions, size);
and
direct_functions = new_hash;
can have a race with another CPU (or even this one if it gets interrupted),
and the entries being moved to the new hash are not referenced.
That's because the "dup_hash()" is really misnamed and is really a
"move_hash()". It moves the entries from the old hash to the new one.
Now even if that was changed, this code is not proper as direct_functions
should not be updated until the end. That is the best way to handle
function reference changes, and is the way other parts of ftrace handles
this.
The following is done:
1. Change add_hash_entry() to return the entry it created and inserted
into the hash, and not just return success or not.
2. Replace ftrace_add_rec_direct() with add_hash_entry(), and remove
the former.
3. Allocate a "new_hash" at the start that is made for holding both the
new hash entries as well as the existing entries in direct_functions.
4. Copy (not move) the direct_function entries over to the new_hash.
5. Copy the entries of the added hash to the new_hash.
6. If everything succeeds, then use rcu_pointer_assign() to update the
direct_functions with the new_hash.
This simplifies the code and fixes both the memory leak as well as the
race condition mentioned above.
Link: https://lore.kernel.org/all/170368070504.42064.8960569647118388081.stgit@devnote2/
Link: https://lore.kernel.org/linux-trace-kernel/20231229115134.08dd5174@gandalf.local.home
Cc: stable@vger.kernel.org
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Fixes: 763e34e74b ("ftrace: Add register_ftrace_direct()")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
If, as part of handling a hardlockup or softlockup, we've already dumped
all CPUs and we're just about to panic, don't reenable dumping and give
some other CPU a chance to hop in there and add some confusing logs right
as the panic is happening.
Link: https://lkml.kernel.org/r/20231220131534.4.Id3a9c7ec2d7d83e4080da6f8662ba2226b40543f@changeid
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: Lecopzer Chen <lecopzer.chen@mediatek.com>
Cc: Li Zhe <lizhe.67@bytedance.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Pingfan Liu <kernelfans@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
If two CPUs end up reporting a hardlockup at the same time then their logs
could get interleaved which is hard to read.
The interleaving problem was especially bad with the "perf" hardlockup
detector where the locked up CPU is always the same as the running CPU and
we end up in show_regs(). show_regs() has no inherent serialization so we
could mix together two crawls if two hardlockups happened at the same time
(and if we didn't have `sysctl_hardlockup_all_cpu_backtrace` set). With
this change we'll fully serialize hardlockups when using the "perf"
hardlockup detector.
The interleaving problem was less bad with the "buddy" hardlockup
detector. With "buddy" we always end up calling
`trigger_single_cpu_backtrace(cpu)` on some CPU other than the running
one. trigger_single_cpu_backtrace() always at least serializes the
individual stack crawls because it eventually uses
printk_cpu_sync_get_irqsave(). Unfortunately the fact that
trigger_single_cpu_backtrace() eventually calls
printk_cpu_sync_get_irqsave() (on a different CPU) means that we have to
drop the "lock" before calling it and we can't fully serialize all
printouts associated with a given hardlockup. However, we still do get
the advantage of serializing the output of print_modules() and
print_irqtrace_events().
Aside from serializing hardlockups from each other, this change also has
the advantage of serializing hardlockups and softlockups from each other
if they happen to happen at the same time since they are both using the
same "lock".
Even though nobody is expected to hang while holding the lock associated
with printk_cpu_sync_get_irqsave(), out of an abundance of caution, we
don't call printk_cpu_sync_get_irqsave() until after we print out about
the hardlockup. This makes extra sure that, even if
printk_cpu_sync_get_irqsave() somehow never runs we at least print that we
saw the hardlockup. This is different than the choice made for softlockup
because hardlockup is really our last resort.
Link: https://lkml.kernel.org/r/20231220131534.3.I6ff691b3b40f0379bc860f80c6e729a0485b5247@changeid
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: John Ogness <john.ogness@linutronix.de>
Cc: Lecopzer Chen <lecopzer.chen@mediatek.com>
Cc: Li Zhe <lizhe.67@bytedance.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Pingfan Liu <kernelfans@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Instead of introducing a spinlock, use printk_cpu_sync_get_irqsave() and
printk_cpu_sync_put_irqrestore() to serialize softlockup reporting. Alone
this doesn't have any real advantage over the spinlock, but this will
allow us to use the same function in a future change to also serialize
hardlockup crawls.
NOTE: for the most part this serialization is important because we often
end up in the show_regs() path and that has no built-in serialization if
there are multiple callers at once. However, even in the case where we
end up in the dump_stack() path this still has some advantages because the
stack will be guaranteed to be together in the logs with the lockup
message with no interleaving.
NOTE: the fact that printk_cpu_sync_get_irqsave() is allowed to be called
multiple times on the same CPU is important here. Specifically we hold
the "lock" while calling dump_stack() which also gets the same "lock".
This is explicitly documented to be OK and means we don't need to
introduce a variant of dump_stack() that doesn't grab the lock.
Link: https://lkml.kernel.org/r/20231220131534.2.Ia5906525d440d8e8383cde31b7c61c2aadc8f907@changeid
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Li Zhe <lizhe.67@bytedance.com>
Cc: Lecopzer Chen <lecopzer.chen@mediatek.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Pingfan Liu <kernelfans@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "watchdog: Better handling of concurrent lockups".
When we get multiple lockups at roughly the same time, the output in the
kernel logs can be very confusing since the reports about the lockups end
up interleaved in the logs. There is some code in the kernel to try to
handle this but it wasn't that complete.
Li Zhe recently made this a bit better for softlockups (specifically for
the case where `kernel.softlockup_all_cpu_backtrace` is not set) in commit
9d02330abd ("softlockup: serialized softlockup's log"), but that only
handled softlockup reports. Hardlockup reports still had similar issues.
This series also has a small fix to avoid dumping all stacks a second time
in the case of a panic. This is a bit unrelated to the interleaving fixes
but it does also improve the clarity of lockup reports.
This patch (of 4):
The hardlockup detector and softlockup detector both have the ability to
dump the stack of all CPUs (`kernel.hardlockup_all_cpu_backtrace` and
`kernel.softlockup_all_cpu_backtrace`). Both detectors also have some
logic to attempt to avoid interleaving printouts if two CPUs were trying
to do dumps of all CPUs at the same time. However:
- The hardlockup detector's logic still allowed interleaving some
information. Specifically another CPU could print modules and dump
the stack of the locked CPU at the same time we were dumping all
CPUs.
- In the case where `kernel.hardlockup_panic` was set in addition to
`kernel.hardlockup_all_cpu_backtrace`, when two CPUs both detected
hardlockups at the same time the second CPU could call panic() while
the first was still dumping stacks. This was especially bad if the
locked up CPU wasn't responding to the request for a backtrace since
the function nmi_trigger_cpumask_backtrace() can wait up to 10
seconds.
Let's resolve this by adopting the softlockup logic in the hardlockup
handler.
NOTES:
- As part of this, one might think that we should make a helper
function that both the hard and softlockup detectors call. This
turns out not to be super trivial since it would have to be
parameterized quite a bit since there are separate global variables
controlling each lockup detector and they print log messages that
are just different enough that it would be a pain. We probably don't
want to change the messages that are printed without good reason to
avoid throwing log parsers for a loop.
- One might also think that it would be a good idea to have the
hardlockup and softlockup detector use the same global variable to
prevent interleaving. This would make sure that softlockups and
hardlockups can't interleave each other. That _almost_ works but has
a dangerous flaw if `kernel.hardlockup_panic` is not the same as
`kernel.softlockup_panic` because we might skip a call to panic() if
one type of lockup was detected at the same time as another.
Link: https://lkml.kernel.org/r/20231220211640.2023645-1-dianders@chromium.org
Link: https://lkml.kernel.org/r/20231220131534.1.I4f35a69fbb124b5f0c71f75c631e11fabbe188ff@changeid
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: Lecopzer Chen <lecopzer.chen@mediatek.com>
Cc: Li Zhe <lizhe.67@bytedance.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Pingfan Liu <kernelfans@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
image->control_page represents the starting address for allocating the
next control page, while hole_end represents the address of the last valid
byte of the currently allocated control page.
This bug actually does not affect the correctness of allocating control
pages, because image->control_page is currently only used in
kimage_alloc_crash_control_pages(), and this function, when allocating
control pages, will first align image->control_page up to the nearest
`(1 << order) << PAGE_SHIFT` boundary, then use this value as the
starting address of the next control page. This ensures that the newly
allocated control page will use the correct starting address and not
overlap with previously allocated control pages.
Although it does not affect the correctness of the final result, it is
better for us to set image->control_page to the correct value, in case
it might be used elsewhere in the future, potentially causing errors.
Therefore, after successfully allocating a control page,
image->control_page should be updated to `hole_end + 1`, rather than
hole_end.
Link: https://lkml.kernel.org/r/20231221042308.11076-1-ytcoode@gmail.com
Signed-off-by: Yuntao Wang <ytcoode@gmail.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Change @task to @tsk to prevent kernel-doc warnings:
kernel/stacktrace.c:138: warning: Excess function parameter 'task' description in 'stack_trace_save_tsk'
kernel/stacktrace.c:138: warning: Function parameter or member 'tsk' not described in 'stack_trace_save_tsk'
Link: https://lkml.kernel.org/r/20231220054945.17663-1-rdunlap@infradead.org
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Documentation/filesystems/relay.rst says to use
return debugfs_create_file(filename, mode, parent, buf,
&relay_file_operations);
and this is the only way relay_file_operations is used.
Thus: debugfs_create_file(&relay_file_operations)
-> __debugfs_create_file(&debugfs_full_proxy_file_operations,
&relay_file_operations)
-> dentry{inode: {i_fop: &debugfs_full_proxy_file_operations},
d_fsdata: &relay_file_operations
| DEBUGFS_FSDATA_IS_REAL_FOPS_BIT}
debugfs_full_proxy_file_operations.open is full_proxy_open, which extracts
the &relay_file_operations from the dentry, and allocates via
__full_proxy_fops_init() new fops, with trivial wrappers around release,
llseek, read, write, poll, and unlocked_ioctl, then replaces the fops on
the opened file therewith.
Naturally, all thusly-created debugfs files have .splice_read = NULL.
This was introduced in commit 49d200deaa ("debugfs: prevent access to
removed files' private data") from 2016-03-22.
AFAICT, relay_file_operations is the only struct file_operations used for
debugfs which defines a .splice_read callback. Hooking it up with
> diff --git a/fs/debugfs/file.c b/fs/debugfs/file.c
> index 5063434be0fc..952fcf5b2afa 100644
> --- a/fs/debugfs/file.c
> +++ b/fs/debugfs/file.c
> @@ -328,6 +328,11 @@ FULL_PROXY_FUNC(write, ssize_t, filp,
> loff_t *ppos),
> ARGS(filp, buf, size, ppos));
>
> +FULL_PROXY_FUNC(splice_read, long, in,
> + PROTO(struct file *in, loff_t *ppos, struct pipe_inode_info *pipe,
> + size_t len, unsigned int flags),
> + ARGS(in, ppos, pipe, len, flags));
> +
> FULL_PROXY_FUNC(unlocked_ioctl, long, filp,
> PROTO(struct file *filp, unsigned int cmd, unsigned long arg),
> ARGS(filp, cmd, arg));
> @@ -382,6 +387,8 @@ static void __full_proxy_fops_init(struct file_operations *proxy_fops,
> proxy_fops->write = full_proxy_write;
> if (real_fops->poll)
> proxy_fops->poll = full_proxy_poll;
> + if (real_fops->splice_read)
> + proxy_fops->splice_read = full_proxy_splice_read;
> if (real_fops->unlocked_ioctl)
> proxy_fops->unlocked_ioctl = full_proxy_unlocked_ioctl;
> }
shows it just doesn't work, and splicing always instantly returns empty
(subsequent reads actually return the contents).
No-one noticed it became dead code in 2016, who knows if it worked back
then. Clearly no-one cares; just delete it.
Link: https://lkml.kernel.org/r/dtexwpw6zcdx7dkx3xj5gyjp5syxmyretdcbcdtvrnukd4vvuh@tarta.nabijaczleweli.xyz
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Li kunyu <kunyu@nfschina.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Rafael J. Wysocki <rafael@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Zhang Zhengming <zhang.zhengming@h3c.com>
Cc: Zhao Lei <zhao_lei1@hoperun.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
temp_end represents the address of the last available byte. Therefore,
the starting address of the memory segment with temp_end as its last
available byte and a size of `kbuf->memsz`, that is, the value of
temp_start, should be `temp_end - kbuf->memsz + 1` instead of `temp_end -
kbuf->memsz`.
Additionally, use the ALIGN_DOWN macro instead of open-coding it directly
in locate_mem_hole_top_down() to improve code readability.
Link: https://lkml.kernel.org/r/20231217033528.303333-3-ytcoode@gmail.com
Signed-off-by: Yuntao Wang <ytcoode@gmail.com>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov (AMD) <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The end parameter received by kimage_is_destination_range() should be the
last valid byte address of the target memory segment plus 1. However, in
the locate_mem_hole_bottom_up() and locate_mem_hole_top_down() functions,
the corresponding value passed to kimage_is_destination_range() is the
last valid byte address of the target memory segment, which is 1 less.
There are two ways to fix this bug. We can either correct the logic of
the locate_mem_hole_bottom_up() and locate_mem_hole_top_down() functions,
or we can fix kimage_is_destination_range() by making the end parameter
represent the last valid byte address of the target memory segment. Here,
we choose the second approach.
Due to the modification to kimage_is_destination_range(), we also need to
adjust its callers, such as kimage_alloc_normal_control_pages() and
kimage_alloc_page().
Link: https://lkml.kernel.org/r/20231217033528.303333-2-ytcoode@gmail.com
Signed-off-by: Yuntao Wang <ytcoode@gmail.com>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov (AMD) <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
We already have the folio in these functions, we just need to use it.
folio_add_new_anon_rmap() didn't exist at the time they were converted to
folios.
Link: https://lkml.kernel.org/r/20231211162214.2146080-5-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
If an application blocks on the snapshot or snapshot_raw files, expecting
to be woken up when a snapshot occurs, it will not happen. Or it may
happen with an unexpected result.
That result is that the application will be reading the main buffer
instead of the snapshot buffer. That is because when the snapshot occurs,
the main and snapshot buffers are swapped. But the reader has a descriptor
still pointing to the buffer that it originally connected to.
This is fine for the main buffer readers, as they may be blocked waiting
for a watermark to be hit, and when a snapshot occurs, the data that the
main readers want is now on the snapshot buffer.
But for waiters of the snapshot buffer, they are waiting for an event to
occur that will trigger the snapshot and they can then consume it quickly
to save the snapshot before the next snapshot occurs. But to do this, they
need to read the new snapshot buffer, not the old one that is now
receiving new data.
Also, it does not make sense to have a watermark "buffer_percent" on the
snapshot buffer, as the snapshot buffer is static and does not receive new
data except all at once.
Link: https://lore.kernel.org/linux-trace-kernel/20231228095149.77f5b45d@gandalf.local.home
Cc: stable@vger.kernel.org
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Fixes: debdd57f51 ("tracing: Make a snapshot feature available from userspace")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The tracefs file "buffer_percent" is to allow user space to set a
water-mark on how much of the tracing ring buffer needs to be filled in
order to wake up a blocked reader.
0 - is to wait until any data is in the buffer
1 - is to wait for 1% of the sub buffers to be filled
50 - would be half of the sub buffers are filled with data
100 - is not to wake the waiter until the ring buffer is completely full
Unfortunately the test for being full was:
dirty = ring_buffer_nr_dirty_pages(buffer, cpu);
return (dirty * 100) > (full * nr_pages);
Where "full" is the value for "buffer_percent".
There is two issues with the above when full == 100.
1. dirty * 100 > 100 * nr_pages will never be true
That is, the above is basically saying that if the user sets
buffer_percent to 100, more pages need to be dirty than exist in the
ring buffer!
2. The page that the writer is on is never considered dirty, as dirty
pages are only those that are full. When the writer goes to a new
sub-buffer, it clears the contents of that sub-buffer.
That is, even if the check was ">=" it would still not be equal as the
most pages that can be considered "dirty" is nr_pages - 1.
To fix this, add one to dirty and use ">=" in the compare.
Link: https://lore.kernel.org/linux-trace-kernel/20231226125902.4a057f1d@gandalf.local.home
Cc: stable@vger.kernel.org
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Fixes: 03329f9939 ("tracing: Add tracefs file buffer_percentage")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
When a CPU is taken offline, the contribution of its cfs_rqs to task_groups'
load may remain and will negatively impact the calculation of the share of
the online CPUs.
To fix this bug, clear the contribution of an offlining CPU to task groups'
load and skip its contribution while it is inactive.
Here's the reproducer of the anomaly, by Imran Khan:
"So far I have encountered only one rather lengthy way of reproducing this issue,
which is as follows:
1. Take a KVM guest (booted with 4 CPUs and can be scaled up to 124 CPUs) and
create 2 custom cgroups: /sys/fs/cgroup/cpu/test_group_1 and /sys/fs/cgroup/
cpu/test_group_2
2. Assign a CPU intensive workload to each of these cgroups and start the
workload.
For my tests I am using following app:
int main(int argc, char *argv[])
{
unsigned long count, i, val;
if (argc != 2) {
printf("usage: ./a.out <number of random nums to generate> \n");
return 0;
}
count = strtoul(argv[1], NULL, 10);
printf("Generating %lu random numbers \n", count);
for (i = 0; i < count; i++) {
val = rand();
val = val % 2;
//usleep(1);
}
printf("Generated %lu random numbers \n", count);
return 0;
}
Also since the system is booted with 4 CPUs, in order to completely load the
system I am also launching 4 instances of same test app under:
/sys/fs/cgroup/cpu/
3. We can see that both of the cgroups get similar CPU time:
# systemd-cgtop --depth 1
Path Tasks %CPU Memory Input/s Output/s
/ 659 - 5.5G - -
/system.slice - - 5.7G - -
/test_group_1 4 - - - -
/test_group_2 3 - - - -
/user.slice 31 - 56.5M - -
Path Tasks %CPU Memory Input/s Output/s
/ 659 394.6 5.5G - -
/test_group_2 3 65.7 - - -
/user.slice 29 55.1 48.0M - -
/test_group_1 4 47.3 - - -
/system.slice - 2.2 5.7G - -
Path Tasks %CPU Memory Input/s Output/s
/ 659 394.8 5.5G - -
/test_group_1 4 62.9 - - -
/user.slice 28 44.9 54.2M - -
/test_group_2 3 44.7 - - -
/system.slice - 0.9 5.7G - -
Path Tasks %CPU Memory Input/s Output/s
/ 659 394.4 5.5G - -
/test_group_2 3 58.8 - - -
/test_group_1 4 51.9 - - -
/user.slice 30 39.3 59.6M - -
/system.slice - 1.9 5.7G - -
Path Tasks %CPU Memory Input/s Output/s
/ 659 394.7 5.5G - -
/test_group_1 4 60.9 - - -
/test_group_2 3 57.9 - - -
/user.slice 28 43.5 36.9M - -
/system.slice - 3.0 5.7G - -
Path Tasks %CPU Memory Input/s Output/s
/ 659 395.0 5.5G - -
/test_group_1 4 66.8 - - -
/test_group_2 3 56.3 - - -
/user.slice 29 43.1 51.8M - -
/system.slice - 0.7 5.7G - -
4. Now move systemd-udevd to one of these test groups, say test_group_1, and
perform scale up to 124 CPUs followed by scale down back to 4 CPUs from the
host side.
5. Run the same workload i.e 4 instances of CPU hogger under /sys/fs/cgroup/cpu
and one instance of CPU hogger each in /sys/fs/cgroup/cpu/test_group_1 and
/sys/fs/cgroup/test_group_2.
It can be seen that test_group_1 (the one where systemd-udevd was moved) is getting
much less CPU time than the test_group_2, even though at this point of time both of
these groups have only CPU hogger running:
# systemd-cgtop --depth 1
Path Tasks %CPU Memory Input/s Output/s
/ 1219 - 5.4G - -
/system.slice - - 5.6G - -
/test_group_1 4 - - - -
/test_group_2 3 - - - -
/user.slice 26 - 91.3M - -
Path Tasks %CPU Memory Input/s Output/s
/ 1221 394.3 5.4G - -
/test_group_2 3 82.7 - - -
/test_group_1 4 14.3 - - -
/system.slice - 0.8 5.6G - -
/user.slice 26 0.4 91.2M - -
Path Tasks %CPU Memory Input/s Output/s
/ 1221 394.6 5.4G - -
/test_group_2 3 67.4 - - -
/system.slice - 24.6 5.6G - -
/test_group_1 4 12.5 - - -
/user.slice 26 0.4 91.2M - -
Path Tasks %CPU Memory Input/s Output/s
/ 1221 395.2 5.4G - -
/test_group_2 3 60.9 - - -
/system.slice - 27.9 5.6G - -
/test_group_1 4 12.2 - - -
/user.slice 26 0.4 91.2M - -
Path Tasks %CPU Memory Input/s Output/s
/ 1221 395.2 5.4G - -
/test_group_2 3 69.4 - - -
/test_group_1 4 13.9 - - -
/user.slice 28 1.6 92.0M - -
/system.slice - 1.0 5.6G - -
Path Tasks %CPU Memory Input/s Output/s
/ 1221 395.6 5.4G - -
/test_group_2 3 59.3 - - -
/test_group_1 4 14.1 - - -
/user.slice 28 1.3 92.2M - -
/system.slice - 0.7 5.6G - -
Path Tasks %CPU Memory Input/s Output/s
/ 1221 395.5 5.4G - -
/test_group_2 3 67.2 - - -
/test_group_1 4 11.5 - - -
/user.slice 28 1.3 92.5M - -
/system.slice - 0.6 5.6G - -
Path Tasks %CPU Memory Input/s Output/s
/ 1221 395.1 5.4G - -
/test_group_2 3 76.8 - - -
/test_group_1 4 12.9 - - -
/user.slice 28 1.3 92.8M - -
/system.slice - 1.2 5.6G - -
From sched_debug data it can be seen that in bad case the load.weight of per-CPU
sched entities corresponding to test_group_1 has reduced significantly and
also load_avg of test_group_1 remains much higher than that of test_group_2,
even though systemd-udevd stopped running long time back and at this point of
time both cgroups just have the CPU hogger app as running entity."
[ mingo: Added details from the original discussion, plus minor edits to the patch. ]
Reported-by: Imran Khan <imran.f.khan@oracle.com>
Tested-by: Imran Khan <imran.f.khan@oracle.com>
Tested-by: Aaron Lu <aaron.lu@intel.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Imran Khan <imran.f.khan@oracle.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Link: https://lore.kernel.org/r/20231223111545.62135-1-vincent.guittot@linaro.org
are not considered backporting material.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZYys4AAKCRDdBJ7gKXxA
jtmaAQC+o04Ia7IfB8MIqp1p7dNZQo64x/EnGA8YjUnQ8N6IwQD+ImU7dHl9g9Oo
ROiiAbtMRBUfeJRsExX/Yzc1DV9E9QM=
=ZGcs
-----END PGP SIGNATURE-----
Merge tag 'mm-hotfixes-stable-2023-12-27-15-00' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"11 hotfixes. 7 are cc:stable and the other 4 address post-6.6 issues
or are not considered backporting material"
* tag 'mm-hotfixes-stable-2023-12-27-15-00' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
mailmap: add an old address for Naoya Horiguchi
mm/memory-failure: cast index to loff_t before shifting it
mm/memory-failure: check the mapcount of the precise page
mm/memory-failure: pass the folio and the page to collect_procs()
selftests: secretmem: floor the memory size to the multiple of page_size
mm: migrate high-order folios in swap cache correctly
maple_tree: do not preallocate nodes for slot stores
mm/filemap: avoid buffered read/write race to read inconsistent data
kunit: kasan_test: disable fortify string checker on kmalloc_oob_memset
kexec: select CRYPTO from KEXEC_FILE instead of depending on it
kexec: fix KEXEC_FILE dependencies
by moving cond_resched_rcu() to rcupdate_wait.h, we can kill another big
sched.h dependency.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>