Commit Graph

44789 Commits

Author SHA1 Message Date
72bffbf57c sched/fair: Fix initial util_avg calculation
Change se->load.weight to se_weight(se) in the calculation for the
initial util_avg to avoid unnecessarily inflating the util_avg by 1024
times.

The reason is that se->load.weight has the unit/scale as the scaled-up
load, while cfs_rg->avg.load_avg has the unit/scale as the true task
weight (as mapped directly from the task's nice/priority value). With
CONFIG_32BIT, the scaled-up load is equal to the true task weight. With
CONFIG_64BIT, the scaled-up load is 1024 times the true task weight.
Thus, the current code may inflate the util_avg by 1024 times. The
follow-up capping will not allow the util_avg value to go wild. But the
calculation should have the correct logic.

Signed-off-by: Dawei Li <daweilics@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Vishal Chourasia <vishalc@linux.ibm.com>
Link: https://lore.kernel.org/r/20240315015916.21545-1-daweilics@gmail.com
2024-05-17 09:49:44 +02:00
287372fa39 sched/debug: Dump domains' level
Knowing domain's level exactly can be useful when setting
relax_domain_level or cpuset.sched_relax_domain_level

Usage:

  cat /debug/sched/domains/cpu0/domain1/level

to dump cpu0 domain1's level.

SDM macro is not used because sd->level is 'int' and
it would hide the type mismatch between 'int' and 'u32'.

Signed-off-by: Vitalii Bursov <vitaly@bursov.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lore.kernel.org/r/9489b6475f6dd6fbc67c617752d4216fa094da53.1714488502.git.vitaly@bursov.com
2024-05-17 09:48:25 +02:00
a1fd0b9d75 sched/fair: Allow disabling sched_balance_newidle with sched_relax_domain_level
Change relax_domain_level checks so that it would be possible
to include or exclude all domains from newidle balancing.

This matches the behavior described in the documentation:

  -1   no request. use system default or follow request of others.
   0   no search.
   1   search siblings (hyperthreads in a core).

"2" enables levels 0 and 1, level_max excludes the last (level_max)
level, and level_max+1 includes all levels.

Fixes: 1d3504fcf5 ("sched, cpuset: customize sched domains, core")
Signed-off-by: Vitalii Bursov <vitaly@bursov.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lore.kernel.org/r/bd6de28e80073c79466ec6401cdeae78f0d4423d.1714488502.git.vitaly@bursov.com
2024-05-17 09:48:24 +02:00
3c999d1ae3 Merge tag 'wq-for-6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue updates from Tejun Heo:

 - Work items can now be disabled and enabled, and cancel_work_sync()
   and disable_work() can be called form atomic contexts for BH work
   items.

   This closes feature gap with tasklet and should allow converting all
   existing tasklet users to BH workqueues.

 - Improve pool sharing for unbound workqueues with strict affinity.

 - Misc changes including doc updates, improved debug annotations and
   cleanups.

* tag 'wq-for-6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
  workqueue: Use "@..." in function comment to describe variable length argument
  workqueue: Add destroy_work_on_stack() in workqueue_softirq_dead()
  workqueue: remove unnecessary import and function in wq_monitor.py
  workqueue: Introduce enable_and_queue_work() convenience function
  workqueue: add function in event of workqueue_activate_work
  workqueue: Cleanup subsys attribute registration
  workqueue: Use list_last_entry() to get the last idle worker
  workqueue: Move attrs->cpumask out of worker_pool's properties when attrs->affn_strict
  workqueue: Use INIT_WORK_ONSTACK in workqueue_softirq_dead()
  workqueue: Allow cancel_work_sync() and disable_work() from atomic contexts on BH work items
  workqueue: Remember whether a work item was on a BH workqueue
  workqueue: Remove WORK_OFFQ_CANCELING
  workqueue: Implement disable/enable for (delayed) work items
  workqueue: Preserve OFFQ bits in cancel[_sync] paths
2024-05-15 17:32:08 -07:00
de6fef50ea Merge tag 'cgroup-for-6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:

 - The locking around cpuset hotplug processing has always been a bit of
   mess which was worked around by making hotplug processing
   asynchronous. The asynchronity isn't great and led to other issues.

   We tried to make the behavior synchronous a while ago but that led to
   lockdep splats. Waiman took another stab at cleaning up and making it
   synchronous. The patch has been in -next for well over a month and
   there haven't been any complaints, so fingers crossed.

 - Tracepoints added to help understanding rstat lock contentions.

 - A bunch of minor changes - doc updates, code cleanups and selftests.

* tag 'cgroup-for-6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (24 commits)
  cgroup/rstat: add cgroup_rstat_cpu_lock helpers and tracepoints
  selftests/cgroup: Drop define _GNU_SOURCE
  docs: cgroup-v1: Update page cache removal functions
  selftests/cgroup: fix uninitialized variables in test_zswap.c
  selftests/cgroup: cpu_hogger init: use {} instead of {NULL}
  selftests/cgroup: fix clang warnings: uninitialized fd variable
  selftests/cgroup: fix clang build failures for abs() calls
  cgroup/cpuset: Remove outdated comment in sched_partition_write()
  cgroup/cpuset: Fix incorrect top_cpuset flags
  cgroup/cpuset: Avoid clearing CS_SCHED_LOAD_BALANCE twice
  cgroup/cpuset: Statically initialize more members of top_cpuset
  cgroup: Avoid unnecessary looping in cgroup_no_v1()
  cgroup, legacy_freezer: update comment for freezer_css_offline()
  docs, cgroup: add entries for pids to cgroup-v2.rst
  cgroup: don't call cgroup1_pidlist_destroy_all() for v2
  cgroup_freezer: update comment for freezer_css_online()
  cgroup/rstat: desc member cgrp in cgroup_rstat_flush_release
  cgroup/rstat: add cgroup_rstat_lock helpers and tracepoints
  cgroup/pids: Remove superfluous zeroing
  docs: cgroup-v1: Fix description for css_online
  ...
2024-05-15 17:06:08 -07:00
1a7d0890dd kprobe/ftrace: bail out if ftrace was killed
If an error happens in ftrace, ftrace_kill() will prevent disarming
kprobes. Eventually, the ftrace_ops associated with the kprobes will be
freed, yet the kprobes will still be active, and when triggered, they
will use the freed memory, likely resulting in a page fault and panic.

This behavior can be reproduced quite easily, by creating a kprobe and
then triggering a ftrace_kill(). For simplicity, we can simulate an
ftrace error with a kernel module like [1]:

[1]: https://github.com/brenns10/kernel_stuff/tree/master/ftrace_killer

  sudo perf probe --add commit_creds
  sudo perf trace -e probe:commit_creds
  # In another terminal
  make
  sudo insmod ftrace_killer.ko  # calls ftrace_kill(), simulating bug
  # Back to perf terminal
  # ctrl-c
  sudo perf probe --del commit_creds

After a short period, a page fault and panic would occur as the kprobe
continues to execute and uses the freed ftrace_ops. While ftrace_kill()
is supposed to be used only in extreme circumstances, it is invoked in
FTRACE_WARN_ON() and so there are many places where an unexpected bug
could be triggered, yet the system may continue operating, possibly
without the administrator noticing. If ftrace_kill() does not panic the
system, then we should do everything we can to continue operating,
rather than leave a ticking time bomb.

Link: https://lore.kernel.org/all/20240501162956.229427-1-stephen.s.brennan@oracle.com/

Signed-off-by: Stephen Brennan <stephen.s.brennan@oracle.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Guo Ren <guoren@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2024-05-16 07:23:30 +09:00
f4b0c4b508 Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM updates from Paolo Bonzini:
 "ARM:

   - Move a lot of state that was previously stored on a per vcpu basis
     into a per-CPU area, because it is only pertinent to the host while
     the vcpu is loaded. This results in better state tracking, and a
     smaller vcpu structure.

   - Add full handling of the ERET/ERETAA/ERETAB instructions in nested
     virtualisation. The last two instructions also require emulating
     part of the pointer authentication extension. As a result, the trap
     handling of pointer authentication has been greatly simplified.

   - Turn the global (and not very scalable) LPI translation cache into
     a per-ITS, scalable cache, making non directly injected LPIs much
     cheaper to make visible to the vcpu.

   - A batch of pKVM patches, mostly fixes and cleanups, as the
     upstreaming process seems to be resuming. Fingers crossed!

   - Allocate PPIs and SGIs outside of the vcpu structure, allowing for
     smaller EL2 mapping and some flexibility in implementing more or
     less than 32 private IRQs.

   - Purge stale mpidr_data if a vcpu is created after the MPIDR map has
     been created.

   - Preserve vcpu-specific ID registers across a vcpu reset.

   - Various minor cleanups and improvements.

  LoongArch:

   - Add ParaVirt IPI support

   - Add software breakpoint support

   - Add mmio trace events support

  RISC-V:

   - Support guest breakpoints using ebreak

   - Introduce per-VCPU mp_state_lock and reset_cntx_lock

   - Virtualize SBI PMU snapshot and counter overflow interrupts

   - New selftests for SBI PMU and Guest ebreak

   - Some preparatory work for both TDX and SNP page fault handling.

     This also cleans up the page fault path, so that the priorities of
     various kinds of fauls (private page, no memory, write to read-only
     slot, etc.) are easier to follow.

  x86:

   - Minimize amount of time that shadow PTEs remain in the special
     REMOVED_SPTE state.

     This is a state where the mmu_lock is held for reading but
     concurrent accesses to the PTE have to spin; shortening its use
     allows other vCPUs to repopulate the zapped region while the zapper
     finishes tearing down the old, defunct page tables.

   - Advertise the max mappable GPA in the "guest MAXPHYADDR" CPUID
     field, which is defined by hardware but left for software use.

     This lets KVM communicate its inability to map GPAs that set bits
     51:48 on hosts without 5-level nested page tables. Guest firmware
     is expected to use the information when mapping BARs; this avoids
     that they end up at a legal, but unmappable, GPA.

   - Fixed a bug where KVM would not reject accesses to MSR that aren't
     supposed to exist given the vCPU model and/or KVM configuration.

   - As usual, a bunch of code cleanups.

  x86 (AMD):

   - Implement a new and improved API to initialize SEV and SEV-ES VMs,
     which will also be extendable to SEV-SNP.

     The new API specifies the desired encryption in KVM_CREATE_VM and
     then separately initializes the VM. The new API also allows
     customizing the desired set of VMSA features; the features affect
     the measurement of the VM's initial state, and therefore enabling
     them cannot be done tout court by the hypervisor.

     While at it, the new API includes two bugfixes that couldn't be
     applied to the old one without a flag day in userspace or without
     affecting the initial measurement. When a SEV-ES VM is created with
     the new VM type, KVM_GET_REGS/KVM_SET_REGS and friends are rejected
     once the VMSA has been encrypted. Also, the FPU and AVX state will
     be synchronized and encrypted too.

   - Support for GHCB version 2 as applicable to SEV-ES guests.

     This, once more, is only accessible when using the new
     KVM_SEV_INIT2 flow for initialization of SEV-ES VMs.

  x86 (Intel):

   - An initial bunch of prerequisite patches for Intel TDX were merged.

     They generally don't do anything interesting. The only somewhat
     user visible change is a new debugging mode that checks that KVM's
     MMU never triggers a #VE virtualization exception in the guest.

   - Clear vmcs.EXIT_QUALIFICATION when synthesizing an EPT Misconfig
     VM-Exit to L1, as per the SDM.

  Generic:

   - Use vfree() instead of kvfree() for allocations that always use
     vcalloc() or __vcalloc().

   - Remove .change_pte() MMU notifier - the changes to non-KVM code are
     small and Andrew Morton asked that I also take those through the
     KVM tree.

     The callback was only ever implemented by KVM (which was also the
     original user of MMU notifiers) but it had been nonfunctional ever
     since calls to set_pte_at_notify were wrapped with
     invalidate_range_start and invalidate_range_end... in 2012.

  Selftests:

   - Enhance the demand paging test to allow for better reporting and
     stressing of UFFD performance.

   - Convert the steal time test to generate TAP-friendly output.

   - Fix a flaky false positive in the xen_shinfo_test due to comparing
     elapsed time across two different clock domains.

   - Skip the MONITOR/MWAIT test if the host doesn't actually support
     MWAIT.

   - Avoid unnecessary use of "sudo" in the NX hugepage test wrapper
     shell script, to play nice with running in a minimal userspace
     environment.

   - Allow skipping the RSEQ test's sanity check that the vCPU was able
     to complete a reasonable number of KVM_RUNs, as the assert can fail
     on a completely valid setup.

     If the test is run on a large-ish system that is otherwise idle,
     and the test isn't affined to a low-ish number of CPUs, the vCPU
     task can be repeatedly migrated to CPUs that are in deep sleep
     states, which results in the vCPU having very little net runtime
     before the next migration due to high wakeup latencies.

   - Define _GNU_SOURCE for all selftests to fix a warning that was
     introduced by a change to kselftest_harness.h late in the 6.9
     cycle, and because forcing every test to #define _GNU_SOURCE is
     painful.

   - Provide a global pseudo-RNG instance for all tests, so that library
     code can generate random, but determinstic numbers.

   - Use the global pRNG to randomly force emulation of select writes
     from guest code on x86, e.g. to help validate KVM's emulation of
     locked accesses.

   - Allocate and initialize x86's GDT, IDT, TSS, segments, and default
     exception handlers at VM creation, instead of forcing tests to
     manually trigger the related setup.

  Documentation:

   - Fix a goof in the KVM_CREATE_GUEST_MEMFD documentation"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (225 commits)
  selftests/kvm: remove dead file
  KVM: selftests: arm64: Test vCPU-scoped feature ID registers
  KVM: selftests: arm64: Test that feature ID regs survive a reset
  KVM: selftests: arm64: Store expected register value in set_id_regs
  KVM: selftests: arm64: Rename helper in set_id_regs to imply VM scope
  KVM: arm64: Only reset vCPU-scoped feature ID regs once
  KVM: arm64: Reset VM feature ID regs from kvm_reset_sys_regs()
  KVM: arm64: Rename is_id_reg() to imply VM scope
  KVM: arm64: Destroy mpidr_data for 'late' vCPU creation
  KVM: arm64: Use hVHE in pKVM by default on CPUs with VHE support
  KVM: arm64: Fix hvhe/nvhe early alias parsing
  KVM: SEV: Allow per-guest configuration of GHCB protocol version
  KVM: SEV: Add GHCB handling for termination requests
  KVM: SEV: Add GHCB handling for Hypervisor Feature Support requests
  KVM: SEV: Add support to handle AP reset MSR protocol
  KVM: x86: Explicitly zero kvm_caps during vendor module load
  KVM: x86: Fully re-initialize supported_mce_cap on vendor module load
  KVM: x86: Fully re-initialize supported_vm_types on vendor module load
  KVM: x86/mmu: Sanity check that __kvm_faultin_pfn() doesn't create noslot pfns
  KVM: x86/mmu: Initialize kvm_page_fault's pfn and hva to error values
  ...
2024-05-15 14:46:43 -07:00
a2a58909cf Merge branch 'for-6.10' into test-merge-for-6.10 2024-05-15 11:40:33 -10:00
a49468240e Merge tag 'modules-6.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux
Pull modules updates from Luis Chamberlain:
 "Finally something fun. Mike Rapoport does some cleanup to allow us to
  take out module_alloc() out of modules into a new paint shedded
  execmem_alloc() and execmem_free() so to make emphasis these helpers
  are actually used outside of modules.

  It starts with a non-functional changes API rename / placeholders to
  then allow architectures to define their requirements into a new shiny
  struct execmem_info with ranges, and requirements for those ranges.

  Archs now can intitialize this execmem_info as the last part of
  mm_core_init() if they have to diverge from the norm. Each range is a
  known type clearly articulated and spelled out in enum execmem_type.

  Although a lot of this is major cleanup and prep work for future
  enhancements an immediate clear gain is we get to enable KPROBES
  without MODULES now. That is ultimately what motiviated to pick this
  work up again, now with smaller goal as concrete stepping stone"

* tag 'modules-6.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux:
  bpf: remove CONFIG_BPF_JIT dependency on CONFIG_MODULES of
  kprobes: remove dependency on CONFIG_MODULES
  powerpc: use CONFIG_EXECMEM instead of CONFIG_MODULES where appropriate
  x86/ftrace: enable dynamic ftrace without CONFIG_MODULES
  arch: make execmem setup available regardless of CONFIG_MODULES
  powerpc: extend execmem_params for kprobes allocations
  arm64: extend execmem_info for generated code allocations
  riscv: extend execmem_params for generated code allocations
  mm/execmem, arch: convert remaining overrides of module_alloc to execmem
  mm/execmem, arch: convert simple overrides of module_alloc to execmem
  mm: introduce execmem_alloc() and execmem_free()
  module: make module_memory_{alloc,free} more self-contained
  sparc: simplify module_alloc()
  nios2: define virtual address space for modules
  mips: module: rename MODULE_START to MODULES_VADDR
  arm64: module: remove unneeded call to kasan_alloc_module_shadow()
  kallsyms: replace deprecated strncpy with strscpy
  module: allow UNUSED_KSYMS_WHITELIST to be relative against objtree.
2024-05-15 14:05:08 -07:00
8c06da67d0 Merge tag 'livepatching-for-6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/livepatching/livepatching
Pull livepatching update from Petr Mladek:

 - Use more informative names for the livepatch transition states

* tag 'livepatching-for-6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/livepatching/livepatching:
  livepatch: Rename KLP_* to KLP_TRANSITION_*
2024-05-15 13:07:49 -07:00
a19264d086 Merge tag 'printk-for-6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux
Pull printk updates from Petr Mladek:

 - Use no_printk() instead of "if (0) printk()" constructs to avoid
   generating printk index for messages disabled at compile time

 - Remove deprecated strncpy/strcpy from printk.c

 - Remove redundant CONFIG_BASE_FULL in favor of CONFIG_BASE_SMALL

* tag 'printk-for-6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux:
  printk: cleanup deprecated uses of strncpy/strcpy
  printk: Remove redundant CONFIG_BASE_FULL
  printk: Change type of CONFIG_BASE_SMALL to bool
  printk: Fix LOG_CPU_MAX_BUF_SHIFT when BASE_SMALL is enabled
  ceph: Use no_printk() helper
  dyndbg: Use *no_printk() helpers
  dev_printk: Add and use dev_no_printk()
  printk: Let no_printk() use _printk()
2024-05-15 12:34:46 -07:00
9ee9822908 bpf: save extended inner map info for percpu array maps as well
ARRAY_OF_MAPS and HASH_OF_MAPS map types have special logic to save
a few extra fields required for correct operations of ARRAY maps, when
they are used as inner maps. PERCPU_ARRAY maps have similar
requirements as they now support generating inline element lookup
logic. So make sure that both classes of maps are handled correctly.

Reported-by: Jakub Kicinski <kuba@kernel.org>
Fixes: db69718b8e ("bpf: inline bpf_map_lookup_elem() for PERCPU_ARRAY maps")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20240515062440.846086-1-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-05-15 09:34:54 -07:00
dafc2d0f1b Merge branch 'for-6.10-base-small' into for-linus 2024-05-15 11:58:26 +02:00
b9c6820f02 ring-buffer: Add cast to unsigned long addr passed to virt_to_page()
The sub-buffer pages are held in an unsigned long array, and when it is
passed to virt_to_page() a cast is needed.

Link: https://lore.kernel.org/all/20240515124808.06279d04@canb.auug.org.au/
Link: https://lore.kernel.org/linux-trace-kernel/20240515010558.4abaefdd@rorschach.local.home

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Fixes: 117c39200d ("ring-buffer: Introducing ring-buffer mapping functions")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-05-15 01:28:42 -04:00
1b294a1f35 Merge tag 'net-next-6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Jakub Kicinski:
 "Core & protocols:

   - Complete rework of garbage collection of AF_UNIX sockets.

     AF_UNIX is prone to forming reference count cycles due to fd
     passing functionality. New method based on Tarjan's Strongly
     Connected Components algorithm should be both faster and remove a
     lot of workarounds we accumulated over the years.

   - Add TCP fraglist GRO support, allowing chaining multiple TCP
     packets and forwarding them together. Useful for small switches /
     routers which lack basic checksum offload in some scenarios (e.g.
     PPPoE).

   - Support using SMP threads for handling packet backlog i.e. packet
     processing from software interfaces and old drivers which don't use
     NAPI. This helps move the processing out of the softirq jumble.

   - Continue work of converting from rtnl lock to RCU protection.

     Don't require rtnl lock when reading: IPv6 routing FIB, IPv6
     address labels, netdev threaded NAPI sysfs files, bonding driver's
     sysfs files, MPLS devconf, IPv4 FIB rules, netns IDs, tcp metrics,
     TC Qdiscs, neighbor entries, ARP entries via ioctl(SIOCGARP), a lot
     of the link information available via rtnetlink.

   - Small optimizations from Eric to UDP wake up handling, memory
     accounting, RPS/RFS implementation, TCP packet sizing etc.

   - Allow direct page recycling in the bulk API used by XDP, for +2%
     PPS.

   - Support peek with an offset on TCP sockets.

   - Add MPTCP APIs for querying last time packets were received/sent/acked
     and whether MPTCP "upgrade" succeeded on a TCP socket.

   - Add intra-node communication shortcut to improve SMC performance.

   - Add IPv6 (and IPv{4,6}-over-IPv{4,6}) support to the GTP protocol
     driver.

   - Add HSR-SAN (RedBOX) mode of operation to the HSR protocol driver.

   - Add reset reasons for tracing what caused a TCP reset to be sent.

   - Introduce direction attribute for xfrm (IPSec) states. State can be
     used either for input or output packet processing.

  Things we sprinkled into general kernel code:

   - Add bitmap_{read,write}(), bitmap_size(), expose BYTES_TO_BITS().

     This required touch-ups and renaming of a few existing users.

   - Add Endian-dependent __counted_by_{le,be} annotations.

   - Make building selftests "quieter" by printing summaries like
     "CC object.o" rather than full commands with all the arguments.

  Netfilter:

   - Use GFP_KERNEL to clone elements, to deal better with OOM
     situations and avoid failures in the .commit step.

  BPF:

   - Add eBPF JIT for ARCv2 CPUs.

   - Support attaching kprobe BPF programs through kprobe_multi link in
     a session mode, meaning, a BPF program is attached to both function
     entry and return, the entry program can decide if the return
     program gets executed and the entry program can share u64 cookie
     value with return program. "Session mode" is a common use-case for
     tetragon and bpftrace.

   - Add the ability to specify and retrieve BPF cookie for raw
     tracepoint programs in order to ease migration from classic to raw
     tracepoints.

   - Add an internal-only BPF per-CPU instruction for resolving per-CPU
     memory addresses and implement support in x86, ARM64 and RISC-V
     JITs. This allows inlining functions which need to access per-CPU
     state.

   - Optimize x86 BPF JIT's emit_mov_imm64, and add support for various
     atomics in bpf_arena which can be JITed as a single x86
     instruction. Support BPF arena on ARM64.

   - Add a new bpf_wq API for deferring events and refactor
     process-context bpf_timer code to keep common code where possible.

   - Harden the BPF verifier's and/or/xor value tracking.

   - Introduce crypto kfuncs to let BPF programs call kernel crypto
     APIs.

   - Support bpf_tail_call_static() helper for BPF programs with GCC 13.

   - Add bpf_preempt_{disable,enable}() kfuncs in order to allow a BPF
     program to have code sections where preemption is disabled.

  Driver API:

   - Skip software TC processing completely if all installed rules are
     marked as HW-only, instead of checking the HW-only flag rule by
     rule.

   - Add support for configuring PoE (Power over Ethernet), similar to
     the already existing support for PoDL (Power over Data Line)
     config.

   - Initial bits of a queue control API, for now allowing a single
     queue to be reset without disturbing packet flow to other queues.

   - Common (ethtool) statistics for hardware timestamping.

  Tests and tooling:

   - Remove the need to create a config file to run the net forwarding
     tests so that a naive "make run_tests" can exercise them.

   - Define a method of writing tests which require an external endpoint
     to communicate with (to send/receive data towards the test
     machine). Add a few such tests.

   - Create a shared code library for writing Python tests. Expose the
     YAML Netlink library from tools/ to the tests for easy Netlink
     access.

   - Move netfilter tests under net/, extend them, separate performance
     tests from correctness tests, and iron out issues found by running
     them "on every commit".

   - Refactor BPF selftests to use common network helpers.

   - Further work filling in YAML definitions of Netlink messages for:
     nftables, team driver, bonding interfaces, vlan interfaces, VF
     info, TC u32 mark, TC police action.

   - Teach Python YAML Netlink to decode attribute policies.

   - Extend the definition of the "indexed array" construct in the specs
     to cover arrays of scalars rather than just nests.

   - Add hyperlinks between definitions in generated Netlink docs.

  Drivers:

   - Make sure unsupported flower control flags are rejected by drivers,
     and make more drivers report errors directly to the application
     rather than dmesg (large number of driver changes from Asbjørn
     Sloth Tønnesen).

   - Ethernet high-speed NICs:
      - Broadcom (bnxt):
         - support multiple RSS contexts and steering traffic to them
         - support XDP metadata
         - make page pool allocations more NUMA aware
      - Intel (100G, ice, idpf):
         - extract datapath code common among Intel drivers into a library
         - use fewer resources in switchdev by sharing queues with the PF
         - add PFCP filter support
         - add Ethernet filter support
         - use a spinlock instead of HW lock in PTP clock ops
         - support 5 layer Tx scheduler topology
      - nVidia/Mellanox:
         - 800G link modes and 100G SerDes speeds
         - per-queue IRQ coalescing configuration
      - Marvell Octeon:
         - support offloading TC packet mark action

   - Ethernet NICs consumer, embedded and virtual:
      - stop lying about skb->truesize in USB Ethernet drivers, it
        messes up TCP memory calculations
      - Google cloud vNIC:
         - support changing ring size via ethtool
         - support ring reset using the queue control API
      - VirtIO net:
         - expose flow hash from RSS to XDP
         - per-queue statistics
         - add selftests
      - Synopsys (stmmac):
         - support controllers which require an RX clock signal from the
           MII bus to perform their hardware initialization
      - TI:
         - icssg_prueth: support ICSSG-based Ethernet on AM65x SR1.0 devices
         - icssg_prueth: add SW TX / RX Coalescing based on hrtimers
         - cpsw: minimal XDP support
      - Renesas (ravb):
         - support describing the MDIO bus
      - Realtek (r8169):
         - add support for RTL8168M
      - Microchip Sparx5:
         - matchall and flower actions mirred and redirect

   - Ethernet switches:
      - nVidia/Mellanox:
         - improve events processing performance
      - Marvell:
         - add support for MV88E6250 family internal PHYs
      - Microchip:
         - add DCB and DSCP mapping support for KSZ switches
         - vsc73xx: convert to PHYLINK
      - Realtek:
         - rtl8226b/rtl8221b: add C45 instances and SerDes switching

   - Many driver changes related to PHYLIB and PHYLINK deprecated API
     cleanup

   - Ethernet PHYs:
      - Add a new driver for Airoha EN8811H 2.5 Gigabit PHY.
      - micrel: lan8814: add support for PPS out and external timestamp trigger

   - WiFi:
      - Disable Wireless Extensions (WEXT) in all Wi-Fi 7 devices
        drivers. Modern devices can only be configured using nl80211.
      - mac80211/cfg80211
         - handle color change per link for WiFi 7 Multi-Link Operation
      - Intel (iwlwifi):
         - don't support puncturing in 5 GHz
         - support monitor mode on passive channels
         - BZ-W device support
         - P2P with HE/EHT support
         - re-add support for firmware API 90
         - provide channel survey information for Automatic Channel Selection
      - MediaTek (mt76):
         - mt7921 LED control
         - mt7925 EHT radiotap support
         - mt7920e PCI support
      - Qualcomm (ath11k):
         - P2P support for QCA6390, WCN6855 and QCA2066
         - support hibernation
         - ieee80211-freq-limit Device Tree property support
      - Qualcomm (ath12k):
         - refactoring in preparation of multi-link support
         - suspend and hibernation support
         - ACPI support
         - debugfs support, including dfs_simulate_radar support
      - RealTek:
         - rtw88: RTL8723CS SDIO device support
         - rtw89: RTL8922AE Wi-Fi 7 PCI device support
         - rtw89: complete features of new WiFi 7 chip 8922AE including
           BT-coexistence and Wake-on-WLAN
         - rtw89: use BIOS ACPI settings to set TX power and channels
         - rtl8xxxu: enable Management Frame Protection (MFP) support

   - Bluetooth:
      - support for Intel BlazarI and Filmore Peak2 (BE201)
      - support for MediaTek MT7921S SDIO
      - initial support for Intel PCIe BT driver
      - remove HCI_AMP support"

* tag 'net-next-6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1827 commits)
  selftests: netfilter: fix packetdrill conntrack testcase
  net: gro: fix napi_gro_cb zeroed alignment
  Bluetooth: btintel_pcie: Refactor and code cleanup
  Bluetooth: btintel_pcie: Fix warning reported by sparse
  Bluetooth: hci_core: Fix not handling hdev->le_num_of_adv_sets=1
  Bluetooth: btintel: Fix compiler warning for multi_v7_defconfig config
  Bluetooth: btintel_pcie: Fix compiler warnings
  Bluetooth: btintel_pcie: Add *setup* function to download firmware
  Bluetooth: btintel_pcie: Add support for PCIe transport
  Bluetooth: btintel: Export few static functions
  Bluetooth: HCI: Remove HCI_AMP support
  Bluetooth: L2CAP: Fix div-by-zero in l2cap_le_flowctl_init()
  Bluetooth: qca: Fix error code in qca_read_fw_build_info()
  Bluetooth: hci_conn: Use __counted_by() and avoid -Wfamnae warning
  Bluetooth: btintel: Add support for Filmore Peak2 (BE201)
  Bluetooth: btintel: Add support for BlazarI
  LE Create Connection command timeout increased to 20 secs
  dt-bindings: net: bluetooth: Add MediaTek MT7921S SDIO Bluetooth
  Bluetooth: compute LE flow credits based on recvbuf space
  Bluetooth: hci_sync: Use cmd->num_cis instead of magic number
  ...
2024-05-14 19:42:24 -07:00
0c181b1d97 Merge tag 'pm-6.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management updates from Rafael Wysocki:
 "These are mostly cpufreq updates, including a significant intel-pstate
  driver update and several amd-pstate improvements plus some updates of
  ARM cpufreq drivers, general fixes and cleanups.

  Also included are changes related to system sleep, power capping
  updates adding support for a new platform and a new hardware feature
  (among other things), a Samsung exynos-asv driver update allowing it
  to change its Energy Model after adjusting voltage, minor cpuidle and
  devfreq updates and a small documentation cleanup.

  Specifics:

   - Rework the handling of disabled turbo in the intel_pstate driver
     and make it update the maximum CPU frequency consistently
     regardless of the reason on top of a number of cleanups (Rafael
     Wysocki)

   - Add missing checks for NULL .exit() cpufreq driver callback to the
     cpufreq core (Viresh Kumar)

   - Prevent pulicy->max from going above the frequency QoS maximum
     value when cpufreq_frequency_table_verify() is used (Xuewen Yan)

   - Prevent a negative CPU number or frequency value from being printed
     if they are really large (Joshua Yeong)

   - Update MAINTAINERS entry for amd-pstate to add two new
     submaintainers and a designated reviewer (Huang Rui)

   - Clean up the amd-pstate driver and update its documentation
     (Gautham Shenoy)

   - Fix the highest frequency issue in the amd-pstate driver which
     limits performance (Perry Yuan)

   - Enable CPPC v2 for certain processors in the family 17H, as
     requested by TR40 processor users who expect improved performance
     and lower system temperature (Perry Yuan)

   - Change latency and delay values to be read from platform firmware
     firstly for more accurate timing (Perry Yuan)

   - A new quirk is introduced for supporting amd-pstate on legacy
     processors which either lack CPPC capability, or only only have
     CPPC v2 capability (Perry Yuan)

   - Sun50i cpufreq: Add support for opp_supported_hw, H616 platform and
     general cleanups (Andre Przywara, Martin Botka, Brandon Cheo Fusi,
     Dan Carpenter, Viresh Kumar)

   - CPPC cpufreq: Fix possible null pointer dereference (Aleksandr
     Mishin)

   - Eliminate uses of of_node_put() from cpufreq (Javier Carrasco,
     Shivani Gupta)

   - brcmstb-avs: ISO C90 forbids mixed declarations (Portia Stephens)

   - mediatek cpufreq: Add support for MT7988A (Sam Shih)

   - cpufreq-qcom-hw: Add SM4450 compatibles in DT bindings (Tengfei
     Fan)

   - Fix struct cpudata::epp_cached kernel-doc in the intel_pstate
     cpufreq driver (Jeff Johnson)

   - Fix kerneldoc description of ladder_do_selection() (Jeff Johnson)

   - Convert the cpuidle kirkwood driver to platform remove callback
     returning void (Yangtao Li)

   - Replace deprecated strncpy() with strscpy() in the hibernation core
     code (Justin Stitt)

   - Use %ps to simplify debug output in the core system-wide suspend
     and resume code (Len Brown)

   - Remove unnecessary else from device_init_wakeup() and make
     device_wakeup_disable() return void (Dhruva Gole)

   - Enable PMU support in the Intel TPMI RAPL driver (Zhang Rui)

   - Add support for ArrowLake-H platform to the Intel RAPL driver
     (Zhang Rui)

   - Avoid explicit cpumask allocation on stack in DTPM (Dawei Li)

   - Make the Samsung exynos-asv driver update the Energy Model after
     adjusting voltage on top of some preliminary changes of the OPP and
     Enery Model generic code (Lukasz Luba)

   - Remove a reference to a function that has been dropped from the
     power management documentation (Bjorn Helgaas)

   - Convert the platfrom remove callback to .remove_new for the
     exyno-nocp, exynos-ppmu, mtk-cci-devfreq, sun8i-a33-mbus, and
     rk3399_dmc devfreq drivers (Uwe Kleine-König)

   - Use DEFINE_SIMPLE_PM_OPS for exyno-bus.c driver (Anand Moon)"

* tag 'pm-6.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (68 commits)
  PM / devfreq: exynos: Use DEFINE_SIMPLE_DEV_PM_OPS for PM functions
  PM / devfreq: rk3399_dmc: Convert to platform remove callback returning void
  PM / devfreq: sun8i-a33-mbus: Convert to platform remove callback returning void
  PM / devfreq: mtk-cci: Convert to platform remove callback returning void
  PM / devfreq: exynos-ppmu: Convert to platform remove callback returning void
  PM / devfreq: exynos-nocp: Convert to platform remove callback returning void
  cpufreq: amd-pstate: fix the highest frequency issue which limits performance
  cpufreq: intel_pstate: fix struct cpudata::epp_cached kernel-doc
  cpuidle: ladder: fix ladder_do_selection() kernel-doc
  powercap: intel_rapl_tpmi: Enable PMU support
  powercap: intel_rapl: Introduce APIs for PMU support
  PM: hibernate: replace deprecated strncpy() with strscpy()
  cpufreq: Fix up printing large CPU numbers and frequency values
  MAINTAINERS: cpufreq: amd-pstate: Add co-maintainers and reviewer
  cpufreq: amd-pstate: remove unused variable lowest_nonlinear_freq
  cpufreq: amd-pstate: fix code format problems
  cpufreq: amd-pstate: Add quirk for the pstate CPPC capabilities missing
  cppc_acpi: print error message if CPPC is unsupported
  cpufreq: amd-pstate: get transition delay and latency value from ACPI tables
  cpufreq: amd-pstate: Bail out if min/max/nominal_freq is 0
  ...
2024-05-14 13:19:15 -07:00
21c38a3bd4 cgroup/rstat: add cgroup_rstat_cpu_lock helpers and tracepoints
This closely resembles helpers added for the global cgroup_rstat_lock in
commit fc29e04ae1 ("cgroup/rstat: add cgroup_rstat_lock helpers and
tracepoints"). This is for the per CPU lock cgroup_rstat_cpu_lock.

Based on production workloads, we observe the fast-path "update" function
cgroup_rstat_updated() is invoked around 3 million times per sec, while the
"flush" function cgroup_rstat_flush_locked(), walking each possible CPU,
can see periodic spikes of 700 invocations/sec.

For this reason, the tracepoints are split into normal and fastpath
versions for this per-CPU lock. Making it feasible for production to
continuously monitor the non-fastpath tracepoint to detect lock contention
issues. The reason for monitoring is that lock disables IRQs which can
disturb e.g. softirq processing on the local CPUs involved. When the
global cgroup_rstat_lock stops disabling IRQs (e.g converted to a mutex),
this per CPU lock becomes the next bottleneck that can introduce latency
variations.

A practical bpftrace script for monitoring contention latency:

 bpftrace -e '
   tracepoint:cgroup:cgroup_rstat_cpu_lock_contended {
     @start[tid]=nsecs; @cnt[probe]=count()}
   tracepoint:cgroup:cgroup_rstat_cpu_locked {
     if (args->contended) {
       @wait_ns=hist(nsecs-@start[tid]); delete(@start[tid]);}
     @cnt[probe]=count()}
   interval:s:1 {time("%H:%M:%S "); print(@wait_ns); print(@cnt); clear(@cnt);}'

Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-05-14 09:43:17 -10:00
896d3fce84 Merge tag 'linux_kselftest-kunit-6.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
Pull kunit updates from Shuah Khan:

 - fix race condition in try-catch completion

 - change __kunit_test_suites_init() to exit early if there is
   nothing to test

 - change string-stream-test to use KUNIT_DEFINE_ACTION_WRAPPER

 - move fault tests behind KUNIT_FAULT_TEST Kconfig option

 - kthread test fixes and improvements

 - iov_iter test fixes

* tag 'linux_kselftest-kunit-6.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
  kunit: bail out early in __kunit_test_suites_init() if there are no suites to test
  kunit: string-stream-test: use KUNIT_DEFINE_ACTION_WRAPPER
  kunit: test: Move fault tests behind KUNIT_FAULT_TEST Kconfig option
  kunit: unregister the device on error
  kunit: Fix race condition in try-catch completion
  kunit: Add tests for fault
  kunit: Print last test location on fault
  kunit: Fix KUNIT_SUCCESS() calls in iov_iter tests
  kunit: Handle test faults
  kunit: Fix timeout message
  kunit: Fix kthread reference
  kunit: Handle thread creation error
2024-05-14 11:32:52 -07:00
654de42f3f Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Merge in late fixes to prepare for the 6.10 net-next PR.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-14 10:53:19 -07:00
6bfd2d442a Merge tag 'irq-core-2024-05-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull interrupt subsystem updates from Thomas Gleixner:
 "Core code:

   - Interrupt storm detection for the lockup watchdog:

     Lockups which are caused by interrupt storms are not easy to debug
     because there is no information about the events which make the
     lockup detector trigger.

     To make this more user friendly, provide an extenstion to interrupt
     statistics which allows to take snapshots and an interface to
     retrieve the delta to the snapshot. Use this new mechanism in the
     watchdog code to do a two stage lockup analysis by taking the
     snapshot and printing the deltas for the topmost active interrupts
     on the second trigger.

     Note: This contains both the interrupt and the watchdog changes as
     the latter depend on the former obviously.

   - Avoid summation loops in the /proc/interrupts output and use the
     global counter when possible

   - Skip suspended interrupts on CPU hotplug operations to ensure that
     they are not delivered before the system resumes the device drivers
     when coming out of suspend.

   - On CPU hot-unplug interrupts which are affine to the outgoing CPU
     are migrated to a different CPU in the affinity mask. This can fail
     when the CPUs have no vectors left. Instead of giving up try to
     migrate it to any online CPU and thereby breaking the affinity
     setting in order to prevent a stale device interrupt which targets
     an offline CPU

   - The usual small cleanups

  Driver code:

   - Support for the RISCV AIA MSI controller

   - Make the interrupt allocation for the Loongson PCH controller more
     flexible to prevent vector exhaustion

   - The usual set of cleanups and fixes all over the place"

* tag 'irq-core-2024-05-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (51 commits)
  irqchip/gic-v3-its: Remove BUG_ON in its_vpe_irq_domain_alloc
  cpuidle: Avoid explicit cpumask allocation on stack
  irqchip/sifive-plic: Avoid explicit cpumask allocation on stack
  irqchip/riscv-aplic-direct: Avoid explicit cpumask allocation on stack
  irqchip/loongson-eiointc: Avoid explicit cpumask allocation on stack
  irqchip/gic-v3-its: Avoid explicit cpumask allocation on stack
  irqchip/irq-bcm6345-l1: Avoid explicit cpumask allocation on stack
  cpumask: Introduce cpumask_first_and_and()
  irqchip/irq-brcmstb-l2: Avoid saving mask on shutdown
  genirq: Reuse irq_is_nmi()
  genirq/cpuhotplug: Retry with cpu_online_mask when migration fails
  genirq/cpuhotplug: Skip suspended interrupts when restoring affinity
  arm64: dts: st: Add interrupt parent to pinctrl on stm32mp251
  arm64: dts: st: Add exti1 and exti2 nodes on stm32mp251
  ARM: dts: stm32: List exti parent interrupts on stm32mp131
  ARM: dts: stm32: List exti parent interrupts on stm32mp151
  arm64: Kconfig.platforms: Enable STM32_EXTI for ARCH_STM32
  irqchip/stm32-exti: Mark events reserved with RIF configuration check
  irqchip/stm32-exti: Skip secure events
  irqchip/stm32-exti: Convert driver to standard PM
  ...
2024-05-14 09:47:14 -07:00
2d9db778dd Merge tag 'timers-core-2024-05-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timers and timekeeping updates from Thomas Gleixner:
 "Core code:

   - Make timekeeping and VDSO time readouts resilent against math
     overflow:

     In guest context the kernel is prone to math overflow when the host
     defers the timer interrupt due to overload, malfunction or malice.

     This can be mitigated by checking the clocksource delta for the
     maximum deferrement which is readily available. If that value is
     exceeded then the code uses a slowpath function which can handle
     the multiplication overflow.

     This functionality is enabled unconditionally in the kernel, but
     made conditional in the VDSO code. The latter is conditional
     because it allows architectures to optimize the check so it is not
     causing performance regressions.

     On X86 this is achieved by reworking the existing check for
     negative TSC deltas as a negative delta obviously exceeds the
     maximum deferrement when it is evaluated as an unsigned value. That
     avoids two conditionals in the hotpath and allows to hide both the
     negative delta and the large delta handling in the same slow path.

   - Add an initial minimal ktime_t abstraction for Rust

   - The usual boring cleanups and enhancements

  Drivers:

   - Boring updates to device trees and trivial enhancements in various
     drivers"

* tag 'timers-core-2024-05-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits)
  clocksource/drivers/arm_arch_timer: Mark hisi_161010101_oem_info const
  clocksource/drivers/timer-ti-dm: Remove an unused field in struct dmtimer
  clocksource/drivers/renesas-ostm: Avoid reprobe after successful early probe
  clocksource/drivers/renesas-ostm: Allow OSTM driver to reprobe for RZ/V2H(P) SoC
  dt-bindings: timer: renesas: ostm: Document Renesas RZ/V2H(P) SoC
  rust: time: doc: Add missing C header links
  clocksource: Make the int help prompt unit readable in ncurses
  hrtimer: Rename __hrtimer_hres_active() to hrtimer_hres_active()
  timerqueue: Remove never used function timerqueue_node_expires()
  rust: time: Add Ktime
  vdso: Fix powerpc build U64_MAX undeclared error
  clockevents: Convert s[n]printf() to sysfs_emit()
  clocksource: Convert s[n]printf() to sysfs_emit()
  clocksource: Make watchdog and suspend-timing multiplication overflow safe
  timekeeping: Let timekeeping_cycles_to_ns() handle both under and overflow
  timekeeping: Make delta calculation overflow safe
  timekeeping: Prepare timekeeping_cycles_to_ns() for overflow safety
  timekeeping: Fold in timekeeping_delta_to_ns()
  timekeeping: Consolidate timekeeping helpers
  timekeeping: Refactor timekeeping helpers
  ...
2024-05-14 09:27:40 -07:00
e60b613df8 ftrace: Fix possible use-after-free issue in ftrace_location()
KASAN reports a bug:

  BUG: KASAN: use-after-free in ftrace_location+0x90/0x120
  Read of size 8 at addr ffff888141d40010 by task insmod/424
  CPU: 8 PID: 424 Comm: insmod Tainted: G        W          6.9.0-rc2+
  [...]
  Call Trace:
   <TASK>
   dump_stack_lvl+0x68/0xa0
   print_report+0xcf/0x610
   kasan_report+0xb5/0xe0
   ftrace_location+0x90/0x120
   register_kprobe+0x14b/0xa40
   kprobe_init+0x2d/0xff0 [kprobe_example]
   do_one_initcall+0x8f/0x2d0
   do_init_module+0x13a/0x3c0
   load_module+0x3082/0x33d0
   init_module_from_file+0xd2/0x130
   __x64_sys_finit_module+0x306/0x440
   do_syscall_64+0x68/0x140
   entry_SYSCALL_64_after_hwframe+0x71/0x79

The root cause is that, in lookup_rec(), ftrace record of some address
is being searched in ftrace pages of some module, but those ftrace pages
at the same time is being freed in ftrace_release_mod() as the
corresponding module is being deleted:

           CPU1                       |      CPU2
  register_kprobes() {                | delete_module() {
    check_kprobe_address_safe() {     |
      arch_check_ftrace_location() {  |
        ftrace_location() {           |
          lookup_rec() // USE!        |   ftrace_release_mod() // Free!

To fix this issue:
  1. Hold rcu lock as accessing ftrace pages in ftrace_location_range();
  2. Use ftrace_location_range() instead of lookup_rec() in
     ftrace_location();
  3. Call synchronize_rcu() before freeing any ftrace pages both in
     ftrace_process_locs()/ftrace_release_mod()/ftrace_free_mem().

Link: https://lore.kernel.org/linux-trace-kernel/20240509192859.1273558-1-zhengyejian1@huawei.com

Cc: stable@vger.kernel.org
Cc: <mhiramat@kernel.org>
Cc: <mark.rutland@arm.com>
Cc: <mathieu.desnoyers@efficios.com>
Fixes: ae6aa16fdc ("kprobes: introduce ftrace based optimization")
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-05-14 11:13:27 -04:00
2c9e5d4a00 bpf: remove CONFIG_BPF_JIT dependency on CONFIG_MODULES of
BPF just-in-time compiler depended on CONFIG_MODULES because it used
module_alloc() to allocate memory for the generated code.

Since code allocations are now implemented with execmem, drop dependency of
CONFIG_BPF_JIT on CONFIG_MODULES and make it select CONFIG_EXECMEM.

Suggested-by: Björn Töpel <bjorn@kernel.org>
Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2024-05-14 00:36:29 -07:00
7582b7be16 kprobes: remove dependency on CONFIG_MODULES
kprobes depended on CONFIG_MODULES because it has to allocate memory for
code.

Since code allocations are now implemented with execmem, kprobes can be
enabled in non-modular kernels.

Add #ifdef CONFIG_MODULE guards for the code dealing with kprobes inside
modules, make CONFIG_KPROBES select CONFIG_EXECMEM and drop the
dependency of CONFIG_KPROBES on CONFIG_MODULES.

Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
[mcgrof: rebase in light of NEED_TASKS_RCU ]
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2024-05-14 00:35:06 -07:00
223b5e57d0 mm/execmem, arch: convert remaining overrides of module_alloc to execmem
Extend execmem parameters to accommodate more complex overrides of
module_alloc() by architectures.

This includes specification of a fallback range required by arm, arm64
and powerpc, EXECMEM_MODULE_DATA type required by powerpc, support for
allocation of KASAN shadow required by s390 and x86 and support for
late initialization of execmem required by arm64.

The core implementation of execmem_alloc() takes care of suppressing
warnings when the initial allocation fails but there is a fallback range
defined.

Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
Acked-by: Will Deacon <will@kernel.org>
Acked-by: Song Liu <song@kernel.org>
Tested-by: Liviu Dudau <liviu@dudau.co.uk>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2024-05-14 00:31:43 -07:00
12af2b83d0 mm: introduce execmem_alloc() and execmem_free()
module_alloc() is used everywhere as a mean to allocate memory for code.

Beside being semantically wrong, this unnecessarily ties all subsystems
that need to allocate code, such as ftrace, kprobes and BPF to modules and
puts the burden of code allocation to the modules code.

Several architectures override module_alloc() because of various
constraints where the executable memory can be located and this causes
additional obstacles for improvements of code allocation.

Start splitting code allocation from modules by introducing execmem_alloc()
and execmem_free() APIs.

Initially, execmem_alloc() is a wrapper for module_alloc() and
execmem_free() is a replacement of module_memfree() to allow updating all
call sites to use the new APIs.

Since architectures define different restrictions on placement,
permissions, alignment and other parameters for memory that can be used by
different subsystems that allocate executable memory, execmem_alloc() takes
a type argument, that will be used to identify the calling subsystem and to
allow architectures define parameters for ranges suitable for that
subsystem.

No functional changes.

Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Song Liu <song@kernel.org>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2024-05-14 00:31:43 -07:00
bc6b94d3ea module: make module_memory_{alloc,free} more self-contained
Move the logic related to the memory allocation and freeing into
module_memory_alloc() and module_memory_free().

Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Song Liu <song@kernel.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2024-05-14 00:31:43 -07:00
086437d94a kallsyms: replace deprecated strncpy with strscpy
strncpy() is deprecated for use on NUL-terminated destination strings
[1] and as such we should prefer more robust and less ambiguous string
interfaces. The goal is to remove its use completely [2].

namebuf is eventually cleaned of any trailing llvm suffixes using
strstr(). This hints that namebuf should be NUL-terminated.

static void cleanup_symbol_name(char *s)
{
	char *res;
	...
	res = strstr(s, ".llvm.");
	...
}

Due to this, use strscpy() over strncpy() as it guarantees
NUL-termination on the destination buffer. Drop the -1 from the length
calculation as it is no longer needed to ensure NUL-termination.

Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings [1]
Link: https://manpages.debian.org/testing/linux-manual-4.8/strscpy.9.en.html
Link: https://github.com/KSPP/linux/issues/90 [2]
Cc: linux-hardening@vger.kernel.org
Signed-off-by: Justin Stitt <justinstitt@google.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2024-05-14 00:31:43 -07:00
8d0b728840 module: allow UNUSED_KSYMS_WHITELIST to be relative against objtree.
If UNUSED_KSYMS_WHITELIST is a file generated
before Kbuild runs, and the source tree is in
a read-only filesystem, the developer must put
the file somewhere and specify an absolute
path to UNUSED_KSYMS_WHITELIST. This worked,
but if IKCONFIG=y, an absolute path is embedded
into .config and eventually into vmlinux, causing
the build to be less reproducible when building
on a different machine.

This patch makes the handling of
UNUSED_KSYMS_WHITELIST to be similar to
MODULE_SIG_KEY.

First, check if UNUSED_KSYMS_WHITELIST is an
absolute path, just as before this patch. If so,
use the path as is.

If it is a relative path, use wildcard to check
the existence of the file below objtree first.
If it does not exist, fall back to the original
behavior of adding $(srctree)/ before the value.

After this patch, the developer can put the generated
file in objtree, then use a relative path against
objtree in .config, eradicating any absolute paths
that may be evaluated differently on different machines.

Signed-off-by: Yifan Hong <elsk@google.com>
Reviewed-by: Elliot Berman <quic_eberman@quicinc.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2024-05-14 00:31:43 -07:00
d2cc859cc8 ftrace: Remove unused global 'ftrace_direct_func_count'
Commit 8788ca164e ("ftrace: Remove the legacy _ftrace_direct API")
stopped setting the 'ftrace_direct_func_count' variable, but left
it around.  Clean it up.

Link: https://lore.kernel.org/linux-trace-kernel/20240506233305.215735-1-linux@treblig.org

Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-05-14 02:09:40 -04:00
c9d5b7b826 ftrace: Remove unused list 'ftrace_direct_funcs'
Commit 8788ca164e ("ftrace: Remove the legacy _ftrace_direct API")
stopped using 'ftrace_direct_funcs' (and the associated
struct ftrace_direct_func).  Remove them.

Build tested only (on x86-64 with FTRACE and DYNAMIC_FTRACE
enabled)

Link: https://lore.kernel.org/linux-trace-kernel/20240504132303.67538-1-linux@treblig.org

Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-05-14 02:09:00 -04:00
6e5a0c30b6 Merge tag 'sched-core-2024-05-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:

 - Add cpufreq pressure feedback for the scheduler

 - Rework misfit load-balancing wrt affinity restrictions

 - Clean up and simplify the code around ::overutilized and
   ::overload access.

 - Simplify sched_balance_newidle()

 - Bump SCHEDSTAT_VERSION to 16 due to a cleanup of CPU_MAX_IDLE_TYPES
   handling that changed the output.

 - Rework & clean up <asm/vtime.h> interactions wrt arch_vtime_task_switch()

 - Reorganize, clean up and unify most of the higher level
   scheduler balancing function names around the sched_balance_*()
   prefix

 - Simplify the balancing flag code (sched_balance_running)

 - Miscellaneous cleanups & fixes

* tag 'sched-core-2024-05-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (50 commits)
  sched/pelt: Remove shift of thermal clock
  sched/cpufreq: Rename arch_update_thermal_pressure() => arch_update_hw_pressure()
  thermal/cpufreq: Remove arch_update_thermal_pressure()
  sched/cpufreq: Take cpufreq feedback into account
  cpufreq: Add a cpufreq pressure feedback for the scheduler
  sched/fair: Fix update of rd->sg_overutilized
  sched/vtime: Do not include <asm/vtime.h> header
  s390/irq,nmi: Include <asm/vtime.h> header directly
  s390/vtime: Remove unused __ARCH_HAS_VTIME_TASK_SWITCH leftover
  sched/vtime: Get rid of generic vtime_task_switch() implementation
  sched/vtime: Remove confusing arch_vtime_task_switch() declaration
  sched/balancing: Simplify the sg_status bitmask and use separate ->overloaded and ->overutilized flags
  sched/fair: Rename set_rd_overutilized_status() to set_rd_overutilized()
  sched/fair: Rename SG_OVERLOAD to SG_OVERLOADED
  sched/fair: Rename {set|get}_rd_overload() to {set|get}_rd_overloaded()
  sched/fair: Rename root_domain::overload to ::overloaded
  sched/fair: Use helper functions to access root_domain::overload
  sched/fair: Check root_domain::overload value before update
  sched/fair: Combine EAS check with root_domain::overutilized access
  sched/fair: Simplify the continue_balancing logic in sched_balance_newidle()
  ...
2024-05-13 17:18:51 -07:00
17ca7fc22f Merge tag 'perf-core-2024-05-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf events updates from Ingo Molnar:

 - Combine perf and BPF for fast evalution of HW breakpoint
   conditions

 - Add LBR capture support outside of hardware events

 - Trigger IO signals for watermark_wakeup

 - Add RAPL support for Intel Arrow Lake and Lunar Lake

 - Optimize frequency-throttling

 - Miscellaneous cleanups & fixes

* tag 'perf-core-2024-05-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (21 commits)
  perf/bpf: Mark perf_event_set_bpf_handler() and perf_event_free_bpf_handler() as inline too
  selftests/perf_events: Test FASYNC with watermark wakeups
  perf/ring_buffer: Trigger IO signals for watermark_wakeup
  perf: Move perf_event_fasync() to perf_event.h
  perf/bpf: Change the !CONFIG_BPF_SYSCALL stubs to static inlines
  selftest/bpf: Test a perf BPF program that suppresses side effects
  perf/bpf: Allow a BPF program to suppress all sample side effects
  perf/bpf: Remove unneeded uses_default_overflow_handler()
  perf/bpf: Call BPF handler directly, not through overflow machinery
  perf/bpf: Remove #ifdef CONFIG_BPF_SYSCALL from struct perf_event members
  perf/bpf: Create bpf_overflow_handler() stub for !CONFIG_BPF_SYSCALL
  perf/bpf: Reorder bpf_overflow_handler() ahead of __perf_event_overflow()
  perf/x86/rapl: Add support for Intel Lunar Lake
  perf/x86/rapl: Add support for Intel Arrow Lake
  perf/core: Reduce PMU access to adjust sample freq
  perf/core: Optimize perf_adjust_freq_unthr_context()
  perf/x86/amd: Don't reject non-sampling events with configured LBR
  perf/x86/amd: Support capturing LBR from software events
  perf/x86/amd: Avoid taking branches before disabling LBR
  perf/x86/amd: Ensure amd_pmu_core_disable_all() is always inlined
  ...
2024-05-13 17:13:47 -07:00
48fc82c40b Merge tag 'locking-core-2024-05-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking updates from Ingo Molnar:

 - Over a dozen code generation micro-optimizations for the atomic
   and spinlock code

 - Add more __ro_after_init attributes

 - Robustify the lockdevent_*() macros

* tag 'locking-core-2024-05-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  locking/pvqspinlock/x86: Use _Q_LOCKED_VAL in PV_UNLOCK_ASM macro
  locking/qspinlock/x86: Micro-optimize virt_spin_lock()
  locking/atomic/x86: Merge __arch{,_try}_cmpxchg64_emu_local() with __arch{,_try}_cmpxchg64_emu()
  locking/atomic/x86: Introduce arch_try_cmpxchg64_local()
  locking/pvqspinlock/x86: Remove redundant CMP after CMPXCHG in __raw_callee_save___pv_queued_spin_unlock()
  locking/pvqspinlock: Use try_cmpxchg() in qspinlock_paravirt.h
  locking/pvqspinlock: Use try_cmpxchg_acquire() in trylock_clear_pending()
  locking/qspinlock: Use atomic_try_cmpxchg_relaxed() in xchg_tail()
  locking/atomic/x86: Define arch_atomic_sub() family using arch_atomic_add() functions
  locking/atomic/x86: Rewrite x86_32 arch_atomic64_{,fetch}_{and,or,xor}() functions
  locking/atomic/x86: Introduce arch_atomic64_read_nonatomic() to x86_32
  locking/atomic/x86: Introduce arch_atomic64_try_cmpxchg() to x86_32
  locking/atomic/x86: Introduce arch_try_cmpxchg64() for !CONFIG_X86_CMPXCHG64
  locking/atomic/x86: Modernize x86_32 arch_{,try_}_cmpxchg64{,_local}()
  locking/atomic/x86: Correct the definition of __arch_try_cmpxchg128()
  x86/tsc: Make __use_tsc __ro_after_init
  x86/kvm: Make kvm_async_pf_enabled __ro_after_init
  context_tracking: Make context_tracking_key __ro_after_init
  jump_label,module: Don't alloc static_key_mod for __ro_after_init keys
  locking/qspinlock: Always evaluate lockevent* non-event parameter once
2024-05-13 17:01:28 -07:00
347bd7f072 tracing: Improve benchmark test performance by using do_div()
Partially revert commit d6cb38e108 ("tracing: Use div64_u64() instead
of do_div()") and use do_div() again to utilize its faster 64-by-32
division compared to the 64-by-64 division done by div64_u64().

Explicitly cast the divisor bm_cnt to u32 to prevent a Coccinelle
warning reported by do_div.cocci. The warning was removed with commit
d6cb38e108 ("tracing: Use div64_u64() instead of do_div()").

Using the faster 64-by-32 division and casting bm_cnt to u32 is safe
because we return early from trace_do_benchmark() if bm_cnt > UINT_MAX.
This approach is already used twice in trace_do_benchmark() when
calculating the standard deviation:

	do_div(stddev, (u32)bm_cnt);
	do_div(stddev, (u32)bm_cnt - 1);

Link: https://lore.kernel.org/linux-trace-kernel/20240329160229.4874-2-thorsten.blum@toblux.com

Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-05-13 20:00:57 -04:00
fe832be05a ring-buffer: Have mmapped ring buffer keep track of missed events
While testing libtracefs on the mmapped ring buffer, the test that checks
if missed events are accounted for failed when using the mapped buffer.
This is because the mapped page does not update the missed events that
were dropped because the writer filled up the ring buffer before the
reader could catch it.

Add the missed events to the reader page/sub-buffer when the IOCTL is done
and a new reader page is acquired.

Note that all accesses to the reader_page via rb_page_commit() had to be
switched to rb_page_size(), and rb_page_size() which was just a copy of
rb_page_commit() but now it masks out the RB_MISSED bits. This is needed
as the mapped reader page is still active in the ring buffer code and
where it reads the commit field of the bpage for the size, it now must
mask it otherwise the missed bits that are now set will corrupt the size
returned.

Link: https://lore.kernel.org/linux-trace-kernel/20240312175405.12fb6726@gandalf.local.home

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-05-13 19:55:22 -04:00
6e62702feb Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2024-05-13

We've added 119 non-merge commits during the last 14 day(s) which contain
a total of 134 files changed, 9462 insertions(+), 4742 deletions(-).

The main changes are:

1) Add BPF JIT support for 32-bit ARCv2 processors, from Shahab Vahedi.

2) Add BPF range computation improvements to the verifier in particular
   around XOR and OR operators, refactoring of checks for range computation
   and relaxing MUL range computation so that src_reg can also be an unknown
   scalar, from Cupertino Miranda.

3) Add support to attach kprobe BPF programs through kprobe_multi link in
   a session mode, meaning, a BPF program is attached to both function entry
   and return, the entry program can decide if the return program gets
   executed and the entry program can share u64 cookie value with return
   program. Session mode is a common use-case for tetragon and bpftrace,
   from Jiri Olsa.

4) Fix a potential overflow in libbpf's ring__consume_n() and improve libbpf
   as well as BPF selftest's struct_ops handling, from Andrii Nakryiko.

5) Improvements to BPF selftests in context of BPF gcc backend,
   from Jose E. Marchesi & David Faust.

6) Migrate remaining BPF selftest tests from test_sock_addr.c to prog_test-
   -style in order to retire the old test, run it in BPF CI and additionally
   expand test coverage, from Jordan Rife.

7) Big batch for BPF selftest refactoring in order to remove duplicate code
   around common network helpers, from Geliang Tang.

8) Another batch of improvements to BPF selftests to retire obsolete
   bpf_tcp_helpers.h as everything is available vmlinux.h,
   from Martin KaFai Lau.

9) Fix BPF map tear-down to not walk the map twice on free when both timer
   and wq is used, from Benjamin Tissoires.

10) Fix BPF verifier assumptions about socket->sk that it can be non-NULL,
    from Alexei Starovoitov.

11) Change BTF build scripts to using --btf_features for pahole v1.26+,
    from Alan Maguire.

12) Small improvements to BPF reusing struct_size() and krealloc_array(),
    from Andy Shevchenko.

13) Fix s390 JIT to emit a barrier for BPF_FETCH instructions,
    from Ilya Leoshkevich.

14) Extend TCP ->cong_control() callback in order to feed in ack and
    flag parameters and allow write-access to tp->snd_cwnd_stamp
    from BPF program, from Miao Xu.

15) Add support for internal-only per-CPU instructions to inline
    bpf_get_smp_processor_id() helper call for arm64 and riscv64 BPF JITs,
    from Puranjay Mohan.

16) Follow-up to remove the redundant ethtool.h from tooling infrastructure,
    from Tushar Vyavahare.

17) Extend libbpf to support "module:<function>" syntax for tracing
    programs, from Viktor Malik.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (119 commits)
  bpf: make list_for_each_entry portable
  bpf: ignore expected GCC warning in test_global_func10.c
  bpf: disable strict aliasing in test_global_func9.c
  selftests/bpf: Free strdup memory in xdp_hw_metadata
  selftests/bpf: Fix a few tests for GCC related warnings.
  bpf: avoid gcc overflow warning in test_xdp_vlan.c
  tools: remove redundant ethtool.h from tooling infra
  selftests/bpf: Expand ATTACH_REJECT tests
  selftests/bpf: Expand getsockname and getpeername tests
  sefltests/bpf: Expand sockaddr hook deny tests
  selftests/bpf: Expand sockaddr program return value tests
  selftests/bpf: Retire test_sock_addr.(c|sh)
  selftests/bpf: Remove redundant sendmsg test cases
  selftests/bpf: Migrate ATTACH_REJECT test cases
  selftests/bpf: Migrate expected_attach_type tests
  selftests/bpf: Migrate wildcard destination rewrite test
  selftests/bpf: Migrate sendmsg6 v4 mapped address tests
  selftests/bpf: Migrate sendmsg deny test cases
  selftests/bpf: Migrate WILDCARD_IP test
  selftests/bpf: Handle SYSCALL_EPERM and SYSCALL_ENOTSUPP test cases
  ...
====================

Link: https://lore.kernel.org/r/20240513134114.17575-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-13 16:41:10 -07:00
33f137143e ftrace: Use asynchronous grace period for register_ftrace_direct()
When running heavy test workloads with KASAN enabled, RCU Tasks grace
periods can extend for many tens of seconds, significantly slowing
trace registration.  Therefore, make the registration-side RCU Tasks
grace period be asynchronous via call_rcu_tasks().

Link: https://lore.kernel.org/linux-trace-kernel/ac05be77-2972-475b-9b57-56bef15aa00a@paulmck-laptop

Reported-by: Jakub Kicinski <kuba@kernel.org>
Reported-by: Alexei Starovoitov <ast@kernel.org>
Reported-by: Chris Mason <clm@fb.com>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-05-13 19:36:40 -04:00
c5963a0990 ftrace: Replaces simple_strtoul in ftrace
The function simple_strtoul performs no error checking in scenarios
where the input value overflows the intended output variable.
This results in this function successfully returning, even when the
output does not match the input string (aka the function returns
successfully even when the result is wrong).

Or as it was mentioned [1], "...simple_strtol(), simple_strtoll(),
simple_strtoul(), and simple_strtoull() functions explicitly ignore
overflows, which may lead to unexpected results in callers."
Hence, the use of those functions is discouraged.

This patch replaces all uses of the simple_strtoul with the safer
alternatives kstrtoul and kstruint.

Callers affected:
- add_rec_by_index
- set_graph_max_depth_function

Side effects of this patch:
- Since `fgraph_max_depth` is an `unsigned int`, this patch uses
  kstrtouint instead of kstrtoul to avoid any compiler warnings
  that could originate from calling the latter.
- This patch ensures that the callers of kstrtou* return accordingly
  when kstrtoul and kstruint fail for some reason.
  In this case, both callers this patch is addressing return 0 on error.

[1] https://www.kernel.org/doc/html/latest/process/deprecated.html#simple-strtol-simple-strtoll-simple-strtoul-simple-strtoull

Link: https://lore.kernel.org/linux-trace-kernel/GV1PR10MB656333529A8D7B8AFB28D238E8B4A@GV1PR10MB6563.EURPRD10.PROD.OUTLOOK.COM

Signed-off-by: Yuran Pereira <yuran.pereira@hotmail.com>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-05-13 19:36:40 -04:00
cf9f0f7c4c tracing: Allow user-space mapping of the ring-buffer
Currently, user-space extracts data from the ring-buffer via splice,
which is handy for storage or network sharing. However, due to splice
limitations, it is imposible to do real-time analysis without a copy.

A solution for that problem is to let the user-space map the ring-buffer
directly.

The mapping is exposed via the per-CPU file trace_pipe_raw. The first
element of the mapping is the meta-page. It is followed by each
subbuffer constituting the ring-buffer, ordered by their unique page ID:

  * Meta-page -- include/uapi/linux/trace_mmap.h for a description
  * Subbuf ID 0
  * Subbuf ID 1
     ...

It is therefore easy to translate a subbuf ID into an offset in the
mapping:

  reader_id = meta->reader->id;
  reader_offset = meta->meta_page_size + reader_id * meta->subbuf_size;

When new data is available, the mapper must call a newly introduced ioctl:
TRACE_MMAP_IOCTL_GET_READER. This will update the Meta-page reader ID to
point to the next reader containing unread data.

Mapping will prevent snapshot and buffer size modifications.

Link: https://lore.kernel.org/linux-trace-kernel/20240510140435.3550353-4-vdonnefort@google.com

CC: <linux-mm@kvack.org>
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-05-13 18:09:56 -04:00
117c39200d ring-buffer: Introducing ring-buffer mapping functions
In preparation for allowing the user-space to map a ring-buffer, add
a set of mapping functions:

  ring_buffer_{map,unmap}()

And controls on the ring-buffer:

  ring_buffer_map_get_reader()  /* swap reader and head */

Mapping the ring-buffer also involves:

  A unique ID for each subbuf of the ring-buffer, currently they are
  only identified through their in-kernel VA.

  A meta-page, where are stored ring-buffer statistics and a
  description for the current reader

The linear mapping exposes the meta-page, and each subbuf of the
ring-buffer, ordered following their unique ID, assigned during the
first mapping.

Once mapped, no subbuf can get in or out of the ring-buffer: the buffer
size will remain unmodified and the splice enabling functions will in
reality simply memcpy the data instead of swapping subbufs.

Link: https://lore.kernel.org/linux-trace-kernel/20240510140435.3550353-3-vdonnefort@google.com

CC: <linux-mm@kvack.org>
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Acked-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-05-13 18:09:56 -04:00
c09d4167b5 ring-buffer: Allocate sub-buffers with __GFP_COMP
In preparation for the ring-buffer memory mapping, allocate compound
pages for the ring-buffer sub-buffers to enable us to map them to
user-space with vm_insert_pages().

Link: https://lore.kernel.org/linux-trace-kernel/20240510140435.3550353-2-vdonnefort@google.com

Acked-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-05-13 18:09:55 -04:00
84c7d76b5a Merge tag 'v6.10-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto updates from Herbert Xu:
 "API:
   - Remove crypto stats interface

  Algorithms:
   - Add faster AES-XTS on modern x86_64 CPUs
   - Forbid curves with order less than 224 bits in ecc (FIPS 186-5)
   - Add ECDSA NIST P521

  Drivers:
   - Expose otp zone in atmel
   - Add dh fallback for primes > 4K in qat
   - Add interface for live migration in qat
   - Use dma for aes requests in starfive
   - Add full DMA support for stm32mpx in stm32
   - Add Tegra Security Engine driver

  Others:
   - Introduce scope-based x509_certificate allocation"

* tag 'v6.10-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (123 commits)
  crypto: atmel-sha204a - provide the otp content
  crypto: atmel-sha204a - add reading from otp zone
  crypto: atmel-i2c - rename read function
  crypto: atmel-i2c - add missing arg description
  crypto: iaa - Use kmemdup() instead of kzalloc() and memcpy()
  crypto: sahara - use 'time_left' variable with wait_for_completion_timeout()
  crypto: api - use 'time_left' variable with wait_for_completion_killable_timeout()
  crypto: caam - i.MX8ULP donot have CAAM page0 access
  crypto: caam - init-clk based on caam-page0-access
  crypto: starfive - Use fallback for unaligned dma access
  crypto: starfive - Do not free stack buffer
  crypto: starfive - Skip unneeded fallback allocation
  crypto: starfive - Skip dma setup for zeroed message
  crypto: hisilicon/sec2 - fix for register offset
  crypto: hisilicon/debugfs - mask the unnecessary info from the dump
  crypto: qat - specify firmware files for 402xx
  crypto: x86/aes-gcm - simplify GCM hash subkey derivation
  crypto: x86/aes-gcm - delete unused GCM assembly code
  crypto: x86/aes-xts - simplify loop in xts_crypt_slowpath()
  hwrng: stm32 - repair clock handling
  ...
2024-05-13 14:53:05 -07:00
87caef4220 Merge tag 'hardening-6.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull hardening updates from Kees Cook:
 "The bulk of the changes here are related to refactoring and expanding
  the KUnit tests for string helper and fortify behavior.

  Some trivial strncpy replacements in fs/ were carried in my tree. Also
  some fixes to SCSI string handling were carried in my tree since the
  helper for those was introduce here. Beyond that, just little fixes
  all around: objtool getting confused about LKDTM+KCFI, preparing for
  future refactors (constification of sysctl tables, additional
  __counted_by annotations), a Clang UBSAN+i386 crash fix, and adding
  more options in the hardening.config Kconfig fragment.

  Summary:

   - selftests: Add str*cmp tests (Ivan Orlov)

   - __counted_by: provide UAPI for _le/_be variants (Erick Archer)

   - Various strncpy deprecation refactors (Justin Stitt)

   - stackleak: Use a copy of soon-to-be-const sysctl table (Thomas
     Weißschuh)

   - UBSAN: Work around i386 -regparm=3 bug with Clang prior to
     version 19

   - Provide helper to deal with non-NUL-terminated string copying

   - SCSI: Fix older string copying bugs (with new helper)

   - selftests: Consolidate string helper behavioral tests

   - selftests: add memcpy() fortify tests

   - string: Add additional __realloc_size() annotations for "dup"
     helpers

   - LKDTM: Fix KCFI+rodata+objtool confusion

   - hardening.config: Enable KCFI"

* tag 'hardening-6.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (29 commits)
  uapi: stddef.h: Provide UAPI macros for __counted_by_{le, be}
  stackleak: Use a copy of the ctl_table argument
  string: Add additional __realloc_size() annotations for "dup" helpers
  kunit/fortify: Fix replaced failure path to unbreak __alloc_size
  hardening: Enable KCFI and some other options
  lkdtm: Disable CFI checking for perms functions
  kunit/fortify: Add memcpy() tests
  kunit/fortify: Do not spam logs with fortify WARNs
  kunit/fortify: Rename tests to use recommended conventions
  init: replace deprecated strncpy with strscpy_pad
  kunit/fortify: Fix mismatched kvalloc()/vfree() usage
  scsi: qla2xxx: Avoid possible run-time warning with long model_num
  scsi: mpi3mr: Avoid possible run-time warning with long manufacturer strings
  scsi: mptfusion: Avoid possible run-time warning with long manufacturer strings
  fs: ecryptfs: replace deprecated strncpy with strscpy
  hfsplus: refactor copy_name to not use strncpy
  reiserfs: replace deprecated strncpy with scnprintf
  virt: acrn: replace deprecated strncpy with strscpy
  ubsan: Avoid i386 UBSAN handler crashes with Clang
  ubsan: Remove 1-element array usage in debug reporting
  ...
2024-05-13 14:14:05 -07:00
1ba58f1ae9 Merge tag 'seccomp-6.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull seccomp update from Kees Cook:

 - Prepare for sysctl table constification

* tag 'seccomp-6.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  seccomp: Constify sysctl subhelpers
2024-05-13 13:56:36 -07:00
c9f9df3f63 Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:

====================
pull-request: bpf 2024-05-13

We've added 3 non-merge commits during the last 2 day(s) which contain
a total of 2 files changed, 62 insertions(+), 8 deletions(-).

The main changes are:

1) Fix a case where syzkaller found that it's unexpectedly possible
   to attach a cgroup_skb program to the sockopt hooks. The fix adds
   missing attach_type enforcement for the link_create case along
   with selftests, from Stanislav Fomichev.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  selftests/bpf: Add sockopt case to verify prog_type
  selftests/bpf: Extend sockopt tests to use BPF_LINK_CREATE
  bpf: Add BPF_PROG_TYPE_CGROUP_SKB attach type enforcement in BPF_LINK_CREATE
====================

Link: https://lore.kernel.org/r/20240513041845.31040-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-13 13:10:48 -07:00
de1c2722e0 Merge branches 'pm-em' and 'pm-docs'
Merge Enery Model update and a power management documentation update for
6.10:

 - Make the Samsung exynos-asv driver update the Energy Model after
   adjusting voltage on top of some preliminary changes of the OPP and
   Enery Model generic code (Lukasz Luba).

 - Remove a reference to a function that has been dropped from the power
   management documentation (Bjorn Helgaas).

* pm-em:
  soc: samsung: exynos-asv: Update Energy Model after adjusting voltage
  PM: EM: Add em_dev_update_chip_binning()
  PM: EM: Refactor em_adjust_new_capacity()
  OPP: OF: Export dev_opp_pm_calc_power() for usage from EM

* pm-docs:
  Documentation: PM: Update platform_pci_wakeup_init() reference
2024-05-13 20:19:58 +02:00
440f9d47df Merge branches 'pm-cpuidle', 'pm-sleep' and 'pm-powercap'
Merge cpuidle updates, changes related to system sleep and power capping
updates for 6.10:

 - Fix kerneldoc description of ladder_do_selection() (Jeff Johnson).

 - Convert the cpuidle kirkwood driver to platform remove callback
   returning void (Yangtao Li).

 - Replace deprecated strncpy() with strscpy() in the hibernation core
   code (Justin Stitt).

 - Use %ps to simplify debug output in the core system-wide suspend and
   resume code (Len Brown).

 - Remove unnecessary else from device_init_wakeup() and make
   device_wakeup_disable() return void (Dhruva Gole).

 - Enable PMU support in the Intel TPMI RAPL driver (Zhang Rui).

 - Add support for ArrowLake-H platform to the Intel RAPL driver (Zhang
   Rui).

 - Avoid explicit cpumask allocation on stack in DTPM (Dawei Li).

* pm-cpuidle:
  cpuidle: ladder: fix ladder_do_selection() kernel-doc
  cpuidle: kirkwood: Convert to platform remove callback returning void

* pm-sleep:
  PM: hibernate: replace deprecated strncpy() with strscpy()
  PM: sleep: Take advantage of %ps to simplify debug output
  PM: wakeup: Remove unnecessary else from device_init_wakeup()
  PM: wakeup: make device_wakeup_disable() return void

* pm-powercap:
  powercap: intel_rapl_tpmi: Enable PMU support
  powercap: intel_rapl: Introduce APIs for PMU support
  powercap: intel_rapl: Sort header files
  powercap: intel_rapl: Add support for ArrowLake-H platform
  powercap: DTPM: Avoid explicit cpumask allocation on stack
2024-05-13 20:14:10 +02:00
c07ea940a0 Merge tag 'kcsan.2024.05.10a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu
Pull kcsan update from Paul McKenney:
 "Introduce __data_racy type qualifier

  This adds a __data_racy type qualifier that enables kernel developers
  to inform KCSAN that a given variable is a shared variable without
  needing to mark each and every access.

  This allows pre-KCSAN code to be correctly (if approximately)
  instrumented withh very little effort, and also provides people
  reading the code a clear indication that the variable is in fact
  shared.

  In addition, it permits incremental transition to per-access KCSAN
  marking, so that (for example) a given subsystem can be transitioned
  one variable at a time, while avoiding large numbers of KCSAN warnings
  during this transition"

* tag 'kcsan.2024.05.10a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu:
  kcsan, compiler_types: Introduce __data_racy type qualifier
2024-05-13 10:13:39 -07:00
c0b9620bc3 Merge tag 'rcu.next.v6.10' of https://github.com/urezki/linux
Pull RCU updates from Uladzislau Rezki:

 - Fix a lockdep complain for lazy-preemptible kernel, remove redundant
   BH disable for TINY_RCU, remove redundant READ_ONCE() in tree.c, fix
   false positives KCSAN splat and fix buffer overflow in the
   print_cpu_stall_info().

 - Misc updates related to bpf, tracing and update the MAINTAINERS file.

 - An improvement of a normal synchronize_rcu() call in terms of
   latency. It maintains a separate track for sync. users only. This
   approach bypasses per-cpu nocb-lists thus sync-users do not depend on
   nocb-list length and how fast regular callbacks are processed.

 - RCU tasks: switch tasks RCU grace periods to sleep at TASK_IDLE
   priority, fix some comments, add some diagnostic warning to the
   exit_tasks_rcu_start() and fix a buffer overflow in the
   show_rcu_tasks_trace_gp_kthread().

 - RCU torture: Increase memory to guest OS, fix a Tasks Rude RCU
   testing, some updates for TREE09, dump mode information to debug GP
   kthread state, remove redundant READ_ONCE(), fix some comments about
   RCU_TORTURE_PIPE_LEN and pipe_count, remove some redundant pointer
   initialization, fix a hung splat task by when the rcutorture tests
   start to exit, fix invalid context warning, add '--do-kvfree'
   parameter to torture test and use slow register unregister callbacks
   only for rcutype test.

* tag 'rcu.next.v6.10' of https://github.com/urezki/linux: (48 commits)
  rcutorture: Use rcu_gp_slow_register/unregister() only for rcutype test
  torture: Scale --do-kvfree test time
  rcutorture: Fix invalid context warning when enable srcu barrier testing
  rcutorture: Make stall-tasks directly exit when rcutorture tests end
  rcutorture: Removing redundant function pointer initialization
  rcutorture: Make rcutorture support print rcu-tasks gp state
  rcutorture: Use the gp_kthread_dbg operation specified by cur_ops
  rcutorture: Re-use value stored to ->rtort_pipe_count instead of re-reading
  rcutorture: Fix rcu_torture_one_read() pipe_count overflow comment
  rcutorture: Remove extraneous rcu_torture_pipe_update_one() READ_ONCE()
  rcu: Allocate WQ with WQ_MEM_RECLAIM bit set
  rcu: Support direct wake-up of synchronize_rcu() users
  rcu: Add a trace event for synchronize_rcu_normal()
  rcu: Reduce synchronize_rcu() latency
  rcu: Fix buffer overflow in print_cpu_stall_info()
  rcu: Mollify sparse with RCU guard
  rcu-tasks: Fix show_rcu_tasks_trace_gp_kthread buffer overflow
  rcu-tasks: Fix the comments for tasks_rcu_exit_srcu_stall_timer
  rcu-tasks: Replace exit_tasks_rcu_start() initialization with WARN_ON_ONCE()
  rcu: Remove redundant CONFIG_PROVE_RCU #if condition
  ...
2024-05-13 09:49:06 -07:00