Commit Graph

41275 Commits

Author SHA1 Message Date
Stefan Roesch
d7597f59d1 mm: add new api to enable ksm per process
Patch series "mm: process/cgroup ksm support", v9.

So far KSM can only be enabled by calling madvise for memory regions.  To
be able to use KSM for more workloads, KSM needs to have the ability to be
enabled / disabled at the process / cgroup level.

Use case 1:
  The madvise call is not available in the programming language.  An
  example for this are programs with forked workloads using a garbage
  collected language without pointers.  In such a language madvise cannot
  be made available.

  In addition the addresses of objects get moved around as they are
  garbage collected.  KSM sharing needs to be enabled "from the outside"
  for these type of workloads.

Use case 2:
  The same interpreter can also be used for workloads where KSM brings
  no benefit or even has overhead.  We'd like to be able to enable KSM on
  a workload by workload basis.

Use case 3:
  With the madvise call sharing opportunities are only enabled for the
  current process: it is a workload-local decision.  A considerable number
  of sharing opportunities may exist across multiple workloads or jobs (if
  they are part of the same security domain).  Only a higler level entity
  like a job scheduler or container can know for certain if its running
  one or more instances of a job.  That job scheduler however doesn't have
  the necessary internal workload knowledge to make targeted madvise
  calls.

Security concerns:

  In previous discussions security concerns have been brought up.  The
  problem is that an individual workload does not have the knowledge about
  what else is running on a machine.  Therefore it has to be very
  conservative in what memory areas can be shared or not.  However, if the
  system is dedicated to running multiple jobs within the same security
  domain, its the job scheduler that has the knowledge that sharing can be
  safely enabled and is even desirable.

Performance:

  Experiments with using UKSM have shown a capacity increase of around 20%.

  Here are the metrics from an instagram workload (taken from a machine
  with 64GB main memory):

   full_scans: 445
   general_profit: 20158298048
   max_page_sharing: 256
   merge_across_nodes: 1
   pages_shared: 129547
   pages_sharing: 5119146
   pages_to_scan: 4000
   pages_unshared: 1760924
   pages_volatile: 10761341
   run: 1
   sleep_millisecs: 20
   stable_node_chains: 167
   stable_node_chains_prune_millisecs: 2000
   stable_node_dups: 2751
   use_zero_pages: 0
   zero_pages_sharing: 0

After the service is running for 30 minutes to an hour, 4 to 5 million
shared pages are common for this workload when using KSM.


Detailed changes:

1. New options for prctl system command
   This patch series adds two new options to the prctl system call. 
   The first one allows to enable KSM at the process level and the second
   one to query the setting.

The setting will be inherited by child processes.

With the above setting, KSM can be enabled for the seed process of a cgroup
and all processes in the cgroup will inherit the setting.

2. Changes to KSM processing
   When KSM is enabled at the process level, the KSM code will iterate
   over all the VMA's and enable KSM for the eligible VMA's.

   When forking a process that has KSM enabled, the setting will be
   inherited by the new child process.

3. Add general_profit metric
   The general_profit metric of KSM is specified in the documentation,
   but not calculated.  This adds the general profit metric to
   /sys/kernel/debug/mm/ksm.

4. Add more metrics to ksm_stat
   This adds the process profit metric to /proc/<pid>/ksm_stat.

5. Add more tests to ksm_tests and ksm_functional_tests
   This adds an option to specify the merge type to the ksm_tests. 
   This allows to test madvise and prctl KSM.

   It also adds a two new tests to ksm_functional_tests: one to test
   the new prctl options and the other one is a fork test to verify that
   the KSM process setting is inherited by client processes.


This patch (of 3):

So far KSM can only be enabled by calling madvise for memory regions.  To
be able to use KSM for more workloads, KSM needs to have the ability to be
enabled / disabled at the process / cgroup level.

1. New options for prctl system command

   This patch series adds two new options to the prctl system call.
   The first one allows to enable KSM at the process level and the second
   one to query the setting.

   The setting will be inherited by child processes.

   With the above setting, KSM can be enabled for the seed process of a
   cgroup and all processes in the cgroup will inherit the setting.

2. Changes to KSM processing

   When KSM is enabled at the process level, the KSM code will iterate
   over all the VMA's and enable KSM for the eligible VMA's.

   When forking a process that has KSM enabled, the setting will be
   inherited by the new child process.

  1) Introduce new MMF_VM_MERGE_ANY flag

     This introduces the new flag MMF_VM_MERGE_ANY flag.  When this flag
     is set, kernel samepage merging (ksm) gets enabled for all vma's of a
     process.

  2) Setting VM_MERGEABLE on VMA creation

     When a VMA is created, if the MMF_VM_MERGE_ANY flag is set, the
     VM_MERGEABLE flag will be set for this VMA.

  3) support disabling of ksm for a process

     This adds the ability to disable ksm for a process if ksm has been
     enabled for the process with prctl.

  4) add new prctl option to get and set ksm for a process

     This adds two new options to the prctl system call
     - enable ksm for all vmas of a process (if the vmas support it).
     - query if ksm has been enabled for a process.

3. Disabling MMF_VM_MERGE_ANY for storage keys in s390

   In the s390 architecture when storage keys are used, the
   MMF_VM_MERGE_ANY will be disabled.

Link: https://lkml.kernel.org/r/20230418051342.1919757-1-shr@devkernel.io
Link: https://lkml.kernel.org/r/20230418051342.1919757-2-shr@devkernel.io
Signed-off-by: Stefan Roesch <shr@devkernel.io>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Bagas Sanjaya <bagasdotme@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-21 14:52:03 -07:00
Pavankumar Kondeti
1f6ab566cb printk: export console trace point for kcsan/kasan/kfence/kmsan
The console tracepoint is used by kcsan/kasan/kfence/kmsan test modules. 
Since this tracepoint is not exported, these modules iterate over all
available tracepoints to find the console trace point.  Export the trace
point so that it can be directly used.

Link: https://lkml.kernel.org/r/20230413100859.1492323-1-quic_pkondeti@quicinc.com
Signed-off-by: Pavankumar Kondeti <quic_pkondeti@quicinc.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: Marco Elver <elver@google.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-18 16:30:11 -07:00
Josh Triplett
ddc65971bb prctl: add PR_GET_AUXV to copy auxv to userspace
If a library wants to get information from auxv (for instance,
AT_HWCAP/AT_HWCAP2), it has a few options, none of them perfectly reliable
or ideal:

- Be main or the pre-main startup code, and grub through the stack above
  main. Doesn't work for a library.
- Call libc getauxval. Not ideal for libraries that are trying to be
  libc-independent and/or don't otherwise require anything from other
  libraries.
- Open and read /proc/self/auxv. Doesn't work for libraries that may run
  in arbitrarily constrained environments that may not have /proc
  mounted (e.g. libraries that might be used by an init program or a
  container setup tool).
- Assume you're on the main thread and still on the original stack, and
  try to walk the stack upwards, hoping to find auxv. Extremely bad
  idea.
- Ask the caller to pass auxv in for you. Not ideal for a user-friendly
  library, and then your caller may have the same problem.

Add a prctl that copies current->mm->saved_auxv to a userspace buffer.

Link: https://lkml.kernel.org/r/d81864a7f7f43bca6afa2a09fc2e850e4050ab42.1680611394.git.josh@joshtriplett.org
Signed-off-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-18 16:29:53 -07:00
Yosry Ahmed
8bff9a04ca cgroup: rename cgroup_rstat_flush_"irqsafe" to "atomic"
Patch series "memcg: avoid flushing stats atomically where possible", v3.

rstat flushing is an expensive operation that scales with the number of
cpus and the number of cgroups in the system.  The purpose of this series
is to minimize the contexts where we flush stats atomically.

Patches 1 and 2 are cleanups requested during reviews of prior versions of
this series.

Patch 3 makes sure we never try to flush from within an irq context.

Patches 4 to 7 introduce separate variants of mem_cgroup_flush_stats() for
atomic and non-atomic flushing, and make sure we only flush the stats
atomically when necessary.

Patch 8 is a slightly tangential optimization that limits the work done by
rstat flushing in some scenarios.


This patch (of 8):

cgroup_rstat_flush_irqsafe() can be a confusing name.  It may read as
"irqs are disabled throughout", which is what the current implementation
does (currently under discussion [1]), but is not the intention.  The
intention is that this function is safe to call from atomic contexts. 
Name it as such.

Link: https://lkml.kernel.org/r/20230330191801.1967435-1-yosryahmed@google.com
Link: https://lkml.kernel.org/r/20230330191801.1967435-2-yosryahmed@google.com
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vasily Averin <vasily.averin@linux.dev>
Cc: Zefan Li <lizefan.x@bytedance.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-18 16:29:49 -07:00
Andrew Morton
f8f238ffe5 sync mm-stable with mm-hotfixes-stable to pick up depended-upon upstream changes 2023-04-18 14:53:49 -07:00
Mathieu Desnoyers
b20b0368c6 mm: fix memory leak on mm_init error handling
commit f1a7941243 ("mm: convert mm's rss stats into percpu_counter")
introduces a memory leak by missing a call to destroy_context() when a
percpu_counter fails to allocate.

Before introducing the per-cpu counter allocations, init_new_context() was
the last call that could fail in mm_init(), and thus there was no need to
ever invoke destroy_context() in the error paths.  Adding the following
percpu counter allocations adds error paths after init_new_context(),
which means its associated destroy_context() needs to be called when
percpu counters fail to allocate.

Link: https://lkml.kernel.org/r/20230330133822.66271-1-mathieu.desnoyers@efficios.com
Fixes: f1a7941243 ("mm: convert mm's rss stats into percpu_counter")
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-18 14:22:12 -07:00
Ondrej Mosnacek
659c0ce1cb kernel/sys.c: fix and improve control flow in __sys_setres[ug]id()
Linux Security Modules (LSMs) that implement the "capable" hook will
usually emit an access denial message to the audit log whenever they
"block" the current task from using the given capability based on their
security policy.

The occurrence of a denial is used as an indication that the given task
has attempted an operation that requires the given access permission, so
the callers of functions that perform LSM permission checks must take care
to avoid calling them too early (before it is decided if the permission is
actually needed to perform the requested operation).

The __sys_setres[ug]id() functions violate this convention by first
calling ns_capable_setid() and only then checking if the operation
requires the capability or not.  It means that any caller that has the
capability granted by DAC (task's capability set) but not by MAC (LSMs)
will generate a "denied" audit record, even if is doing an operation for
which the capability is not required.

Fix this by reordering the checks such that ns_capable_setid() is checked
last and -EPERM is returned immediately if it returns false.

While there, also do two small optimizations:
* move the capability check before prepare_creds() and
* bail out early in case of a no-op.

Link: https://lkml.kernel.org/r/20230217162154.837549-1-omosnace@redhat.com
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Ondrej Mosnacek <omosnace@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-18 14:22:12 -07:00
Andrew Morton
e492cd61b9 sync mm-stable with mm-hotfixes-stable to pick up depended-upon upstream changes 2023-04-16 12:31:58 -07:00
Raghavendra K T
d46031f40e sched/numa: use hash_32 to mix up PIDs accessing VMA
before: last 6 bits of PID is used as index to store information about
tasks accessing VMA's.

after: hash_32 is used to take of cases where tasks are created over a
period of time, and thus improve collision probability.

Result:
The patch series overall improves autonuma cost.

Kernbench around more than 5% improvement and system time in mmtest
autonuma showed more than 80% improvement

Link: https://lkml.kernel.org/r/d5a9f75513300caed74e5c8570bba9317b963c2b.1677672277.git.raghavendra.kt@amd.com
Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Cc: Bharata B Rao <bharata@amd.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Disha Talreja <dishaa.talreja@amd.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Rapoport <rppt@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-05 20:03:03 -07:00
Raghavendra K T
20f586486b sched/numa: implement access PID reset logic
This helps to ensure that only recently accessed PIDs scan the VMAs.

Current implementation: (idea supported by PeterZ)

 1. Accessing PID information is maintained in two windows. 
    access_pids[1] being newest.

 2. Reset old access PID info i.e.  access_pid[0] every (4 *
    sysctl_numa_balancing_scan_delay) interval after initial scan delay
    period expires.

The above interval seemed to be experimentally optimum since it avoids
frequent reset of access info as well as helps clearing the old access
info regularly.  The reset logic is implemented in scan path.

Link: https://lkml.kernel.org/r/f7a675f66d1442d048b4216b2baf94515012c405.1677672277.git.raghavendra.kt@amd.com
Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>
Suggested-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Bharata B Rao <bharata@amd.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Disha Talreja <dishaa.talreja@amd.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-05 20:03:03 -07:00
Raghavendra K T
fc137c0dda sched/numa: enhance vma scanning logic
During Numa scanning make sure only relevant vmas of the tasks are
scanned.

Before:
 All the tasks of a process participate in scanning the vma even if they
 do not access vma in it's lifespan.

Now:
 Except cases of first few unconditional scans, if a process do
 not touch vma (exluding false positive cases of PID collisions)
 tasks no longer scan all vma

Logic used:

1) 6 bits of PID used to mark active bit in vma numab status during
   fault to remember PIDs accessing vma.  (Thanks Mel)

2) Subsequently in scan path, vma scanning is skipped if current PID
   had not accessed vma.

3) First two times we do allow unconditional scan to preserve earlier
   behaviour of scanning.

Acknowledgement to Bharata B Rao <bharata@amd.com> for initial patch to
store pid information and Peter Zijlstra <peterz@infradead.org> (Usage of
test and set bit)

Link: https://lkml.kernel.org/r/092f03105c7c1d3450f4636b1ea350407f07640e.1677672277.git.raghavendra.kt@amd.com
Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>
Suggested-by: Mel Gorman <mgorman@techsingularity.net>
Cc: David Hildenbrand <david@redhat.com>
Cc: Disha Talreja <dishaa.talreja@amd.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Mike Rapoport <rppt@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-05 20:03:03 -07:00
Mel Gorman
ef6a22b70f sched/numa: apply the scan delay to every new vma
Pach series "sched/numa: Enhance vma scanning", v3.

The patchset proposes one of the enhancements to numa vma scanning
suggested by Mel.  This is continuation of [3].

Reposting the rebased patchset to akpm mm-unstable tree (March 1) 

Existing mechanism of scan period involves, scan period derived from
per-thread stats.  Process Adaptive autoNUMA [1] proposed to gather NUMA
fault stats at per-process level to capture aplication behaviour better.

During that course of discussion, Mel proposed several ideas to enhance
current numa balancing.  One of the suggestion was below

Track what threads access a VMA.  The suggestion was to use an unsigned
long pid_mask and use the lower bits to tag approximately what threads
access a VMA.  Skip VMAs that did not trap a fault.  This would be
approximate because of PID collisions but would reduce scanning of areas
the thread is not interested in.  The above suggestion intends not to
penalize threads that has no interest in the vma, thus reduce scanning
overhead.

V3 changes are mostly based on PeterZ comments (details below in changes)

Summary of patchset:

Current patchset implements:

1. Delay the vma scanning logic for newly created VMA's so that
   additional overhead of scanning is not incurred for short lived tasks
   (implementation by Mel)

2. Store the information of tasks accessing VMA in 2 windows.  It is
   regularly cleared in (4*sysctl_numa_balancing_scan_delay) interval. 
   The above time is derived from experimenting (Suggested by PeterZ) to
   balance between frequent clearing vs obsolete access data

3. hash_32 used to encode task index accessing VMA information

4. VMA's acess information is used to skip scanning for the tasks
   which had not accessed VMA

Changes since V2:
patch1: 
 - Renaming of structure, macro to function,
 - Add explanation to heuristics
 - Adding more details from result (PeterZ)
 Patch2:
 - Usage of test and set bit (PeterZ)
 - Move storing access PID info to numa_migrate_prep()
 - Add a note on fainess among tasks allowed to scan
   (PeterZ)
 Patch3:
 - Maintain two windows of access PID information
  (PeterZ supported implementation and Gave idea to extend
   to N if needed)
 Patch4:
 - Apply hash_32 function to track VMA accessing PIDs (PeterZ)

Changes since RFC V1:
 - Include Mel's vma scan delay patch
 - Change the accessing pid store logic (Thanks Mel)
 - Fencing structure / code to NUMA_BALANCING (David, Mel)
 - Adding clearing access PID logic (Mel)
 - Descriptive change log ( Mike Rapoport)

Things to ponder over:
==========================================

- Improvement to clearing accessing PIDs logic (discussed in-detail in
  patch3 itself (Done in this patchset by implementing 2 window history)

- Current scan period is not changed in the patchset, so we do see
  frequent tries to scan.  Relaxing scan period dynamically could improve
  results further.

[1] sched/numa: Process Adaptive autoNUMA 
 Link: https://lore.kernel.org/lkml/20220128052851.17162-1-bharata@amd.com/T/

[2] RFC V1 Link: 
  https://lore.kernel.org/all/cover.1673610485.git.raghavendra.kt@amd.com/

[3] V2 Link:
  https://lore.kernel.org/lkml/cover.1675159422.git.raghavendra.kt@amd.com/


Results:
Summary: Huge autonuma cost reduction seen in mmtest. Kernbench improvement 
is more than 5% and huge system time (80%+) improvement from mmtest autonuma.
(dbench had huge std deviation to post)

kernbench
===========
                      6.2.0-mmunstable-base  6.2.0-mmunstable-patched
Amean     user-256    22002.51 (   0.00%)    22649.95 *  -2.94%*
Amean     syst-256    10162.78 (   0.00%)     8214.13 *  19.17%*
Amean     elsp-256      160.74 (   0.00%)      156.92 *   2.38%*

Duration User       66017.43    67959.84
Duration System     30503.15    24657.03
Duration Elapsed      504.61      493.12

                      6.2.0-mmunstable-base  6.2.0-mmunstable-patched
Ops NUMA alloc hit                1738835089.00  1738780310.00
Ops NUMA alloc local              1738834448.00  1738779711.00
Ops NUMA base-page range updates      477310.00      392566.00
Ops NUMA PTE updates                  477310.00      392566.00
Ops NUMA hint faults                   96817.00       87555.00
Ops NUMA hint local faults %           10150.00        2192.00
Ops NUMA hint local percent               10.48           2.50
Ops NUMA pages migrated                86660.00       85363.00
Ops AutoNUMA cost                        489.07         442.14

autonumabench
===============
                      6.2.0-mmunstable-base  6.2.0-mmunstable-patched
Amean     syst-NUMA01                  399.50 (   0.00%)       52.05 *  86.97%*
Amean     syst-NUMA01_THREADLOCAL        0.21 (   0.00%)        0.22 *  -5.41%*
Amean     syst-NUMA02                    0.80 (   0.00%)        0.78 *   2.68%*
Amean     syst-NUMA02_SMT                0.65 (   0.00%)        0.68 *  -3.95%*
Amean     elsp-NUMA01                  313.26 (   0.00%)      313.11 *   0.05%*
Amean     elsp-NUMA01_THREADLOCAL        1.06 (   0.00%)        1.08 *  -1.76%*
Amean     elsp-NUMA02                    3.19 (   0.00%)        3.24 *  -1.52%*
Amean     elsp-NUMA02_SMT                3.72 (   0.00%)        3.61 *   2.92%*

Duration User      396433.47   324835.96
Duration System      2808.70      376.66
Duration Elapsed     2258.61     2258.12

                      6.2.0-mmunstable-base  6.2.0-mmunstable-patched
Ops NUMA alloc hit                  59921806.00    49623489.00
Ops NUMA alloc miss                        0.00           0.00
Ops NUMA interleave hit                    0.00           0.00
Ops NUMA alloc local                59920880.00    49622594.00
Ops NUMA base-page range updates   152259275.00       50075.00
Ops NUMA PTE updates               152259275.00       50075.00
Ops NUMA PMD updates                       0.00           0.00
Ops NUMA hint faults               154660352.00       39014.00
Ops NUMA hint local faults %       138550501.00       23139.00
Ops NUMA hint local percent               89.58          59.31
Ops NUMA pages migrated              8179067.00       14147.00
Ops AutoNUMA cost                     774522.98         195.69


This patch (of 4):

Currently whenever a new task is created we wait for
sysctl_numa_balancing_scan_delay to avoid unnessary scanning overhead. 
Extend the same logic to new or very short-lived VMAs.

[raghavendra.kt@amd.com: add initialization in vm_area_dup())]
Link: https://lkml.kernel.org/r/cover.1677672277.git.raghavendra.kt@amd.com
Link: https://lkml.kernel.org/r/7a6fbba87c8b51e67efd3e74285bb4cb311a16ca.1677672277.git.raghavendra.kt@amd.com
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>
Cc: Bharata B Rao <bharata@amd.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Disha Talreja <dishaa.talreja@amd.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-05 20:03:03 -07:00
Suren Baghdasaryan
c7f8f31c00 mm: separate vma->lock from vm_area_struct
vma->lock being part of the vm_area_struct causes performance regression
during page faults because during contention its count and owner fields
are constantly updated and having other parts of vm_area_struct used
during page fault handling next to them causes constant cache line
bouncing.  Fix that by moving the lock outside of the vm_area_struct.

All attempts to keep vma->lock inside vm_area_struct in a separate cache
line still produce performance regression especially on NUMA machines. 
Smallest regression was achieved when lock is placed in the fourth cache
line but that bloats vm_area_struct to 256 bytes.

Considering performance and memory impact, separate lock looks like the
best option.  It increases memory footprint of each VMA but that can be
optimized later if the new size causes issues.  Note that after this
change vma_init() does not allocate or initialize vma->lock anymore.  A
number of drivers allocate a pseudo VMA on the stack but they never use
the VMA's lock, therefore it does not need to be allocated.  The future
drivers which might need the VMA lock should use
vm_area_alloc()/vm_area_free() to allocate the VMA.

Link: https://lkml.kernel.org/r/20230227173632.3292573-34-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-05 20:03:02 -07:00
Suren Baghdasaryan
0d2ebf9c3f mm/mmap: free vm_area_struct without call_rcu in exit_mmap
call_rcu() can take a long time when callback offloading is enabled.  Its
use in the vm_area_free can cause regressions in the exit path when
multiple VMAs are being freed.

Because exit_mmap() is called only after the last mm user drops its
refcount, the page fault handlers can't be racing with it.  Any other
possible user like oom-reaper or process_mrelease are already synchronized
using mmap_lock.  Therefore exit_mmap() can free VMAs directly, without
the use of call_rcu().

Expose __vm_area_free() and use it from exit_mmap() to avoid possible
call_rcu() floods and performance regressions caused by it.

Link: https://lkml.kernel.org/r/20230227173632.3292573-33-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-05 20:03:02 -07:00
Suren Baghdasaryan
f2e13784c1 kernel/fork: assert no VMA readers during its destruction
Assert there are no holders of VMA lock for reading when it is about to be
destroyed.

Link: https://lkml.kernel.org/r/20230227173632.3292573-21-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-05 20:02:59 -07:00
Suren Baghdasaryan
5e31275cc9 mm: add per-VMA lock and helper functions to control it
Introduce per-VMA locking.  The lock implementation relies on a per-vma
and per-mm sequence counters to note exclusive locking:

  - read lock - (implemented by vma_start_read) requires the vma
    (vm_lock_seq) and mm (mm_lock_seq) sequence counters to differ.
    If they match then there must be a vma exclusive lock held somewhere.
  - read unlock - (implemented by vma_end_read) is a trivial vma->lock
    unlock.
  - write lock - (vma_start_write) requires the mmap_lock to be held
    exclusively and the current mm counter is assigned to the vma counter.
    This will allow multiple vmas to be locked under a single mmap_lock
    write lock (e.g. during vma merging). The vma counter is modified
    under exclusive vma lock.
  - write unlock - (vma_end_write_all) is a batch release of all vma
    locks held. It doesn't pair with a specific vma_start_write! It is
    done before exclusive mmap_lock is released by incrementing mm
    sequence counter (mm_lock_seq).
  - write downgrade - if the mmap_lock is downgraded to the read lock, all
    vma write locks are released as well (effectivelly same as write
    unlock).

Link: https://lkml.kernel.org/r/20230227173632.3292573-13-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-05 20:02:57 -07:00
Michel Lespinasse
20cce633f4 mm: rcu safe VMA freeing
This prepares for page faults handling under VMA lock, looking up VMAs
under protection of an rcu read lock, instead of the usual mmap read lock.

Link: https://lkml.kernel.org/r/20230227173632.3292573-11-surenb@google.com
Signed-off-by: Michel Lespinasse <michel@lespinasse.org>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-05 20:02:57 -07:00
Kirill A. Shutemov
23baf831a3 mm, treewide: redefine MAX_ORDER sanely
MAX_ORDER currently defined as number of orders page allocator supports:
user can ask buddy allocator for page order between 0 and MAX_ORDER-1.

This definition is counter-intuitive and lead to number of bugs all over
the kernel.

Change the definition of MAX_ORDER to be inclusive: the range of orders
user can ask from buddy allocator is 0..MAX_ORDER now.

[kirill@shutemov.name: fix min() warning]
  Link: https://lkml.kernel.org/r/20230315153800.32wib3n5rickolvh@box
[akpm@linux-foundation.org: fix another min_t warning]
[kirill@shutemov.name: fixups per Zi Yan]
  Link: https://lkml.kernel.org/r/20230316232144.b7ic4cif4kjiabws@box.shutemov.name
[akpm@linux-foundation.org: fix underlining in docs]
  Link: https://lore.kernel.org/oe-kbuild-all/202303191025.VRCTk6mP-lkp@intel.com/
Link: https://lkml.kernel.org/r/20230315113133.11326-11-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Michael Ellerman <mpe@ellerman.id.au>	[powerpc]
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-05 19:42:46 -07:00
Kirill A. Shutemov
934487e98f perf/core: fix MAX_ORDER usage in rb_alloc_aux_page()
MAX_ORDER is not inclusive: the maximum allocation order buddy allocator
can deliver is MAX_ORDER-1.

Fix MAX_ORDER usage in rb_alloc_aux_page().

Link: https://lkml.kernel.org/r/20230315113133.11326-7-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Ian Rogers <irogers@google.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-05 19:42:45 -07:00
Liam R. Howlett
3dd4432549 mm: enable maple tree RCU mode by default
Use the maple tree in RCU mode for VMA tracking.

The maple tree tracks the stack and is able to update the pivot
(lower/upper boundary) in-place to allow the page fault handler to write
to the tree while holding just the mmap read lock.  This is safe as the
writes to the stack have a guard VMA which ensures there will always be a
NULL in the direction of the growth and thus will only update a pivot.

It is possible, but not recommended, to have VMAs that grow up/down
without guard VMAs.  syzbot has constructed a testcase which sets up a VMA
to grow and consume the empty space.  Overwriting the entire NULL entry
causes the tree to be altered in a way that is not safe for concurrent
readers; the readers may see a node being rewritten or one that does not
match the maple state they are using.

Enabling RCU mode allows the concurrent readers to see a stable node and
will return the expected result.

[Liam.Howlett@Oracle.com: we don't need to free the nodes with RCU[
Link: https://lore.kernel.org/linux-mm/000000000000b0a65805f663ace6@google.com/
Link: https://lkml.kernel.org/r/20230227173632.3292573-9-surenb@google.com
Fixes: d4af56c5c7 ("mm: start tracking VMAs with maple tree")
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reported-by: syzbot+8d95422d3537159ca390@syzkaller.appspotmail.com
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-05 18:06:22 -07:00
Nicholas Piggin
2655421ae6 lazy tlb: shoot lazies, non-refcounting lazy tlb mm reference handling scheme
On big systems, the mm refcount can become highly contented when doing a
lot of context switching with threaded applications.  user<->idle switch
is one of the important cases.  Abandoning lazy tlb entirely slows this
switching down quite a bit in the common uncontended case, so that is not
viable.

Implement a scheme where lazy tlb mm references do not contribute to the
refcount, instead they get explicitly removed when the refcount reaches
zero.

The final mmdrop() sends IPIs to all CPUs in the mm_cpumask and they
switch away from this mm to init_mm if it was being used as the lazy tlb
mm.  Enabling the shoot lazies option therefore requires that the arch
ensures that mm_cpumask contains all CPUs that could possibly be using mm.
A DEBUG_VM option IPIs every CPU in the system after this to ensure there
are no references remaining before the mm is freed.

Shootdown IPIs cost could be an issue, but they have not been observed to
be a serious problem with this scheme, because short-lived processes tend
not to migrate CPUs much, therefore they don't get much chance to leave
lazy tlb mm references on remote CPUs.  There are a lot of options to
reduce them if necessary, described in comments.

The near-worst-case can be benchmarked with will-it-scale:

  context_switch1_threads -t $(($(nproc) / 2))

This will create nproc threads (nproc / 2 switching pairs) all sharing the
same mm that spread over all CPUs so each CPU does thread->idle->thread
switching.

[ Rik came up with basically the same idea a few years ago, so credit
  to him for that. ]

Link: https://lore.kernel.org/linux-mm/20230118080011.2258375-1-npiggin@gmail.com/
Link: https://lore.kernel.org/all/20180728215357.3249-11-riel@surriel.com/
Link: https://lkml.kernel.org/r/20230203071837.1136453-5-npiggin@gmail.com
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-03-28 16:20:08 -07:00
Nicholas Piggin
aa464ba9a1 lazy tlb: introduce lazy tlb mm refcount helper functions
Add explicit _lazy_tlb annotated functions for lazy tlb mm refcounting. 
This makes the lazy tlb mm references more obvious, and allows the
refcounting scheme to be modified in later changes.  There is no
functional change with this patch.

Link: https://lkml.kernel.org/r/20230203071837.1136453-3-npiggin@gmail.com
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-03-28 16:20:08 -07:00
Nicholas Piggin
6cad87b0d2 kthread: simplify kthread_use_mm refcounting
Patch series "shoot lazy tlbs (lazy tlb refcount scalability
improvement)", v7.

This series improves scalability of context switching between user and
kernel threads on large systems with a threaded process spread across a
lot of CPUs.

Discussion of v6 here:
https://lore.kernel.org/linux-mm/20230118080011.2258375-1-npiggin@gmail.com/


This patch (of 5):

Remove the special case avoiding refcounting when the mm to be used is the
same as the kernel thread's active (lazy tlb) mm.  kthread_use_mm() should
not be such a performance critical path that this matters much.  This
simplifies a later change to lazy tlb mm refcounting.

Link: https://lkml.kernel.org/r/20230203071837.1136453-1-npiggin@gmail.com
Link: https://lkml.kernel.org/r/20230203071837.1136453-2-npiggin@gmail.com
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Will Deacon <will@kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-03-28 16:20:07 -07:00
Linus Torvalds
18940c888c - Fix a corner case where vruntime of a task is not being sanitized
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmQgQeEACgkQEsHwGGHe
 VUpJ+w//c01JpzBXQvGGlNSTTuUzSxLAQz0n8lQmixpYHOUgVL3CQlOF+OfnkYPp
 mz8m3nDh1FB9o70Sd2J4/OjY2Gh1qENrBLmj509LTnZE9hADRI0T5Mn93Jo7m7HY
 QoXrYWEwJYwsLzJ66mR8A0xts5jWgkJsAWKXF9gfxf/ieeycdJ1GdOdzC1tp4Nfe
 /4SEjSbUhx/bsBbAdJ38Z/iQGT0HuyQLOBGBuBcFE0JnP/aYEanAQsxxP2LObeVw
 Za7ATxdJ9I1TErVfRsG0GDSiVKCYzSG2GME5TXibgPJ2g1+m0I7gZgpGO9Q8Wzo4
 7y0X+vqsykY/Me3xEDBVaeCiHmFTambkxOR2xVJ2TISN8b390yePy4vgY1QQDidd
 eNh9S2x1dsKp8i4NdYyeW7xwaTfIDp9Yp4XNP8cw02VzW0FSCnsmCzwGHIXsF4K/
 Sib+bhKUo+Qmck5nJlV6R5Xr9cvGgyPpBvD8/XqqwF5lHJ7xg4qkPwPKjoKL1HRj
 YT1t+l0kzcg/onyDAuPe1mIRFf7Q8x5G8zkUGMG401h2tazv19rjK4+V1UemhBqA
 h5Cf1BBy6+6kn4DDb/zD+0IgpDFKJDaClxNwfPzaplAoMC/8+0nxxnu7f32eT59k
 /JtMisERhr6lG0Q+TfURhB3yyCiBjBR2EKcKzHz+KARtxs7a4T0=
 =Z8oJ
 -----END PGP SIGNATURE-----

Merge tag 'sched_urgent_for_v6.3_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler fix from Borislav Petkov:

 - Fix a corner case where vruntime of a task is not being sanitized

* tag 'sched_urgent_for_v6.3_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/fair: Sanitize vruntime of entity being migrated
2023-03-26 09:18:30 -07:00
Linus Torvalds
f6cdaeb08b - Do the delayed RCU wakeup for kthreads in the proper order so that
former doesn't get ignored
 
 - A noinstr warning fix
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmQgPhsACgkQEsHwGGHe
 VUqecA//VN/9pvCpJbf0S92lDXjbjuAna4+rYak6mjxke2lYHeXNsCjPpBtdfMnK
 Mvr1HykKVBGIwlIPsKvwzt1FYN08rhdPnb/XvwM1ZliFlXP/HhN7K3AWgld559sZ
 hY9gKfijof5D7VhqjRS7ZA2qo2by72ekXnuVQT0cmwZiHPQLx8M4ySHuMUZD3Lmz
 GQyvITuDyBX3BiIGVtqpOhpugpEjAoEBE0MPwXrMEe9Glk4i1z86Xd1Cr4ksCFZL
 gIdDSnlUuLuXn1OfLA35fXzoaStB07aTMEbRl+iD1KUopdRzpqj9j0rqYINkW0Ar
 W/BzyMNw2itEbGrF+kjjCpolwmJvMcJUuvEZKO/gNTv2qYZW5BQqQrHYILdmezyz
 HvGndyT996D3uoUIMIGqJf+41cwqPEjGyMLs3GfYnZMdVnZZnG/KflWQCiGUR9RR
 LfxaryNZfT8MfQP6uslxOWubMupfsen7Hk8oljUpT2GzUAsWxTjqYWkFx6bMNMkV
 Kx304at7R9jD81qC3Rdkqu0F5Z17YZWubd1oJEhi8HeMq8uxFxPb983SkXLY0w7Y
 4Ss/MhJwt30e9ltGCMwgF83uOnndXwzFJG4TR9TqsO0TcdUE6XJPbPj/K8Wsi/u7
 fvnCbBLsDaEJDEAikWJsSOaMjUwnajyaYomy+9VCR536/DlIq5M=
 =c5tH
 -----END PGP SIGNATURE-----

Merge tag 'core_urgent_for_v6.3_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull core fixes from Borislav Petkov:

 - Do the delayed RCU wakeup for kthreads in the proper order so that
   former doesn't get ignored

 - A noinstr warning fix

* tag 'core_urgent_for_v6.3_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  entry/rcu: Check TIF_RESCHED _after_ delayed RCU wake-up
  entry: Fix noinstr warning in __enter_from_user_mode()
2023-03-26 09:06:20 -07:00
Linus Torvalds
f768b35a23 Fixes for 6.3-rc3:
* Fix a race in the percpu counters summation code where the summation
    failed to add in the values for any CPUs that were dying but not yet
    dead.  This fixes some minor discrepancies and incorrect assertions
    when running generic/650.
 
 Signed-off-by: Darrick J. Wong <djwong@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZBdAbgAKCRBKO3ySh0YR
 pkltAQCs4QO5LjYReqjUxd4cSsLtNnNon09qswRsl2GuRyI36AEAxI9QMq4Q6D9V
 ZasNbiTCkV3KPKfmp6gf1mQNLk1lGQ0=
 =Bz3q
 -----END PGP SIGNATURE-----

Merge tag 'xfs-6.3-fixes-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs percpu counter fixes from Darrick Wong:
 "We discovered a filesystem summary counter corruption problem that was
  traced to cpu hot-remove racing with the call to percpu_counter_sum
  that sets the free block count in the superblock when writing it to
  disk. The root cause is that percpu_counter_sum doesn't cull from
  dying cpus and hence misses those counter values if the cpu shutdown
  hooks have not yet run to merge the values.

  I'm hoping this is a fairly painless fix to the problem, since the
  dying cpu mask should generally be empty. It's been in for-next for a
  week without any complaints from the bots.

   - Fix a race in the percpu counters summation code where the
     summation failed to add in the values for any CPUs that were dying
     but not yet dead. This fixes some minor discrepancies and incorrect
     assertions when running generic/650"

* tag 'xfs-6.3-fixes-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  pcpcntr: remove percpu_counter_sum_all()
  fork: remove use of percpu_counter_sum_all
  pcpcntrs: fix dying cpu summation race
  cpumask: introduce for_each_cpu_or
2023-03-25 12:57:34 -07:00
Linus Torvalds
65aca32efd 21 hotfixes, 8 of which are cc:stable. 11 are for MM, the remainder are
for other subsystems.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZB48xAAKCRDdBJ7gKXxA
 js2rAP4zvcMn90vBJhWNElsA7pBgDYD66QCK6JBDHGe3J1qdeQEA8D606pjMBWkL
 ly7NifwCjOtFhfDRgEHOXu8g8g1k1QM=
 =Cswg
 -----END PGP SIGNATURE-----

Merge tag 'mm-hotfixes-stable-2023-03-24-17-09' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull misc fixes from Andrew Morton:
 "21 hotfixes, 8 of which are cc:stable. 11 are for MM, the remainder
  are for other subsystems"

* tag 'mm-hotfixes-stable-2023-03-24-17-09' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (21 commits)
  mm: mmap: remove newline at the end of the trace
  mailmap: add entries for Richard Leitner
  kcsan: avoid passing -g for test
  kfence: avoid passing -g for test
  mm: kfence: fix using kfence_metadata without initialization in show_object()
  lib: dhry: fix unstable smp_processor_id(_) usage
  mailmap: add entry for Enric Balletbo i Serra
  mailmap: map Sai Prakash Ranjan's old address to his current one
  mailmap: map Rajendra Nayak's old address to his current one
  Revert "kasan: drop skip_kasan_poison variable in free_pages_prepare"
  mailmap: add entry for Tobias Klauser
  kasan, powerpc: don't rename memintrinsics if compiler adds prefixes
  mm/ksm: fix race with VMA iteration and mm_struct teardown
  kselftest: vm: fix unused variable warning
  mm: fix error handling for map_deny_write_exec
  mm: deduplicate error handling for map_deny_write_exec
  checksyscalls: ignore fstat to silence build warning on LoongArch
  nilfs2: fix kernel-infoleak in nilfs_ioctl_wrap_copy()
  test_maple_tree: add more testing for mas_empty_area()
  maple_tree: fix mas_skip_node() end slot detection
  ...
2023-03-24 18:06:11 -07:00
Linus Torvalds
608f1b1366 Including fixes from bpf, wifi and bluetooth.
Current release - regressions:
 
  - wifi: mt76: mt7915: add back 160MHz channel width support for MT7915
 
  - libbpf: revert poisoning of strlcpy, it broke uClibc-ng
 
 Current release - new code bugs:
 
  - bpf: improve the coverage of the "allow reads from uninit stack"
    feature to fix verification complexity problems
 
  - eth: am65-cpts: reset PPS genf adj settings on enable
 
 Previous releases - regressions:
 
  - wifi: mac80211: serialize ieee80211_handle_wake_tx_queue()
 
  - wifi: mt76: do not run mt76_unregister_device() on unregistered hw,
    fix null-deref
 
  - Bluetooth: btqcomsmd: fix command timeout after setting BD address
 
  - eth: igb: revert rtnl_lock() that causes a deadlock
 
  - dsa: mscc: ocelot: fix device specific statistics
 
 Previous releases - always broken:
 
  - xsk: add missing overflow check in xdp_umem_reg()
 
  - wifi: mac80211:
    - fix QoS on mesh interfaces
    - fix mesh path discovery based on unicast packets
 
  - Bluetooth:
    - ISO: fix timestamped HCI ISO data packet parsing
    - remove "Power-on" check from Mesh feature
 
  - usbnet: more fixes to drivers trusting packet length
 
  - wifi: iwlwifi: mvm: fix mvmtxq->stopped handling
 
  - Bluetooth: btintel: iterate only bluetooth device ACPI entries
 
  - eth: iavf: fix inverted Rx hash condition leading to disabled hash
 
  - eth: igc: fix the validation logic for taprio's gate list
 
  - dsa: tag_brcm: legacy: fix daisy-chained switches
 
 Misc:
 
  - bpf: adjust insufficient default bpf_jit_limit to account for
    growth of BPF use over the last 5 years
 
  - xdp: bpf_xdp_metadata() use EOPNOTSUPP as unique errno indicating
    no driver support
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmQc4vkACgkQMUZtbf5S
 IruG/w//XixBtdFMHE0/fcGv77jTovlJNiDYeaa+KtyjvIseieYwOKW5F31r3xvl
 Mf/YhNEjAc++V8Zna/1UM5i/WOj1PJdHgSC+wMUGUXjMF+MfzL57nM83CllOpUB5
 Z9YtUqGfolf2Vtx03wnV14qawmVnJWYKHn3AU11cueE5dUu6KNyBTCefQ7uzgcJN
 zMtHAxw96MRQIDxSfKvZsePk4FnQ4qoSOLkslji5iikcMnKePaqZaxQla2oTcEIR
 zue9V+ILmi62Y8mPcdT4ePpZQsjB39bpemh+9EL6l03/cjsjqmuiCw/d1+6g9kuy
 ZD5LgZzUOb6xalhSseiwJL+vj8x2gQhshEfoHQvgp7fzr6agta6sisRX611wtmJl
 hv4k2PMRqFrMv2S+8m8XC177bXIaGbiWh4vBFOWjf4u0lG55cGlzclbXWWQ80njy
 C5cE4V7qPRk8Cl/+uT10CLNQx6JmaX8kcddtFrYpu0PZHKx1WfUYKIpgkiiMPRKT
 njLkDQbFRa8Y3p7UX0wU1TbeuMzzLz+aTBrFEN864IJmbnUnWimeluQzD60WbkSx
 6dciqq11LtvYDsR1HZ1pb7IoHYuDsDrO2Rx4zuqsB/SyfrGdRKJoKOnYvsk+AdCL
 N/e4wivie8s6b+G3yL6p+IdlpEaVo2ZiLINp7JSW8jhW1hRcZUI=
 =XBLi
 -----END PGP SIGNATURE-----

Merge tag 'net-6.3-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
 "Including fixes from bpf, wifi and bluetooth.

  Current release - regressions:

   - wifi: mt76: mt7915: add back 160MHz channel width support for
     MT7915

   - libbpf: revert poisoning of strlcpy, it broke uClibc-ng

  Current release - new code bugs:

   - bpf: improve the coverage of the "allow reads from uninit stack"
     feature to fix verification complexity problems

   - eth: am65-cpts: reset PPS genf adj settings on enable

  Previous releases - regressions:

   - wifi: mac80211: serialize ieee80211_handle_wake_tx_queue()

   - wifi: mt76: do not run mt76_unregister_device() on unregistered hw,
     fix null-deref

   - Bluetooth: btqcomsmd: fix command timeout after setting BD address

   - eth: igb: revert rtnl_lock() that causes a deadlock

   - dsa: mscc: ocelot: fix device specific statistics

  Previous releases - always broken:

   - xsk: add missing overflow check in xdp_umem_reg()

   - wifi: mac80211:
      - fix QoS on mesh interfaces
      - fix mesh path discovery based on unicast packets

   - Bluetooth:
      - ISO: fix timestamped HCI ISO data packet parsing
      - remove "Power-on" check from Mesh feature

   - usbnet: more fixes to drivers trusting packet length

   - wifi: iwlwifi: mvm: fix mvmtxq->stopped handling

   - Bluetooth: btintel: iterate only bluetooth device ACPI entries

   - eth: iavf: fix inverted Rx hash condition leading to disabled hash

   - eth: igc: fix the validation logic for taprio's gate list

   - dsa: tag_brcm: legacy: fix daisy-chained switches

  Misc:

   - bpf: adjust insufficient default bpf_jit_limit to account for
     growth of BPF use over the last 5 years

   - xdp: bpf_xdp_metadata() use EOPNOTSUPP as unique errno indicating
     no driver support"

* tag 'net-6.3-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (84 commits)
  Bluetooth: HCI: Fix global-out-of-bounds
  Bluetooth: mgmt: Fix MGMT add advmon with RSSI command
  Bluetooth: btsdio: fix use after free bug in btsdio_remove due to unfinished work
  Bluetooth: L2CAP: Fix responding with wrong PDU type
  Bluetooth: btqcomsmd: Fix command timeout after setting BD address
  Bluetooth: btinel: Check ACPI handle for NULL before accessing
  net: mdio: thunder: Add missing fwnode_handle_put()
  net: dsa: mt7530: move setting ssc_delta to PHY_INTERFACE_MODE_TRGMII case
  net: dsa: mt7530: move lowering TRGMII driving to mt7530_setup()
  net: dsa: mt7530: move enabling disabling core clock to mt7530_pll_setup()
  net: asix: fix modprobe "sysfs: cannot create duplicate filename"
  gve: Cache link_speed value from device
  tools: ynl: Fix genlmsg header encoding formats
  net: enetc: fix aggregate RMON counters not showing the ranges
  Bluetooth: Remove "Power-on" check from Mesh feature
  Bluetooth: Fix race condition in hci_cmd_sync_clear
  Bluetooth: btintel: Iterate only bluetooth device ACPI entries
  Bluetooth: ISO: fix timestamped HCI ISO data packet parsing
  Bluetooth: btusb: Remove detection of ISO packets over bulk
  Bluetooth: hci_core: Detect if an ACL packet is in fact an ISO packet
  ...
2023-03-24 08:48:12 -07:00
Marco Elver
5eb39cde1e kcsan: avoid passing -g for test
Nathan reported that when building with GNU as and a version of clang that
defaults to DWARF5, the assembler will complain with:

  Error: non-constant .uleb128 is not supported

This is because `-g` defaults to the compiler debug info default. If the
assembler does not support some of the directives used, the above errors
occur. To fix, remove the explicit passing of `-g`.

All the test wants is that stack traces print valid function names, and
debug info is not required for that. (I currently cannot recall why I
added the explicit `-g`.)

Link: https://lkml.kernel.org/r/20230316224705.709984-2-elver@google.com
Fixes: 1fe84fd4a4 ("kcsan: Add test suite")
Signed-off-by: Marco Elver <elver@google.com>
Reported-by: Nathan Chancellor <nathan@kernel.org>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-03-23 17:18:35 -07:00
Jakub Kicinski
1b4ae19e43 bpf-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZBzSGQAKCRDbK58LschI
 g+dhAP95enbrlwaQ+9aoqrU+GqCq+uo4SkaqnUtq6GSvRNiVBQD8C6iZxrAjyXnm
 1wRr3JN/HszPBzgjl3HvDc9y69I/PAI=
 =8JwR
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf

Daniel Borkmann says:

====================
pull-request: bpf 2023-03-23

We've added 8 non-merge commits during the last 13 day(s) which contain
a total of 21 files changed, 238 insertions(+), 161 deletions(-).

The main changes are:

1) Fix verification issues in some BPF programs due to their stack usage
   patterns, from Eduard Zingerman.

2) Fix to add missing overflow checks in xdp_umem_reg and return an error
   in such case, from Kal Conley.

3) Fix and undo poisoning of strlcpy in libbpf given it broke builds for
   libcs which provided the former like uClibc-ng, from Jesus Sanchez-Palencia.

4) Fix insufficient bpf_jit_limit default to avoid users running into hard
   to debug seccomp BPF errors, from Daniel Borkmann.

5) Fix driver return code when they don't support a bpf_xdp_metadata kfunc
   to make it unambiguous from other errors, from Jesper Dangaard Brouer.

6) Two BPF selftest fixes to address compilation errors from recent changes
   in kernel structures, from Alexei Starovoitov.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  xdp: bpf_xdp_metadata use EOPNOTSUPP for no driver support
  bpf: Adjust insufficient default bpf_jit_limit
  xsk: Add missing overflow check in xdp_umem_reg
  selftests/bpf: Fix progs/test_deny_namespace.c issues.
  selftests/bpf: Fix progs/find_vma_fail1.c build error.
  libbpf: Revert poisoning of strlcpy
  selftests/bpf: Tests for uninitialized stack reads
  bpf: Allow reads from uninit stack
====================

Link: https://lore.kernel.org/r/20230323225221.6082-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-23 16:03:33 -07:00
Daniel Borkmann
10ec8ca8ec bpf: Adjust insufficient default bpf_jit_limit
We've seen recent AWS EKS (Kubernetes) user reports like the following:

  After upgrading EKS nodes from v20230203 to v20230217 on our 1.24 EKS
  clusters after a few days a number of the nodes have containers stuck
  in ContainerCreating state or liveness/readiness probes reporting the
  following error:

    Readiness probe errored: rpc error: code = Unknown desc = failed to
    exec in container: failed to start exec "4a11039f730203ffc003b7[...]":
    OCI runtime exec failed: exec failed: unable to start container process:
    unable to init seccomp: error loading seccomp filter into kernel:
    error loading seccomp filter: errno 524: unknown

  However, we had not been seeing this issue on previous AMIs and it only
  started to occur on v20230217 (following the upgrade from kernel 5.4 to
  5.10) with no other changes to the underlying cluster or workloads.

  We tried the suggestions from that issue (sysctl net.core.bpf_jit_limit=452534528)
  which helped to immediately allow containers to be created and probes to
  execute but after approximately a day the issue returned and the value
  returned by cat /proc/vmallocinfo | grep bpf_jit | awk '{s+=$2} END {print s}'
  was steadily increasing.

I tested bpf tree to observe bpf_jit_charge_modmem, bpf_jit_uncharge_modmem
their sizes passed in as well as bpf_jit_current under tcpdump BPF filter,
seccomp BPF and native (e)BPF programs, and the behavior all looks sane
and expected, that is nothing "leaking" from an upstream perspective.

The bpf_jit_limit knob was originally added in order to avoid a situation
where unprivileged applications loading BPF programs (e.g. seccomp BPF
policies) consuming all the module memory space via BPF JIT such that loading
of kernel modules would be prevented. The default limit was defined back in
2018 and while good enough back then, we are generally seeing far more BPF
consumers today.

Adjust the limit for the BPF JIT pool from originally 1/4 to now 1/2 of the
module memory space to better reflect today's needs and avoid more users
running into potentially hard to debug issues.

Fixes: fdadd04931 ("bpf: fix bpf_jit_limit knob for PAGE_SIZE >= 64K")
Reported-by: Stephen Haynes <sh@synk.net>
Reported-by: Lefteris Alexakis <lefteris.alexakis@kpn.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://github.com/awslabs/amazon-eks-ami/issues/1179
Link: https://github.com/awslabs/amazon-eks-ami/issues/1219
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20230320143725.8394-1-daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-21 12:43:05 -07:00
Frederic Weisbecker
b416514054 entry/rcu: Check TIF_RESCHED _after_ delayed RCU wake-up
RCU sometimes needs to perform a delayed wake up for specific kthreads
handling offloaded callbacks (RCU_NOCB).  These wakeups are performed
by timers and upon entry to idle (also to guest and to user on nohz_full).

However the delayed wake-up on kernel exit is actually performed after
the thread flags are fetched towards the fast path check for work to
do on exit to user. As a result, and if there is no other pending work
to do upon that kernel exit, the current task will resume to userspace
with TIF_RESCHED set and the pending wake up ignored.

Fix this with fetching the thread flags _after_ the delayed RCU-nocb
kthread wake-up.

Fixes: 47b8ff194c ("entry: Explicitly flush pending rcuog wakeup before last rescheduling point")
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20230315194349.10798-3-joel@joelfernandes.org
2023-03-21 15:13:15 +01:00
Vincent Guittot
a53ce18cac sched/fair: Sanitize vruntime of entity being migrated
Commit 829c1651e9 ("sched/fair: sanitize vruntime of entity being placed")
fixes an overflowing bug, but ignore a case that se->exec_start is reset
after a migration.

For fixing this case, we delay the reset of se->exec_start after
placing the entity which se->exec_start to detect long sleeping task.

In order to take into account a possible divergence between the clock_task
of 2 rqs, we increase the threshold to around 104 days.

Fixes: 829c1651e9 ("sched/fair: sanitize vruntime of entity being placed")
Originally-by: Zhang Qiao <zhangqiao22@huawei.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Zhang Qiao <zhangqiao22@huawei.com>
Link: https://lore.kernel.org/r/20230317160810.107988-1-vincent.guittot@linaro.org
2023-03-21 14:43:04 +01:00
Josh Poimboeuf
f87d28673b entry: Fix noinstr warning in __enter_from_user_mode()
__enter_from_user_mode() is triggering noinstr warnings with
CONFIG_DEBUG_PREEMPT due to its call of preempt_count_add() via
ct_state().

The preemption disable isn't needed as interrupts are already disabled.
And the context_tracking_enabled() check in ct_state() also isn't needed
as that's already being done by the CT_WARN_ON().

Just use __ct_state() instead.

Fixes the following warnings:

  vmlinux.o: warning: objtool: enter_from_user_mode+0xba: call to preempt_count_add() leaves .noinstr.text section
  vmlinux.o: warning: objtool: syscall_enter_from_user_mode+0xf9: call to preempt_count_add() leaves .noinstr.text section
  vmlinux.o: warning: objtool: syscall_enter_from_user_mode_prepare+0xc7: call to preempt_count_add() leaves .noinstr.text section
  vmlinux.o: warning: objtool: irqentry_enter_from_user_mode+0xba: call to preempt_count_add() leaves .noinstr.text section

Fixes: 171476775d ("context_tracking: Convert state to atomic_t")
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/d8955fa6d68dc955dda19baf13ae014ae27926f5.1677369694.git.jpoimboe@kernel.org
2023-03-21 11:53:16 +01:00
Linus Torvalds
eaba52d63b Tracing fixes for 6.3:
- Fix setting affinity of hwlat threads in containers
   Using sched_set_affinity() has unwanted side effects when being
   called within a container. Use set_cpus_allowed_ptr() instead.
 
 - Fix per cpu thread management of the hwlat tracer
   * Do not start per_cpu threads if one is already running for the CPU.
   * When starting per_cpu threads, do not clear the kthread variable
     as it may already be set to running per cpu threads
 
 - Fix return value for test_gen_kprobe_cmd()
   On error the return value was overwritten by being set to
   the result of the call from kprobe_event_delete(), which would
   likely succeed, and thus have the function return success.
 
 - Fix splice() reads from the trace file that was broken by
   36e2c7421f ("fs: don't allow splice read/write without explicit ops")
 
 - Remove obsolete and confusing comment in ring_buffer.c
   The original design of the ring buffer used struct page flags
   for tricks to optimize, which was shortly removed due to them
   being tricks. But a comment for those tricks remained.
 
 - Set local functions and variables to static
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZBdIkBQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qti2AP49s1GM8teFgDF/CO3oa45BoYq1lMJO
 1Z+x+mS/bdUAZgEAw+3KGo8oZDuvWu/nr04JPeoy0GL1/JnbQ6JNCCjzhQc=
 =Yk66
 -----END PGP SIGNATURE-----

Merge tag 'trace-v6.3-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing fixes from Steven Rostedt:

 - Fix setting affinity of hwlat threads in containers

   Using sched_set_affinity() has unwanted side effects when being
   called within a container. Use set_cpus_allowed_ptr() instead

 - Fix per cpu thread management of the hwlat tracer:
    - Do not start per_cpu threads if one is already running for the CPU
    - When starting per_cpu threads, do not clear the kthread variable
      as it may already be set to running per cpu threads

 - Fix return value for test_gen_kprobe_cmd()

   On error the return value was overwritten by being set to the result
   of the call from kprobe_event_delete(), which would likely succeed,
   and thus have the function return success

 - Fix splice() reads from the trace file that was broken by commit
   36e2c7421f ("fs: don't allow splice read/write without explicit
   ops")

 - Remove obsolete and confusing comment in ring_buffer.c

   The original design of the ring buffer used struct page flags for
   tricks to optimize, which was shortly removed due to them being
   tricks. But a comment for those tricks remained

 - Set local functions and variables to static

* tag 'trace-v6.3-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing/hwlat: Replace sched_setaffinity with set_cpus_allowed_ptr
  ring-buffer: remove obsolete comment for free_buffer_page()
  tracing: Make splice_read available again
  ftrace: Set direct_ops storage-class-specifier to static
  trace/hwlat: Do not start per-cpu thread if it is already running
  trace/hwlat: Do not wipe the contents of per-cpu thread data
  tracing/osnoise: set several trace_osnoise.c variables storage-class-specifier to static
  tracing: Fix wrong return in kprobe_event_gen_test.c
2023-03-19 10:46:02 -07:00
Costa Shulyupin
71c7a30442 tracing/hwlat: Replace sched_setaffinity with set_cpus_allowed_ptr
There is a problem with the behavior of hwlat in a container,
resulting in incorrect output. A warning message is generated:
"cpumask changed while in round-robin mode, switching to mode none",
and the tracing_cpumask is ignored. This issue arises because
the kernel thread, hwlatd, is not a part of the container, and
the function sched_setaffinity is unable to locate it using its PID.
Additionally, the task_struct of hwlatd is already known.
Ultimately, the function set_cpus_allowed_ptr achieves
the same outcome as sched_setaffinity, but employs task_struct
instead of PID.

Test case:

  # cd /sys/kernel/tracing
  # echo 0 > tracing_on
  # echo round-robin > hwlat_detector/mode
  # echo hwlat > current_tracer
  # unshare --fork --pid bash -c 'echo 1 > tracing_on'
  # dmesg -c

Actual behavior:

[573502.809060] hwlat_detector: cpumask changed while in round-robin mode, switching to mode none

Link: https://lore.kernel.org/linux-trace-kernel/20230316144535.1004952-1-costa.shul@redhat.com

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Fixes: 0330f7aa8e ("tracing: Have hwlat trace migrate across tracing_cpumask CPUs")
Signed-off-by: Costa Shulyupin <costa.shul@redhat.com>
Acked-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-03-19 13:23:22 -04:00
Vlastimil Babka
a98151ad53 ring-buffer: remove obsolete comment for free_buffer_page()
The comment refers to mm/slob.c which is being removed. It comes from
commit ed56829cb3 ("ring_buffer: reset buffer page when freeing") and
according to Steven the borrowed code was a page mapcount and mapping
reset, which was later removed by commit e4c2ce82ca ("ring_buffer:
allocate buffer page pointer"). Thus the comment is not accurate anyway,
remove it.

Link: https://lore.kernel.org/linux-trace-kernel/20230315142446.27040-1-vbabka@suse.cz

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Reported-by: Mike Rapoport <mike.rapoport@gmail.com>
Suggested-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Fixes: e4c2ce82ca ("ring_buffer: allocate buffer page pointer")
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Mukesh Ojha <quic_mojha@quicinc.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-03-19 13:23:22 -04:00
Sung-hun Kim
e400be674a tracing: Make splice_read available again
Since the commit 36e2c7421f ("fs: don't allow splice read/write
without explicit ops") is applied to the kernel, splice() and
sendfile() calls on the trace file (/sys/kernel/debug/tracing
/trace) return EINVAL.

This patch restores these system calls by initializing splice_read
in file_operations of the trace file. This patch only enables such
functionalities for the read case.

Link: https://lore.kernel.org/linux-trace-kernel/20230314013707.28814-1-sfoon.kim@samsung.com

Cc: stable@vger.kernel.org
Fixes: 36e2c7421f ("fs: don't allow splice read/write without explicit ops")
Signed-off-by: Sung-hun Kim <sfoon.kim@samsung.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-03-19 13:16:21 -04:00
Dave Chinner
7ba85fba47 fork: remove use of percpu_counter_sum_all
This effectively reverts the change made in commit f689054aac
("percpu_counter: add percpu_counter_sum_all interface") as the
race condition percpu_counter_sum_all() was invented to avoid is
now handled directly in percpu_counter_sum() and nobody needs to
care about summing racing with cpu unplug anymore.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2023-03-19 10:02:04 -07:00
Linus Torvalds
80102f2ee7 - Check whether sibling events have been deactivated before adding them
to groups
 
 - Update the proper event time tracking variable depending on the
   event type
 
 - Fix a memory overwrite issue due to using the wrong function argument
   when outputting perf events
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmQXBiQACgkQEsHwGGHe
 VUrTHg//dTd+7Oo1vao32lZCkVW0bBvx3KzoyAcvEPh7Yu3Cw+tZkfI8lWY8dgU3
 lgecN1pUa9IiZNULFbulXqRumcq3HIRKCp/RejEmR3W30KCwxnEAwGdakkCPEcMS
 2vXl4SZn7rU8avKJBd/ZUUS5lDWz6YHPNbLX3iamFH+7oN56Vf2LFJuuO0WtZSgr
 Sqv25oaV1ZZjeEAI5KrPM04hcldj4BXGPoIPvP30/c3z6aPnwt6v5H2gw5IUJNHl
 Qm0ycgW7An6iNR4bFpm/NP9cSAI7FgmAB3VuhQAExl6NroUxE8R+HEO04PsJAOgd
 BpNB1eXIjrTgw8r+TsDq4Z9VY8fgMtHbwutAp348XwF6//5F/WOzojwSGAAX7LCx
 5+fK4eA4gLkQJmuyI5/3fNCC0EnUTsK/YCC5eFyas8KFERTmmq8hafCBQO4W9VPr
 8A7TD2X6fWNWvS/UF1RqVv/gCa0ub7ifz+eOx42/vK2PXn4vsjMv0JtwnARj8PzZ
 Ymo/3ArmBWhuP+94FdVP4AduBuEdHvWO7EZG5buRdOo//sb2MwX05zufowLvR9YQ
 E+VkZxf0RFPaDQdPnoS/SwFE206ii2Z5MtgsQX7XpoBo06R5AzZQlhlMl/StFvN0
 65ut1dWwqrVG0uSN0GOzTIZl5x5YZASm5I49ItZsnZyoMPTwDOc=
 =l7PF
 -----END PGP SIGNATURE-----

Merge tag 'perf_urgent_for_v6.3_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull perf fixes from Borislav Petkov:

 - Check whether sibling events have been deactivated before adding them
   to groups

 - Update the proper event time tracking variable depending on the event
   type

 - Fix a memory overwrite issue due to using the wrong function argument
   when outputting perf events

* tag 'perf_urgent_for_v6.3_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf: Fix check before add_event_to_groups() in perf_group_detach()
  perf: fix perf_event_context->time
  perf/core: Fix perf_output_begin parameter is incorrectly invoked in perf_event_bpf_output
2023-03-19 09:47:55 -07:00
Tom Rix
8732565549 ftrace: Set direct_ops storage-class-specifier to static
smatch reports this warning
kernel/trace/ftrace.c:2594:19: warning:
  symbol 'direct_ops' was not declared. Should it be static?

The variable direct_ops is only used in ftrace.c, so it should be static

Link: https://lore.kernel.org/linux-trace-kernel/20230311135113.711824-1-trix@redhat.com

Signed-off-by: Tom Rix <trix@redhat.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-03-19 12:20:49 -04:00
Tero Kristo
08697bca9b trace/hwlat: Do not start per-cpu thread if it is already running
The hwlatd tracer will end up starting multiple per-cpu threads with
the following script:

    #!/bin/sh
    cd /sys/kernel/debug/tracing
    echo 0 > tracing_on
    echo hwlat > current_tracer
    echo per-cpu > hwlat_detector/mode
    echo 100000 > hwlat_detector/width
    echo 200000 > hwlat_detector/window
    echo 1 > tracing_on

To fix the issue, check if the hwlatd thread for the cpu is already
running, before starting a new one. Along with the previous patch, this
avoids running multiple instances of the same CPU thread on the system.

Link: https://lore.kernel.org/all/20230302113654.2984709-1-tero.kristo@linux.intel.com/
Link: https://lkml.kernel.org/r/20230310100451.3948583-3-tero.kristo@linux.intel.com

Cc: stable@vger.kernel.org
Fixes: f46b16520a ("trace/hwlat: Implement the per-cpu mode")
Signed-off-by: Tero Kristo <tero.kristo@linux.intel.com>
Acked-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-03-19 12:20:49 -04:00
Tero Kristo
4c42f5f0d1 trace/hwlat: Do not wipe the contents of per-cpu thread data
Do not wipe the contents of the per-cpu kthread data when starting the
tracer, as this will completely forget about already running instances
and can later start new additional per-cpu threads.

Link: https://lore.kernel.org/all/20230302113654.2984709-1-tero.kristo@linux.intel.com/
Link: https://lkml.kernel.org/r/20230310100451.3948583-2-tero.kristo@linux.intel.com

Cc: stable@vger.kernel.org
Fixes: f46b16520a ("trace/hwlat: Implement the per-cpu mode")
Signed-off-by: Tero Kristo <tero.kristo@linux.intel.com>
Acked-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-03-19 12:20:48 -04:00
Tom Rix
7a025e066e tracing/osnoise: set several trace_osnoise.c variables storage-class-specifier to static
smatch reports several similar warnings
kernel/trace/trace_osnoise.c:220:1: warning:
  symbol '__pcpu_scope_per_cpu_osnoise_var' was not declared. Should it be static?
kernel/trace/trace_osnoise.c:243:1: warning:
  symbol '__pcpu_scope_per_cpu_timerlat_var' was not declared. Should it be static?
kernel/trace/trace_osnoise.c:335:14: warning:
  symbol 'interface_lock' was not declared. Should it be static?
kernel/trace/trace_osnoise.c:2242:5: warning:
  symbol 'timerlat_min_period' was not declared. Should it be static?
kernel/trace/trace_osnoise.c:2243:5: warning:
  symbol 'timerlat_max_period' was not declared. Should it be static?

These variables are only used in trace_osnoise.c, so it should be static

Link: https://lore.kernel.org/linux-trace-kernel/20230309150414.4036764-1-trix@redhat.com

Signed-off-by: Tom Rix <trix@redhat.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-03-19 12:20:48 -04:00
Anton Gusev
bc4f359b3b tracing: Fix wrong return in kprobe_event_gen_test.c
Overwriting the error code with the deletion result may cause the
function to return 0 despite encountering an error. Commit b111545d26
("tracing: Remove the useless value assignment in
test_create_synth_event()") solves a similar issue by
returning the original error code, so this patch does the same.

Found by Linux Verification Center (linuxtesting.org) with SVACE.

Link: https://lore.kernel.org/linux-trace-kernel/20230131075818.5322-1-aagusev@ispras.ru

Signed-off-by: Anton Gusev <aagusev@ispras.ru>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-03-19 12:20:48 -04:00
Budimir Markovic
fd0815f632 perf: Fix check before add_event_to_groups() in perf_group_detach()
Events should only be added to a groups rb tree if they have not been
removed from their context by list_del_event(). Since remove_on_exec
made it possible to call list_del_event() on individual events before
they are detached from their group, perf_group_detach() should check each
sibling's attach_state before calling add_event_to_groups() on it.

Fixes: 2e498d0a74 ("perf: Add support for event removal on exec")
Signed-off-by: Budimir Markovic <markovicbudimir@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/ZBFzvQV9tEqoHEtH@gentoo
2023-03-15 21:49:47 +01:00
Song Liu
baf1b12a67 perf: fix perf_event_context->time
Time readers rely on perf_event_context->[time|timestamp|timeoffset] to get
accurate time_enabled and time_running for an event. The difference between
ctx->timestamp and ctx->time is the among of time when the context is not
enabled. __update_context_time(ctx, false) is used to increase timestamp,
but not time. Therefore, it should only be called in ctx_sched_in() when
EVENT_TIME was not enabled.

Fixes: 09f5e7dc7a ("perf: Fix perf_event_read_local() time")
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Link: https://lkml.kernel.org/r/20230313171608.298734-1-song@kernel.org
2023-03-15 21:49:46 +01:00
Yang Jihong
eb81a2ed4f perf/core: Fix perf_output_begin parameter is incorrectly invoked in perf_event_bpf_output
syzkaller reportes a KASAN issue with stack-out-of-bounds.
The call trace is as follows:
  dump_stack+0x9c/0xd3
  print_address_description.constprop.0+0x19/0x170
  __kasan_report.cold+0x6c/0x84
  kasan_report+0x3a/0x50
  __perf_event_header__init_id+0x34/0x290
  perf_event_header__init_id+0x48/0x60
  perf_output_begin+0x4a4/0x560
  perf_event_bpf_output+0x161/0x1e0
  perf_iterate_sb_cpu+0x29e/0x340
  perf_iterate_sb+0x4c/0xc0
  perf_event_bpf_event+0x194/0x2c0
  __bpf_prog_put.constprop.0+0x55/0xf0
  __cls_bpf_delete_prog+0xea/0x120 [cls_bpf]
  cls_bpf_delete_prog_work+0x1c/0x30 [cls_bpf]
  process_one_work+0x3c2/0x730
  worker_thread+0x93/0x650
  kthread+0x1b8/0x210
  ret_from_fork+0x1f/0x30

commit 267fb27352 ("perf: Reduce stack usage of perf_output_begin()")
use on-stack struct perf_sample_data of the caller function.

However, perf_event_bpf_output uses incorrect parameter to convert
small-sized data (struct perf_bpf_event) into large-sized data
(struct perf_sample_data), which causes memory overwriting occurs in
__perf_event_header__init_id.

Fixes: 267fb27352 ("perf: Reduce stack usage of perf_output_begin()")
Signed-off-by: Yang Jihong <yangjihong1@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20230314044735.56551-1-yangjihong1@huawei.com
2023-03-15 21:49:46 +01:00
Linus Torvalds
6015b1aca1 sched_getaffinity: don't assume 'cpumask_size()' is fully initialized
The getaffinity() system call uses 'cpumask_size()' to decide how big
the CPU mask is - so far so good.  It is indeed the allocation size of a
cpumask.

But the code also assumes that the whole allocation is initialized
without actually doing so itself.  That's wrong, because we might have
fixed-size allocations (making copying and clearing more efficient), but
not all of it is then necessarily used if 'nr_cpu_ids' is smaller.

Having checked other users of 'cpumask_size()', they all seem to be ok,
either using it purely for the allocation size, or explicitly zeroing
the cpumask before using the size in bytes to copy it.

See for example the ublk_ctrl_get_queue_affinity() function that uses
the proper 'zalloc_cpumask_var()' to make sure that the whole mask is
cleared, whether the storage is on the stack or if it was an external
allocation.

Fix this by just zeroing the allocation before using it.  Do the same
for the compat version of sched_getaffinity(), which had the same logic.

Also, for consistency, make sched_getaffinity() use 'cpumask_bits()' to
access the bits.  For a cpumask_var_t, it ends up being a pointer to the
same data either way, but it's just a good idea to treat it like you
would a 'cpumask_t'.  The compat case already did that.

Reported-by: Ryan Roberts <ryan.roberts@arm.com>
Link: https://lore.kernel.org/lkml/7d026744-6bd6-6827-0471-b5e8eae0be3f@arm.com/
Cc: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-03-14 19:32:38 -07:00
Linus Torvalds
29db00c252 Tracing fixes for v6.3
- Do not allow histogram values to have modifies.
   Can cause a NULL pointer dereference if they do.
 
 - Warn if hist_field_name() is passed a NULL.
   Prevent the NULL pointer dereference mentioned above.
 
 - Fix invalid address look up race in lookup_rec()
 
 - Define ftrace_stub_graph conditionally to prevent linker errors
 
 - Always check if RCU is watching at all tracepoint locations
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZBDuTBQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qsboAP4yfrFYvIIKM5EkzkEiPI+V2hdlA12x
 bt839jO5AWCmhAEAiY8FmKatpBJQKsiGqSOab8aHOMnhGFZwltCHAPa9PAI=
 =vtA2
 -----END PGP SIGNATURE-----

Merge tag 'trace-v6.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing fixes from Steven Rostedt:

 - Do not allow histogram values to have modifies. They can cause a NULL
   pointer dereference if they do.

 - Warn if hist_field_name() is passed a NULL. Prevent the NULL pointer
   dereference mentioned above.

 - Fix invalid address look up race in lookup_rec()

 - Define ftrace_stub_graph conditionally to prevent linker errors

 - Always check if RCU is watching at all tracepoint locations

* tag 'trace-v6.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing: Make tracepoint lockdep check actually test something
  ftrace,kcfi: Define ftrace_stub_graph conditionally
  ftrace: Fix invalid address access in lookup_rec() when index is 0
  tracing: Check field value in hist_field_name()
  tracing: Do not let histogram values have some modifiers
2023-03-14 17:07:54 -07:00