IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
Patch series "mm: process/cgroup ksm support", v9.
So far KSM can only be enabled by calling madvise for memory regions. To
be able to use KSM for more workloads, KSM needs to have the ability to be
enabled / disabled at the process / cgroup level.
Use case 1:
The madvise call is not available in the programming language. An
example for this are programs with forked workloads using a garbage
collected language without pointers. In such a language madvise cannot
be made available.
In addition the addresses of objects get moved around as they are
garbage collected. KSM sharing needs to be enabled "from the outside"
for these type of workloads.
Use case 2:
The same interpreter can also be used for workloads where KSM brings
no benefit or even has overhead. We'd like to be able to enable KSM on
a workload by workload basis.
Use case 3:
With the madvise call sharing opportunities are only enabled for the
current process: it is a workload-local decision. A considerable number
of sharing opportunities may exist across multiple workloads or jobs (if
they are part of the same security domain). Only a higler level entity
like a job scheduler or container can know for certain if its running
one or more instances of a job. That job scheduler however doesn't have
the necessary internal workload knowledge to make targeted madvise
calls.
Security concerns:
In previous discussions security concerns have been brought up. The
problem is that an individual workload does not have the knowledge about
what else is running on a machine. Therefore it has to be very
conservative in what memory areas can be shared or not. However, if the
system is dedicated to running multiple jobs within the same security
domain, its the job scheduler that has the knowledge that sharing can be
safely enabled and is even desirable.
Performance:
Experiments with using UKSM have shown a capacity increase of around 20%.
Here are the metrics from an instagram workload (taken from a machine
with 64GB main memory):
full_scans: 445
general_profit: 20158298048
max_page_sharing: 256
merge_across_nodes: 1
pages_shared: 129547
pages_sharing: 5119146
pages_to_scan: 4000
pages_unshared: 1760924
pages_volatile: 10761341
run: 1
sleep_millisecs: 20
stable_node_chains: 167
stable_node_chains_prune_millisecs: 2000
stable_node_dups: 2751
use_zero_pages: 0
zero_pages_sharing: 0
After the service is running for 30 minutes to an hour, 4 to 5 million
shared pages are common for this workload when using KSM.
Detailed changes:
1. New options for prctl system command
This patch series adds two new options to the prctl system call.
The first one allows to enable KSM at the process level and the second
one to query the setting.
The setting will be inherited by child processes.
With the above setting, KSM can be enabled for the seed process of a cgroup
and all processes in the cgroup will inherit the setting.
2. Changes to KSM processing
When KSM is enabled at the process level, the KSM code will iterate
over all the VMA's and enable KSM for the eligible VMA's.
When forking a process that has KSM enabled, the setting will be
inherited by the new child process.
3. Add general_profit metric
The general_profit metric of KSM is specified in the documentation,
but not calculated. This adds the general profit metric to
/sys/kernel/debug/mm/ksm.
4. Add more metrics to ksm_stat
This adds the process profit metric to /proc/<pid>/ksm_stat.
5. Add more tests to ksm_tests and ksm_functional_tests
This adds an option to specify the merge type to the ksm_tests.
This allows to test madvise and prctl KSM.
It also adds a two new tests to ksm_functional_tests: one to test
the new prctl options and the other one is a fork test to verify that
the KSM process setting is inherited by client processes.
This patch (of 3):
So far KSM can only be enabled by calling madvise for memory regions. To
be able to use KSM for more workloads, KSM needs to have the ability to be
enabled / disabled at the process / cgroup level.
1. New options for prctl system command
This patch series adds two new options to the prctl system call.
The first one allows to enable KSM at the process level and the second
one to query the setting.
The setting will be inherited by child processes.
With the above setting, KSM can be enabled for the seed process of a
cgroup and all processes in the cgroup will inherit the setting.
2. Changes to KSM processing
When KSM is enabled at the process level, the KSM code will iterate
over all the VMA's and enable KSM for the eligible VMA's.
When forking a process that has KSM enabled, the setting will be
inherited by the new child process.
1) Introduce new MMF_VM_MERGE_ANY flag
This introduces the new flag MMF_VM_MERGE_ANY flag. When this flag
is set, kernel samepage merging (ksm) gets enabled for all vma's of a
process.
2) Setting VM_MERGEABLE on VMA creation
When a VMA is created, if the MMF_VM_MERGE_ANY flag is set, the
VM_MERGEABLE flag will be set for this VMA.
3) support disabling of ksm for a process
This adds the ability to disable ksm for a process if ksm has been
enabled for the process with prctl.
4) add new prctl option to get and set ksm for a process
This adds two new options to the prctl system call
- enable ksm for all vmas of a process (if the vmas support it).
- query if ksm has been enabled for a process.
3. Disabling MMF_VM_MERGE_ANY for storage keys in s390
In the s390 architecture when storage keys are used, the
MMF_VM_MERGE_ANY will be disabled.
Link: https://lkml.kernel.org/r/20230418051342.1919757-1-shr@devkernel.io
Link: https://lkml.kernel.org/r/20230418051342.1919757-2-shr@devkernel.io
Signed-off-by: Stefan Roesch <shr@devkernel.io>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Bagas Sanjaya <bagasdotme@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The console tracepoint is used by kcsan/kasan/kfence/kmsan test modules.
Since this tracepoint is not exported, these modules iterate over all
available tracepoints to find the console trace point. Export the trace
point so that it can be directly used.
Link: https://lkml.kernel.org/r/20230413100859.1492323-1-quic_pkondeti@quicinc.com
Signed-off-by: Pavankumar Kondeti <quic_pkondeti@quicinc.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: Marco Elver <elver@google.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
If a library wants to get information from auxv (for instance,
AT_HWCAP/AT_HWCAP2), it has a few options, none of them perfectly reliable
or ideal:
- Be main or the pre-main startup code, and grub through the stack above
main. Doesn't work for a library.
- Call libc getauxval. Not ideal for libraries that are trying to be
libc-independent and/or don't otherwise require anything from other
libraries.
- Open and read /proc/self/auxv. Doesn't work for libraries that may run
in arbitrarily constrained environments that may not have /proc
mounted (e.g. libraries that might be used by an init program or a
container setup tool).
- Assume you're on the main thread and still on the original stack, and
try to walk the stack upwards, hoping to find auxv. Extremely bad
idea.
- Ask the caller to pass auxv in for you. Not ideal for a user-friendly
library, and then your caller may have the same problem.
Add a prctl that copies current->mm->saved_auxv to a userspace buffer.
Link: https://lkml.kernel.org/r/d81864a7f7f43bca6afa2a09fc2e850e4050ab42.1680611394.git.josh@joshtriplett.org
Signed-off-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "memcg: avoid flushing stats atomically where possible", v3.
rstat flushing is an expensive operation that scales with the number of
cpus and the number of cgroups in the system. The purpose of this series
is to minimize the contexts where we flush stats atomically.
Patches 1 and 2 are cleanups requested during reviews of prior versions of
this series.
Patch 3 makes sure we never try to flush from within an irq context.
Patches 4 to 7 introduce separate variants of mem_cgroup_flush_stats() for
atomic and non-atomic flushing, and make sure we only flush the stats
atomically when necessary.
Patch 8 is a slightly tangential optimization that limits the work done by
rstat flushing in some scenarios.
This patch (of 8):
cgroup_rstat_flush_irqsafe() can be a confusing name. It may read as
"irqs are disabled throughout", which is what the current implementation
does (currently under discussion [1]), but is not the intention. The
intention is that this function is safe to call from atomic contexts.
Name it as such.
Link: https://lkml.kernel.org/r/20230330191801.1967435-1-yosryahmed@google.com
Link: https://lkml.kernel.org/r/20230330191801.1967435-2-yosryahmed@google.com
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vasily Averin <vasily.averin@linux.dev>
Cc: Zefan Li <lizefan.x@bytedance.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
commit f1a7941243 ("mm: convert mm's rss stats into percpu_counter")
introduces a memory leak by missing a call to destroy_context() when a
percpu_counter fails to allocate.
Before introducing the per-cpu counter allocations, init_new_context() was
the last call that could fail in mm_init(), and thus there was no need to
ever invoke destroy_context() in the error paths. Adding the following
percpu counter allocations adds error paths after init_new_context(),
which means its associated destroy_context() needs to be called when
percpu counters fail to allocate.
Link: https://lkml.kernel.org/r/20230330133822.66271-1-mathieu.desnoyers@efficios.com
Fixes: f1a7941243 ("mm: convert mm's rss stats into percpu_counter")
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Linux Security Modules (LSMs) that implement the "capable" hook will
usually emit an access denial message to the audit log whenever they
"block" the current task from using the given capability based on their
security policy.
The occurrence of a denial is used as an indication that the given task
has attempted an operation that requires the given access permission, so
the callers of functions that perform LSM permission checks must take care
to avoid calling them too early (before it is decided if the permission is
actually needed to perform the requested operation).
The __sys_setres[ug]id() functions violate this convention by first
calling ns_capable_setid() and only then checking if the operation
requires the capability or not. It means that any caller that has the
capability granted by DAC (task's capability set) but not by MAC (LSMs)
will generate a "denied" audit record, even if is doing an operation for
which the capability is not required.
Fix this by reordering the checks such that ns_capable_setid() is checked
last and -EPERM is returned immediately if it returns false.
While there, also do two small optimizations:
* move the capability check before prepare_creds() and
* bail out early in case of a no-op.
Link: https://lkml.kernel.org/r/20230217162154.837549-1-omosnace@redhat.com
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Ondrej Mosnacek <omosnace@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
before: last 6 bits of PID is used as index to store information about
tasks accessing VMA's.
after: hash_32 is used to take of cases where tasks are created over a
period of time, and thus improve collision probability.
Result:
The patch series overall improves autonuma cost.
Kernbench around more than 5% improvement and system time in mmtest
autonuma showed more than 80% improvement
Link: https://lkml.kernel.org/r/d5a9f75513300caed74e5c8570bba9317b963c2b.1677672277.git.raghavendra.kt@amd.com
Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Cc: Bharata B Rao <bharata@amd.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Disha Talreja <dishaa.talreja@amd.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Rapoport <rppt@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This helps to ensure that only recently accessed PIDs scan the VMAs.
Current implementation: (idea supported by PeterZ)
1. Accessing PID information is maintained in two windows.
access_pids[1] being newest.
2. Reset old access PID info i.e. access_pid[0] every (4 *
sysctl_numa_balancing_scan_delay) interval after initial scan delay
period expires.
The above interval seemed to be experimentally optimum since it avoids
frequent reset of access info as well as helps clearing the old access
info regularly. The reset logic is implemented in scan path.
Link: https://lkml.kernel.org/r/f7a675f66d1442d048b4216b2baf94515012c405.1677672277.git.raghavendra.kt@amd.com
Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>
Suggested-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Bharata B Rao <bharata@amd.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Disha Talreja <dishaa.talreja@amd.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
During Numa scanning make sure only relevant vmas of the tasks are
scanned.
Before:
All the tasks of a process participate in scanning the vma even if they
do not access vma in it's lifespan.
Now:
Except cases of first few unconditional scans, if a process do
not touch vma (exluding false positive cases of PID collisions)
tasks no longer scan all vma
Logic used:
1) 6 bits of PID used to mark active bit in vma numab status during
fault to remember PIDs accessing vma. (Thanks Mel)
2) Subsequently in scan path, vma scanning is skipped if current PID
had not accessed vma.
3) First two times we do allow unconditional scan to preserve earlier
behaviour of scanning.
Acknowledgement to Bharata B Rao <bharata@amd.com> for initial patch to
store pid information and Peter Zijlstra <peterz@infradead.org> (Usage of
test and set bit)
Link: https://lkml.kernel.org/r/092f03105c7c1d3450f4636b1ea350407f07640e.1677672277.git.raghavendra.kt@amd.com
Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>
Suggested-by: Mel Gorman <mgorman@techsingularity.net>
Cc: David Hildenbrand <david@redhat.com>
Cc: Disha Talreja <dishaa.talreja@amd.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Mike Rapoport <rppt@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Pach series "sched/numa: Enhance vma scanning", v3.
The patchset proposes one of the enhancements to numa vma scanning
suggested by Mel. This is continuation of [3].
Reposting the rebased patchset to akpm mm-unstable tree (March 1)
Existing mechanism of scan period involves, scan period derived from
per-thread stats. Process Adaptive autoNUMA [1] proposed to gather NUMA
fault stats at per-process level to capture aplication behaviour better.
During that course of discussion, Mel proposed several ideas to enhance
current numa balancing. One of the suggestion was below
Track what threads access a VMA. The suggestion was to use an unsigned
long pid_mask and use the lower bits to tag approximately what threads
access a VMA. Skip VMAs that did not trap a fault. This would be
approximate because of PID collisions but would reduce scanning of areas
the thread is not interested in. The above suggestion intends not to
penalize threads that has no interest in the vma, thus reduce scanning
overhead.
V3 changes are mostly based on PeterZ comments (details below in changes)
Summary of patchset:
Current patchset implements:
1. Delay the vma scanning logic for newly created VMA's so that
additional overhead of scanning is not incurred for short lived tasks
(implementation by Mel)
2. Store the information of tasks accessing VMA in 2 windows. It is
regularly cleared in (4*sysctl_numa_balancing_scan_delay) interval.
The above time is derived from experimenting (Suggested by PeterZ) to
balance between frequent clearing vs obsolete access data
3. hash_32 used to encode task index accessing VMA information
4. VMA's acess information is used to skip scanning for the tasks
which had not accessed VMA
Changes since V2:
patch1:
- Renaming of structure, macro to function,
- Add explanation to heuristics
- Adding more details from result (PeterZ)
Patch2:
- Usage of test and set bit (PeterZ)
- Move storing access PID info to numa_migrate_prep()
- Add a note on fainess among tasks allowed to scan
(PeterZ)
Patch3:
- Maintain two windows of access PID information
(PeterZ supported implementation and Gave idea to extend
to N if needed)
Patch4:
- Apply hash_32 function to track VMA accessing PIDs (PeterZ)
Changes since RFC V1:
- Include Mel's vma scan delay patch
- Change the accessing pid store logic (Thanks Mel)
- Fencing structure / code to NUMA_BALANCING (David, Mel)
- Adding clearing access PID logic (Mel)
- Descriptive change log ( Mike Rapoport)
Things to ponder over:
==========================================
- Improvement to clearing accessing PIDs logic (discussed in-detail in
patch3 itself (Done in this patchset by implementing 2 window history)
- Current scan period is not changed in the patchset, so we do see
frequent tries to scan. Relaxing scan period dynamically could improve
results further.
[1] sched/numa: Process Adaptive autoNUMA
Link: https://lore.kernel.org/lkml/20220128052851.17162-1-bharata@amd.com/T/
[2] RFC V1 Link:
https://lore.kernel.org/all/cover.1673610485.git.raghavendra.kt@amd.com/
[3] V2 Link:
https://lore.kernel.org/lkml/cover.1675159422.git.raghavendra.kt@amd.com/
Results:
Summary: Huge autonuma cost reduction seen in mmtest. Kernbench improvement
is more than 5% and huge system time (80%+) improvement from mmtest autonuma.
(dbench had huge std deviation to post)
kernbench
===========
6.2.0-mmunstable-base 6.2.0-mmunstable-patched
Amean user-256 22002.51 ( 0.00%) 22649.95 * -2.94%*
Amean syst-256 10162.78 ( 0.00%) 8214.13 * 19.17%*
Amean elsp-256 160.74 ( 0.00%) 156.92 * 2.38%*
Duration User 66017.43 67959.84
Duration System 30503.15 24657.03
Duration Elapsed 504.61 493.12
6.2.0-mmunstable-base 6.2.0-mmunstable-patched
Ops NUMA alloc hit 1738835089.00 1738780310.00
Ops NUMA alloc local 1738834448.00 1738779711.00
Ops NUMA base-page range updates 477310.00 392566.00
Ops NUMA PTE updates 477310.00 392566.00
Ops NUMA hint faults 96817.00 87555.00
Ops NUMA hint local faults % 10150.00 2192.00
Ops NUMA hint local percent 10.48 2.50
Ops NUMA pages migrated 86660.00 85363.00
Ops AutoNUMA cost 489.07 442.14
autonumabench
===============
6.2.0-mmunstable-base 6.2.0-mmunstable-patched
Amean syst-NUMA01 399.50 ( 0.00%) 52.05 * 86.97%*
Amean syst-NUMA01_THREADLOCAL 0.21 ( 0.00%) 0.22 * -5.41%*
Amean syst-NUMA02 0.80 ( 0.00%) 0.78 * 2.68%*
Amean syst-NUMA02_SMT 0.65 ( 0.00%) 0.68 * -3.95%*
Amean elsp-NUMA01 313.26 ( 0.00%) 313.11 * 0.05%*
Amean elsp-NUMA01_THREADLOCAL 1.06 ( 0.00%) 1.08 * -1.76%*
Amean elsp-NUMA02 3.19 ( 0.00%) 3.24 * -1.52%*
Amean elsp-NUMA02_SMT 3.72 ( 0.00%) 3.61 * 2.92%*
Duration User 396433.47 324835.96
Duration System 2808.70 376.66
Duration Elapsed 2258.61 2258.12
6.2.0-mmunstable-base 6.2.0-mmunstable-patched
Ops NUMA alloc hit 59921806.00 49623489.00
Ops NUMA alloc miss 0.00 0.00
Ops NUMA interleave hit 0.00 0.00
Ops NUMA alloc local 59920880.00 49622594.00
Ops NUMA base-page range updates 152259275.00 50075.00
Ops NUMA PTE updates 152259275.00 50075.00
Ops NUMA PMD updates 0.00 0.00
Ops NUMA hint faults 154660352.00 39014.00
Ops NUMA hint local faults % 138550501.00 23139.00
Ops NUMA hint local percent 89.58 59.31
Ops NUMA pages migrated 8179067.00 14147.00
Ops AutoNUMA cost 774522.98 195.69
This patch (of 4):
Currently whenever a new task is created we wait for
sysctl_numa_balancing_scan_delay to avoid unnessary scanning overhead.
Extend the same logic to new or very short-lived VMAs.
[raghavendra.kt@amd.com: add initialization in vm_area_dup())]
Link: https://lkml.kernel.org/r/cover.1677672277.git.raghavendra.kt@amd.com
Link: https://lkml.kernel.org/r/7a6fbba87c8b51e67efd3e74285bb4cb311a16ca.1677672277.git.raghavendra.kt@amd.com
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>
Cc: Bharata B Rao <bharata@amd.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Disha Talreja <dishaa.talreja@amd.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
vma->lock being part of the vm_area_struct causes performance regression
during page faults because during contention its count and owner fields
are constantly updated and having other parts of vm_area_struct used
during page fault handling next to them causes constant cache line
bouncing. Fix that by moving the lock outside of the vm_area_struct.
All attempts to keep vma->lock inside vm_area_struct in a separate cache
line still produce performance regression especially on NUMA machines.
Smallest regression was achieved when lock is placed in the fourth cache
line but that bloats vm_area_struct to 256 bytes.
Considering performance and memory impact, separate lock looks like the
best option. It increases memory footprint of each VMA but that can be
optimized later if the new size causes issues. Note that after this
change vma_init() does not allocate or initialize vma->lock anymore. A
number of drivers allocate a pseudo VMA on the stack but they never use
the VMA's lock, therefore it does not need to be allocated. The future
drivers which might need the VMA lock should use
vm_area_alloc()/vm_area_free() to allocate the VMA.
Link: https://lkml.kernel.org/r/20230227173632.3292573-34-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
call_rcu() can take a long time when callback offloading is enabled. Its
use in the vm_area_free can cause regressions in the exit path when
multiple VMAs are being freed.
Because exit_mmap() is called only after the last mm user drops its
refcount, the page fault handlers can't be racing with it. Any other
possible user like oom-reaper or process_mrelease are already synchronized
using mmap_lock. Therefore exit_mmap() can free VMAs directly, without
the use of call_rcu().
Expose __vm_area_free() and use it from exit_mmap() to avoid possible
call_rcu() floods and performance regressions caused by it.
Link: https://lkml.kernel.org/r/20230227173632.3292573-33-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Introduce per-VMA locking. The lock implementation relies on a per-vma
and per-mm sequence counters to note exclusive locking:
- read lock - (implemented by vma_start_read) requires the vma
(vm_lock_seq) and mm (mm_lock_seq) sequence counters to differ.
If they match then there must be a vma exclusive lock held somewhere.
- read unlock - (implemented by vma_end_read) is a trivial vma->lock
unlock.
- write lock - (vma_start_write) requires the mmap_lock to be held
exclusively and the current mm counter is assigned to the vma counter.
This will allow multiple vmas to be locked under a single mmap_lock
write lock (e.g. during vma merging). The vma counter is modified
under exclusive vma lock.
- write unlock - (vma_end_write_all) is a batch release of all vma
locks held. It doesn't pair with a specific vma_start_write! It is
done before exclusive mmap_lock is released by incrementing mm
sequence counter (mm_lock_seq).
- write downgrade - if the mmap_lock is downgraded to the read lock, all
vma write locks are released as well (effectivelly same as write
unlock).
Link: https://lkml.kernel.org/r/20230227173632.3292573-13-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This prepares for page faults handling under VMA lock, looking up VMAs
under protection of an rcu read lock, instead of the usual mmap read lock.
Link: https://lkml.kernel.org/r/20230227173632.3292573-11-surenb@google.com
Signed-off-by: Michel Lespinasse <michel@lespinasse.org>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
MAX_ORDER is not inclusive: the maximum allocation order buddy allocator
can deliver is MAX_ORDER-1.
Fix MAX_ORDER usage in rb_alloc_aux_page().
Link: https://lkml.kernel.org/r/20230315113133.11326-7-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Ian Rogers <irogers@google.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Use the maple tree in RCU mode for VMA tracking.
The maple tree tracks the stack and is able to update the pivot
(lower/upper boundary) in-place to allow the page fault handler to write
to the tree while holding just the mmap read lock. This is safe as the
writes to the stack have a guard VMA which ensures there will always be a
NULL in the direction of the growth and thus will only update a pivot.
It is possible, but not recommended, to have VMAs that grow up/down
without guard VMAs. syzbot has constructed a testcase which sets up a VMA
to grow and consume the empty space. Overwriting the entire NULL entry
causes the tree to be altered in a way that is not safe for concurrent
readers; the readers may see a node being rewritten or one that does not
match the maple state they are using.
Enabling RCU mode allows the concurrent readers to see a stable node and
will return the expected result.
[Liam.Howlett@Oracle.com: we don't need to free the nodes with RCU[
Link: https://lore.kernel.org/linux-mm/000000000000b0a65805f663ace6@google.com/
Link: https://lkml.kernel.org/r/20230227173632.3292573-9-surenb@google.com
Fixes: d4af56c5c7 ("mm: start tracking VMAs with maple tree")
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reported-by: syzbot+8d95422d3537159ca390@syzkaller.appspotmail.com
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
On big systems, the mm refcount can become highly contented when doing a
lot of context switching with threaded applications. user<->idle switch
is one of the important cases. Abandoning lazy tlb entirely slows this
switching down quite a bit in the common uncontended case, so that is not
viable.
Implement a scheme where lazy tlb mm references do not contribute to the
refcount, instead they get explicitly removed when the refcount reaches
zero.
The final mmdrop() sends IPIs to all CPUs in the mm_cpumask and they
switch away from this mm to init_mm if it was being used as the lazy tlb
mm. Enabling the shoot lazies option therefore requires that the arch
ensures that mm_cpumask contains all CPUs that could possibly be using mm.
A DEBUG_VM option IPIs every CPU in the system after this to ensure there
are no references remaining before the mm is freed.
Shootdown IPIs cost could be an issue, but they have not been observed to
be a serious problem with this scheme, because short-lived processes tend
not to migrate CPUs much, therefore they don't get much chance to leave
lazy tlb mm references on remote CPUs. There are a lot of options to
reduce them if necessary, described in comments.
The near-worst-case can be benchmarked with will-it-scale:
context_switch1_threads -t $(($(nproc) / 2))
This will create nproc threads (nproc / 2 switching pairs) all sharing the
same mm that spread over all CPUs so each CPU does thread->idle->thread
switching.
[ Rik came up with basically the same idea a few years ago, so credit
to him for that. ]
Link: https://lore.kernel.org/linux-mm/20230118080011.2258375-1-npiggin@gmail.com/
Link: https://lore.kernel.org/all/20180728215357.3249-11-riel@surriel.com/
Link: https://lkml.kernel.org/r/20230203071837.1136453-5-npiggin@gmail.com
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Add explicit _lazy_tlb annotated functions for lazy tlb mm refcounting.
This makes the lazy tlb mm references more obvious, and allows the
refcounting scheme to be modified in later changes. There is no
functional change with this patch.
Link: https://lkml.kernel.org/r/20230203071837.1136453-3-npiggin@gmail.com
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "shoot lazy tlbs (lazy tlb refcount scalability
improvement)", v7.
This series improves scalability of context switching between user and
kernel threads on large systems with a threaded process spread across a
lot of CPUs.
Discussion of v6 here:
https://lore.kernel.org/linux-mm/20230118080011.2258375-1-npiggin@gmail.com/
This patch (of 5):
Remove the special case avoiding refcounting when the mm to be used is the
same as the kernel thread's active (lazy tlb) mm. kthread_use_mm() should
not be such a performance critical path that this matters much. This
simplifies a later change to lazy tlb mm refcounting.
Link: https://lkml.kernel.org/r/20230203071837.1136453-1-npiggin@gmail.com
Link: https://lkml.kernel.org/r/20230203071837.1136453-2-npiggin@gmail.com
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Will Deacon <will@kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmQgQeEACgkQEsHwGGHe
VUpJ+w//c01JpzBXQvGGlNSTTuUzSxLAQz0n8lQmixpYHOUgVL3CQlOF+OfnkYPp
mz8m3nDh1FB9o70Sd2J4/OjY2Gh1qENrBLmj509LTnZE9hADRI0T5Mn93Jo7m7HY
QoXrYWEwJYwsLzJ66mR8A0xts5jWgkJsAWKXF9gfxf/ieeycdJ1GdOdzC1tp4Nfe
/4SEjSbUhx/bsBbAdJ38Z/iQGT0HuyQLOBGBuBcFE0JnP/aYEanAQsxxP2LObeVw
Za7ATxdJ9I1TErVfRsG0GDSiVKCYzSG2GME5TXibgPJ2g1+m0I7gZgpGO9Q8Wzo4
7y0X+vqsykY/Me3xEDBVaeCiHmFTambkxOR2xVJ2TISN8b390yePy4vgY1QQDidd
eNh9S2x1dsKp8i4NdYyeW7xwaTfIDp9Yp4XNP8cw02VzW0FSCnsmCzwGHIXsF4K/
Sib+bhKUo+Qmck5nJlV6R5Xr9cvGgyPpBvD8/XqqwF5lHJ7xg4qkPwPKjoKL1HRj
YT1t+l0kzcg/onyDAuPe1mIRFf7Q8x5G8zkUGMG401h2tazv19rjK4+V1UemhBqA
h5Cf1BBy6+6kn4DDb/zD+0IgpDFKJDaClxNwfPzaplAoMC/8+0nxxnu7f32eT59k
/JtMisERhr6lG0Q+TfURhB3yyCiBjBR2EKcKzHz+KARtxs7a4T0=
=Z8oJ
-----END PGP SIGNATURE-----
Merge tag 'sched_urgent_for_v6.3_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fix from Borislav Petkov:
- Fix a corner case where vruntime of a task is not being sanitized
* tag 'sched_urgent_for_v6.3_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/fair: Sanitize vruntime of entity being migrated
former doesn't get ignored
- A noinstr warning fix
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmQgPhsACgkQEsHwGGHe
VUqecA//VN/9pvCpJbf0S92lDXjbjuAna4+rYak6mjxke2lYHeXNsCjPpBtdfMnK
Mvr1HykKVBGIwlIPsKvwzt1FYN08rhdPnb/XvwM1ZliFlXP/HhN7K3AWgld559sZ
hY9gKfijof5D7VhqjRS7ZA2qo2by72ekXnuVQT0cmwZiHPQLx8M4ySHuMUZD3Lmz
GQyvITuDyBX3BiIGVtqpOhpugpEjAoEBE0MPwXrMEe9Glk4i1z86Xd1Cr4ksCFZL
gIdDSnlUuLuXn1OfLA35fXzoaStB07aTMEbRl+iD1KUopdRzpqj9j0rqYINkW0Ar
W/BzyMNw2itEbGrF+kjjCpolwmJvMcJUuvEZKO/gNTv2qYZW5BQqQrHYILdmezyz
HvGndyT996D3uoUIMIGqJf+41cwqPEjGyMLs3GfYnZMdVnZZnG/KflWQCiGUR9RR
LfxaryNZfT8MfQP6uslxOWubMupfsen7Hk8oljUpT2GzUAsWxTjqYWkFx6bMNMkV
Kx304at7R9jD81qC3Rdkqu0F5Z17YZWubd1oJEhi8HeMq8uxFxPb983SkXLY0w7Y
4Ss/MhJwt30e9ltGCMwgF83uOnndXwzFJG4TR9TqsO0TcdUE6XJPbPj/K8Wsi/u7
fvnCbBLsDaEJDEAikWJsSOaMjUwnajyaYomy+9VCR536/DlIq5M=
=c5tH
-----END PGP SIGNATURE-----
Merge tag 'core_urgent_for_v6.3_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull core fixes from Borislav Petkov:
- Do the delayed RCU wakeup for kthreads in the proper order so that
former doesn't get ignored
- A noinstr warning fix
* tag 'core_urgent_for_v6.3_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
entry/rcu: Check TIF_RESCHED _after_ delayed RCU wake-up
entry: Fix noinstr warning in __enter_from_user_mode()
* Fix a race in the percpu counters summation code where the summation
failed to add in the values for any CPUs that were dying but not yet
dead. This fixes some minor discrepancies and incorrect assertions
when running generic/650.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZBdAbgAKCRBKO3ySh0YR
pkltAQCs4QO5LjYReqjUxd4cSsLtNnNon09qswRsl2GuRyI36AEAxI9QMq4Q6D9V
ZasNbiTCkV3KPKfmp6gf1mQNLk1lGQ0=
=Bz3q
-----END PGP SIGNATURE-----
Merge tag 'xfs-6.3-fixes-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs percpu counter fixes from Darrick Wong:
"We discovered a filesystem summary counter corruption problem that was
traced to cpu hot-remove racing with the call to percpu_counter_sum
that sets the free block count in the superblock when writing it to
disk. The root cause is that percpu_counter_sum doesn't cull from
dying cpus and hence misses those counter values if the cpu shutdown
hooks have not yet run to merge the values.
I'm hoping this is a fairly painless fix to the problem, since the
dying cpu mask should generally be empty. It's been in for-next for a
week without any complaints from the bots.
- Fix a race in the percpu counters summation code where the
summation failed to add in the values for any CPUs that were dying
but not yet dead. This fixes some minor discrepancies and incorrect
assertions when running generic/650"
* tag 'xfs-6.3-fixes-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
pcpcntr: remove percpu_counter_sum_all()
fork: remove use of percpu_counter_sum_all
pcpcntrs: fix dying cpu summation race
cpumask: introduce for_each_cpu_or
for other subsystems.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZB48xAAKCRDdBJ7gKXxA
js2rAP4zvcMn90vBJhWNElsA7pBgDYD66QCK6JBDHGe3J1qdeQEA8D606pjMBWkL
ly7NifwCjOtFhfDRgEHOXu8g8g1k1QM=
=Cswg
-----END PGP SIGNATURE-----
Merge tag 'mm-hotfixes-stable-2023-03-24-17-09' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"21 hotfixes, 8 of which are cc:stable. 11 are for MM, the remainder
are for other subsystems"
* tag 'mm-hotfixes-stable-2023-03-24-17-09' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (21 commits)
mm: mmap: remove newline at the end of the trace
mailmap: add entries for Richard Leitner
kcsan: avoid passing -g for test
kfence: avoid passing -g for test
mm: kfence: fix using kfence_metadata without initialization in show_object()
lib: dhry: fix unstable smp_processor_id(_) usage
mailmap: add entry for Enric Balletbo i Serra
mailmap: map Sai Prakash Ranjan's old address to his current one
mailmap: map Rajendra Nayak's old address to his current one
Revert "kasan: drop skip_kasan_poison variable in free_pages_prepare"
mailmap: add entry for Tobias Klauser
kasan, powerpc: don't rename memintrinsics if compiler adds prefixes
mm/ksm: fix race with VMA iteration and mm_struct teardown
kselftest: vm: fix unused variable warning
mm: fix error handling for map_deny_write_exec
mm: deduplicate error handling for map_deny_write_exec
checksyscalls: ignore fstat to silence build warning on LoongArch
nilfs2: fix kernel-infoleak in nilfs_ioctl_wrap_copy()
test_maple_tree: add more testing for mas_empty_area()
maple_tree: fix mas_skip_node() end slot detection
...
Current release - regressions:
- wifi: mt76: mt7915: add back 160MHz channel width support for MT7915
- libbpf: revert poisoning of strlcpy, it broke uClibc-ng
Current release - new code bugs:
- bpf: improve the coverage of the "allow reads from uninit stack"
feature to fix verification complexity problems
- eth: am65-cpts: reset PPS genf adj settings on enable
Previous releases - regressions:
- wifi: mac80211: serialize ieee80211_handle_wake_tx_queue()
- wifi: mt76: do not run mt76_unregister_device() on unregistered hw,
fix null-deref
- Bluetooth: btqcomsmd: fix command timeout after setting BD address
- eth: igb: revert rtnl_lock() that causes a deadlock
- dsa: mscc: ocelot: fix device specific statistics
Previous releases - always broken:
- xsk: add missing overflow check in xdp_umem_reg()
- wifi: mac80211:
- fix QoS on mesh interfaces
- fix mesh path discovery based on unicast packets
- Bluetooth:
- ISO: fix timestamped HCI ISO data packet parsing
- remove "Power-on" check from Mesh feature
- usbnet: more fixes to drivers trusting packet length
- wifi: iwlwifi: mvm: fix mvmtxq->stopped handling
- Bluetooth: btintel: iterate only bluetooth device ACPI entries
- eth: iavf: fix inverted Rx hash condition leading to disabled hash
- eth: igc: fix the validation logic for taprio's gate list
- dsa: tag_brcm: legacy: fix daisy-chained switches
Misc:
- bpf: adjust insufficient default bpf_jit_limit to account for
growth of BPF use over the last 5 years
- xdp: bpf_xdp_metadata() use EOPNOTSUPP as unique errno indicating
no driver support
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmQc4vkACgkQMUZtbf5S
IruG/w//XixBtdFMHE0/fcGv77jTovlJNiDYeaa+KtyjvIseieYwOKW5F31r3xvl
Mf/YhNEjAc++V8Zna/1UM5i/WOj1PJdHgSC+wMUGUXjMF+MfzL57nM83CllOpUB5
Z9YtUqGfolf2Vtx03wnV14qawmVnJWYKHn3AU11cueE5dUu6KNyBTCefQ7uzgcJN
zMtHAxw96MRQIDxSfKvZsePk4FnQ4qoSOLkslji5iikcMnKePaqZaxQla2oTcEIR
zue9V+ILmi62Y8mPcdT4ePpZQsjB39bpemh+9EL6l03/cjsjqmuiCw/d1+6g9kuy
ZD5LgZzUOb6xalhSseiwJL+vj8x2gQhshEfoHQvgp7fzr6agta6sisRX611wtmJl
hv4k2PMRqFrMv2S+8m8XC177bXIaGbiWh4vBFOWjf4u0lG55cGlzclbXWWQ80njy
C5cE4V7qPRk8Cl/+uT10CLNQx6JmaX8kcddtFrYpu0PZHKx1WfUYKIpgkiiMPRKT
njLkDQbFRa8Y3p7UX0wU1TbeuMzzLz+aTBrFEN864IJmbnUnWimeluQzD60WbkSx
6dciqq11LtvYDsR1HZ1pb7IoHYuDsDrO2Rx4zuqsB/SyfrGdRKJoKOnYvsk+AdCL
N/e4wivie8s6b+G3yL6p+IdlpEaVo2ZiLINp7JSW8jhW1hRcZUI=
=XBLi
-----END PGP SIGNATURE-----
Merge tag 'net-6.3-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from bpf, wifi and bluetooth.
Current release - regressions:
- wifi: mt76: mt7915: add back 160MHz channel width support for
MT7915
- libbpf: revert poisoning of strlcpy, it broke uClibc-ng
Current release - new code bugs:
- bpf: improve the coverage of the "allow reads from uninit stack"
feature to fix verification complexity problems
- eth: am65-cpts: reset PPS genf adj settings on enable
Previous releases - regressions:
- wifi: mac80211: serialize ieee80211_handle_wake_tx_queue()
- wifi: mt76: do not run mt76_unregister_device() on unregistered hw,
fix null-deref
- Bluetooth: btqcomsmd: fix command timeout after setting BD address
- eth: igb: revert rtnl_lock() that causes a deadlock
- dsa: mscc: ocelot: fix device specific statistics
Previous releases - always broken:
- xsk: add missing overflow check in xdp_umem_reg()
- wifi: mac80211:
- fix QoS on mesh interfaces
- fix mesh path discovery based on unicast packets
- Bluetooth:
- ISO: fix timestamped HCI ISO data packet parsing
- remove "Power-on" check from Mesh feature
- usbnet: more fixes to drivers trusting packet length
- wifi: iwlwifi: mvm: fix mvmtxq->stopped handling
- Bluetooth: btintel: iterate only bluetooth device ACPI entries
- eth: iavf: fix inverted Rx hash condition leading to disabled hash
- eth: igc: fix the validation logic for taprio's gate list
- dsa: tag_brcm: legacy: fix daisy-chained switches
Misc:
- bpf: adjust insufficient default bpf_jit_limit to account for
growth of BPF use over the last 5 years
- xdp: bpf_xdp_metadata() use EOPNOTSUPP as unique errno indicating
no driver support"
* tag 'net-6.3-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (84 commits)
Bluetooth: HCI: Fix global-out-of-bounds
Bluetooth: mgmt: Fix MGMT add advmon with RSSI command
Bluetooth: btsdio: fix use after free bug in btsdio_remove due to unfinished work
Bluetooth: L2CAP: Fix responding with wrong PDU type
Bluetooth: btqcomsmd: Fix command timeout after setting BD address
Bluetooth: btinel: Check ACPI handle for NULL before accessing
net: mdio: thunder: Add missing fwnode_handle_put()
net: dsa: mt7530: move setting ssc_delta to PHY_INTERFACE_MODE_TRGMII case
net: dsa: mt7530: move lowering TRGMII driving to mt7530_setup()
net: dsa: mt7530: move enabling disabling core clock to mt7530_pll_setup()
net: asix: fix modprobe "sysfs: cannot create duplicate filename"
gve: Cache link_speed value from device
tools: ynl: Fix genlmsg header encoding formats
net: enetc: fix aggregate RMON counters not showing the ranges
Bluetooth: Remove "Power-on" check from Mesh feature
Bluetooth: Fix race condition in hci_cmd_sync_clear
Bluetooth: btintel: Iterate only bluetooth device ACPI entries
Bluetooth: ISO: fix timestamped HCI ISO data packet parsing
Bluetooth: btusb: Remove detection of ISO packets over bulk
Bluetooth: hci_core: Detect if an ACL packet is in fact an ISO packet
...
Nathan reported that when building with GNU as and a version of clang that
defaults to DWARF5, the assembler will complain with:
Error: non-constant .uleb128 is not supported
This is because `-g` defaults to the compiler debug info default. If the
assembler does not support some of the directives used, the above errors
occur. To fix, remove the explicit passing of `-g`.
All the test wants is that stack traces print valid function names, and
debug info is not required for that. (I currently cannot recall why I
added the explicit `-g`.)
Link: https://lkml.kernel.org/r/20230316224705.709984-2-elver@google.com
Fixes: 1fe84fd4a4 ("kcsan: Add test suite")
Signed-off-by: Marco Elver <elver@google.com>
Reported-by: Nathan Chancellor <nathan@kernel.org>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZBzSGQAKCRDbK58LschI
g+dhAP95enbrlwaQ+9aoqrU+GqCq+uo4SkaqnUtq6GSvRNiVBQD8C6iZxrAjyXnm
1wRr3JN/HszPBzgjl3HvDc9y69I/PAI=
=8JwR
-----END PGP SIGNATURE-----
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:
====================
pull-request: bpf 2023-03-23
We've added 8 non-merge commits during the last 13 day(s) which contain
a total of 21 files changed, 238 insertions(+), 161 deletions(-).
The main changes are:
1) Fix verification issues in some BPF programs due to their stack usage
patterns, from Eduard Zingerman.
2) Fix to add missing overflow checks in xdp_umem_reg and return an error
in such case, from Kal Conley.
3) Fix and undo poisoning of strlcpy in libbpf given it broke builds for
libcs which provided the former like uClibc-ng, from Jesus Sanchez-Palencia.
4) Fix insufficient bpf_jit_limit default to avoid users running into hard
to debug seccomp BPF errors, from Daniel Borkmann.
5) Fix driver return code when they don't support a bpf_xdp_metadata kfunc
to make it unambiguous from other errors, from Jesper Dangaard Brouer.
6) Two BPF selftest fixes to address compilation errors from recent changes
in kernel structures, from Alexei Starovoitov.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
xdp: bpf_xdp_metadata use EOPNOTSUPP for no driver support
bpf: Adjust insufficient default bpf_jit_limit
xsk: Add missing overflow check in xdp_umem_reg
selftests/bpf: Fix progs/test_deny_namespace.c issues.
selftests/bpf: Fix progs/find_vma_fail1.c build error.
libbpf: Revert poisoning of strlcpy
selftests/bpf: Tests for uninitialized stack reads
bpf: Allow reads from uninit stack
====================
Link: https://lore.kernel.org/r/20230323225221.6082-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We've seen recent AWS EKS (Kubernetes) user reports like the following:
After upgrading EKS nodes from v20230203 to v20230217 on our 1.24 EKS
clusters after a few days a number of the nodes have containers stuck
in ContainerCreating state or liveness/readiness probes reporting the
following error:
Readiness probe errored: rpc error: code = Unknown desc = failed to
exec in container: failed to start exec "4a11039f730203ffc003b7[...]":
OCI runtime exec failed: exec failed: unable to start container process:
unable to init seccomp: error loading seccomp filter into kernel:
error loading seccomp filter: errno 524: unknown
However, we had not been seeing this issue on previous AMIs and it only
started to occur on v20230217 (following the upgrade from kernel 5.4 to
5.10) with no other changes to the underlying cluster or workloads.
We tried the suggestions from that issue (sysctl net.core.bpf_jit_limit=452534528)
which helped to immediately allow containers to be created and probes to
execute but after approximately a day the issue returned and the value
returned by cat /proc/vmallocinfo | grep bpf_jit | awk '{s+=$2} END {print s}'
was steadily increasing.
I tested bpf tree to observe bpf_jit_charge_modmem, bpf_jit_uncharge_modmem
their sizes passed in as well as bpf_jit_current under tcpdump BPF filter,
seccomp BPF and native (e)BPF programs, and the behavior all looks sane
and expected, that is nothing "leaking" from an upstream perspective.
The bpf_jit_limit knob was originally added in order to avoid a situation
where unprivileged applications loading BPF programs (e.g. seccomp BPF
policies) consuming all the module memory space via BPF JIT such that loading
of kernel modules would be prevented. The default limit was defined back in
2018 and while good enough back then, we are generally seeing far more BPF
consumers today.
Adjust the limit for the BPF JIT pool from originally 1/4 to now 1/2 of the
module memory space to better reflect today's needs and avoid more users
running into potentially hard to debug issues.
Fixes: fdadd04931 ("bpf: fix bpf_jit_limit knob for PAGE_SIZE >= 64K")
Reported-by: Stephen Haynes <sh@synk.net>
Reported-by: Lefteris Alexakis <lefteris.alexakis@kpn.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://github.com/awslabs/amazon-eks-ami/issues/1179
Link: https://github.com/awslabs/amazon-eks-ami/issues/1219
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20230320143725.8394-1-daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
RCU sometimes needs to perform a delayed wake up for specific kthreads
handling offloaded callbacks (RCU_NOCB). These wakeups are performed
by timers and upon entry to idle (also to guest and to user on nohz_full).
However the delayed wake-up on kernel exit is actually performed after
the thread flags are fetched towards the fast path check for work to
do on exit to user. As a result, and if there is no other pending work
to do upon that kernel exit, the current task will resume to userspace
with TIF_RESCHED set and the pending wake up ignored.
Fix this with fetching the thread flags _after_ the delayed RCU-nocb
kthread wake-up.
Fixes: 47b8ff194c ("entry: Explicitly flush pending rcuog wakeup before last rescheduling point")
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20230315194349.10798-3-joel@joelfernandes.org
Commit 829c1651e9 ("sched/fair: sanitize vruntime of entity being placed")
fixes an overflowing bug, but ignore a case that se->exec_start is reset
after a migration.
For fixing this case, we delay the reset of se->exec_start after
placing the entity which se->exec_start to detect long sleeping task.
In order to take into account a possible divergence between the clock_task
of 2 rqs, we increase the threshold to around 104 days.
Fixes: 829c1651e9 ("sched/fair: sanitize vruntime of entity being placed")
Originally-by: Zhang Qiao <zhangqiao22@huawei.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Zhang Qiao <zhangqiao22@huawei.com>
Link: https://lore.kernel.org/r/20230317160810.107988-1-vincent.guittot@linaro.org
__enter_from_user_mode() is triggering noinstr warnings with
CONFIG_DEBUG_PREEMPT due to its call of preempt_count_add() via
ct_state().
The preemption disable isn't needed as interrupts are already disabled.
And the context_tracking_enabled() check in ct_state() also isn't needed
as that's already being done by the CT_WARN_ON().
Just use __ct_state() instead.
Fixes the following warnings:
vmlinux.o: warning: objtool: enter_from_user_mode+0xba: call to preempt_count_add() leaves .noinstr.text section
vmlinux.o: warning: objtool: syscall_enter_from_user_mode+0xf9: call to preempt_count_add() leaves .noinstr.text section
vmlinux.o: warning: objtool: syscall_enter_from_user_mode_prepare+0xc7: call to preempt_count_add() leaves .noinstr.text section
vmlinux.o: warning: objtool: irqentry_enter_from_user_mode+0xba: call to preempt_count_add() leaves .noinstr.text section
Fixes: 171476775d ("context_tracking: Convert state to atomic_t")
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/d8955fa6d68dc955dda19baf13ae014ae27926f5.1677369694.git.jpoimboe@kernel.org
- Fix setting affinity of hwlat threads in containers
Using sched_set_affinity() has unwanted side effects when being
called within a container. Use set_cpus_allowed_ptr() instead.
- Fix per cpu thread management of the hwlat tracer
* Do not start per_cpu threads if one is already running for the CPU.
* When starting per_cpu threads, do not clear the kthread variable
as it may already be set to running per cpu threads
- Fix return value for test_gen_kprobe_cmd()
On error the return value was overwritten by being set to
the result of the call from kprobe_event_delete(), which would
likely succeed, and thus have the function return success.
- Fix splice() reads from the trace file that was broken by
36e2c7421f ("fs: don't allow splice read/write without explicit ops")
- Remove obsolete and confusing comment in ring_buffer.c
The original design of the ring buffer used struct page flags
for tricks to optimize, which was shortly removed due to them
being tricks. But a comment for those tricks remained.
- Set local functions and variables to static
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZBdIkBQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qti2AP49s1GM8teFgDF/CO3oa45BoYq1lMJO
1Z+x+mS/bdUAZgEAw+3KGo8oZDuvWu/nr04JPeoy0GL1/JnbQ6JNCCjzhQc=
=Yk66
-----END PGP SIGNATURE-----
Merge tag 'trace-v6.3-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fixes from Steven Rostedt:
- Fix setting affinity of hwlat threads in containers
Using sched_set_affinity() has unwanted side effects when being
called within a container. Use set_cpus_allowed_ptr() instead
- Fix per cpu thread management of the hwlat tracer:
- Do not start per_cpu threads if one is already running for the CPU
- When starting per_cpu threads, do not clear the kthread variable
as it may already be set to running per cpu threads
- Fix return value for test_gen_kprobe_cmd()
On error the return value was overwritten by being set to the result
of the call from kprobe_event_delete(), which would likely succeed,
and thus have the function return success
- Fix splice() reads from the trace file that was broken by commit
36e2c7421f ("fs: don't allow splice read/write without explicit
ops")
- Remove obsolete and confusing comment in ring_buffer.c
The original design of the ring buffer used struct page flags for
tricks to optimize, which was shortly removed due to them being
tricks. But a comment for those tricks remained
- Set local functions and variables to static
* tag 'trace-v6.3-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing/hwlat: Replace sched_setaffinity with set_cpus_allowed_ptr
ring-buffer: remove obsolete comment for free_buffer_page()
tracing: Make splice_read available again
ftrace: Set direct_ops storage-class-specifier to static
trace/hwlat: Do not start per-cpu thread if it is already running
trace/hwlat: Do not wipe the contents of per-cpu thread data
tracing/osnoise: set several trace_osnoise.c variables storage-class-specifier to static
tracing: Fix wrong return in kprobe_event_gen_test.c
There is a problem with the behavior of hwlat in a container,
resulting in incorrect output. A warning message is generated:
"cpumask changed while in round-robin mode, switching to mode none",
and the tracing_cpumask is ignored. This issue arises because
the kernel thread, hwlatd, is not a part of the container, and
the function sched_setaffinity is unable to locate it using its PID.
Additionally, the task_struct of hwlatd is already known.
Ultimately, the function set_cpus_allowed_ptr achieves
the same outcome as sched_setaffinity, but employs task_struct
instead of PID.
Test case:
# cd /sys/kernel/tracing
# echo 0 > tracing_on
# echo round-robin > hwlat_detector/mode
# echo hwlat > current_tracer
# unshare --fork --pid bash -c 'echo 1 > tracing_on'
# dmesg -c
Actual behavior:
[573502.809060] hwlat_detector: cpumask changed while in round-robin mode, switching to mode none
Link: https://lore.kernel.org/linux-trace-kernel/20230316144535.1004952-1-costa.shul@redhat.com
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Fixes: 0330f7aa8e ("tracing: Have hwlat trace migrate across tracing_cpumask CPUs")
Signed-off-by: Costa Shulyupin <costa.shul@redhat.com>
Acked-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The comment refers to mm/slob.c which is being removed. It comes from
commit ed56829cb3 ("ring_buffer: reset buffer page when freeing") and
according to Steven the borrowed code was a page mapcount and mapping
reset, which was later removed by commit e4c2ce82ca ("ring_buffer:
allocate buffer page pointer"). Thus the comment is not accurate anyway,
remove it.
Link: https://lore.kernel.org/linux-trace-kernel/20230315142446.27040-1-vbabka@suse.cz
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Reported-by: Mike Rapoport <mike.rapoport@gmail.com>
Suggested-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Fixes: e4c2ce82ca ("ring_buffer: allocate buffer page pointer")
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Mukesh Ojha <quic_mojha@quicinc.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Since the commit 36e2c7421f ("fs: don't allow splice read/write
without explicit ops") is applied to the kernel, splice() and
sendfile() calls on the trace file (/sys/kernel/debug/tracing
/trace) return EINVAL.
This patch restores these system calls by initializing splice_read
in file_operations of the trace file. This patch only enables such
functionalities for the read case.
Link: https://lore.kernel.org/linux-trace-kernel/20230314013707.28814-1-sfoon.kim@samsung.com
Cc: stable@vger.kernel.org
Fixes: 36e2c7421f ("fs: don't allow splice read/write without explicit ops")
Signed-off-by: Sung-hun Kim <sfoon.kim@samsung.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
This effectively reverts the change made in commit f689054aac
("percpu_counter: add percpu_counter_sum_all interface") as the
race condition percpu_counter_sum_all() was invented to avoid is
now handled directly in percpu_counter_sum() and nobody needs to
care about summing racing with cpu unplug anymore.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
to groups
- Update the proper event time tracking variable depending on the
event type
- Fix a memory overwrite issue due to using the wrong function argument
when outputting perf events
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmQXBiQACgkQEsHwGGHe
VUrTHg//dTd+7Oo1vao32lZCkVW0bBvx3KzoyAcvEPh7Yu3Cw+tZkfI8lWY8dgU3
lgecN1pUa9IiZNULFbulXqRumcq3HIRKCp/RejEmR3W30KCwxnEAwGdakkCPEcMS
2vXl4SZn7rU8avKJBd/ZUUS5lDWz6YHPNbLX3iamFH+7oN56Vf2LFJuuO0WtZSgr
Sqv25oaV1ZZjeEAI5KrPM04hcldj4BXGPoIPvP30/c3z6aPnwt6v5H2gw5IUJNHl
Qm0ycgW7An6iNR4bFpm/NP9cSAI7FgmAB3VuhQAExl6NroUxE8R+HEO04PsJAOgd
BpNB1eXIjrTgw8r+TsDq4Z9VY8fgMtHbwutAp348XwF6//5F/WOzojwSGAAX7LCx
5+fK4eA4gLkQJmuyI5/3fNCC0EnUTsK/YCC5eFyas8KFERTmmq8hafCBQO4W9VPr
8A7TD2X6fWNWvS/UF1RqVv/gCa0ub7ifz+eOx42/vK2PXn4vsjMv0JtwnARj8PzZ
Ymo/3ArmBWhuP+94FdVP4AduBuEdHvWO7EZG5buRdOo//sb2MwX05zufowLvR9YQ
E+VkZxf0RFPaDQdPnoS/SwFE206ii2Z5MtgsQX7XpoBo06R5AzZQlhlMl/StFvN0
65ut1dWwqrVG0uSN0GOzTIZl5x5YZASm5I49ItZsnZyoMPTwDOc=
=l7PF
-----END PGP SIGNATURE-----
Merge tag 'perf_urgent_for_v6.3_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Borislav Petkov:
- Check whether sibling events have been deactivated before adding them
to groups
- Update the proper event time tracking variable depending on the event
type
- Fix a memory overwrite issue due to using the wrong function argument
when outputting perf events
* tag 'perf_urgent_for_v6.3_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf: Fix check before add_event_to_groups() in perf_group_detach()
perf: fix perf_event_context->time
perf/core: Fix perf_output_begin parameter is incorrectly invoked in perf_event_bpf_output
smatch reports this warning
kernel/trace/ftrace.c:2594:19: warning:
symbol 'direct_ops' was not declared. Should it be static?
The variable direct_ops is only used in ftrace.c, so it should be static
Link: https://lore.kernel.org/linux-trace-kernel/20230311135113.711824-1-trix@redhat.com
Signed-off-by: Tom Rix <trix@redhat.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The hwlatd tracer will end up starting multiple per-cpu threads with
the following script:
#!/bin/sh
cd /sys/kernel/debug/tracing
echo 0 > tracing_on
echo hwlat > current_tracer
echo per-cpu > hwlat_detector/mode
echo 100000 > hwlat_detector/width
echo 200000 > hwlat_detector/window
echo 1 > tracing_on
To fix the issue, check if the hwlatd thread for the cpu is already
running, before starting a new one. Along with the previous patch, this
avoids running multiple instances of the same CPU thread on the system.
Link: https://lore.kernel.org/all/20230302113654.2984709-1-tero.kristo@linux.intel.com/
Link: https://lkml.kernel.org/r/20230310100451.3948583-3-tero.kristo@linux.intel.com
Cc: stable@vger.kernel.org
Fixes: f46b16520a ("trace/hwlat: Implement the per-cpu mode")
Signed-off-by: Tero Kristo <tero.kristo@linux.intel.com>
Acked-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
smatch reports several similar warnings
kernel/trace/trace_osnoise.c:220:1: warning:
symbol '__pcpu_scope_per_cpu_osnoise_var' was not declared. Should it be static?
kernel/trace/trace_osnoise.c:243:1: warning:
symbol '__pcpu_scope_per_cpu_timerlat_var' was not declared. Should it be static?
kernel/trace/trace_osnoise.c:335:14: warning:
symbol 'interface_lock' was not declared. Should it be static?
kernel/trace/trace_osnoise.c:2242:5: warning:
symbol 'timerlat_min_period' was not declared. Should it be static?
kernel/trace/trace_osnoise.c:2243:5: warning:
symbol 'timerlat_max_period' was not declared. Should it be static?
These variables are only used in trace_osnoise.c, so it should be static
Link: https://lore.kernel.org/linux-trace-kernel/20230309150414.4036764-1-trix@redhat.com
Signed-off-by: Tom Rix <trix@redhat.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Overwriting the error code with the deletion result may cause the
function to return 0 despite encountering an error. Commit b111545d26
("tracing: Remove the useless value assignment in
test_create_synth_event()") solves a similar issue by
returning the original error code, so this patch does the same.
Found by Linux Verification Center (linuxtesting.org) with SVACE.
Link: https://lore.kernel.org/linux-trace-kernel/20230131075818.5322-1-aagusev@ispras.ru
Signed-off-by: Anton Gusev <aagusev@ispras.ru>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Events should only be added to a groups rb tree if they have not been
removed from their context by list_del_event(). Since remove_on_exec
made it possible to call list_del_event() on individual events before
they are detached from their group, perf_group_detach() should check each
sibling's attach_state before calling add_event_to_groups() on it.
Fixes: 2e498d0a74 ("perf: Add support for event removal on exec")
Signed-off-by: Budimir Markovic <markovicbudimir@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/ZBFzvQV9tEqoHEtH@gentoo
Time readers rely on perf_event_context->[time|timestamp|timeoffset] to get
accurate time_enabled and time_running for an event. The difference between
ctx->timestamp and ctx->time is the among of time when the context is not
enabled. __update_context_time(ctx, false) is used to increase timestamp,
but not time. Therefore, it should only be called in ctx_sched_in() when
EVENT_TIME was not enabled.
Fixes: 09f5e7dc7a ("perf: Fix perf_event_read_local() time")
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Link: https://lkml.kernel.org/r/20230313171608.298734-1-song@kernel.org
syzkaller reportes a KASAN issue with stack-out-of-bounds.
The call trace is as follows:
dump_stack+0x9c/0xd3
print_address_description.constprop.0+0x19/0x170
__kasan_report.cold+0x6c/0x84
kasan_report+0x3a/0x50
__perf_event_header__init_id+0x34/0x290
perf_event_header__init_id+0x48/0x60
perf_output_begin+0x4a4/0x560
perf_event_bpf_output+0x161/0x1e0
perf_iterate_sb_cpu+0x29e/0x340
perf_iterate_sb+0x4c/0xc0
perf_event_bpf_event+0x194/0x2c0
__bpf_prog_put.constprop.0+0x55/0xf0
__cls_bpf_delete_prog+0xea/0x120 [cls_bpf]
cls_bpf_delete_prog_work+0x1c/0x30 [cls_bpf]
process_one_work+0x3c2/0x730
worker_thread+0x93/0x650
kthread+0x1b8/0x210
ret_from_fork+0x1f/0x30
commit 267fb27352 ("perf: Reduce stack usage of perf_output_begin()")
use on-stack struct perf_sample_data of the caller function.
However, perf_event_bpf_output uses incorrect parameter to convert
small-sized data (struct perf_bpf_event) into large-sized data
(struct perf_sample_data), which causes memory overwriting occurs in
__perf_event_header__init_id.
Fixes: 267fb27352 ("perf: Reduce stack usage of perf_output_begin()")
Signed-off-by: Yang Jihong <yangjihong1@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20230314044735.56551-1-yangjihong1@huawei.com
The getaffinity() system call uses 'cpumask_size()' to decide how big
the CPU mask is - so far so good. It is indeed the allocation size of a
cpumask.
But the code also assumes that the whole allocation is initialized
without actually doing so itself. That's wrong, because we might have
fixed-size allocations (making copying and clearing more efficient), but
not all of it is then necessarily used if 'nr_cpu_ids' is smaller.
Having checked other users of 'cpumask_size()', they all seem to be ok,
either using it purely for the allocation size, or explicitly zeroing
the cpumask before using the size in bytes to copy it.
See for example the ublk_ctrl_get_queue_affinity() function that uses
the proper 'zalloc_cpumask_var()' to make sure that the whole mask is
cleared, whether the storage is on the stack or if it was an external
allocation.
Fix this by just zeroing the allocation before using it. Do the same
for the compat version of sched_getaffinity(), which had the same logic.
Also, for consistency, make sched_getaffinity() use 'cpumask_bits()' to
access the bits. For a cpumask_var_t, it ends up being a pointer to the
same data either way, but it's just a good idea to treat it like you
would a 'cpumask_t'. The compat case already did that.
Reported-by: Ryan Roberts <ryan.roberts@arm.com>
Link: https://lore.kernel.org/lkml/7d026744-6bd6-6827-0471-b5e8eae0be3f@arm.com/
Cc: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- Do not allow histogram values to have modifies.
Can cause a NULL pointer dereference if they do.
- Warn if hist_field_name() is passed a NULL.
Prevent the NULL pointer dereference mentioned above.
- Fix invalid address look up race in lookup_rec()
- Define ftrace_stub_graph conditionally to prevent linker errors
- Always check if RCU is watching at all tracepoint locations
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZBDuTBQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qsboAP4yfrFYvIIKM5EkzkEiPI+V2hdlA12x
bt839jO5AWCmhAEAiY8FmKatpBJQKsiGqSOab8aHOMnhGFZwltCHAPa9PAI=
=vtA2
-----END PGP SIGNATURE-----
Merge tag 'trace-v6.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fixes from Steven Rostedt:
- Do not allow histogram values to have modifies. They can cause a NULL
pointer dereference if they do.
- Warn if hist_field_name() is passed a NULL. Prevent the NULL pointer
dereference mentioned above.
- Fix invalid address look up race in lookup_rec()
- Define ftrace_stub_graph conditionally to prevent linker errors
- Always check if RCU is watching at all tracepoint locations
* tag 'trace-v6.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing: Make tracepoint lockdep check actually test something
ftrace,kcfi: Define ftrace_stub_graph conditionally
ftrace: Fix invalid address access in lookup_rec() when index is 0
tracing: Check field value in hist_field_name()
tracing: Do not let histogram values have some modifiers