IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
There are a few functions which were inlined but are somewhat too large to
inline, so remove the inline key word.
There are also several very small functions which are used in critical
code sections which gcc was not inlining, so make this more strict and use
__always_line for these functions.
Link: https://lkml.kernel.org/r/20231101171629.3612299-8-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Peng Zhang <zhangpeng.00@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The node end is set during the walk, so use the resulting end instead of
re-fetching it.
Link: https://lkml.kernel.org/r/20231101171629.3612299-7-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Peng Zhang <zhangpeng.00@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
When looking for the next entry, don't recalculate the node end as it is
now tracked in the maple state.
Link: https://lkml.kernel.org/r/20231101171629.3612299-6-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Peng Zhang <zhangpeng.00@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Analysis of the mas_for_each() iteration showed that there is a
significant time spent finding the end of a node. This time can be
greatly reduced if the end of the node is cached in the maple state. Care
must be taken to update & invalidate as necessary.
Link: https://lkml.kernel.org/r/20231101171629.3612299-5-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Peng Zhang <zhangpeng.00@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mas_erase() may not deal correctly with all maple states. Make the
function more robust by ensuring the state is in one of the two acceptable
states.
Link: https://lkml.kernel.org/r/20231101171629.3612299-3-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Peng Zhang <zhangpeng.00@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "maple_tree: iterator state changes".
These patches have some general cleanup and a change to separate the maple
state status tracking from the maple state node.
The maple state status change allows for walks to continue from previous
places when the status needs to be recorded to make logical sense for the
next call to the maple state. For instance, it allows for prev/next to
function in a way that better resembles the linked list. It also allows
switch statements to be used to detect missed states during compile, and
the addition of fast-path "active" state is cleaner as an enum.
While making the status change, perf showed some very small (one line)
functions that were not inlined even with the inline key word. Making
these small functions __always_inline is less expensive according to perf.
As part of that change, some inlines have been dropped from larger
functions.
Perf also showed that the commonly used mas_for_each() iterator was
spending a lot of time finding the end of the node. This series
introduces caching of the end of the node in the maple state (and updating
it during writes). This caching along with the inline changes yielded at
23.25% improvement on the BENCH_MAS_FOR_EACH maple tree test framework
benchmark.
I've also included a change to mtree_range_walk and mtree_lookup_walk to
take advantage of Peng's change [1] to the initial pivot setup.
mmtests did not produce any significant gains.
[1] https://lore.kernel.org/all/20230711035444.526-1-zhangpeng.00@bytedance.com/T/#u
This patch (of 12):
Removing the default types from the switch statements will cause compile
warnings on missing cases.
Link: https://lkml.kernel.org/r/20231101171629.3612299-2-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Get rid of MACHINE_HAS_VX and replace it with cpu_has_vx() which is a
short readable wrapper for "test_facility(129)".
Facility bit 129 is set if the vector facility is present. test_facility()
returns also true for all bits which are set in the architecture level set
of the cpu that the kernel is compiled for. This means that
test_facility(129) is a compile time constant which returns true for z13
and later, since the vector facility bit is part of the z13 kernel ALS.
In result the compiled code will have less runtime checks, and less code.
Reviewed-by: Hendrik Brueckner <brueckner@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Patch series "Treewide: enable -Wmissing-prototypes", v3.
At this point, there are five architectures with a number of known
regressions: alpha, nios2, mips, sh and sparc. In the previous version of
this patch, I had turned off the missing prototype warnings for the 15
architectures that still had issues, but since there are only five left, I
think we can leave the rest to the maintainers (Cc'd here) as well.
The series is also likely to cause occasional build regressions on
linux-next as developers add new code that misses prototypes. Hopefully
this should be resolved by the time the patches make it into a release and
everyone gets the warnings right away.
This patch (of 6):
There is no global declaration for ida_dump() and no other callers, so
make it static to avoid this warning:
lib/test_ida.c:16:6: error: no previous prototype for 'ida_dump'
Link: https://lkml.kernel.org/r/20231123110506.707903-1-arnd@kernel.org
Link: https://lkml.kernel.org/r/20231123110506.707903-2-arnd@kernel.org
Fixes: 8ab8ba38d488 ("ida: Start new test_ida module")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Dinh Nguyen <dinguyen@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Nicolas Schier <nicolas@fjasle.eu>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Richard Henderson <richard.henderson@linaro.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Tudor Ambarus <tudor.ambarus@linaro.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Kees Cook <keescook@chromium.org>
Cc: Palmer Dabbelt <palmer@rivosinc.com>
Cc: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Record and report more information to help us find the cause of the bug
and to help us correlate the error with other system events.
This patch adds recording and showing CPU number and timestamp at
allocation and free (controlled by CONFIG_KASAN_EXTRA_INFO). The
timestamps in the report use the same format and source as printk.
Error occurrence timestamp is already implicit in the printk log, and CPU
number is already shown by dump_stack_lvl, so there is no need to add it.
In order to record CPU number and timestamp at allocation and free,
corresponding members need to be added to the relevant data structures,
which will lead to increased memory consumption.
In Generic KASAN, members are added to struct kasan_track. Since in most
cases, alloc meta is stored in the redzone and free meta is stored in the
object or the redzone, memory consumption will not increase much.
In SW_TAGS KASAN and HW_TAGS KASAN, members are added to struct
kasan_stack_ring_entry. Memory consumption increases as the size of
struct kasan_stack_ring_entry increases (this part of the memory is
allocated by memblock), but since this is configurable, it is up to the
user to choose.
Link: https://lkml.kernel.org/r/VI1P193MB0752BD991325D10E4AB1913599BDA@VI1P193MB0752.EURP193.PROD.OUTLOOK.COM
Signed-off-by: Juntong Deng <juntong.deng@outlook.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
KMSAN is frequently used in fuzzing scenarios and thus saves a lot of
stack traces. As KMSAN does not support evicting stack traces from the
stack depot, the stack depot capacity might be reached quickly with large
stack records.
Adjust the maximum number of stack depot pools for this case.
The average size of a stack trace saved into the stack depot is ~16
frames. Thus, adjust the maximum pools number accordingly to keep the
maximum number of stack traces that can be saved into the stack depot
similar to the one that was allowed before the stack trace eviction
changes.
Link: https://lkml.kernel.org/r/301a115cf7ce8ddb42ef6de9151c2bb76ba728fc.1700502145.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Evgenii Stepanov <eugenis@google.com>
Cc: Marco Elver <elver@google.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Add stack_depot_put, a function that decrements the reference counter on a
stack record and removes it from the stack depot once the counter reaches
0.
Internally, when removing a stack record, the function unlinks it from the
hash table bucket and returns to the freelist.
With this change, the users of stack depot can call stack_depot_put when
keeping a stack trace in the stack depot is not needed anymore. This
allows avoiding polluting the stack depot with irrelevant stack traces and
thus have more space to store the relevant ones before the stack depot
reaches its capacity.
Link: https://lkml.kernel.org/r/1d1ad5692ee43d4fc2b3fd9d221331d30b36123f.1700502145.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Evgenii Stepanov <eugenis@google.com>
Cc: Marco Elver <elver@google.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Add a reference counter for how many times a stack records has been
added to stack depot.
Add a new STACK_DEPOT_FLAG_GET flag to stack_depot_save_flags that
instructs the stack depot to increment the refcount.
Do not yet decrement the refcount; this is implemented in one of the
following patches.
Do not yet enable any users to use the flag to avoid overflowing the
refcount.
This is preparatory patch for implementing the eviction of stack records
from the stack depot.
Link: https://lkml.kernel.org/r/a3fc14a2359d019d2a008d4ff8b46a665371ffee.1700502145.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Evgenii Stepanov <eugenis@google.com>
Cc: Marco Elver <elver@google.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Change the bool can_alloc argument of __stack_depot_save to a u32
argument that accepts a set of flags.
The following patch will add another flag to stack_depot_save_flags
besides the existing STACK_DEPOT_FLAG_CAN_ALLOC.
Also rename the function to stack_depot_save_flags, as
__stack_depot_save is a cryptic name,
Link: https://lkml.kernel.org/r/645fa15239621eebbd3a10331e5864b718839512.1700502145.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Evgenii Stepanov <eugenis@google.com>
Cc: Marco Elver <elver@google.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Switch stack_record to use list_head for links in the hash table and in
the freelist.
This will allow removing entries from the hash table buckets.
This is preparatory patch for implementing the eviction of stack records
from the stack depot.
Link: https://lkml.kernel.org/r/4787d9a584cd33433d9ee1846b17fa3d3e1987ad.1700502145.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Evgenii Stepanov <eugenis@google.com>
Cc: Marco Elver <elver@google.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Currently, stack depot uses the following locking scheme:
1. Lock-free accesses when looking up a stack record, which allows to
have multiple users to look up records in parallel;
2. Spinlock for protecting the stack depot pools and the hash table
when adding a new record.
For implementing the eviction of stack traces from stack depot, the
lock-free approach is not going to work anymore, as we will need to be
able to also remove records from the hash table.
Convert the spinlock into a read/write lock, and drop the atomic
accesses, as they are no longer required.
Looking up stack traces is now protected by the read lock and adding new
records - by the write lock. One of the following patches will add a
new function for evicting stack records, which will be protected by the
write lock as well.
With this change, multiple users can still look up records in parallel.
This is preparatory patch for implementing the eviction of stack records
from the stack depot.
Link: https://lkml.kernel.org/r/9f81ffcc4bb422ebb6326a65a770bf1918634cbb.1700502145.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Evgenii Stepanov <eugenis@google.com>
Cc: Marco Elver <elver@google.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Instead of using the global pool_offset variable to find a free slot when
storing a new stack record, mainlain a freelist of free slots within the
allocated stack pools.
A global next_stack variable is used as the head of the freelist, and the
next field in the stack_record struct is reused as freelist link (when the
record is not in the freelist, this field is used as a link in the hash
table).
This is preparatory patch for implementing the eviction of stack records
from the stack depot.
Link: https://lkml.kernel.org/r/b9e4c79955c2121b69301778643b203d3fb09ccc.1700502145.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Evgenii Stepanov <eugenis@google.com>
Cc: Marco Elver <elver@google.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Instead of using the last pointer in stack_pools for storing the pointer
to a new pool (which does not yet store any stack records), use a new
new_pool variable.
This a purely code readability change: it seems more logical to store the
pointer to a pool with a special meaning in a dedicated variable.
Link: https://lkml.kernel.org/r/448bc18296c16bef95cb3167697be6583dcc8ce3.1700502145.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Evgenii Stepanov <eugenis@google.com>
Cc: Marco Elver <elver@google.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Rename next_pool_required to new_pool_required.
This a purely code readability change: the following patch will change
stack depot to store the pointer to the new pool in a separate variable,
and "new" seems like a more logical name.
Link: https://lkml.kernel.org/r/fd7cd6c6eb250c13ec5d2009d75bb4ddd1470db9.1700502145.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Evgenii Stepanov <eugenis@google.com>
Cc: Marco Elver <elver@google.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Split code in depot_alloc_stack and depot_init_pool into 3 functions:
1. depot_keep_next_pool that keeps preallocated memory for the next pool
if required.
2. depot_update_pools that moves on to the next pool if there's no space
left in the current pool, uses preallocated memory for the new current
pool if required, and calls depot_keep_next_pool otherwise.
3. depot_alloc_stack that calls depot_update_pools and then allocates
a stack record as before.
This makes it somewhat easier to follow the logic of depot_alloc_stack and
also serves as a preparation for implementing the eviction of stack
records from the stack depot.
Link: https://lkml.kernel.org/r/71fb144d42b701fcb46708d7f4be6801a4a8270e.1700502145.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Evgenii Stepanov <eugenis@google.com>
Cc: Marco Elver <elver@google.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Drop smp_load_acquire from next_pool_required in depot_init_pool, as both
depot_init_pool and the all smp_store_release's to this variable are
executed under the stack depot lock.
Also simplify and clean up comments accompanying the use of atomic
accesses in the stack depot code.
Link: https://lkml.kernel.org/r/c118ef044d8db80248d9e1f14592c72e8429e9d9.1700502145.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Evgenii Stepanov <eugenis@google.com>
Cc: Marco Elver <elver@google.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Instead of storing stack records in stack depot pools one right after
another, use fixed-sized slots.
Add a new Kconfig option STACKDEPOT_MAX_FRAMES that allows to select the
size of the slot in frames. Use 64 as the default value, which is the
maximum stack trace size both KASAN and KMSAN use right now.
Also add descriptions for other stack depot Kconfig options.
This is preparatory patch for implementing the eviction of stack records
from the stack depot.
Link: https://lkml.kernel.org/r/dce7d030a99ff61022509665187fac45b0827298.1700502145.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Evgenii Stepanov <eugenis@google.com>
Cc: Marco Elver <elver@google.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Add a helper depot_fetch_stack function that fetches the pointer to a
stack record.
With this change, all static depot_* functions now operate on stack pools
and the exported stack_depot_* functions operate on the hash table.
Link: https://lkml.kernel.org/r/170d8c202f29dc8e3d5491ee074d1e9e029a46db.1700502145.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Evgenii Stepanov <eugenis@google.com>
Cc: Marco Elver <elver@google.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Stack depot doesn't use the valid bit in handles in any way, so drop it.
Link: https://lkml.kernel.org/r/34969bba2ca6e012c6ad071767197dee64dc5723.1700502145.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Evgenii Stepanov <eugenis@google.com>
Cc: Marco Elver <elver@google.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The retval local variable in __stack_depot_save has the union type
handle_parts, but the function never uses anything but the union's handle
field.
Define retval simply as depot_stack_handle_t to simplify the code.
Link: https://lkml.kernel.org/r/3b0763c8057a1cf2f200ff250a5f9580ee36a28c.1700502145.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Evgenii Stepanov <eugenis@google.com>
Cc: Marco Elver <elver@google.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Do not try fetching a stack trace from the stack depot if the
stack_depot_disabled flag is enabled.
Link: https://lkml.kernel.org/r/c3bfa3b7ab00b2e48ab75a3fbb9c67555777cb08.1700502145.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Evgenii Stepanov <eugenis@google.com>
Cc: Marco Elver <elver@google.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "stackdepot: allow evicting stack traces", v4.
Currently, the stack depot grows indefinitely until it reaches its
capacity. Once that happens, the stack depot stops saving new stack
traces.
This creates a problem for using the stack depot for in-field testing and
in production.
For such uses, an ideal stack trace storage should:
1. Allow saving fresh stack traces on systems with a large uptime while
limiting the amount of memory used to store the traces;
2. Have a low performance impact.
Implementing #1 in the stack depot is impossible with the current
keep-forever approach. This series targets to address that. Issue #2 is
left to be addressed in a future series.
This series changes the stack depot implementation to allow evicting
unneeded stack traces from the stack depot. The users of the stack depot
can do that via new stack_depot_save_flags(STACK_DEPOT_FLAG_GET) and
stack_depot_put APIs.
Internal changes to the stack depot code include:
1. Storing stack traces in fixed-frame-sized slots (vs precisely-sized
slots in the current implementation); the slot size is controlled via
CONFIG_STACKDEPOT_MAX_FRAMES (default: 64 frames);
2. Keeping available slots in a freelist (vs keeping an offset to the next
free slot);
3. Using a read/write lock for synchronization (vs a lock-free approach
combined with a spinlock).
This series also integrates the eviction functionality into KASAN: the
tag-based modes evict stack traces when the corresponding entry leaves the
stack ring, and Generic KASAN evicts stack traces for objects once those
leave the quarantine.
With KASAN, despite wasting some space on rounding up the size of each
stack record, the total memory consumed by stack depot gets saturated due
to the eviction of irrelevant stack traces from the stack depot.
With the tag-based KASAN modes, the average total amount of memory used
for stack traces becomes ~0.5 MB (with the current default stack ring size
of 32k entries and the default CONFIG_STACKDEPOT_MAX_FRAMES of 64). With
Generic KASAN, the stack traces take up ~1 MB per 1 GB of RAM (as the
quarantine's size depends on the amount of RAM).
However, with KMSAN, the stack depot ends up using ~4x more memory per a
stack trace than before. Thus, for KMSAN, the stack depot capacity is
increased accordingly. KMSAN uses a lot of RAM for shadow memory anyway,
so the increased stack depot memory usage will not make a significant
difference.
Other users of the stack depot do not save stack traces as often as KASAN
and KMSAN. Thus, the increased memory usage is taken as an acceptable
trade-off. In the future, these other users can take advantage of the
eviction API to limit the memory waste.
There is no measurable boot time performance impact of these changes for
KASAN on x86-64. I haven't done any tests for arm64 modes (the stack
depot without performance optimizations is not suitable for intended use
of those anyway), but I expect a similar result. Obtaining and copying
stack trace frames when saving them into stack depot is what takes the
most time.
This series does not yet provide a way to configure the maximum size of
the stack depot externally (e.g. via a command-line parameter). This
will be added in a separate series, possibly together with the performance
improvement changes.
This patch (of 22):
Currently, if stack_depot_disable=off is passed to the kernel command-line
after stack_depot_disable=on, stack depot prints a message that it is
disabled, while it is actually enabled.
Fix this by moving printing the disabled message to
stack_depot_early_init. Place it before the
__stack_depot_early_init_requested check, so that the message is printed
even if early stack depot init has not been requested.
Also drop the stack_table = NULL assignment from disable_stack_depot, as
stack_table is NULL by default.
Link: https://lkml.kernel.org/r/cover.1700502145.git.andreyknvl@google.com
Link: https://lkml.kernel.org/r/73a25c5fff29f3357cd7a9330e85e09bc8da2cbe.1700502145.git.andreyknvl@google.com
Fixes: e1fdc403349c ("lib: stackdepot: add support to disable stack depot")
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Marco Elver <elver@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Evgenii Stepanov <eugenis@google.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
KASan inline instrumentation can yield up to a 2x performance gain at the
cost of a larger binary.
Make inline instrumentation the default, as suggested in the bug report
below.
When an architecture does not support inline instrumentation, it should
set ARCH_DISABLE_KASAN_INLINE, as done by PowerPC, for instance.
Link: https://lkml.kernel.org/r/20231109155101.186028-1-paul.heidekrueger@tum.de
Signed-off-by: Paul Heidekrüger <paul.heidekrueger@tum.de>
Reported-by: Andrey Konovalov <andreyknvl@gmail.com>
Reviewed-by: Marco Elver <elver@google.com>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=203495
Acked-by: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
When destroying maple tree, preserve its attributes and then turn it into
an empty tree. This allows it to be reused without needing to be
reinitialized.
Link: https://lkml.kernel.org/r/20231027033845.90608-10-zhangpeng.00@bytedance.com
Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Mateusz Guzik <mjguzik@gmail.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Mike Christie <michael.christie@oracle.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Updated check_forking() and bench_forking() to use __mt_dup() to duplicate
maple tree.
Link: https://lkml.kernel.org/r/20231027033845.90608-9-zhangpeng.00@bytedance.com
Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Mateusz Guzik <mjguzik@gmail.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Mike Christie <michael.christie@oracle.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Skip other tests when BENCH is enabled so that performance can be measured
in user space.
Link: https://lkml.kernel.org/r/20231027033845.90608-8-zhangpeng.00@bytedance.com
Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Mateusz Guzik <mjguzik@gmail.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Mike Christie <michael.christie@oracle.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Introduce interfaces __mt_dup() and mtree_dup(), which are used to
duplicate a maple tree. They duplicate a maple tree in Depth-First Search
(DFS) pre-order traversal. It uses memcopy() to copy nodes in the source
tree and allocate new child nodes in non-leaf nodes. The new node is
exactly the same as the source node except for all the addresses stored in
it. It will be faster than traversing all elements in the source tree and
inserting them one by one into the new tree. The time complexity of these
two functions is O(n).
The difference between __mt_dup() and mtree_dup() is that mtree_dup()
handles locks internally.
Analysis of the average time complexity of this algorithm:
For simplicity, let's assume that the maximum branching factor of all
non-leaf nodes is 16 (in allocation mode, it is 10), and the tree is a
full tree.
Under the given conditions, if there is a maple tree with n elements, the
number of its leaves is n/16. From bottom to top, the number of nodes in
each level is 1/16 of the number of nodes in the level below. So the
total number of nodes in the entire tree is given by the sum of n/16 +
n/16^2 + n/16^3 + ... + 1. This is a geometric series, and it has log(n)
terms with base 16. According to the formula for the sum of a geometric
series, the sum of this series can be calculated as (n-1)/15. Each node
has only one parent node pointer, which can be considered as an edge. In
total, there are (n-1)/15-1 edges.
This algorithm consists of two operations:
1. Traversing all nodes in DFS order.
2. For each node, making a copy and performing necessary modifications
to create a new node.
For the first part, DFS traversal will visit each edge twice. Let
T(ascend) represent the cost of taking one step downwards, and T(descend)
represent the cost of taking one step upwards. And both of them are
constants (although mas_ascend() may not be, as it contains a loop, but
here we ignore it and treat it as a constant). So the time spent on the
first part can be represented as ((n-1)/15-1) * (T(ascend) + T(descend)).
For the second part, each node will be copied, and the cost of copying a
node is denoted as T(copy_node). For each non-leaf node, it is necessary
to reallocate all child nodes, and the cost of this operation is denoted
as T(dup_alloc). The behavior behind memory allocation is complex and not
specific to the maple tree operation. Here, we assume that the time
required for a single allocation is constant. Since the size of a node is
fixed, both of these symbols are also constants. We can calculate that
the time spent on the second part is ((n-1)/15) * T(copy_node) + ((n-1)/15
- n/16) * T(dup_alloc).
Adding both parts together, the total time spent by the algorithm can be
represented as:
((n-1)/15) * (T(ascend) + T(descend) + T(copy_node) + T(dup_alloc)) -
n/16 * T(dup_alloc) - (T(ascend) + T(descend))
Let C1 = T(ascend) + T(descend) + T(copy_node) + T(dup_alloc)
Let C2 = T(dup_alloc)
Let C3 = T(ascend) + T(descend)
Finally, the expression can be simplified as:
((16 * C1 - 15 * C2) / (15 * 16)) * n - (C1 / 15 + C3).
This is a linear function, so the average time complexity is O(n).
Link: https://lkml.kernel.org/r/20231027033845.90608-4-zhangpeng.00@bytedance.com
Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com>
Suggested-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Mateusz Guzik <mjguzik@gmail.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Mike Christie <michael.christie@oracle.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "Introduce __mt_dup() to improve the performance of fork()", v7.
This series introduces __mt_dup() to improve the performance of fork().
During the duplication process of mmap, all VMAs are traversed and
inserted one by one into the new maple tree, causing the maple tree to be
rebalanced multiple times. Balancing the maple tree is a costly
operation. To duplicate VMAs more efficiently, mtree_dup() and __mt_dup()
are introduced for the maple tree. They can efficiently duplicate a maple
tree.
Here are some algorithmic details about {mtree,__mt}_dup(). We perform a
DFS pre-order traversal of all nodes in the source maple tree. During
this process, we fully copy the nodes from the source tree to the new
tree. This involves memory allocation, and when encountering a new node,
if it is a non-leaf node, all its child nodes are allocated at once.
This idea was originally from Liam R. Howlett's Maple Tree Work email,
and I added some of my own ideas to implement it. Some previous
discussions can be found in [1]. For a more detailed analysis of the
algorithm, please refer to the logs for patch [3/10] and patch [10/10].
There is a "spawn" in byte-unixbench[2], which can be used to test the
performance of fork(). I modified it slightly to make it work with
different number of VMAs.
Below are the test results. The first row shows the number of VMAs. The
second and third rows show the number of fork() calls per ten seconds,
corresponding to next-20231006 and the this patchset, respectively. The
test results were obtained with CPU binding to avoid scheduler load
balancing that could cause unstable results. There are still some
fluctuations in the test results, but at least they are better than the
original performance.
21 121 221 421 821 1621 3221 6421 12821 25621 51221
112100 76261 54227 34035 20195 11112 6017 3161 1606 802 393
114558 83067 65008 45824 28751 16072 8922 4747 2436 1233 599
2.19% 8.92% 19.88% 34.64% 42.37% 44.64% 48.28% 50.17% 51.68% 53.74% 52.42%
Thanks to Liam and Matthew for the review.
This patch (of 10):
Add two helpers:
1. mt_free_one(), used to free a maple node.
2. mt_attr(), used to obtain the attributes of maple tree.
Link: https://lkml.kernel.org/r/20231027033845.90608-1-zhangpeng.00@bytedance.com
Link: https://lkml.kernel.org/r/20231027033845.90608-2-zhangpeng.00@bytedance.com
Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Mateusz Guzik <mjguzik@gmail.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Mike Christie <michael.christie@oracle.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Currently, there are two test cases with same name
"ALU64_SMOD_X: -7 % 2 = -1", the first one is right,
the second one should be ALU64_SMOD_K because its
code is BPF_ALU64 | BPF_MOD | BPF_K.
Before:
test_bpf: #170 ALU64_SMOD_X: -7 % 2 = -1 jited:1 4 PASS
test_bpf: #171 ALU64_SMOD_X: -7 % 2 = -1 jited:1 4 PASS
After:
test_bpf: #170 ALU64_SMOD_X: -7 % 2 = -1 jited:1 4 PASS
test_bpf: #171 ALU64_SMOD_K: -7 % 2 = -1 jited:1 4 PASS
Fixes: daabb2b098e0 ("bpf/tests: add tests for cpuv4 instructions")
Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20231207040851.19730-1-yangtiezhu@loongson.cn
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The remainder address post-6.6 issues or aren't considered serious enough
to justify backporting.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZXKEfwAKCRDdBJ7gKXxA
jlRpAQCiAp1nSqIz/fOKTzoQRaTDXU/m+C+6ZAXdKLDfvQBhpwEAnxxjZ8IgF+8Z
Klz/GirHX5w5o7jE2wb8iObo1nR75Qo=
=omRq
-----END PGP SIGNATURE-----
Merge tag 'mm-hotfixes-stable-2023-12-07-18-47' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"31 hotfixes. Ten of these address pre-6.6 issues and are marked
cc:stable. The remainder address post-6.6 issues or aren't considered
serious enough to justify backporting"
* tag 'mm-hotfixes-stable-2023-12-07-18-47' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (31 commits)
mm/madvise: add cond_resched() in madvise_cold_or_pageout_pte_range()
nilfs2: prevent WARNING in nilfs_sufile_set_segment_usage()
mm/hugetlb: have CONFIG_HUGETLB_PAGE select CONFIG_XARRAY_MULTI
scripts/gdb: fix lx-device-list-bus and lx-device-list-class
MAINTAINERS: drop Antti Palosaari
highmem: fix a memory copy problem in memcpy_from_folio
nilfs2: fix missing error check for sb_set_blocksize call
kernel/Kconfig.kexec: drop select of KEXEC for CRASH_DUMP
units: add missing header
drivers/base/cpu: crash data showing should depends on KEXEC_CORE
mm/damon/sysfs-schemes: add timeout for update_schemes_tried_regions
scripts/gdb/tasks: fix lx-ps command error
mm/Kconfig: make userfaultfd a menuconfig
selftests/mm: prevent duplicate runs caused by TEST_GEN_PROGS
mm/damon/core: copy nr_accesses when splitting region
lib/group_cpus.c: avoid acquiring cpu hotplug lock in group_cpus_evenly
checkstack: fix printed address
mm/memory_hotplug: fix error handling in add_memory_resource()
mm/memory_hotplug: add missing mem_hotplug_lock
.mailmap: add a new address mapping for Chester Lin
...
group_cpus_evenly() could be part of storage driver's error handler, such
as nvme driver, when may happen during CPU hotplug, in which storage queue
has to drain its pending IOs because all CPUs associated with the queue
are offline and the queue is becoming inactive. And handling IO needs
error handler to provide forward progress.
Then deadlock is caused:
1) inside CPU hotplug handler, CPU hotplug lock is held, and blk-mq's
handler is waiting for inflight IO
2) error handler is waiting for CPU hotplug lock
3) inflight IO can't be completed in blk-mq's CPU hotplug handler
because error handling can't provide forward progress.
Solve the deadlock by not holding CPU hotplug lock in group_cpus_evenly(),
in which two stage spreads are taken: 1) the 1st stage is over all present
CPUs; 2) the end stage is over all other CPUs.
Turns out the two stage spread just needs consistent 'cpu_present_mask',
and remove the CPU hotplug lock by storing it into one local cache. This
way doesn't change correctness, because all CPUs are still covered.
Link: https://lkml.kernel.org/r/20231120083559.285174-1-ming.lei@redhat.com
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reported-by: Yi Zhang <yi.zhang@redhat.com>
Reported-by: Guangwu Zhang <guazhang@redhat.com>
Tested-by: Guangwu Zhang <guazhang@redhat.com>
Reviewed-by: Chengming Zhou <zhouchengming@bytedance.com>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Cc: Keith Busch <kbusch@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The original intention of acpi_parse_entries_array() is to return the
number of all matching entries on success. This number may be greater than
the value of the max_entries parameter. When this happens, the function
will output a warning message, indicating that `count - max_entries`
matching entries remain unprocessed and have been ignored.
However, commit 4ceacd02f5a1 ("ACPI / table: Always count matched and
successfully parsed entries") changed this logic to return the number of
entries successfully processed by the handler. In this case, when the
max_entries parameter is not zero, the number of entries successfully
processed can never be greater than the value of max_entries. In other
words, the expression `count > max_entries` will always evaluate to false.
This means that the logic in the final if statement will never be executed.
Commit 99b0efd7c886 ("ACPI / tables: do not report the number of entries
ignored by acpi_parse_entries()") mentioned this issue, but it tried to fix
it by removing part of the warning message. This is meaningless because the
pr_warn statement will never be executed in the first place.
Commit 8726d4f44150 ("ACPI / tables: fix acpi_parse_entries_array() so it
traverses all subtables") introduced an errs variable, which is intended to
make acpi_parse_entries_array() always traverse all of the subtables,
calling as many of the callbacks as possible. However, it seems that the
commit does not achieve this goal. For example, when a handler returns an
error, none of the handlers will be called again in the subsequent
iterations. This result appears to be no different from before the change.
This patch corrects and cleans up the logic of acpi_parse_entries_array(),
making it return the number of all matching entries, rather than the number
of entries successfully processed by handlers. Additionally, if an error
occurs when executing a handler, the function will return -EINVAL immediately.
This patch should not affect existing users of acpi_parse_entries_array().
Signed-off-by: Yuntao Wang <ytcoode@gmail.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
A refcount issue can appeared in __fwnode_link_del() due to the
pr_debug() call:
WARNING: CPU: 0 PID: 901 at lib/refcount.c:25 refcount_warn_saturate+0xe5/0x110
Call Trace:
<TASK>
...
of_node_get+0x1e/0x30
of_fwnode_get+0x28/0x40
fwnode_full_name_string+0x34/0x90
fwnode_string+0xdb/0x140
...
vsnprintf+0x17b/0x630
...
__fwnode_link_del+0x25/0xa0
fwnode_links_purge+0x39/0xb0
of_node_release+0xd9/0x180
...
Indeed, an fwnode (of_node) is being destroyed and so, of_node_release()
is called because the of_node refcount reached 0.
From of_node_release() several function calls are done and lead to
a pr_debug() calls with %pfwf to print the fwnode full name.
The issue is not present if we change %pfwf to %pfwP.
To print the full name, %pfwf iterates over the current node and its
parents and obtain/drop a reference to all nodes involved.
In order to allow to print the full name (%pfwf) of a node while it is
being destroyed, do not obtain/drop a reference to this current node.
Fixes: a92eb7621b9f ("lib/vsprintf: Make use of fwnode API to obtain node names and separators")
Cc: stable@vger.kernel.org
Signed-off-by: Herve Codina <herve.codina@bootlin.com>
Reviewed-by: Sakari Ailus <sakari.ailus@linux.intel.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20231114152655.409331-1-herve.codina@bootlin.com
With the removal of the 'iov' argument to import_single_range(), the two
functions are now fully identical. Convert the import_single_range()
callers to import_ubuf(), and remove the former fully.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/r/20231204174827.1258875-3-axboe@kernel.dk
Signed-off-by: Christian Brauner <brauner@kernel.org>
Remove CONFIG_SLAB, CONFIG_DEBUG_SLAB, CONFIG_SLAB_DEPRECATED and
everything in Kconfig files and mm/Makefile that depends on those. Since
SLUB is the only remaining allocator, remove the allocator choice, make
CONFIG_SLUB a "def_bool y" for now and remove all explicit dependencies
on SLUB or SLAB as it's now always enabled. Make every option's verbose
name and description refer to "the slab allocator" without refering to
the specific implementation. Do not rename the CONFIG_ option names yet.
Everything under #ifdef CONFIG_SLAB, and mm/slab.c is now dead code, all
code under #ifdef CONFIG_SLUB is now always compiled.
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Christoph Lameter <cl@linux.com>
Acked-by: David Rientjes <rientjes@google.com>
Tested-by: David Rientjes <rientjes@google.com>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
- objpool: Fix objpool overrun case on memory/cache access delay especially
on the big.LITTLE SoC. The objpool uses a copy of object slot index
internal loop, but the slot index can be changed on another processor
in parallel. In that case, the difference of 'head' local copy and the
'slot->last' index will be bigger than local slot size. In that case,
we need to re-read the slot::head to update it.
- kretprobe: Fix to use appropriate rcu API for kretprobe holder. Since
kretprobe_holder::rp is RCU managed, it should use rcu_assign_pointer()
and rcu_dereference_check() correctly. Also adding __rcu tag for
finding wrong usage by sparse.
- rethook: Fix to use appropriate rcu API for rethook::handler. The same
as kretprobe, rethook::handler is RCU managed and it should use
rcu_assign_pointer() and rcu_dereference_check(). This also adds __rcu
tag for finding wrong usage by sparse.
-----BEGIN PGP SIGNATURE-----
iQFPBAABCgA5FiEEh7BulGwFlgAOi5DV2/sHvwUrPxsFAmVpfBobHG1hc2FtaS5o
aXJhbWF0c3VAZ21haWwuY29tAAoJENv7B78FKz8bNyMIAJSLICKQNuFiBJEn/rty
ACWJ9QMOnwi0DoVaepG/m9QJh6AIUUFW4//9helmSm0GIVzxQ2+f8UeKU+sYiVtH
ro9atea4W4+FMTvtEB1cU8oG5CDVT4WQdUXbjMktqYe3+WB8Zt8+fIP0mnbTFAVr
yStpliGPecmlupJVRYqrJGyDdbkUxXxVlPsP/eDrHFgbBWv8Incw0f+MLGSi6LSE
sZ1MaKCdi2tlHbtD/fiowfLoBMZwQAKY4hq/XguVsWh+BGaGUgwtif+8ESwPeu22
KEZLyWDQ1N8XBHyOBotV7vsBEwh6LKtLGVXIBsO4KxVyGw6msxWBis0dt/tkn+kk
LEg=
=B9WK
-----END PGP SIGNATURE-----
Merge tag 'probes-fixes-v6.7-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull probes fixes from Masami Hiramatsu:
- objpool: Fix objpool overrun case on memory/cache access delay
especially on the big.LITTLE SoC. The objpool uses a copy of object
slot index internal loop, but the slot index can be changed on
another processor in parallel. In that case, the difference of 'head'
local copy and the 'slot->last' index will be bigger than local slot
size. In that case, we need to re-read the slot::head to update it.
- kretprobe: Fix to use appropriate rcu API for kretprobe holder. Since
kretprobe_holder::rp is RCU managed, it should use
rcu_assign_pointer() and rcu_dereference_check() correctly. Also
adding __rcu tag for finding wrong usage by sparse.
- rethook: Fix to use appropriate rcu API for rethook::handler. The
same as kretprobe, rethook::handler is RCU managed and it should use
rcu_assign_pointer() and rcu_dereference_check(). This also adds
__rcu tag for finding wrong usage by sparse.
* tag 'probes-fixes-v6.7-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
rethook: Use __rcu pointer for rethook::handler
kprobes: consistent rcu api usage for kretprobe holder
lib: objpool: fix head overrun on RK3588 SBC
- Fix a recently introduced build issue on ARM32 platforms caused by an
inadvertent header file breakage (Dave Jiang).
- Eliminate questionable usage of acpi_driver_data() in the ACPI
backlight cooling device code that leads to NULL pointer dereferences
after recent ACPI core changes (Hans de Goede).
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAmVqSGMSHHJqd0Byand5
c29ja2kubmV0AAoJEILEb/54YlRxKBsP/RqwUb8oNYxreUmKgIVZ0H8SnPRfluVP
4St+HeTxsmdN+obglVUWnhZMViewGsionLuq0y/FrYoWLI2F08UQAU8i248h20aZ
nGXalM+n5H517dOidTJzvGxKHMOA2TrzyVna/IcQAYLnbXmp25j8EdHmuhrI3KfK
3yxXLe+6J22776U/MMyutR+rwTVE0tgfQOWM4YuxT67uUPQVkKX+3/uYt/EfkKsX
Dz2ce/5vF28JDjv/yTxaoctMpmjjem97av0J7Y1EM/5K9kTZ070U8OZhYEgBHw5o
pRalhQlNz5VI/KQy3Mc8n4DmrwrPoktCBI0pQPr8FV56dmCFmTaFY1C4/SxebTr4
O2U3r5GkmcxLCKZXAUUAc8J+M6BBRHNQtlpBN3iNRG4ID2x72idPn5fr02UeGk4g
lxysNzcIwxcOuogeKTD2IERzLZA7Ub3qm8gguFOMqnxV1nmPx6f1nl1MuxLsJY4e
ZQwrwZCDJr5CZUrvJz6Mo9HUibLQOELgtsWinKV9OYUgTOfQ69t+XuwsPYIYtQiH
UV7qfE+ZLiG/jZPgP6tfN6InbtB9I4xXNMhKyoZrGTNtbKUVLdDHJH62/vIyHBXQ
VuLr9jH+CuyRjJSRCnbWA1BWkLhWfHQnFaq9UVWAWkVd8MisGVolnCSPGjUyQur7
ZT7i3c13nxJ2
=ANS/
-----END PGP SIGNATURE-----
Merge tag 'acpi-6.7-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull ACPI fixes from Rafael Wysocki:
"This fixes a recently introduced build issue on ARM32 and a NULL
pointer dereference in the ACPI backlight driver due to a design issue
exposed by a recent change in the ACPI bus type code.
Specifics:
- Fix a recently introduced build issue on ARM32 platforms caused by
an inadvertent header file breakage (Dave Jiang)
- Eliminate questionable usage of acpi_driver_data() in the ACPI
backlight cooling device code that leads to NULL pointer
dereferences after recent ACPI core changes (Hans de Goede)"
* tag 'acpi-6.7-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
ACPI: video: Use acpi_video_device for cooling-dev driver data
ACPI: Fix ARM32 platforms compile issue introduced by fw_table changes
Bigger/user visible fixes:
- bcache & bcachefs were broken with CFI enabled; patch for closures to
fix type punning
- mark erasure coding as extra-experimental; there are incompatible
disk space accounting changes coming for erasure coding, and I'm
still seeing checksum errors in some tests
- several fixes for durability-related issues (durability is a device
specific setting where we can tell bcachefs that data on a given
device should be counted as replicated x times )
- a fix for a rare livelock when a btree node merge then updates a
parent node that is almost full
- fix a race in the device removal path, where dropping a pointer in a
btree node to a device would be clobbered by an in flight btree write
updating the btree node key on completion
- fix one SRCU lock hold time warning in the btree gc code - ther's
still a bunch more of these to fix
- fix a rare race where we'd start copygc before initializing the "are
we rw" percpu refcount; copygc would think we were already ro and die
immediately
https://evilpiepirate.org/~testdashboard/ci?branch=bcachefs-for-upstream
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmVnoHoACgkQE6szbY3K
bnbzLBAApVEg3kB3XDCHYw+8AxLbzkuKbuV8FR/w+ULYAmRKbnM5e4pM4UJzwVJ9
vzBS9KUT4mVNpA5zl7FWmqh5AiJkhbPgb/BijtQiS+gz1ofZ8uCW/DjzWZpaTaT9
0zz9auiKwzJbBmLXC2lWC28MUPjFNXxlP2pfQPqhpKqlGKBC893hKeJ0Veb6dM1R
DqkctoWtSQzsNpEaXiQpKBNoNUIlYcFX1XXHn+XpPpWNe80SpMfVNCs2qPkMByu/
V/QULE9cHI7RTu7oyFY80+9xQDeXDDYZgvtpD7hqNPcyyoix+r/DVz1mZe41XF2B
bvaJhfcdWePctmiuEXJVXT4HSkwwzC6EKHwi7fejGY56hOvsrEAxNzTEIPRNw5st
ZkZlxASwFqkiJ3ehy+KRngLX2GZSbJsU4aM5ViQJKtz4rBzGyyf0LmMucdxAoDH5
zLzsAYaA6FkIZ5e5ZNdTDj7/TMnKWXlU9vTttqIpb8s7qSy+3ejk5NuGitJihZ4R
LAaCTs1JIsItLP47Ko0ZvmKV6CHlmt+Ht8OBqu73BWJ8vsBTQ8JMK4mGt60bwHvm
LdEMtp3C3FmXFc06zhKoGgjrletZYO6G4mFBPnQqh1brfFXM1prVg3ftDTqBWkMI
iAz2chiVc8k0qxoSAqylCYFaGzgiBKzw6YMtqPRmZgfLcq/sJ34=
=vN+y
-----END PGP SIGNATURE-----
Merge tag 'bcachefs-2023-11-29' of https://evilpiepirate.org/git/bcachefs
Pull more bcachefs bugfixes from Kent Overstreet:
- bcache & bcachefs were broken with CFI enabled; patch for closures to
fix type punning
- mark erasure coding as extra-experimental; there are incompatible
disk space accounting changes coming for erasure coding, and I'm
still seeing checksum errors in some tests
- several fixes for durability-related issues (durability is a device
specific setting where we can tell bcachefs that data on a given
device should be counted as replicated x times)
- a fix for a rare livelock when a btree node merge then updates a
parent node that is almost full
- fix a race in the device removal path, where dropping a pointer in a
btree node to a device would be clobbered by an in flight btree write
updating the btree node key on completion
- fix one SRCU lock hold time warning in the btree gc code - ther's
still a bunch more of these to fix
- fix a rare race where we'd start copygc before initializing the "are
we rw" percpu refcount; copygc would think we were already ro and die
immediately
* tag 'bcachefs-2023-11-29' of https://evilpiepirate.org/git/bcachefs: (23 commits)
bcachefs: Extra kthread_should_stop() calls for copygc
bcachefs: Convert gc_alloc_start() to for_each_btree_key2()
bcachefs: Fix race between btree writes and metadata drop
bcachefs: move journal seq assertion
bcachefs: -EROFS doesn't count as move_extent_start_fail
bcachefs: trace_move_extent_start_fail() now includes errcode
bcachefs: Fix split_race livelock
bcachefs: Fix bucket data type for stripe buckets
bcachefs: Add missing validation for jset_entry_data_usage
bcachefs: Fix zstd compress workspace size
bcachefs: bpos is misaligned on big endian
bcachefs: Fix ec + durability calculation
bcachefs: Data update path won't accidentaly grow replicas
bcachefs: deallocate_extra_replicas()
bcachefs: Proper refcounting for journal_keys
bcachefs: preserve device path as device name
bcachefs: Fix an endianness conversion
bcachefs: Start gc, copygc, rebalance threads after initing writes ref
bcachefs: Don't stop copygc thread on device resize
bcachefs: Make sure bch2_move_ratelimit() also waits for move_ops
...
Merge a fix for a recently introduced build issue on ARM32 platforms
caused by an inadvertent header file breakage (Dave Jiang).
* acpi-tables:
ACPI: Fix ARM32 platforms compile issue introduced by fw_table changes
objpool overrun stress with test_objpool on OrangePi5+ SBC triggered the
following kernel warnings:
WARNING: CPU: 6 PID: 3115 at lib/objpool.c:168 objpool_push+0xc0/0x100
This message is from objpool.c:168:
WARN_ON_ONCE(tail - head > pool->nr_objs);
The overrun test case is to validate the case that pre-allocated objects
are insufficient: 8 objects are pre-allocated for each node and consumer
thread per node tries to grab 16 objects in a row. The testing system is
OrangePI 5+, with RK3588, a big.LITTLE SOC with 4x A76 and 4x A55. When
disabling either all 4 big or 4 little cores, the overrun tests run well,
and once with big and little cores mixed together, the overrun test would
always cause an overrun loop. It's likely the memory timing differences
of big and little cores cause this trouble. Here are the debugging data
of objpool_try_get_slot after try_cmpxchg_release:
objpool_pop: cpu: 4/0 0:0 head: 278/279 tail:278 last:276/278
The local copies of 'head' and 'last' were 278 and 276, and reloading of
'slot->head' and 'slot->last' got 279 and 278. After try_cmpxchg_release
'slot->head' became 'head + 1', which is correct. But what's wrong here
is the stale value of 'last', and that stale value of 'last' finally led
the overrun of 'head'.
Memory updating of 'last' and 'head' are performed in push() and pop()
independently, which could be the culprit leading this out of order
visibility of 'last' and 'head'. So for objpool_try_get_slot(), it's
not enough only checking the condition of 'head != slot', the implicit
condition 'last - head <= nr_objs' must also be explicitly asserted to
guarantee 'last' is always behind 'head' before the object retrieving.
This patch will check and try reloading of 'head' and 'last' to ensure
'last' is behind 'head' at the time of object retrieving. Performance
testings show the average impact is about 0.1% for X86_64 and 1.12% for
ARM64. Here are the results:
OS: Debian 10 X86_64, Linux 6.6rc
HW: XEON 8336C x 2, 64 cores/128 threads, DDR4 3200MT/s
1T 2T 4T 8T 16T
native: 49543304 99277826 199017659 399070324 795185848
objpool: 29909085 59865637 119692073 239750369 478005250
objpool+: 29879313 59230743 119609856 239067773 478509029
32T 48T 64T 96T 128T
native: 1596927073 2390099988 2929397330 3183875848 3257546602
objpool: 957553042 1435814086 1680872925 2043126796 2165424198
objpool+: 956476281 1434491297 1666055740 2041556569 2157415622
OS: Debian 11 AARCH64, Linux 6.6rc
HW: Kunpeng-920 96 cores/2 sockets/4 NUMA nodes, DDR4 2933 MT/s
1T 2T 4T 8T 16T
native: 30890508 60399915 123111980 242257008 494002946
objpool: 14742531 28883047 57739948 115886644 232455421
objpool+: 14107220 29032998 57286084 113730493 232232850
24T 32T 48T 64T 96T
native: 746406039 1000174750 1493236240 1998318364 2942911180
objpool: 349164852 467284332 702296756 934459713 1387898285
objpool+: 348388180 462750976 696606096 927865887 1368402195
Link: https://lore.kernel.org/all/20231114115148.298821-1-wuqiang.matt@bytedance.com/
Fixes: b4edb8d2d464 ("lib: objpool added: ring-array based lockless MPMC")
Signed-off-by: wuqiang.matt <wuqiang.matt@bytedance.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
This KUnit fixes update for Linux 6.7-rc4 consists of three fixes to
warnings and run-time test behavior. With these fixes, test suite
counter will be reset correctly before running tests, kunit will warn
if tests are too slow, and eliminate warning when kfree() as an action.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEPZKym/RZuOCGeA/kCwJExA0NQxwFAmVosvIACgkQCwJExA0N
QxyzqxAAtXffplLUm7saZW7bXW3IB5Bugzu5qjp8Zw0YR7WNhgXMimCVHotWhuHX
CrQiRIf1ZyXt6+qnUYXMIsEpMri03qhun2E8lmsA6Ws7xAHgQh5nORPEZOgBEV+3
HnyPJTT2RBPBoOZuRTipYeu6MZE1YSnmpGdftYWHWENThqG+qb5WJ/Rs33JSZAnx
O/XnCHZigqc7aCxZHnm8HcFWfikn4pFg2f76XU7j6uvH+ZIuQJEZvCdukzjWvGeC
AP47AaZ04rqbkRZAA5nOew7SIgb0fK3dYdxsWfv3ai1gJMReUwqwQQ/44tnxtPs2
s0d2FzH8+OL3MIQ0PFMH/E16bmAwcGXFlRmJvJFdbMoWa+zji8mhUSw7qIqJaFTI
6oMHd4XdHGwHYHrLgQgLdIzuggTRSusnaMBubGJe0xo2xCgFI6CjNUzLNk1zsY2L
OeKNlTUH2l3oOnlia4CsOPzx5HOe204ZCZwQR1fkSQVmfkl5cfz17Mr/lKshxH+Q
X+t56bLy5rhyUI6LxrAj7AC6E8MVQ/fiial3v1lDmJKD6Bh5YL0n6wAl5eWv3Hcq
Com0maVlGCaPIErzjaSM9RqBCYT7Le89J/p3lJ7rXF3zUY8B1Miv7IVCeHDOxzpl
AwMI9PavdHOvh2NeaDku/fQwzID7ytdT7dAktQIviUReqWnZRGU=
=g2ql
-----END PGP SIGNATURE-----
Merge tag 'linux_kselftest-kunit-fixes-6.7-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
Pull KUnit fixes from Shuah Khan:
"Three fixes to warnings and run-time test behavior. With these fixes,
test suite counter will be reset correctly before running tests, kunit
will warn if tests are too slow, and eliminate warning when kfree() as
an action"
* tag 'linux_kselftest-kunit-fixes-6.7-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
kunit: test: Avoid cast warning when adding kfree() as an action
kunit: Reset suite counter right before running tests
kunit: Warn if tests are slow
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZWiCPAAKCRDbK58LschI
g4djAQC1FdqCRIFkhbiIRNHTgHjnfQShELQbd9ofJqzylLqmmgD+JI1E7D9SXagm
pIXQ26EGmq8/VcCT3VLncA8EsC76Gg4=
=Xowm
-----END PGP SIGNATURE-----
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:
====================
pull-request: bpf-next 2023-11-30
We've added 30 non-merge commits during the last 7 day(s) which contain
a total of 58 files changed, 1598 insertions(+), 154 deletions(-).
The main changes are:
1) Add initial TX metadata implementation for AF_XDP with support in mlx5
and stmmac drivers. Two types of offloads are supported right now, that
is, TX timestamp and TX checksum offload, from Stanislav Fomichev with
stmmac implementation from Song Yoong Siang.
2) Change BPF verifier logic to validate global subprograms lazily instead
of unconditionally before the main program, so they can be guarded using
BPF CO-RE techniques, from Andrii Nakryiko.
3) Add BPF link_info support for uprobe multi link along with bpftool
integration for the latter, from Jiri Olsa.
4) Use pkg-config in BPF selftests to determine ld flags which is
in particular needed for linking statically, from Akihiko Odaki.
5) Fix a few BPF selftest failures to adapt to the upcoming LLVM18,
from Yonghong Song.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (30 commits)
bpf/tests: Remove duplicate JSGT tests
selftests/bpf: Add TX side to xdp_hw_metadata
selftests/bpf: Convert xdp_hw_metadata to XDP_USE_NEED_WAKEUP
selftests/bpf: Add TX side to xdp_metadata
selftests/bpf: Add csum helpers
selftests/xsk: Support tx_metadata_len
xsk: Add option to calculate TX checksum in SW
xsk: Validate xsk_tx_metadata flags
xsk: Document tx_metadata_len layout
net: stmmac: Add Tx HWTS support to XDP ZC
net/mlx5e: Implement AF_XDP TX timestamp and checksum offload
tools: ynl: Print xsk-features from the sample
xsk: Add TX timestamp and TX checksum offload support
xsk: Support tx_metadata_len
selftests/bpf: Use pkg-config for libelf
selftests/bpf: Override PKG_CONFIG for static builds
selftests/bpf: Choose pkg-config for the target
bpftool: Add support to display uprobe_multi links
selftests/bpf: Add link_info test for uprobe_multi link
selftests/bpf: Use bpf_link__destroy in fill_link_info tests
...
====================
Conflicts:
Documentation/netlink/specs/netdev.yaml:
839ff60df3ab ("net: page_pool: add nlspec for basic access to page pools")
48eb03dd2630 ("xsk: Add TX timestamp and TX checksum offload support")
https://lore.kernel.org/all/20231201094705.1ee3cab8@canb.auug.org.au/
While at it also regen, tree is dirty after:
48eb03dd2630 ("xsk: Add TX timestamp and TX checksum offload support")
looks like code wasn't re-rendered after "render-max" was removed.
Link: https://lore.kernel.org/r/20231130145708.32573-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>