IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
This includes a wholesale reversion of the post-6.4 series "make slab shrink
lockless". After input from Dave Chinner it has been decided that we
should go a different way. Thread starts at
https://lkml.kernel.org/r/ZH6K0McWBeCjaf16@dread.disaster.area.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZJH/qAAKCRDdBJ7gKXxA
jq7uAP9AtDGHfvOuW5jlHdYfpUBnbfuQDKjiik71UuIxyhtwQQEAqpOBv7UDuhHj
NbNIGTIi/xM5vkpjV6CBo9ymR7qTKwo=
=uGuc
-----END PGP SIGNATURE-----
Merge tag 'mm-hotfixes-stable-2023-06-20-12-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull hotfixes from Andrew Morton:
"19 hotfixes. 8 of these are cc:stable.
This includes a wholesale reversion of the post-6.4 series 'make slab
shrink lockless'. After input from Dave Chinner it has been decided
that we should go a different way [1]"
Link: https://lkml.kernel.org/r/ZH6K0McWBeCjaf16@dread.disaster.area [1]
* tag 'mm-hotfixes-stable-2023-06-20-12-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
selftests/mm: fix cross compilation with LLVM
mailmap: add entries for Ben Dooks
nilfs2: prevent general protection fault in nilfs_clear_dirty_page()
Revert "mm: vmscan: make global slab shrink lockless"
Revert "mm: vmscan: make memcg slab shrink lockless"
Revert "mm: vmscan: add shrinker_srcu_generation"
Revert "mm: shrinkers: make count and scan in shrinker debugfs lockless"
Revert "mm: vmscan: hold write lock to reparent shrinker nr_deferred"
Revert "mm: vmscan: remove shrinker_rwsem from synchronize_shrinkers()"
Revert "mm: shrinkers: convert shrinker_rwsem to mutex"
nilfs2: fix buffer corruption due to concurrent device reads
scripts/gdb: fix SB_* constants parsing
scripts: fix the gfp flags header path in gfp-translate
udmabuf: revert 'Add support for mapping hugepages (v4)'
mm/khugepaged: fix iteration in collapse_file
memfd: check for non-NULL file_seals in memfd_create() syscall
mm/vmalloc: do not output a spurious warning when huge vmalloc() fails
mm/mprotect: fix do_mprotect_pkey() limit check
writeback: fix dereferencing NULL mapping->host on writeback_page_template
This reverts commit f95bdb700b.
Kernel test robot reports -88.8% regression in stress-ng.ramfs.ops_per_sec
test case [1], which is caused by commit f95bdb700b ("mm: vmscan: make
global slab shrink lockless"). The root cause is that SRCU has to be
careful to not frequently check for SRCU read-side critical section exits.
Therefore, even if no one is currently in the SRCU read-side critical
section, synchronize_srcu() cannot return quickly. That's why
unregister_shrinker() has become slower.
After discussion, we will try to use the refcount+RCU method [2] proposed
by Dave Chinner to continue to re-implement the lockless slab shrink. So
revert the shrinker_srcu related changes first.
[1]. https://lore.kernel.org/lkml/202305230837.db2c233f-yujie.liu@intel.com/
[2]. https://lore.kernel.org/lkml/ZIJhou1d55d4H1s0@dread.disaster.area/
Link: https://lkml.kernel.org/r/20230609081518.3039120-8-qi.zheng@linux.dev
Reported-by: kernel test robot <yujie.liu@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202305230837.db2c233f-yujie.liu@intel.com
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Kirill Tkhai <tkhai@ya.ru>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This reverts commit caa05325c9.
Kernel test robot reports -88.8% regression in stress-ng.ramfs.ops_per_sec
test case [1], which is caused by commit f95bdb700b ("mm: vmscan: make
global slab shrink lockless"). The root cause is that SRCU has to be
careful to not frequently check for SRCU read-side critical section exits.
Therefore, even if no one is currently in the SRCU read-side critical
section, synchronize_srcu() cannot return quickly. That's why
unregister_shrinker() has become slower.
After discussion, we will try to use the refcount+RCU method [2] proposed
by Dave Chinner to continue to re-implement the lockless slab shrink. So
revert the shrinker_srcu related changes first.
[1]. https://lore.kernel.org/lkml/202305230837.db2c233f-yujie.liu@intel.com/
[2]. https://lore.kernel.org/lkml/ZIJhou1d55d4H1s0@dread.disaster.area/
Link: https://lkml.kernel.org/r/20230609081518.3039120-7-qi.zheng@linux.dev
Reported-by: kernel test robot <yujie.liu@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202305230837.db2c233f-yujie.liu@intel.com
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Kirill Tkhai <tkhai@ya.ru>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This reverts commit 475733dda5.
Kernel test robot reports -88.8% regression in stress-ng.ramfs.ops_per_sec
test case [1], which is caused by commit f95bdb700b ("mm: vmscan: make
global slab shrink lockless"). The root cause is that SRCU has to be
careful to not frequently check for SRCU read-side critical section exits.
Therefore, even if no one is currently in the SRCU read-side critical
section, synchronize_srcu() cannot return quickly. That's why
unregister_shrinker() has become slower.
We will try to use the refcount+RCU method [2] proposed by Dave Chinner to
continue to re-implement the lockless slab shrink. So revert the
shrinker_srcu related changes first.
[1]. https://lore.kernel.org/lkml/202305230837.db2c233f-yujie.liu@intel.com/
[2]. https://lore.kernel.org/lkml/ZIJhou1d55d4H1s0@dread.disaster.area/
Link: https://lkml.kernel.org/r/20230609081518.3039120-6-qi.zheng@linux.dev
Reported-by: kernel test robot <yujie.liu@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202305230837.db2c233f-yujie.liu@intel.com
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Kirill Tkhai <tkhai@ya.ru>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This reverts commit 20cd1892fc.
Kernel test robot reports -88.8% regression in stress-ng.ramfs.ops_per_sec
test case [1], which is caused by commit f95bdb700b ("mm: vmscan: make
global slab shrink lockless"). The root cause is that SRCU has to be
careful to not frequently check for SRCU read-side critical section exits.
Therefore, even if no one is currently in the SRCU read-side critical
section, synchronize_srcu() cannot return quickly. That's why
unregister_shrinker() has become slower.
We will try to use the refcount+RCU method [2] proposed by Dave Chinner to
continue to re-implement the lockless slab shrink. So revert the
shrinker_srcu related changes first.
[1]. https://lore.kernel.org/lkml/202305230837.db2c233f-yujie.liu@intel.com/
[2]. https://lore.kernel.org/lkml/ZIJhou1d55d4H1s0@dread.disaster.area/
Link: https://lkml.kernel.org/r/20230609081518.3039120-5-qi.zheng@linux.dev
Reported-by: kernel test robot <yujie.liu@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202305230837.db2c233f-yujie.liu@intel.com
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Kirill Tkhai <tkhai@ya.ru>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This reverts commit b3cabea3c9.
Kernel test robot reports -88.8% regression in stress-ng.ramfs.ops_per_sec
test case [1], which is caused by commit f95bdb700b ("mm: vmscan: make
global slab shrink lockless"). The root cause is that SRCU has to be careful
to not frequently check for SRCU read-side critical section exits. Therefore,
even if no one is currently in the SRCU read-side critical section,
synchronize_srcu() cannot return quickly. That's why unregister_shrinker()
has become slower.
We will try to use the refcount+RCU method [2] proposed by Dave Chinner
to continue to re-implement the lockless slab shrink. Because there will
be other readers after reverting the shrinker_srcu related changes, so
it is better to restore to hold read lock to reparent shrinker nr_deferred.
[1]. https://lore.kernel.org/lkml/202305230837.db2c233f-yujie.liu@intel.com/
[2]. https://lore.kernel.org/lkml/ZIJhou1d55d4H1s0@dread.disaster.area/
Link: https://lkml.kernel.org/r/20230609081518.3039120-4-qi.zheng@linux.dev
Reported-by: kernel test robot <yujie.liu@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202305230837.db2c233f-yujie.liu@intel.com
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Kirill Tkhai <tkhai@ya.ru>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This reverts commit 1643db98d9.
Kernel test robot reports -88.8% regression in stress-ng.ramfs.ops_per_sec
test case [1], which is caused by commit f95bdb700b ("mm: vmscan: make
global slab shrink lockless"). The root cause is that SRCU has to be
careful to not frequently check for SRCU read-side critical section exits.
Therefore, even if no one is currently in the SRCU read-side critical
section, synchronize_srcu() cannot return quickly. That's why
unregister_shrinker() has become slower.
We will try to use the refcount+RCU method [2] proposed by Dave Chinner to
continue to re-implement the lockless slab shrink. So we still need
shrinker_rwsem in synchronize_shrinkers() after reverting the
shrinker_srcu related changes.
[1]. https://lore.kernel.org/lkml/202305230837.db2c233f-yujie.liu@intel.com/
[2]. https://lore.kernel.org/lkml/ZIJhou1d55d4H1s0@dread.disaster.area/
Link: https://lkml.kernel.org/r/20230609081518.3039120-3-qi.zheng@linux.dev
Reported-by: kernel test robot <yujie.liu@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202305230837.db2c233f-yujie.liu@intel.com
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Kirill Tkhai <tkhai@ya.ru>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "revert shrinker_srcu related changes".
This patch (of 7):
This reverts commit cf2e309ebc.
Kernel test robot reports -88.8% regression in stress-ng.ramfs.ops_per_sec
test case [1], which is caused by commit f95bdb700b ("mm: vmscan: make
global slab shrink lockless"). The root cause is that SRCU has to be
careful to not frequently check for SRCU read-side critical section exits.
Therefore, even if no one is currently in the SRCU read-side critical
section, synchronize_srcu() cannot return quickly. That's why
unregister_shrinker() has become slower.
After discussion, we will try to use the refcount+RCU method [2] proposed
by Dave Chinner to continue to re-implement the lockless slab shrink. So
revert the shrinker_mutex back to shrinker_rwsem first.
[1]. https://lore.kernel.org/lkml/202305230837.db2c233f-yujie.liu@intel.com/
[2]. https://lore.kernel.org/lkml/ZIJhou1d55d4H1s0@dread.disaster.area/
Link: https://lkml.kernel.org/r/20230609081518.3039120-1-qi.zheng@linux.dev
Link: https://lkml.kernel.org/r/20230609081518.3039120-2-qi.zheng@linux.dev
Reported-by: kernel test robot <yujie.liu@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202305230837.db2c233f-yujie.liu@intel.com
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Kirill Tkhai <tkhai@ya.ru>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yujie Liu <yujie.liu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Remove an unnecessary call to xas_set(index) when iterating over the
target range in collapse_file. The extra call to xas_set reset the xas
cursor to the top of the tree, causing the xas_next call on the next
iteration to walk the tree to index instead of advancing to index+1. This
returned the same page again, which would cause collapse_file to fail
because the page is already locked.
This bug was hidden when CONFIG_DEBUG_VM was set. When that config was
used, the xas_load in a subsequent VM_BUG_ON assert would walk xas from
the top of the tree to index, causing the xas_next call on the next loop
iteration to advance the cursor as expected.
Link: https://lkml.kernel.org/r/20230607053135.2087354-1-stevensd@google.com
Fixes: a2e17cc2ef ("mm/khugepaged: maintain page cache uptodate flag")
Signed-off-by: David Stevens <stevensd@chromium.org>
Reviewed-by: Peter Xu <peterx@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jiaqi Yan <jiaqiyan@google.com>
Cc: Kirill A . Shutemov <kirill@shutemov.name>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Ensure that file_seals is non-NULL before using it in the memfd_create()
syscall. One situation in which memfd_file_seals_ptr() could return a
NULL pointer when CONFIG_SHMEM=n, oopsing the kernel.
Link: https://lkml.kernel.org/r/20230607132427.2867435-1-roberto.sassu@huaweicloud.com
Fixes: 47b9012ecd ("shmem: add sealing support to hugetlb-backed memfd")
Signed-off-by: Roberto Sassu <roberto.sassu@huawei.com>
Cc: Marc-Andr Lureau <marcandre.lureau@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
In __vmalloc_area_node() we always warn_alloc() when an allocation
performed by vm_area_alloc_pages() fails unless it was due to a pending
fatal signal.
However, huge page allocations instigated either by vmalloc_huge() or
__vmalloc_node_range() (or a caller that invokes this like kvmalloc() or
kvmalloc_node()) always falls back to order-0 allocations if the huge page
allocation fails.
This renders the warning useless and noisy, especially as all callers
appear to be aware that this may fallback. This has already resulted in
at least one bug report from a user who was confused by this (see link).
Therefore, simply update the code to only output this warning for order-0
pages when no fatal signal is pending.
Link: https://bugzilla.suse.com/show_bug.cgi?id=1211410
Link: https://lkml.kernel.org/r/20230605201107.83298-1-lstoakes@gmail.com
Fixes: 80b1d8fdfa ("mm: vmalloc: correct use of __GFP_NOWARN mask in __vmalloc_area_node()")
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Baoquan He <bhe@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The return of do_mprotect_pkey() can still be incorrectly returned as
success if there is a gap that spans to or beyond the end address passed
in. Update the check to ensure that the end address has indeed been seen.
Link: https://lore.kernel.org/all/CABi2SkXjN+5iFoBhxk71t3cmunTk-s=rB4T7qo0UQRh17s49PQ@mail.gmail.com/
Link: https://lkml.kernel.org/r/20230606182912.586576-1-Liam.Howlett@oracle.com
Fixes: 82f951340f ("mm/mprotect: fix do_mprotect_pkey() return on error")
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reported-by: Jeff Xu <jeffxu@chromium.org>
Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The error unrolling was leaving the VMAs detached in many cases and
leaving the locked_vm statistic altered, and skipping the unrolling
entirely in the case of the vma tree write failing.
Fix the error path by re-attaching the detached VMAs and adding the
necessary goto for the failed vma tree write, and fix the locked_vm
statistic by only updating after the vma tree write succeeds.
Fixes: 763ecb0350 ("mm: remove the vma linked list")
Reported-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Per-VMA locking allows us to lock a struct vm_area_struct without
taking the process-wide mmap lock in read mode.
Consider a process workload where the mmap lock is taken constantly in
write mode. In this scenario, all zerocopy receives are periodically
blocked during that period of time - though in principle, the memory
ranges being used by TCP are not touched by the operations that need
the mmap write lock. This results in performance degradation.
Now consider another workload where the mmap lock is never taken in
write mode, but there are many TCP connections using receive zerocopy
that are concurrently receiving. These connections all take the mmap
lock in read mode, but this does induce a lot of contention and atomic
ops for this process-wide lock. This results in additional CPU
overhead caused by contending on the cache line for this lock.
However, with per-vma locking, both of these problems can be avoided.
As a test, I ran an RPC-style request/response workload with 4KB
payloads and receive zerocopy enabled, with 100 simultaneous TCP
connections. I measured perf cycles within the
find_tcp_vma/mmap_read_lock/mmap_read_unlock codepath, with and
without per-vma locking enabled.
When using process-wide mmap semaphore read locking, about 1% of
measured perf cycles were within this path. With per-VMA locking, this
value dropped to about 0.45%.
Signed-off-by: Arjun Roy <arjunroy@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
introduced during this -rc cycle or which were considered inappropriate
for a backport.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZIdw7QAKCRDdBJ7gKXxA
jki4AQCygi1UoqVPq4N/NzJbv2GaNDXNmcJIoLvPpp3MYFhucAEAtQNzAYO9z6CT
iLDMosnuh+1KLTaKNGL5iak3NAxnxQw=
=mTdI
-----END PGP SIGNATURE-----
Merge tag 'mm-hotfixes-stable-2023-06-12-12-22' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"19 hotfixes. 14 are cc:stable and the remainder address issues which
were introduced during this development cycle or which were considered
inappropriate for a backport"
* tag 'mm-hotfixes-stable-2023-06-12-12-22' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
zswap: do not shrink if cgroup may not zswap
page cache: fix page_cache_next/prev_miss off by one
ocfs2: check new file size on fallocate call
mailmap: add entry for John Keeping
mm/damon/core: fix divide error in damon_nr_accesses_to_accesses_bp()
epoll: ep_autoremove_wake_function should use list_del_init_careful
mm/gup_test: fix ioctl fail for compat task
nilfs2: reject devices with insufficient block count
ocfs2: fix use-after-free when unmounting read-only filesystem
lib/test_vmalloc.c: avoid garbage in page array
nilfs2: fix possible out-of-bounds segment allocation in resize ioctl
riscv/purgatory: remove PGO flags
powerpc/purgatory: remove PGO flags
x86/purgatory: remove PGO flags
kexec: support purgatories with .text.hot sections
mm/uffd: allow vma to merge as much as possible
mm/uffd: fix vma operation where start addr cuts part of vma
radix-tree: move declarations to header
nilfs2: fix incomplete buffer cleanup in nilfs_btnode_abort_change_key()
Before storing a page, zswap first checks if the number of stored pages
exceeds the limit specified by memory.zswap.max, for each cgroup in the
hierarchy. If this limit is reached or exceeded, then zswap shrinking is
triggered and short-circuits the store attempt.
However, since the zswap's LRU is not memcg-aware, this can create the
following pathological behavior: the cgroup whose zswap limit is 0 will
evict pages from other cgroups continually, without lowering its own zswap
usage. This means the shrinking will continue until the need for swap
ceases or the pool becomes empty.
As a result of this, we observe a disproportionate amount of zswap
writeback and a perpetually small zswap pool in our experiments, even
though the pool limit is never hit.
More generally, a cgroup might unnecessarily evict pages from other
cgroups before we drive the memcg back below its limit.
This patch fixes the issue by rejecting zswap store attempt without
shrinking the pool when obj_cgroup_may_zswap() returns false.
[akpm@linux-foundation.org: fix return of unintialized value]
[akpm@linux-foundation.org: s/ENOSPC/ENOMEM/]
Link: https://lkml.kernel.org/r/20230530222440.2777700-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20230530232435.3097106-1-nphamcs@gmail.com
Fixes: f4840ccfca ("zswap: memcg accounting")
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Ackerley Tng reported an issue with hugetlbfs fallocate here[1]. The
issue showed up after the conversion of hugetlb page cache lookup code to
use page_cache_next_miss. Code in hugetlb fallocate, userfaultfd and GUP
is now using page_cache_next_miss to determine if a page is present the
page cache. The following statement is used.
present = page_cache_next_miss(mapping, index, 1) != index;
There are two issues with page_cache_next_miss when used in this way.
1) If the passed value for index is equal to the 'wrap-around' value,
the same index will always be returned. This wrap-around value is 0,
so 0 will be returned even if page is present at index 0.
2) If there is no gap in the range passed, the last index in the range
will be returned. When passed a range of 1 as above, the passed
index value will be returned even if the page is present.
The end result is the statement above will NEVER indicate a page is
present in the cache, even if it is.
As noted by Ackerley in [1], users can see this by hugetlb fallocate
incorrectly returning EEXIST if pages are already present in the file. In
addition, hugetlb pages will not be included in core dumps if they need to
be brought in via GUP. userfaultfd UFFDIO_COPY also uses this code and
will not notice pages already present in the cache. It may try to
allocate a new page and potentially return ENOMEM as opposed to EEXIST.
Both page_cache_next_miss and page_cache_prev_miss have similar issues.
Fix by:
- Check for index equal to 'wrap-around' value and do not exit early.
- If no gap is found in range, return index outside range.
- Update function description to say 'wrap-around' value could be
returned if passed as index.
[1] https://lore.kernel.org/linux-mm/cover.1683069252.git.ackerleytng@google.com/
Link: https://lkml.kernel.org/r/20230602225747.103865-2-mike.kravetz@oracle.com
Fixes: d0ce0e47b3 ("mm/hugetlb: convert hugetlb fault paths to use alloc_hugetlb_folio()")
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reported-by: Ackerley Tng <ackerleytng@google.com>
Reviewed-by: Ackerley Tng <ackerleytng@google.com>
Tested-by: Ackerley Tng <ackerleytng@google.com>
Cc: Erdem Aktas <erdemaktas@google.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Cc: Vishal Annapurve <vannapurve@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
When tools/testing/selftests/mm/gup_test.c is compiled as 32bit, then run
on arm64 kernel, it reports "ioctl: Inappropriate ioctl for device".
Fix it by filling compat_ioctl in gup_test_fops
Link: https://lkml.kernel.org/r/20230526022125.175728-1-haibo.li@mediatek.com
Signed-off-by: Haibo Li <haibo.li@mediatek.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
Cc: Matthias Brugger <matthias.bgg@gmail.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The current uses of PageAnon in page table check functions can lead to
type confusion bugs between struct page and slab [1], if slab pages are
accidentally mapped into the user space. This is because slab reuses the
bits in struct page to store its internal states, which renders PageAnon
ineffective on slab pages.
Since slab pages are not expected to be mapped into the user space, this
patch adds BUG_ON(PageSlab(page)) checks to make sure that slab pages
are not inadvertently mapped. Otherwise, there must be some bugs in the
kernel.
Reported-by: syzbot+fcf1a817ceb50935ce99@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/lkml/000000000000258e5e05fae79fc1@google.com/ [1]
Fixes: df4e817b71 ("mm: page table check")
Cc: <stable@vger.kernel.org> # 5.17
Signed-off-by: Ruihan Li <lrh2000@pku.edu.cn>
Acked-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Link: https://lore.kernel.org/r/20230515130958.32471-5-lrh2000@pku.edu.cn
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Without EXCLUSIVE_SYSTEM_RAM, users are allowed to map arbitrary
physical memory regions into the userspace via /dev/mem. At the same
time, pages may change their properties (e.g., from anonymous pages to
named pages) while they are still being mapped in the userspace, leading
to "corruption" detected by the page table check.
To avoid these false positives, this patch makes PAGE_TABLE_CHECK
depends on EXCLUSIVE_SYSTEM_RAM. This dependency is understandable
because PAGE_TABLE_CHECK is a hardening technique but /dev/mem without
STRICT_DEVMEM (i.e., !EXCLUSIVE_SYSTEM_RAM) is itself a security
problem.
Even with EXCLUSIVE_SYSTEM_RAM, I/O pages may be still allowed to be
mapped via /dev/mem. However, these pages are always considered as named
pages, so they won't break the logic used in the page table check.
Cc: <stable@vger.kernel.org> # 5.17
Signed-off-by: Ruihan Li <lrh2000@pku.edu.cn>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Link: https://lore.kernel.org/r/20230515130958.32471-4-lrh2000@pku.edu.cn
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The zswap writeback mechanism can cause a race condition resulting in
memory corruption, where a swapped out page gets swapped in with data that
was written to a different page.
The race unfolds like this:
1. a page with data A and swap offset X is stored in zswap
2. page A is removed off the LRU by zpool driver for writeback in
zswap-shrink work, data for A is mapped by zpool driver
3. user space program faults and invalidates page entry A, offset X is
considered free
4. kswapd stores page B at offset X in zswap (zswap could also be
full, if so, page B would then be IOed to X, then skip step 5.)
5. entry A is replaced by B in tree->rbroot, this doesn't affect the
local reference held by zswap-shrink work
6. zswap-shrink work writes back A at X, and frees zswap entry A
7. swapin of slot X brings A in memory instead of B
The fix:
Once the swap page cache has been allocated (case ZSWAP_SWAPCACHE_NEW),
zswap-shrink work just checks that the local zswap_entry reference is
still the same as the one in the tree. If it's not the same it means that
it's either been invalidated or replaced, in both cases the writeback is
aborted because the local entry contains stale data.
Reproducer:
I originally found this by running `stress` overnight to validate my work
on the zswap writeback mechanism, it manifested after hours on my test
machine. The key to make it happen is having zswap writebacks, so
whatever setup pumps /sys/kernel/debug/zswap/written_back_pages should do
the trick.
In order to reproduce this faster on a vm, I setup a system with ~100M of
available memory and a 500M swap file, then running `stress --vm 1
--vm-bytes 300000000 --vm-stride 4000` makes it happen in matter of tens
of minutes. One can speed things up even more by swinging
/sys/module/zswap/parameters/max_pool_percent up and down between, say, 20
and 1; this makes it reproduce in tens of seconds. It's crucial to set
`--vm-stride` to something other than 4096 otherwise `stress` won't
realize that memory has been corrupted because all pages would have the
same data.
Link: https://lkml.kernel.org/r/20230503151200.19707-1-cerasuolodomenico@gmail.com
Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Chris Li (Google) <chrisl@kernel.org>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Since commit 1ba3cbf3ec ("mm: kfence: improve the performance of
__kfence_alloc() and __kfence_free()"), kfence reports failures in random
places at boot on big endian machines.
The problem is that the new KFENCE_CANARY_PATTERN_U64 encodes the address
of each byte in its value, so it needs to be byte swapped on big endian
machines.
The compiler is smart enough to do the le64_to_cpu() at compile time, so
there is no runtime overhead.
Link: https://lkml.kernel.org/r/20230505035127.195387-1-mpe@ellerman.id.au
Fixes: 1ba3cbf3ec ("mm: kfence: improve the performance of __kfence_alloc() and __kfence_free()")
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Alexander Potapenko <glider@google.com>
Reviewed-by: Marco Elver <elver@google.com>
Cc: Peng Zhang <zhangpeng.00@bytedance.com>
Cc: David Laight <David.Laight@ACULAB.COM>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Under memory pressure, we sometimes observe the following crash:
[ 5694.832838] ------------[ cut here ]------------
[ 5694.842093] list_del corruption, ffff888014b6a448->next is LIST_POISON1 (dead000000000100)
[ 5694.858677] WARNING: CPU: 33 PID: 418824 at lib/list_debug.c:47 __list_del_entry_valid+0x42/0x80
[ 5694.961820] CPU: 33 PID: 418824 Comm: fuse_counters.s Kdump: loaded Tainted: G S 5.19.0-0_fbk3_rc3_hoangnhatpzsdynshrv41_10870_g85a9558a25de #1
[ 5694.990194] Hardware name: Wiwynn Twin Lakes MP/Twin Lakes Passive MP, BIOS YMM16 05/24/2021
[ 5695.007072] RIP: 0010:__list_del_entry_valid+0x42/0x80
[ 5695.017351] Code: 08 48 83 c2 22 48 39 d0 74 24 48 8b 10 48 39 f2 75 2c 48 8b 51 08 b0 01 48 39 f2 75 34 c3 48 c7 c7 55 d7 78 82 e8 4e 45 3b 00 <0f> 0b eb 31 48 c7 c7 27 a8 70 82 e8 3e 45 3b 00 0f 0b eb 21 48 c7
[ 5695.054919] RSP: 0018:ffffc90027aef4f0 EFLAGS: 00010246
[ 5695.065366] RAX: 41fe484987275300 RBX: ffff888008988180 RCX: 0000000000000000
[ 5695.079636] RDX: ffff88886006c280 RSI: ffff888860060480 RDI: ffff888860060480
[ 5695.093904] RBP: 0000000000000002 R08: 0000000000000000 R09: ffffc90027aef370
[ 5695.108175] R10: 0000000000000000 R11: ffffffff82fdf1c0 R12: 0000000010000002
[ 5695.122447] R13: ffff888014b6a448 R14: ffff888014b6a420 R15: 00000000138dc240
[ 5695.136717] FS: 00007f23a7d3f740(0000) GS:ffff888860040000(0000) knlGS:0000000000000000
[ 5695.152899] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 5695.164388] CR2: 0000560ceaab6ac0 CR3: 000000001c06c001 CR4: 00000000007706e0
[ 5695.178659] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 5695.192927] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 5695.207197] PKRU: 55555554
[ 5695.212602] Call Trace:
[ 5695.217486] <TASK>
[ 5695.221674] zs_map_object+0x91/0x270
[ 5695.229000] zswap_frontswap_store+0x33d/0x870
[ 5695.237885] ? do_raw_spin_lock+0x5d/0xa0
[ 5695.245899] __frontswap_store+0x51/0xb0
[ 5695.253742] swap_writepage+0x3c/0x60
[ 5695.261063] shrink_page_list+0x738/0x1230
[ 5695.269255] shrink_lruvec+0x5ec/0xcd0
[ 5695.276749] ? shrink_slab+0x187/0x5f0
[ 5695.284240] ? mem_cgroup_iter+0x6e/0x120
[ 5695.292255] shrink_node+0x293/0x7b0
[ 5695.299402] do_try_to_free_pages+0xea/0x550
[ 5695.307940] try_to_free_pages+0x19a/0x490
[ 5695.316126] __folio_alloc+0x19ff/0x3e40
[ 5695.323971] ? __filemap_get_folio+0x8a/0x4e0
[ 5695.332681] ? walk_component+0x2a8/0xb50
[ 5695.340697] ? generic_permission+0xda/0x2a0
[ 5695.349231] ? __filemap_get_folio+0x8a/0x4e0
[ 5695.357940] ? walk_component+0x2a8/0xb50
[ 5695.365955] vma_alloc_folio+0x10e/0x570
[ 5695.373796] ? walk_component+0x52/0xb50
[ 5695.381634] wp_page_copy+0x38c/0xc10
[ 5695.388953] ? filename_lookup+0x378/0xbc0
[ 5695.397140] handle_mm_fault+0x87f/0x1800
[ 5695.405157] do_user_addr_fault+0x1bd/0x570
[ 5695.413520] exc_page_fault+0x5d/0x110
[ 5695.421017] asm_exc_page_fault+0x22/0x30
After some investigation, I have found the following issue: unlike other
zswap backends, zsmalloc performs the LRU list update at the object
mapping time, rather than when the slot for the object is allocated.
This deviation was discussed and agreed upon during the review process
of the zsmalloc writeback patch series:
https://lore.kernel.org/lkml/Y3flcAXNxxrvy3ZH@cmpxchg.org/
Unfortunately, this introduces a subtle bug that occurs when there is a
concurrent store and reclaim, which interleave as follows:
zswap_frontswap_store() shrink_worker()
zs_malloc() zs_zpool_shrink()
spin_lock(&pool->lock) zs_reclaim_page()
zspage = find_get_zspage()
spin_unlock(&pool->lock)
spin_lock(&pool->lock)
zspage = list_first_entry(&pool->lru)
list_del(&zspage->lru)
zspage->lru.next = LIST_POISON1
zspage->lru.prev = LIST_POISON2
spin_unlock(&pool->lock)
zs_map_object()
spin_lock(&pool->lock)
if (!list_empty(&zspage->lru))
list_del(&zspage->lru)
CHECK_DATA_CORRUPTION(next == LIST_POISON1) /* BOOM */
With the current upstream code, this issue rarely happens. zswap only
triggers writeback when the pool is already full, at which point all
further store attempts are short-circuited. This creates an implicit
pseudo-serialization between reclaim and store. I am working on a new
zswap shrinking mechanism, which makes interleaving reclaim and store
more likely, exposing this bug.
zbud and z3fold do not have this problem, because they perform the LRU
list update in the alloc function, while still holding the pool's lock.
This patch fixes the aforementioned bug by moving the LRU update back to
zs_malloc(), analogous to zbud and z3fold.
Link: https://lkml.kernel.org/r/20230505185054.2417128-1-nphamcs@gmail.com
Fixes: 64f768c6b3 ("zsmalloc: add a LRU to zs_pool to keep track of zspages in LRU order")
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
When something registers and unregisters many shrinkers, such as:
for x in $(seq 10000); do unshare -Ui true; done
Sometimes the following error is printed to the kernel log:
debugfs: Directory '...' with parent 'shrinker' already present!
This occurs since commit badc28d492 ("mm: shrinkers: fix deadlock in
shrinker debugfs") / v6.2: Since the call to `debugfs_remove_recursive`
was moved outside the `shrinker_rwsem`/`shrinker_mutex` lock, but the call
to `ida_free` stayed inside, a newly registered shrinker can be
re-assigned that ID and attempt to create the debugfs directory before the
directory from the previous shrinker has been removed.
The locking changes in commit f95bdb700b ("mm: vmscan: make global slab
shrink lockless") made the race condition more likely, though it existed
before then.
Commit badc28d492 ("mm: shrinkers: fix deadlock in shrinker debugfs")
could be reverted since the issue is addressed should no longer occur
since the count and scan operations are lockless since commit 20cd1892fc
("mm: shrinkers: make count and scan in shrinker debugfs lockless").
However, since this is a contended lock, prefer instead moving `ida_free`
outside the lock to avoid the race.
Link: https://lkml.kernel.org/r/20230503013232.299211-1-joanbrugueram@gmail.com
Fixes: badc28d492 ("mm: shrinkers: fix deadlock in shrinker debugfs")
Signed-off-by: Joan Bruguera Micó <joanbrugueram@gmail.com>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2d55c16c0c ("dmapool: create/destroy cleanup").
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZFaTPAAKCRDdBJ7gKXxA
ji1zAP4s/mWPQOJkoTR8cn9wRdGUB4FKgDiJwwYeOdcHRuylvwEAlgOD8qlizXYu
z4tGx6J8p3X+IdEApqH6Up56wjW+Wgk=
=X19M
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2023-05-06-10-49' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull dmapool updates - again - from Andrew Morton:
"Reinstate the dmapool changes which were accidentally removed by a
mishap on the last commit in the previous attempt at the series"
Fixes: 2d55c16c0c ("dmapool: create/destroy cleanup").
[ The whole old series: def8574308ed..2d55c16c0c54 results in an empty
diff because that last commit ended up being just a revert of all that
came everything before it. - Linus ]
* tag 'mm-stable-2023-05-06-10-49' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
dmapool: link blocks across pages
dmapool: don't memset on free twice
dmapool: simplify freeing
dmapool: consolidate page initialization
dmapool: rearrange page alloc failure handling
dmapool: move debug code to own functions
dmapool: speedup DMAPOOL_DEBUG with init_on_alloc
dmapool: cleanup integer types
dmapool: use sysfs_emit() instead of scnprintf()
dmapool: remove checks for dev == NULL
changes.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZFaSnAAKCRDdBJ7gKXxA
jskTAPwIkL6xVMIEdtWfQNIy751o0onzfOQAIWgM3dSL83cMJQEA5OXxXt4LjrJw
4q0zf4GLYXjMjhnfYr8zXF39y05iyQA=
=nIhV
-----END PGP SIGNATURE-----
Merge tag 'mm-hotfixes-stable-2023-05-06-10-45' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull hotfixes from Andrew Morton:
"Five hotfixes.
Three are cc:stable, two pertain to merge window changes"
* tag 'mm-hotfixes-stable-2023-05-06-10-45' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
afs: fix the afs_dir_get_folio return value
nilfs2: do not write dirty data after degenerating to read-only
mm: do not reclaim private data from pinned page
nilfs2: fix infinite loop in nilfs_mdt_get_block()
mm/mmap/vma_merge: always check invariants
The allocated dmapool pages are never freed for the lifetime of the pool.
There is no need for the two level list+stack lookup for finding a free
block since nothing is ever removed from the list. Just use a simple
stack, reducing time complexity to constant.
The implementation inserts the stack linking elements and the dma handle
of the block within itself when freed. This means the smallest possible
dmapool block is increased to at most 16 bytes to accommodate these
fields, but there are no exisiting users requesting a dma pool smaller
than that anyway.
Removing the list has a significant change in performance. Using the
kernel's micro-benchmarking self test:
Before:
# modprobe dmapool_test
dmapool test: size:16 blocks:8192 time:57282
dmapool test: size:64 blocks:8192 time:172562
dmapool test: size:256 blocks:8192 time:789247
dmapool test: size:1024 blocks:2048 time:371823
dmapool test: size:4096 blocks:1024 time:362237
After:
# modprobe dmapool_test
dmapool test: size:16 blocks:8192 time:24997
dmapool test: size:64 blocks:8192 time:26584
dmapool test: size:256 blocks:8192 time:33542
dmapool test: size:1024 blocks:2048 time:9022
dmapool test: size:4096 blocks:1024 time:6045
The module test allocates quite a few blocks that may not accurately
represent how these pools are used in real life. For a more marco level
benchmark, running fio high-depth + high-batched on nvme, this patch shows
submission and completion latency reduced by ~100usec each, 1% IOPs
improvement, and perf record's time spent in dma_pool_alloc/free were
reduced by half.
[kbusch@kernel.org: push new blocks in ascending order]
Link: https://lkml.kernel.org/r/20230221165400.1595247-1-kbusch@meta.com
Link: https://lkml.kernel.org/r/20230126215125.4069751-12-kbusch@meta.com
Fixes: 2d55c16c0c ("dmapool: create/destroy cleanup")
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Bryan O'Donoghue <bryan.odonoghue@linaro.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Tony Battersby <tonyb@cybernetics.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
If debug is enabled, dmapool will poison the range, so no need to clear it
to 0 immediately before writing over it.
Link: https://lkml.kernel.org/r/20230126215125.4069751-11-kbusch@meta.com
Fixes: 2d55c16c0c ("dmapool: create/destroy cleanup")
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Tony Battersby <tonyb@cybernetics.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The actions for busy and not busy are mostly the same, so combine these
and remove the unnecessary function. Also, the pool is about to be freed
so there's no need to poison the page data since we only check for poison
on alloc, which can't be done on a freed pool.
Link: https://lkml.kernel.org/r/20230126215125.4069751-10-kbusch@meta.com
Fixes: 2d55c16c0c ("dmapool: create/destroy cleanup")
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Tony Battersby <tonyb@cybernetics.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Various fields of the dma pool are set in different places. Move it all
to one function.
Link: https://lkml.kernel.org/r/20230126215125.4069751-9-kbusch@meta.com
Fixes: 2d55c16c0c ("dmapool: create/destroy cleanup")
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Tony Battersby <tonyb@cybernetics.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Handle the error in a condition so the good path can be in the normal
flow.
Link: https://lkml.kernel.org/r/20230126215125.4069751-8-kbusch@meta.com
Fixes: 2d55c16c0c ("dmapool: create/destroy cleanup")
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Tony Battersby <tonyb@cybernetics.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Clean up the normal path by moving the debug code outside it.
Link: https://lkml.kernel.org/r/20230126215125.4069751-7-kbusch@meta.com
Fixes: 2d55c16c0c ("dmapool: create/destroy cleanup")
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Tony Battersby <tonyb@cybernetics.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Avoid double-memset of the same allocated memory in dma_pool_alloc() when
both DMAPOOL_DEBUG is enabled and init_on_alloc=1.
Link: https://lkml.kernel.org/r/20230126215125.4069751-6-kbusch@meta.com
Fixes: 2d55c16c0c ("dmapool: create/destroy cleanup")
Signed-off-by: Tony Battersby <tonyb@cybernetics.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
To represent the size of a single allocation, dmapool currently uses
'unsigned int' in some places and 'size_t' in other places. Standardize
on 'unsigned int' to reduce overhead, but use 'size_t' when counting all
the blocks in the entire pool.
Link: https://lkml.kernel.org/r/20230126215125.4069751-5-kbusch@meta.com
Fixes: 2d55c16c0c ("dmapool: create/destroy cleanup")
Signed-off-by: Tony Battersby <tonyb@cybernetics.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Use sysfs_emit instead of scnprintf, snprintf or sprintf.
Link: https://lkml.kernel.org/r/20230126215125.4069751-4-kbusch@meta.com
Fixes: 2d55c16c0c ("dmapool: create/destroy cleanup")
Signed-off-by: Tony Battersby <tonyb@cybernetics.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
dmapool originally tried to support pools without a device because
dma_alloc_coherent() supports allocations without a device. But nobody
ended up using dma pools without a device, and trying to do so will result
in an oops. So remove the checks for pool->dev == NULL since they are
unneeded bloat.
[kbusch@kernel.org: add check for null dev on create]
Link: https://lkml.kernel.org/r/20230126215125.4069751-3-kbusch@meta.com
Fixes: 2d55c16c0c ("dmapool: create/destroy cleanup")
Signed-off-by: Tony Battersby <tonyb@cybernetics.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
If the page is pinned, there's no point in trying to reclaim it.
Furthermore if the page is from the page cache we don't want to reclaim
fs-private data from the page because the pinning process may be writing
to the page at any time and reclaiming fs private info on a dirty page can
upset the filesystem (see link below).
Link: https://lore.kernel.org/linux-mm/20180103100430.GE4911@quack2.suse.cz
Link: https://lkml.kernel.org/r/20230428124140.30166-1-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Peter Xu <peterx@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
We may still have inconsistent input parameters even if we choose not to
merge and the vma_merge() invariant checks are useful for checking this
with no production runtime cost (these are only relevant when
CONFIG_DEBUG_VM is specified).
Therefore, perform these checks regardless of whether we merge.
This is relevant, as a recent issue (addressed in commit "mm/mempolicy:
Correctly update prev when policy is equal on mbind") in the mbind logic
was only picked up in the 6.2.y stable branch where these assertions are
performed prior to determining mergeability.
Had this remained the same in mainline this issue may have been picked up
faster, so moving forward let's always check them.
Link: https://lkml.kernel.org/r/df548a6ae3fa135eec3b446eb3dae8eb4227da97.1682885809.git.lstoakes@gmail.com
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Smatch reports that filemap_fault() was missed in the conversion of
__filemap_get_folio() error returns from NULL to ERR_PTR.
Fixes: 66dabbb65d ("mm: return an ERR_PTR from __filemap_get_folio")
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Reported-by: syzbot+48011b86c8ea329af1b9@syzkaller.appspotmail.com
Reported-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Merge my x86 uaccess updates branch.
The LAM ("Linear Address Masking") updates in this release made me
unhappy about how "access_ok()" was done, and it actually turned out to
have a couple of small bugs in it too. This is my cleanup of the code:
- use the sign bit of the __user pointer rather than masking the
address and checking it against the TASK_SIZE range.
We already did this part for the get/put_user() side, but
'access_ok()' did the naïve "mask and range check" thing, which not
only generates nasty code, but also ended up meaning that __access_ok
itself didn't do a good job, and so copy_from_user_nmi() didn't get
the check right.
- move all the code that is 64-bit only into the 64-bit version of the
header file, so that we don't unnecessarily pollute the shared x86
code and make it look like LAM might work in 32-bit too.
- fix a bug in the address masking (that doesn't end up mattering: in
this case the fix was to just remove the buggy code entirely).
- a couple of trivial cleanups and added commentary about the
access_ok() rules.
* x86-uaccess-cleanup:
x86-64: mm: clarify the 'positive addresses' user address rules
x86: mm: remove 'sign' games from LAM untagged_addr*() macros
x86: uaccess: move 32-bit and 64-bit parts into proper <asm/uaccess_N.h> header
x86: mm: remove architecture-specific 'access_ok()' define
x86-64: make access_ok() independent of LAM
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZFLuDAAKCRDdBJ7gKXxA
jk4KAP9ceSzcPrMejKeeWrkj0PoQzy8FMp3VhG9yaXkWPSNHUgD9EUG8J/lQftsH
t39eKmn6FDuY2cLpFS8HCrlain9JcAE=
=pn8p
-----END PGP SIGNATURE-----
Merge tag 'mm-hotfixes-stable-2023-05-03-16-27' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull hitfixes from Andrew Morton:
"Five hotfixes. Three are cc:stable, two for this -rc cycle"
* tag 'mm-hotfixes-stable-2023-05-03-16-27' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
mm: change per-VMA lock statistics to be disabled by default
MAINTAINERS: update Michal Simek's email
mm/mempolicy: correctly update prev when policy is equal on mbind
relayfs: fix out-of-bounds access in relay_file_read
kasan: hw_tags: avoid invalid virt_to_page()
- Some KSM work from David Hildenbrand, to make the PR_SET_MEMORY_MERGE
ioctl's behavior more similar to KSM's behavior.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZFLsxAAKCRDdBJ7gKXxA
jl8yAQCqjstPsOULf9QN0z4bGAUhY+Wj4ERz1jbKSIuhFCJWiQEAgQvgRXObKjmi
OtUB0Ek4CMDCQzbyIQ1Bhp3kxi6+Jgs=
=AbyC
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2023-05-03-16-22' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull more MM updates from Andrew Morton:
- Some DAMON cleanups from Kefeng Wang
- Some KSM work from David Hildenbrand, to make the PR_SET_MEMORY_MERGE
ioctl's behavior more similar to KSM's behavior.
[ Andrew called these "final", but I suspect we'll have a series fixing
up the fact that the last commit in the dmapools series in the
previous pull seems to have unintentionally just reverted all the
other commits in the same series.. - Linus ]
* tag 'mm-stable-2023-05-03-16-22' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
mm: hwpoison: coredump: support recovery from dump_user_range()
mm/page_alloc: add some comments to explain the possible hole in __pageblock_pfn_to_page()
mm/ksm: move disabling KSM from s390/gmap code to KSM code
selftests/ksm: ksm_functional_tests: add prctl unmerge test
mm/ksm: unmerge and clear VM_MERGEABLE when setting PR_SET_MEMORY_MERGE=0
mm/damon/paddr: fix missing folio_sz update in damon_pa_young()
mm/damon/paddr: minor refactor of damon_pa_mark_accessed_or_deactivate()
mm/damon/paddr: minor refactor of damon_pa_pageout()
The linear address masking (LAM) code made access_ok() more complicated,
in that it now needs to untag the address in order to verify the access
range. See commit 74c228d20a ("x86/uaccess: Provide untagged_addr()
and remove tags before address check").
We were able to avoid that overhead in the get_user/put_user code paths
by simply using the sign bit for the address check, and depending on the
GP fault if the address was non-canonical, which made it all independent
of LAM.
And we can do the same thing for access_ok(): simply check that the user
pointer range has the high bit clear. No need to bother with any
address bit masking.
In fact, we can go a bit further, and just check the starting address
for known small accesses ranges: any accesses that overflow will still
be in the non-canonical area and will still GP fault.
To still make syzkaller catch any potentially unchecked user addresses,
we'll continue to warn about GP faults that are caused by accesses in
the non-canonical range. But we'll limit that to purely "high bit set
and past the one-page 'slop' area".
We could probably just do that "check only starting address" for any
arbitrary range size: realistically all kernel accesses to user space
will be done starting at the low address. But let's leave that kind of
optimization for later. As it is, this already allows us to generate
simpler code and not worry about any tag bits in the address.
The one thing to look out for is the GUP address check: instead of
actually copying data in the virtual address range (and thus bad
addresses being caught by the GP fault), GUP will look up the page
tables manually. As a result, the page table limits need to be checked,
and that was previously implicitly done by the access_ok().
With the relaxed access_ok() check, we need to just do an explicit check
for TASK_SIZE_MAX in the GUP code instead. The GUP code already needs
to do the tag bit unmasking anyway, so there this is all very
straightforward, and there are no LAM issues.
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Change CONFIG_PER_VMA_LOCK_STATS to be disabled by default, as most users
don't need it. Add configuration help to clarify its usage.
Link: https://lkml.kernel.org/r/20230428173533.18158-1-surenb@google.com
Fixes: 52f238653e ("mm: introduce per-VMA lock statistics")
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The refactoring in commit f4e9e0e694 ("mm/mempolicy: fix use-after-free
of VMA iterator") introduces a subtle bug which arises when attempting to
apply a new NUMA policy across a range of VMAs in mbind_range().
The refactoring passes a **prev pointer to keep track of the previous VMA
in order to reduce duplication, and in all but one case it keeps this
correctly updated.
The bug arises when a VMA within the specified range has an equivalent
policy as determined by mpol_equal() - which unlike other cases, does not
update prev.
This can result in a situation where, later in the iteration, a VMA is
found whose policy does need to change. At this point, vma_merge() is
invoked with prev pointing to a VMA which is before the previous VMA.
Since vma_merge() discovers the curr VMA by looking for the one
immediately after prev, it will now be in a situation where this VMA is
incorrect and the merge will not proceed correctly.
This is checked in the VM_WARN_ON() invariant case with end >
curr->vm_end, which, if a merge is possible, results in a warning (if
CONFIG_DEBUG_VM is specified).
I note that vma_merge() performs these invariant checks only after
merge_prev/merge_next are checked, which is debatable as it hides this
issue if no merge is possible even though a buggy situation has arisen.
The solution is simply to update the prev pointer even when policies are
equal.
This caused a bug to arise in the 6.2.y stable tree, and this patch
resolves this bug.
Link: https://lkml.kernel.org/r/83f1d612acb519d777bebf7f3359317c4e7f4265.1682866629.git.lstoakes@gmail.com
Fixes: f4e9e0e694 ("mm/mempolicy: fix use-after-free of VMA iterator")
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Reported-by: kernel test robot <oliver.sang@intel.com>
Link: https://lore.kernel.org/oe-lkp/202304292203.44ddeff6-oliver.sang@intel.com
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Now the __pageblock_pfn_to_page() is used by set_zone_contiguous(), which
checks whether the given zone contains holes, and uses
pfn_to_online_page() to validate if the start pfn is online and valid, as
well as using pfn_valid() to validate the end pfn.
However, the __pageblock_pfn_to_page() function may return non-NULL even
if the end pfn of a pageblock is in a memory hole in some situations. For
example, if the pageblock order is MAX_ORDER, which will fall into 2
sub-sections, and the end pfn of the pageblock may be hole even though the
start pfn is online and valid.
See below memory layout as an example and suppose the pageblock order is
MAX_ORDER.
[ 0.000000] Zone ranges:
[ 0.000000] DMA [mem 0x0000000040000000-0x00000000ffffffff]
[ 0.000000] DMA32 empty
[ 0.000000] Normal [mem 0x0000000100000000-0x0000001fa7ffffff]
[ 0.000000] Movable zone start for each node
[ 0.000000] Early memory node ranges
[ 0.000000] node 0: [mem 0x0000000040000000-0x0000001fa3c7ffff]
[ 0.000000] node 0: [mem 0x0000001fa3c80000-0x0000001fa3ffffff]
[ 0.000000] node 0: [mem 0x0000001fa4000000-0x0000001fa402ffff]
[ 0.000000] node 0: [mem 0x0000001fa4030000-0x0000001fa40effff]
[ 0.000000] node 0: [mem 0x0000001fa40f0000-0x0000001fa73cffff]
[ 0.000000] node 0: [mem 0x0000001fa73d0000-0x0000001fa745ffff]
[ 0.000000] node 0: [mem 0x0000001fa7460000-0x0000001fa746ffff]
[ 0.000000] node 0: [mem 0x0000001fa7470000-0x0000001fa758ffff]
[ 0.000000] node 0: [mem 0x0000001fa7590000-0x0000001fa7dfffff]
Focus on the last memory range, and there is a hole for the range [mem
0x0000001fa7590000-0x0000001fa7dfffff]. That means the last pageblock
will contain the range from 0x1fa7c00000 to 0x1fa7ffffff, since the
pageblock must be 4M aligned. And in this pageblock, these pfns will fall
into 2 sub-section (the sub-section size is 2M aligned).
So, the 1st sub-section (indicates pfn range: 0x1fa7c00000 - 0x1fa7dfffff
) in this pageblock is valid by calling subsection_map_init() in
free_area_init(), but the 2nd sub-section (indicates pfn range:
0x1fa7e00000 - 0x1fa7ffffff ) in this pageblock is not valid.
This did not break anything until now, but the zone continuous is fragile
in this possible scenario. So as previous discussion[1], it is better to
add some comments to explain this possible issue in case there are some
future pfn walkers that rely on this.
[1] https://lore.kernel.org/all/87r0sdsmr6.fsf@yhuang6-desk2.ccr.corp.intel.com/
Link: https://lkml.kernel.org/r/5c26368865e79c743a453dea48d30670b19d2e4f.1682425534.git.baolin.wang@linux.alibaba.com
Link: https://lkml.kernel.org/r/5c26368865e79c743a453dea48d30670b19d2e4f.1682425534.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Let's factor out actual disabling of KSM. The existing "mm->def_flags &=
~VM_MERGEABLE;" was essentially a NOP and can be dropped, because
def_flags should never include VM_MERGEABLE. Note that we don't currently
prevent re-enabling KSM.
This should now be faster in case KSM was never enabled, because we only
conditionally iterate all VMAs. Further, it certainly looks cleaner.
Link: https://lkml.kernel.org/r/20230422210156.33630-1-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Janosch Frank <frankja@linux.ibm.com>
Acked-by: Stefan Roesch <shr@devkernel.io>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>