IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
The DAMOS action applying function, 'damon_do_apply_schemes()', is still
long and not easy to read. Split out the code for applying a single
action to a single region into a new function for better readability.
Link: https://lkml.kernel.org/r/20221026225943.100429-3-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "mm/damon: cleanup and refactoring code", v2.
This patchset cleans up and refactors a range of DAMON code including the
core, DAMON sysfs interface, and DAMON modules, for better readability and
convenient future feature implementations.
In detail, this patchset splits unnecessarily long and complex functions
in core into smaller functions (patches 1-4). Then, it cleans up the
DAMON sysfs interface by using more type-safe code (patch 5) and removing
unnecessary function parameters (patch 6). Further, it refactor the code
by distributing the code into multiple files (patches 7-10). Last two
patches (patches 11 and 12) deduplicates and remove unnecessary header
inclusion in DAMON modules (reclaim and lru_sort).
This patch (of 12):
The DAMOS action applying function, 'damon_do_apply_schemes()', is quite
long and not so simple. Split out the already quota-charged region skip
code, which is not a small amount of simple code, into a new function with
some comments for better readability.
Link: https://lkml.kernel.org/r/20221026225943.100429-1-sj@kernel.org
Link: https://lkml.kernel.org/r/20221026225943.100429-2-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Any codepath that zaps page table entries must invoke MMU notifiers to
ensure that secondary MMUs (like KVM) don't keep accessing pages which
aren't mapped anymore. Secondary MMUs don't hold their own references to
pages that are mirrored over, so failing to notify them can lead to page
use-after-free.
I'm marking this as addressing an issue introduced in commit f3f0e1d2150b
("khugepaged: add support of collapse for tmpfs/shmem pages"), but most of
the security impact of this only came in commit 27e1f8273113 ("khugepaged:
enable collapse pmd for pte-mapped THP"), which actually omitted flushes
for the removal of present PTEs, not just for the removal of empty page
tables.
Link: https://lkml.kernel.org/r/20221129154730.2274278-3-jannh@google.com
Link: https://lkml.kernel.org/r/20221128180252.1684965-3-jannh@google.com
Link: https://lkml.kernel.org/r/20221125213714.4115729-3-jannh@google.com
Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages")
Signed-off-by: Jann Horn <jannh@google.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Since commit 70cbc3cc78a99 ("mm: gup: fix the fast GUP race against THP
collapse"), the lockless_pages_from_mm() fastpath rechecks the pmd_t to
ensure that the page table was not removed by khugepaged in between.
However, lockless_pages_from_mm() still requires that the page table is
not concurrently freed. Fix it by sending IPIs (if the architecture uses
semi-RCU-style page table freeing) before freeing/reusing page tables.
Link: https://lkml.kernel.org/r/20221129154730.2274278-2-jannh@google.com
Link: https://lkml.kernel.org/r/20221128180252.1684965-2-jannh@google.com
Link: https://lkml.kernel.org/r/20221125213714.4115729-2-jannh@google.com
Fixes: ba76149f47d8 ("thp: khugepaged")
Signed-off-by: Jann Horn <jannh@google.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
pagetable walks on address ranges mapped by VMAs can be done under the
mmap lock, the lock of an anon_vma attached to the VMA, or the lock of the
VMA's address_space. Only one of these needs to be held, and it does not
need to be held in exclusive mode.
Under those circumstances, the rules for concurrent access to page table
entries are:
- Terminal page table entries (entries that don't point to another page
table) can be arbitrarily changed under the page table lock, with the
exception that they always need to be consistent for
hardware page table walks and lockless_pages_from_mm().
This includes that they can be changed into non-terminal entries.
- Non-terminal page table entries (which point to another page table)
can not be modified; readers are allowed to READ_ONCE() an entry, verify
that it is non-terminal, and then assume that its value will stay as-is.
Retracting a page table involves modifying a non-terminal entry, so
page-table-level locks are insufficient to protect against concurrent page
table traversal; it requires taking all the higher-level locks under which
it is possible to start a page walk in the relevant range in exclusive
mode.
The collapse_huge_page() path for anonymous THP already follows this rule,
but the shmem/file THP path was getting it wrong, making it possible for
concurrent rmap-based operations to cause corruption.
Link: https://lkml.kernel.org/r/20221129154730.2274278-1-jannh@google.com
Link: https://lkml.kernel.org/r/20221128180252.1684965-1-jannh@google.com
Link: https://lkml.kernel.org/r/20221125213714.4115729-1-jannh@google.com
Fixes: 27e1f8273113 ("khugepaged: enable collapse pmd for pte-mapped THP")
Signed-off-by: Jann Horn <jannh@google.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The issue is reported when removing memory through virtio_mem device. The
transparent huge page, experienced copy-on-write fault, is wrongly
regarded as pinned. The transparent huge page is escaped from being
isolated in isolate_migratepages_block(). The transparent huge page can't
be migrated and the corresponding memory block can't be put into offline
state.
Fix it by replacing page_mapcount() with total_mapcount(). With this, the
transparent huge page can be isolated and migrated, and the memory block
can be put into offline state. Besides, The page's refcount is increased
a bit earlier to avoid the page is released when the check is executed.
Link: https://lkml.kernel.org/r/20221124095523.31061-1-gshan@redhat.com
Fixes: 1da2f328fa64 ("mm,thp,compaction,cma: allow THP migration for CMA allocations")
Signed-off-by: Gavin Shan <gshan@redhat.com>
Reported-by: Zhenyu Zhang <zhenyzha@redhat.com>
Tested-by: Zhenyu Zhang <zhenyzha@redhat.com>
Suggested-by: David Hildenbrand <david@redhat.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: William Kucharski <william.kucharski@oracle.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: <stable@vger.kernel.org> [5.7+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
When running as a Xen PV guests commit eed9a328aa1a ("mm: x86: add
CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG") can cause a protection violation in
pmdp_test_and_clear_young():
BUG: unable to handle page fault for address: ffff8880083374d0
#PF: supervisor write access in kernel mode
#PF: error_code(0x0003) - permissions violation
PGD 3026067 P4D 3026067 PUD 3027067 PMD 7fee5067 PTE 8010000008337065
Oops: 0003 [#1] PREEMPT SMP NOPTI
CPU: 7 PID: 158 Comm: kswapd0 Not tainted 6.1.0-rc5-20221118-doflr+ #1
RIP: e030:pmdp_test_and_clear_young+0x25/0x40
This happens because the Xen hypervisor can't emulate direct writes to
page table entries other than PTEs.
This can easily be fixed by introducing arch_has_hw_nonleaf_pmd_young()
similar to arch_has_hw_pte_young() and test that instead of
CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG.
Link: https://lkml.kernel.org/r/20221123064510.16225-1-jgross@suse.com
Fixes: eed9a328aa1a ("mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG")
Signed-off-by: Juergen Gross <jgross@suse.com>
Reported-by: Sander Eikelenboom <linux@eikelenboom.it>
Acked-by: Yu Zhao <yuzhao@google.com>
Tested-by: Sander Eikelenboom <linux@eikelenboom.it>
Acked-by: David Hildenbrand <david@redhat.com> [core changes]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Commit da87878010e5 ("mm/damon/sysfs: support online inputs update") made
'damon_sysfs_set_schemes()' to be called for running DAMON context, which
could have schemes. In the case, DAMON sysfs interface is supposed to
update, remove, or add schemes to reflect the sysfs files. However, the
code is assuming the DAMON context wouldn't have schemes at all, and
therefore creates and adds new schemes. As a result, the code doesn't
work as intended for online schemes tuning and could have more than
expected memory footprint. The schemes are all in the DAMON context, so
it doesn't leak the memory, though.
Remove the wrong asssumption (the DAMON context wouldn't have schemes) in
'damon_sysfs_set_schemes()' to fix the bug.
Link: https://lkml.kernel.org/r/20221122194831.3472-1-sj@kernel.org
Fixes: da87878010e5 ("mm/damon/sysfs: support online inputs update")
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: <stable@vger.kernel.org> [5.19+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
madvise(MADV_DONTNEED) ends up calling zap_page_range() to clear page
tables associated with the address range. For hugetlb vmas,
zap_page_range will call __unmap_hugepage_range_final. However,
__unmap_hugepage_range_final assumes the passed vma is about to be removed
and deletes the vma_lock to prevent pmd sharing as the vma is on the way
out. In the case of madvise(MADV_DONTNEED) the vma remains, but the
missing vma_lock prevents pmd sharing and could potentially lead to issues
with truncation/fault races.
This issue was originally reported here [1] as a BUG triggered in
page_try_dup_anon_rmap. Prior to the introduction of the hugetlb
vma_lock, __unmap_hugepage_range_final cleared the VM_MAYSHARE flag to
prevent pmd sharing. Subsequent faults on this vma were confused as
VM_MAYSHARE indicates a sharable vma, but was not set so page_mapping was
not set in new pages added to the page table. This resulted in pages that
appeared anonymous in a VM_SHARED vma and triggered the BUG.
Address issue by adding a new zap flag ZAP_FLAG_UNMAP to indicate an unmap
call from unmap_vmas(). This is used to indicate the 'final' unmapping of
a hugetlb vma. When called via MADV_DONTNEED, this flag is not set and
the vm_lock is not deleted.
[1] https://lore.kernel.org/lkml/CAO4mrfdLMXsao9RF4fUE8-Wfde8xmjsKrTNMNC9wjUb6JudD0g@mail.gmail.com/
Link: https://lkml.kernel.org/r/20221114235507.294320-3-mike.kravetz@oracle.com
Fixes: 90e7e7f5ef3f ("mm: enable MADV_DONTNEED for hugetlb mappings")
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reported-by: Wei Chen <harperchen1110@gmail.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
Cc: Peter Xu <peterx@redhat.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This series addresses the issue first reported in [1], and fully described
in patch 2. Patches 1 and 2 address the user visible issue and are tagged
for stable backports.
While exploring solutions to this issue, related problems with mmu
notification calls were discovered. This is addressed in the patch
"hugetlb: remove duplicate mmu notifications:". Since there are no user
visible effects, this third is not tagged for stable backports.
Previous discussions suggested further cleanup by removing the
routine zap_page_range. This is possible because zap_page_range_single
is now exported, and all callers of zap_page_range pass ranges entirely
within a single vma. This work will be done in a later patch so as not
to distract from this bug fix.
[1] https://lore.kernel.org/lkml/CAO4mrfdLMXsao9RF4fUE8-Wfde8xmjsKrTNMNC9wjUb6JudD0g@mail.gmail.com/
This patch (of 2):
Expose the routine zap_page_range_single to zap a range within a single
vma. The madvise routine madvise_dontneed_single_vma can use this routine
as it explicitly operates on a single vma. Also, update the mmu
notification range in zap_page_range_single to take hugetlb pmd sharing
into account. This is required as MADV_DONTNEED supports hugetlb vmas.
Link: https://lkml.kernel.org/r/20221114235507.294320-1-mike.kravetz@oracle.com
Link: https://lkml.kernel.org/r/20221114235507.294320-2-mike.kravetz@oracle.com
Fixes: 90e7e7f5ef3f ("mm: enable MADV_DONTNEED for hugetlb mappings")
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reported-by: Wei Chen <harperchen1110@gmail.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
Cc: Peter Xu <peterx@redhat.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
As with PG_arch_2, this flag is only allowed on 64-bit architectures due
to the shortage of bits available. It will be used by the arm64 MTE code
in subsequent patches.
Signed-off-by: Peter Collingbourne <pcc@google.com>
Cc: Will Deacon <will@kernel.org>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Steven Price <steven.price@arm.com>
[catalin.marinas@arm.com: added flag preserving in __split_huge_page_tail()]
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Steven Price <steven.price@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221104011041.290951-5-pcc@google.com
Commit 4beba9486abd ("mm: Add PG_arch_2 page flag") introduced a new
page flag for all 64-bit architectures. However, even if an architecture
is 64-bit, it may still have limited spare bits in the 'flags' member of
'struct page'. This may happen if an architecture enables SPARSEMEM
without SPARSEMEM_VMEMMAP as is the case with the newly added loongarch.
This architecture port needs 19 more bits for the sparsemem section
information and, while it is currently fine with PG_arch_2, adding any
more PG_arch_* flags will trigger build-time warnings.
Add a new CONFIG_ARCH_USES_PG_ARCH_X option which can be selected by
architectures that need more PG_arch_* flags beyond PG_arch_1. Select it
on arm64.
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
[pcc@google.com: fix build with CONFIG_ARM64_MTE disabled]
Signed-off-by: Peter Collingbourne <pcc@google.com>
Reported-by: kernel test robot <lkp@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Steven Price <steven.price@arm.com>
Reviewed-by: Steven Price <steven.price@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221104011041.290951-2-pcc@google.com
Current code allows the page_reporting_order parameter to be changed
via sysfs to any integer value. The new value is used immediately
in page reporting code with no validation, which could cause incorrect
behavior. Fix this by adding validation of the new value.
Export this parameter for use in the driver that is calling the
page_reporting_register().
This is needed by drivers like hv_balloon to know the order of the
pages reported. Traditionally the values provided in the kernel boot
line or subsequently changed via sysfs take priority therefore, if
page_reporting_order parameter's value is set, it takes precedence
over the value passed while registering with the driver.
Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Link: https://lore.kernel.org/r/1664517699-1085-2-git-send-email-shradhagupta@linux.microsoft.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Since commit c7323a5ad078 ("mm/slub: restrict sysfs validation to debug
caches and make it safe"), caches with debugging enabled use the
free_debug_processing() function to do both freeing checks and actual
freeing to partial list under list_lock, bypassing the fast paths.
We will want to use the same path for CONFIG_SLUB_TINY, but without the
debugging checks, so refactor the code so that free_debug_processing()
does only the checks, while the freeing is handled by a new function
free_to_partial_list().
For consistency, change return parameter alloc_debug_processing() from
int to bool and correct the !SLUB_DEBUG variant to return true and not
false. This didn't matter until now, but will in the following changes.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Christoph Lameter <cl@linux.com>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Distinguishing kmalloc(__GFP_RECLAIMABLE) can help against fragmentation
by grouping pages by mobility, but on tiny systems the extra memory
overhead of separate set of kmalloc-rcl caches will probably be worse,
and mobility grouping likely disabled anyway.
Thus with CONFIG_SLUB_TINY, don't create kmalloc-rcl caches and use the
regular ones.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Christoph Lameter <cl@linux.com>
With CONFIG_SLUB_TINY we want to minimize memory overhead. By lowering
the default slub_max_order we can make slab allocations use smaller
pages. However depending on object sizes, order-0 might not be the best
due to increased fragmentation. When testing on a 8MB RAM k210 system by
Damien Le Moal [1], slub_max_order=1 had the best results, so use that
as the default for CONFIG_SLUB_TINY.
[1] https://lore.kernel.org/all/6a1883c4-4c3f-545a-90e8-2cd805bcf4ae@opensource.wdc.com/
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Christoph Lameter <cl@linux.com>
SLUB will leave a number of slabs on the partial list even if they are
empty, to avoid some slab freeing and reallocation. The goal of
CONFIG_SLUB_TINY is to minimize memory overhead, so set the limits to 0
for immediate slab page freeing.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Christoph Lameter <cl@linux.com>
Currently SLUB enables its sysfs support depending unconditionally on
the general CONFIG_SYSFS setting. To reduce the configuration
combination space, make CONFIG_SLUB_TINY disable SLUB's sysfs support by
reusing the existing SLAB_SUPPORTS_SYSFS define. It is unlikely that
real tiny systems would combine CONFIG_SLUB_TINY with CONFIG_SYSFS, but
a randconfig might.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Christoph Lameter <cl@linux.com>
For tiny systems that have used SLOB until now, SLUB might be
impractical due to its higher memory usage. To help with that, introduce
an option CONFIG_SLUB_TINY that modifies SLUB to use less memory.
This is done by sacrificing scalability, security and debugging
features, therefore not recommended for any system with more than 16MB
RAM.
This commit introduces the option and uses it to set other related
options in a way that reduces memory usage.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Christoph Lameter <cl@linux.com>
With CONFIG_HARDENED_USERCOPY not enabled, there are no
__check_heap_object() checks happening that would use the struct
kmem_cache useroffset and usersize fields. Yet the fields are still
initialized, preventing merging of otherwise compatible caches.
Also the fields contribute to struct kmem_cache size unnecessarily when
unused. Thus #ifdef them out completely when CONFIG_HARDENED_USERCOPY is
disabled. In kmem_dump_obj() print object_size instead of usersize, as
that's actually the intention.
In a quick virtme boot test, this has reduced the number of caches in
/proc/slabinfo from 131 to 111.
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Christoph Lameter <cl@linux.com>
There have been a lot of hotfixes this cycle, and this is quite a large
batch given how far we are into the -rc cycle. Presumably a reflection of
the unusually large amount of MM material which went into 6.1-rc1.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCY4Bd6gAKCRDdBJ7gKXxA
jvX6AQCsG1ld24kMpdD+70XXUyC29g/6/jribgtZApHyDYjxSwD/WmLNpPlUPRax
WB071Y5w65vjSTUKvwU0OLGbHwyxgAw=
=swD5
-----END PGP SIGNATURE-----
Merge tag 'mm-hotfixes-stable-2022-11-24' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull hotfixes from Andrew Morton:
"24 MM and non-MM hotfixes. 8 marked cc:stable and 16 for post-6.0
issues.
There have been a lot of hotfixes this cycle, and this is quite a
large batch given how far we are into the -rc cycle. Presumably a
reflection of the unusually large amount of MM material which went
into 6.1-rc1"
* tag 'mm-hotfixes-stable-2022-11-24' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (24 commits)
test_kprobes: fix implicit declaration error of test_kprobes
nilfs2: fix nilfs_sufile_mark_dirty() not set segment usage as dirty
mm/cgroup/reclaim: fix dirty pages throttling on cgroup v1
mm: fix unexpected changes to {failslab|fail_page_alloc}.attr
swapfile: fix soft lockup in scan_swap_map_slots
hugetlb: fix __prep_compound_gigantic_page page flag setting
kfence: fix stack trace pruning
proc/meminfo: fix spacing in SecPageTables
mm: multi-gen LRU: retry folios written back while isolated
mailmap: update email address for Satya Priya
mm/migrate_device: return number of migrating pages in args->cpages
kbuild: fix -Wimplicit-function-declaration in license_is_gpl_compatible
MAINTAINERS: update Alex Hung's email address
mailmap: update Alex Hung's email address
mm: mmap: fix documentation for vma_mas_szero
mm/damon/sysfs-schemes: skip stats update if the scheme directory is removed
mm/memory: return vm_fault_t result from migrate_to_ram() callback
mm: correctly charge compressed memory to its memcg
ipc/shm: call underlying open/close vm_ops
gcov: clang: fix the buffer overflow issue
...
READ/WRITE proved to be actively confusing - the meanings are
"data destination, as used with read(2)" and "data source, as
used with write(2)", but people keep interpreting those as
"we read data from it" and "we write data to it", i.e. exactly
the wrong way.
Call them ITER_DEST and ITER_SOURCE - at least that is harder
to misinterpret...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
balance_dirty_pages doesn't do the required dirty throttling on cgroupv1.
See commit 9badce000e2c ("cgroup, writeback: don't enable cgroup writeback
on traditional hierarchies"). Instead, the kernel depends on writeback
throttling in shrink_folio_list to achieve the same goal. With large
memory systems, the flusher may not be able to writeback quickly enough
such that we will start finding pages in the shrink_folio_list already in
writeback. Hence for cgroupv1 let's do a reclaim throttle after waking up
the flusher.
The below test which used to fail on a 256GB system completes till the the
file system is full with this change.
root@lp2:/sys/fs/cgroup/memory# mkdir test
root@lp2:/sys/fs/cgroup/memory# cd test/
root@lp2:/sys/fs/cgroup/memory/test# echo 120M > memory.limit_in_bytes
root@lp2:/sys/fs/cgroup/memory/test# echo $$ > tasks
root@lp2:/sys/fs/cgroup/memory/test# dd if=/dev/zero of=/home/kvaneesh/test bs=1M
Killed
Link: https://lkml.kernel.org/r/20221118070603.84081-1-aneesh.kumar@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: zefan li <lizefan.x@bytedance.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
When we specify __GFP_NOWARN, we only expect that no warnings will be
issued for current caller. But in the __should_failslab() and
__should_fail_alloc_page(), the local GFP flags alter the global
{failslab|fail_page_alloc}.attr, which is persistent and shared by all
tasks. This is not what we expected, let's fix it.
[akpm@linux-foundation.org: unexport should_fail_ex()]
Link: https://lkml.kernel.org/r/20221118100011.2634-1-zhengqi.arch@bytedance.com
Fixes: 3f913fc5f974 ("mm: fix missing handler for __GFP_NOWARN")
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Reviewed-by: Akinobu Mita <akinobu.mita@gmail.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Cc: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
A softlockup occurs in scan free swap slot under huge memory pressure.
The test scenario is: 64 CPU cores, 64GB memory, and 28 zram devices, the
disksize of each zram device is 50MB.
LATENCY_LIMIT is used to prevent softlockups in scan_swap_map_slots(), but
the real loop number would more than LATENCY_LIMIT because of "goto checks
and goto scan" repeatly without decreasing latency limit.
In order to fix it, decrease latency_ration in advance.
There is also a suspicious place that will cause softlockups in
get_swap_pages(). In this function, the "goto start_over" may result in
continuous scanning of the swap partition. If there is no cond_sched in
scan_swap_map_slots(), it would cause a softlockup (I am not sure about
this).
WARN: soft lockup - CPU#11 stuck for 11s! [kswapd0:466]
CPU: 11 PID: 466 Comm: kswapd@ Kdump: loaded Tainted: G
dump backtrace+0x0/0x1le4
show stack+0x20/@x2c
dump_stack+0xd8/0x140
watchdog print_info+0x48/0x54
watchdog_process_before_softlockup+0x98/0xa0
watchdog_timer_fn+0xlac/0x2d0
hrtimer_rum_queues+0xb0/0x130
hrtimer_interrupt+0x13c/0x3c0
arch_timer_handler_virt+0x3c/0x50
handLe_percpu_devid_irq+0x90/0x1f4
handle domain irq+0x84/0x100
gic_handle_irq+0x88/0x2b0
e11 ira+0xhB/Bx140
scan_swap_map_slots+0x678/0x890
get_swap_pages+0x29c/0x440
get_swap_page+0x120/0x2e0
add_to_swap+UX2U/0XyC
shrink_page_list+0x5d0/0x152c
shrink_inactive_list+0xl6c/Bx500
shrink_lruvec+0x270/0x304
WARN: soft lockup - CPU#32 stuck for 11s! [stress-ng:309915]
watchdog_timer_fn+0x1ac/0x2d0
__run_hrtimer+0x98/0x2a0
__hrtimer_run_queues+0xb0/0x130
hrtimer_interrupt+0x13c/0x3c0
arch_timer_handler_virt+0x3c/0x50
handle_percpu_devid_irq+0x90/0x1f4
__handle_domain_irq+0x84/0x100
gic_handle_irq+0x88/0x2b0
el1_irq+0xb8/0x140
get_swap_pages+0x1e8/0x440
get_swap_page+0x1c8/0x2e0
add_to_swap+0x20/0x9c
shrink_page_list+0x5d0/0x152c
reclaim_pages+0x160/0x310
madvise_cold_or_pageout_pte_range+0x7bc/0xe3c
walk_pmd_range.isra.0+0xac/0x22c
walk_pud_range+0xfc/0x1c0
walk_pgd_range+0x158/0x1b0
__walk_page_range+0x64/0x100
walk_page_range+0x104/0x150
Link: https://lkml.kernel.org/r/20221118133850.3360369-1-chenwandun@huawei.com
Fixes: 048c27fd7281 ("[PATCH] swap: scan_swap_map latency breaks")
Signed-off-by: Chen Wandun <chenwandun@huawei.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Nanyong Sun <sunnanyong@huawei.com>
Cc: <xialonglong1@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Commit 2b21624fc232 ("hugetlb: freeze allocated pages before creating
hugetlb pages") changed the order page flags were cleared and set in the
head page. It moved the __ClearPageReserved after __SetPageHead.
However, there is a check to make sure __ClearPageReserved is never done
on a head page. If CONFIG_DEBUG_VM_PGFLAGS is enabled, the following BUG
will be hit when creating a hugetlb gigantic page:
page dumped because: VM_BUG_ON_PAGE(1 && PageCompound(page))
------------[ cut here ]------------
kernel BUG at include/linux/page-flags.h:500!
Call Trace will differ depending on whether hugetlb page is created
at boot time or run time.
Make sure to __ClearPageReserved BEFORE __SetPageHead.
Link: https://lkml.kernel.org/r/20221118195249.178319-1-mike.kravetz@oracle.com
Fixes: 2b21624fc232 ("hugetlb: freeze allocated pages before creating hugetlb pages")
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reported-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Acked-by: Muchun Song <songmuchun@bytedance.com>
Tested-by: Tarun Sahu <tsahu@linux.ibm.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Joao Martins <joao.m.martins@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Peter Xu <peterx@redhat.com>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Commit b14051352465 ("mm/sl[au]b: generalize kmalloc subsystem")
refactored large parts of the kmalloc subsystem, resulting in the stack
trace pruning logic done by KFENCE to no longer work.
While b14051352465 attempted to fix the situation by including
'__kmem_cache_free' in the list of functions KFENCE should skip through,
this only works when the compiler actually optimized the tail call from
kfree() to __kmem_cache_free() into a jump (and thus kfree() _not_
appearing in the full stack trace to begin with).
In some configurations, the compiler no longer optimizes the tail call
into a jump, and __kmem_cache_free() appears in the stack trace. This
means that the pruned stack trace shown by KFENCE would include kfree()
which is not intended - for example:
| BUG: KFENCE: invalid free in kfree+0x7c/0x120
|
| Invalid free of 0xffff8883ed8fefe0 (in kfence-#126):
| kfree+0x7c/0x120
| test_double_free+0x116/0x1a9
| kunit_try_run_case+0x90/0xd0
| [...]
Fix it by moving __kmem_cache_free() to the list of functions that may be
tail called by an allocator entry function, making the pruning logic work
in both the optimized and unoptimized tail call cases.
Link: https://lkml.kernel.org/r/20221118152216.3914899-1-elver@google.com
Fixes: b14051352465 ("mm/sl[au]b: generalize kmalloc subsystem")
Signed-off-by: Marco Elver <elver@google.com>
Reviewed-by: Alexander Potapenko <glider@google.com>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Feng Tang <feng.tang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The page reclaim isolates a batch of folios from the tail of one of the
LRU lists and works on those folios one by one. For a suitable
swap-backed folio, if the swap device is async, it queues that folio for
writeback. After the page reclaim finishes an entire batch, it puts back
the folios it queued for writeback to the head of the original LRU list.
In the meantime, the page writeback flushes the queued folios also by
batches. Its batching logic is independent from that of the page reclaim.
For each of the folios it writes back, the page writeback calls
folio_rotate_reclaimable() which tries to rotate a folio to the tail.
folio_rotate_reclaimable() only works for a folio after the page reclaim
has put it back. If an async swap device is fast enough, the page
writeback can finish with that folio while the page reclaim is still
working on the rest of the batch containing it. In this case, that folio
will remain at the head and the page reclaim will not retry it before
reaching there.
This patch adds a retry to evict_folios(). After evict_folios() has
finished an entire batch and before it puts back folios it cannot free
immediately, it retries those that may have missed the rotation.
Before this patch, ~60% of folios swapped to an Intel Optane missed
folio_rotate_reclaimable(). After this patch, ~99% of missed folios were
reclaimed upon retry.
This problem affects relatively slow async swap devices like Samsung 980
Pro much less and does not affect sync swap devices like zram or zswap at
all.
Link: https://lkml.kernel.org/r/20221116013808.3995280-1-yuzhao@google.com
Fixes: ac35a4902374 ("mm: multi-gen LRU: minimal implementation")
Signed-off-by: Yu Zhao <yuzhao@google.com>
Cc: "Yin, Fengwei" <fengwei.yin@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
migrate_vma->cpages originally contained a count of the number of pages
migrating including non-present pages which can be populated directly on
the target.
Commit 241f68859656 ("mm/migrate_device.c: refactor migrate_vma and
migrate_device_coherent_page()") inadvertantly changed this to contain
just the number of pages that were unmapped. Usage of migrate_vma->cpages
isn't documented, but most drivers use it to see if all the requested
addresses can be migrated so restore the original behaviour.
Link: https://lkml.kernel.org/r/20221111005135.1344004-1-apopple@nvidia.com
Fixes: 241f68859656 ("mm/migrate_device.c: refactor migrate_vma and migrate_deivce_coherent_page()")
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Reported-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Alex Sierra <alex.sierra@amd.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
When the struct_mm input, mm, was changed to a struct ma_state, mas, the
documentation for the function was never updated. This updates that
documentation reference.
Link: https://lkml.kernel.org/r/20221114003349.41235-1-ian@linux.cowan.aero
Signed-off-by: Ian Cowan <ian@linux.cowan.aero>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
A DAMON sysfs interface user can start DAMON with a scheme, remove the
sysfs directory for the scheme, and then ask update of the scheme's stats.
Because the schemes stats update logic isn't aware of the situation, it
results in an invalid memory access. Fix the bug by checking if the
scheme sysfs directory exists.
Link: https://lkml.kernel.org/r/20221114175552.1951-1-sj@kernel.org
Fixes: 0ac32b8affb5 ("mm/damon/sysfs: support DAMOS stats")
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: <stable@vger.kernel.org> [v5.18]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The migrate_to_ram() callback should always succeed, but in rare cases can
fail usually returning VM_FAULT_SIGBUS. Commit 16ce101db85d
("mm/memory.c: fix race when faulting a device private page") incorrectly
stopped passing the return code up the stack. Fix this by setting the ret
variable, restoring the previous behaviour on migrate_to_ram() failure.
Link: https://lkml.kernel.org/r/20221114115537.727371-1-apopple@nvidia.com
Fixes: 16ce101db85d ("mm/memory.c: fix race when faulting a device private page")
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Alex Sierra <alex.sierra@amd.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kswapd will reclaim memory when memory pressure is high, the annonymous
memory will be compressed and stored in the zpool if zswap is enabled.
The memcg_kmem_bypass() in get_obj_cgroup_from_page() will bypass the
kernel thread and cause the compressed memory not be charged to its memory
cgroup.
Remove the memcg_kmem_bypass() call and properly charge compressed memory
to its corresponding memory cgroup.
Link: https://lore.kernel.org/linux-mm/CALvZod4nnn8BHYqAM4xtcR0Ddo2-Wr8uKm9h_CHWUaXw7g_DCg@mail.gmail.com/
Link: https://lkml.kernel.org/r/20221114194828.100822-1-hannes@cmpxchg.org
Fixes: f4840ccfca25 ("zswap: memcg accounting")
Signed-off-by: Li Liguang <liliguang@baidu.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: <stable@vger.kernel.org> [5.19+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Refactor the mm_khugepaged_scan_file tracepoint to move filename
dereference to the tracepoint definition, to maintain consistency with
other tracepoints[1].
[1]:lore.kernel.org/lkml/20221024111621.3ba17e2c@gandalf.local.home/
Link: https://lkml.kernel.org/r/20221026044524.54793-1-gautammenghani201@gmail.com
Fixes: d41fd2016ed07 ("mm/khugepaged: add tracepoint to hpage_collapse_scan_file()")
Signed-off-by: Gautam Menghani <gautammenghani201@gmail.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Reviewed-by: Zach O'Keefe <zokeefe@google.com>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Fix the below compiler warnings reported with 'make W=1 mm/'.
mm/page_ext.c:178: warning: Function parameter or member 'page_ext' not
described in 'page_ext_put'.
[quic_pkondeti@quicinc.com: better patch title]
Link: https://lkml.kernel.org/r/1667884582-2465-1-git-send-email-quic_charante@quicinc.com
Fixes: b1d5488a252dc9 ("mm: fix use-after free of page_ext after race with memory-offline")
Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com>
Reported-by: Vlastimil Babka <vbabka@suse.cz>
Tested-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Pavan Kondeti <quic_pkondeti@quicinc.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Syzbot reported the below splat:
WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 __alloc_pages_node include/linux/gfp.h:221 [inline]
WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 hpage_collapse_alloc_page mm/khugepaged.c:807 [inline]
WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 alloc_charge_hpage+0x802/0xaa0 mm/khugepaged.c:963
Modules linked in:
CPU: 1 PID: 3646 Comm: syz-executor210 Not tainted 6.1.0-rc1-syzkaller-00454-ga70385240892 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/11/2022
RIP: 0010:__alloc_pages_node include/linux/gfp.h:221 [inline]
RIP: 0010:hpage_collapse_alloc_page mm/khugepaged.c:807 [inline]
RIP: 0010:alloc_charge_hpage+0x802/0xaa0 mm/khugepaged.c:963
Code: e5 01 4c 89 ee e8 6e f9 ae ff 4d 85 ed 0f 84 28 fc ff ff e8 70 fc ae ff 48 8d 6b ff 4c 8d 63 07 e9 16 fc ff ff e8 5e fc ae ff <0f> 0b e9 96 fa ff ff 41 bc 1a 00 00 00 e9 86 fd ff ff e8 47 fc ae
RSP: 0018:ffffc90003fdf7d8 EFLAGS: 00010293
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
RDX: ffff888077f457c0 RSI: ffffffff81cd8f42 RDI: 0000000000000001
RBP: ffff888079388c0c R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: dffffc0000000000 R14: 0000000000000000 R15: 0000000000000000
FS: 00007f6b48ccf700(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f6b48a819f0 CR3: 00000000171e7000 CR4: 00000000003506e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
collapse_file+0x1ca/0x5780 mm/khugepaged.c:1715
hpage_collapse_scan_file+0xd6c/0x17a0 mm/khugepaged.c:2156
madvise_collapse+0x53a/0xb40 mm/khugepaged.c:2611
madvise_vma_behavior+0xd0a/0x1cc0 mm/madvise.c:1066
madvise_walk_vmas+0x1c7/0x2b0 mm/madvise.c:1240
do_madvise.part.0+0x24a/0x340 mm/madvise.c:1419
do_madvise mm/madvise.c:1432 [inline]
__do_sys_madvise mm/madvise.c:1432 [inline]
__se_sys_madvise mm/madvise.c:1430 [inline]
__x64_sys_madvise+0x113/0x150 mm/madvise.c:1430
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7f6b48a4eef9
Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 b1 15 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f6b48ccf318 EFLAGS: 00000246 ORIG_RAX: 000000000000001c
RAX: ffffffffffffffda RBX: 00007f6b48af0048 RCX: 00007f6b48a4eef9
RDX: 0000000000000019 RSI: 0000000000600003 RDI: 0000000020000000
RBP: 00007f6b48af0040 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00007f6b48aa53a4
R13: 00007f6b48bffcbf R14: 00007f6b48ccf400 R15: 0000000000022000
</TASK>
The khugepaged code would pick up the node with the most hit as the preferred
node, and also tries to do some balance if several nodes have the same
hit record. Basically it does conceptually:
* If the target_node <= last_target_node, then iterate from
last_target_node + 1 to MAX_NUMNODES (1024 on default config)
* If the max_value == node_load[nid], then target_node = nid
But there is a corner case, paritucularly for MADV_COLLAPSE, that the
non-existing node may be returned as preferred node.
Assuming the system has 2 nodes, the target_node is 0 and the
last_target_node is 1, if MADV_COLLAPSE path is hit, the max_value may
be 0, then it may return 2 for target_node, but it is actually not
existing (offline), so the warn is triggered.
The node balance was introduced by commit 9f1b868a13ac ("mm: thp:
khugepaged: add policy for finding target node") to satisfy
"numactl --interleave=all". But interleaving is a mere hint rather than
something that has hard requirements.
So use nodemask to record the nodes which have the same hit record, the
hugepage allocation could fallback to those nodes. And remove
__GFP_THISNODE since it does disallow fallback. And if the nodemask
just has one node set, it means there is one single node has the most
hit record, the nodemask approach actually behaves like __GFP_THISNODE.
Link: https://lkml.kernel.org/r/20221108184357.55614-2-shy828301@gmail.com
Fixes: 7d8faaf15545 ("mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse")
Signed-off-by: Yang Shi <shy828301@gmail.com>
Suggested-by: Zach O'Keefe <zokeefe@google.com>
Suggested-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Zach O'Keefe <zokeefe@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reported-by: <syzbot+0044b22d177870ee974f@syzkaller.appspotmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
During proactive reclaim, we sometimes observe severe overreclaim, with
several thousand times more pages reclaimed than requested.
This trace was obtained from shrink_lruvec() during such an instance:
prio:0 anon_cost:1141521 file_cost:7767
nr_reclaimed:4387406 nr_to_reclaim:1047 (or_factor:4190)
nr=[7161123 345 578 1111]
While he reclaimer requested 4M, vmscan reclaimed close to 16G, most of it
by swapping. These requests take over a minute, during which the write()
to memory.reclaim is unkillably stuck inside the kernel.
Digging into the source, this is caused by the proportional reclaim
bailout logic. This code tries to resolve a fundamental conflict: to
reclaim roughly what was requested, while also aging all LRUs fairly and
in accordance to their size, swappiness, refault rates etc. The way it
attempts fairness is that once the reclaim goal has been reached, it stops
scanning the LRUs with the smaller remaining scan targets, and adjusts the
remainder of the bigger LRUs according to how much of the smaller LRUs was
scanned. It then finishes scanning that remainder regardless of the
reclaim goal.
This works fine if priority levels are low and the LRU lists are
comparable in size. However, in this instance, the cgroup that is
targeted by proactive reclaim has almost no files left - they've already
been squeezed out by proactive reclaim earlier - and the remaining anon
pages are hot. Anon rotations cause the priority level to drop to 0,
which results in reclaim targeting all of anon (a lot) and all of file
(almost nothing). By the time reclaim decides to bail, it has scanned
most or all of the file target, and therefor must also scan most or all of
the enormous anon target. This target is thousands of times larger than
the reclaim goal, thus causing the overreclaim.
The bailout code hasn't changed in years, why is this failing now? The
most likely explanations are two other recent changes in anon reclaim:
1. Before the series starting with commit 5df741963d52 ("mm: fix LRU
balancing effect of new transparent huge pages"), the VM was
overall relatively reluctant to swap at all, even if swap was
configured. This means the LRU balancing code didn't come into play
as often as it does now, and mostly in high pressure situations
where pronounced swap activity wouldn't be as surprising.
2. For historic reasons, shrink_lruvec() loops on the scan targets of
all LRU lists except the active anon one, meaning it would bail if
the only remaining pages to scan were active anon - even if there
were a lot of them.
Before the series starting with commit ccc5dc67340c ("mm/vmscan:
make active/inactive ratio as 1:1 for anon lru"), most anon pages
would live on the active LRU; the inactive one would contain only a
handful of preselected reclaim candidates. After the series, anon
gets aged similarly to file, and the inactive list is the default
for new anon pages as well, making it often the much bigger list.
As a result, the VM is now more likely to actually finish large
anon targets than before.
Change the code such that only one SWAP_CLUSTER_MAX-sized nudge toward the
larger LRU lists is made before bailing out on a met reclaim goal.
This fixes the extreme overreclaim problem.
Fairness is more subtle and harder to evaluate. No obvious misbehavior
was observed on the test workload, in any case. Conceptually, fairness
should primarily be a cumulative effect from regular, lower priority
scans. Once the VM is in trouble and needs to escalate scan targets to
make forward progress, fairness needs to take a backseat. This is also
acknowledged by the myriad exceptions in get_scan_count(). This patch
makes fairness decrease gradually, as it keeps fairness work static over
increasing priority levels with growing scan targets. This should make
more sense - although we may have to re-visit the exact values.
Link: https://lkml.kernel.org/r/20220802162811.39216-1-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@surriel.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
kmalloc() redzone improvements by Feng Tang
From cover letter [1]:
kmalloc's API family is critical for mm, and one of its nature is that
it will round up the request size to a fixed one (mostly power of 2).
When user requests memory for '2^n + 1' bytes, actually 2^(n+1) bytes
could be allocated, so there is an extra space than what is originally
requested.
This patchset tries to extend the redzone sanity check to the extra
kmalloced buffer than requested, to better detect un-legitimate access
to it. (depends on SLAB_STORE_USER & SLAB_RED_ZONE)
[1] https://lore.kernel.org/all/20221021032405.1825078-1-feng.tang@intel.com/
A series by myself to reorder fields in struct slab to allow the
embedded rcu_head to grow (for debugging purposes). Requires changes to
isolate_movable_page() to skip slab pages which can otherwise become
false-positive __PageMovable due to its use of low bits in
page->mapping.
- Two patches for SLUB's sysfs by Rasmus Villemoes to remove dead code
and optimize boot time with late initialization.
- Allow SLUB's sysfs 'failslab' parameter to be runtime-controllable
again as it can be both useful and safe, by Alexander Atanasov.
A patch from Jiri Kosina that makes SLAB's list_lock a raw_spinlock_t.
While there are no plans to make SLAB actually compatible with
PREEMPT_RT or any other future, it makes !PREEMPT_RT lockdep happy.
- Removal of dead code from deactivate_slab() by Hyeonggon Yoo.
- Fix of BUILD_BUG_ON() for sufficient early percpu size by Baoquan He.
- Make kmem_cache_alloc() kernel-doc less misleading, by myself.
Joel reports [1] that increasing the rcu_head size for debugging
purposes used to work before struct slab was split from struct page, but
now runs into the various SLAB_MATCH() sanity checks of the layout.
This is because the rcu_head in struct page is in union with large
sub-structures and has space to grow without exceeding their size, while
in struct slab (for SLAB and SLUB) it's in union only with a list_head.
On closer inspection (and after the previous patch) we can put all
fields except slab_cache to a union with rcu_head, as slab_cache is
sufficient for the rcu freeing callbacks to work and the rest can be
overwritten by rcu_head without causing issues.
This is only somewhat complicated by the need to keep SLUB's
freelist+counters aligned for cmpxchg_double. As a result the fields
need to be reordered so that slab_cache is first (after page flags) and
the union with rcu_head follows. For consistency, do that for SLAB as
well, although not necessary there.
As a result, the rcu_head field in struct page and struct slab is no
longer at the same offset, but that doesn't matter as there is no
casting that would rely on that in the slab freeing callbacks, so we can
just drop the respective SLAB_MATCH() check.
Also we need to update the SLAB_MATCH() for compound_head to reflect the
new ordering.
While at it, also add a static_assert to check the alignment needed for
cmpxchg_double so mistakes are found sooner than a runtime GPF.
[1] https://lore.kernel.org/all/85afd876-d8bb-0804-b2c5-48ed3055e702@joelfernandes.org/
Reported-by: Joel Fernandes <joel@joelfernandes.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
In the next commit we want to rearrange struct slab fields to allow a larger
rcu_head. Afterwards, the page->mapping field will overlap with SLUB's "struct
list_head slab_list", where the value of prev pointer can become LIST_POISON2,
which is 0x122 + POISON_POINTER_DELTA. Unfortunately the bit 1 being set can
confuse PageMovable() to be a false positive and cause a GPF as reported by lkp
[1].
To fix this, make isolate_movable_page() skip pages with the PageSlab flag set.
This is a bit tricky as we need to add memory barriers to SLAB and SLUB's page
allocation and freeing, and their counterparts to isolate_movable_page().
Based on my RFC from [2]. Added a comment update from Matthew's variant in [3]
and, as done there, moved the PageSlab checks to happen before trying to take
the page lock.
[1] https://lore.kernel.org/all/208c1757-5edd-fd42-67d4-1940cc43b50f@intel.com/
[2] https://lore.kernel.org/all/aec59f53-0e53-1736-5932-25407125d4d4@suse.cz/
[3] https://lore.kernel.org/all/YzsVM8eToHUeTP75@casper.infradead.org/
Reported-by: kernel test robot <yujie.liu@intel.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Alexander reports an issue with the kmem_cache_alloc() comment in
mm/slab.c:
> The current comment mentioned that the flags only matters if the
> cache has no available objects. It's different for the __GFP_ZERO
> flag which will ensure that the returned object is always zeroed
> in any case.
> I have the feeling I run into this question already two times if
> the user need to zero the object or not, but the user does not need
> to zero the object afterwards. However another use of __GFP_ZERO
> and only zero the object if the cache has no available objects would
> also make no sense.
and suggests thus mentioning __GFP_ZERO as the exception. But on closer
inspection, the part about flags being only relevant if cache has no
available objects is misleading. The slab user has no reliable way to
determine if there are available objects, and e.g. the might_sleep()
debug check can be performed even if objects are available, so passing
correct flags given the allocation context always matters.
Thus remove that sentence completely, and while at it, move the comment
to from SLAB-specific mm/slab.c to the common include/linux/slab.h
The comment otherwise refers flags description for kmalloc(), so add
__GFP_ZERO comment there and remove a very misleading GFP_HIGHUSER
(not applicable to slab) description from there. Mention kzalloc() and
kmem_cache_zalloc() shortcuts.
Reported-by: Alexander Aring <aahringo@redhat.com>
Link: https://lore.kernel.org/all/20221011145413.8025-1-aahringo@redhat.com/
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
SLUB allocator relies on percpu allocator to initialize its ->cpu_slab
during early boot. For that, the dynamic chunk of percpu which serves
the early allocation need be large enough to satisfy the kmalloc
creation.
However, the current BUILD_BUG_ON() in alloc_kmem_cache_cpus() doesn't
consider the kmalloc array with NR_KMALLOC_TYPES length. Fix that
with correct calculation.
Signed-off-by: Baoquan He <bhe@redhat.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Acked-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Acked-by: Dennis Zhou <dennis@kernel.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
These cases were done with this Coccinelle:
@@
expression H;
expression L;
@@
- (get_random_u32_below(H) + L)
+ get_random_u32_inclusive(L, H + L - 1)
@@
expression H;
expression L;
expression E;
@@
get_random_u32_inclusive(L,
H
- + E
- - E
)
@@
expression H;
expression L;
expression E;
@@
get_random_u32_inclusive(L,
H
- - E
- + E
)
@@
expression H;
expression L;
expression E;
expression F;
@@
get_random_u32_inclusive(L,
H
- - E
+ F
- + E
)
@@
expression H;
expression L;
expression E;
expression F;
@@
get_random_u32_inclusive(L,
H
- + E
+ F
- - E
)
And then subsequently cleaned up by hand, with several automatic cases
rejected if it didn't make sense contextually.
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> # for infiniband
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
This is a simple mechanical transformation done by:
@@
expression E;
@@
- prandom_u32_max
+ get_random_u32_below
(E)
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs
Reviewed-by: SeongJae Park <sj@kernel.org> # for damon
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> # for infiniband
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> # for arm
Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # for mmc
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Current release - regressions:
- tls: fix memory leak in tls_enc_skb() and tls_sw_fallback_init()
Previous releases - regressions:
- bridge: fix memory leaks when changing VLAN protocol
- dsa: make dsa_master_ioctl() see through port_hwtstamp_get() shims
- dsa: don't leak tagger-owned storage on switch driver unbind
- eth: mlxsw: avoid warnings when not offloaded FDB entry with IPv6 is removed
- eth: stmmac: ensure tx function is not running in stmmac_xdp_release()
- eth: hns3: fix return value check bug of rx copybreak
Previous releases - always broken:
- kcm: close race conditions on sk_receive_queue
- bpf: fix alignment problem in bpf_prog_test_run_skb()
- bpf: fix writing offset in case of fault in strncpy_from_kernel_nofault
- eth: macvlan: use built-in RCU list checking
- eth: marvell: add sleep time after enabling the loopback bit
- eth: octeon_ep: fix potential memory leak in octep_device_setup()
Misc:
- tcp: configurable source port perturb table size
- bpf: Convert BPF_DISPATCHER to use static_call() (not ftrace)
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmN2FlMSHHBhYmVuaUBy
ZWRoYXQuY29tAAoJECkkeY3MjxOkWAwQAJcV7XEB7bEssgabFkEmC4uvS/sFlyHC
uSwFRn5ojaB2c56T1CnNYmitg9Wr4arC6Vca28iai6BgqB6t4qLRI/WWTsZiEPhi
mt/pjNN2u9JMyaafHFHYfXnbSDWRF7kPMpNw4l3uL0vkGyjSI7LGAOP4Qh8C1h/d
tNVSDZnj4Laj/3JtDf7Rk6ydCqPYnNdWxFfoZ/SQkjYZKD3Ze9tml7WJykAzCTLp
yUiPC6TvHOnWIZYbB04sVVOQD4V+95TSOgEhB6wzs/CXB7iBEY+N+oCedjP9Xrfw
n3ea4anBoTleDnJXJI57LhdJBkyoXncfbpbYLwXljyIgosr7XVTALvOG8XUhg/DW
FzN5DWQ54jzTsx2eXFJzjQQcDIgyxazk9EdoHdqF8byCasP+fofq1JvzyqtvNSyh
h8Ps6jdMZrWpXuFDVApXUhP32A/+9q+dFSYHJO681m6mf4CIaUXdm4aB1dkxDAvg
PSlk797U94RQCzJgqxhrgsq1PGQPBb+qadZrAiD3aQi26g0NWCTg7uFpCeCEK2ZF
fLwc2XxrwLQm1q7xQVoEg4UxPIIf0mUesvOD9sLDYop0rFIw8x0v7jdYM4kyhN3o
6FWAXKxBe3LJ9jTTsVTbZbfHYpTnS8Q2KSclBN+/dZNHwwsUPHjy17Z2Ct3o3Jlm
lNbiiD30BgsD
=vVJk
-----END PGP SIGNATURE-----
Merge tag 'net-6.1-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
"Including fixes from bpf.
Current release - regressions:
- tls: fix memory leak in tls_enc_skb() and tls_sw_fallback_init()
Previous releases - regressions:
- bridge: fix memory leaks when changing VLAN protocol
- dsa: make dsa_master_ioctl() see through port_hwtstamp_get() shims
- dsa: don't leak tagger-owned storage on switch driver unbind
- eth: mlxsw: avoid warnings when not offloaded FDB entry with IPv6
is removed
- eth: stmmac: ensure tx function is not running in
stmmac_xdp_release()
- eth: hns3: fix return value check bug of rx copybreak
Previous releases - always broken:
- kcm: close race conditions on sk_receive_queue
- bpf: fix alignment problem in bpf_prog_test_run_skb()
- bpf: fix writing offset in case of fault in
strncpy_from_kernel_nofault
- eth: macvlan: use built-in RCU list checking
- eth: marvell: add sleep time after enabling the loopback bit
- eth: octeon_ep: fix potential memory leak in octep_device_setup()
Misc:
- tcp: configurable source port perturb table size
- bpf: Convert BPF_DISPATCHER to use static_call() (not ftrace)"
* tag 'net-6.1-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (51 commits)
net: use struct_group to copy ip/ipv6 header addresses
net: usb: smsc95xx: fix external PHY reset
net: usb: qmi_wwan: add Telit 0x103a composition
netdevsim: Fix memory leak of nsim_dev->fa_cookie
tcp: configurable source port perturb table size
l2tp: Serialize access to sk_user_data with sk_callback_lock
net: thunderbolt: Fix error handling in tbnet_init()
net: microchip: sparx5: Fix potential null-ptr-deref in sparx_stats_init() and sparx5_start()
net: lan966x: Fix potential null-ptr-deref in lan966x_stats_init()
net: dsa: don't leak tagger-owned storage on switch driver unbind
net/x25: Fix skb leak in x25_lapb_receive_frame()
net: ag71xx: call phylink_disconnect_phy if ag71xx_hw_enable() fail in ag71xx_open()
bridge: switchdev: Fix memory leaks when changing VLAN protocol
net: hns3: fix setting incorrect phy link ksettings for firmware in resetting process
net: hns3: fix return value check bug of rx copybreak
net: hns3: fix incorrect hw rss hash type of rx packet
net: phy: marvell: add sleep time after enabling the loopback bit
net: ena: Fix error handling in ena_init()
kcm: close race conditions on sk_receive_queue
net: ionic: Fix error handling in ionic_init_module()
...