20007 Commits

Author SHA1 Message Date
Christophe Leroy
55d77bae73 kasan: fix Oops due to missing calls to kasan_arch_is_ready()
On powerpc64, you can build a kernel with KASAN as soon as you build it
with RADIX MMU support.  However if the CPU doesn't have RADIX MMU, KASAN
isn't enabled at init and the following Oops is encountered.

  [    0.000000][    T0] KASAN not enabled as it requires radix!

  [    4.484295][   T26] BUG: Unable to handle kernel data access at 0xc00e000000804a04
  [    4.485270][   T26] Faulting instruction address: 0xc00000000062ec6c
  [    4.485748][   T26] Oops: Kernel access of bad area, sig: 11 [#1]
  [    4.485920][   T26] BE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
  [    4.486259][   T26] Modules linked in:
  [    4.486637][   T26] CPU: 0 PID: 26 Comm: kworker/u2:2 Not tainted 6.2.0-rc3-02590-gf8a023b0a805 #249
  [    4.486907][   T26] Hardware name: IBM pSeries (emulated by qemu) POWER9 (raw) 0x4e1200 0xf000005 of:SLOF,HEAD pSeries
  [    4.487445][   T26] Workqueue: eval_map_wq .tracer_init_tracefs_work_func
  [    4.488744][   T26] NIP:  c00000000062ec6c LR: c00000000062bb84 CTR: c0000000002ebcd0
  [    4.488867][   T26] REGS: c0000000049175c0 TRAP: 0380   Not tainted  (6.2.0-rc3-02590-gf8a023b0a805)
  [    4.489028][   T26] MSR:  8000000002009032 <SF,VEC,EE,ME,IR,DR,RI>  CR: 44002808  XER: 00000000
  [    4.489584][   T26] CFAR: c00000000062bb80 IRQMASK: 0
  [    4.489584][   T26] GPR00: c0000000005624d4 c000000004917860 c000000001cfc000 1800000000804a04
  [    4.489584][   T26] GPR04: c0000000003a2650 0000000000000cc0 c00000000000d3d8 c00000000000d3d8
  [    4.489584][   T26] GPR08: c0000000049175b0 a80e000000000000 0000000000000000 0000000017d78400
  [    4.489584][   T26] GPR12: 0000000044002204 c000000003790000 c00000000435003c c0000000043f1c40
  [    4.489584][   T26] GPR16: c0000000043f1c68 c0000000043501a0 c000000002106138 c0000000043f1c08
  [    4.489584][   T26] GPR20: c0000000043f1c10 c0000000043f1c20 c000000004146c40 c000000002fdb7f8
  [    4.489584][   T26] GPR24: c000000002fdb834 c000000003685e00 c000000004025030 c000000003522e90
  [    4.489584][   T26] GPR28: 0000000000000cc0 c0000000003a2650 c000000004025020 c000000004025020
  [    4.491201][   T26] NIP [c00000000062ec6c] .kasan_byte_accessible+0xc/0x20
  [    4.491430][   T26] LR [c00000000062bb84] .__kasan_check_byte+0x24/0x90
  [    4.491767][   T26] Call Trace:
  [    4.491941][   T26] [c000000004917860] [c00000000062ae70] .__kasan_kmalloc+0xc0/0x110 (unreliable)
  [    4.492270][   T26] [c0000000049178f0] [c0000000005624d4] .krealloc+0x54/0x1c0
  [    4.492453][   T26] [c000000004917990] [c0000000003a2650] .create_trace_option_files+0x280/0x530
  [    4.492613][   T26] [c000000004917a90] [c000000002050d90] .tracer_init_tracefs_work_func+0x274/0x2c0
  [    4.492771][   T26] [c000000004917b40] [c0000000001f9948] .process_one_work+0x578/0x9f0
  [    4.492927][   T26] [c000000004917c30] [c0000000001f9ebc] .worker_thread+0xfc/0x950
  [    4.493084][   T26] [c000000004917d60] [c00000000020be84] .kthread+0x1a4/0x1b0
  [    4.493232][   T26] [c000000004917e10] [c00000000000d3d8] .ret_from_kernel_thread+0x58/0x60
  [    4.495642][   T26] Code: 60000000 7cc802a6 38a00000 4bfffc78 60000000 7cc802a6 38a00001 4bfffc68 60000000 3d20a80e 7863e8c2 792907c6 <7c6348ae> 20630007 78630fe0 68630001
  [    4.496704][   T26] ---[ end trace 0000000000000000 ]---

The Oops is due to kasan_byte_accessible() not checking the readiness of
KASAN.  Add missing call to kasan_arch_is_ready() and bail out when not
ready.  The same problem is observed with ____kasan_kfree_large() so fix
it the same.

Also, as KASAN is not available and no shadow area is allocated for linear
memory mapping, there is no point in allocating shadow mem for vmalloc
memory as shown below in /sys/kernel/debug/kernel_page_tables

  ---[ kasan shadow mem start ]---
  0xc00f000000000000-0xc00f00000006ffff  0x00000000040f0000       448K         r  w       pte  valid  present        dirty  accessed
  0xc00f000000860000-0xc00f00000086ffff  0x000000000ac10000        64K         r  w       pte  valid  present        dirty  accessed
  0xc00f3ffffffe0000-0xc00f3fffffffffff  0x0000000004d10000       128K         r  w       pte  valid  present        dirty  accessed
  ---[ kasan shadow mem end ]---

So, also verify KASAN readiness before allocating and poisoning
shadow mem for VMAs.

Link: https://lkml.kernel.org/r/150768c55722311699fdcf8f5379e8256749f47d.1674716617.git.christophe.leroy@csgroup.eu
Fixes: 41b7a347bf14 ("powerpc: Book3S 64-bit outline-only KASAN support")
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Reported-by: Nathan Lynch <nathanl@linux.ibm.com>
Suggested-by: Michael Ellerman <mpe@ellerman.id.au>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: <stable@vger.kernel.org>	[5.19+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-09 15:56:50 -08:00
Thomas Gleixner
f5451547b8 mm, slab/slub: Ensure kmem_cache_alloc_bulk() is available early
The memory allocators are available during early boot even in the phase
where interrupts are disabled and scheduling is not yet possible.

The setup is so that GFP_KERNEL allocations work in this phase without
causing might_alloc() splats to be emitted because the system state is
SYSTEM_BOOTING at that point which prevents the warnings to trigger.

Most allocation/free functions use local_irq_save()/restore() or a lock
variant of that. But kmem_cache_alloc_bulk() and kmem_cache_free_bulk() use
local_[lock]_irq_disable()/enable(), which leads to a lockdep warning when
interrupts are enabled during the early boot phase.

This went unnoticed so far as there are no early users of these
interfaces. The upcoming conversion of the interrupt descriptor store from
radix_tree to maple_tree triggered this warning as maple_tree uses the bulk
interface.

Cure this by moving the kmem_cache_alloc/free() bulk variants of SLUB and
SLAB to local[_lock]_irq_save()/restore().

There is obviously no reclaim possible and required at this point so there
is no need to expand this coverage further.

No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2023-02-08 17:50:04 +01:00
Aaron Thompson
647037adca Revert "mm: Always release pages to the buddy allocator in memblock_free_late()."
This reverts commit 115d9d77bb0f9152c60b6e8646369fa7f6167593.

The pages being freed by memblock_free_late() have already been
initialized, but if they are in the deferred init range,
__free_one_page() might access nearby uninitialized pages when trying to
coalesce buddies. This can, for example, trigger this BUG:

  BUG: unable to handle page fault for address: ffffe964c02580c8
  RIP: 0010:__list_del_entry_valid+0x3f/0x70
   <TASK>
   __free_one_page+0x139/0x410
   __free_pages_ok+0x21d/0x450
   memblock_free_late+0x8c/0xb9
   efi_free_boot_services+0x16b/0x25c
   efi_enter_virtual_mode+0x403/0x446
   start_kernel+0x678/0x714
   secondary_startup_64_no_verify+0xd2/0xdb
   </TASK>

A proper fix will be more involved so revert this change for the time
being.

Fixes: 115d9d77bb0f ("mm: Always release pages to the buddy allocator in memblock_free_late().")
Signed-off-by: Aaron Thompson <dev@aaront.org>
Link: https://lore.kernel.org/r/20230207082151.1303-1-dev@aaront.org
Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
2023-02-07 13:07:37 +02:00
Greg Kroah-Hartman
aa4a86055b mm/slub: fix memory leak with using debugfs_lookup()
When calling debugfs_lookup() the result must have dput() called on it,
otherwise the memory will leak over time.  To make things simpler, just
call debugfs_lookup_and_remove() instead which handles all of the logic
at once.

Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2023-02-06 16:35:09 +01:00
Kuan-Ying Lee
aa1e6a932c mm/gup: add folio to list when folio_isolate_lru() succeed
If we call folio_isolate_lru() successfully, we will get return value 0. 
We need to add this folio to the movable_pages_list.

Link: https://lkml.kernel.org/r/20230131063206.28820-1-Kuan-Ying.Lee@mediatek.com
Fixes: 67e139b02d99 ("mm/gup.c: refactor check_and_migrate_movable_pages()")
Signed-off-by: Kuan-Ying Lee <Kuan-Ying.Lee@mediatek.com>
Reviewed-by: Alistair Popple <apopple@nvidia.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Andrew Yang <andrew.yang@mediatek.com>
Cc: Chinwen Chang <chinwen.chang@mediatek.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Matthias Brugger <matthias.bgg@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-03 17:52:24 -08:00
Linus Torvalds
0c272a1d33 25 hotfixes, mainly for MM. 13 are cc:stable.
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCY9x+swAKCRDdBJ7gKXxA
 joPwAP95XqB7gzy2l1Mc++Ta7Ih0fS34Pj1vTAxwsRQnqzr6rwD/QOt3YU9KgXpy
 D7Fp8NnaQZq6m5o8cvV5+fBqA3uarAM=
 =IIB8
 -----END PGP SIGNATURE-----

Merge tag 'mm-hotfixes-stable-2023-02-02-19-24-2' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull misc fixes from Andrew Morton:
 "25 hotfixes, mainly for MM.  13 are cc:stable"

* tag 'mm-hotfixes-stable-2023-02-02-19-24-2' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (26 commits)
  mm: memcg: fix NULL pointer in mem_cgroup_track_foreign_dirty_slowpath()
  Kconfig.debug: fix the help description in SCHED_DEBUG
  mm/swapfile: add cond_resched() in get_swap_pages()
  mm: use stack_depot_early_init for kmemleak
  Squashfs: fix handling and sanity checking of xattr_ids count
  sh: define RUNTIME_DISCARD_EXIT
  highmem: round down the address passed to kunmap_flush_on_unmap()
  migrate: hugetlb: check for hugetlb shared PMD in node migration
  mm: hugetlb: proc: check for hugetlb shared PMD in /proc/PID/smaps
  mm/MADV_COLLAPSE: catch !none !huge !bad pmd lookups
  Revert "mm: kmemleak: alloc gray object for reserved region with direct map"
  freevxfs: Kconfig: fix spelling
  maple_tree: should get pivots boundary by type
  .mailmap: update e-mail address for Eugen Hristev
  mm, mremap: fix mremap() expanding for vma's with vm_ops->close()
  squashfs: harden sanity check in squashfs_read_xattr_id_table
  ia64: fix build error due to switch case label appearing next to declaration
  mm: multi-gen LRU: fix crash during cgroup migration
  Revert "mm: add nodes= arg to memory.reclaim"
  zsmalloc: fix a race with deferred_handles storing
  ...
2023-02-03 10:01:57 -08:00
Christoph Hellwig
8976fa6d79 swap: use bvec_set_page to initialize bvecs
Use the bvec_set_page helper to initialize bvecs.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230203150634.3199647-20-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-02-03 10:17:42 -07:00
Christoph Hellwig
f05837ed73 blk-cgroup: store a gendisk to throttle in struct task_struct
Switch from a request_queue pointer and reference to a gendisk once
for the throttle information in struct task_struct.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andreas Herrmann <aherrmann@suse.de>
Link: https://lore.kernel.org/r/20230203150400.3199230-8-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-02-03 08:20:05 -07:00
Matthew Wilcox (Oracle)
d585bdbeb7 fs: convert writepage_t callback to pass a folio
Patch series "Convert writepage_t to use a folio".

More folioisation.  I split out the mpage work from everything else
because it completely dominated the patch, but some implementations I just
converted outright.


This patch (of 2):

We always write back an entire folio, but that's currently passed as the
head page.  Convert all filesystems that use write_cache_pages() to expect
a folio instead of a page.

Link: https://lkml.kernel.org/r/20230126201255.1681189-1-willy@infradead.org
Link: https://lkml.kernel.org/r/20230126201255.1681189-2-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-02 22:33:34 -08:00
Christoph Hellwig
3222d8c2a7 block: remove ->rw_page
The ->rw_page method is a special purpose bypass of the usual bio handling
path that is limited to single-page reads and writes and synchronous which
causes a lot of extra code in the drivers, callers and the block layer.

The only remaining user is the MM swap code.  Switch that swap code to
simply submit a single-vec on-stack bio an synchronously wait on it based
on a newly added QUEUE_FLAG_SYNCHRONOUS flag set by the drivers that
currently implement ->rw_page instead.  While this touches one extra cache
line and executes extra code, it simplifies the block layer and drivers
and ensures that all feastures are properly supported by all drivers, e.g.
right now ->rw_page bypassed cgroup writeback entirely.

[akpm@linux-foundation.org: fix comment typo, per Dan]
Link: https://lkml.kernel.org/r/20230125133436.447864-8-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Keith Busch <kbusch@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-02 22:33:34 -08:00
Christoph Hellwig
05cda97ecb mm: factor out a swap_writepage_bdev helper
Split the block device case from swap_readpage into a separate helper,
following the abstraction for file based swap.

Link: https://lkml.kernel.org/r/20230125133436.447864-7-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Keith Busch <kbusch@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-02 22:33:33 -08:00
Christoph Hellwig
e3e2762bd3 mm: remove the __swap_writepage return value
__swap_writepage always returns 0.

Link: https://lkml.kernel.org/r/20230125133436.447864-6-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Keith Busch <kbusch@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-02 22:33:33 -08:00
Christoph Hellwig
9b4e30bd73 mm: use an on-stack bio for synchronous swapin
Optimize the synchronous swap in case by using an on-stack bio instead of
allocating one using bio_alloc.

Link: https://lkml.kernel.org/r/20230125133436.447864-5-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Keith Busch <kbusch@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-02 22:33:33 -08:00
Christoph Hellwig
14bd75f574 mm: factor out a swap_readpage_bdev helper
Split the block device case from swap_readpage into a separate helper,
following the abstraction for file based swap and frontswap.

Link: https://lkml.kernel.org/r/20230125133436.447864-4-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Keith Busch <kbusch@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-02 22:33:33 -08:00
Christoph Hellwig
a8c1408f87 mm: remove the swap_readpage return value
swap_readpage always returns 0, and no caller checks the return value.

[akpm@linux-foundation.org: fix void-returning swap_readpage() stub, per Keith]
Link: https://lkml.kernel.org/r/20230125133436.447864-3-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Keith Busch <kbusch@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-02 22:33:33 -08:00
Christoph Hellwig
9e5fa0ae52 mm: refactor va_remove_mappings
Move the VM_FLUSH_RESET_PERMS to the caller and rename the function to
better describe what it is doing.

Link: https://lkml.kernel.org/r/20230121071051.1143058-11-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-02 22:33:32 -08:00
Christoph Hellwig
79311c1fe0 mm: split __vunmap
vunmap only needs to find and free the vmap_area and vm_strut, so open
code that there and merge the rest of the code into vfree.

Link: https://lkml.kernel.org/r/20230121071051.1143058-10-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-02 22:33:32 -08:00
Christoph Hellwig
17d3ef432d mm: move debug checks from __vunmap to remove_vm_area
All these checks apply to the free_vm_area interface as well, so move them
to the common routine.

Link: https://lkml.kernel.org/r/20230121071051.1143058-9-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-02 22:33:32 -08:00
Christoph Hellwig
75c59ce74e mm: use remove_vm_area in __vunmap
Use the common helper to find and remove a vmap_area instead of open
coding it.

Link: https://lkml.kernel.org/r/20230121071051.1143058-8-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-02 22:33:32 -08:00
Christoph Hellwig
39e65b7f63 mm: move __remove_vm_area out of va_remove_mappings
__remove_vm_area is the only part of va_remove_mappings that requires a
vmap_area.  Move the call out to the caller and only pass the vm_struct to
va_remove_mappings.

Link: https://lkml.kernel.org/r/20230121071051.1143058-7-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-02 22:33:31 -08:00
Christoph Hellwig
5d3d31d6fb mm: call vfree instead of __vunmap from delayed_vfree_work
This adds an extra, never taken, in_interrupt() branch, but will allow to
cut down the maze of vfree helpers.

Link: https://lkml.kernel.org/r/20230121071051.1143058-6-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-02 22:33:31 -08:00
Christoph Hellwig
208162f42f mm: move vmalloc_init and free_work down in vmalloc.c
Move these two functions around a bit to avoid forward declarations.

Link: https://lkml.kernel.org/r/20230121071051.1143058-5-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-02 22:33:31 -08:00
Christoph Hellwig
01e2e8394a mm: remove __vfree_deferred
Fold __vfree_deferred into vfree_atomic, and call vfree_atomic early on
from vfree if called from interrupt context so that the extra low-level
helper can be avoided.

Link: https://lkml.kernel.org/r/20230121071051.1143058-4-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-02 22:33:31 -08:00
Christoph Hellwig
f41f036b80 mm: remove __vfree
__vfree is a subset of vfree that just skips a few checks, and which is
only used by vfree and an error cleanup path.  Fold __vfree into vfree and
switch the only other caller to call vfree() instead.

Link: https://lkml.kernel.org/r/20230121071051.1143058-3-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-02 22:33:30 -08:00
Christoph Hellwig
37f3605e5e mm: reject vmap with VM_FLUSH_RESET_PERMS
Patch series "cleanup vfree and vunmap".

This little series untangles the vfree and vunmap code path a bit.


This patch (of 10):

VM_FLUSH_RESET_PERMS is just for use with vmalloc as it is tied to freeing
the underlying pages.

Link: https://lkml.kernel.org/r/20230121071051.1143058-1-hch@lst.de
Link: https://lkml.kernel.org/r/20230121071051.1143058-2-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-02 22:33:30 -08:00
Mel Gorman
cfccd2e63e mm, compaction: finish pageblocks on complete migration failure
Commit 7efc3b726103 ("mm/compaction: fix set skip in
fast_find_migrateblock") address an issue where a pageblock selected by
fast_find_migrateblock() was ignored.  Unfortunately, the same fix
resulted in numerous reports of khugepaged or kcompactd stalling for long
periods of time or consuming 100% of CPU.

Tracing showed that there was a lot of rescanning between a small subset
of pageblocks because the conditions for marking the block skip are not
met.  The scan is not reaching the end of the pageblock because enough
pages were isolated but none were migrated successfully.  Eventually it
circles back to the same block.

Pageblock skip tracking tries to minimise both latency and excessive
scanning but tracking exactly when a block is fully scanned requires an
excessive amount of state.  This patch forcibly rescans a pageblock when
all isolated pages fail to migrate even though it could be for transient
reasons such as page writeback or page dirty.  This will sometimes migrate
too many pages but pageblocks will be marked skip and forward progress
will be made.

"Usemen" from the mmtests configuration
workload-usemem-stress-numa-compact was used to stress compaction.  The
compaction trace events were recorded using a 6.2-rc5 kernel that includes
commit 7efc3b726103 and count of unique ranges were measured.  The top 5
ranges were

   3076 range=(0x10ca00-0x10cc00)
   3076 range=(0x110a00-0x110c00)
   3098 range=(0x13b600-0x13b800)
   3104 range=(0x141c00-0x141e00)
  11424 range=(0x11b600-0x11b800)

While this workload is very different than what the bugs reported, the
pattern of the same subset of blocks being repeatedly scanned is observed.
At one point, *only* the range range=(0x11b600 ~ 0x11b800) was scanned
for 2 seconds.  14 seconds passed between the first migration-related
event and the last.

With the series applied including this patch, the top 5 ranges were

      1 range=(0x11607e-0x116200)
      1 range=(0x116200-0x116278)
      1 range=(0x116278-0x116400)
      1 range=(0x116400-0x116424)
      1 range=(0x116424-0x116600)

Only unique ranges were scanned and the time between the first
migration-related event was 0.11 milliseconds.

Link: https://lkml.kernel.org/r/20230125134434.18017-5-mgorman@techsingularity.net
Fixes: 7efc3b726103 ("mm/compaction: fix set skip in fast_find_migrateblock")
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Chuyi Zhou <zhouchuyi@bytedance.com>
Cc: Jiri Slaby <jirislaby@kernel.org>
Cc: Maxim Levitsky <mlevitsk@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Pedro Falcato <pedro.falcato@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-02 22:33:30 -08:00
Mel Gorman
f9d7fc1ae3 mm, compaction: finish scanning the current pageblock if requested
cc->finish_pageblock is set when the current pageblock should be rescanned
but fast_find_migrateblock can select an alternative block.  Disable
fast_find_migrateblock when the current pageblock scan should be
completed.

Link: https://lkml.kernel.org/r/20230125134434.18017-4-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Chuyi Zhou <zhouchuyi@bytedance.com>
Cc: Jiri Slaby <jirislaby@kernel.org>
Cc: Maxim Levitsky <mlevitsk@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Pedro Falcato <pedro.falcato@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-02 22:33:30 -08:00
Mel Gorman
16b3be4034 mm, compaction: check if a page has been captured before draining PCP pages
If a page has been captured then draining is unnecssary so check first for
a captured page.

Link: https://lkml.kernel.org/r/20230125134434.18017-3-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Chuyi Zhou <zhouchuyi@bytedance.com>
Cc: Jiri Slaby <jirislaby@kernel.org>
Cc: Maxim Levitsky <mlevitsk@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Pedro Falcato <pedro.falcato@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-02 22:33:30 -08:00
Mel Gorman
48731c8436 mm, compaction: rename compact_control->rescan to finish_pageblock
Patch series "Fix excessive CPU usage during compaction".

Commit 7efc3b726103 ("mm/compaction: fix set skip in fast_find_migrateblock")
fixed a problem where pageblocks found by fast_find_migrateblock() were
ignored. Unfortunately there were numerous bug reports complaining about high
CPU usage and massive stalls once 6.1 was released. Due to the severity,
the patch was reverted by Vlastimil as a short-term fix[1] to -stable.		

The underlying problem for each of the bugs is suspected to be the
repeated scanning of the same pageblocks.  This series should guarantee
forward progress even with commit 7efc3b726103.  More information is in
the changelog for patch 4.

[1] http://lore.kernel.org/r/20230113173345.9692-1-vbabka@suse.cz


This patch (of 4):

The rescan field was not well named albeit accurate at the time.  Rename
the field to finish_pageblock to indicate that the remainder of the
pageblock should be scanned regardless of COMPACT_CLUSTER_MAX.  The intent
is that pageblocks with transient failures get marked for skipping to
avoid revisiting the same pageblock.

Link: https://lkml.kernel.org/r/20230125134434.18017-2-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Chuyi Zhou <zhouchuyi@bytedance.com>
Cc: Jiri Slaby <jirislaby@kernel.org>
Cc: Maxim Levitsky <mlevitsk@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Pedro Falcato <pedro.falcato@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-02 22:33:29 -08:00
Jongwoo Han
c5acf1f6f0 mm/gup.c: fix typo in comments
Link: https://lkml.kernel.org/r/20230125180847.4542-1-jongwooo.han@gmail.com
Signed-off-by: Jongwoo Han <jongwooo.han@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-02 22:33:29 -08:00
Andrey Konovalov
420ef683b5 kasan: reset page tags properly with sampling
The implementation of page_alloc poisoning sampling assumed that
tag_clear_highpage resets page tags for __GFP_ZEROTAGS allocations. 
However, this is no longer the case since commit 70c248aca9e7 ("mm: kasan:
Skip unpoisoning of user pages").

This leads to kernel crashes when MTE-enabled userspace mappings are used
with Hardware Tag-Based KASAN enabled.

Reset page tags for __GFP_ZEROTAGS allocations in post_alloc_hook().

Also clarify and fix related comments.

[andreyknvl@google.com: update comment]
 Link: https://lkml.kernel.org/r/5dbd866714b4839069e2d8469ac45b60953db290.1674592780.git.andreyknvl@google.com
Link: https://lkml.kernel.org/r/24ea20c1b19c2b4b56cf9f5b354915f8dbccfc77.1674592496.git.andreyknvl@google.com
Fixes: 44383cef54c0 ("kasan: allow sampling page_alloc allocations for HW_TAGS")
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reported-by: Peter Collingbourne <pcc@google.com>
Tested-by: Peter Collingbourne <pcc@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Marco Elver <elver@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-02 22:33:29 -08:00
Mike Rapoport
2e126aa290 mm/sparse: fix "unused function 'pgdat_to_phys'" warning
W=1 build with clangs complains:

mm/sparse.c:347:27: warning: unused function 'pgdat_to_phys' [-Wunused-function]
static inline phys_addr_t pgdat_to_phys(struct pglist_data *pgdat)
                             ^
1 warning generated.

pgdat_to_phys() is only used by functions defined when
CONFIG_MEMORY_HOTREMOVE=y.

Move pgdat_to_phys() under #ifdef CONFIG_MEMORY_HOTREMOVE
to make clang happy.

Link: https://lkml.kernel.org/r/20230121101151.1703292-1-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@kernel.org>
Reported-by: kernel test robot <lkp@intel.com>
  Link: https://lore.kernel.org/all/202301210155.1E5zABb5-lkp@intel.com
Cc: Miles Chen <miles.chen@mediatek.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-02 22:33:29 -08:00
Hyeonggon Yoo
05a4219955 mm/page_owner: record single timestamp value for high order allocations
When allocating a high-order page, separate allocation timestamp is
recorded for each sub-page resulting in different timestamp values between
them.

This behavior is not consistent with the behavior when recording free
timestamp and caused confusion when analyzing memory dumps.  Record single
timestamp for the entire allocation, aligning with the behavior for free
timestamps.

Link: https://lkml.kernel.org/r/20230121165054.520507-1-42.hyeyoo@gmail.com
Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-02 22:33:28 -08:00
Jiaqi Yan
18f41fa616 mm: memory-failure: bump memory failure stats to pglist_data
Right before memory_failure finishes its handling, accumulate poisoned
page's resolution counters to pglist_data's memory_failure_stats, so as to
update the corresponding sysfs entries.

Tested:
1) Start an application to allocate memory buffer chunks
2) Convert random memory buffer addresses to physical addresses
3) Inject memory errors using EINJ at chosen physical addresses
4) Access poisoned memory buffer and recover from SIGBUS
5) Check counter values under
   /sys/devices/system/node/node*/memory_failure/*

Link: https://lkml.kernel.org/r/20230120034622.2698268-3-jiaqiyan@google.com
Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-02 22:33:28 -08:00
Jiaqi Yan
44b8f8bf24 mm: memory-failure: add memory failure stats to sysfs
Patch series "Introduce per NUMA node memory error statistics", v2.

Background
==========

In the RFC for Kernel Support of Memory Error Detection [1], one advantage
of software-based scanning over hardware patrol scrubber is the ability to
make statistics visible to system administrators.  The statistics include
2 categories:

* Memory error statistics, for example, how many memory error are
  encountered, how many of them are recovered by the kernel.  Note these
  memory errors are non-fatal to kernel: during the machine check
  exception (MCE) handling kernel already classified MCE's severity to be
  unnecessary to panic (but either action required or optional).

* Scanner statistics, for example how many times the scanner have fully
  scanned a NUMA node, how many errors are first detected by the scanner.

The memory error statistics are useful to userspace and actually not
specific to scanner detected memory errors, and are the focus of this
patchset.

Motivation
==========

Memory error stats are important to userspace but insufficient in kernel
today.  Datacenter administrators can better monitor a machine's memory
health with the visible stats.  For example, while memory errors are
inevitable on servers with 10+ TB memory, starting server maintenance when
there are only 1~2 recovered memory errors could be overreacting; in cloud
production environment maintenance usually means live migrate all the
workload running on the server and this usually causes nontrivial
disruption to the customer.  Providing insight into the scope of memory
errors on a system helps to determine the appropriate follow-up action. 
In addition, the kernel's existing memory error stats need to be
standardized so that userspace can reliably count on their usefulness.

Today kernel provides following memory error info to userspace, but they
are not sufficient or have disadvantages:
* HardwareCorrupted in /proc/meminfo: number of bytes poisoned in total,
  not per NUMA node stats though
* ras:memory_failure_event: only available after explicitly enabled
* /dev/mcelog provides many useful info about the MCEs, but doesn't
  capture how memory_failure recovered memory MCEs
* kernel logs: userspace needs to process log text

Exposing memory error stats is also a good start for the in-kernel memory
error detector.  Today the data source of memory error stats are either
direct memory error consumption, or hardware patrol scrubber detection
(either signaled as UCNA or SRAO).  Once in-kernel memory scanner is
implemented, it will be the main source as it is usually configured to
scan memory DIMMs constantly and faster than hardware patrol scrubber.

How Implemented
===============

As Naoya pointed out [2], exposing memory error statistics to userspace is
useful independent of software or hardware scanner.  Therefore we
implement the memory error statistics independent of the in-kernel memory
error detector.  It exposes the following per NUMA node memory error
counters:

  /sys/devices/system/node/node${X}/memory_failure/total
  /sys/devices/system/node/node${X}/memory_failure/recovered
  /sys/devices/system/node/node${X}/memory_failure/ignored
  /sys/devices/system/node/node${X}/memory_failure/failed
  /sys/devices/system/node/node${X}/memory_failure/delayed

These counters describe how many raw pages are poisoned and after the
attempted recoveries by the kernel, their resolutions: how many are
recovered, ignored, failed, or delayed respectively.  This approach can be
easier to extend for future use cases than /proc/meminfo, trace event, and
log.  The following math holds for the statistics:

* total = recovered + ignored + failed + delayed

These memory error stats are reset during machine boot.

The 1st commit introduces these sysfs entries.  The 2nd commit populates
memory error stats every time memory_failure attempts memory error
recovery.  The 3rd commit adds documentations for introduced stats.

[1] https://lore.kernel.org/linux-mm/7E670362-C29E-4626-B546-26530D54F937@gmail.com/T/#mc22959244f5388891c523882e61163c6e4d703af
[2] https://lore.kernel.org/linux-mm/7E670362-C29E-4626-B546-26530D54F937@gmail.com/T/#m52d8d7a333d8536bd7ce74253298858b1c0c0ac6


This patch (of 3):

Today kernel provides following memory error info to userspace, but each
has its own disadvantage

* HardwareCorrupted in /proc/meminfo: number of bytes poisoned in total,
  not per NUMA node stats though

* ras:memory_failure_event: only available after explicitly enabled

* /dev/mcelog provides many useful info about the MCEs, but
  doesn't capture how memory_failure recovered memory MCEs

* kernel logs: userspace needs to process log text

Exposes per NUMA node memory error stats as sysfs entries:

  /sys/devices/system/node/node${X}/memory_failure/total
  /sys/devices/system/node/node${X}/memory_failure/recovered
  /sys/devices/system/node/node${X}/memory_failure/ignored
  /sys/devices/system/node/node${X}/memory_failure/failed
  /sys/devices/system/node/node${X}/memory_failure/delayed

These counters describe how many raw pages are poisoned and after the
attempted recoveries by the kernel, their resolutions: how many are
recovered, ignored, failed, or delayed respectively.  The following math
holds for the statistics:

* total = recovered + ignored + failed + delayed

Link: https://lkml.kernel.org/r/20230120034622.2698268-1-jiaqiyan@google.com
Link: https://lkml.kernel.org/r/20230120034622.2698268-2-jiaqiyan@google.com
Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-02 22:33:28 -08:00
T.J. Alumbaugh
abf086721a mm: multi-gen LRU: simplify lru_gen_look_around()
Update the folio generation in place with or without
current->reclaim_state->mm_walk.  The LRU lock is held for longer, if
mm_walk is NULL and the number of folios to update is more than
PAGEVEC_SIZE.

This causes a measurable regression from the LRU lock contention during a
microbencmark.  But a tiny regression is not worth the complexity.

Link: https://lkml.kernel.org/r/20230118001827.1040870-8-talumbau@google.com
Signed-off-by: T.J. Alumbaugh <talumbau@google.com>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-02 22:33:28 -08:00
T.J. Alumbaugh
b5ff413361 mm: multi-gen LRU: improve walk_pmd_range()
Improve readability of walk_pmd_range() and walk_pmd_range_locked().

Link: https://lkml.kernel.org/r/20230118001827.1040870-7-talumbau@google.com
Signed-off-by: T.J. Alumbaugh <talumbau@google.com>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-02 22:33:27 -08:00
T.J. Alumbaugh
37cc99979d mm: multi-gen LRU: improve lru_gen_exit_memcg()
Add warnings and poison ->next.

Link: https://lkml.kernel.org/r/20230118001827.1040870-6-talumbau@google.com
Signed-off-by: T.J. Alumbaugh <talumbau@google.com>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-02 22:33:27 -08:00
T.J. Alumbaugh
36c7b4db7c mm: multi-gen LRU: section for memcg LRU
Move memcg LRU code into a dedicated section.  Improve the design doc to
outline its architecture.

Link: https://lkml.kernel.org/r/20230118001827.1040870-5-talumbau@google.com
Signed-off-by: T.J. Alumbaugh <talumbau@google.com>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-02 22:33:27 -08:00
T.J. Alumbaugh
ccbbbb8594 mm: multi-gen LRU: section for Bloom filters
Move Bloom filters code into a dedicated section.  Improve the design doc
to explain Bloom filter usage and connection between aging and eviction in
their use.

Link: https://lkml.kernel.org/r/20230118001827.1040870-4-talumbau@google.com
Signed-off-by: T.J. Alumbaugh <talumbau@google.com>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-02 22:33:27 -08:00
T.J. Alumbaugh
db19a43d9b mm: multi-gen LRU: section for rmap/PT walk feedback
Add a section for lru_gen_look_around() in the code and the design doc.

Link: https://lkml.kernel.org/r/20230118001827.1040870-3-talumbau@google.com
Signed-off-by: T.J. Alumbaugh <talumbau@google.com>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-02 22:33:26 -08:00
T.J. Alumbaugh
7b8144e63d mm: multi-gen LRU: section for working set protection
Patch series "mm: multi-gen LRU: improve".

This patch series improves a few MGLRU functions, collects related
functions, and adds additional documentation.


This patch (of 7):

Add a section for working set protection in the code and the design doc. 
The admin doc already contains its usage.

Link: https://lkml.kernel.org/r/20230118001827.1040870-1-talumbau@google.com
Link: https://lkml.kernel.org/r/20230118001827.1040870-2-talumbau@google.com
Signed-off-by: T.J. Alumbaugh <talumbau@google.com>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-02 22:33:26 -08:00
Zhaoyang Huang
b2db9ef2c0 mm: move KMEMLEAK's Kconfig items from lib to mm
Have the kmemleak's source code and Kconfig items be in the same directory.

Link: https://lkml.kernel.org/r/1674091345-14799-1-git-send-email-zhaoyang.huang@unisoc.com
Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: ke.wang <ke.wang@unisoc.com>
Cc: Mirsad Goran Todorovac <mirsad.todorovac@alu.unizg.hr>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-02 22:33:26 -08:00
SeongJae Park
f4c978b659 mm/damon/core-test: add a test for damon_update_monitoring_results()
Add a simple unit test for damon_update_monitoring_results() function.

Link: https://lkml.kernel.org/r/20230119013831.1911-4-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Brendan Higgins <brendanhiggins@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-02 22:33:26 -08:00
SeongJae Park
2f5bef5a59 mm/damon/core: update monitoring results for new monitoring attributes
region->nr_accesses is the number of sampling intervals in the last
aggregation interval that access to the region has found, and region->age
is the number of aggregation intervals that its access pattern has
maintained.  Hence, the real meaning of the two fields' values is
depending on current sampling and aggregation intervals.

This means the values need to be updated for every sampling and/or
aggregation intervals updates.  As DAMON core doesn't, it is a duty of
in-kernel DAMON framework applications like DAMON sysfs interface, or the
userspace users.

Handling it in userspace or in-kernel DAMON application is complicated,
inefficient, and repetitive compared to doing the update in DAMON core. 
Do the update in DAMON core.

Link: https://lkml.kernel.org/r/20230119013831.1911-3-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Brendan Higgins <brendanhiggins@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-02 22:33:26 -08:00
Waiman Long
782e417953 mm/kmemleak: fix UAF bug in kmemleak_scan()
Commit 6edda04ccc7c ("mm/kmemleak: prevent soft lockup in first object
iteration loop of kmemleak_scan()") fixes soft lockup problem in
kmemleak_scan() by periodically doing a cond_resched().  It does take a
reference of the current object before doing it.  Unfortunately, if the
object has been deleted from the object_list, the next object pointed to
by its next pointer may no longer be valid after coming back from
cond_resched().  This can result in use-after-free and other nasty
problem.

Fix this problem by adding a del_state flag into kmemleak_object structure
to synchronize the object deletion process between kmemleak_cond_resched()
and __remove_object() to make sure that the object remained in the
object_list in the duration of the cond_resched() call.

Link: https://lkml.kernel.org/r/20230119040111.350923-3-longman@redhat.com
Fixes: 6edda04ccc7c ("mm/kmemleak: prevent soft lockup in first object iteration loop of kmemleak_scan()")
Signed-off-by: Waiman Long <longman@redhat.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-02 22:33:25 -08:00
Waiman Long
6061e74082 mm/kmemleak: simplify kmemleak_cond_resched() usage
Patch series "mm/kmemleak: Simplify kmemleak_cond_resched() & fix UAF", v2.

It was found that a KASAN use-after-free error was reported in the
kmemleak_scan() function.  After further examination, it is believe that
even though a reference is taken from the current object, it does not
prevent the object pointed to by the next pointer from going away after a
cond_resched().

To fix that, additional flags are added to make sure that the current
object won't be removed from the object_list during the duration of the
cond_resched() to ensure the validity of the next pointer.

While making the change, I also simplify the current usage of
kmemleak_cond_resched() to make it easier to understand.


This patch (of 2):

The presence of a pinned argument and the 64k loop count make
kmemleak_cond_resched() a bit more complex to read.  The pinned argument
is used only by first kmemleak_scan() loop.

Simplify the usage of kmemleak_cond_resched() by removing the pinned
argument and always do a get_object()/put_object() sequence.  In addition,
the 64k loop is removed by using need_resched() to decide if
kmemleak_cond_resched() should be called.

Link: https://lkml.kernel.org/r/20230119040111.350923-1-longman@redhat.com
Link: https://lkml.kernel.org/r/20230119040111.350923-2-longman@redhat.com
Signed-off-by: Waiman Long <longman@redhat.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-02 22:33:25 -08:00
Joey Gouly
b507808ebc mm: implement memory-deny-write-execute as a prctl
Patch series "mm: In-kernel support for memory-deny-write-execute (MDWE)",
v2.

The background to this is that systemd has a configuration option called
MemoryDenyWriteExecute [2], implemented as a SECCOMP BPF filter.  Its aim
is to prevent a user task from inadvertently creating an executable
mapping that is (or was) writeable.  Since such BPF filter is stateless,
it cannot detect mappings that were previously writeable but subsequently
changed to read-only.  Therefore the filter simply rejects any
mprotect(PROT_EXEC).  The side-effect is that on arm64 with BTI support
(Branch Target Identification), the dynamic loader cannot change an ELF
section from PROT_EXEC to PROT_EXEC|PROT_BTI using mprotect().  For
libraries, it can resort to unmapping and re-mapping but for the main
executable it does not have a file descriptor.  The original bug report in
the Red Hat bugzilla - [3] - and subsequent glibc workaround for libraries
- [4].

This series adds in-kernel support for this feature as a prctl
PR_SET_MDWE, that is inherited on fork().  The prctl denies PROT_WRITE |
PROT_EXEC mappings.  Like the systemd BPF filter it also denies adding
PROT_EXEC to mappings.  However unlike the BPF filter it only denies it if
the mapping didn't previous have PROT_EXEC.  This allows to PROT_EXEC ->
PROT_EXEC | PROT_BTI with mprotect(), which is a problem with the BPF
filter.


This patch (of 2):

The aim of such policy is to prevent a user task from creating an
executable mapping that is also writeable.

An example of mmap() returning -EACCESS if the policy is enabled:

	mmap(0, size, PROT_READ | PROT_WRITE | PROT_EXEC, flags, 0, 0);

Similarly, mprotect() would return -EACCESS below:

	addr = mmap(0, size, PROT_READ | PROT_EXEC, flags, 0, 0);
	mprotect(addr, size, PROT_READ | PROT_WRITE | PROT_EXEC);

The BPF filter that systemd MDWE uses is stateless, and disallows
mprotect() with PROT_EXEC completely. This new prctl allows PROT_EXEC to
be enabled if it was already PROT_EXEC, which allows the following case:

	addr = mmap(0, size, PROT_READ | PROT_EXEC, flags, 0, 0);
	mprotect(addr, size, PROT_READ | PROT_EXEC | PROT_BTI);

where PROT_BTI enables branch tracking identification on arm64.

Link: https://lkml.kernel.org/r/20230119160344.54358-1-joey.gouly@arm.com
Link: https://lkml.kernel.org/r/20230119160344.54358-2-joey.gouly@arm.com
Signed-off-by: Joey Gouly <joey.gouly@arm.com>
Co-developed-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Jeremy Linton <jeremy.linton@arm.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Lennart Poettering <lennart@poettering.net>
Cc: Mark Brown <broonie@kernel.org>
Cc: nd <nd@arm.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Szabolcs Nagy <szabolcs.nagy@arm.com>
Cc: Topi Miettinen <toiwoton@gmail.com>
Cc: Zbigniew Jędrzejewski-Szmek <zbyszek@in.waw.pl>
Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-02 22:33:24 -08:00
Levi Yun
148aa87e4f mm/cma: fix potential memory loss on cma_declare_contiguous_nid
Suppose memblock_alloc_range_nid() with highmem_start succeeds when
cma_declare_contiguous_nid is called with !fixed on a 32-bit system with
PHYS_ADDR_T_64BIT enabled with memblock.bottom_up == false.

But the next trial to memblock_alloc_range_nid() to allocate in [SIZE_4G,
limits) nullifies former successfully allocated addr and it retries
memblock_alloc_ragne_nid().

In this situation, the first successfully allocated address area is lost.

Change the order of allocation (SIZE_4G, high_memory and base) and check
whether the allocated succeeded to prevent potential memory loss.

Link: https://lkml.kernel.org/r/20230118080523.44522-1-ppbuk5246@gmail.com
Signed-off-by: Levi Yun <ppbuk5246@gmail.com>
Cc: Laurent Pinchart <laurent.pinchart@ideasonboard.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-02 22:33:24 -08:00
Yang Yang
5649d113ff swap_state: update shadow_nodes for anonymous page
Shadow_nodes is for shadow nodes reclaiming of workingset handling, it is
updated when page cache add or delete since long time ago workingset only
supported page cache.  But when workingset supports anonymous page
detection, we missied updating shadow nodes for it.  This caused that
shadow nodes of anonymous page will never be reclaimd by
scan_shadow_nodes() even they use much memory and system memory is tense.

So update shadow_nodes of anonymous page when swap cache is add or delete
by calling xas_set_update(..workingset_update_node).

Link: https://lkml.kernel.org/r/202301182013032211005@zte.com.cn
Fixes: aae466b0052e ("mm/swap: implement workingset detection for anonymous LRU")
Signed-off-by: Yang Yang <yang.yang29@zte.com.cn>
Reviewed-by: Ran Xiaokai <ran.xiaokai@zte.com.cn>
Cc: Bagas Sanjaya <bagasdotme@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-02 22:33:24 -08:00