linux/mm
David Hildenbrand d74943a2f3 mm/gup: reintroduce FOLL_NUMA as FOLL_HONOR_NUMA_FAULT
Unfortunately commit 474098edac ("mm/gup: replace FOLL_NUMA by
gup_can_follow_protnone()") missed that follow_page() and
follow_trans_huge_pmd() never implicitly set FOLL_NUMA because they really
don't want to fail on PROT_NONE-mapped pages -- either due to NUMA hinting
or due to inaccessible (PROT_NONE) VMAs.

As spelled out in commit 0b9d705297 ("mm: numa: Support NUMA hinting
page faults from gup/gup_fast"): "Other follow_page callers like KSM
should not use FOLL_NUMA, or they would fail to get the pages if they use
follow_page instead of get_user_pages."

liubo reported [1] that smaps_rollup results are imprecise, because they
miss accounting of pages that are mapped PROT_NONE.  Further, it's easy to
reproduce that KSM no longer works on inaccessible VMAs on x86-64, because
pte_protnone()/pmd_protnone() also indictaes "true" in inaccessible VMAs,
and follow_page() refuses to return such pages right now.

As KVM really depends on these NUMA hinting faults, removing the
pte_protnone()/pmd_protnone() handling in GUP code completely is not
really an option.

To fix the issues at hand, let's revive FOLL_NUMA as FOLL_HONOR_NUMA_FAULT
to restore the original behavior for now and add better comments.

Set FOLL_HONOR_NUMA_FAULT independent of FOLL_FORCE in
is_valid_gup_args(), to add that flag for all external GUP users.

Note that there are three GUP-internal __get_user_pages() users that don't
end up calling is_valid_gup_args() and consequently won't get
FOLL_HONOR_NUMA_FAULT set.

1) get_dump_page(): we really don't want to handle NUMA hinting
   faults. It specifies FOLL_FORCE and wouldn't have honored NUMA
   hinting faults already.
2) populate_vma_page_range(): we really don't want to handle NUMA hinting
   faults. It specifies FOLL_FORCE on accessible VMAs, so it wouldn't have
   honored NUMA hinting faults already.
3) faultin_vma_page_range(): we similarly don't want to handle NUMA
   hinting faults.

To make the combination of FOLL_FORCE and FOLL_HONOR_NUMA_FAULT work in
inaccessible VMAs properly, we have to perform VMA accessibility checks in
gup_can_follow_protnone().

As GUP-fast should reject such pages either way in
pte_access_permitted()/pmd_access_permitted() -- for example on x86-64 and
arm64 that both implement pte_protnone() -- let's just always fallback to
ordinary GUP when stumbling over pte_protnone()/pmd_protnone().

As Linus notes [2], honoring NUMA faults might only make sense for
selected GUP users.

So we should really see if we can instead let relevant GUP callers specify
it manually, and not trigger NUMA hinting faults from GUP as default. 
Prepare for that by making FOLL_HONOR_NUMA_FAULT an external GUP flag and
adding appropriate documenation.

While at it, remove a stale comment from follow_trans_huge_pmd(): That
comment for pmd_protnone() was added in commit 2b4847e730 ("mm: numa:
serialise parallel get_user_page against THP migration"), which noted:

	THP does not unmap pages due to a lack of support for migration
	entries at a PMD level.  This allows races with get_user_pages

Nowadays, we do have PMD migration entries, so the comment no longer
applies.  Let's drop it.

[1] https://lore.kernel.org/r/20230726073409.631838-1-liubo254@huawei.com
[2] https://lore.kernel.org/r/CAHk-=wgRiP_9X0rRdZKT8nhemZGNateMtb366t37d8-x7VRs=g@mail.gmail.com

Link: https://lkml.kernel.org/r/20230803143208.383663-2-david@redhat.com
Fixes: 474098edac ("mm/gup: replace FOLL_NUMA by gup_can_follow_protnone()")
Signed-off-by: David Hildenbrand <david@redhat.com>
Reported-by: liubo <liubo254@huawei.com>
Closes: https://lore.kernel.org/r/20230726073409.631838-1-liubo254@huawei.com
Reported-by: Peter Xu <peterx@redhat.com>
Closes: https://lore.kernel.org/all/ZMKJjDaqZ7FW0jfe@x1n/
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Peter Xu <peterx@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21 13:07:20 -07:00
..
damon mm/damon/core: initialize damo_filter->list from damos_new_filter() 2023-08-04 13:03:43 -07:00
kasan kasan, slub: fix HW_TAGS zeroing with slub_debug 2023-07-08 09:29:32 -07:00
kfence mm/slab: introduce kmem_cache flag SLAB_NO_MERGE 2023-06-02 10:24:33 +02:00
kmsan kasan,kmsan: remove __GFP_KSWAPD_RECLAIM usage from kasan/kmsan 2023-06-23 16:59:26 -07:00
backing-dev.c mm: backing-dev: make bdi_class a static const structure 2023-06-23 16:59:27 -07:00
balloon_compaction.c
bootmem_info.c
cma_debug.c
cma_sysfs.c mm: cma: make kobj_type structure constant 2023-03-28 16:20:06 -07:00
cma.c mm/page_owner/cma: show pfn in cma/page_owner with hex format 2023-06-19 16:19:32 -07:00
cma.h
compaction.c mm: compaction: fix endless looping over same migrate block 2023-08-04 13:03:42 -07:00
debug_page_alloc.c mm: page_alloc: split out DEBUG_PAGEALLOC 2023-06-09 16:25:23 -07:00
debug_page_ref.c
debug_vm_pgtable.c mm/debug_vm_pgtable,page_table_check: warn pte map fails 2023-06-19 16:19:15 -07:00
debug.c mm: update validate_mm() to use vma iterator 2023-06-09 16:25:31 -07:00
dmapool_test.c dmapool: add alloc/free performance test 2023-04-05 19:42:38 -07:00
dmapool.c dmapool: create/destroy cleanup 2023-06-09 16:25:17 -07:00
early_ioremap.c mm/early_ioremap.c: improve the execution efficiency of early_ioremap_setup() 2023-06-09 16:25:56 -07:00
fadvise.c mm: remove unnecessary pagevec includes 2023-06-23 16:59:31 -07:00
fail_page_alloc.c mm: page_alloc: split out FAIL_PAGE_ALLOC 2023-06-09 16:25:23 -07:00
failslab.c
filemap.c - Yosry Ahmed brought back some cgroup v1 stats in OOM logs. 2023-06-28 10:28:11 -07:00
folio-compat.c - Nick Piggin's "shoot lazy tlbs" series, to improve the peformance of 2023-04-27 19:42:02 -07:00
frontswap.c mm: zswap: support exclusive loads 2023-06-19 16:19:05 -07:00
gup_test.c Merge mm-hotfixes-stable into mm-stable to pick up depended-upon changes. 2023-06-23 16:58:19 -07:00
gup_test.h
gup.c mm/gup: reintroduce FOLL_NUMA as FOLL_HONOR_NUMA_FAULT 2023-08-21 13:07:20 -07:00
highmem.c mm: ptep_get() conversion 2023-06-19 16:19:25 -07:00
hmm.c mm: ptep_get() conversion 2023-06-19 16:19:25 -07:00
huge_memory.c mm/gup: reintroduce FOLL_NUMA as FOLL_HONOR_NUMA_FAULT 2023-08-21 13:07:20 -07:00
hugetlb_cgroup.c mm/hugetlb: increase use of folios in alloc_huge_page() 2023-02-13 15:54:27 -08:00
hugetlb_vmemmap.c mm: ptep_get() conversion 2023-06-19 16:19:25 -07:00
hugetlb_vmemmap.h
hugetlb.c hugetlb: do not clear hugetlb dtor until allocating vmemmap 2023-08-04 13:03:41 -07:00
hwpoison-inject.c
init-mm.c IOMMU Updates for Linux 6.4 2023-04-30 13:00:38 -07:00
internal.h - Yosry Ahmed brought back some cgroup v1 stats in OOM logs. 2023-06-28 10:28:11 -07:00
interval_tree.c
io-mapping.c
ioremap.c
Kconfig mm: disable CONFIG_PER_VMA_LOCK until its fixed 2023-07-08 09:29:29 -07:00
Kconfig.debug mm: page_table_check: Make it dependent on EXCLUSIVE_SYSTEM_RAM 2023-05-29 16:14:28 +01:00
khugepaged.c mm/khugepaged: fix regression in collapse_file() 2023-06-29 09:41:45 -07:00
kmemleak.c lib/stackdepot, mm: rename stack_depot_want_early_init 2023-02-16 20:43:49 -08:00
ksm.c mm/swapfile: fix wrong swap entry type for hwpoisoned swapcache page 2023-08-04 13:03:40 -07:00
list_lru.c
maccess.c mm: Fix copy_from_user_nofault(). 2023-04-12 17:36:23 -07:00
madvise.c mm: ptep_get() conversion 2023-06-19 16:19:25 -07:00
Makefile mm: page_alloc: split out DEBUG_PAGEALLOC 2023-06-09 16:25:23 -07:00
mapping_dirty_helpers.c mm: ptep_get() conversion 2023-06-19 16:19:25 -07:00
memblock.c Revert "mm,memblock: reset memblock.reserved to system init state to prevent UAF" 2023-07-28 09:47:06 -07:00
memcontrol.c mm/memcontrol: do not tweak node in mem_cgroup_init() 2023-06-23 16:59:26 -07:00
memfd.c memfd: check for non-NULL file_seals in memfd_create() syscall 2023-06-19 13:19:31 -07:00
memory_hotplug.c mm: remove unnecessary pagevec includes 2023-06-23 16:59:31 -07:00
memory-failure.c mm: memory-failure: avoid false hwpoison page mapped error info 2023-08-04 13:03:41 -07:00
memory-tiers.c memory tier: remove unneeded !IS_ENABLED(CONFIG_MIGRATION) check 2023-06-19 16:19:29 -07:00
memory.c mm: lock_vma_under_rcu() must check vma->anon_vma under vma lock 2023-07-27 11:13:22 -07:00
mempolicy.c mm/mempolicy: Take VMA lock before replacing policy 2023-07-28 09:44:06 -07:00
mempool.c
memremap.c mm/memremap.c: fix outdated comment in devm_memremap_pages 2023-02-09 16:51:46 -08:00
memtest.c mm/memtest: add results of early memtest to /proc/meminfo 2023-04-05 19:42:55 -07:00
migrate_device.c mm: remove references to pagevec 2023-06-23 16:59:30 -07:00
migrate.c mm: remove unnecessary pagevec includes 2023-06-23 16:59:31 -07:00
mincore.c mm: ptep_get() conversion 2023-06-19 16:19:25 -07:00
mlock.c mm/mlock: fix vma iterator conversion of apply_vma_lock_flags() 2023-07-17 12:53:21 -07:00
mm_init.c - Yosry Ahmed brought back some cgroup v1 stats in OOM logs. 2023-06-28 10:28:11 -07:00
mm_slot.h
mmap_lock.c
mmap.c mm: lock VMA in dup_anon_vma() before setting ->anon_vma 2023-07-27 13:07:04 -07:00
mmu_gather.c mm: prefer xxx_page() alloc/free functions for order-0 pages 2023-03-28 16:20:16 -07:00
mmu_notifier.c mm/mmu_notifier: remove unused mmu_notifier_range_update_to_read_only export 2023-02-02 22:32:54 -08:00
mmzone.c
mprotect.c Merge mm-hotfixes-stable into mm-stable to pick up depended-upon changes. 2023-06-23 16:58:19 -07:00
mremap.c mm: Update do_vmi_align_munmap() return semantics 2023-07-01 08:10:56 -07:00
msync.c
nommu.c xtensa: fix lock_mm_and_find_vma in case VMA not found 2023-07-01 08:00:05 -07:00
oom_kill.c mm, oom: do not check 0 mask in out_of_memory() 2023-06-09 16:25:20 -07:00
page_alloc.c - Yosry Ahmed brought back some cgroup v1 stats in OOM logs. 2023-06-28 10:28:11 -07:00
page_counter.c
page_ext.c mm/page_ext: init page_ext early if there are no deferred struct pages 2023-02-02 22:33:22 -08:00
page_idle.c
page_io.c swap: use __bio_add_page to add page to bio 2023-05-31 09:50:02 -06:00
page_isolation.c mm: page_isolation: write proper kerneldoc 2023-06-19 16:18:59 -07:00
page_owner.c mm/page_owner/cma: show pfn in cma/page_owner with hex format 2023-06-19 16:19:32 -07:00
page_poison.c
page_reporting.c mm, treewide: redefine MAX_ORDER sanely 2023-04-05 19:42:46 -07:00
page_reporting.h
page_table_check.c - Yosry Ahmed brought back some cgroup v1 stats in OOM logs. 2023-06-28 10:28:11 -07:00
page_vma_mapped.c mm: ptep_get() conversion 2023-06-19 16:19:25 -07:00
page-writeback.c writeback: account the number of pages written back 2023-07-08 09:29:30 -07:00
pagewalk.c mm/pagewalk: fix EFI_PGT_DUMP of espfix area 2023-07-27 13:07:04 -07:00
percpu-internal.h percpu-internal/pcpu_chunk: re-layout pcpu_chunk structure to reduce false sharing 2023-06-19 16:19:29 -07:00
percpu-km.c
percpu-stats.c
percpu-vm.c
percpu.c mm: memcontrol: rename memcg_kmem_enabled() 2023-02-16 20:43:56 -08:00
pgalloc-track.h
pgtable-generic.c mm: ptep_get() conversion 2023-06-19 16:19:25 -07:00
process_vm_access.c mm/gup: remove unused vmas parameter from pin_user_pages_remote() 2023-06-09 16:25:25 -07:00
ptdump.c mm: ptdump should use ptep_get_lockless() 2023-06-19 16:19:24 -07:00
readahead.c mm: remove unnecessary pagevec includes 2023-06-23 16:59:31 -07:00
rmap.c mm: ptep_get() conversion 2023-06-19 16:19:25 -07:00
rodata_test.c
secretmem.c mm/mlock: rename mlock_future_check() to mlock_future_ok() 2023-06-09 16:25:38 -07:00
shmem.c shmem: minor fixes to splice-read implementation 2023-07-27 13:07:03 -07:00
show_mem.c mm: page_alloc: collect mem statistic into show_mem.c 2023-06-09 16:25:22 -07:00
shrinker_debug.c Revert "mm: shrinkers: make count and scan in shrinker debugfs lockless" 2023-06-19 13:19:34 -07:00
shuffle.c
shuffle.h mm, treewide: redefine MAX_ORDER sanely 2023-04-05 19:42:46 -07:00
slab_common.c slab updates for 6.5 2023-06-29 16:34:12 -07:00
slab.c slab updates for 6.5 2023-06-29 16:34:12 -07:00
slab.h kasan, slub: fix HW_TAGS zeroing with slub_debug 2023-07-08 09:29:32 -07:00
slub.c slab updates for 6.5 2023-06-29 16:34:12 -07:00
sparse-vmemmap.c mm: ptep_get() conversion 2023-06-19 16:19:25 -07:00
sparse.c - Arnd Bergmann has fixed a bunch of -Wmissing-prototypes in 2023-06-28 10:59:38 -07:00
swap_cgroup.c
swap_slots.c
swap_state.c mm: remove unnecessary pagevec includes 2023-06-23 16:59:31 -07:00
swap.c mm: remove references to pagevec 2023-06-23 16:59:30 -07:00
swap.h mm: remove the __swap_writepage return value 2023-02-02 22:33:33 -08:00
swapfile.c mm/swapfile: fix wrong swap entry type for hwpoisoned swapcache page 2023-08-04 13:03:40 -07:00
truncate.c mm: remove references to pagevec 2023-06-23 16:59:30 -07:00
usercopy.c mm: Fix copy_from_user_nofault(). 2023-04-12 17:36:23 -07:00
userfaultfd.c mm: ptep_get() conversion 2023-06-19 16:19:25 -07:00
util.c mm: uninline kstrdup() 2023-04-08 13:45:37 -07:00
vmalloc.c Merge mm-hotfixes-stable into mm-stable to pick up depended-upon changes. 2023-06-23 16:58:19 -07:00
vmpressure.c
vmscan.c mm/vmscan: fix root proactive reclaim unthrottling unbalanced node 2023-06-23 16:59:32 -07:00
vmstat.c - Yosry Ahmed brought back some cgroup v1 stats in OOM logs. 2023-06-28 10:28:11 -07:00
workingset.c Multi-gen LRU: fix workingset accounting 2023-06-09 16:25:46 -07:00
z3fold.c mm: zswap: remove shrink from zpool interface 2023-06-19 16:19:27 -07:00
zbud.c mm: zswap: remove shrink from zpool interface 2023-06-19 16:19:27 -07:00
zpool.c mm: zswap: remove shrink from zpool interface 2023-06-19 16:19:27 -07:00
zsmalloc.c zsmalloc: fix races between modifications of fullness and isolated 2023-08-04 13:03:40 -07:00
zswap.c mm: zswap: fix double invalidate with exclusive loads 2023-06-23 16:59:31 -07:00