linux/mm
David Hildenbrand fce831c920 mm/memory: cleanly support zeropage in vm_insert_page*(), vm_map_pages*() and vmf_insert_mixed()
For now we only get the (small) zeropage mapped to user space in four
cases (excluding VM_PFNMAP mappings, such as /proc/vmstat):

(1) Read page faults in anonymous VMAs (MAP_PRIVATE|MAP_ANON):
    do_anonymous_page() will not refcount it and map it pte_mkspecial()
(2) UFFDIO_ZEROPAGE on anonymous VMA or COW mapping of shmem
    (MAP_PRIVATE). mfill_atomic_pte_zeropage() will not refcount it and
    map it pte_mkspecial().
(3) KSM in mergeable VMA (anonymous VMA or COW mapping).
    cmp_and_merge_page() will not refcount it and map it
    pte_mkspecial().
(4) FSDAX as an optimization for holes.
    vmf_insert_mixed()->__vm_insert_mixed() might end up calling
    insert_page() without CONFIG_ARCH_HAS_PTE_SPECIAL, refcounting the
    zeropage and not mapping it pte_mkspecial(). With
    CONFIG_ARCH_HAS_PTE_SPECIAL, we'll call insert_pfn() where we will
    not refcount it and map it pte_mkspecial().

In case (4), we might not have VM_MIXEDMAP set: while fs/fuse/dax.c sets
VM_MIXEDMAP, we removed it for ext4 fsdax in commit e1fb4a0864 ("dax:
remove VM_MIXEDMAP for fsdax and device dax") and for XFS in commit
e1fb4a0864 ("dax: remove VM_MIXEDMAP for fsdax and device dax").

Without CONFIG_ARCH_HAS_PTE_SPECIAL and with VM_MIXEDMAP, vm_normal_page()
would currently return the zeropage.  We'll refcount the zeropage when
mapping and when unmapping.

Without CONFIG_ARCH_HAS_PTE_SPECIAL and without VM_MIXEDMAP,
vm_normal_page() would currently refuse to return the zeropage.  So we'd
refcount it when mapping but not when unmapping it ...  do we have fsdax
without CONFIG_ARCH_HAS_PTE_SPECIAL in practice?  Hard to tell.

Independent of that, we should never refcount the zeropage when we might
be holding that reference for a long time, because even without an
accounting imbalance we might overflow the refcount.  As there is interest
in using the zeropage also in other VM_MIXEDMAP mappings, let's add clean
support for that in the cases where it makes sense:

(A) Never refcount the zeropage when mapping it:

In insert_page(), special-case the zeropage, do not refcount it, and use
pte_mkspecial().  Don't involve insert_pfn(), adjusting insert_page()
looks cleaner than branching off to insert_pfn().

(B) Never refcount the zeropage when unmapping it:

In vm_normal_page(), also don't return the zeropage in a VM_MIXEDMAP
mapping without CONFIG_ARCH_HAS_PTE_SPECIAL.  Add a VM_WARN_ON_ONCE()
sanity check if we'd ever return the zeropage, which could happen if
someone forgets to set pte_mkspecial() when mapping the zeropage. 
Document that.

(C) Allow the zeropage only where reasonable

s390x never wants the zeropage in some processes running legacy KVM guests
that make use of storage keys.  So disallow that.

Further, using the zeropage in COW mappings is unproblematic (just what we
do for other COW mappings), because FAULT_FLAG_UNSHARE can just unshare it
and GUP with FOLL_LONGTERM would work as expected.

Similarly, mappings that can never have writable PTEs (implying no write
faults) are also not problematic, because nothing could end up mapping the
PTE writable by mistake later.  But in case we could have writable PTEs,
we'll only allow the zeropage in FSDAX VMAs, that are incompatible with
GUP and are blocked there completely.

We'll always require the zeropage to be mapped with pte_special(). 
GUP-fast will reject the zeropage that way, but GUP-slow will allow it. 
(Note that GUP does not refcount the zeropage with FOLL_PIN, because there
were issues with overflowing the refcount in the past).

Add sanity checks to can_change_pte_writable() and wp_page_reuse(), to
catch early during testing if we'd ever find a zeropage unexpectedly in
code that wants to upgrade write permissions.

Convert the BUG_ON in vm_mixed_ok() to an ordinary check and simply fail
with VM_FAULT_SIGBUS, like we do for other sanity checks.  Drop the stale
comment regarding reserved pages from insert_page().

Note that:
* we won't mess with VM_PFNMAP mappings for now. remap_pfn_range() and
  vmf_insert_pfn() would allow the zeropage in some cases and
  not refcount it.
* vmf_insert_pfn*() will reject the zeropage in VM_MIXEDMAP
  mappings and we'll leave that alone for now. People can simply use
  one of the other interfaces.
* we won't bother with the huge zeropage for now. It's never
  PTE-mapped and also GUP does not special-case it yet.

Link: https://lkml.kernel.org/r/20240522125713.775114-3-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03 19:29:56 -07:00
..
damon mm/damon/core: fix return value from damos_wmark_metric_value 2024-05-11 15:41:36 -07:00
kasan kasan: fix bad call to unpoison_slab_object 2024-06-24 20:52:09 -07:00
kfence mm: introduce slabobj_ext to support slab object extensions 2024-04-25 20:55:51 -07:00
kmsan kmsan: do not wipe out origin when doing partial unpoisoning 2024-06-05 19:19:25 -07:00
backing-dev.c writeback: support retrieving per group debug writeback stats of bdi 2024-05-05 17:53:51 -07:00
balloon_compaction.c
bootmem_info.c bootmem: use kmemleak_free_part_phys in put_page_bootmem 2023-10-25 16:47:13 -07:00
cma_debug.c
cma_sysfs.c mm/cma: add sysfs file 'release_pages_success' 2024-02-22 10:24:57 -08:00
cma.c mm/cma: drop incorrect alignment check in cma_init_reserved_mem 2024-04-25 20:56:42 -07:00
cma.h mm/cma: add sysfs file 'release_pages_success' 2024-02-22 10:24:57 -08:00
compaction.c mm: handle profiling for fake memory allocations during compaction 2024-06-24 20:52:09 -07:00
debug_page_alloc.c mm: page_alloc: consolidate free page accounting 2024-04-25 20:56:04 -07:00
debug_page_ref.c
debug_vm_pgtable.c mm/debug_vm_pgtable: drop RANDOM_ORVALUE trick 2024-06-15 10:43:08 -07:00
debug.c mm/debug: print only page mapcount (excluding folio entire mapcount) in __dump_folio() 2024-05-05 17:53:31 -07:00
dmapool_test.c
dmapool.c mm/mempool/dmapool: remove CONFIG_DEBUG_SLAB ifdefs 2023-12-05 11:17:58 +01:00
early_ioremap.c
execmem.c mm/execmem, arch: convert remaining overrides of module_alloc to execmem 2024-05-14 00:31:43 -07:00
fadvise.c
fail_page_alloc.c
failslab.c
filemap.c mm: fix xyz_noprof functions calling profiled functions 2024-06-05 19:19:26 -07:00
folio-compat.c mm: remove __set_page_dirty_nobuffers() 2024-04-25 20:56:25 -07:00
gup_test.c
gup_test.h
gup.c mm/gup: fix hugepd handling in hugetlb rework 2024-05-07 10:37:01 -07:00
highmem.c x86/kexec: use pr_err() instead of kexec_dprintk() when an error occurs 2023-12-29 12:22:28 -08:00
hmm.c mm/treewide: replace pXd_huge() with pXd_leaf() 2024-04-25 20:55:46 -07:00
huge_memory.c mm/swap: reduce swap cache search space 2024-07-03 19:29:56 -07:00
hugetlb_cgroup.c mm/hugetlb: assert hugetlb_lock in __hugetlb_cgroup_commit_charge 2024-05-05 17:53:41 -07:00
hugetlb_vmemmap.c memory: remove the now superfluous sentinel element from ctl_table array 2024-04-25 20:56:32 -07:00
hugetlb_vmemmap.h mm: hugetlb_vmemmap: fix reference to nonexistent file 2023-10-25 16:47:14 -07:00
hugetlb.c mm/hugetlb: drop node_alloc_noretry from alloc_fresh_hugetlb_folio 2024-07-03 19:29:52 -07:00
hwpoison-inject.c mm/memory-failure: convert shake_page() to shake_folio() 2024-05-05 17:53:45 -07:00
init-mm.c mm: Deprecate pasid field 2023-12-12 10:11:32 +01:00
internal.h /proc/pid/smaps: add mseal info for vma 2024-06-24 20:52:09 -07:00
interval_tree.c
io-mapping.c
ioremap.c
Kconfig The usual shower of singleton fixes and minor series all over MM, 2024-05-19 09:21:03 -07:00
Kconfig.debug mm/slub: unify all sl[au]b parameters with "slab_$param" 2024-01-22 10:31:08 +01:00
khugepaged.c mm: simplify thp_vma_allowable_order 2024-05-05 17:53:53 -07:00
kmemleak.c mm: lift gfp_kmemleak_mask() to gfp.h 2024-05-19 14:40:44 -07:00
ksm.c mm/ksm: fix ksm_zero_pages accounting 2024-06-05 19:19:26 -07:00
list_lru.c mm/zswap: stop lru list shrinking when encounter warm region 2024-02-22 10:24:54 -08:00
maccess.c
madvise.c mseal: add mseal syscall 2024-05-23 19:40:26 -07:00
Makefile mseal: add mseal syscall 2024-05-23 19:40:26 -07:00
mapping_dirty_helpers.c mm: fix clean_record_shared_mapping_range kernel-doc 2023-08-24 16:20:30 -07:00
memblock.c memblock: use numa_valid_node() helper to check for invalid node ID 2024-06-16 10:17:57 +03:00
memcontrol.c mm/swap: reduce swap cache search space 2024-07-03 19:29:56 -07:00
memfd.c mm/memfd: refactor memfd_tag_pins() and memfd_wait_for_pins() 2024-03-04 17:01:21 -08:00
memory_hotplug.c mm/hugetlb: rename dissolve_free_huge_pages() to dissolve_free_hugetlb_folios() 2024-05-05 17:53:35 -07:00
memory-failure.c mm/memory-failure: fix handling of dissolved but not taken off from buddy pages 2024-05-24 11:55:08 -07:00
memory-tiers.c memory tier: create CPUless memory tiers after obtaining HMAT info 2024-05-05 17:53:26 -07:00
memory.c mm/memory: cleanly support zeropage in vm_insert_page*(), vm_map_pages*() and vmf_insert_mixed() 2024-07-03 19:29:56 -07:00
mempolicy.c mm: mempolicy: use folio_alloc_mpol() in alloc_migration_target_by_mpol() 2024-07-03 19:29:53 -07:00
mempool.c mm: fix xyz_noprof functions calling profiled functions 2024-06-05 19:19:26 -07:00
memremap.c mm: convert put_devmap_managed_page_refs() to put_devmap_managed_folio_refs() 2024-05-05 17:53:49 -07:00
memtest.c memtest: use {READ,WRITE}_ONCE in memory scanning 2024-03-13 12:12:21 -07:00
migrate_device.c The usual shower of singleton fixes and minor series all over MM, 2024-05-19 09:21:03 -07:00
migrate.c mm/migrate: make migrate_pages_batch() stats consistent 2024-06-24 20:52:10 -07:00
mincore.c mm/swap: reduce swap cache search space 2024-07-03 19:29:56 -07:00
mlock.c mm: add pmd_folio() 2024-04-25 20:56:19 -07:00
mm_init.c Revert "mm: init_mlocked_on_free_v3" 2024-06-15 10:43:05 -07:00
mm_slot.h
mmap_lock.c
mmap.c mseal: add mseal syscall 2024-05-23 19:40:26 -07:00
mmu_gather.c mm/mmu_gather: improve cond_resched() handling with large folios and expensive page freeing 2024-02-22 15:27:17 -08:00
mmu_notifier.c mmu_notifier: remove the .change_pte() callback 2024-04-11 13:18:36 -04:00
mmzone.c zswap: shrink zswap pool based on memory pressure 2023-12-12 10:57:02 -08:00
mprotect.c mm/memory: cleanly support zeropage in vm_insert_page*(), vm_map_pages*() and vmf_insert_mixed() 2024-07-03 19:29:56 -07:00
mremap.c mseal: add mseal syscall 2024-05-23 19:40:26 -07:00
mseal.c mseal: add mseal syscall 2024-05-23 19:40:26 -07:00
msync.c
nommu.c The usual shower of singleton fixes and minor series all over MM, 2024-05-19 09:21:03 -07:00
oom_kill.c memory: remove the now superfluous sentinel element from ctl_table array 2024-04-25 20:56:32 -07:00
page_alloc.c mm/page_alloc: Separate THP PCP into movable and non-movable categories 2024-06-24 20:52:11 -07:00
page_counter.c
page_ext.c mm: make page_ext_get() take a const argument 2024-04-25 20:56:14 -07:00
page_idle.c
page_io.c mm/swap: get the swap device offset directly 2024-07-03 19:29:55 -07:00
page_isolation.c mm: page_isolation: prepare for hygienic freelists 2024-04-25 20:56:04 -07:00
page_owner.c mm/page-owner: use gfp_nested_mask() instead of open coded masking 2024-05-19 14:40:44 -07:00
page_poison.c mm/page_poison: replace kmap_atomic() with kmap_local_page() 2023-12-10 16:51:50 -08:00
page_reporting.c mm, treewide: rename MAX_ORDER to MAX_PAGE_ORDER 2024-01-08 15:27:15 -08:00
page_reporting.h
page_table_check.c mm/page_table_check: fix crash on ZONE_DEVICE 2024-06-15 10:43:04 -07:00
page_vma_mapped.c mm: make page_mapped_in_vma conditional on CONFIG_MEMORY_FAILURE 2024-05-05 17:53:45 -07:00
page-writeback.c writeback: factor out balance_wb_limits to remove repeated code 2024-07-03 19:29:54 -07:00
pagewalk.c mm: pagewalk: assert write mmap lock only for walking the user page tables 2023-12-10 16:51:53 -08:00
percpu-internal.h mm: percpu: add codetag reference into pcpuobj_ext 2024-04-25 20:55:56 -07:00
percpu-km.c
percpu-stats.c
percpu-vm.c percpu: clean up all mappings when pcpu_map_pages() fails 2024-04-25 20:55:49 -07:00
percpu.c mm: percpu: enable per-cpu allocation tagging 2024-04-25 20:55:56 -07:00
pgalloc-track.h
pgtable-generic.c mm: fix race between __split_huge_pmd_locked() and GUP-fast 2024-05-07 10:37:00 -07:00
process_vm_access.c mm: fix process_vm_rw page counts 2023-12-10 16:51:39 -08:00
ptdump.c mm: ptdump: add check_wx_pages debugfs attribute 2024-02-22 10:24:47 -08:00
readahead.c The usual shower of singleton fixes and minor series all over MM, 2024-05-19 09:21:03 -07:00
rmap.c mm: do not update memcg stats for NR_{FILE/SHMEM}_PMDMAPPED 2024-05-11 15:41:35 -07:00
rodata_test.c
secretmem.c
shmem_quota.c tmpfs: fix race on handling dquot rbtree 2024-03-26 11:07:23 -07:00
shmem.c mm/swap: reduce swap cache search space 2024-07-03 19:29:56 -07:00
show_mem.c lib: add memory allocations report in show_mem() 2024-04-25 20:55:57 -07:00
shrinker_debug.c mm: shrinker: convert shrinker_rwsem to mutex 2023-10-04 10:32:26 -07:00
shrinker.c mm: shrinker: use kvzalloc_node() from expand_one_shrinker_info() 2024-01-05 09:58:32 -08:00
shuffle.c
shuffle.h mm, treewide: rename MAX_ORDER to MAX_PAGE_ORDER 2024-01-08 15:27:15 -08:00
slab_common.c The usual shower of singleton fixes and minor series all over MM, 2024-05-19 09:21:03 -07:00
slab.h The usual shower of singleton fixes and minor series all over MM, 2024-05-19 09:21:03 -07:00
slub.c mm/slab: fix 'variable obj_exts set but not used' warning 2024-06-24 20:52:09 -07:00
sparse-vmemmap.c
sparse.c mm/sparse: guard the size of mem_section is power of 2 2024-05-05 17:53:40 -07:00
swap_cgroup.c
swap_slots.c mm: swap: update get_swap_pages() to take folio order 2024-04-25 20:56:37 -07:00
swap_state.c mm/swap: reduce swap cache search space 2024-07-03 19:29:56 -07:00
swap.c mm: add kernel-doc for folio_mark_accessed() 2024-05-05 17:53:50 -07:00
swap.h mm/swap: reduce swap cache search space 2024-07-03 19:29:56 -07:00
swapfile.c mm/swap: reduce swap cache search space 2024-07-03 19:29:56 -07:00
truncate.c mm/vmscan: update stale references to shrink_page_list 2024-07-03 19:29:52 -07:00
usercopy.c
userfaultfd.c The usual shower of singleton fixes and minor series all over MM, 2024-05-19 09:21:03 -07:00
util.c hardening fixes for v6.10-rc5 2024-06-17 12:00:22 -07:00
vmalloc.c mm: fix incorrect vbq reference in purge_fragmented_block 2024-06-24 20:52:08 -07:00
vmpressure.c eventfd: simplify eventfd_signal() 2023-11-28 14:08:38 +01:00
vmscan.c mm: vmscan: reset sc->priority on retry 2024-07-03 19:29:53 -07:00
vmstat.c iommu: observability of the IOMMU allocations 2024-04-15 14:31:47 +02:00
workingset.c mm: cleanup WORKINGSET_NODES in workingset 2024-05-07 10:36:59 -07:00
z3fold.c mm: zpool: return pool size in pages 2024-04-25 20:55:48 -07:00
zbud.c mm: zpool: return pool size in pages 2024-04-25 20:55:48 -07:00
zpool.c mm: zpool: return pool size in pages 2024-04-25 20:55:48 -07:00
zsmalloc.c mm: zpool: return pool size in pages 2024-04-25 20:55:48 -07:00
zswap.c mm: zswap: remove same_filled module params 2024-05-05 17:53:38 -07:00