linux

iv/linux

History

Linus Torvalds 5df397dec7 mm: delay page_remove_rmap() until after the TLB has been flushed

When we remove a page table entry, we are very careful to only free the
page after we have flushed the TLB, because other CPUs could still be
using the page through stale TLB entries until after the flush.

However, we have removed the rmap entry for that page early, which means
that functions like folio_mkclean() would end up not serializing with the
page table lock because the page had already been made invisible to rmap.

And that is a problem, because while the TLB entry exists, we could end up
with the following situation:

 (a) one CPU could come in and clean it, never seeing our mapping of the
     page

 (b) another CPU could continue to use the stale and dirty TLB entry and
     continue to write to said page

resulting in a page that has been dirtied, but then marked clean again,
all while another CPU might have dirtied it some more.

End result: possibly lost dirty data.

This extends our current TLB gather infrastructure to optionally track a
"should I do a delayed page_remove_rmap() for this page after flushing the
TLB".  It uses the newly introduced 'encoded page pointer' to do that
without having to keep separate data around.

Note, this is complicated by a couple of issues:

 - we want to delay the rmap removal, but not past the page table lock,
   because that simplifies the memcg accounting

 - only SMP configurations want to delay TLB flushing, since on UP
   there are obviously no remote TLBs to worry about, and the page
   table lock means there are no preemption issues either

 - s390 has its own mmu_gather model that doesn't delay TLB flushing,
   and as a result also does not want the delayed rmap. As such, we can
   treat S390 like the UP case and use a common fallback for the "no
   delays" case.

 - we can track an enormous number of pages in our mmu_gather structure,
   with MAX_GATHER_BATCH_COUNT batches of MAX_TABLE_BATCH pages each,
   all set up to be approximately 10k pending pages.

   We do not want to have a huge number of batched pages that we then
   need to check for delayed rmap handling inside the page table lock.

Particularly that last point results in a noteworthy detail, where the
normal page batch gathering is limited once we have delayed rmaps pending,
in such a way that only the last batch (the so-called "active batch") in
the mmu_gather structure can have any delayed entries.

NOTE!  While the "possibly lost dirty data" sounds catastrophic, for this
all to happen you need to have a user thread doing either madvise() with
MADV_DONTNEED or a full re-mmap() of the area concurrently with another
thread continuing to use said mapping.

So arguably this is about user space doing crazy things, but from a VM
consistency standpoint it's better if we track the dirty bit properly even
when user space goes off the rails.

[akpm@linux-foundation.org: fix UP build, per Linus]
Link: https://lore.kernel.org/all/B88D3073-440A-41C7-95F4-895D3F657EF2@gmail.com/
Link: https://lkml.kernel.org/r/20221109203051.1835763-4-torvalds@linux-foundation.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Hugh Dickins <hughd@google.com>
Reported-by: Nadav Amit <nadav.amit@gmail.com>
Tested-by: Nadav Amit <nadav.amit@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

2022-11-30 15:58:50 -08:00

damon

mm/damon: use kstrtobool() instead of strtobool()

2022-11-30 15:58:45 -08:00

kasan

memory: move hotplug memory notifier priority to same file for easy sorting

2022-11-08 17:37:17 -08:00

kfence

kfence: fix stack trace pruning

2022-11-22 18:50:44 -08:00

kmsan

kmsan: core: kmsan_in_runtime() should return true in NMI context

2022-11-08 15:57:24 -08:00

backing-dev.c

mm: backing-dev: Remove the unneeded result variable

2022-09-11 20:26:02 -07:00

balloon_compaction.c

mm: Convert all PageMovable users to movable_operations

2022-08-02 12:34:03 -04:00

bootmem_info.c

bootmem: remove the vmemmap pages from kmemleak in put_page_bootmem

2022-08-28 14:02:45 -07:00

cma_debug.c

mm/cma_debug: show complete cma name in debugfs directories

2022-09-11 20:25:50 -07:00

cma_sysfs.c

…

cma.c

Revert "mm/cma.c: remove redundant cma_mutex lock"

2022-05-13 15:11:26 -07:00

cma.h

mm/cma: provide option to opt out from exposing pages on activation failure

2022-03-22 15:57:09 -07:00

compaction.c

mm: migrate: fix THP's mapcount on isolation

2022-11-30 14:49:41 -08:00

debug_page_ref.c

…

debug_vm_pgtable.c

mm: remove unused savedwrite infrastructure

2022-11-30 15:58:49 -08:00

debug.c

mm,thp,rmap: simplify compound page mapcount handling

2022-11-30 15:58:46 -08:00

dmapool.c

mm/dmapool.c: revert "make dma pool to use kmalloc_node"

2022-01-15 16:30:28 +02:00

early_ioremap.c

mm/early_ioremap: declare early_memremap_pgprot_adjust()

2022-03-22 15:57:11 -07:00

fadvise.c

riscv: compat: syscall: Add compat_sys_call_table implementation

2022-04-26 13:36:25 -07:00

failslab.c

mm: fix unexpected changes to {failslab|fail_page_alloc}.attr

2022-11-22 18:50:44 -08:00

filemap.c

filemap: find_get_entries() now updates start offset

2022-11-08 17:37:12 -08:00

folio-compat.c

mm,thp,rmap: simplify compound page mapcount handling

2022-11-30 15:58:46 -08:00

frontswap.c

frontswap: don't call ->init if no ops are registered

2022-09-26 12:14:34 -07:00

gup_test.c

mm/gup_test: start/stop/read functionality for PIN LONGTERM test

2022-11-08 17:37:15 -08:00

gup_test.h

mm/gup_test: start/stop/read functionality for PIN LONGTERM test

2022-11-08 17:37:15 -08:00

gup.c

hugetlb: simplify hugetlb handling in follow_page_mask

2022-11-08 17:37:10 -08:00

highmem.c

highmem: fix kmap_to_page() for kmap_local_page() addresses

2022-10-12 18:51:51 -07:00

hmm.c

mm/swap: add swp_offset_pfn() to fetch PFN from swap entry

2022-09-26 19:46:05 -07:00

huge_memory.c

mm/autonuma: use can_change_(pte|pmd)_writable() to replace savedwrite

2022-11-30 15:58:49 -08:00

hugetlb_cgroup.c

mm/hugeltb_cgroup: convert hugetlb_cgroup_commit_charge*() to folios

2022-11-30 15:58:43 -08:00

hugetlb_vmemmap.c

mm/hugetlb_vmemmap: remap head page to newly allocated page

2022-11-30 15:58:47 -08:00

hugetlb_vmemmap.h

mm: hugetlb_vmemmap: improve hugetlb_vmemmap code readability

2022-08-08 18:06:43 -07:00

hugetlb.c

mm,thp,rmap: simplify compound page mapcount handling

2022-11-30 15:58:46 -08:00

hwpoison-inject.c

mm/hwpoison: add __init/__exit annotations to module init/exit funcs

2022-10-03 14:03:05 -07:00

init-mm.c

mm: remove rb tree.

2022-09-26 19:46:16 -07:00

internal.h

mm/hwpoison: introduce per-memory_block hwpoison counter

2022-11-08 17:37:22 -08:00

interval_tree.c

…

io-mapping.c

…

ioremap.c

mm: ioremap: Add ioremap/iounmap_allowed()

2022-06-27 12:22:31 +01:00

Kconfig

mm,hugetlb: use folio fields in second tail page

2022-11-30 15:58:46 -08:00

Kconfig.debug

Two followon fixes for the post-5.19 series "Use pageblock_order for cma

2022-05-27 11:40:49 -07:00

khugepaged.c

mm,thp,rmap: simplify compound page mapcount handling

2022-11-30 15:58:46 -08:00

kmemleak.c

mm/kmemleak: prevent soft lockup in kmemleak_scan()'s object iteration loops

2022-10-28 13:37:22 -07:00

ksm.c

mm/autonuma: use can_change_(pte|pmd)_writable() to replace savedwrite

2022-11-30 15:58:49 -08:00

list_lru.c

mm: kmem: make mem_cgroup_from_obj() vmalloc()-safe

2022-06-16 19:48:31 -07:00

maccess.c

asm-generic updates for 5.18

2022-03-23 18:03:08 -07:00

madvise.c

madvise: use zap_page_range_single for madvise dontneed

2022-11-30 14:49:40 -08:00

Makefile

mm: memcontrol: drop dead CONFIG_MEMCG_SWAP config symbol

2022-10-03 14:03:36 -07:00

mapping_dirty_helpers.c

mm: move tlb_flush_pending inline helpers to mm_inline.h

2022-01-15 16:30:27 +02:00

memblock.c

mm: add pageblock_align() macro

2022-10-03 14:03:04 -07:00

memcontrol.c

mm: vmscan: split khugepaged stats from direct reclaim stats

2022-11-30 15:58:41 -08:00

memfd.c

memfd: fix F_SEAL_WRITE after shmem huge page allocated

2022-03-05 11:08:32 -08:00

memory_hotplug.c

mm: add pageblock_aligned() macro

2022-10-03 14:03:04 -07:00

memory-failure.c

mm,hugetlb: use folio fields in second tail page

2022-11-30 15:58:46 -08:00

memory-tiers.c

memory: move hotplug memory notifier priority to same file for easy sorting

2022-11-08 17:37:17 -08:00

memory.c

mm: delay page_remove_rmap() until after the TLB has been flushed

2022-11-30 15:58:50 -08:00

mempolicy.c

mm/mempolicy: fix mbind_range() arguments to vma_merge()

2022-10-20 21:27:21 -07:00

mempool.c

mempool: do not use ksize() for poisoning

2022-11-30 15:58:41 -08:00

memremap.c

mm/memremap.c: map FS_DAX device memory as decrypted

2022-11-08 15:57:23 -08:00

memtest.c

…

migrate_device.c

mm/migrate_device: return number of migrating pages in args->cpages

2022-11-22 18:50:43 -08:00

migrate.c

mm/hugetlb: convert move_hugetlb_state() to folios

2022-11-30 15:58:43 -08:00

mincore.c

mm: convert find_get_incore_page() to filemap_get_incore_folio()

2022-11-08 17:37:18 -08:00

mlock.c

mm/mlock: drop dead code in count_mm_mlocked_page_nr()

2022-09-26 19:46:27 -07:00

mm_init.c

memory: move hotplug memory notifier priority to same file for easy sorting

2022-11-08 17:37:17 -08:00

mm_slot.h

mm: introduce common struct mm_slot

2022-10-03 14:02:43 -07:00

mmap_lock.c

…

mmap.c

Merge branch 'mm-hotfixes-stable' into mm-stable

2022-11-30 14:58:42 -08:00

mmu_gather.c

mm: delay page_remove_rmap() until after the TLB has been flushed

2022-11-30 15:58:50 -08:00

mmu_notifier.c

mm/mmu_notifier.c: fix race in mmu_interval_notifier_remove()

2022-04-21 20:01:10 -07:00

mmzone.c

mm: multi-gen LRU: groundwork

2022-09-26 19:46:09 -07:00

mprotect.c

mm/autonuma: use can_change_(pte|pmd)_writable() to replace savedwrite

2022-11-30 15:58:49 -08:00

mremap.c

mm: add merging after mremap resize

2022-09-26 19:46:28 -07:00

msync.c

mm/msync: use vma_find() instead of vma linked list

2022-09-26 19:46:25 -07:00

nommu.c

mm: remove the vma linked list

2022-09-26 19:46:26 -07:00

oom_kill.c

mm: reduce noise in show_mem for lowmem allocations

2022-09-26 19:46:29 -07:00

page_alloc.c

mm,thp,rmap: subpages_mapcount COMPOUND_MAPPED if PMD-mapped

2022-11-30 15:58:48 -08:00

page_counter.c

mm: page_counter: remove unneeded atomic ops for low/min

2022-09-11 20:26:01 -07:00

page_ext.c

Merge branch 'mm-hotfixes-stable' into mm-stable

2022-11-30 14:58:42 -08:00

page_idle.c

mm: don't be stuck to rmap lock on reclaim path

2022-05-19 14:08:54 -07:00

page_io.c

swap: convert swap_writepage() to use a folio

2022-10-03 14:02:52 -07:00

page_isolation.c

mm/page_isolation: fix clang deadcode warning

2022-10-28 13:37:22 -07:00

page_owner.c

mm: reuse pageblock_start/end_pfn() macro

2022-10-03 14:03:03 -07:00

page_poison.c

…

page_reporting.c

…

page_reporting.h

…

page_table_check.c

mm: use kstrtobool() instead of strtobool()

2022-11-30 15:58:45 -08:00

page_vma_mapped.c

mm/swap: add swp_offset_pfn() to fetch PFN from swap entry

2022-09-26 19:46:05 -07:00

page-writeback.c

mm: export balance_dirty_pages_ratelimited_flags()

2022-09-26 12:28:07 +02:00

pagewalk.c

- Yu Zhao's Multi-Gen LRU patches are here. They've been under test in

2022-10-10 17:53:04 -07:00

percpu-internal.h

percpu: improve percpu_alloc_percpu event trace

2022-05-13 07:20:18 -07:00

percpu-km.c

…

percpu-stats.c

mm: use vmalloc_array and vcalloc for array allocations

2022-03-08 09:30:46 -05:00

percpu-vm.c

…

percpu.c

mm: percpu: use kmemleak_ignore_phys() instead of kmemleak_free()

2022-07-17 17:14:47 -07:00

pgalloc-track.h

…

pgtable-generic.c

mm: avoid unnecessary flush on change_huge_pmd()

2022-05-13 07:20:05 -07:00

process_vm_access.c

…

ptdump.c

mm: pagewalk: Fix race between unmap and page walker

2022-09-03 10:13:13 -07:00

readahead.c

mm: add PSI accounting around ->read_folio and ->readahead calls

2022-09-20 08:24:38 -06:00

rmap.c

mm,thp,rmap: subpages_mapcount COMPOUND_MAPPED if PMD-mapped

2022-11-30 15:58:48 -08:00

rodata_test.c

mm/rodata_test: use PAGE_ALIGNED() helper

2022-10-03 14:03:05 -07:00

secretmem.c

mm/secretmem: remove reduntant return value

2022-10-03 14:03:36 -07:00

shmem.c

mm: use pte markers for swap errors

2022-11-30 15:58:46 -08:00

shrinker_debug.c

mm: shrinkers: fix double kfree on shrinker name

2022-07-29 18:07:13 -07:00

shuffle.c

mm/shuffle: convert module_param_call to module_param_cb

2022-10-03 14:03:07 -07:00

shuffle.h

…

slab_common.c

- Yu Zhao's Multi-Gen LRU patches are here. They've been under test in

2022-10-10 17:53:04 -07:00

slab.c

Random number generator fixes for Linux 6.1-rc1.

2022-10-16 15:27:07 -07:00

slab.h

- Yu Zhao's Multi-Gen LRU patches are here. They've been under test in

2022-10-10 17:53:04 -07:00

slob.c

Merge branch 'slab/for-6.1/kmalloc_size_roundup' into slab/for-next

2022-09-29 11:30:55 +02:00

slub.c

mm/slub.c: use hotplug_memory_notifier() directly

2022-11-08 17:37:16 -08:00

sparse-vmemmap.c

mm: hugetlb_vmemmap: move vmemmap code related to HugeTLB to hugetlb_vmemmap.c

2022-08-08 18:06:42 -07:00

sparse.c

mm/hwpoison: introduce per-memory_block hwpoison counter

2022-11-08 17:37:22 -08:00

swap_cgroup.c

mm: memcontrol: don't allocate cgroup swap arrays when memcg is disabled

2022-10-03 14:03:36 -07:00

swap_slots.c

mm/swap: convert put_swap_page() to put_swap_folio()

2022-10-03 14:02:46 -07:00

swap_state.c

mm: mmu_gather: prepare to gather encoded page pointers with flags

2022-11-30 15:58:50 -08:00

swap.c

mm: teach release_pages() to take an array of encoded page pointers too

2022-11-30 15:58:50 -08:00

swap.h

mm: convert find_get_incore_page() to filemap_get_incore_folio()

2022-11-08 17:37:18 -08:00

swapfile.c

mm: use pte markers for swap errors

2022-11-30 15:58:46 -08:00

truncate.c

filemap: find_get_entries() now updates start offset

2022-11-08 17:37:12 -08:00

usercopy.c

mm: use kstrtobool() instead of strtobool()

2022-11-30 15:58:45 -08:00

userfaultfd.c

mm/shmem: use page_mapping() to detect page cache for uffd continue

2022-11-08 15:57:23 -08:00

util.c

mm,thp,rmap: simplify compound page mapcount handling

2022-11-30 15:58:46 -08:00

vmalloc.c

mm: vmalloc: use trace_free_vmap_area_noflush event

2022-11-08 17:37:17 -08:00

vmpressure.c

mm/vmpressure: fix data-race with memcg->socket_pressure

2021-11-06 13:30:40 -07:00

vmscan.c

mm: vmscan: split khugepaged stats from direct reclaim stats

2022-11-30 15:58:41 -08:00

vmstat.c

mm: vmscan: split khugepaged stats from direct reclaim stats

2022-11-30 15:58:41 -08:00

workingset.c

mm: vmscan: make rotations a secondary factor in balancing anon vs file

2022-11-08 17:37:11 -08:00

z3fold.c

mm: Convert all PageMovable users to movable_operations

2022-08-02 12:34:03 -04:00

zbud.c

…

zpool.c

zpool: remove the list of pools_head

2022-01-15 16:30:31 +02:00

zsmalloc.c

zsmalloc: replace IS_ERR() with IS_ERR_VALUE()

2022-11-30 15:58:46 -08:00

zswap.c

mm/swap: remove the end_write_func argument to __swap_writepage

2022-09-11 20:25:50 -07:00