linux

iv/linux

History

Linus Torvalds 5df397dec7 mm: delay page_remove_rmap() until after the TLB has been flushed When we remove a page table entry, we are very careful to only free the page after we have flushed the TLB, because other CPUs could still be using the page through stale TLB entries until after the flush. However, we have removed the rmap entry for that page early, which means that functions like folio_mkclean() would end up not serializing with the page table lock because the page had already been made invisible to rmap. And that is a problem, because while the TLB entry exists, we could end up with the following situation: (a) one CPU could come in and clean it, never seeing our mapping of the page (b) another CPU could continue to use the stale and dirty TLB entry and continue to write to said page resulting in a page that has been dirtied, but then marked clean again, all while another CPU might have dirtied it some more. End result: possibly lost dirty data. This extends our current TLB gather infrastructure to optionally track a "should I do a delayed page_remove_rmap() for this page after flushing the TLB". It uses the newly introduced 'encoded page pointer' to do that without having to keep separate data around. Note, this is complicated by a couple of issues: - we want to delay the rmap removal, but not past the page table lock, because that simplifies the memcg accounting - only SMP configurations want to delay TLB flushing, since on UP there are obviously no remote TLBs to worry about, and the page table lock means there are no preemption issues either - s390 has its own mmu_gather model that doesn't delay TLB flushing, and as a result also does not want the delayed rmap. As such, we can treat S390 like the UP case and use a common fallback for the "no delays" case. - we can track an enormous number of pages in our mmu_gather structure, with MAX_GATHER_BATCH_COUNT batches of MAX_TABLE_BATCH pages each, all set up to be approximately 10k pending pages. We do not want to have a huge number of batched pages that we then need to check for delayed rmap handling inside the page table lock. Particularly that last point results in a noteworthy detail, where the normal page batch gathering is limited once we have delayed rmaps pending, in such a way that only the last batch (the so-called "active batch") in the mmu_gather structure can have any delayed entries. NOTE! While the "possibly lost dirty data" sounds catastrophic, for this all to happen you need to have a user thread doing either madvise() with MADV_DONTNEED or a full re-mmap() of the area concurrently with another thread continuing to use said mapping. So arguably this is about user space doing crazy things, but from a VM consistency standpoint it's better if we track the dirty bit properly even when user space goes off the rails. [akpm@linux-foundation.org: fix UP build, per Linus] Link: https://lore.kernel.org/all/B88D3073-440A-41C7-95F4-895D3F657EF2@gmail.com/ Link: https://lkml.kernel.org/r/20221109203051.1835763-4-torvalds@linux-foundation.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Hugh Dickins <hughd@google.com> Reported-by: Nadav Amit <nadav.amit@gmail.com> Tested-by: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>		2022-11-30 15:58:50 -08:00
..
alpha	mm: remove kern_addr_valid() completely	2022-11-08 17:37:18 -08:00
arc	mm: remove kern_addr_valid() completely	2022-11-08 17:37:18 -08:00
arm	mm: remove kern_addr_valid() completely	2022-11-08 17:37:18 -08:00
arm64	mm: remove kern_addr_valid() completely	2022-11-08 17:37:18 -08:00
csky	mm: remove kern_addr_valid() completely	2022-11-08 17:37:18 -08:00
hexagon	mm: remove kern_addr_valid() completely	2022-11-08 17:37:18 -08:00
ia64	mm: remove kern_addr_valid() completely	2022-11-08 17:37:18 -08:00
loongarch	Merge branch 'mm-hotfixes-stable' into mm-stable	2022-11-30 14:58:42 -08:00
m68k	mm: remove kern_addr_valid() completely	2022-11-08 17:37:18 -08:00
microblaze	mm: remove kern_addr_valid() completely	2022-11-08 17:37:18 -08:00
mips	Merge branch 'mm-hotfixes-stable' into mm-stable	2022-11-30 14:58:42 -08:00
nios2	nios2: remove unused INIT_MMAP	2022-11-08 17:37:19 -08:00
openrisc	mm: remove kern_addr_valid() completely	2022-11-08 17:37:18 -08:00
parisc	mm/hwpoison: pass pfn to num_poisoned_pages_*()	2022-11-08 17:37:22 -08:00
powerpc	mm: remove unused savedwrite infrastructure	2022-11-30 15:58:49 -08:00
riscv	Merge branch 'mm-hotfixes-stable' into mm-stable	2022-11-30 14:58:42 -08:00
s390	mm: delay page_remove_rmap() until after the TLB has been flushed	2022-11-30 15:58:50 -08:00
sh	mm: remove kern_addr_valid() completely	2022-11-08 17:37:18 -08:00
sparc	Merge branch 'mm-hotfixes-stable' into mm-stable	2022-11-30 14:58:42 -08:00
um	mm: remove kern_addr_valid() completely	2022-11-08 17:37:18 -08:00
x86	Merge branch 'mm-hotfixes-stable' into mm-stable	2022-11-30 14:58:42 -08:00
xtensa	mm: remove kern_addr_valid() completely	2022-11-08 17:37:18 -08:00
.gitignore
Kconfig	- Yu Zhao's Multi-Gen LRU patches are here. They've been under test in	2022-10-10 17:53:04 -07:00