linux

iv/linux

History

Aaron Lu 5d1904204c mremap: fix race between mremap() and page cleanning

Prior to 3.15, there was a race between zap_pte_range() and
page_mkclean() where writes to a page could be lost.  Dave Hansen
discovered by inspection that there is a similar race between
move_ptes() and page_mkclean().

We've been able to reproduce the issue by enlarging the race window with
a msleep(), but have not been able to hit it without modifying the code.
So, we think it's a real issue, but is difficult or impossible to hit in
practice.

The zap_pte_range() issue is fixed by commit 1cf35d47712d("mm: split
'tlb_flush_mmu()' into tlb flushing and memory freeing parts").  And
this patch is to fix the race between page_mkclean() and mremap().

Here is one possible way to hit the race: suppose a process mmapped a
file with READ | WRITE and SHARED, it has two threads and they are bound
to 2 different CPUs, e.g.  CPU1 and CPU2.  mmap returned X, then thread
1 did a write to addr X so that CPU1 now has a writable TLB for addr X
on it.  Thread 2 starts mremaping from addr X to Y while thread 1
cleaned the page and then did another write to the old addr X again.
The 2nd write from thread 1 could succeed but the value will get lost.

        thread 1                           thread 2
     (bound to CPU1)                    (bound to CPU2)

  1: write 1 to addr X to get a
     writeable TLB on this CPU

                                        2: mremap starts

                                        3: move_ptes emptied PTE for addr X
                                           and setup new PTE for addr Y and
                                           then dropped PTL for X and Y

  4: page laundering for N by doing
     fadvise FADV_DONTNEED. When done,
     pageframe N is deemed clean.

  5: *write 2 to addr X

                                        6: tlb flush for addr X

  7: munmap (Y, pagesize) to make the
     page unmapped

  8: fadvise with FADV_DONTNEED again
     to kick the page off the pagecache

  9: pread the page from file to verify
     the value. If 1 is there, it means
     we have lost the written 2.

  *the write may or may not cause segmentation fault, it depends on
  if the TLB is still on the CPU.

Please note that this is only one specific way of how the race could
occur, it didn't mean that the race could only occur in exact the above
config, e.g. more than 2 threads could be involved and fadvise() could
be done in another thread, etc.

For anonymous pages, they could race between mremap() and page reclaim:
THP: a huge PMD is moved by mremap to a new huge PMD, then the new huge
PMD gets unmapped/splitted/pagedout before the flush tlb happened for
the old huge PMD in move_page_tables() and we could still write data to
it.  The normal anonymous page has similar situation.

To fix this, check for any dirty PTE in move_ptes()/move_huge_pmd() and
if any, did the flush before dropping the PTL.  If we did the flush for
every move_ptes()/move_huge_pmd() call then we do not need to do the
flush in move_pages_tables() for the whole range.  But if we didn't, we
still need to do the whole range flush.

Alternatively, we can track which part of the range is flushed in
move_ptes()/move_huge_pmd() and which didn't to avoid flushing the whole
range in move_page_tables().  But that would require multiple tlb
flushes for the different sub-ranges and should be less efficient than
the single whole range flush.

KBuild test on my Sandybridge desktop doesn't show any noticeable change.
v4.9-rc4:
  real    5m14.048s
  user    32m19.800s
  sys     4m50.320s

With this commit:
  real    5m13.888s
  user    32m19.330s
  sys     4m51.200s

Reported-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Aaron Lu <aaron.lu@intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

2016-11-17 09:46:56 -08:00

kasan

kprobes: Unpoison stack in jprobe_return() for KASAN

2016-10-16 11:02:31 +02:00

backing-dev.c

block: fix bdi vs gendisk lifetime mismatch

2016-08-04 14:19:16 -06:00

balloon_compaction.c

mm: balloon: use general non-lru movable page feature

2016-07-26 16:19:19 -07:00

bootmem.c

mm: kmemleak: avoid using __va() on addresses that don't have a lowmem mapping

2016-10-11 15:06:33 -07:00

cleancache.c

cleancache: constify cleancache_ops structure

2016-01-27 09:09:57 -05:00

cma_debug.c

mm/cma_debug: correct size input to bitmap function

2015-07-17 16:39:54 -07:00

cma.c

mm/cma.c: check the max limit for cma allocation

2016-11-11 08:12:37 -08:00

cma.h

mm: cma: mark cma_bitmap_maxno() inline in header

2015-08-14 15:56:32 -07:00

compaction.c

mm, compaction: restrict fragindex to costly orders

2016-10-07 18:46:29 -07:00

debug_page_ref.c

mm/page_ref: add tracepoint to track down page reference manipulation

2016-03-17 15:09:34 -07:00

debug.c

mm: clarify why we avoid page_mapcount() for slab pages in dump_page()

2016-10-07 18:46:29 -07:00

dmapool.c

mm: convert printk(KERN_<LEVEL> to pr_<level>

2016-03-17 15:09:34 -07:00

early_ioremap.c

mm/early_ioremap: use offset_in_page macro

2015-11-05 19:34:48 -08:00

fadvise.c

mm/fadvise.c: do not discard partial pages with POSIX_FADV_DONTNEED

2016-06-09 14:23:11 -07:00

failslab.c

mm: fault-inject take over bootstrap kmem_cache check

2016-03-15 16:55:16 -07:00

filemap.c

mm/filemap: don't allow partially uptodate page for pipes

2016-11-11 08:12:37 -08:00

frame_vector.c

mm: replace get_vaddr_frames() write/force parameters with gup_flags

2016-10-19 08:11:24 -07:00

frontswap.c

mm, frontswap: convert frontswap_enabled to static key

2016-07-26 16:19:19 -07:00

gup.c

mm: unexport __get_user_pages()

2016-10-24 19:13:20 -07:00

highmem.c

mm/highmem: make nr_free_highpages() handles all highmem zones by itself

2016-05-19 19:12:14 -07:00

huge_memory.c

mremap: fix race between mremap() and page cleanning

2016-11-17 09:46:56 -08:00

hugetlb_cgroup.c

mm, hugetlb_cgroup: round limit_in_bytes down to hugepage size

2016-05-20 17:58:30 -07:00

hugetlb.c

mm/hugetlb: fix huge page reservation leak in private mapping error paths

2016-11-11 08:12:37 -08:00

hwpoison-inject.c

hwpoison: use page_cgroup_ino for filtering by memcg

2015-09-10 13:29:01 -07:00

init-mm.c

…

internal.h

mm, compaction: make full priority ignore pageblock suitability

2016-10-07 18:46:29 -07:00

interval_tree.c

mm: replace vma->sharead.linear with vma->shared

2015-02-10 14:30:31 -08:00

Kconfig

Allow KASAN and HOTPLUG_MEMORY to co-exist when doing build testing

2016-10-27 16:23:01 -07:00

Kconfig.debug

PM / Hibernate: allow hibernation with PAGE_POISONING_ZERO

2016-09-13 02:35:27 +02:00

khugepaged.c

mm, thp: fix leaking mapped pte in __collapse_huge_page_swapin()

2016-09-19 15:36:16 -07:00

kmemcheck.c

mm: convert printk(KERN_<LEVEL> to pr_<level>

2016-03-17 15:09:34 -07:00

kmemleak-test.c

mm: convert printk(KERN_<LEVEL> to pr_<level>

2016-03-17 15:09:34 -07:00

kmemleak.c

mm: kmemleak: scan .data.ro_after_init

2016-11-11 08:12:37 -08:00

ksm.c

mm,ksm: add __GFP_HIGH to the allocation in alloc_stable_node()

2016-10-07 18:46:29 -07:00

list_lru.c

mm/list_lru.c: avoid error-path NULL pointer deref

2016-10-27 18:43:42 -07:00

maccess.c

x86: remove more uaccess_32.h complexity

2016-05-22 17:21:27 -07:00

madvise.c

mm: make mmap_sem for write waits killable for mm syscalls

2016-05-23 17:04:14 -07:00

Makefile

Disable the __builtin_return_address() warning globally after all

2016-10-12 10:23:41 -07:00

memblock.c

mm: kmemleak: avoid using __va() on addresses that don't have a lowmem mapping

2016-10-11 15:06:33 -07:00

memcontrol.c

mm: memcontrol: do not recurse in direct reclaim

2016-10-27 18:43:43 -07:00

memory_hotplug.c

mm: remove unused variable in memory hotplug

2016-10-27 15:49:12 -07:00

memory-failure.c

mm: hwpoison: fix thp split handling in memory_failure()

2016-11-11 08:12:37 -08:00

memory.c

mm: replace access_process_vm() write parameter with gup_flags

2016-10-19 08:31:25 -07:00

mempolicy.c

mm: replace get_user_pages() write/force parameters with gup_flags

2016-10-19 08:11:43 -07:00

mempool.c

Revert "mm, mempool: only set __GFP_NOMEMALLOC if there are free elements"

2016-07-28 16:07:41 -07:00

memtest.c

memtest: remove unused header files

2015-09-08 15:35:28 -07:00

migrate.c

mm: vm_page_prot: update with WRITE_ONCE/READ_ONCE

2016-10-07 18:46:29 -07:00

mincore.c

mm, swap: use offset of swap entry as key of swap cache

2016-10-07 18:46:28 -07:00

mlock.c

mm: mlock: avoid increase mm->locked_vm on mlock() when already mlock2(,MLOCK_ONFAULT)

2016-10-07 18:46:28 -07:00

mm_init.c

mm: convert printk(KERN_<LEVEL> to pr_<level>

2016-03-17 15:09:34 -07:00

mmap.c

mm: vma_merge: correct false positive from __vma_unlink->validate_mm_rb

2016-10-07 18:46:29 -07:00

mmu_context.c

mm/mmu_context, sched/core: Fix mmu_context.h assumption

2016-04-28 11:44:19 +02:00

mmu_notifier.c

fix Christoph's email addresses

2016-03-17 15:09:34 -07:00

mmzone.c

mm, page_alloc: inline the fast path of the zonelist iterator

2016-05-19 19:12:14 -07:00

mprotect.c

mm/numa: Remove duplicated include from mprotect.c

2016-10-19 17:28:48 +02:00

mremap.c

mremap: fix race between mremap() and page cleanning

2016-11-17 09:46:56 -08:00

msync.c

mm/msync: use offset_in_page macro

2015-11-05 19:34:48 -08:00

nobootmem.c

mm: kmemleak: avoid using __va() on addresses that don't have a lowmem mapping

2016-10-11 15:06:33 -07:00

nommu.c

mm: unexport __get_user_pages()

2016-10-24 19:13:20 -07:00

oom_kill.c

oom: print nodemask in the oom report

2016-10-07 18:46:29 -07:00

page_alloc.c

mm: remove extra newline from allocation stall warning

2016-11-11 08:12:37 -08:00

page_counter.c

mm: page_counter: let page_counter_try_charge() return bool

2015-11-05 19:34:48 -08:00

page_ext.c

mm/page_ext: support extra space allocation by page_ext user

2016-10-07 18:46:27 -07:00

page_idle.c

mm, vmscan: move lru_lock to the node

2016-07-28 16:07:41 -07:00

page_io.c

mm/page_io.c: replace some BUG_ON()s with VM_BUG_ON_PAGE()

2016-10-07 18:46:29 -07:00

page_isolation.c

mm/page_isolation: fix typo: "paes" -> "pages"

2016-10-07 18:46:29 -07:00

page_owner.c

mm/page_owner: don't define fields on struct page_ext by hard-coding

2016-10-07 18:46:27 -07:00

page_poison.c

mm: check the return value of lookup_page_ext for all call sites

2016-06-03 15:06:22 -07:00

page-writeback.c

mm: don't use radix tree writeback tags for pages in swap cache

2016-10-07 18:46:28 -07:00

pagewalk.c

thp: rename split_huge_page_pmd() to split_huge_pmd()

2016-01-15 17:56:32 -08:00

percpu-km.c

mm: percpu: use pr_fmt to prefix output

2016-03-17 15:09:34 -07:00

percpu-vm.c

…

percpu.c

mm/percpu.c: fix potential memory leakage for pcpu_embed_first_chunk()

2016-10-05 11:52:55 -04:00

pgtable-generic.c

mm/thp/migration: switch from flush_tlb_range to flush_pmd_tlb_range

2016-03-17 15:09:34 -07:00

process_vm_access.c

mm: remove write/force parameters from __get_user_pages_unlocked()

2016-10-18 14:13:37 -07:00

quicklist.c

fix Christoph's email addresses

2016-03-17 15:09:34 -07:00

readahead.c

mm: silently skip readahead for DAX inodes

2016-08-26 17:39:35 -07:00

rmap.c

rmap: fix compound check logic in page_remove_file_rmap

2016-08-10 16:40:56 -07:00

shmem.c

shmem: fix pageflags after swapping DMA32 object

2016-11-11 08:12:37 -08:00

slab_common.c

memcg: prevent memcg caches to be both OFF_SLAB & OBJFREELIST_SLAB

2016-11-11 08:12:37 -08:00

slab.c

mm/slab: improve performance of gathering slabinfo stats

2016-10-27 18:43:43 -07:00

slab.h

mm/slab: improve performance of gathering slabinfo stats

2016-10-27 18:43:43 -07:00

slob.c

mm: slab: free kmem_cache_node after destroy sysfs file

2016-02-18 16:23:24 -08:00

slub.c

slub: Convert to hotplug state machine

2016-09-06 18:30:20 +02:00

sparse-vmemmap.c

treewide: replace obsolete _refok by __ref

2016-08-02 17:31:41 -04:00

sparse.c

treewide: replace obsolete _refok by __ref

2016-08-02 17:31:41 -04:00

swap_cgroup.c

mm: convert printk(KERN_<LEVEL> to pr_<level>

2016-03-17 15:09:34 -07:00

swap_state.c

mm, swap: use offset of swap entry as key of swap cache

2016-10-07 18:46:28 -07:00

swap.c

thp: reduce usage of huge zero page's atomic counter

2016-10-07 18:46:28 -07:00

swapfile.c

swapfile: fix memory corruption via malformed swapfile

2016-11-11 08:12:37 -08:00

truncate.c

truncate: handle file thp

2016-07-26 16:19:19 -07:00

usercopy.c

mm: usercopy: Check for module addresses

2016-09-20 16:07:39 -07:00

userfaultfd.c

mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros

2016-04-04 10:41:08 -07:00

util.c

Merge branch 'mm-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

2016-10-22 09:39:10 -07:00

vmacache.c

mm: unrig VMA cache hit ratio

2016-10-07 18:46:27 -07:00

vmalloc.c

mm: consolidate warn_alloc_failed users

2016-10-07 18:46:29 -07:00

vmpressure.c

mm/vmpressure.c: fix subtree pressure detection

2016-02-03 08:28:43 -08:00

vmscan.c

mm: memcontrol: do not recurse in direct reclaim

2016-10-27 18:43:43 -07:00

vmstat.c

seq/proc: modify seq_put_decimal_[u]ll to take a const char *, not char

2016-10-07 18:46:30 -07:00

workingset.c

mm: workingset: fix crash in shadow node shrinker caused by replace_page_cache_page()

2016-09-30 15:26:52 -07:00

z3fold.c

mm/z3fold.c: avoid modifying HEADLESS page and minor cleanup

2016-06-03 16:02:55 -07:00

zbud.c

mm/zbud.c: use list_last_entry() instead of list_tail_entry()

2016-01-15 11:40:52 -08:00

zpool.c

mm: zsmalloc: constify struct zs_pool name

2015-11-06 17:50:42 -08:00

zsmalloc.c

zsmalloc: Delete an unnecessary check before the function call "iput"

2016-07-28 16:07:41 -07:00

zswap.c

mm/zswap: use workqueue to destroy pool

2016-05-20 17:58:30 -07:00