linux/mm
Vineeth Remanan Pillai b56a2d8af9 mm: rid swapoff of quadratic complexity
This patch was initially posted by Kelley Nielsen.  Reposting the patch
with all review comments addressed and with minor modifications and
optimizations.  Also, folding in the fixes offered by Hugh Dickins and
Huang Ying.  Tests were rerun and commit message updated with new
results.

try_to_unuse() is of quadratic complexity, with a lot of wasted effort.
It unuses swap entries one by one, potentially iterating over all the
page tables for all the processes in the system for each one.

This new proposed implementation of try_to_unuse simplifies its
complexity to linear.  It iterates over the system's mms once, unusing
all the affected entries as it walks each set of page tables.  It also
makes similar changes to shmem_unuse.

Improvement

swapoff was called on a swap partition containing about 6G of data, in a
VM(8cpu, 16G RAM), and calls to unuse_pte_range() were counted.

Present implementation....about 1200M calls(8min, avg 80% cpu util).
Prototype.................about  9.0K calls(3min, avg 5% cpu util).

Details

In shmem_unuse(), iterate over the shmem_swaplist and, for each
shmem_inode_info that contains a swap entry, pass it to
shmem_unuse_inode(), along with the swap type.  In shmem_unuse_inode(),
iterate over its associated xarray, and store the index and value of
each swap entry in an array for passing to shmem_swapin_page() outside
of the RCU critical section.

In try_to_unuse(), instead of iterating over the entries in the type and
unusing them one by one, perhaps walking all the page tables for all the
processes for each one, iterate over the mmlist, making one pass.  Pass
each mm to unuse_mm() to begin its page table walk, and during the walk,
unuse all the ptes that have backing store in the swap type received by
try_to_unuse().  After the walk, check the type for orphaned swap
entries with find_next_to_unuse(), and remove them from the swap cache.
If find_next_to_unuse() starts over at the beginning of the type, repeat
the check of the shmem_swaplist and the walk a maximum of three times.

Change unuse_mm() and the intervening walk functions down to
unuse_pte_range() to take the type as a parameter, and to iterate over
their entire range, calling the next function down on every iteration.
In unuse_pte_range(), make a swap entry from each pte in the range using
the passed in type.  If it has backing store in the type, call
swapin_readahead() to retrieve the page and pass it to unuse_pte().

Pass the count of pages_to_unuse down the page table walks in
try_to_unuse(), and return from the walk when the desired number of
pages has been swapped back in.

Link: http://lkml.kernel.org/r/20190114153129.4852-2-vpillai@digitalocean.com
Signed-off-by: Vineeth Remanan Pillai <vpillai@digitalocean.com>
Signed-off-by: Kelley Nielsen <kelleynnn@gmail.com>
Signed-off-by: Huang Ying <ying.huang@intel.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:18 -08:00
..
kasan kasan: fix coccinelle warnings in kasan_p*_table 2019-03-05 21:07:13 -08:00
backing-dev.c writeback: synchronize sync(2) against cgroup writeback membership switches 2019-01-22 14:39:38 -07:00
balloon_compaction.c
cleancache.c mm: use octal not symbolic permissions 2018-06-15 07:55:25 +09:00
cma_debug.c mm: no need to check return value of debugfs_create functions 2019-03-05 21:07:17 -08:00
cma.c kasan, mm, arm64: tag non slab memory allocated via pagealloc 2018-12-28 12:11:44 -08:00
cma.h
compaction.c mm, compaction: capture a page under direct compaction 2019-03-05 21:07:17 -08:00
debug_page_ref.c
debug.c mm/debug.c: fix __dump_page() for poisoned pages 2019-02-21 09:01:00 -08:00
dmapool.c mm: use octal not symbolic permissions 2018-06-15 07:55:25 +09:00
early_ioremap.c
fadvise.c vfs: implement readahead(2) using POSIX_FADV_WILLNEED 2018-08-30 20:01:32 +02:00
failslab.c mm: no need to check return value of debugfs_create functions 2019-03-05 21:07:17 -08:00
filemap.c mm/filemap: pass inclusive 'end_byte' parameter to filemap_range_has_page 2019-03-05 21:07:16 -08:00
frame_vector.c
frontswap.c mm: use octal not symbolic permissions 2018-06-15 07:55:25 +09:00
gup_benchmark.c mm: no need to check return value of debugfs_create functions 2019-03-05 21:07:17 -08:00
gup.c mm/gup: fix gup_pmd_range() for dax 2019-02-12 16:33:18 -08:00
highmem.c mm: convert totalram_pages and totalhigh_pages variables to atomic 2018-12-28 12:11:47 -08:00
hmm.c mm/hmm: fix memremap.h, move dev_page_fault_t callback to hmm 2018-12-28 12:11:52 -08:00
huge_memory.c mm: no need to check return value of debugfs_create functions 2019-03-05 21:07:17 -08:00
hugetlb_cgroup.c mm: rename page_counter's count/limit into usage/max 2018-06-07 17:34:35 -07:00
hugetlb.c mm/hugetlb: add prot_modify_start/commit sequence for hugetlb update 2019-03-05 21:07:18 -08:00
hwpoison-inject.c
init-mm.c mm: Allocate the mm_cpumask (mm->cpu_bitmap[]) dynamically based on nr_cpu_ids 2018-07-17 09:35:30 +02:00
internal.h mm, compaction: capture a page under direct compaction 2019-03-05 21:07:17 -08:00
interval_tree.c
Kconfig ksm: replace jhash2 with xxhash 2018-12-28 12:11:46 -08:00
Kconfig.debug mm/page_owner: move config option to mm/Kconfig.debug 2019-03-05 21:07:18 -08:00
khugepaged.c mm/mmu_notifier: use structure for invalidate_range_start/end calls v2 2018-12-28 12:11:50 -08:00
kmemleak-test.c
kmemleak.c kmemleak: account for tagged pointers when calculating pointer range 2019-02-21 09:01:00 -08:00
ksm.c mm: reuse only-pte-mapped KSM page in do_wp_page() 2019-03-05 21:07:15 -08:00
list_lru.c mm/list_lru: introduce list_lru_shrink_walk_irq() 2018-08-17 16:20:32 -07:00
maccess.c Revert "x86/fault: BUG() when uaccess helpers fault on kernel addresses" 2019-02-25 09:10:51 -08:00
madvise.c mm/mmu_notifier: use structure for invalidate_range_start/end calls v2 2018-12-28 12:11:50 -08:00
Makefile mm: remove nobootmem 2018-10-31 08:54:16 -07:00
memblock.c mm: no need to check return value of debugfs_create functions 2019-03-05 21:07:17 -08:00
memcontrol.c memcg: killed threads should not invoke memcg OOM killer 2019-03-05 21:07:18 -08:00
memfd.c memfd: Convert memfd_tag_pins to XArray 2018-10-21 10:46:41 -04:00
memory_hotplug.c mm: remove extra drain pages on pcp list 2019-03-05 21:07:15 -08:00
memory-failure.c mm: hwpoison: fix thp split handing in soft_offline_in_use_page() 2019-03-05 21:07:13 -08:00
memory.c mm: update ptep_modify_prot_commit to take old pte value as arg 2019-03-05 21:07:18 -08:00
mempolicy.c mm, mempolicy: fix uninit memory access 2019-03-05 21:07:18 -08:00
mempool.c mm/mempool.c: add missing parameter description 2018-08-22 10:52:44 -07:00
memtest.c
migrate.c mm: fix some typos in mm directory 2019-03-05 21:07:18 -08:00
mincore.c Revert "Change mincore() to count "mapped" pages rather than "cached" pages" 2019-01-24 09:04:37 +13:00
mlock.c dax: remove VM_MIXEDMAP for fsdax and device dax 2018-08-17 16:20:27 -07:00
mm_init.c mm: convert totalram_pages and totalhigh_pages variables to atomic 2018-12-28 12:11:47 -08:00
mmap.c mm: fix some typos in mm directory 2019-03-05 21:07:18 -08:00
mmu_context.c
mmu_gather.c mm: Replace call_rcu_sched() with call_rcu() 2018-11-27 09:21:46 -08:00
mmu_notifier.c mm/mmu_notifier: use structure for invalidate_range_start/end calls v2 2018-12-28 12:11:50 -08:00
mmzone.c
mprotect.c mm: update ptep_modify_prot_commit to take old pte value as arg 2019-03-05 21:07:18 -08:00
mremap.c mm: speed up mremap by 20x on large regions 2019-01-04 13:13:48 -08:00
msync.c
nommu.c mm/gup: cache dev_pagemap while pinning pages 2018-10-26 16:38:15 -07:00
oom_kill.c mm, oom: remove 'prefer children over parent' heuristic 2019-03-05 21:07:17 -08:00
page_alloc.c mm/page_alloc.c: check return value of memblock_alloc_node_nopanic() 2019-03-05 21:07:18 -08:00
page_counter.c memcg: introduce memory.min 2018-06-07 17:34:36 -07:00
page_ext.c mm: replace all open encodings for NUMA_NO_NODE 2019-03-05 21:07:14 -08:00
page_idle.c mm: remove include/linux/bootmem.h 2018-10-31 08:54:16 -07:00
page_io.c mm/page_io.c: fix polled swap page in 2019-01-04 13:13:48 -08:00
page_isolation.c mm: only report isolation failures when offlining memory 2018-12-28 12:11:46 -08:00
page_owner.c mm: no need to check return value of debugfs_create functions 2019-03-05 21:07:17 -08:00
page_poison.c page_poison: play nicely with KASAN 2019-03-05 21:07:13 -08:00
page_vma_mapped.c mm/rmap: map_pte() was not handling private ZONE_DEVICE page properly 2018-10-31 08:54:11 -07:00
page-writeback.c mm/page-writeback.c: don't break integrity writeback on ->writepage() error 2018-12-28 12:11:49 -08:00
pagewalk.c mm: kernel-doc: add missing parameter descriptions 2018-04-05 21:36:27 -07:00
percpu-internal.h
percpu-km.c percpu: convert spin_lock_irq to spin_lock_irqsave. 2018-12-18 09:04:08 -08:00
percpu-stats.c treewide: Use array_size() in vmalloc() 2018-06-12 16:19:22 -07:00
percpu-vm.c percpu: allow select gfp to be passed to underlying allocators 2018-02-18 05:33:01 -08:00
percpu.c Merge branch 'for-4.20' of git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu 2018-11-01 09:27:57 -07:00
pgtable-generic.c x86/mm: Page size aware flush_tlb_mm_range() 2018-10-09 16:51:11 +02:00
process_vm_access.c mm: docs: add blank lines to silence sphinx "Unexpected indentation" errors 2018-02-06 18:32:48 -08:00
quicklist.c
readahead.c mm/readahead.c: simplify get_next_ra_size() 2018-12-28 12:11:46 -08:00
rmap.c mm/mmu_notifier: mm/rmap.c: Fix a mmu_notifier range bug in try_to_unmap_one 2019-01-10 02:58:21 -08:00
rodata_test.c
shmem.c mm: rid swapoff of quadratic complexity 2019-03-05 21:07:18 -08:00
slab_common.c mm, memcg: create mem_cgroup_from_seq 2019-03-05 21:07:17 -08:00
slab.c mm/slab.c: kmemleak no scan alien caches 2019-03-05 21:07:14 -08:00
slab.h memcg: localize memcg_kmem_enabled() check 2019-03-05 21:07:15 -08:00
slob.c slab: __GFP_ZERO is incompatible with a constructor 2018-06-07 17:34:34 -07:00
slub.c mm: fix some typos in mm directory 2019-03-05 21:07:18 -08:00
sparse-vmemmap.c mm: remove include/linux/bootmem.h 2018-10-31 08:54:16 -07:00
sparse.c mm, sparse: pass nid instead of pgdat to sparse_add_one_section() 2018-12-28 12:11:49 -08:00
swap_cgroup.c
swap_slots.c mm, swap, get_swap_pages: use entry_size instead of cluster in parameter 2018-08-22 10:52:44 -07:00
swap_state.c mm: swap: add comment for swap_vma_readahead 2019-03-05 21:07:16 -08:00
swap.c mm: handle lru_add_drain_all for UP properly 2019-02-21 09:01:00 -08:00
swapfile.c mm: rid swapoff of quadratic complexity 2019-03-05 21:07:18 -08:00
truncate.c mm: cleancache: fix corruption on missed inode invalidation 2018-11-30 14:56:14 -08:00
usercopy.c mm/usercopy.c: no check page span for stack objects 2019-01-08 17:15:11 -08:00
userfaultfd.c hugetlbfs: revert "use i_mmap_rwsem for more pmd sharing synchronization" 2019-01-08 17:15:11 -08:00
util.c mm: don't let userspace spam allocations warnings 2019-02-21 09:01:01 -08:00
vmacache.c mm: get rid of vmacache_flush_all() entirely 2018-09-13 15:18:04 -10:00
vmalloc.c mm/vmalloc.c: fix kernel BUG at mm/vmalloc.c:512! 2019-03-05 21:07:17 -08:00
vmpressure.c mm/vmpressure.c: convert to use match_string() helper 2018-06-07 17:34:36 -07:00
vmscan.c mm/vmscan.c: remove 7th argument of isolate_lru_pages() 2019-03-05 21:07:18 -08:00
vmstat.c mm: no need to check return value of debugfs_create functions 2019-03-05 21:07:17 -08:00
workingset.c mm: convert totalram_pages and totalhigh_pages variables to atomic 2018-12-28 12:11:47 -08:00
z3fold.c z3fold: fix possible reclaim races 2018-11-18 10:15:09 -08:00
zbud.c mm: docs: fix parameter names mismatch 2018-02-06 18:32:48 -08:00
zpool.c mm/zpool.c: zpool_evictable: fix mismatch in parameter name and kernel-doc 2018-02-21 15:35:43 -08:00
zsmalloc.c mm/zsmalloc.c: fix fall-through annotation 2018-10-26 16:26:35 -07:00
zswap.c mm: convert totalram_pages and totalhigh_pages variables to atomic 2018-12-28 12:11:47 -08:00