2019-05-19 15:08:55 +03:00
// SPDX-License-Identifier: GPL-2.0-only
2005-04-17 02:20:36 +04:00
/*
* linux / mm / filemap . c
*
* Copyright ( C ) 1994 - 1999 Linus Torvalds
*/
/*
* This file handles the generic file mmap semantics used by
* most " normal " filesystems ( but you don ' t / have / to use this :
* the NFS filesystem used to do this differently , for example )
*/
2011-10-16 10:01:52 +04:00
# include <linux/export.h>
2005-04-17 02:20:36 +04:00
# include <linux/compiler.h>
2016-01-23 02:10:40 +03:00
# include <linux/dax.h>
2005-04-17 02:20:36 +04:00
# include <linux/fs.h>
2017-02-08 20:51:30 +03:00
# include <linux/sched/signal.h>
2006-06-23 13:04:16 +04:00
# include <linux/uaccess.h>
2006-01-11 23:17:46 +03:00
# include <linux/capability.h>
2005-04-17 02:20:36 +04:00
# include <linux/kernel_stat.h>
include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.
percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.
http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.
* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.
* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.
The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.
2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.
3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.
4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.
5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.
6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).
* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig
8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.
Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-24 11:04:11 +03:00
# include <linux/gfp.h>
2005-04-17 02:20:36 +04:00
# include <linux/mm.h>
# include <linux/swap.h>
2022-01-22 09:10:46 +03:00
# include <linux/swapops.h>
cachestat: implement cachestat syscall
There is currently no good way to query the page cache state of large file
sets and directory trees. There is mincore(), but it scales poorly: the
kernel writes out a lot of bitmap data that userspace has to aggregate,
when the user really doesn not care about per-page information in that
case. The user also needs to mmap and unmap each file as it goes along,
which can be quite slow as well.
Some use cases where this information could come in handy:
* Allowing database to decide whether to perform an index scan or
direct table queries based on the in-memory cache state of the
index.
* Visibility into the writeback algorithm, for performance issues
diagnostic.
* Workload-aware writeback pacing: estimating IO fulfilled by page
cache (and IO to be done) within a range of a file, allowing for
more frequent syncing when and where there is IO capacity, and
batching when there is not.
* Computing memory usage of large files/directory trees, analogous to
the du tool for disk usage.
More information about these use cases could be found in the following
thread:
https://lore.kernel.org/lkml/20230315170934.GA97793@cmpxchg.org/
This patch implements a new syscall that queries cache state of a file and
summarizes the number of cached pages, number of dirty pages, number of
pages marked for writeback, number of (recently) evicted pages, etc. in a
given range. Currently, the syscall is only wired in for x86
architecture.
NAME
cachestat - query the page cache statistics of a file.
SYNOPSIS
#include <sys/mman.h>
struct cachestat_range {
__u64 off;
__u64 len;
};
struct cachestat {
__u64 nr_cache;
__u64 nr_dirty;
__u64 nr_writeback;
__u64 nr_evicted;
__u64 nr_recently_evicted;
};
int cachestat(unsigned int fd, struct cachestat_range *cstat_range,
struct cachestat *cstat, unsigned int flags);
DESCRIPTION
cachestat() queries the number of cached pages, number of dirty
pages, number of pages marked for writeback, number of evicted
pages, number of recently evicted pages, in the bytes range given by
`off` and `len`.
An evicted page is a page that is previously in the page cache but
has been evicted since. A page is recently evicted if its last
eviction was recent enough that its reentry to the cache would
indicate that it is actively being used by the system, and that
there is memory pressure on the system.
These values are returned in a cachestat struct, whose address is
given by the `cstat` argument.
The `off` and `len` arguments must be non-negative integers. If
`len` > 0, the queried range is [`off`, `off` + `len`]. If `len` ==
0, we will query in the range from `off` to the end of the file.
The `flags` argument is unused for now, but is included for future
extensibility. User should pass 0 (i.e no flag specified).
Currently, hugetlbfs is not supported.
Because the status of a page can change after cachestat() checks it
but before it returns to the application, the returned values may
contain stale information.
RETURN VALUE
On success, cachestat returns 0. On error, -1 is returned, and errno
is set to indicate the error.
ERRORS
EFAULT cstat or cstat_args points to an invalid address.
EINVAL invalid flags.
EBADF invalid file descriptor.
EOPNOTSUPP file descriptor is of a hugetlbfs file
[nphamcs@gmail.com: replace rounddown logic with the existing helper]
Link: https://lkml.kernel.org/r/20230504022044.3675469-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20230503013608.2431726-3-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Brian Foster <bfoster@redhat.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-03 04:36:07 +03:00
# include <linux/syscalls.h>
2005-04-17 02:20:36 +04:00
# include <linux/mman.h>
# include <linux/pagemap.h>
# include <linux/file.h>
# include <linux/uio.h>
2019-05-14 03:21:04 +03:00
# include <linux/error-injection.h>
2005-04-17 02:20:36 +04:00
# include <linux/hash.h>
# include <linux/writeback.h>
2007-10-19 01:47:32 +04:00
# include <linux/backing-dev.h>
2005-04-17 02:20:36 +04:00
# include <linux/pagevec.h>
# include <linux/security.h>
2006-03-24 14:16:04 +03:00
# include <linux/cpuset.h>
mm: memcontrol: rewrite charge API
These patches rework memcg charge lifetime to integrate more naturally
with the lifetime of user pages. This drastically simplifies the code and
reduces charging and uncharging overhead. The most expensive part of
charging and uncharging is the page_cgroup bit spinlock, which is removed
entirely after this series.
Here are the top-10 profile entries of a stress test that reads a 128G
sparse file on a freshly booted box, without even a dedicated cgroup (i.e.
executing in the root memcg). Before:
15.36% cat [kernel.kallsyms] [k] copy_user_generic_string
13.31% cat [kernel.kallsyms] [k] memset
11.48% cat [kernel.kallsyms] [k] do_mpage_readpage
4.23% cat [kernel.kallsyms] [k] get_page_from_freelist
2.38% cat [kernel.kallsyms] [k] put_page
2.32% cat [kernel.kallsyms] [k] __mem_cgroup_commit_charge
2.18% kswapd0 [kernel.kallsyms] [k] __mem_cgroup_uncharge_common
1.92% kswapd0 [kernel.kallsyms] [k] shrink_page_list
1.86% cat [kernel.kallsyms] [k] __radix_tree_lookup
1.62% cat [kernel.kallsyms] [k] __pagevec_lru_add_fn
After:
15.67% cat [kernel.kallsyms] [k] copy_user_generic_string
13.48% cat [kernel.kallsyms] [k] memset
11.42% cat [kernel.kallsyms] [k] do_mpage_readpage
3.98% cat [kernel.kallsyms] [k] get_page_from_freelist
2.46% cat [kernel.kallsyms] [k] put_page
2.13% kswapd0 [kernel.kallsyms] [k] shrink_page_list
1.88% cat [kernel.kallsyms] [k] __radix_tree_lookup
1.67% cat [kernel.kallsyms] [k] __pagevec_lru_add_fn
1.39% kswapd0 [kernel.kallsyms] [k] free_pcppages_bulk
1.30% cat [kernel.kallsyms] [k] kfree
As you can see, the memcg footprint has shrunk quite a bit.
text data bss dec hex filename
37970 9892 400 48262 bc86 mm/memcontrol.o.old
35239 9892 400 45531 b1db mm/memcontrol.o
This patch (of 4):
The memcg charge API charges pages before they are rmapped - i.e. have an
actual "type" - and so every callsite needs its own set of charge and
uncharge functions to know what type is being operated on. Worse,
uncharge has to happen from a context that is still type-specific, rather
than at the end of the page's lifetime with exclusive access, and so
requires a lot of synchronization.
Rewrite the charge API to provide a generic set of try_charge(),
commit_charge() and cancel_charge() transaction operations, much like
what's currently done for swap-in:
mem_cgroup_try_charge() attempts to reserve a charge, reclaiming
pages from the memcg if necessary.
mem_cgroup_commit_charge() commits the page to the charge once it
has a valid page->mapping and PageAnon() reliably tells the type.
mem_cgroup_cancel_charge() aborts the transaction.
This reduces the charge API and enables subsequent patches to
drastically simplify uncharging.
As pages need to be committed after rmap is established but before they
are added to the LRU, page_add_new_anon_rmap() must stop doing LRU
additions again. Revive lru_cache_add_active_or_unevictable().
[hughd@google.com: fix shmem_unuse]
[hughd@google.com: Add comments on the private use of -EAGAIN]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-09 01:19:20 +04:00
# include <linux/hugetlb.h>
2008-02-07 11:13:53 +03:00
# include <linux/memcontrol.h>
2017-11-16 04:37:41 +03:00
# include <linux/shmem_fs.h>
2014-04-08 02:37:19 +04:00
# include <linux/rmap.h>
2018-10-27 01:06:08 +03:00
# include <linux/delayacct.h>
psi: pressure stall information for CPU, memory, and IO
When systems are overcommitted and resources become contended, it's hard
to tell exactly the impact this has on workload productivity, or how close
the system is to lockups and OOM kills. In particular, when machines work
multiple jobs concurrently, the impact of overcommit in terms of latency
and throughput on the individual job can be enormous.
In order to maximize hardware utilization without sacrificing individual
job health or risk complete machine lockups, this patch implements a way
to quantify resource pressure in the system.
A kernel built with CONFIG_PSI=y creates files in /proc/pressure/ that
expose the percentage of time the system is stalled on CPU, memory, or IO,
respectively. Stall states are aggregate versions of the per-task delay
accounting delays:
cpu: some tasks are runnable but not executing on a CPU
memory: tasks are reclaiming, or waiting for swapin or thrashing cache
io: tasks are waiting for io completions
These percentages of walltime can be thought of as pressure percentages,
and they give a general sense of system health and productivity loss
incurred by resource overcommit. They can also indicate when the system
is approaching lockup scenarios and OOMs.
To do this, psi keeps track of the task states associated with each CPU
and samples the time they spend in stall states. Every 2 seconds, the
samples are averaged across CPUs - weighted by the CPUs' non-idle time to
eliminate artifacts from unused CPUs - and translated into percentages of
walltime. A running average of those percentages is maintained over 10s,
1m, and 5m periods (similar to the loadaverage).
[hannes@cmpxchg.org: doc fixlet, per Randy]
Link: http://lkml.kernel.org/r/20180828205625.GA14030@cmpxchg.org
[hannes@cmpxchg.org: code optimization]
Link: http://lkml.kernel.org/r/20180907175015.GA8479@cmpxchg.org
[hannes@cmpxchg.org: rename psi_clock() to psi_update_work(), per Peter]
Link: http://lkml.kernel.org/r/20180907145404.GB11088@cmpxchg.org
[hannes@cmpxchg.org: fix build]
Link: http://lkml.kernel.org/r/20180913014222.GA2370@cmpxchg.org
Link: http://lkml.kernel.org/r/20180828172258.3185-9-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Daniel Drake <drake@endlessm.com>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-27 01:06:27 +03:00
# include <linux/psi.h>
2019-10-19 06:20:20 +03:00
# include <linux/ramfs.h>
2020-08-07 09:19:55 +03:00
# include <linux/page_idle.h>
2022-01-22 09:10:46 +03:00
# include <linux/migrate.h>
2023-02-14 18:01:42 +03:00
# include <linux/pipe_fs_i.h>
# include <linux/splice.h>
2020-12-19 15:19:23 +03:00
# include <asm/pgalloc.h>
2021-02-10 14:15:11 +03:00
# include <asm/tlbflush.h>
2006-03-22 11:08:33 +03:00
# include "internal.h"
2013-04-30 02:06:10 +04:00
# define CREATE_TRACE_POINTS
# include <trace/events/filemap.h>
2005-04-17 02:20:36 +04:00
/*
* FIXME : remove all knowledge of the buffer layer from the core VM
*/
2009-08-17 21:52:36 +04:00
# include <linux/buffer_head.h> /* for try_to_free_buffers */
2005-04-17 02:20:36 +04:00
# include <asm/mman.h>
cachestat: implement cachestat syscall
There is currently no good way to query the page cache state of large file
sets and directory trees. There is mincore(), but it scales poorly: the
kernel writes out a lot of bitmap data that userspace has to aggregate,
when the user really doesn not care about per-page information in that
case. The user also needs to mmap and unmap each file as it goes along,
which can be quite slow as well.
Some use cases where this information could come in handy:
* Allowing database to decide whether to perform an index scan or
direct table queries based on the in-memory cache state of the
index.
* Visibility into the writeback algorithm, for performance issues
diagnostic.
* Workload-aware writeback pacing: estimating IO fulfilled by page
cache (and IO to be done) within a range of a file, allowing for
more frequent syncing when and where there is IO capacity, and
batching when there is not.
* Computing memory usage of large files/directory trees, analogous to
the du tool for disk usage.
More information about these use cases could be found in the following
thread:
https://lore.kernel.org/lkml/20230315170934.GA97793@cmpxchg.org/
This patch implements a new syscall that queries cache state of a file and
summarizes the number of cached pages, number of dirty pages, number of
pages marked for writeback, number of (recently) evicted pages, etc. in a
given range. Currently, the syscall is only wired in for x86
architecture.
NAME
cachestat - query the page cache statistics of a file.
SYNOPSIS
#include <sys/mman.h>
struct cachestat_range {
__u64 off;
__u64 len;
};
struct cachestat {
__u64 nr_cache;
__u64 nr_dirty;
__u64 nr_writeback;
__u64 nr_evicted;
__u64 nr_recently_evicted;
};
int cachestat(unsigned int fd, struct cachestat_range *cstat_range,
struct cachestat *cstat, unsigned int flags);
DESCRIPTION
cachestat() queries the number of cached pages, number of dirty
pages, number of pages marked for writeback, number of evicted
pages, number of recently evicted pages, in the bytes range given by
`off` and `len`.
An evicted page is a page that is previously in the page cache but
has been evicted since. A page is recently evicted if its last
eviction was recent enough that its reentry to the cache would
indicate that it is actively being used by the system, and that
there is memory pressure on the system.
These values are returned in a cachestat struct, whose address is
given by the `cstat` argument.
The `off` and `len` arguments must be non-negative integers. If
`len` > 0, the queried range is [`off`, `off` + `len`]. If `len` ==
0, we will query in the range from `off` to the end of the file.
The `flags` argument is unused for now, but is included for future
extensibility. User should pass 0 (i.e no flag specified).
Currently, hugetlbfs is not supported.
Because the status of a page can change after cachestat() checks it
but before it returns to the application, the returned values may
contain stale information.
RETURN VALUE
On success, cachestat returns 0. On error, -1 is returned, and errno
is set to indicate the error.
ERRORS
EFAULT cstat or cstat_args points to an invalid address.
EINVAL invalid flags.
EBADF invalid file descriptor.
EOPNOTSUPP file descriptor is of a hugetlbfs file
[nphamcs@gmail.com: replace rounddown logic with the existing helper]
Link: https://lkml.kernel.org/r/20230504022044.3675469-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20230503013608.2431726-3-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Brian Foster <bfoster@redhat.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-03 04:36:07 +03:00
# include "swap.h"
2005-04-17 02:20:36 +04:00
/*
* Shared mappings implemented 30.11 .1994 . It ' s not fully working yet ,
* though .
*
* Shared mappings now work . 15.8 .1995 Bruno .
*
* finished ' unifying ' the page and buffer cache and SMP - threaded the
* page - cache , 21.05 .1999 , Ingo Molnar < mingo @ redhat . com >
*
* SMP - threaded pagemap - LRU 1999 , Andrea Arcangeli < andrea @ suse . de >
*/
/*
* Lock ordering :
*
2014-12-13 03:54:24 +03:00
* - > i_mmap_rwsem ( truncate_pagecache )
2022-02-09 23:22:12 +03:00
* - > private_lock ( __free_pte - > block_dirty_folio )
[PATCH] swap: swap_lock replace list+device
The idea of a swap_device_lock per device, and a swap_list_lock over them all,
is appealing; but in practice almost every holder of swap_device_lock must
already hold swap_list_lock, which defeats the purpose of the split.
The only exceptions have been swap_duplicate, valid_swaphandles and an
untrodden path in try_to_unuse (plus a few places added in this series).
valid_swaphandles doesn't show up high in profiles, but swap_duplicate does
demand attention. However, with the hold time in get_swap_pages so much
reduced, I've not yet found a load and set of swap device priorities to show
even swap_duplicate benefitting from the split. Certainly the split is mere
overhead in the common case of a single swap device.
So, replace swap_list_lock and swap_device_lock by spinlock_t swap_lock
(generally we seem to prefer an _ in the name, and not hide in a macro).
If someone can show a regression in swap_duplicate, then probably we should
add a hashlock for the swap_map entries alone (shorts being anatomic), so as
to help the case of the single swap device too.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-04 02:54:41 +04:00
* - > swap_lock ( exclusive_swap_page , others )
2018-04-11 02:36:56 +03:00
* - > i_pages lock
2005-04-17 02:20:36 +04:00
*
2021-04-12 16:50:21 +03:00
* - > i_rwsem
2021-01-28 21:19:45 +03:00
* - > invalidate_lock ( acquired by fs in truncate path )
* - > i_mmap_rwsem ( truncate - > unmap_mapping_range )
2005-04-17 02:20:36 +04:00
*
2020-06-09 07:33:54 +03:00
* - > mmap_lock
2014-12-13 03:54:24 +03:00
* - > i_mmap_rwsem
2005-10-30 04:16:41 +03:00
* - > page_table_lock or pte_lock ( various , mainly in memory . c )
2018-04-11 02:36:56 +03:00
* - > i_pages lock ( arch - dependent flush_dcache_mmap_lock )
2005-04-17 02:20:36 +04:00
*
2020-06-09 07:33:54 +03:00
* - > mmap_lock
2021-01-28 21:19:45 +03:00
* - > invalidate_lock ( filemap_fault )
* - > lock_page ( filemap_fault , access_process_vm )
2005-04-17 02:20:36 +04:00
*
2021-04-12 16:50:21 +03:00
* - > i_rwsem ( generic_perform_write )
2021-08-02 14:44:20 +03:00
* - > mmap_lock ( fault_in_readable - > do_page_fault )
2005-04-17 02:20:36 +04:00
*
2011-04-22 04:19:44 +04:00
* bdi - > wb . list_lock
2011-03-22 14:23:41 +03:00
* sb_lock ( fs / fs - writeback . c )
2018-04-11 02:36:56 +03:00
* - > i_pages lock ( __sync_single_inode )
2005-04-17 02:20:36 +04:00
*
2014-12-13 03:54:24 +03:00
* - > i_mmap_rwsem
2023-01-20 19:26:49 +03:00
* - > anon_vma . lock ( vma_merge )
2005-04-17 02:20:36 +04:00
*
* - > anon_vma . lock
2005-10-30 04:16:41 +03:00
* - > page_table_lock or pte_lock ( anon_vma_prepare and various )
2005-04-17 02:20:36 +04:00
*
2005-10-30 04:16:41 +03:00
* - > page_table_lock or pte_lock
[PATCH] swap: swap_lock replace list+device
The idea of a swap_device_lock per device, and a swap_list_lock over them all,
is appealing; but in practice almost every holder of swap_device_lock must
already hold swap_list_lock, which defeats the purpose of the split.
The only exceptions have been swap_duplicate, valid_swaphandles and an
untrodden path in try_to_unuse (plus a few places added in this series).
valid_swaphandles doesn't show up high in profiles, but swap_duplicate does
demand attention. However, with the hold time in get_swap_pages so much
reduced, I've not yet found a load and set of swap device priorities to show
even swap_duplicate benefitting from the split. Certainly the split is mere
overhead in the common case of a single swap device.
So, replace swap_list_lock and swap_device_lock by spinlock_t swap_lock
(generally we seem to prefer an _ in the name, and not hide in a macro).
If someone can show a regression in swap_duplicate, then probably we should
add a hashlock for the swap_map entries alone (shorts being anatomic), so as
to help the case of the single swap device too.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-04 02:54:41 +04:00
* - > swap_lock ( try_to_unmap_one )
2005-04-17 02:20:36 +04:00
* - > private_lock ( try_to_unmap_one )
2018-04-11 02:36:56 +03:00
* - > i_pages lock ( try_to_unmap_one )
2020-12-16 01:21:31 +03:00
* - > lruvec - > lru_lock ( follow_page - > mark_page_accessed )
* - > lruvec - > lru_lock ( check_pte_range - > isolate_lru_page )
2005-04-17 02:20:36 +04:00
* - > private_lock ( page_remove_rmap - > set_page_dirty )
2018-04-11 02:36:56 +03:00
* - > i_pages lock ( page_remove_rmap - > set_page_dirty )
2011-04-22 04:19:44 +04:00
* bdi . wb - > list_lock ( page_remove_rmap - > set_page_dirty )
2011-03-22 14:23:36 +03:00
* - > inode - > i_lock ( page_remove_rmap - > set_page_dirty )
2023-06-14 17:36:12 +03:00
* - > memcg - > move_lock ( page_remove_rmap - > folio_memcg_lock )
2011-04-22 04:19:44 +04:00
* bdi . wb - > list_lock ( zap_pte_range - > set_page_dirty )
2011-03-22 14:23:36 +03:00
* - > inode - > i_lock ( zap_pte_range - > set_page_dirty )
2022-02-09 23:22:12 +03:00
* - > private_lock ( zap_pte_range - > block_dirty_folio )
2005-04-17 02:20:36 +04:00
*
2014-12-13 03:54:24 +03:00
* - > i_mmap_rwsem
2012-03-22 03:34:09 +04:00
* - > tasklist_lock ( memory_failure , collect_procs_ao )
2005-04-17 02:20:36 +04:00
*/
2017-11-21 17:17:59 +03:00
static void page_cache_delete ( struct address_space * mapping ,
2021-05-08 07:35:49 +03:00
struct folio * folio , void * shadow )
2014-04-04 01:47:49 +04:00
{
2021-05-08 07:35:49 +03:00
XA_STATE ( xas , & mapping - > i_pages , folio - > index ) ;
long nr = 1 ;
2016-12-13 03:43:17 +03:00
2017-11-21 17:17:59 +03:00
mapping_set_update ( & xas , mapping ) ;
2016-12-13 03:43:17 +03:00
2017-11-21 17:17:59 +03:00
/* hugetlb pages are represented by a single entry in the xarray */
2021-05-08 07:35:49 +03:00
if ( ! folio_test_hugetlb ( folio ) ) {
xas_set_order ( & xas , folio - > index , folio_order ( folio ) ) ;
nr = folio_nr_pages ( folio ) ;
2017-11-21 17:17:59 +03:00
}
2014-04-04 01:47:49 +04:00
2021-05-08 07:35:49 +03:00
VM_BUG_ON_FOLIO ( ! folio_test_locked ( folio ) , folio ) ;
mm: keep page cache radix tree nodes in check
Previously, page cache radix tree nodes were freed after reclaim emptied
out their page pointers. But now reclaim stores shadow entries in their
place, which are only reclaimed when the inodes themselves are
reclaimed. This is problematic for bigger files that are still in use
after they have a significant amount of their cache reclaimed, without
any of those pages actually refaulting. The shadow entries will just
sit there and waste memory. In the worst case, the shadow entries will
accumulate until the machine runs out of memory.
To get this under control, the VM will track radix tree nodes
exclusively containing shadow entries on a per-NUMA node list. Per-NUMA
rather than global because we expect the radix tree nodes themselves to
be allocated node-locally and we want to reduce cross-node references of
otherwise independent cache workloads. A simple shrinker will then
reclaim these nodes on memory pressure.
A few things need to be stored in the radix tree node to implement the
shadow node LRU and allow tree deletions coming from the list:
1. There is no index available that would describe the reverse path
from the node up to the tree root, which is needed to perform a
deletion. To solve this, encode in each node its offset inside the
parent. This can be stored in the unused upper bits of the same
member that stores the node's height at no extra space cost.
2. The number of shadow entries needs to be counted in addition to the
regular entries, to quickly detect when the node is ready to go to
the shadow node LRU list. The current entry count is an unsigned
int but the maximum number of entries is 64, so a shadow counter
can easily be stored in the unused upper bits.
3. Tree modification needs tree lock and tree root, which are located
in the address space, so store an address_space backpointer in the
node. The parent pointer of the node is in a union with the 2-word
rcu_head, so the backpointer comes at no extra cost as well.
4. The node needs to be linked to an LRU list, which requires a list
head inside the node. This does increase the size of the node, but
it does not change the number of objects that fit into a slab page.
[akpm@linux-foundation.org: export the right function]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Bob Liu <bob.liu@oracle.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Luigi Semenzato <semenzato@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Metin Doslu <metin@citusdata.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Ozgun Erdogan <ozgun@citusdata.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roman Gushchin <klamm@yandex-team.ru>
Cc: Ryan Mallon <rmallon@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 01:47:56 +04:00
2017-11-21 17:17:59 +03:00
xas_store ( & xas , shadow ) ;
xas_init_marks ( & xas ) ;
mm: filemap: don't plant shadow entries without radix tree node
When the underflow checks were added to workingset_node_shadow_dec(),
they triggered immediately:
kernel BUG at ./include/linux/swap.h:276!
invalid opcode: 0000 [#1] SMP
Modules linked in: isofs usb_storage fuse xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_REJECT nf_reject_ipv6
soundcore wmi acpi_als pinctrl_sunrisepoint kfifo_buf tpm_tis industrialio acpi_pad pinctrl_intel tpm_tis_core tpm nfsd auth_rpcgss nfs_acl lockd grace sunrpc dm_crypt
CPU: 0 PID: 20929 Comm: blkid Not tainted 4.8.0-rc8-00087-gbe67d60ba944 #1
Hardware name: System manufacturer System Product Name/Z170-K, BIOS 1803 05/06/2016
task: ffff8faa93ecd940 task.stack: ffff8faa7f478000
RIP: page_cache_tree_insert+0xf1/0x100
Call Trace:
__add_to_page_cache_locked+0x12e/0x270
add_to_page_cache_lru+0x4e/0xe0
mpage_readpages+0x112/0x1d0
blkdev_readpages+0x1d/0x20
__do_page_cache_readahead+0x1ad/0x290
force_page_cache_readahead+0xaa/0x100
page_cache_sync_readahead+0x3f/0x50
generic_file_read_iter+0x5af/0x740
blkdev_read_iter+0x35/0x40
__vfs_read+0xe1/0x130
vfs_read+0x96/0x130
SyS_read+0x55/0xc0
entry_SYSCALL_64_fastpath+0x13/0x8f
Code: 03 00 48 8b 5d d8 65 48 33 1c 25 28 00 00 00 44 89 e8 75 19 48 83 c4 18 5b 41 5c 41 5d 41 5e 5d c3 0f 0b 41 bd ef ff ff ff eb d7 <0f> 0b e8 88 68 ef ff 0f 1f 84 00
RIP page_cache_tree_insert+0xf1/0x100
This is a long-standing bug in the way shadow entries are accounted in
the radix tree nodes. The shrinker needs to know when radix tree nodes
contain only shadow entries, no pages, so node->count is split in half
to count shadows in the upper bits and pages in the lower bits.
Unfortunately, the radix tree implementation doesn't know of this and
assumes all entries are in node->count. When there is a shadow entry
directly in root->rnode and the tree is later extended, the radix tree
implementation will copy that entry into the new node and and bump its
node->count, i.e. increases the page count bits. Once the shadow gets
removed and we subtract from the upper counter, node->count underflows
and triggers the warning. Afterwards, without node->count reaching 0
again, the radix tree node is leaked.
Limit shadow entries to when we have actual radix tree nodes and can
count them properly. That means we lose the ability to detect refaults
from files that had only the first page faulted in at eviction time.
Fixes: 449dd6984d0e ("mm: keep page cache radix tree nodes in check")
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-and-tested-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-04 23:02:08 +03:00
2021-05-08 07:35:49 +03:00
folio - > mapping = NULL ;
2017-11-16 04:37:26 +03:00
/* Leave page->index set: truncation lookup relies upon it */
mm: filemap: don't plant shadow entries without radix tree node
When the underflow checks were added to workingset_node_shadow_dec(),
they triggered immediately:
kernel BUG at ./include/linux/swap.h:276!
invalid opcode: 0000 [#1] SMP
Modules linked in: isofs usb_storage fuse xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_REJECT nf_reject_ipv6
soundcore wmi acpi_als pinctrl_sunrisepoint kfifo_buf tpm_tis industrialio acpi_pad pinctrl_intel tpm_tis_core tpm nfsd auth_rpcgss nfs_acl lockd grace sunrpc dm_crypt
CPU: 0 PID: 20929 Comm: blkid Not tainted 4.8.0-rc8-00087-gbe67d60ba944 #1
Hardware name: System manufacturer System Product Name/Z170-K, BIOS 1803 05/06/2016
task: ffff8faa93ecd940 task.stack: ffff8faa7f478000
RIP: page_cache_tree_insert+0xf1/0x100
Call Trace:
__add_to_page_cache_locked+0x12e/0x270
add_to_page_cache_lru+0x4e/0xe0
mpage_readpages+0x112/0x1d0
blkdev_readpages+0x1d/0x20
__do_page_cache_readahead+0x1ad/0x290
force_page_cache_readahead+0xaa/0x100
page_cache_sync_readahead+0x3f/0x50
generic_file_read_iter+0x5af/0x740
blkdev_read_iter+0x35/0x40
__vfs_read+0xe1/0x130
vfs_read+0x96/0x130
SyS_read+0x55/0xc0
entry_SYSCALL_64_fastpath+0x13/0x8f
Code: 03 00 48 8b 5d d8 65 48 33 1c 25 28 00 00 00 44 89 e8 75 19 48 83 c4 18 5b 41 5c 41 5d 41 5e 5d c3 0f 0b 41 bd ef ff ff ff eb d7 <0f> 0b e8 88 68 ef ff 0f 1f 84 00
RIP page_cache_tree_insert+0xf1/0x100
This is a long-standing bug in the way shadow entries are accounted in
the radix tree nodes. The shrinker needs to know when radix tree nodes
contain only shadow entries, no pages, so node->count is split in half
to count shadows in the upper bits and pages in the lower bits.
Unfortunately, the radix tree implementation doesn't know of this and
assumes all entries are in node->count. When there is a shadow entry
directly in root->rnode and the tree is later extended, the radix tree
implementation will copy that entry into the new node and and bump its
node->count, i.e. increases the page count bits. Once the shadow gets
removed and we subtract from the upper counter, node->count underflows
and triggers the warning. Afterwards, without node->count reaching 0
again, the radix tree node is leaked.
Limit shadow entries to when we have actual radix tree nodes and can
count them properly. That means we lose the ability to detect refaults
from files that had only the first page faulted in at eviction time.
Fixes: 449dd6984d0e ("mm: keep page cache radix tree nodes in check")
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-and-tested-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-04 23:02:08 +03:00
mapping - > nrpages - = nr ;
2014-04-04 01:47:49 +04:00
}
2021-05-09 03:04:05 +03:00
static void filemap_unaccount_folio ( struct address_space * mapping ,
struct folio * folio )
2005-04-17 02:20:36 +04:00
{
2021-05-09 03:04:05 +03:00
long nr ;
2005-04-17 02:20:36 +04:00
2021-05-09 03:04:05 +03:00
VM_BUG_ON_FOLIO ( folio_mapped ( folio ) , folio ) ;
if ( ! IS_ENABLED ( CONFIG_DEBUG_VM ) & & unlikely ( folio_mapped ( folio ) ) ) {
mm: __delete_from_page_cache show Bad page if mapped
Commit e1534ae95004 ("mm: differentiate page_mapped() from
page_mapcount() for compound pages") changed the famous
BUG_ON(page_mapped(page)) in __delete_from_page_cache() to
VM_BUG_ON_PAGE(page_mapped(page)): which gives us more info when
CONFIG_DEBUG_VM=y, but nothing at all when not.
Although it has not usually been very helpul, being hit long after the
error in question, we do need to know if it actually happens on users'
systems; but reinstating a crash there is likely to be opposed :)
In the non-debug case, pr_alert("BUG: Bad page cache") plus dump_page(),
dump_stack(), add_taint() - I don't really believe LOCKDEP_NOW_UNRELIABLE,
but that seems to be the standard procedure now. Move that, or the
VM_BUG_ON_PAGE(), up before the deletion from tree: so that the
unNULLified page->mapping gives a little more information.
If the inode is being evicted (rather than truncated), it won't have any
vmas left, so it's safe(ish) to assume that the raised mapcount is
erroneous, and we can discount it from page_count to avoid leaking the
page (I'm less worried by leaking the occasional 4kB, than losing a
potential 2MB page with each 4kB page leaked).
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-10 01:08:07 +03:00
pr_alert ( " BUG: Bad page cache in process %s pfn:%05lx \n " ,
2021-05-09 03:04:05 +03:00
current - > comm , folio_pfn ( folio ) ) ;
dump_page ( & folio - > page , " still mapped when deleted " ) ;
mm: __delete_from_page_cache show Bad page if mapped
Commit e1534ae95004 ("mm: differentiate page_mapped() from
page_mapcount() for compound pages") changed the famous
BUG_ON(page_mapped(page)) in __delete_from_page_cache() to
VM_BUG_ON_PAGE(page_mapped(page)): which gives us more info when
CONFIG_DEBUG_VM=y, but nothing at all when not.
Although it has not usually been very helpul, being hit long after the
error in question, we do need to know if it actually happens on users'
systems; but reinstating a crash there is likely to be opposed :)
In the non-debug case, pr_alert("BUG: Bad page cache") plus dump_page(),
dump_stack(), add_taint() - I don't really believe LOCKDEP_NOW_UNRELIABLE,
but that seems to be the standard procedure now. Move that, or the
VM_BUG_ON_PAGE(), up before the deletion from tree: so that the
unNULLified page->mapping gives a little more information.
If the inode is being evicted (rather than truncated), it won't have any
vmas left, so it's safe(ish) to assume that the raised mapcount is
erroneous, and we can discount it from page_count to avoid leaking the
page (I'm less worried by leaking the occasional 4kB, than losing a
potential 2MB page with each 4kB page leaked).
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-10 01:08:07 +03:00
dump_stack ( ) ;
add_taint ( TAINT_BAD_PAGE , LOCKDEP_NOW_UNRELIABLE ) ;
2022-03-25 04:09:52 +03:00
if ( mapping_exiting ( mapping ) & & ! folio_test_large ( folio ) ) {
int mapcount = page_mapcount ( & folio - > page ) ;
if ( folio_ref_count ( folio ) > = mapcount + 2 ) {
/*
* All vmas have already been torn down , so it ' s
* a good bet that actually the page is unmapped
* and we ' d rather not leak it : if we ' re wrong ,
* another bad page check should catch it later .
*/
page_mapcount_reset ( & folio - > page ) ;
folio_ref_sub ( folio , mapcount ) ;
}
mm: __delete_from_page_cache show Bad page if mapped
Commit e1534ae95004 ("mm: differentiate page_mapped() from
page_mapcount() for compound pages") changed the famous
BUG_ON(page_mapped(page)) in __delete_from_page_cache() to
VM_BUG_ON_PAGE(page_mapped(page)): which gives us more info when
CONFIG_DEBUG_VM=y, but nothing at all when not.
Although it has not usually been very helpul, being hit long after the
error in question, we do need to know if it actually happens on users'
systems; but reinstating a crash there is likely to be opposed :)
In the non-debug case, pr_alert("BUG: Bad page cache") plus dump_page(),
dump_stack(), add_taint() - I don't really believe LOCKDEP_NOW_UNRELIABLE,
but that seems to be the standard procedure now. Move that, or the
VM_BUG_ON_PAGE(), up before the deletion from tree: so that the
unNULLified page->mapping gives a little more information.
If the inode is being evicted (rather than truncated), it won't have any
vmas left, so it's safe(ish) to assume that the raised mapcount is
erroneous, and we can discount it from page_count to avoid leaking the
page (I'm less worried by leaking the occasional 4kB, than losing a
potential 2MB page with each 4kB page leaked).
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-10 01:08:07 +03:00
}
}
2021-05-09 03:04:05 +03:00
/* hugetlb folios do not participate in page cache accounting. */
if ( folio_test_hugetlb ( folio ) )
2017-11-16 04:37:29 +03:00
return ;
2017-07-11 01:47:35 +03:00
2021-05-09 03:04:05 +03:00
nr = folio_nr_pages ( folio ) ;
2017-11-16 04:37:29 +03:00
2021-05-09 03:04:05 +03:00
__lruvec_stat_mod_folio ( folio , NR_FILE_PAGES , - nr ) ;
if ( folio_test_swapbacked ( folio ) ) {
__lruvec_stat_mod_folio ( folio , NR_SHMEM , - nr ) ;
if ( folio_test_pmd_mappable ( folio ) )
__lruvec_stat_mod_folio ( folio , NR_SHMEM_THPS , - nr ) ;
} else if ( folio_test_pmd_mappable ( folio ) ) {
__lruvec_stat_mod_folio ( folio , NR_FILE_THPS , - nr ) ;
2019-09-24 01:38:03 +03:00
filemap_nr_thps_dec ( mapping ) ;
2016-07-27 01:26:18 +03:00
}
2017-11-16 04:37:29 +03:00
/*
2021-05-09 03:04:05 +03:00
* At this point folio must be either written or cleaned by
* truncate . Dirty folio here signals a bug and loss of
2022-03-25 04:13:59 +03:00
* unwritten data - on ordinary filesystems .
2017-11-16 04:37:29 +03:00
*
2022-03-25 04:13:59 +03:00
* But it ' s harmless on in - memory filesystems like tmpfs ; and can
* occur when a driver which did get_user_pages ( ) sets page dirty
* before putting it , while the inode is being finally evicted .
*
* Below fixes dirty accounting after removing the folio entirely
2021-05-09 03:04:05 +03:00
* but leaves the dirty flag set : it has no effect for truncated
* folio and anyway will be cleared before returning folio to
2017-11-16 04:37:29 +03:00
* buddy allocator .
*/
2022-03-25 04:13:59 +03:00
if ( WARN_ON_ONCE ( folio_test_dirty ( folio ) & &
mapping_can_writeback ( mapping ) ) )
folio_account_cleaned ( folio , inode_to_wb ( mapping - > host ) ) ;
2017-11-16 04:37:29 +03:00
}
/*
* Delete a page from the page cache and free it . Caller has to make
* sure the page is locked and that nobody else uses it - or that usage
2018-04-11 02:36:56 +03:00
* is safe . The caller must hold the i_pages lock .
2017-11-16 04:37:29 +03:00
*/
2021-05-09 16:33:42 +03:00
void __filemap_remove_folio ( struct folio * folio , void * shadow )
2017-11-16 04:37:29 +03:00
{
2021-05-09 16:33:42 +03:00
struct address_space * mapping = folio - > mapping ;
2017-11-16 04:37:29 +03:00
2021-07-23 16:29:46 +03:00
trace_mm_filemap_delete_from_page_cache ( folio ) ;
2021-05-09 03:04:05 +03:00
filemap_unaccount_folio ( mapping , folio ) ;
2021-05-08 07:35:49 +03:00
page_cache_delete ( mapping , folio , shadow ) ;
2005-04-17 02:20:36 +04:00
}
2021-07-28 22:52:34 +03:00
void filemap_free_folio ( struct address_space * mapping , struct folio * folio )
2017-11-16 04:37:18 +03:00
{
2022-05-01 14:35:31 +03:00
void ( * free_folio ) ( struct folio * ) ;
2022-01-07 21:03:48 +03:00
int refs = 1 ;
2017-11-16 04:37:18 +03:00
2022-05-01 14:35:31 +03:00
free_folio = mapping - > a_ops - > free_folio ;
if ( free_folio )
free_folio ( folio ) ;
2017-11-16 04:37:18 +03:00
2022-01-07 21:03:48 +03:00
if ( folio_test_large ( folio ) & & ! folio_test_hugetlb ( folio ) )
refs = folio_nr_pages ( folio ) ;
folio_put_refs ( folio , refs ) ;
2017-11-16 04:37:18 +03:00
}
2011-03-23 02:32:43 +03:00
/**
2021-05-09 16:33:42 +03:00
* filemap_remove_folio - Remove folio from page cache .
* @ folio : The folio .
2011-03-23 02:32:43 +03:00
*
2021-05-09 16:33:42 +03:00
* This must be called only on folios that are locked and have been
* verified to be in the page cache . It will never put the folio into
* the free list because the caller has a reference on the page .
2011-03-23 02:32:43 +03:00
*/
2021-05-09 16:33:42 +03:00
void filemap_remove_folio ( struct folio * folio )
2005-04-17 02:20:36 +04:00
{
2021-05-09 16:33:42 +03:00
struct address_space * mapping = folio - > mapping ;
2005-04-17 02:20:36 +04:00
2021-05-09 16:33:42 +03:00
BUG_ON ( ! folio_test_locked ( folio ) ) ;
vfs: keep inodes with page cache off the inode shrinker LRU
Historically (pre-2.5), the inode shrinker used to reclaim only empty
inodes and skip over those that still contained page cache. This caused
problems on highmem hosts: struct inode could put fill lowmem zones
before the cache was getting reclaimed in the highmem zones.
To address this, the inode shrinker started to strip page cache to
facilitate reclaiming lowmem. However, this comes with its own set of
problems: the shrinkers may drop actively used page cache just because
the inodes are not currently open or dirty - think working with a large
git tree. It further doesn't respect cgroup memory protection settings
and can cause priority inversions between containers.
Nowadays, the page cache also holds non-resident info for evicted cache
pages in order to detect refaults. We've come to rely heavily on this
data inside reclaim for protecting the cache workingset and driving swap
behavior. We also use it to quantify and report workload health through
psi. The latter in turn is used for fleet health monitoring, as well as
driving automated memory sizing of workloads and containers, proactive
reclaim and memory offloading schemes.
The consequences of dropping page cache prematurely is that we're seeing
subtle and not-so-subtle failures in all of the above-mentioned
scenarios, with the workload generally entering unexpected thrashing
states while losing the ability to reliably detect it.
To fix this on non-highmem systems at least, going back to rotating
inodes on the LRU isn't feasible. We've tried (commit a76cf1a474d7
("mm: don't reclaim inodes with many attached pages")) and failed
(commit 69056ee6a8a3 ("Revert "mm: don't reclaim inodes with many
attached pages"")).
The issue is mostly that shrinker pools attract pressure based on their
size, and when objects get skipped the shrinkers remember this as
deferred reclaim work. This accumulates excessive pressure on the
remaining inodes, and we can quickly eat into heavily used ones, or
dirty ones that require IO to reclaim, when there potentially is plenty
of cold, clean cache around still.
Instead, this patch keeps populated inodes off the inode LRU in the
first place - just like an open file or dirty state would. An otherwise
clean and unused inode then gets queued when the last cache entry
disappears. This solves the problem without reintroducing the reclaim
issues, and generally is a bit more scalable than having to wade through
potentially hundreds of thousands of busy inodes.
Locking is a bit tricky because the locks protecting the inode state
(i_lock) and the inode LRU (lru_list.lock) don't nest inside the
irq-safe page cache lock (i_pages.xa_lock). Page cache deletions are
serialized through i_lock, taken before the i_pages lock, to make sure
depopulated inodes are queued reliably. Additions may race with
deletions, but we'll check again in the shrinker. If additions race
with the shrinker itself, we're protected by the i_lock: if find_inode()
or iput() win, the shrinker will bail on the elevated i_count or
I_REFERENCED; if the shrinker wins and goes ahead with the inode, it
will set I_FREEING and inhibit further igets(), which will cause the
other side to create a new instance of the inode instead.
Link: https://lkml.kernel.org/r/20210614211904.14420-4-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-09 05:31:24 +03:00
spin_lock ( & mapping - > host - > i_lock ) ;
2021-09-03 00:53:18 +03:00
xa_lock_irq ( & mapping - > i_pages ) ;
2021-05-09 16:33:42 +03:00
__filemap_remove_folio ( folio , NULL ) ;
2021-09-03 00:53:18 +03:00
xa_unlock_irq ( & mapping - > i_pages ) ;
vfs: keep inodes with page cache off the inode shrinker LRU
Historically (pre-2.5), the inode shrinker used to reclaim only empty
inodes and skip over those that still contained page cache. This caused
problems on highmem hosts: struct inode could put fill lowmem zones
before the cache was getting reclaimed in the highmem zones.
To address this, the inode shrinker started to strip page cache to
facilitate reclaiming lowmem. However, this comes with its own set of
problems: the shrinkers may drop actively used page cache just because
the inodes are not currently open or dirty - think working with a large
git tree. It further doesn't respect cgroup memory protection settings
and can cause priority inversions between containers.
Nowadays, the page cache also holds non-resident info for evicted cache
pages in order to detect refaults. We've come to rely heavily on this
data inside reclaim for protecting the cache workingset and driving swap
behavior. We also use it to quantify and report workload health through
psi. The latter in turn is used for fleet health monitoring, as well as
driving automated memory sizing of workloads and containers, proactive
reclaim and memory offloading schemes.
The consequences of dropping page cache prematurely is that we're seeing
subtle and not-so-subtle failures in all of the above-mentioned
scenarios, with the workload generally entering unexpected thrashing
states while losing the ability to reliably detect it.
To fix this on non-highmem systems at least, going back to rotating
inodes on the LRU isn't feasible. We've tried (commit a76cf1a474d7
("mm: don't reclaim inodes with many attached pages")) and failed
(commit 69056ee6a8a3 ("Revert "mm: don't reclaim inodes with many
attached pages"")).
The issue is mostly that shrinker pools attract pressure based on their
size, and when objects get skipped the shrinkers remember this as
deferred reclaim work. This accumulates excessive pressure on the
remaining inodes, and we can quickly eat into heavily used ones, or
dirty ones that require IO to reclaim, when there potentially is plenty
of cold, clean cache around still.
Instead, this patch keeps populated inodes off the inode LRU in the
first place - just like an open file or dirty state would. An otherwise
clean and unused inode then gets queued when the last cache entry
disappears. This solves the problem without reintroducing the reclaim
issues, and generally is a bit more scalable than having to wade through
potentially hundreds of thousands of busy inodes.
Locking is a bit tricky because the locks protecting the inode state
(i_lock) and the inode LRU (lru_list.lock) don't nest inside the
irq-safe page cache lock (i_pages.xa_lock). Page cache deletions are
serialized through i_lock, taken before the i_pages lock, to make sure
depopulated inodes are queued reliably. Additions may race with
deletions, but we'll check again in the shrinker. If additions race
with the shrinker itself, we're protected by the i_lock: if find_inode()
or iput() win, the shrinker will bail on the elevated i_count or
I_REFERENCED; if the shrinker wins and goes ahead with the inode, it
will set I_FREEING and inhibit further igets(), which will cause the
other side to create a new instance of the inode instead.
Link: https://lkml.kernel.org/r/20210614211904.14420-4-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-09 05:31:24 +03:00
if ( mapping_shrinkable ( mapping ) )
inode_add_lru ( mapping - > host ) ;
spin_unlock ( & mapping - > host - > i_lock ) ;
2010-12-01 21:35:19 +03:00
2021-05-09 16:33:42 +03:00
filemap_free_folio ( mapping , folio ) ;
2011-03-23 02:30:53 +03:00
}
2017-11-16 04:37:33 +03:00
/*
2021-12-07 22:15:07 +03:00
* page_cache_delete_batch - delete several folios from page cache
* @ mapping : the mapping to which folios belong
* @ fbatch : batch of folios to delete
2017-11-16 04:37:33 +03:00
*
2021-12-07 22:15:07 +03:00
* The function walks over mapping - > i_pages and removes folios passed in
* @ fbatch from the mapping . The function expects @ fbatch to be sorted
* by page index and is optimised for it to be dense .
* It tolerates holes in @ fbatch ( mapping entries at those indices are not
* modified ) .
2017-11-16 04:37:33 +03:00
*
2018-04-11 02:36:56 +03:00
* The function expects the i_pages lock to be held .
2017-11-16 04:37:33 +03:00
*/
2017-12-04 11:59:45 +03:00
static void page_cache_delete_batch ( struct address_space * mapping ,
2021-12-07 22:15:07 +03:00
struct folio_batch * fbatch )
2017-11-16 04:37:33 +03:00
{
2021-12-07 22:15:07 +03:00
XA_STATE ( xas , & mapping - > i_pages , fbatch - > folios [ 0 ] - > index ) ;
2020-06-28 05:19:08 +03:00
long total_pages = 0 ;
2019-09-24 01:34:52 +03:00
int i = 0 ;
2021-03-13 07:13:46 +03:00
struct folio * folio ;
2017-11-16 04:37:33 +03:00
2017-12-04 11:59:45 +03:00
mapping_set_update ( & xas , mapping ) ;
2021-03-13 07:13:46 +03:00
xas_for_each ( & xas , folio , ULONG_MAX ) {
2021-12-07 22:15:07 +03:00
if ( i > = folio_batch_count ( fbatch ) )
2017-11-16 04:37:33 +03:00
break ;
2019-09-24 01:34:52 +03:00
/* A swap/dax/shadow entry got inserted? Skip it. */
2021-03-13 07:13:46 +03:00
if ( xa_is_value ( folio ) )
2017-11-16 04:37:33 +03:00
continue ;
2019-09-24 01:34:52 +03:00
/*
* A page got inserted in our range ? Skip it . We have our
* pages locked so they are protected from being removed .
* If we see a page whose index is higher than ours , it
* means our page has been removed , which shouldn ' t be
* possible because we ' re holding the PageLock .
*/
2021-12-07 22:15:07 +03:00
if ( folio ! = fbatch - > folios [ i ] ) {
2021-03-13 07:13:46 +03:00
VM_BUG_ON_FOLIO ( folio - > index >
2021-12-07 22:15:07 +03:00
fbatch - > folios [ i ] - > index , folio ) ;
2019-09-24 01:34:52 +03:00
continue ;
}
2021-03-13 07:13:46 +03:00
WARN_ON_ONCE ( ! folio_test_locked ( folio ) ) ;
2019-09-24 01:34:52 +03:00
2020-06-28 05:19:08 +03:00
folio - > mapping = NULL ;
2021-12-07 22:15:07 +03:00
/* Leave folio->index set: truncation lookup relies on it */
2019-09-24 01:34:52 +03:00
2020-06-28 05:19:08 +03:00
i + + ;
2017-12-04 11:59:45 +03:00
xas_store ( & xas , NULL ) ;
2020-06-28 05:19:08 +03:00
total_pages + = folio_nr_pages ( folio ) ;
2017-11-16 04:37:33 +03:00
}
mapping - > nrpages - = total_pages ;
}
void delete_from_page_cache_batch ( struct address_space * mapping ,
2021-12-07 22:15:07 +03:00
struct folio_batch * fbatch )
2017-11-16 04:37:33 +03:00
{
int i ;
2021-12-07 22:15:07 +03:00
if ( ! folio_batch_count ( fbatch ) )
2017-11-16 04:37:33 +03:00
return ;
vfs: keep inodes with page cache off the inode shrinker LRU
Historically (pre-2.5), the inode shrinker used to reclaim only empty
inodes and skip over those that still contained page cache. This caused
problems on highmem hosts: struct inode could put fill lowmem zones
before the cache was getting reclaimed in the highmem zones.
To address this, the inode shrinker started to strip page cache to
facilitate reclaiming lowmem. However, this comes with its own set of
problems: the shrinkers may drop actively used page cache just because
the inodes are not currently open or dirty - think working with a large
git tree. It further doesn't respect cgroup memory protection settings
and can cause priority inversions between containers.
Nowadays, the page cache also holds non-resident info for evicted cache
pages in order to detect refaults. We've come to rely heavily on this
data inside reclaim for protecting the cache workingset and driving swap
behavior. We also use it to quantify and report workload health through
psi. The latter in turn is used for fleet health monitoring, as well as
driving automated memory sizing of workloads and containers, proactive
reclaim and memory offloading schemes.
The consequences of dropping page cache prematurely is that we're seeing
subtle and not-so-subtle failures in all of the above-mentioned
scenarios, with the workload generally entering unexpected thrashing
states while losing the ability to reliably detect it.
To fix this on non-highmem systems at least, going back to rotating
inodes on the LRU isn't feasible. We've tried (commit a76cf1a474d7
("mm: don't reclaim inodes with many attached pages")) and failed
(commit 69056ee6a8a3 ("Revert "mm: don't reclaim inodes with many
attached pages"")).
The issue is mostly that shrinker pools attract pressure based on their
size, and when objects get skipped the shrinkers remember this as
deferred reclaim work. This accumulates excessive pressure on the
remaining inodes, and we can quickly eat into heavily used ones, or
dirty ones that require IO to reclaim, when there potentially is plenty
of cold, clean cache around still.
Instead, this patch keeps populated inodes off the inode LRU in the
first place - just like an open file or dirty state would. An otherwise
clean and unused inode then gets queued when the last cache entry
disappears. This solves the problem without reintroducing the reclaim
issues, and generally is a bit more scalable than having to wade through
potentially hundreds of thousands of busy inodes.
Locking is a bit tricky because the locks protecting the inode state
(i_lock) and the inode LRU (lru_list.lock) don't nest inside the
irq-safe page cache lock (i_pages.xa_lock). Page cache deletions are
serialized through i_lock, taken before the i_pages lock, to make sure
depopulated inodes are queued reliably. Additions may race with
deletions, but we'll check again in the shrinker. If additions race
with the shrinker itself, we're protected by the i_lock: if find_inode()
or iput() win, the shrinker will bail on the elevated i_count or
I_REFERENCED; if the shrinker wins and goes ahead with the inode, it
will set I_FREEING and inhibit further igets(), which will cause the
other side to create a new instance of the inode instead.
Link: https://lkml.kernel.org/r/20210614211904.14420-4-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-09 05:31:24 +03:00
spin_lock ( & mapping - > host - > i_lock ) ;
2021-09-03 00:53:18 +03:00
xa_lock_irq ( & mapping - > i_pages ) ;
2021-12-07 22:15:07 +03:00
for ( i = 0 ; i < folio_batch_count ( fbatch ) ; i + + ) {
struct folio * folio = fbatch - > folios [ i ] ;
2017-11-16 04:37:33 +03:00
2021-07-23 16:29:46 +03:00
trace_mm_filemap_delete_from_page_cache ( folio ) ;
filemap_unaccount_folio ( mapping , folio ) ;
2017-11-16 04:37:33 +03:00
}
2021-12-07 22:15:07 +03:00
page_cache_delete_batch ( mapping , fbatch ) ;
2021-09-03 00:53:18 +03:00
xa_unlock_irq ( & mapping - > i_pages ) ;
vfs: keep inodes with page cache off the inode shrinker LRU
Historically (pre-2.5), the inode shrinker used to reclaim only empty
inodes and skip over those that still contained page cache. This caused
problems on highmem hosts: struct inode could put fill lowmem zones
before the cache was getting reclaimed in the highmem zones.
To address this, the inode shrinker started to strip page cache to
facilitate reclaiming lowmem. However, this comes with its own set of
problems: the shrinkers may drop actively used page cache just because
the inodes are not currently open or dirty - think working with a large
git tree. It further doesn't respect cgroup memory protection settings
and can cause priority inversions between containers.
Nowadays, the page cache also holds non-resident info for evicted cache
pages in order to detect refaults. We've come to rely heavily on this
data inside reclaim for protecting the cache workingset and driving swap
behavior. We also use it to quantify and report workload health through
psi. The latter in turn is used for fleet health monitoring, as well as
driving automated memory sizing of workloads and containers, proactive
reclaim and memory offloading schemes.
The consequences of dropping page cache prematurely is that we're seeing
subtle and not-so-subtle failures in all of the above-mentioned
scenarios, with the workload generally entering unexpected thrashing
states while losing the ability to reliably detect it.
To fix this on non-highmem systems at least, going back to rotating
inodes on the LRU isn't feasible. We've tried (commit a76cf1a474d7
("mm: don't reclaim inodes with many attached pages")) and failed
(commit 69056ee6a8a3 ("Revert "mm: don't reclaim inodes with many
attached pages"")).
The issue is mostly that shrinker pools attract pressure based on their
size, and when objects get skipped the shrinkers remember this as
deferred reclaim work. This accumulates excessive pressure on the
remaining inodes, and we can quickly eat into heavily used ones, or
dirty ones that require IO to reclaim, when there potentially is plenty
of cold, clean cache around still.
Instead, this patch keeps populated inodes off the inode LRU in the
first place - just like an open file or dirty state would. An otherwise
clean and unused inode then gets queued when the last cache entry
disappears. This solves the problem without reintroducing the reclaim
issues, and generally is a bit more scalable than having to wade through
potentially hundreds of thousands of busy inodes.
Locking is a bit tricky because the locks protecting the inode state
(i_lock) and the inode LRU (lru_list.lock) don't nest inside the
irq-safe page cache lock (i_pages.xa_lock). Page cache deletions are
serialized through i_lock, taken before the i_pages lock, to make sure
depopulated inodes are queued reliably. Additions may race with
deletions, but we'll check again in the shrinker. If additions race
with the shrinker itself, we're protected by the i_lock: if find_inode()
or iput() win, the shrinker will bail on the elevated i_count or
I_REFERENCED; if the shrinker wins and goes ahead with the inode, it
will set I_FREEING and inhibit further igets(), which will cause the
other side to create a new instance of the inode instead.
Link: https://lkml.kernel.org/r/20210614211904.14420-4-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-09 05:31:24 +03:00
if ( mapping_shrinkable ( mapping ) )
inode_add_lru ( mapping - > host ) ;
spin_unlock ( & mapping - > host - > i_lock ) ;
2017-11-16 04:37:33 +03:00
2021-12-07 22:15:07 +03:00
for ( i = 0 ; i < folio_batch_count ( fbatch ) ; i + + )
filemap_free_folio ( mapping , fbatch - > folios [ i ] ) ;
2017-11-16 04:37:33 +03:00
}
2016-07-29 15:10:57 +03:00
int filemap_check_errors ( struct address_space * mapping )
2013-04-30 02:08:42 +04:00
{
int ret = 0 ;
/* Check for outstanding write errors */
2014-05-22 22:54:16 +04:00
if ( test_bit ( AS_ENOSPC , & mapping - > flags ) & &
test_and_clear_bit ( AS_ENOSPC , & mapping - > flags ) )
2013-04-30 02:08:42 +04:00
ret = - ENOSPC ;
2014-05-22 22:54:16 +04:00
if ( test_bit ( AS_EIO , & mapping - > flags ) & &
test_and_clear_bit ( AS_EIO , & mapping - > flags ) )
2013-04-30 02:08:42 +04:00
ret = - EIO ;
return ret ;
}
2016-07-29 15:10:57 +03:00
EXPORT_SYMBOL ( filemap_check_errors ) ;
2013-04-30 02:08:42 +04:00
2017-07-06 14:02:22 +03:00
static int filemap_check_and_keep_errors ( struct address_space * mapping )
{
/* Check for outstanding write errors */
if ( test_bit ( AS_EIO , & mapping - > flags ) )
return - EIO ;
if ( test_bit ( AS_ENOSPC , & mapping - > flags ) )
return - ENOSPC ;
return 0 ;
}
2021-07-14 21:47:22 +03:00
/**
* filemap_fdatawrite_wbc - start writeback on mapping dirty pages in range
* @ mapping : address space structure to write
* @ wbc : the writeback_control controlling the writeout
*
* Call writepages on the mapping using the provided wbc to control the
* writeout .
*
* Return : % 0 on success , negative error code otherwise .
*/
int filemap_fdatawrite_wbc ( struct address_space * mapping ,
struct writeback_control * wbc )
{
int ret ;
if ( ! mapping_can_writeback ( mapping ) | |
! mapping_tagged ( mapping , PAGECACHE_TAG_DIRTY ) )
return 0 ;
wbc_attach_fdatawrite_inode ( wbc , mapping - > host ) ;
ret = do_writepages ( mapping , wbc ) ;
wbc_detach_inode ( wbc ) ;
return ret ;
}
EXPORT_SYMBOL ( filemap_fdatawrite_wbc ) ;
2005-04-17 02:20:36 +04:00
/**
2006-06-23 13:03:49 +04:00
* __filemap_fdatawrite_range - start writeback on mapping dirty pages in range
2005-05-01 19:59:26 +04:00
* @ mapping : address space structure to write
* @ start : offset in bytes where the range starts
2006-03-24 14:17:45 +03:00
* @ end : offset in bytes where the range ends ( inclusive )
2005-05-01 19:59:26 +04:00
* @ sync_mode : enable synchronous operation
2005-04-17 02:20:36 +04:00
*
2006-06-23 13:03:49 +04:00
* Start writeback against all of a mapping ' s dirty pages that lie
* within the byte offsets < start , end > inclusive .
*
2005-04-17 02:20:36 +04:00
* If sync_mode is WB_SYNC_ALL then this is a " data integrity " operation , as
2006-06-23 13:03:49 +04:00
* opposed to a regular memory cleansing writeback . The difference between
2005-04-17 02:20:36 +04:00
* these two operations is that if a dirty page / buffer is encountered , it must
* be waited upon , and not just skipped over .
2019-03-06 02:48:42 +03:00
*
* Return : % 0 on success , negative error code otherwise .
2005-04-17 02:20:36 +04:00
*/
[PATCH] fadvise(): write commands
Add two new linux-specific fadvise extensions():
LINUX_FADV_ASYNC_WRITE: start async writeout of any dirty pages between file
offsets `offset' and `offset+len'. Any pages which are currently under
writeout are skipped, whether or not they are dirty.
LINUX_FADV_WRITE_WAIT: wait upon writeout of any dirty pages between file
offsets `offset' and `offset+len'.
By combining these two operations the application may do several things:
LINUX_FADV_ASYNC_WRITE: push some or all of the dirty pages at the disk.
LINUX_FADV_WRITE_WAIT, LINUX_FADV_ASYNC_WRITE: push all of the currently dirty
pages at the disk.
LINUX_FADV_WRITE_WAIT, LINUX_FADV_ASYNC_WRITE, LINUX_FADV_WRITE_WAIT: push all
of the currently dirty pages at the disk, wait until they have been written.
It should be noted that none of these operations write out the file's
metadata. So unless the application is strictly performing overwrites of
already-instantiated disk blocks, there are no guarantees here that the data
will be available after a crash.
To complete this suite of operations I guess we should have a "sync file
metadata only" operation. This gives applications access to all the building
blocks needed for all sorts of sync operations. But sync-metadata doesn't fit
well with the fadvise() interface. Probably it should be a new syscall:
sys_fmetadatasync().
The patch also diddles with the meaning of `endbyte' in sys_fadvise64_64().
It is made to represent that last affected byte in the file (ie: it is
inclusive). Generally, all these byterange and pagerange functions are
inclusive so we can easily represent EOF with -1.
As Ulrich notes, these two functions are somewhat abusive of the fadvise()
concept, which appears to be "set the future policy for this fd".
But these commands are a perfect fit with the fadvise() impementation, and
several of the existing fadvise() commands are synchronous and don't affect
future policy either. I think we can live with the slight incongruity.
Cc: Michael Kerrisk <mtk-manpages@gmx.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-24 14:18:04 +03:00
int __filemap_fdatawrite_range ( struct address_space * mapping , loff_t start ,
loff_t end , int sync_mode )
2005-04-17 02:20:36 +04:00
{
struct writeback_control wbc = {
. sync_mode = sync_mode ,
mm: write_cache_pages integrity fix
In write_cache_pages, nr_to_write is heeded even for data-integrity syncs,
so the function will return success after writing out nr_to_write pages,
even if that was not sufficient to guarantee data integrity.
The callers tend to set it to values that could break data interity
semantics easily in practice. For example, nr_to_write can be set to
mapping->nr_pages * 2, however if a file has a single, dirty page, then
fsync is called, subsequent pages might be concurrently added and dirtied,
then write_cache_pages might writeout two of these newly dirty pages,
while not writing out the old page that should have been written out.
Fix this by ignoring nr_to_write if it is a data integrity sync.
This is a data integrity bug.
The reason this has been done in the past is to avoid stalling sync
operations behind page dirtiers.
"If a file has one dirty page at offset 1000000000000000 then someone
does an fsync() and someone else gets in first and starts madly writing
pages at offset 0, we want to write that page at 1000000000000000.
Somehow."
What we do today is return success after an arbitrary amount of pages are
written, whether or not we have provided the data-integrity semantics that
the caller has asked for. Even this doesn't actually fix all stall cases
completely: in the above situation, if the file has a huge number of pages
in pagecache (but not dirty), then mapping->nrpages is going to be huge,
even if pages are being dirtied.
This change does indeed make the possibility of long stalls lager, and
that's not a good thing, but lying about data integrity is even worse. We
have to either perform the sync, or return -ELINUXISLAME so at least the
caller knows what has happened.
There are subsequent competing approaches in the works to solve the stall
problems properly, without compromising data integrity.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-07 01:39:08 +03:00
. nr_to_write = LONG_MAX ,
[PATCH] writeback: fix range handling
When a writeback_control's `start' and `end' fields are used to
indicate a one-byte-range starting at file offset zero, the required
values of .start=0,.end=0 mean that the ->writepages() implementation
has no way of telling that it is being asked to perform a range
request. Because we're currently overloading (start == 0 && end == 0)
to mean "this is not a write-a-range request".
To make all this sane, the patch changes range of writeback_control.
So caller does: If it is calling ->writepages() to write pages, it
sets range (range_start/end or range_cyclic) always.
And if range_cyclic is true, ->writepages() thinks the range is
cyclic, otherwise it just uses range_start and range_end.
This patch does,
- Add LLONG_MAX, LLONG_MIN, ULLONG_MAX to include/linux/kernel.h
-1 is usually ok for range_end (type is long long). But, if someone did,
range_end += val; range_end is "val - 1"
u64val = range_end >> bits; u64val is "~(0ULL)"
or something, they are wrong. So, this adds LLONG_MAX to avoid nasty
things, and uses LLONG_MAX for range_end.
- All callers of ->writepages() sets range_start/end or range_cyclic.
- Fix updates of ->writeback_index. It seems already bit strange.
If it starts at 0 and ended by check of nr_to_write, this last
index may reduce chance to scan end of file. So, this updates
->writeback_index only if range_cyclic is true or whole-file is
scanned.
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: Nathan Scott <nathans@sgi.com>
Cc: Anton Altaparmakov <aia21@cantab.net>
Cc: Steven French <sfrench@us.ibm.com>
Cc: "Vladimir V. Saveliev" <vs@namesys.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 13:03:26 +04:00
. range_start = start ,
. range_end = end ,
2005-04-17 02:20:36 +04:00
} ;
2021-07-14 21:47:22 +03:00
return filemap_fdatawrite_wbc ( mapping , & wbc ) ;
2005-04-17 02:20:36 +04:00
}
static inline int __filemap_fdatawrite ( struct address_space * mapping ,
int sync_mode )
{
[PATCH] writeback: fix range handling
When a writeback_control's `start' and `end' fields are used to
indicate a one-byte-range starting at file offset zero, the required
values of .start=0,.end=0 mean that the ->writepages() implementation
has no way of telling that it is being asked to perform a range
request. Because we're currently overloading (start == 0 && end == 0)
to mean "this is not a write-a-range request".
To make all this sane, the patch changes range of writeback_control.
So caller does: If it is calling ->writepages() to write pages, it
sets range (range_start/end or range_cyclic) always.
And if range_cyclic is true, ->writepages() thinks the range is
cyclic, otherwise it just uses range_start and range_end.
This patch does,
- Add LLONG_MAX, LLONG_MIN, ULLONG_MAX to include/linux/kernel.h
-1 is usually ok for range_end (type is long long). But, if someone did,
range_end += val; range_end is "val - 1"
u64val = range_end >> bits; u64val is "~(0ULL)"
or something, they are wrong. So, this adds LLONG_MAX to avoid nasty
things, and uses LLONG_MAX for range_end.
- All callers of ->writepages() sets range_start/end or range_cyclic.
- Fix updates of ->writeback_index. It seems already bit strange.
If it starts at 0 and ended by check of nr_to_write, this last
index may reduce chance to scan end of file. So, this updates
->writeback_index only if range_cyclic is true or whole-file is
scanned.
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: Nathan Scott <nathans@sgi.com>
Cc: Anton Altaparmakov <aia21@cantab.net>
Cc: Steven French <sfrench@us.ibm.com>
Cc: "Vladimir V. Saveliev" <vs@namesys.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 13:03:26 +04:00
return __filemap_fdatawrite_range ( mapping , 0 , LLONG_MAX , sync_mode ) ;
2005-04-17 02:20:36 +04:00
}
int filemap_fdatawrite ( struct address_space * mapping )
{
return __filemap_fdatawrite ( mapping , WB_SYNC_ALL ) ;
}
EXPORT_SYMBOL ( filemap_fdatawrite ) ;
2008-07-12 03:27:31 +04:00
int filemap_fdatawrite_range ( struct address_space * mapping , loff_t start ,
[PATCH] fadvise(): write commands
Add two new linux-specific fadvise extensions():
LINUX_FADV_ASYNC_WRITE: start async writeout of any dirty pages between file
offsets `offset' and `offset+len'. Any pages which are currently under
writeout are skipped, whether or not they are dirty.
LINUX_FADV_WRITE_WAIT: wait upon writeout of any dirty pages between file
offsets `offset' and `offset+len'.
By combining these two operations the application may do several things:
LINUX_FADV_ASYNC_WRITE: push some or all of the dirty pages at the disk.
LINUX_FADV_WRITE_WAIT, LINUX_FADV_ASYNC_WRITE: push all of the currently dirty
pages at the disk.
LINUX_FADV_WRITE_WAIT, LINUX_FADV_ASYNC_WRITE, LINUX_FADV_WRITE_WAIT: push all
of the currently dirty pages at the disk, wait until they have been written.
It should be noted that none of these operations write out the file's
metadata. So unless the application is strictly performing overwrites of
already-instantiated disk blocks, there are no guarantees here that the data
will be available after a crash.
To complete this suite of operations I guess we should have a "sync file
metadata only" operation. This gives applications access to all the building
blocks needed for all sorts of sync operations. But sync-metadata doesn't fit
well with the fadvise() interface. Probably it should be a new syscall:
sys_fmetadatasync().
The patch also diddles with the meaning of `endbyte' in sys_fadvise64_64().
It is made to represent that last affected byte in the file (ie: it is
inclusive). Generally, all these byterange and pagerange functions are
inclusive so we can easily represent EOF with -1.
As Ulrich notes, these two functions are somewhat abusive of the fadvise()
concept, which appears to be "set the future policy for this fd".
But these commands are a perfect fit with the fadvise() impementation, and
several of the existing fadvise() commands are synchronous and don't affect
future policy either. I think we can live with the slight incongruity.
Cc: Michael Kerrisk <mtk-manpages@gmx.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-24 14:18:04 +03:00
loff_t end )
2005-04-17 02:20:36 +04:00
{
return __filemap_fdatawrite_range ( mapping , start , end , WB_SYNC_ALL ) ;
}
2008-07-12 03:27:31 +04:00
EXPORT_SYMBOL ( filemap_fdatawrite_range ) ;
2005-04-17 02:20:36 +04:00
2006-06-23 13:03:49 +04:00
/**
* filemap_flush - mostly a non - blocking flush
* @ mapping : target address_space
*
2005-04-17 02:20:36 +04:00
* This is a mostly non - blocking flush . Not suitable for data - integrity
* purposes - I / O may not be started against all dirty pages .
2019-03-06 02:48:42 +03:00
*
* Return : % 0 on success , negative error code otherwise .
2005-04-17 02:20:36 +04:00
*/
int filemap_flush ( struct address_space * mapping )
{
return __filemap_fdatawrite ( mapping , WB_SYNC_NONE ) ;
}
EXPORT_SYMBOL ( filemap_flush ) ;
2017-06-20 15:05:41 +03:00
/**
* filemap_range_has_page - check if a page exists in range .
* @ mapping : address space within which to check
* @ start_byte : offset in bytes where the range starts
* @ end_byte : offset in bytes where the range ends ( inclusive )
*
* Find at least one page in the range supplied , usually used to check if
* direct writing in this range will trigger a writeback .
2019-03-06 02:48:42 +03:00
*
* Return : % true if at least one page exists in the specified range ,
* % false otherwise .
2017-06-20 15:05:41 +03:00
*/
bool filemap_range_has_page ( struct address_space * mapping ,
loff_t start_byte , loff_t end_byte )
{
2023-01-16 22:39:40 +03:00
struct folio * folio ;
2018-01-16 14:26:49 +03:00
XA_STATE ( xas , & mapping - > i_pages , start_byte > > PAGE_SHIFT ) ;
pgoff_t max = end_byte > > PAGE_SHIFT ;
2017-06-20 15:05:41 +03:00
if ( end_byte < start_byte )
return false ;
2018-01-16 14:26:49 +03:00
rcu_read_lock ( ) ;
for ( ; ; ) {
2023-01-16 22:39:40 +03:00
folio = xas_find ( & xas , max ) ;
if ( xas_retry ( & xas , folio ) )
2018-01-16 14:26:49 +03:00
continue ;
/* Shadow entries don't count */
2023-01-16 22:39:40 +03:00
if ( xa_is_value ( folio ) )
2018-01-16 14:26:49 +03:00
continue ;
/*
* We don ' t need to try to pin this page ; we ' re about to
* release the RCU lock anyway . It is enough to know that
* there was a page here recently .
*/
break ;
}
rcu_read_unlock ( ) ;
2017-06-20 15:05:41 +03:00
2023-01-16 22:39:40 +03:00
return folio ! = NULL ;
2017-06-20 15:05:41 +03:00
}
EXPORT_SYMBOL ( filemap_range_has_page ) ;
2017-07-06 14:02:24 +03:00
static void __filemap_fdatawait_range ( struct address_space * mapping ,
2015-11-06 05:47:23 +03:00
loff_t start_byte , loff_t end_byte )
2005-04-17 02:20:36 +04:00
{
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 15:29:47 +03:00
pgoff_t index = start_byte > > PAGE_SHIFT ;
pgoff_t end = end_byte > > PAGE_SHIFT ;
2023-01-05 00:14:28 +03:00
struct folio_batch fbatch ;
unsigned nr_folios ;
folio_batch_init ( & fbatch ) ;
2005-04-17 02:20:36 +04:00
2017-11-16 04:35:05 +03:00
while ( index < = end ) {
2005-04-17 02:20:36 +04:00
unsigned i ;
2023-01-05 00:14:28 +03:00
nr_folios = filemap_get_folios_tag ( mapping , & index , end ,
PAGECACHE_TAG_WRITEBACK , & fbatch ) ;
if ( ! nr_folios )
2017-11-16 04:35:05 +03:00
break ;
2023-01-05 00:14:28 +03:00
for ( i = 0 ; i < nr_folios ; i + + ) {
struct folio * folio = fbatch . folios [ i ] ;
2005-04-17 02:20:36 +04:00
2023-01-05 00:14:28 +03:00
folio_wait_writeback ( folio ) ;
folio_clear_error ( folio ) ;
2005-04-17 02:20:36 +04:00
}
2023-01-05 00:14:28 +03:00
folio_batch_release ( & fbatch ) ;
2005-04-17 02:20:36 +04:00
cond_resched ( ) ;
}
2015-11-06 05:47:23 +03:00
}
/**
* filemap_fdatawait_range - wait for writeback to complete
* @ mapping : address space structure to wait for
* @ start_byte : offset in bytes where the range starts
* @ end_byte : offset in bytes where the range ends ( inclusive )
*
* Walk the list of under - writeback pages of the given address space
* in the given range and wait for all of them . Check error status of
* the address space and return it .
*
* Since the error status of the address space is cleared by this function ,
* callers are responsible for checking the return value and handling and / or
* reporting the error .
2019-03-06 02:48:42 +03:00
*
* Return : error status of the address space .
2015-11-06 05:47:23 +03:00
*/
int filemap_fdatawait_range ( struct address_space * mapping , loff_t start_byte ,
loff_t end_byte )
{
2017-07-06 14:02:24 +03:00
__filemap_fdatawait_range ( mapping , start_byte , end_byte ) ;
return filemap_check_errors ( mapping ) ;
2005-04-17 02:20:36 +04:00
}
2009-08-17 21:30:27 +04:00
EXPORT_SYMBOL ( filemap_fdatawait_range ) ;
2019-06-21 00:05:37 +03:00
/**
* filemap_fdatawait_range_keep_errors - wait for writeback to complete
* @ mapping : address space structure to wait for
* @ start_byte : offset in bytes where the range starts
* @ end_byte : offset in bytes where the range ends ( inclusive )
*
* Walk the list of under - writeback pages of the given address space in the
* given range and wait for all of them . Unlike filemap_fdatawait_range ( ) ,
* this function does not clear error status of the address space .
*
* Use this function if callers don ' t handle errors themselves . Expected
* call sites are system - wide / filesystem - wide data flushers : e . g . sync ( 2 ) ,
* fsfreeze ( 8 )
*/
int filemap_fdatawait_range_keep_errors ( struct address_space * mapping ,
loff_t start_byte , loff_t end_byte )
{
__filemap_fdatawait_range ( mapping , start_byte , end_byte ) ;
return filemap_check_and_keep_errors ( mapping ) ;
}
EXPORT_SYMBOL ( filemap_fdatawait_range_keep_errors ) ;
2017-07-28 14:24:43 +03:00
/**
* file_fdatawait_range - wait for writeback to complete
* @ file : file pointing to address space structure to wait for
* @ start_byte : offset in bytes where the range starts
* @ end_byte : offset in bytes where the range ends ( inclusive )
*
* Walk the list of under - writeback pages of the address space that file
* refers to , in the given range and wait for all of them . Check error
* status of the address space vs . the file - > f_wb_err cursor and return it .
*
* Since the error status of the file is advanced by this function ,
* callers are responsible for checking the return value and handling and / or
* reporting the error .
2019-03-06 02:48:42 +03:00
*
* Return : error status of the address space vs . the file - > f_wb_err cursor .
2017-07-28 14:24:43 +03:00
*/
int file_fdatawait_range ( struct file * file , loff_t start_byte , loff_t end_byte )
{
struct address_space * mapping = file - > f_mapping ;
__filemap_fdatawait_range ( mapping , start_byte , end_byte ) ;
return file_check_and_advance_wb_err ( file ) ;
}
EXPORT_SYMBOL ( file_fdatawait_range ) ;
2009-08-17 21:30:27 +04:00
2015-11-06 05:47:23 +03:00
/**
* filemap_fdatawait_keep_errors - wait for writeback without clearing errors
* @ mapping : address space structure to wait for
*
* Walk the list of under - writeback pages of the given address space
* and wait for all of them . Unlike filemap_fdatawait ( ) , this function
* does not clear error status of the address space .
*
* Use this function if callers don ' t handle errors themselves . Expected
* call sites are system - wide / filesystem - wide data flushers : e . g . sync ( 2 ) ,
* fsfreeze ( 8 )
2019-03-06 02:48:42 +03:00
*
* Return : error status of the address space .
2015-11-06 05:47:23 +03:00
*/
2017-07-06 14:02:22 +03:00
int filemap_fdatawait_keep_errors ( struct address_space * mapping )
2015-11-06 05:47:23 +03:00
{
2017-07-31 17:29:38 +03:00
__filemap_fdatawait_range ( mapping , 0 , LLONG_MAX ) ;
2017-07-06 14:02:22 +03:00
return filemap_check_and_keep_errors ( mapping ) ;
2015-11-06 05:47:23 +03:00
}
2017-07-06 14:02:22 +03:00
EXPORT_SYMBOL ( filemap_fdatawait_keep_errors ) ;
2015-11-06 05:47:23 +03:00
2019-09-24 01:34:48 +03:00
/* Returns true if writeback might be needed or already in progress. */
2017-07-26 17:21:11 +03:00
static bool mapping_needs_writeback ( struct address_space * mapping )
2005-04-17 02:20:36 +04:00
{
2019-09-24 01:34:48 +03:00
return mapping - > nrpages ;
2005-04-17 02:20:36 +04:00
}
2021-10-28 17:47:05 +03:00
bool filemap_range_has_writeback ( struct address_space * mapping ,
loff_t start_byte , loff_t end_byte )
2021-11-05 23:37:13 +03:00
{
XA_STATE ( xas , & mapping - > i_pages , start_byte > > PAGE_SHIFT ) ;
pgoff_t max = end_byte > > PAGE_SHIFT ;
2022-09-06 00:45:57 +03:00
struct folio * folio ;
2021-11-05 23:37:13 +03:00
if ( end_byte < start_byte )
return false ;
rcu_read_lock ( ) ;
2022-09-06 00:45:57 +03:00
xas_for_each ( & xas , folio , max ) {
if ( xas_retry ( & xas , folio ) )
2021-11-05 23:37:13 +03:00
continue ;
2022-09-06 00:45:57 +03:00
if ( xa_is_value ( folio ) )
2021-11-05 23:37:13 +03:00
continue ;
2022-09-06 00:45:57 +03:00
if ( folio_test_dirty ( folio ) | | folio_test_locked ( folio ) | |
folio_test_writeback ( folio ) )
2021-11-05 23:37:13 +03:00
break ;
}
rcu_read_unlock ( ) ;
2022-09-06 00:45:57 +03:00
return folio ! = NULL ;
mm: provide filemap_range_needs_writeback() helper
Patch series "Improve IOCB_NOWAIT O_DIRECT reads", v3.
An internal workload complained because it was using too much CPU, and
when I took a look, we had a lot of io_uring workers going to town.
For an async buffered read like workload, I am normally expecting _zero_
offloads to a worker thread, but this one had tons of them. I'd drop
caches and things would look good again, but then a minute later we'd
regress back to using workers. Turns out that every minute something
was reading parts of the device, which would add page cache for that
inode. I put patches like these in for our kernel, and the problem was
solved.
Don't -EAGAIN IOCB_NOWAIT dio reads just because we have page cache
entries for the given range. This causes unnecessary work from the
callers side, when the IO could have been issued totally fine without
blocking on writeback when there is none.
This patch (of 3):
For O_DIRECT reads/writes, we check if we need to issue a call to
filemap_write_and_wait_range() to issue and/or wait for writeback for any
page in the given range. The existing mechanism just checks for a page in
the range, which is suboptimal for IOCB_NOWAIT as we'll fallback to the
slow path (and needing retry) if there's just a clean page cache page in
the range.
Provide filemap_range_needs_writeback() which tries a little harder to
check if we actually need to issue and/or wait for writeback in the range.
Link: https://lkml.kernel.org/r/20210224164455.1096727-1-axboe@kernel.dk
Link: https://lkml.kernel.org/r/20210224164455.1096727-2-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-30 08:55:18 +03:00
}
2021-10-28 17:47:05 +03:00
EXPORT_SYMBOL_GPL ( filemap_range_has_writeback ) ;
mm: provide filemap_range_needs_writeback() helper
Patch series "Improve IOCB_NOWAIT O_DIRECT reads", v3.
An internal workload complained because it was using too much CPU, and
when I took a look, we had a lot of io_uring workers going to town.
For an async buffered read like workload, I am normally expecting _zero_
offloads to a worker thread, but this one had tons of them. I'd drop
caches and things would look good again, but then a minute later we'd
regress back to using workers. Turns out that every minute something
was reading parts of the device, which would add page cache for that
inode. I put patches like these in for our kernel, and the problem was
solved.
Don't -EAGAIN IOCB_NOWAIT dio reads just because we have page cache
entries for the given range. This causes unnecessary work from the
callers side, when the IO could have been issued totally fine without
blocking on writeback when there is none.
This patch (of 3):
For O_DIRECT reads/writes, we check if we need to issue a call to
filemap_write_and_wait_range() to issue and/or wait for writeback for any
page in the given range. The existing mechanism just checks for a page in
the range, which is suboptimal for IOCB_NOWAIT as we'll fallback to the
slow path (and needing retry) if there's just a clean page cache page in
the range.
Provide filemap_range_needs_writeback() which tries a little harder to
check if we actually need to issue and/or wait for writeback in the range.
Link: https://lkml.kernel.org/r/20210224164455.1096727-1-axboe@kernel.dk
Link: https://lkml.kernel.org/r/20210224164455.1096727-2-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-30 08:55:18 +03:00
2006-06-23 13:03:49 +04:00
/**
* filemap_write_and_wait_range - write out & wait on a file range
* @ mapping : the address_space for the pages
* @ lstart : offset in bytes where the range starts
* @ lend : offset in bytes where the range ends ( inclusive )
*
2006-03-24 14:17:45 +03:00
* Write out and wait upon file offsets lstart - > lend , inclusive .
*
2017-03-30 23:11:36 +03:00
* Note that @ lend is inclusive ( describes the last byte to be written ) so
2006-03-24 14:17:45 +03:00
* that this function can be used to write to the very end - of - file ( end = - 1 ) .
2019-03-06 02:48:42 +03:00
*
* Return : error status of the address space .
2006-03-24 14:17:45 +03:00
*/
2005-04-17 02:20:36 +04:00
int filemap_write_and_wait_range ( struct address_space * mapping ,
loff_t lstart , loff_t lend )
{
2022-06-27 16:23:51 +03:00
int err = 0 , err2 ;
2005-04-17 02:20:36 +04:00
filemap: skip write and wait if end offset precedes start
Patch series "filemap: skip write and wait if end offset precedes start",
v2.
A fix for the odd write and wait behavior described in the patch 1 commit
log. Technically patch 1 could simply remove the check rather than lift
it into the callers, but this seemed a bit more user friendly to me.
Patch 2 is appended after observation that fadvise() interacted poorly
with the v1 patch. This is no longer a problem with v2, making patch 2
purely a cleanup.
This series survived both fstests and ltp regression runs without
observable problems. I had (end < start) warning checks in each relevant
function, with fadvise() being the only caller that triggered them. That
said, I dropped the warnings after testing because there seemed to much
potential for noise from the various other callers.
This patch (of 2):
A call to file[map]_write_and_wait_range() with an end offset that
precedes the start offset but happens to land in the same page can trigger
writeback submission but fails to wait on the submitted page. Writeback
submission occurs because __filemap_fdatawrite_range() passes both offsets
down into write_cache_pages(), which rounds down to page indexes before it
starts processing writeback. However, __filemap_fdatawait_range()
immediately returns if the byte-granular end offset precedes the start
offset.
This behavior was observed in the form of unpredictable latency from a
frequent write and wait call with incorrect parameters. The behavior gave
the impression that the fdatawait path might occasionally fail to wait on
writeback, but further investigation showed the latency was from
write_cache_pages() waiting on writeback state to clear for a page already
under writeback. Therefore, this indicated that fdatawait actually never
waits on writeback in this particular situation.
The byte granular check in __filemap_fdatawait_range() goes all the way
back to the old wait_on_page_writeback() helper. It originally used page
offsets and so would have waited in this problematic case. That changed
to byte granularity file offsets in commit 94004ed726f3 ("kill
wait_on_page_writeback_range"), which subtly changed this behavior. The
check itself has become somewhat redundant since the error checking code
that used to follow the wait loop (at the time of the aforementioned
commit) has now been removed and lifted into the higher level callers.
Therefore, we can restore historical fdatawait behavior by simply removing
the check. Since the current fdatawait behavior has been in place for
quite some time and is consistent with other interfaces that use file
offsets, instead lift the check into the file[map]_write_and_wait_range()
helpers to provide consistent behavior between the write and wait.
Link: https://lkml.kernel.org/r/20221128155632.3950447-1-bfoster@redhat.com
Link: https://lkml.kernel.org/r/20221128155632.3950447-2-bfoster@redhat.com
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-28 18:56:31 +03:00
if ( lend < lstart )
return 0 ;
2017-07-26 17:21:11 +03:00
if ( mapping_needs_writeback ( mapping ) ) {
[PATCH] Fix and add EXPORT_SYMBOL(filemap_write_and_wait)
This patch add EXPORT_SYMBOL(filemap_write_and_wait) and use it.
See mm/filemap.c:
And changes the filemap_write_and_wait() and filemap_write_and_wait_range().
Current filemap_write_and_wait() doesn't wait if filemap_fdatawrite()
returns error. However, even if filemap_fdatawrite() returned an
error, it may have submitted the partially data pages to the device.
(e.g. in the case of -ENOSPC)
<quotation>
Andrew Morton writes,
If filemap_fdatawrite() returns an error, this might be due to some
I/O problem: dead disk, unplugged cable, etc. Given the generally
crappy quality of the kernel's handling of such exceptions, there's a
good chance that the filemap_fdatawait() will get stuck in D state
forever.
</quotation>
So, this patch doesn't wait if filemap_fdatawrite() returns the -EIO.
Trond, could you please review the nfs part? Especially I'm not sure,
nfs must use the "filemap_fdatawrite(inode->i_mapping) == 0", or not.
Acked-by: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 12:02:14 +03:00
err = __filemap_fdatawrite_range ( mapping , lstart , lend ,
WB_SYNC_ALL ) ;
2020-01-31 09:12:07 +03:00
/*
* Even if the above returned error , the pages may be
* written partially ( e . g . - ENOSPC ) , so we wait for it .
* But the - EIO is special case , it may indicate the worst
* thing ( e . g . bug ) happened , so we avoid waiting for it .
*/
2022-06-27 16:23:51 +03:00
if ( err ! = - EIO )
__filemap_fdatawait_range ( mapping , lstart , lend ) ;
2005-04-17 02:20:36 +04:00
}
2022-06-27 16:23:51 +03:00
err2 = filemap_check_errors ( mapping ) ;
if ( ! err )
err = err2 ;
[PATCH] Fix and add EXPORT_SYMBOL(filemap_write_and_wait)
This patch add EXPORT_SYMBOL(filemap_write_and_wait) and use it.
See mm/filemap.c:
And changes the filemap_write_and_wait() and filemap_write_and_wait_range().
Current filemap_write_and_wait() doesn't wait if filemap_fdatawrite()
returns error. However, even if filemap_fdatawrite() returned an
error, it may have submitted the partially data pages to the device.
(e.g. in the case of -ENOSPC)
<quotation>
Andrew Morton writes,
If filemap_fdatawrite() returns an error, this might be due to some
I/O problem: dead disk, unplugged cable, etc. Given the generally
crappy quality of the kernel's handling of such exceptions, there's a
good chance that the filemap_fdatawait() will get stuck in D state
forever.
</quotation>
So, this patch doesn't wait if filemap_fdatawrite() returns the -EIO.
Trond, could you please review the nfs part? Especially I'm not sure,
nfs must use the "filemap_fdatawrite(inode->i_mapping) == 0", or not.
Acked-by: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 12:02:14 +03:00
return err ;
2005-04-17 02:20:36 +04:00
}
2009-04-15 21:22:37 +04:00
EXPORT_SYMBOL ( filemap_write_and_wait_range ) ;
2005-04-17 02:20:36 +04:00
fs: new infrastructure for writeback error handling and reporting
Most filesystems currently use mapping_set_error and
filemap_check_errors for setting and reporting/clearing writeback errors
at the mapping level. filemap_check_errors is indirectly called from
most of the filemap_fdatawait_* functions and from
filemap_write_and_wait*. These functions are called from all sorts of
contexts to wait on writeback to finish -- e.g. mostly in fsync, but
also in truncate calls, getattr, etc.
The non-fsync callers are problematic. We should be reporting writeback
errors during fsync, but many places spread over the tree clear out
errors before they can be properly reported, or report errors at
nonsensical times.
If I get -EIO on a stat() call, there is no reason for me to assume that
it is because some previous writeback failed. The fact that it also
clears out the error such that a subsequent fsync returns 0 is a bug,
and a nasty one since that's potentially silent data corruption.
This patch adds a small bit of new infrastructure for setting and
reporting errors during address_space writeback. While the above was my
original impetus for adding this, I think it's also the case that
current fsync semantics are just problematic for userland. Most
applications that call fsync do so to ensure that the data they wrote
has hit the backing store.
In the case where there are multiple writers to the file at the same
time, this is really hard to determine. The first one to call fsync will
see any stored error, and the rest get back 0. The processes with open
fds may not be associated with one another in any way. They could even
be in different containers, so ensuring coordination between all fsync
callers is not really an option.
One way to remedy this would be to track what file descriptor was used
to dirty the file, but that's rather cumbersome and would likely be
slow. However, there is a simpler way to improve the semantics here
without incurring too much overhead.
This set adds an errseq_t to struct address_space, and a corresponding
one is added to struct file. Writeback errors are recorded in the
mapping's errseq_t, and the one in struct file is used as the "since"
value.
This changes the semantics of the Linux fsync implementation such that
applications can now use it to determine whether there were any
writeback errors since fsync(fd) was last called (or since the file was
opened in the case of fsync having never been called).
Note that those writeback errors may have occurred when writing data
that was dirtied via an entirely different fd, but that's the case now
with the current mapping_set_error/filemap_check_error infrastructure.
This will at least prevent you from getting a false report of success.
The new behavior is still consistent with the POSIX spec, and is more
reliable for application developers. This patch just adds some basic
infrastructure for doing this, and ensures that the f_wb_err "cursor"
is properly set when a file is opened. Later patches will change the
existing code to use this new infrastructure for reporting errors at
fsync time.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
2017-07-06 14:02:25 +03:00
void __filemap_set_wb_err ( struct address_space * mapping , int err )
{
2017-07-24 13:22:15 +03:00
errseq_t eseq = errseq_set ( & mapping - > wb_err , err ) ;
fs: new infrastructure for writeback error handling and reporting
Most filesystems currently use mapping_set_error and
filemap_check_errors for setting and reporting/clearing writeback errors
at the mapping level. filemap_check_errors is indirectly called from
most of the filemap_fdatawait_* functions and from
filemap_write_and_wait*. These functions are called from all sorts of
contexts to wait on writeback to finish -- e.g. mostly in fsync, but
also in truncate calls, getattr, etc.
The non-fsync callers are problematic. We should be reporting writeback
errors during fsync, but many places spread over the tree clear out
errors before they can be properly reported, or report errors at
nonsensical times.
If I get -EIO on a stat() call, there is no reason for me to assume that
it is because some previous writeback failed. The fact that it also
clears out the error such that a subsequent fsync returns 0 is a bug,
and a nasty one since that's potentially silent data corruption.
This patch adds a small bit of new infrastructure for setting and
reporting errors during address_space writeback. While the above was my
original impetus for adding this, I think it's also the case that
current fsync semantics are just problematic for userland. Most
applications that call fsync do so to ensure that the data they wrote
has hit the backing store.
In the case where there are multiple writers to the file at the same
time, this is really hard to determine. The first one to call fsync will
see any stored error, and the rest get back 0. The processes with open
fds may not be associated with one another in any way. They could even
be in different containers, so ensuring coordination between all fsync
callers is not really an option.
One way to remedy this would be to track what file descriptor was used
to dirty the file, but that's rather cumbersome and would likely be
slow. However, there is a simpler way to improve the semantics here
without incurring too much overhead.
This set adds an errseq_t to struct address_space, and a corresponding
one is added to struct file. Writeback errors are recorded in the
mapping's errseq_t, and the one in struct file is used as the "since"
value.
This changes the semantics of the Linux fsync implementation such that
applications can now use it to determine whether there were any
writeback errors since fsync(fd) was last called (or since the file was
opened in the case of fsync having never been called).
Note that those writeback errors may have occurred when writing data
that was dirtied via an entirely different fd, but that's the case now
with the current mapping_set_error/filemap_check_error infrastructure.
This will at least prevent you from getting a false report of success.
The new behavior is still consistent with the POSIX spec, and is more
reliable for application developers. This patch just adds some basic
infrastructure for doing this, and ensures that the f_wb_err "cursor"
is properly set when a file is opened. Later patches will change the
existing code to use this new infrastructure for reporting errors at
fsync time.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
2017-07-06 14:02:25 +03:00
trace_filemap_set_wb_err ( mapping , eseq ) ;
}
EXPORT_SYMBOL ( __filemap_set_wb_err ) ;
/**
* file_check_and_advance_wb_err - report wb error ( if any ) that was previously
* and advance wb_err to current one
* @ file : struct file on which the error is being reported
*
* When userland calls fsync ( or something like nfsd does the equivalent ) , we
* want to report any writeback errors that occurred since the last fsync ( or
* since the file was opened if there haven ' t been any ) .
*
* Grab the wb_err from the mapping . If it matches what we have in the file ,
* then just quickly return 0. The file is all caught up .
*
* If it doesn ' t match , then take the mapping value , set the " seen " flag in
* it and try to swap it into place . If it works , or another task beat us
* to it with the new value , then update the f_wb_err and return the error
* portion . The error at this point must be reported via proper channels
* ( a ' la fsync , or NFS COMMIT operation , etc . ) .
*
* While we handle mapping - > wb_err with atomic operations , the f_wb_err
* value is protected by the f_lock since we must ensure that it reflects
* the latest value swapped in for this file descriptor .
2019-03-06 02:48:42 +03:00
*
* Return : % 0 on success , negative error code otherwise .
fs: new infrastructure for writeback error handling and reporting
Most filesystems currently use mapping_set_error and
filemap_check_errors for setting and reporting/clearing writeback errors
at the mapping level. filemap_check_errors is indirectly called from
most of the filemap_fdatawait_* functions and from
filemap_write_and_wait*. These functions are called from all sorts of
contexts to wait on writeback to finish -- e.g. mostly in fsync, but
also in truncate calls, getattr, etc.
The non-fsync callers are problematic. We should be reporting writeback
errors during fsync, but many places spread over the tree clear out
errors before they can be properly reported, or report errors at
nonsensical times.
If I get -EIO on a stat() call, there is no reason for me to assume that
it is because some previous writeback failed. The fact that it also
clears out the error such that a subsequent fsync returns 0 is a bug,
and a nasty one since that's potentially silent data corruption.
This patch adds a small bit of new infrastructure for setting and
reporting errors during address_space writeback. While the above was my
original impetus for adding this, I think it's also the case that
current fsync semantics are just problematic for userland. Most
applications that call fsync do so to ensure that the data they wrote
has hit the backing store.
In the case where there are multiple writers to the file at the same
time, this is really hard to determine. The first one to call fsync will
see any stored error, and the rest get back 0. The processes with open
fds may not be associated with one another in any way. They could even
be in different containers, so ensuring coordination between all fsync
callers is not really an option.
One way to remedy this would be to track what file descriptor was used
to dirty the file, but that's rather cumbersome and would likely be
slow. However, there is a simpler way to improve the semantics here
without incurring too much overhead.
This set adds an errseq_t to struct address_space, and a corresponding
one is added to struct file. Writeback errors are recorded in the
mapping's errseq_t, and the one in struct file is used as the "since"
value.
This changes the semantics of the Linux fsync implementation such that
applications can now use it to determine whether there were any
writeback errors since fsync(fd) was last called (or since the file was
opened in the case of fsync having never been called).
Note that those writeback errors may have occurred when writing data
that was dirtied via an entirely different fd, but that's the case now
with the current mapping_set_error/filemap_check_error infrastructure.
This will at least prevent you from getting a false report of success.
The new behavior is still consistent with the POSIX spec, and is more
reliable for application developers. This patch just adds some basic
infrastructure for doing this, and ensures that the f_wb_err "cursor"
is properly set when a file is opened. Later patches will change the
existing code to use this new infrastructure for reporting errors at
fsync time.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
2017-07-06 14:02:25 +03:00
*/
int file_check_and_advance_wb_err ( struct file * file )
{
int err = 0 ;
errseq_t old = READ_ONCE ( file - > f_wb_err ) ;
struct address_space * mapping = file - > f_mapping ;
/* Locklessly handle the common case where nothing has changed */
if ( errseq_check ( & mapping - > wb_err , old ) ) {
/* Something changed, must use slow path */
spin_lock ( & file - > f_lock ) ;
old = file - > f_wb_err ;
err = errseq_check_and_advance ( & mapping - > wb_err ,
& file - > f_wb_err ) ;
trace_file_check_and_advance_wb_err ( file , old ) ;
spin_unlock ( & file - > f_lock ) ;
}
2017-10-04 02:15:25 +03:00
/*
* We ' re mostly using this function as a drop in replacement for
* filemap_check_errors . Clear AS_EIO / AS_ENOSPC to emulate the effect
* that the legacy code would have had on these flags .
*/
clear_bit ( AS_EIO , & mapping - > flags ) ;
clear_bit ( AS_ENOSPC , & mapping - > flags ) ;
fs: new infrastructure for writeback error handling and reporting
Most filesystems currently use mapping_set_error and
filemap_check_errors for setting and reporting/clearing writeback errors
at the mapping level. filemap_check_errors is indirectly called from
most of the filemap_fdatawait_* functions and from
filemap_write_and_wait*. These functions are called from all sorts of
contexts to wait on writeback to finish -- e.g. mostly in fsync, but
also in truncate calls, getattr, etc.
The non-fsync callers are problematic. We should be reporting writeback
errors during fsync, but many places spread over the tree clear out
errors before they can be properly reported, or report errors at
nonsensical times.
If I get -EIO on a stat() call, there is no reason for me to assume that
it is because some previous writeback failed. The fact that it also
clears out the error such that a subsequent fsync returns 0 is a bug,
and a nasty one since that's potentially silent data corruption.
This patch adds a small bit of new infrastructure for setting and
reporting errors during address_space writeback. While the above was my
original impetus for adding this, I think it's also the case that
current fsync semantics are just problematic for userland. Most
applications that call fsync do so to ensure that the data they wrote
has hit the backing store.
In the case where there are multiple writers to the file at the same
time, this is really hard to determine. The first one to call fsync will
see any stored error, and the rest get back 0. The processes with open
fds may not be associated with one another in any way. They could even
be in different containers, so ensuring coordination between all fsync
callers is not really an option.
One way to remedy this would be to track what file descriptor was used
to dirty the file, but that's rather cumbersome and would likely be
slow. However, there is a simpler way to improve the semantics here
without incurring too much overhead.
This set adds an errseq_t to struct address_space, and a corresponding
one is added to struct file. Writeback errors are recorded in the
mapping's errseq_t, and the one in struct file is used as the "since"
value.
This changes the semantics of the Linux fsync implementation such that
applications can now use it to determine whether there were any
writeback errors since fsync(fd) was last called (or since the file was
opened in the case of fsync having never been called).
Note that those writeback errors may have occurred when writing data
that was dirtied via an entirely different fd, but that's the case now
with the current mapping_set_error/filemap_check_error infrastructure.
This will at least prevent you from getting a false report of success.
The new behavior is still consistent with the POSIX spec, and is more
reliable for application developers. This patch just adds some basic
infrastructure for doing this, and ensures that the f_wb_err "cursor"
is properly set when a file is opened. Later patches will change the
existing code to use this new infrastructure for reporting errors at
fsync time.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
2017-07-06 14:02:25 +03:00
return err ;
}
EXPORT_SYMBOL ( file_check_and_advance_wb_err ) ;
/**
* file_write_and_wait_range - write out & wait on a file range
* @ file : file pointing to address_space with pages
* @ lstart : offset in bytes where the range starts
* @ lend : offset in bytes where the range ends ( inclusive )
*
* Write out and wait upon file offsets lstart - > lend , inclusive .
*
* Note that @ lend is inclusive ( describes the last byte to be written ) so
* that this function can be used to write to the very end - of - file ( end = - 1 ) .
*
* After writing out and waiting on the data , we check and advance the
* f_wb_err cursor to the latest value , and return any errors detected there .
2019-03-06 02:48:42 +03:00
*
* Return : % 0 on success , negative error code otherwise .
fs: new infrastructure for writeback error handling and reporting
Most filesystems currently use mapping_set_error and
filemap_check_errors for setting and reporting/clearing writeback errors
at the mapping level. filemap_check_errors is indirectly called from
most of the filemap_fdatawait_* functions and from
filemap_write_and_wait*. These functions are called from all sorts of
contexts to wait on writeback to finish -- e.g. mostly in fsync, but
also in truncate calls, getattr, etc.
The non-fsync callers are problematic. We should be reporting writeback
errors during fsync, but many places spread over the tree clear out
errors before they can be properly reported, or report errors at
nonsensical times.
If I get -EIO on a stat() call, there is no reason for me to assume that
it is because some previous writeback failed. The fact that it also
clears out the error such that a subsequent fsync returns 0 is a bug,
and a nasty one since that's potentially silent data corruption.
This patch adds a small bit of new infrastructure for setting and
reporting errors during address_space writeback. While the above was my
original impetus for adding this, I think it's also the case that
current fsync semantics are just problematic for userland. Most
applications that call fsync do so to ensure that the data they wrote
has hit the backing store.
In the case where there are multiple writers to the file at the same
time, this is really hard to determine. The first one to call fsync will
see any stored error, and the rest get back 0. The processes with open
fds may not be associated with one another in any way. They could even
be in different containers, so ensuring coordination between all fsync
callers is not really an option.
One way to remedy this would be to track what file descriptor was used
to dirty the file, but that's rather cumbersome and would likely be
slow. However, there is a simpler way to improve the semantics here
without incurring too much overhead.
This set adds an errseq_t to struct address_space, and a corresponding
one is added to struct file. Writeback errors are recorded in the
mapping's errseq_t, and the one in struct file is used as the "since"
value.
This changes the semantics of the Linux fsync implementation such that
applications can now use it to determine whether there were any
writeback errors since fsync(fd) was last called (or since the file was
opened in the case of fsync having never been called).
Note that those writeback errors may have occurred when writing data
that was dirtied via an entirely different fd, but that's the case now
with the current mapping_set_error/filemap_check_error infrastructure.
This will at least prevent you from getting a false report of success.
The new behavior is still consistent with the POSIX spec, and is more
reliable for application developers. This patch just adds some basic
infrastructure for doing this, and ensures that the f_wb_err "cursor"
is properly set when a file is opened. Later patches will change the
existing code to use this new infrastructure for reporting errors at
fsync time.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
2017-07-06 14:02:25 +03:00
*/
int file_write_and_wait_range ( struct file * file , loff_t lstart , loff_t lend )
{
int err = 0 , err2 ;
struct address_space * mapping = file - > f_mapping ;
filemap: skip write and wait if end offset precedes start
Patch series "filemap: skip write and wait if end offset precedes start",
v2.
A fix for the odd write and wait behavior described in the patch 1 commit
log. Technically patch 1 could simply remove the check rather than lift
it into the callers, but this seemed a bit more user friendly to me.
Patch 2 is appended after observation that fadvise() interacted poorly
with the v1 patch. This is no longer a problem with v2, making patch 2
purely a cleanup.
This series survived both fstests and ltp regression runs without
observable problems. I had (end < start) warning checks in each relevant
function, with fadvise() being the only caller that triggered them. That
said, I dropped the warnings after testing because there seemed to much
potential for noise from the various other callers.
This patch (of 2):
A call to file[map]_write_and_wait_range() with an end offset that
precedes the start offset but happens to land in the same page can trigger
writeback submission but fails to wait on the submitted page. Writeback
submission occurs because __filemap_fdatawrite_range() passes both offsets
down into write_cache_pages(), which rounds down to page indexes before it
starts processing writeback. However, __filemap_fdatawait_range()
immediately returns if the byte-granular end offset precedes the start
offset.
This behavior was observed in the form of unpredictable latency from a
frequent write and wait call with incorrect parameters. The behavior gave
the impression that the fdatawait path might occasionally fail to wait on
writeback, but further investigation showed the latency was from
write_cache_pages() waiting on writeback state to clear for a page already
under writeback. Therefore, this indicated that fdatawait actually never
waits on writeback in this particular situation.
The byte granular check in __filemap_fdatawait_range() goes all the way
back to the old wait_on_page_writeback() helper. It originally used page
offsets and so would have waited in this problematic case. That changed
to byte granularity file offsets in commit 94004ed726f3 ("kill
wait_on_page_writeback_range"), which subtly changed this behavior. The
check itself has become somewhat redundant since the error checking code
that used to follow the wait loop (at the time of the aforementioned
commit) has now been removed and lifted into the higher level callers.
Therefore, we can restore historical fdatawait behavior by simply removing
the check. Since the current fdatawait behavior has been in place for
quite some time and is consistent with other interfaces that use file
offsets, instead lift the check into the file[map]_write_and_wait_range()
helpers to provide consistent behavior between the write and wait.
Link: https://lkml.kernel.org/r/20221128155632.3950447-1-bfoster@redhat.com
Link: https://lkml.kernel.org/r/20221128155632.3950447-2-bfoster@redhat.com
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-28 18:56:31 +03:00
if ( lend < lstart )
return 0 ;
2017-07-26 17:21:11 +03:00
if ( mapping_needs_writeback ( mapping ) ) {
fs: new infrastructure for writeback error handling and reporting
Most filesystems currently use mapping_set_error and
filemap_check_errors for setting and reporting/clearing writeback errors
at the mapping level. filemap_check_errors is indirectly called from
most of the filemap_fdatawait_* functions and from
filemap_write_and_wait*. These functions are called from all sorts of
contexts to wait on writeback to finish -- e.g. mostly in fsync, but
also in truncate calls, getattr, etc.
The non-fsync callers are problematic. We should be reporting writeback
errors during fsync, but many places spread over the tree clear out
errors before they can be properly reported, or report errors at
nonsensical times.
If I get -EIO on a stat() call, there is no reason for me to assume that
it is because some previous writeback failed. The fact that it also
clears out the error such that a subsequent fsync returns 0 is a bug,
and a nasty one since that's potentially silent data corruption.
This patch adds a small bit of new infrastructure for setting and
reporting errors during address_space writeback. While the above was my
original impetus for adding this, I think it's also the case that
current fsync semantics are just problematic for userland. Most
applications that call fsync do so to ensure that the data they wrote
has hit the backing store.
In the case where there are multiple writers to the file at the same
time, this is really hard to determine. The first one to call fsync will
see any stored error, and the rest get back 0. The processes with open
fds may not be associated with one another in any way. They could even
be in different containers, so ensuring coordination between all fsync
callers is not really an option.
One way to remedy this would be to track what file descriptor was used
to dirty the file, but that's rather cumbersome and would likely be
slow. However, there is a simpler way to improve the semantics here
without incurring too much overhead.
This set adds an errseq_t to struct address_space, and a corresponding
one is added to struct file. Writeback errors are recorded in the
mapping's errseq_t, and the one in struct file is used as the "since"
value.
This changes the semantics of the Linux fsync implementation such that
applications can now use it to determine whether there were any
writeback errors since fsync(fd) was last called (or since the file was
opened in the case of fsync having never been called).
Note that those writeback errors may have occurred when writing data
that was dirtied via an entirely different fd, but that's the case now
with the current mapping_set_error/filemap_check_error infrastructure.
This will at least prevent you from getting a false report of success.
The new behavior is still consistent with the POSIX spec, and is more
reliable for application developers. This patch just adds some basic
infrastructure for doing this, and ensures that the f_wb_err "cursor"
is properly set when a file is opened. Later patches will change the
existing code to use this new infrastructure for reporting errors at
fsync time.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
2017-07-06 14:02:25 +03:00
err = __filemap_fdatawrite_range ( mapping , lstart , lend ,
WB_SYNC_ALL ) ;
/* See comment of filemap_write_and_wait() */
if ( err ! = - EIO )
__filemap_fdatawait_range ( mapping , lstart , lend ) ;
}
err2 = file_check_and_advance_wb_err ( file ) ;
if ( ! err )
err = err2 ;
return err ;
}
EXPORT_SYMBOL ( file_write_and_wait_range ) ;
2011-03-23 02:30:52 +03:00
/**
2022-11-01 20:53:22 +03:00
* replace_page_cache_folio - replace a pagecache folio with a new one
* @ old : folio to be replaced
* @ new : folio to replace with
*
* This function replaces a folio in the pagecache with a new one . On
* success it acquires the pagecache reference for the new folio and
* drops it for the old folio . Both the old and new folios must be
* locked . This function does not add the new folio to the LRU , the
2011-03-23 02:30:52 +03:00
* caller must do that .
*
2017-11-17 18:01:45 +03:00
* The remove + add is atomic . This function cannot fail .
2011-03-23 02:30:52 +03:00
*/
2022-11-01 20:53:22 +03:00
void replace_page_cache_folio ( struct folio * old , struct folio * new )
2011-03-23 02:30:52 +03:00
{
2017-11-17 18:01:45 +03:00
struct address_space * mapping = old - > mapping ;
2022-05-01 14:35:31 +03:00
void ( * free_folio ) ( struct folio * ) = mapping - > a_ops - > free_folio ;
2017-11-17 18:01:45 +03:00
pgoff_t offset = old - > index ;
XA_STATE ( xas , & mapping - > i_pages , offset ) ;
2011-03-23 02:30:52 +03:00
2022-11-01 20:53:22 +03:00
VM_BUG_ON_FOLIO ( ! folio_test_locked ( old ) , old ) ;
VM_BUG_ON_FOLIO ( ! folio_test_locked ( new ) , new ) ;
VM_BUG_ON_FOLIO ( new - > mapping , new ) ;
2011-03-23 02:30:52 +03:00
2022-11-01 20:53:22 +03:00
folio_get ( new ) ;
2017-11-17 18:01:45 +03:00
new - > mapping = mapping ;
new - > index = offset ;
2011-03-23 02:30:52 +03:00
2022-11-01 20:53:22 +03:00
mem_cgroup_migrate ( old , new ) ;
2020-06-04 02:01:54 +03:00
2021-09-03 00:53:18 +03:00
xas_lock_irq ( & xas ) ;
2017-11-17 18:01:45 +03:00
xas_store ( & xas , new ) ;
2015-06-25 02:57:24 +03:00
2017-11-17 18:01:45 +03:00
old - > mapping = NULL ;
/* hugetlb pages do not participate in page cache accounting. */
2022-11-01 20:53:22 +03:00
if ( ! folio_test_hugetlb ( old ) )
__lruvec_stat_sub_folio ( old , NR_FILE_PAGES ) ;
if ( ! folio_test_hugetlb ( new ) )
__lruvec_stat_add_folio ( new , NR_FILE_PAGES ) ;
if ( folio_test_swapbacked ( old ) )
__lruvec_stat_sub_folio ( old , NR_SHMEM ) ;
if ( folio_test_swapbacked ( new ) )
__lruvec_stat_add_folio ( new , NR_SHMEM ) ;
2021-09-03 00:53:18 +03:00
xas_unlock_irq ( & xas ) ;
2022-05-01 14:35:31 +03:00
if ( free_folio )
2022-11-01 20:53:22 +03:00
free_folio ( old ) ;
folio_put ( old ) ;
2011-03-23 02:30:52 +03:00
}
2022-11-01 20:53:22 +03:00
EXPORT_SYMBOL_GPL ( replace_page_cache_folio ) ;
2011-03-23 02:30:52 +03:00
2020-12-08 16:56:28 +03:00
noinline int __filemap_add_folio ( struct address_space * mapping ,
struct folio * folio , pgoff_t index , gfp_t gfp , void * * shadowp )
2005-04-17 02:20:36 +04:00
{
2020-12-08 16:56:28 +03:00
XA_STATE ( xas , & mapping - > i_pages , index ) ;
int huge = folio_test_hugetlb ( folio ) ;
2021-02-05 05:32:45 +03:00
bool charged = false ;
2019-09-05 21:03:12 +03:00
long nr = 1 ;
2008-07-26 06:45:30 +04:00
2020-12-08 16:56:28 +03:00
VM_BUG_ON_FOLIO ( ! folio_test_locked ( folio ) , folio ) ;
VM_BUG_ON_FOLIO ( folio_test_swapbacked ( folio ) , folio ) ;
2017-11-17 18:01:45 +03:00
mapping_set_update ( & xas , mapping ) ;
2008-07-26 06:45:30 +04:00
mm: memcontrol: convert page cache to a new mem_cgroup_charge() API
The try/commit/cancel protocol that memcg uses dates back to when pages
used to be uncharged upon removal from the page cache, and thus couldn't
be committed before the insertion had succeeded. Nowadays, pages are
uncharged when they are physically freed; it doesn't matter whether the
insertion was successful or not. For the page cache, the transaction
dance has become unnecessary.
Introduce a mem_cgroup_charge() function that simply charges a newly
allocated page to a cgroup and sets up page->mem_cgroup in one single
step. If the insertion fails, the caller doesn't have to do anything but
free/put the page.
Then switch the page cache over to this new API.
Subsequent patches will also convert anon pages, but it needs a bit more
prep work. Right now, memcg depends on page->mapping being already set up
at the time of charging, so that it can maintain its own MEMCG_CACHE and
MEMCG_RSS counters. For anon, page->mapping is set under the same pte
lock under which the page is publishd, so a single charge point that can
block doesn't work there just yet.
The following prep patches will replace the private memcg counters with
the generic vmstat counters, thus removing the page->mapping dependency,
then complete the transition to the new single-point charge API and delete
the old transactional scheme.
v2: leave shmem swapcache when charging fails to avoid double IO (Joonsoo)
v3: rebase on preceeding shmem simplification patch
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Link: http://lkml.kernel.org/r/20200508183105.225460-6-hannes@cmpxchg.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-04 02:01:41 +03:00
if ( ! huge ) {
2019-09-05 21:03:12 +03:00
int error = mem_cgroup_charge ( folio , NULL , gfp ) ;
2020-12-08 16:56:28 +03:00
VM_BUG_ON_FOLIO ( index & ( folio_nr_pages ( folio ) - 1 ) , folio ) ;
mm: memcontrol: convert page cache to a new mem_cgroup_charge() API
The try/commit/cancel protocol that memcg uses dates back to when pages
used to be uncharged upon removal from the page cache, and thus couldn't
be committed before the insertion had succeeded. Nowadays, pages are
uncharged when they are physically freed; it doesn't matter whether the
insertion was successful or not. For the page cache, the transaction
dance has become unnecessary.
Introduce a mem_cgroup_charge() function that simply charges a newly
allocated page to a cgroup and sets up page->mem_cgroup in one single
step. If the insertion fails, the caller doesn't have to do anything but
free/put the page.
Then switch the page cache over to this new API.
Subsequent patches will also convert anon pages, but it needs a bit more
prep work. Right now, memcg depends on page->mapping being already set up
at the time of charging, so that it can maintain its own MEMCG_CACHE and
MEMCG_RSS counters. For anon, page->mapping is set under the same pte
lock under which the page is publishd, so a single charge point that can
block doesn't work there just yet.
The following prep patches will replace the private memcg counters with
the generic vmstat counters, thus removing the page->mapping dependency,
then complete the transition to the new single-point charge API and delete
the old transactional scheme.
v2: leave shmem swapcache when charging fails to avoid double IO (Joonsoo)
v3: rebase on preceeding shmem simplification patch
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Link: http://lkml.kernel.org/r/20200508183105.225460-6-hannes@cmpxchg.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-04 02:01:41 +03:00
if ( error )
2019-09-05 21:03:12 +03:00
return error ;
2021-02-05 05:32:45 +03:00
charged = true ;
2019-09-05 21:03:12 +03:00
xas_set_order ( & xas , index , folio_order ( folio ) ) ;
nr = folio_nr_pages ( folio ) ;
mm: memcontrol: convert page cache to a new mem_cgroup_charge() API
The try/commit/cancel protocol that memcg uses dates back to when pages
used to be uncharged upon removal from the page cache, and thus couldn't
be committed before the insertion had succeeded. Nowadays, pages are
uncharged when they are physically freed; it doesn't matter whether the
insertion was successful or not. For the page cache, the transaction
dance has become unnecessary.
Introduce a mem_cgroup_charge() function that simply charges a newly
allocated page to a cgroup and sets up page->mem_cgroup in one single
step. If the insertion fails, the caller doesn't have to do anything but
free/put the page.
Then switch the page cache over to this new API.
Subsequent patches will also convert anon pages, but it needs a bit more
prep work. Right now, memcg depends on page->mapping being already set up
at the time of charging, so that it can maintain its own MEMCG_CACHE and
MEMCG_RSS counters. For anon, page->mapping is set under the same pte
lock under which the page is publishd, so a single charge point that can
block doesn't work there just yet.
The following prep patches will replace the private memcg counters with
the generic vmstat counters, thus removing the page->mapping dependency,
then complete the transition to the new single-point charge API and delete
the old transactional scheme.
v2: leave shmem swapcache when charging fails to avoid double IO (Joonsoo)
v3: rebase on preceeding shmem simplification patch
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Link: http://lkml.kernel.org/r/20200508183105.225460-6-hannes@cmpxchg.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-04 02:01:41 +03:00
}
2020-10-16 06:05:20 +03:00
gfp & = GFP_RECLAIM_MASK ;
2019-09-05 21:03:12 +03:00
folio_ref_add ( folio , nr ) ;
folio - > mapping = mapping ;
folio - > index = xas . xa_index ;
2020-10-16 06:05:20 +03:00
2017-11-17 18:01:45 +03:00
do {
2020-10-16 06:05:20 +03:00
unsigned int order = xa_get_order ( xas . xa , xas . xa_index ) ;
void * entry , * old = NULL ;
2020-12-08 16:56:28 +03:00
if ( order > folio_order ( folio ) )
2020-10-16 06:05:20 +03:00
xas_split_alloc ( & xas , xa_load ( xas . xa , xas . xa_index ) ,
order , gfp ) ;
2017-11-17 18:01:45 +03:00
xas_lock_irq ( & xas ) ;
2020-10-16 06:05:20 +03:00
xas_for_each_conflict ( & xas , entry ) {
old = entry ;
if ( ! xa_is_value ( entry ) ) {
xas_set_err ( & xas , - EEXIST ) ;
goto unlock ;
}
}
if ( old ) {
if ( shadowp )
* shadowp = old ;
/* entry may have been split before we acquired lock */
order = xa_get_order ( xas . xa , xas . xa_index ) ;
2020-12-08 16:56:28 +03:00
if ( order > folio_order ( folio ) ) {
2019-09-05 21:03:12 +03:00
/* How to handle large swap entries? */
BUG_ON ( shmem_mapping ( mapping ) ) ;
2020-10-16 06:05:20 +03:00
xas_split ( & xas , old , order ) ;
xas_reset ( & xas ) ;
}
}
2020-12-08 16:56:28 +03:00
xas_store ( & xas , folio ) ;
2017-11-17 18:01:45 +03:00
if ( xas_error ( & xas ) )
goto unlock ;
2019-09-05 21:03:12 +03:00
mapping - > nrpages + = nr ;
2017-11-17 18:01:45 +03:00
/* hugetlb pages do not participate in page cache accounting */
2019-09-05 21:03:12 +03:00
if ( ! huge ) {
__lruvec_stat_mod_folio ( folio , NR_FILE_PAGES , nr ) ;
if ( folio_test_pmd_mappable ( folio ) )
__lruvec_stat_mod_folio ( folio ,
NR_FILE_THPS , nr ) ;
}
2017-11-17 18:01:45 +03:00
unlock :
xas_unlock_irq ( & xas ) ;
2020-10-16 06:05:20 +03:00
} while ( xas_nomem ( & xas , gfp ) ) ;
2017-11-17 18:01:45 +03:00
2019-09-05 21:03:12 +03:00
if ( xas_error ( & xas ) )
2017-11-17 18:01:45 +03:00
goto error ;
2015-06-25 02:57:24 +03:00
2021-07-23 16:29:46 +03:00
trace_mm_filemap_add_to_page_cache ( folio ) ;
2013-09-13 02:13:59 +04:00
return 0 ;
2017-11-17 18:01:45 +03:00
error :
2019-09-05 21:03:12 +03:00
if ( charged )
mem_cgroup_uncharge ( folio ) ;
2020-12-08 16:56:28 +03:00
folio - > mapping = NULL ;
2013-09-13 02:13:59 +04:00
/* Leave page->index set: truncation relies upon it */
2019-09-05 21:03:12 +03:00
folio_put_refs ( folio , nr ) ;
return xas_error ( & xas ) ;
2005-04-17 02:20:36 +04:00
}
2020-12-08 16:56:28 +03:00
ALLOW_ERROR_INJECTION ( __filemap_add_folio , ERRNO ) ;
2014-04-04 01:47:51 +04:00
2020-12-08 16:56:28 +03:00
int filemap_add_folio ( struct address_space * mapping , struct folio * folio ,
pgoff_t index , gfp_t gfp )
2005-04-17 02:20:36 +04:00
{
2014-04-04 01:47:51 +04:00
void * shadow = NULL ;
2008-10-19 07:26:32 +04:00
int ret ;
2020-12-08 16:56:28 +03:00
__folio_set_locked ( folio ) ;
ret = __filemap_add_folio ( mapping , folio , index , gfp , & shadow ) ;
2014-04-04 01:47:51 +04:00
if ( unlikely ( ret ) )
2020-12-08 16:56:28 +03:00
__folio_clear_locked ( folio ) ;
2014-04-04 01:47:51 +04:00
else {
/*
2020-12-08 16:56:28 +03:00
* The folio might have been evicted from cache only
2014-04-04 01:47:51 +04:00
* recently , in which case it should be activated like
2020-12-08 16:56:28 +03:00
* any other repeatedly accessed folio .
* The exception is folios getting rewritten ; evicting other
2016-05-21 02:56:25 +03:00
* data from the working set , only to cache data that will
* get overwritten with something else , is a waste of memory .
2014-04-04 01:47:51 +04:00
*/
2020-12-08 16:56:28 +03:00
WARN_ON_ONCE ( folio_test_active ( folio ) ) ;
if ( ! ( gfp & __GFP_WRITE ) & & shadow )
workingset_refault ( folio , shadow ) ;
folio_add_lru ( folio ) ;
2014-04-04 01:47:51 +04:00
}
2005-04-17 02:20:36 +04:00
return ret ;
}
2020-12-08 16:56:28 +03:00
EXPORT_SYMBOL_GPL ( filemap_add_folio ) ;
2005-04-17 02:20:36 +04:00
2006-03-24 14:16:04 +03:00
# ifdef CONFIG_NUMA
2020-12-16 07:11:07 +03:00
struct folio * filemap_alloc_folio ( gfp_t gfp , unsigned int order )
2006-03-24 14:16:04 +03:00
{
2010-05-25 01:32:08 +04:00
int n ;
2020-12-16 07:11:07 +03:00
struct folio * folio ;
2010-05-25 01:32:08 +04:00
2006-03-24 14:16:04 +03:00
if ( cpuset_do_page_mem_spread ( ) ) {
cpuset: mm: reduce large amounts of memory barrier related damage v3
Commit c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when
changing cpuset's mems") wins a super prize for the largest number of
memory barriers entered into fast paths for one commit.
[get|put]_mems_allowed is incredibly heavy with pairs of full memory
barriers inserted into a number of hot paths. This was detected while
investigating at large page allocator slowdown introduced some time
after 2.6.32. The largest portion of this overhead was shown by
oprofile to be at an mfence introduced by this commit into the page
allocator hot path.
For extra style points, the commit introduced the use of yield() in an
implementation of what looks like a spinning mutex.
This patch replaces the full memory barriers on both read and write
sides with a sequence counter with just read barriers on the fast path
side. This is much cheaper on some architectures, including x86. The
main bulk of the patch is the retry logic if the nodemask changes in a
manner that can cause a false failure.
While updating the nodemask, a check is made to see if a false failure
is a risk. If it is, the sequence number gets bumped and parallel
allocators will briefly stall while the nodemask update takes place.
In a page fault test microbenchmark, oprofile samples from
__alloc_pages_nodemask went from 4.53% of all samples to 1.15%. The
actual results were
3.3.0-rc3 3.3.0-rc3
rc3-vanilla nobarrier-v2r1
Clients 1 UserTime 0.07 ( 0.00%) 0.08 (-14.19%)
Clients 2 UserTime 0.07 ( 0.00%) 0.07 ( 2.72%)
Clients 4 UserTime 0.08 ( 0.00%) 0.07 ( 3.29%)
Clients 1 SysTime 0.70 ( 0.00%) 0.65 ( 6.65%)
Clients 2 SysTime 0.85 ( 0.00%) 0.82 ( 3.65%)
Clients 4 SysTime 1.41 ( 0.00%) 1.41 ( 0.32%)
Clients 1 WallTime 0.77 ( 0.00%) 0.74 ( 4.19%)
Clients 2 WallTime 0.47 ( 0.00%) 0.45 ( 3.73%)
Clients 4 WallTime 0.38 ( 0.00%) 0.37 ( 1.58%)
Clients 1 Flt/sec/cpu 497620.28 ( 0.00%) 520294.53 ( 4.56%)
Clients 2 Flt/sec/cpu 414639.05 ( 0.00%) 429882.01 ( 3.68%)
Clients 4 Flt/sec/cpu 257959.16 ( 0.00%) 258761.48 ( 0.31%)
Clients 1 Flt/sec 495161.39 ( 0.00%) 517292.87 ( 4.47%)
Clients 2 Flt/sec 820325.95 ( 0.00%) 850289.77 ( 3.65%)
Clients 4 Flt/sec 1020068.93 ( 0.00%) 1022674.06 ( 0.26%)
MMTests Statistics: duration
Sys Time Running Test (seconds) 135.68 132.17
User+Sys Time Running Test (seconds) 164.2 160.13
Total Elapsed Time (seconds) 123.46 120.87
The overall improvement is small but the System CPU time is much
improved and roughly in correlation to what oprofile reported (these
performance figures are without profiling so skew is expected). The
actual number of page faults is noticeably improved.
For benchmarks like kernel builds, the overall benefit is marginal but
the system CPU time is slightly reduced.
To test the actual bug the commit fixed I opened two terminals. The
first ran within a cpuset and continually ran a small program that
faulted 100M of anonymous data. In a second window, the nodemask of the
cpuset was continually randomised in a loop.
Without the commit, the program would fail every so often (usually
within 10 seconds) and obviously with the commit everything worked fine.
With this patch applied, it also worked fine so the fix should be
functionally equivalent.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Miao Xie <miaox@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-22 03:34:11 +04:00
unsigned int cpuset_mems_cookie ;
do {
2014-04-04 01:47:24 +04:00
cpuset_mems_cookie = read_mems_allowed_begin ( ) ;
cpuset: mm: reduce large amounts of memory barrier related damage v3
Commit c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when
changing cpuset's mems") wins a super prize for the largest number of
memory barriers entered into fast paths for one commit.
[get|put]_mems_allowed is incredibly heavy with pairs of full memory
barriers inserted into a number of hot paths. This was detected while
investigating at large page allocator slowdown introduced some time
after 2.6.32. The largest portion of this overhead was shown by
oprofile to be at an mfence introduced by this commit into the page
allocator hot path.
For extra style points, the commit introduced the use of yield() in an
implementation of what looks like a spinning mutex.
This patch replaces the full memory barriers on both read and write
sides with a sequence counter with just read barriers on the fast path
side. This is much cheaper on some architectures, including x86. The
main bulk of the patch is the retry logic if the nodemask changes in a
manner that can cause a false failure.
While updating the nodemask, a check is made to see if a false failure
is a risk. If it is, the sequence number gets bumped and parallel
allocators will briefly stall while the nodemask update takes place.
In a page fault test microbenchmark, oprofile samples from
__alloc_pages_nodemask went from 4.53% of all samples to 1.15%. The
actual results were
3.3.0-rc3 3.3.0-rc3
rc3-vanilla nobarrier-v2r1
Clients 1 UserTime 0.07 ( 0.00%) 0.08 (-14.19%)
Clients 2 UserTime 0.07 ( 0.00%) 0.07 ( 2.72%)
Clients 4 UserTime 0.08 ( 0.00%) 0.07 ( 3.29%)
Clients 1 SysTime 0.70 ( 0.00%) 0.65 ( 6.65%)
Clients 2 SysTime 0.85 ( 0.00%) 0.82 ( 3.65%)
Clients 4 SysTime 1.41 ( 0.00%) 1.41 ( 0.32%)
Clients 1 WallTime 0.77 ( 0.00%) 0.74 ( 4.19%)
Clients 2 WallTime 0.47 ( 0.00%) 0.45 ( 3.73%)
Clients 4 WallTime 0.38 ( 0.00%) 0.37 ( 1.58%)
Clients 1 Flt/sec/cpu 497620.28 ( 0.00%) 520294.53 ( 4.56%)
Clients 2 Flt/sec/cpu 414639.05 ( 0.00%) 429882.01 ( 3.68%)
Clients 4 Flt/sec/cpu 257959.16 ( 0.00%) 258761.48 ( 0.31%)
Clients 1 Flt/sec 495161.39 ( 0.00%) 517292.87 ( 4.47%)
Clients 2 Flt/sec 820325.95 ( 0.00%) 850289.77 ( 3.65%)
Clients 4 Flt/sec 1020068.93 ( 0.00%) 1022674.06 ( 0.26%)
MMTests Statistics: duration
Sys Time Running Test (seconds) 135.68 132.17
User+Sys Time Running Test (seconds) 164.2 160.13
Total Elapsed Time (seconds) 123.46 120.87
The overall improvement is small but the System CPU time is much
improved and roughly in correlation to what oprofile reported (these
performance figures are without profiling so skew is expected). The
actual number of page faults is noticeably improved.
For benchmarks like kernel builds, the overall benefit is marginal but
the system CPU time is slightly reduced.
To test the actual bug the commit fixed I opened two terminals. The
first ran within a cpuset and continually ran a small program that
faulted 100M of anonymous data. In a second window, the nodemask of the
cpuset was continually randomised in a loop.
Without the commit, the program would fail every so often (usually
within 10 seconds) and obviously with the commit everything worked fine.
With this patch applied, it also worked fine so the fix should be
functionally equivalent.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Miao Xie <miaox@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-22 03:34:11 +04:00
n = cpuset_mem_spread_node ( ) ;
2020-12-16 07:11:07 +03:00
folio = __folio_alloc_node ( gfp , order , n ) ;
} while ( ! folio & & read_mems_allowed_retry ( cpuset_mems_cookie ) ) ;
cpuset: mm: reduce large amounts of memory barrier related damage v3
Commit c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when
changing cpuset's mems") wins a super prize for the largest number of
memory barriers entered into fast paths for one commit.
[get|put]_mems_allowed is incredibly heavy with pairs of full memory
barriers inserted into a number of hot paths. This was detected while
investigating at large page allocator slowdown introduced some time
after 2.6.32. The largest portion of this overhead was shown by
oprofile to be at an mfence introduced by this commit into the page
allocator hot path.
For extra style points, the commit introduced the use of yield() in an
implementation of what looks like a spinning mutex.
This patch replaces the full memory barriers on both read and write
sides with a sequence counter with just read barriers on the fast path
side. This is much cheaper on some architectures, including x86. The
main bulk of the patch is the retry logic if the nodemask changes in a
manner that can cause a false failure.
While updating the nodemask, a check is made to see if a false failure
is a risk. If it is, the sequence number gets bumped and parallel
allocators will briefly stall while the nodemask update takes place.
In a page fault test microbenchmark, oprofile samples from
__alloc_pages_nodemask went from 4.53% of all samples to 1.15%. The
actual results were
3.3.0-rc3 3.3.0-rc3
rc3-vanilla nobarrier-v2r1
Clients 1 UserTime 0.07 ( 0.00%) 0.08 (-14.19%)
Clients 2 UserTime 0.07 ( 0.00%) 0.07 ( 2.72%)
Clients 4 UserTime 0.08 ( 0.00%) 0.07 ( 3.29%)
Clients 1 SysTime 0.70 ( 0.00%) 0.65 ( 6.65%)
Clients 2 SysTime 0.85 ( 0.00%) 0.82 ( 3.65%)
Clients 4 SysTime 1.41 ( 0.00%) 1.41 ( 0.32%)
Clients 1 WallTime 0.77 ( 0.00%) 0.74 ( 4.19%)
Clients 2 WallTime 0.47 ( 0.00%) 0.45 ( 3.73%)
Clients 4 WallTime 0.38 ( 0.00%) 0.37 ( 1.58%)
Clients 1 Flt/sec/cpu 497620.28 ( 0.00%) 520294.53 ( 4.56%)
Clients 2 Flt/sec/cpu 414639.05 ( 0.00%) 429882.01 ( 3.68%)
Clients 4 Flt/sec/cpu 257959.16 ( 0.00%) 258761.48 ( 0.31%)
Clients 1 Flt/sec 495161.39 ( 0.00%) 517292.87 ( 4.47%)
Clients 2 Flt/sec 820325.95 ( 0.00%) 850289.77 ( 3.65%)
Clients 4 Flt/sec 1020068.93 ( 0.00%) 1022674.06 ( 0.26%)
MMTests Statistics: duration
Sys Time Running Test (seconds) 135.68 132.17
User+Sys Time Running Test (seconds) 164.2 160.13
Total Elapsed Time (seconds) 123.46 120.87
The overall improvement is small but the System CPU time is much
improved and roughly in correlation to what oprofile reported (these
performance figures are without profiling so skew is expected). The
actual number of page faults is noticeably improved.
For benchmarks like kernel builds, the overall benefit is marginal but
the system CPU time is slightly reduced.
To test the actual bug the commit fixed I opened two terminals. The
first ran within a cpuset and continually ran a small program that
faulted 100M of anonymous data. In a second window, the nodemask of the
cpuset was continually randomised in a loop.
Without the commit, the program would fail every so often (usually
within 10 seconds) and obviously with the commit everything worked fine.
With this patch applied, it also worked fine so the fix should be
functionally equivalent.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Miao Xie <miaox@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-22 03:34:11 +04:00
2020-12-16 07:11:07 +03:00
return folio ;
2006-03-24 14:16:04 +03:00
}
2020-12-16 07:11:07 +03:00
return folio_alloc ( gfp , order ) ;
2006-03-24 14:16:04 +03:00
}
2020-12-16 07:11:07 +03:00
EXPORT_SYMBOL ( filemap_alloc_folio ) ;
2006-03-24 14:16:04 +03:00
# endif
2021-05-24 14:02:30 +03:00
/*
* filemap_invalidate_lock_two - lock invalidate_lock for two mappings
*
* Lock exclusively invalidate_lock of any passed mapping that is not NULL .
*
* @ mapping1 : the first mapping to lock
* @ mapping2 : the second mapping to lock
*/
void filemap_invalidate_lock_two ( struct address_space * mapping1 ,
struct address_space * mapping2 )
{
if ( mapping1 > mapping2 )
swap ( mapping1 , mapping2 ) ;
if ( mapping1 )
down_write ( & mapping1 - > invalidate_lock ) ;
if ( mapping2 & & mapping1 ! = mapping2 )
down_write_nested ( & mapping2 - > invalidate_lock , 1 ) ;
}
EXPORT_SYMBOL ( filemap_invalidate_lock_two ) ;
/*
* filemap_invalidate_unlock_two - unlock invalidate_lock for two mappings
*
* Unlock exclusive invalidate_lock of any passed mapping that is not NULL .
*
* @ mapping1 : the first mapping to unlock
* @ mapping2 : the second mapping to unlock
*/
void filemap_invalidate_unlock_two ( struct address_space * mapping1 ,
struct address_space * mapping2 )
{
if ( mapping1 )
up_write ( & mapping1 - > invalidate_lock ) ;
if ( mapping2 & & mapping1 ! = mapping2 )
up_write ( & mapping2 - > invalidate_lock ) ;
}
EXPORT_SYMBOL ( filemap_invalidate_unlock_two ) ;
2005-04-17 02:20:36 +04:00
/*
* In order to wait for pages to become available there must be
* waitqueues associated with pages . By using a hash table of
* waitqueues where the bucket discipline is to maintain all
* waiters on the same queue and wake all when any of the pages
* become available , and for the woken contexts to check to be
* sure the appropriate page became available , this saves space
* at a cost of " thundering herd " phenomena during rare hash
* collisions .
*/
2016-12-25 06:00:30 +03:00
# define PAGE_WAIT_TABLE_BITS 8
# define PAGE_WAIT_TABLE_SIZE (1 << PAGE_WAIT_TABLE_BITS)
2021-01-16 19:22:14 +03:00
static wait_queue_head_t folio_wait_table [ PAGE_WAIT_TABLE_SIZE ] __cacheline_aligned ;
2016-12-25 06:00:30 +03:00
2021-01-16 19:22:14 +03:00
static wait_queue_head_t * folio_waitqueue ( struct folio * folio )
2005-04-17 02:20:36 +04:00
{
2021-01-16 19:22:14 +03:00
return & folio_wait_table [ hash_ptr ( folio , PAGE_WAIT_TABLE_BITS ) ] ;
2005-04-17 02:20:36 +04:00
}
2016-12-25 06:00:30 +03:00
void __init pagecache_init ( void )
2005-04-17 02:20:36 +04:00
{
2016-12-25 06:00:30 +03:00
int i ;
2005-04-17 02:20:36 +04:00
2016-12-25 06:00:30 +03:00
for ( i = 0 ; i < PAGE_WAIT_TABLE_SIZE ; i + + )
2021-01-16 19:22:14 +03:00
init_waitqueue_head ( & folio_wait_table [ i ] ) ;
2016-12-25 06:00:30 +03:00
page_writeback_init ( ) ;
2005-04-17 02:20:36 +04:00
}
mm: allow a controlled amount of unfairness in the page lock
Commit 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic") made
the page locking entirely fair, in that if a waiter came in while the
lock was held, the lock would be transferred to the lockers strictly in
order.
That was intended to finally get rid of the long-reported watchdog
failures that involved the page lock under extreme load, where a process
could end up waiting essentially forever, as other page lockers stole
the lock from under it.
It also improved some benchmarks, but it ended up causing huge
performance regressions on others, simply because fair lock behavior
doesn't end up giving out the lock as aggressively, causing better
worst-case latency, but potentially much worse average latencies and
throughput.
Instead of reverting that change entirely, this introduces a controlled
amount of unfairness, with a sysctl knob to tune it if somebody needs
to. But the default value should hopefully be good for any normal load,
allowing a few rounds of lock stealing, but enforcing the strict
ordering before the lock has been stolen too many times.
There is also a hint from Matthieu Baerts that the fair page coloring
may end up exposing an ABBA deadlock that is hidden by the usual
optimistic lock stealing, and while the unfairness doesn't fix the
fundamental issue (and I'm still looking at that), it avoids it in
practice.
The amount of unfairness can be modified by writing a new value to the
'sysctl_page_lock_unfairness' variable (default value of 5, exposed
through /proc/sys/vm/page_lock_unfairness), but that is hopefully
something we'd use mainly for debugging rather than being necessary for
any deep system tuning.
This whole issue has exposed just how critical the page lock can be, and
how contended it gets under certain locks. And the main contention
doesn't really seem to be anything related to IO (which was the origin
of this lock), but for things like just verifying that the page file
mapping is stable while faulting in the page into a page table.
Link: https://lore.kernel.org/linux-fsdevel/ed8442fd-6f54-dd84-cd4a-941e8b7ee603@MichaelLarabel.com/
Link: https://www.phoronix.com/scan.php?page=article&item=linux-50-59&num=1
Link: https://lore.kernel.org/linux-fsdevel/c560a38d-8313-51fb-b1ec-e904bd8836bc@tessares.net/
Reported-and-tested-by: Michael Larabel <Michael@michaellarabel.com>
Tested-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-14 00:05:35 +03:00
/*
* The page wait code treats the " wait->flags " somewhat unusually , because
2020-09-20 20:38:47 +03:00
* we have multiple different kinds of waits , not just the usual " exclusive "
mm: allow a controlled amount of unfairness in the page lock
Commit 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic") made
the page locking entirely fair, in that if a waiter came in while the
lock was held, the lock would be transferred to the lockers strictly in
order.
That was intended to finally get rid of the long-reported watchdog
failures that involved the page lock under extreme load, where a process
could end up waiting essentially forever, as other page lockers stole
the lock from under it.
It also improved some benchmarks, but it ended up causing huge
performance regressions on others, simply because fair lock behavior
doesn't end up giving out the lock as aggressively, causing better
worst-case latency, but potentially much worse average latencies and
throughput.
Instead of reverting that change entirely, this introduces a controlled
amount of unfairness, with a sysctl knob to tune it if somebody needs
to. But the default value should hopefully be good for any normal load,
allowing a few rounds of lock stealing, but enforcing the strict
ordering before the lock has been stolen too many times.
There is also a hint from Matthieu Baerts that the fair page coloring
may end up exposing an ABBA deadlock that is hidden by the usual
optimistic lock stealing, and while the unfairness doesn't fix the
fundamental issue (and I'm still looking at that), it avoids it in
practice.
The amount of unfairness can be modified by writing a new value to the
'sysctl_page_lock_unfairness' variable (default value of 5, exposed
through /proc/sys/vm/page_lock_unfairness), but that is hopefully
something we'd use mainly for debugging rather than being necessary for
any deep system tuning.
This whole issue has exposed just how critical the page lock can be, and
how contended it gets under certain locks. And the main contention
doesn't really seem to be anything related to IO (which was the origin
of this lock), but for things like just verifying that the page file
mapping is stable while faulting in the page into a page table.
Link: https://lore.kernel.org/linux-fsdevel/ed8442fd-6f54-dd84-cd4a-941e8b7ee603@MichaelLarabel.com/
Link: https://www.phoronix.com/scan.php?page=article&item=linux-50-59&num=1
Link: https://lore.kernel.org/linux-fsdevel/c560a38d-8313-51fb-b1ec-e904bd8836bc@tessares.net/
Reported-and-tested-by: Michael Larabel <Michael@michaellarabel.com>
Tested-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-14 00:05:35 +03:00
* one .
*
* We have :
*
* ( a ) no special bits set :
*
* We ' re just waiting for the bit to be released , and when a waker
* calls the wakeup function , we set WQ_FLAG_WOKEN and wake it up ,
* and remove it from the wait queue .
*
* Simple and straightforward .
*
* ( b ) WQ_FLAG_EXCLUSIVE :
*
* The waiter is waiting to get the lock , and only one waiter should
* be woken up to avoid any thundering herd behavior . We ' ll set the
* WQ_FLAG_WOKEN bit , wake it up , and remove it from the wait queue .
*
* This is the traditional exclusive wait .
*
2020-09-20 20:38:47 +03:00
* ( c ) WQ_FLAG_EXCLUSIVE | WQ_FLAG_CUSTOM :
mm: allow a controlled amount of unfairness in the page lock
Commit 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic") made
the page locking entirely fair, in that if a waiter came in while the
lock was held, the lock would be transferred to the lockers strictly in
order.
That was intended to finally get rid of the long-reported watchdog
failures that involved the page lock under extreme load, where a process
could end up waiting essentially forever, as other page lockers stole
the lock from under it.
It also improved some benchmarks, but it ended up causing huge
performance regressions on others, simply because fair lock behavior
doesn't end up giving out the lock as aggressively, causing better
worst-case latency, but potentially much worse average latencies and
throughput.
Instead of reverting that change entirely, this introduces a controlled
amount of unfairness, with a sysctl knob to tune it if somebody needs
to. But the default value should hopefully be good for any normal load,
allowing a few rounds of lock stealing, but enforcing the strict
ordering before the lock has been stolen too many times.
There is also a hint from Matthieu Baerts that the fair page coloring
may end up exposing an ABBA deadlock that is hidden by the usual
optimistic lock stealing, and while the unfairness doesn't fix the
fundamental issue (and I'm still looking at that), it avoids it in
practice.
The amount of unfairness can be modified by writing a new value to the
'sysctl_page_lock_unfairness' variable (default value of 5, exposed
through /proc/sys/vm/page_lock_unfairness), but that is hopefully
something we'd use mainly for debugging rather than being necessary for
any deep system tuning.
This whole issue has exposed just how critical the page lock can be, and
how contended it gets under certain locks. And the main contention
doesn't really seem to be anything related to IO (which was the origin
of this lock), but for things like just verifying that the page file
mapping is stable while faulting in the page into a page table.
Link: https://lore.kernel.org/linux-fsdevel/ed8442fd-6f54-dd84-cd4a-941e8b7ee603@MichaelLarabel.com/
Link: https://www.phoronix.com/scan.php?page=article&item=linux-50-59&num=1
Link: https://lore.kernel.org/linux-fsdevel/c560a38d-8313-51fb-b1ec-e904bd8836bc@tessares.net/
Reported-and-tested-by: Michael Larabel <Michael@michaellarabel.com>
Tested-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-14 00:05:35 +03:00
*
* The waiter is waiting to get the bit , and additionally wants the
* lock to be transferred to it for fair lock behavior . If the lock
* cannot be taken , we stop walking the wait queue without waking
* the waiter .
*
* This is the " fair lock handoff " case , and in addition to setting
* WQ_FLAG_WOKEN , we set WQ_FLAG_DONE to let the waiter easily see
* that it now has the lock .
*/
2017-06-20 13:06:13 +03:00
static int wake_page_function ( wait_queue_entry_t * wait , unsigned mode , int sync , void * arg )
2011-05-25 04:11:29 +04:00
{
mm: allow a controlled amount of unfairness in the page lock
Commit 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic") made
the page locking entirely fair, in that if a waiter came in while the
lock was held, the lock would be transferred to the lockers strictly in
order.
That was intended to finally get rid of the long-reported watchdog
failures that involved the page lock under extreme load, where a process
could end up waiting essentially forever, as other page lockers stole
the lock from under it.
It also improved some benchmarks, but it ended up causing huge
performance regressions on others, simply because fair lock behavior
doesn't end up giving out the lock as aggressively, causing better
worst-case latency, but potentially much worse average latencies and
throughput.
Instead of reverting that change entirely, this introduces a controlled
amount of unfairness, with a sysctl knob to tune it if somebody needs
to. But the default value should hopefully be good for any normal load,
allowing a few rounds of lock stealing, but enforcing the strict
ordering before the lock has been stolen too many times.
There is also a hint from Matthieu Baerts that the fair page coloring
may end up exposing an ABBA deadlock that is hidden by the usual
optimistic lock stealing, and while the unfairness doesn't fix the
fundamental issue (and I'm still looking at that), it avoids it in
practice.
The amount of unfairness can be modified by writing a new value to the
'sysctl_page_lock_unfairness' variable (default value of 5, exposed
through /proc/sys/vm/page_lock_unfairness), but that is hopefully
something we'd use mainly for debugging rather than being necessary for
any deep system tuning.
This whole issue has exposed just how critical the page lock can be, and
how contended it gets under certain locks. And the main contention
doesn't really seem to be anything related to IO (which was the origin
of this lock), but for things like just verifying that the page file
mapping is stable while faulting in the page into a page table.
Link: https://lore.kernel.org/linux-fsdevel/ed8442fd-6f54-dd84-cd4a-941e8b7ee603@MichaelLarabel.com/
Link: https://www.phoronix.com/scan.php?page=article&item=linux-50-59&num=1
Link: https://lore.kernel.org/linux-fsdevel/c560a38d-8313-51fb-b1ec-e904bd8836bc@tessares.net/
Reported-and-tested-by: Michael Larabel <Michael@michaellarabel.com>
Tested-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-14 00:05:35 +03:00
unsigned int flags ;
2016-12-25 06:00:30 +03:00
struct wait_page_key * key = arg ;
struct wait_page_queue * wait_page
= container_of ( wait , struct wait_page_queue , wait ) ;
2020-08-03 23:01:22 +03:00
if ( ! wake_page_match ( wait_page , key ) )
2016-12-25 06:00:30 +03:00
return 0 ;
Minor page waitqueue cleanups
Tim Chen and Kan Liang have been battling a customer load that shows
extremely long page wakeup lists. The cause seems to be constant NUMA
migration of a hot page that is shared across a lot of threads, but the
actual root cause for the exact behavior has not been found.
Tim has a patch that batches the wait list traversal at wakeup time, so
that we at least don't get long uninterruptible cases where we traverse
and wake up thousands of processes and get nasty latency spikes. That
is likely 4.14 material, but we're still discussing the page waitqueue
specific parts of it.
In the meantime, I've tried to look at making the page wait queues less
expensive, and failing miserably. If you have thousands of threads
waiting for the same page, it will be painful. We'll need to try to
figure out the NUMA balancing issue some day, in addition to avoiding
the excessive spinlock hold times.
That said, having tried to rewrite the page wait queues, I can at least
fix up some of the braindamage in the current situation. In particular:
(a) we don't want to continue walking the page wait list if the bit
we're waiting for already got set again (which seems to be one of
the patterns of the bad load). That makes no progress and just
causes pointless cache pollution chasing the pointers.
(b) we don't want to put the non-locking waiters always on the front of
the queue, and the locking waiters always on the back. Not only is
that unfair, it means that we wake up thousands of reading threads
that will just end up being blocked by the writer later anyway.
Also add a comment about the layout of 'struct wait_page_key' - there is
an external user of it in the cachefiles code that means that it has to
match the layout of 'struct wait_bit_key' in the two first members. It
so happens to match, because 'struct page *' and 'unsigned long *' end
up having the same values simply because the page flags are the first
member in struct page.
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Christopher Lameter <cl@linux.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-27 23:55:12 +03:00
mm: put_and_wait_on_page_locked() while page is migrated
Waiting on a page migration entry has used wait_on_page_locked() all along
since 2006: but you cannot safely wait_on_page_locked() without holding a
reference to the page, and that extra reference is enough to make
migrate_page_move_mapping() fail with -EAGAIN, when a racing task faults
on the entry before migrate_page_move_mapping() gets there.
And that failure is retried nine times, amplifying the pain when trying to
migrate a popular page. With a single persistent faulter, migration
sometimes succeeds; with two or three concurrent faulters, success becomes
much less likely (and the more the page was mapped, the worse the overhead
of unmapping and remapping it on each try).
This is especially a problem for memory offlining, where the outer level
retries forever (or until terminated from userspace), because a heavy
refault workload can trigger an endless loop of migration failures.
wait_on_page_locked() is the wrong tool for the job.
David Herrmann (but was he the first?) noticed this issue in 2014:
https://marc.info/?l=linux-mm&m=140110465608116&w=2
Tim Chen started a thread in August 2017 which appears relevant:
https://marc.info/?l=linux-mm&m=150275941014915&w=2 where Kan Liang went
on to implicate __migration_entry_wait():
https://marc.info/?l=linux-mm&m=150300268411980&w=2 and the thread ended
up with the v4.14 commits: 2554db916586 ("sched/wait: Break up long wake
list walk") 11a19c7b099f ("sched/wait: Introduce wakeup boomark in
wake_up_page_bit")
Baoquan He reported "Memory hotplug softlock issue" 14 November 2018:
https://marc.info/?l=linux-mm&m=154217936431300&w=2
We have all assumed that it is essential to hold a page reference while
waiting on a page lock: partly to guarantee that there is still a struct
page when MEMORY_HOTREMOVE is configured, but also to protect against
reuse of the struct page going to someone who then holds the page locked
indefinitely, when the waiter can reasonably expect timely unlocking.
But in fact, so long as wait_on_page_bit_common() does the put_page(), and
is careful not to rely on struct page contents thereafter, there is no
need to hold a reference to the page while waiting on it. That does mean
that this case cannot go back through the loop: but that's fine for the
page migration case, and even if used more widely, is limited by the "Stop
walking if it's locked" optimization in wake_page_function().
Add interface put_and_wait_on_page_locked() to do this, using "behavior"
enum in place of "lock" arg to wait_on_page_bit_common() to implement it.
No interruptible or killable variant needed yet, but they might follow: I
have a vague notion that reporting -EINTR should take precedence over
return from wait_on_page_bit_common() without knowing the page state, so
arrange it accordingly - but that may be nothing but pedantic.
__migration_entry_wait() still has to take a brief reference to the page,
prior to calling put_and_wait_on_page_locked(): but now that it is dropped
before waiting, the chance of impeding page migration is very much
reduced. Should we perhaps disable preemption across this?
shrink_page_list()'s __ClearPageLocked(): that was a surprise! This
survived a lot of testing before that showed up. PageWaiters may have
been set by wait_on_page_bit_common(), and the reference dropped, just
before shrink_page_list() succeeds in freezing its last page reference: in
such a case, unlock_page() must be used. Follow the suggestion from
Michal Hocko, just revert a978d6f52106 ("mm: unlockless reclaim") now:
that optimization predates PageWaiters, and won't buy much these days; but
we can reinstate it for the !PageWaiters case if anyone notices.
It does raise the question: should vmscan.c's is_page_cache_freeable() and
__remove_mapping() now treat a PageWaiters page as if an extra reference
were held? Perhaps, but I don't think it matters much, since
shrink_page_list() already had to win its trylock_page(), so waiters are
not very common there: I noticed no difference when trying the bigger
change, and it's surely not needed while put_and_wait_on_page_locked() is
only used for page migration.
[willy@infradead.org: add put_and_wait_on_page_locked() kerneldoc]
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261121330.1116@eggly.anvils
Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: Baoquan He <bhe@redhat.com>
Tested-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: David Herrmann <dh.herrmann@gmail.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Nick Piggin <npiggin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 11:36:14 +03:00
/*
mm: allow a controlled amount of unfairness in the page lock
Commit 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic") made
the page locking entirely fair, in that if a waiter came in while the
lock was held, the lock would be transferred to the lockers strictly in
order.
That was intended to finally get rid of the long-reported watchdog
failures that involved the page lock under extreme load, where a process
could end up waiting essentially forever, as other page lockers stole
the lock from under it.
It also improved some benchmarks, but it ended up causing huge
performance regressions on others, simply because fair lock behavior
doesn't end up giving out the lock as aggressively, causing better
worst-case latency, but potentially much worse average latencies and
throughput.
Instead of reverting that change entirely, this introduces a controlled
amount of unfairness, with a sysctl knob to tune it if somebody needs
to. But the default value should hopefully be good for any normal load,
allowing a few rounds of lock stealing, but enforcing the strict
ordering before the lock has been stolen too many times.
There is also a hint from Matthieu Baerts that the fair page coloring
may end up exposing an ABBA deadlock that is hidden by the usual
optimistic lock stealing, and while the unfairness doesn't fix the
fundamental issue (and I'm still looking at that), it avoids it in
practice.
The amount of unfairness can be modified by writing a new value to the
'sysctl_page_lock_unfairness' variable (default value of 5, exposed
through /proc/sys/vm/page_lock_unfairness), but that is hopefully
something we'd use mainly for debugging rather than being necessary for
any deep system tuning.
This whole issue has exposed just how critical the page lock can be, and
how contended it gets under certain locks. And the main contention
doesn't really seem to be anything related to IO (which was the origin
of this lock), but for things like just verifying that the page file
mapping is stable while faulting in the page into a page table.
Link: https://lore.kernel.org/linux-fsdevel/ed8442fd-6f54-dd84-cd4a-941e8b7ee603@MichaelLarabel.com/
Link: https://www.phoronix.com/scan.php?page=article&item=linux-50-59&num=1
Link: https://lore.kernel.org/linux-fsdevel/c560a38d-8313-51fb-b1ec-e904bd8836bc@tessares.net/
Reported-and-tested-by: Michael Larabel <Michael@michaellarabel.com>
Tested-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-14 00:05:35 +03:00
* If it ' s a lock handoff wait , we get the bit for it , and
* stop walking ( and do not wake it up ) if we can ' t .
mm: put_and_wait_on_page_locked() while page is migrated
Waiting on a page migration entry has used wait_on_page_locked() all along
since 2006: but you cannot safely wait_on_page_locked() without holding a
reference to the page, and that extra reference is enough to make
migrate_page_move_mapping() fail with -EAGAIN, when a racing task faults
on the entry before migrate_page_move_mapping() gets there.
And that failure is retried nine times, amplifying the pain when trying to
migrate a popular page. With a single persistent faulter, migration
sometimes succeeds; with two or three concurrent faulters, success becomes
much less likely (and the more the page was mapped, the worse the overhead
of unmapping and remapping it on each try).
This is especially a problem for memory offlining, where the outer level
retries forever (or until terminated from userspace), because a heavy
refault workload can trigger an endless loop of migration failures.
wait_on_page_locked() is the wrong tool for the job.
David Herrmann (but was he the first?) noticed this issue in 2014:
https://marc.info/?l=linux-mm&m=140110465608116&w=2
Tim Chen started a thread in August 2017 which appears relevant:
https://marc.info/?l=linux-mm&m=150275941014915&w=2 where Kan Liang went
on to implicate __migration_entry_wait():
https://marc.info/?l=linux-mm&m=150300268411980&w=2 and the thread ended
up with the v4.14 commits: 2554db916586 ("sched/wait: Break up long wake
list walk") 11a19c7b099f ("sched/wait: Introduce wakeup boomark in
wake_up_page_bit")
Baoquan He reported "Memory hotplug softlock issue" 14 November 2018:
https://marc.info/?l=linux-mm&m=154217936431300&w=2
We have all assumed that it is essential to hold a page reference while
waiting on a page lock: partly to guarantee that there is still a struct
page when MEMORY_HOTREMOVE is configured, but also to protect against
reuse of the struct page going to someone who then holds the page locked
indefinitely, when the waiter can reasonably expect timely unlocking.
But in fact, so long as wait_on_page_bit_common() does the put_page(), and
is careful not to rely on struct page contents thereafter, there is no
need to hold a reference to the page while waiting on it. That does mean
that this case cannot go back through the loop: but that's fine for the
page migration case, and even if used more widely, is limited by the "Stop
walking if it's locked" optimization in wake_page_function().
Add interface put_and_wait_on_page_locked() to do this, using "behavior"
enum in place of "lock" arg to wait_on_page_bit_common() to implement it.
No interruptible or killable variant needed yet, but they might follow: I
have a vague notion that reporting -EINTR should take precedence over
return from wait_on_page_bit_common() without knowing the page state, so
arrange it accordingly - but that may be nothing but pedantic.
__migration_entry_wait() still has to take a brief reference to the page,
prior to calling put_and_wait_on_page_locked(): but now that it is dropped
before waiting, the chance of impeding page migration is very much
reduced. Should we perhaps disable preemption across this?
shrink_page_list()'s __ClearPageLocked(): that was a surprise! This
survived a lot of testing before that showed up. PageWaiters may have
been set by wait_on_page_bit_common(), and the reference dropped, just
before shrink_page_list() succeeds in freezing its last page reference: in
such a case, unlock_page() must be used. Follow the suggestion from
Michal Hocko, just revert a978d6f52106 ("mm: unlockless reclaim") now:
that optimization predates PageWaiters, and won't buy much these days; but
we can reinstate it for the !PageWaiters case if anyone notices.
It does raise the question: should vmscan.c's is_page_cache_freeable() and
__remove_mapping() now treat a PageWaiters page as if an extra reference
were held? Perhaps, but I don't think it matters much, since
shrink_page_list() already had to win its trylock_page(), so waiters are
not very common there: I noticed no difference when trying the bigger
change, and it's surely not needed while put_and_wait_on_page_locked() is
only used for page migration.
[willy@infradead.org: add put_and_wait_on_page_locked() kerneldoc]
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261121330.1116@eggly.anvils
Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: Baoquan He <bhe@redhat.com>
Tested-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: David Herrmann <dh.herrmann@gmail.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Nick Piggin <npiggin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 11:36:14 +03:00
*/
mm: allow a controlled amount of unfairness in the page lock
Commit 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic") made
the page locking entirely fair, in that if a waiter came in while the
lock was held, the lock would be transferred to the lockers strictly in
order.
That was intended to finally get rid of the long-reported watchdog
failures that involved the page lock under extreme load, where a process
could end up waiting essentially forever, as other page lockers stole
the lock from under it.
It also improved some benchmarks, but it ended up causing huge
performance regressions on others, simply because fair lock behavior
doesn't end up giving out the lock as aggressively, causing better
worst-case latency, but potentially much worse average latencies and
throughput.
Instead of reverting that change entirely, this introduces a controlled
amount of unfairness, with a sysctl knob to tune it if somebody needs
to. But the default value should hopefully be good for any normal load,
allowing a few rounds of lock stealing, but enforcing the strict
ordering before the lock has been stolen too many times.
There is also a hint from Matthieu Baerts that the fair page coloring
may end up exposing an ABBA deadlock that is hidden by the usual
optimistic lock stealing, and while the unfairness doesn't fix the
fundamental issue (and I'm still looking at that), it avoids it in
practice.
The amount of unfairness can be modified by writing a new value to the
'sysctl_page_lock_unfairness' variable (default value of 5, exposed
through /proc/sys/vm/page_lock_unfairness), but that is hopefully
something we'd use mainly for debugging rather than being necessary for
any deep system tuning.
This whole issue has exposed just how critical the page lock can be, and
how contended it gets under certain locks. And the main contention
doesn't really seem to be anything related to IO (which was the origin
of this lock), but for things like just verifying that the page file
mapping is stable while faulting in the page into a page table.
Link: https://lore.kernel.org/linux-fsdevel/ed8442fd-6f54-dd84-cd4a-941e8b7ee603@MichaelLarabel.com/
Link: https://www.phoronix.com/scan.php?page=article&item=linux-50-59&num=1
Link: https://lore.kernel.org/linux-fsdevel/c560a38d-8313-51fb-b1ec-e904bd8836bc@tessares.net/
Reported-and-tested-by: Michael Larabel <Michael@michaellarabel.com>
Tested-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-14 00:05:35 +03:00
flags = wait - > flags ;
if ( flags & WQ_FLAG_EXCLUSIVE ) {
2021-01-16 19:22:14 +03:00
if ( test_bit ( key - > bit_nr , & key - > folio - > flags ) )
2020-07-23 20:16:49 +03:00
return - 1 ;
mm: allow a controlled amount of unfairness in the page lock
Commit 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic") made
the page locking entirely fair, in that if a waiter came in while the
lock was held, the lock would be transferred to the lockers strictly in
order.
That was intended to finally get rid of the long-reported watchdog
failures that involved the page lock under extreme load, where a process
could end up waiting essentially forever, as other page lockers stole
the lock from under it.
It also improved some benchmarks, but it ended up causing huge
performance regressions on others, simply because fair lock behavior
doesn't end up giving out the lock as aggressively, causing better
worst-case latency, but potentially much worse average latencies and
throughput.
Instead of reverting that change entirely, this introduces a controlled
amount of unfairness, with a sysctl knob to tune it if somebody needs
to. But the default value should hopefully be good for any normal load,
allowing a few rounds of lock stealing, but enforcing the strict
ordering before the lock has been stolen too many times.
There is also a hint from Matthieu Baerts that the fair page coloring
may end up exposing an ABBA deadlock that is hidden by the usual
optimistic lock stealing, and while the unfairness doesn't fix the
fundamental issue (and I'm still looking at that), it avoids it in
practice.
The amount of unfairness can be modified by writing a new value to the
'sysctl_page_lock_unfairness' variable (default value of 5, exposed
through /proc/sys/vm/page_lock_unfairness), but that is hopefully
something we'd use mainly for debugging rather than being necessary for
any deep system tuning.
This whole issue has exposed just how critical the page lock can be, and
how contended it gets under certain locks. And the main contention
doesn't really seem to be anything related to IO (which was the origin
of this lock), but for things like just verifying that the page file
mapping is stable while faulting in the page into a page table.
Link: https://lore.kernel.org/linux-fsdevel/ed8442fd-6f54-dd84-cd4a-941e8b7ee603@MichaelLarabel.com/
Link: https://www.phoronix.com/scan.php?page=article&item=linux-50-59&num=1
Link: https://lore.kernel.org/linux-fsdevel/c560a38d-8313-51fb-b1ec-e904bd8836bc@tessares.net/
Reported-and-tested-by: Michael Larabel <Michael@michaellarabel.com>
Tested-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-14 00:05:35 +03:00
if ( flags & WQ_FLAG_CUSTOM ) {
2021-01-16 19:22:14 +03:00
if ( test_and_set_bit ( key - > bit_nr , & key - > folio - > flags ) )
mm: allow a controlled amount of unfairness in the page lock
Commit 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic") made
the page locking entirely fair, in that if a waiter came in while the
lock was held, the lock would be transferred to the lockers strictly in
order.
That was intended to finally get rid of the long-reported watchdog
failures that involved the page lock under extreme load, where a process
could end up waiting essentially forever, as other page lockers stole
the lock from under it.
It also improved some benchmarks, but it ended up causing huge
performance regressions on others, simply because fair lock behavior
doesn't end up giving out the lock as aggressively, causing better
worst-case latency, but potentially much worse average latencies and
throughput.
Instead of reverting that change entirely, this introduces a controlled
amount of unfairness, with a sysctl knob to tune it if somebody needs
to. But the default value should hopefully be good for any normal load,
allowing a few rounds of lock stealing, but enforcing the strict
ordering before the lock has been stolen too many times.
There is also a hint from Matthieu Baerts that the fair page coloring
may end up exposing an ABBA deadlock that is hidden by the usual
optimistic lock stealing, and while the unfairness doesn't fix the
fundamental issue (and I'm still looking at that), it avoids it in
practice.
The amount of unfairness can be modified by writing a new value to the
'sysctl_page_lock_unfairness' variable (default value of 5, exposed
through /proc/sys/vm/page_lock_unfairness), but that is hopefully
something we'd use mainly for debugging rather than being necessary for
any deep system tuning.
This whole issue has exposed just how critical the page lock can be, and
how contended it gets under certain locks. And the main contention
doesn't really seem to be anything related to IO (which was the origin
of this lock), but for things like just verifying that the page file
mapping is stable while faulting in the page into a page table.
Link: https://lore.kernel.org/linux-fsdevel/ed8442fd-6f54-dd84-cd4a-941e8b7ee603@MichaelLarabel.com/
Link: https://www.phoronix.com/scan.php?page=article&item=linux-50-59&num=1
Link: https://lore.kernel.org/linux-fsdevel/c560a38d-8313-51fb-b1ec-e904bd8836bc@tessares.net/
Reported-and-tested-by: Michael Larabel <Michael@michaellarabel.com>
Tested-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-14 00:05:35 +03:00
return - 1 ;
flags | = WQ_FLAG_DONE ;
}
2020-07-23 20:16:49 +03:00
}
2011-05-25 04:11:29 +04:00
mm: allow a controlled amount of unfairness in the page lock
Commit 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic") made
the page locking entirely fair, in that if a waiter came in while the
lock was held, the lock would be transferred to the lockers strictly in
order.
That was intended to finally get rid of the long-reported watchdog
failures that involved the page lock under extreme load, where a process
could end up waiting essentially forever, as other page lockers stole
the lock from under it.
It also improved some benchmarks, but it ended up causing huge
performance regressions on others, simply because fair lock behavior
doesn't end up giving out the lock as aggressively, causing better
worst-case latency, but potentially much worse average latencies and
throughput.
Instead of reverting that change entirely, this introduces a controlled
amount of unfairness, with a sysctl knob to tune it if somebody needs
to. But the default value should hopefully be good for any normal load,
allowing a few rounds of lock stealing, but enforcing the strict
ordering before the lock has been stolen too many times.
There is also a hint from Matthieu Baerts that the fair page coloring
may end up exposing an ABBA deadlock that is hidden by the usual
optimistic lock stealing, and while the unfairness doesn't fix the
fundamental issue (and I'm still looking at that), it avoids it in
practice.
The amount of unfairness can be modified by writing a new value to the
'sysctl_page_lock_unfairness' variable (default value of 5, exposed
through /proc/sys/vm/page_lock_unfairness), but that is hopefully
something we'd use mainly for debugging rather than being necessary for
any deep system tuning.
This whole issue has exposed just how critical the page lock can be, and
how contended it gets under certain locks. And the main contention
doesn't really seem to be anything related to IO (which was the origin
of this lock), but for things like just verifying that the page file
mapping is stable while faulting in the page into a page table.
Link: https://lore.kernel.org/linux-fsdevel/ed8442fd-6f54-dd84-cd4a-941e8b7ee603@MichaelLarabel.com/
Link: https://www.phoronix.com/scan.php?page=article&item=linux-50-59&num=1
Link: https://lore.kernel.org/linux-fsdevel/c560a38d-8313-51fb-b1ec-e904bd8836bc@tessares.net/
Reported-and-tested-by: Michael Larabel <Michael@michaellarabel.com>
Tested-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-14 00:05:35 +03:00
/*
* We are holding the wait - queue lock , but the waiter that
* is waiting for this will be checking the flags without
* any locking .
*
* So update the flags atomically , and wake up the waiter
* afterwards to avoid any races . This store - release pairs
2021-03-04 20:02:54 +03:00
* with the load - acquire in folio_wait_bit_common ( ) .
mm: allow a controlled amount of unfairness in the page lock
Commit 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic") made
the page locking entirely fair, in that if a waiter came in while the
lock was held, the lock would be transferred to the lockers strictly in
order.
That was intended to finally get rid of the long-reported watchdog
failures that involved the page lock under extreme load, where a process
could end up waiting essentially forever, as other page lockers stole
the lock from under it.
It also improved some benchmarks, but it ended up causing huge
performance regressions on others, simply because fair lock behavior
doesn't end up giving out the lock as aggressively, causing better
worst-case latency, but potentially much worse average latencies and
throughput.
Instead of reverting that change entirely, this introduces a controlled
amount of unfairness, with a sysctl knob to tune it if somebody needs
to. But the default value should hopefully be good for any normal load,
allowing a few rounds of lock stealing, but enforcing the strict
ordering before the lock has been stolen too many times.
There is also a hint from Matthieu Baerts that the fair page coloring
may end up exposing an ABBA deadlock that is hidden by the usual
optimistic lock stealing, and while the unfairness doesn't fix the
fundamental issue (and I'm still looking at that), it avoids it in
practice.
The amount of unfairness can be modified by writing a new value to the
'sysctl_page_lock_unfairness' variable (default value of 5, exposed
through /proc/sys/vm/page_lock_unfairness), but that is hopefully
something we'd use mainly for debugging rather than being necessary for
any deep system tuning.
This whole issue has exposed just how critical the page lock can be, and
how contended it gets under certain locks. And the main contention
doesn't really seem to be anything related to IO (which was the origin
of this lock), but for things like just verifying that the page file
mapping is stable while faulting in the page into a page table.
Link: https://lore.kernel.org/linux-fsdevel/ed8442fd-6f54-dd84-cd4a-941e8b7ee603@MichaelLarabel.com/
Link: https://www.phoronix.com/scan.php?page=article&item=linux-50-59&num=1
Link: https://lore.kernel.org/linux-fsdevel/c560a38d-8313-51fb-b1ec-e904bd8836bc@tessares.net/
Reported-and-tested-by: Michael Larabel <Michael@michaellarabel.com>
Tested-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-14 00:05:35 +03:00
*/
smp_store_release ( & wait - > flags , flags | WQ_FLAG_WOKEN ) ;
2020-07-23 20:16:49 +03:00
wake_up_state ( wait - > private , mode ) ;
/*
* Ok , we have successfully done what we ' re waiting for ,
* and we can unconditionally remove the wait entry .
*
mm: allow a controlled amount of unfairness in the page lock
Commit 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic") made
the page locking entirely fair, in that if a waiter came in while the
lock was held, the lock would be transferred to the lockers strictly in
order.
That was intended to finally get rid of the long-reported watchdog
failures that involved the page lock under extreme load, where a process
could end up waiting essentially forever, as other page lockers stole
the lock from under it.
It also improved some benchmarks, but it ended up causing huge
performance regressions on others, simply because fair lock behavior
doesn't end up giving out the lock as aggressively, causing better
worst-case latency, but potentially much worse average latencies and
throughput.
Instead of reverting that change entirely, this introduces a controlled
amount of unfairness, with a sysctl knob to tune it if somebody needs
to. But the default value should hopefully be good for any normal load,
allowing a few rounds of lock stealing, but enforcing the strict
ordering before the lock has been stolen too many times.
There is also a hint from Matthieu Baerts that the fair page coloring
may end up exposing an ABBA deadlock that is hidden by the usual
optimistic lock stealing, and while the unfairness doesn't fix the
fundamental issue (and I'm still looking at that), it avoids it in
practice.
The amount of unfairness can be modified by writing a new value to the
'sysctl_page_lock_unfairness' variable (default value of 5, exposed
through /proc/sys/vm/page_lock_unfairness), but that is hopefully
something we'd use mainly for debugging rather than being necessary for
any deep system tuning.
This whole issue has exposed just how critical the page lock can be, and
how contended it gets under certain locks. And the main contention
doesn't really seem to be anything related to IO (which was the origin
of this lock), but for things like just verifying that the page file
mapping is stable while faulting in the page into a page table.
Link: https://lore.kernel.org/linux-fsdevel/ed8442fd-6f54-dd84-cd4a-941e8b7ee603@MichaelLarabel.com/
Link: https://www.phoronix.com/scan.php?page=article&item=linux-50-59&num=1
Link: https://lore.kernel.org/linux-fsdevel/c560a38d-8313-51fb-b1ec-e904bd8836bc@tessares.net/
Reported-and-tested-by: Michael Larabel <Michael@michaellarabel.com>
Tested-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-14 00:05:35 +03:00
* Note that this pairs with the " finish_wait() " in the
* waiter , and has to be the absolute last thing we do .
* After this list_del_init ( & wait - > entry ) the wait entry
2020-07-23 20:16:49 +03:00
* might be de - allocated and the process might even have
* exited .
*/
2020-07-23 22:33:41 +03:00
list_del_init_careful ( & wait - > entry ) ;
mm: allow a controlled amount of unfairness in the page lock
Commit 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic") made
the page locking entirely fair, in that if a waiter came in while the
lock was held, the lock would be transferred to the lockers strictly in
order.
That was intended to finally get rid of the long-reported watchdog
failures that involved the page lock under extreme load, where a process
could end up waiting essentially forever, as other page lockers stole
the lock from under it.
It also improved some benchmarks, but it ended up causing huge
performance regressions on others, simply because fair lock behavior
doesn't end up giving out the lock as aggressively, causing better
worst-case latency, but potentially much worse average latencies and
throughput.
Instead of reverting that change entirely, this introduces a controlled
amount of unfairness, with a sysctl knob to tune it if somebody needs
to. But the default value should hopefully be good for any normal load,
allowing a few rounds of lock stealing, but enforcing the strict
ordering before the lock has been stolen too many times.
There is also a hint from Matthieu Baerts that the fair page coloring
may end up exposing an ABBA deadlock that is hidden by the usual
optimistic lock stealing, and while the unfairness doesn't fix the
fundamental issue (and I'm still looking at that), it avoids it in
practice.
The amount of unfairness can be modified by writing a new value to the
'sysctl_page_lock_unfairness' variable (default value of 5, exposed
through /proc/sys/vm/page_lock_unfairness), but that is hopefully
something we'd use mainly for debugging rather than being necessary for
any deep system tuning.
This whole issue has exposed just how critical the page lock can be, and
how contended it gets under certain locks. And the main contention
doesn't really seem to be anything related to IO (which was the origin
of this lock), but for things like just verifying that the page file
mapping is stable while faulting in the page into a page table.
Link: https://lore.kernel.org/linux-fsdevel/ed8442fd-6f54-dd84-cd4a-941e8b7ee603@MichaelLarabel.com/
Link: https://www.phoronix.com/scan.php?page=article&item=linux-50-59&num=1
Link: https://lore.kernel.org/linux-fsdevel/c560a38d-8313-51fb-b1ec-e904bd8836bc@tessares.net/
Reported-and-tested-by: Michael Larabel <Michael@michaellarabel.com>
Tested-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-14 00:05:35 +03:00
return ( flags & WQ_FLAG_EXCLUSIVE ) ! = 0 ;
2011-05-25 04:11:29 +04:00
}
2021-01-16 01:14:48 +03:00
static void folio_wake_bit ( struct folio * folio , int bit_nr )
2014-09-25 07:55:19 +04:00
{
2021-01-16 19:22:14 +03:00
wait_queue_head_t * q = folio_waitqueue ( folio ) ;
2016-12-25 06:00:30 +03:00
struct wait_page_key key ;
unsigned long flags ;
2017-08-25 19:13:55 +03:00
wait_queue_entry_t bookmark ;
2014-09-25 07:55:19 +04:00
2021-01-16 19:22:14 +03:00
key . folio = folio ;
2016-12-25 06:00:30 +03:00
key . bit_nr = bit_nr ;
key . page_match = 0 ;
2017-08-25 19:13:55 +03:00
bookmark . flags = 0 ;
bookmark . private = NULL ;
bookmark . func = NULL ;
INIT_LIST_HEAD ( & bookmark . entry ) ;
2016-12-25 06:00:30 +03:00
spin_lock_irqsave ( & q - > lock , flags ) ;
2017-08-25 19:13:55 +03:00
__wake_up_locked_key_bookmark ( q , TASK_NORMAL , & key , & bookmark ) ;
while ( bookmark . flags & WQ_FLAG_BOOKMARK ) {
/*
* Take a breather from holding the lock ,
* allow pages that finish wake up asynchronously
* to acquire the lock and remove themselves
* from wait queue
*/
spin_unlock_irqrestore ( & q - > lock , flags ) ;
cpu_relax ( ) ;
spin_lock_irqsave ( & q - > lock , flags ) ;
__wake_up_locked_key_bookmark ( q , TASK_NORMAL , & key , & bookmark ) ;
}
2016-12-25 06:00:30 +03:00
/*
2022-03-25 04:09:49 +03:00
* It ' s possible to miss clearing waiters here , when we woke our page
* waiters , but the hashed waitqueue has waiters for other pages on it .
* That ' s okay , it ' s a rare case . The next waker will clear it .
2016-12-25 06:00:30 +03:00
*
2022-03-25 04:09:49 +03:00
* Note that , depending on the page pool ( buddy , hugetlb , ZONE_DEVICE ,
* other ) , the flag may be cleared in the course of freeing the page ;
* but that is not required for correctness .
2016-12-25 06:00:30 +03:00
*/
2022-03-25 04:09:49 +03:00
if ( ! waitqueue_active ( q ) | | ! key . page_match )
2021-01-16 01:14:48 +03:00
folio_clear_waiters ( folio ) ;
2022-03-25 04:09:49 +03:00
2016-12-25 06:00:30 +03:00
spin_unlock_irqrestore ( & q - > lock , flags ) ;
}
2017-02-23 02:44:41 +03:00
2021-03-03 23:21:55 +03:00
static void folio_wake ( struct folio * folio , int bit )
2017-02-23 02:44:41 +03:00
{
2021-03-03 23:21:55 +03:00
if ( ! folio_test_waiters ( folio ) )
2017-02-23 02:44:41 +03:00
return ;
2021-01-16 01:14:48 +03:00
folio_wake_bit ( folio , bit ) ;
2017-02-23 02:44:41 +03:00
}
2016-12-25 06:00:30 +03:00
mm: put_and_wait_on_page_locked() while page is migrated
Waiting on a page migration entry has used wait_on_page_locked() all along
since 2006: but you cannot safely wait_on_page_locked() without holding a
reference to the page, and that extra reference is enough to make
migrate_page_move_mapping() fail with -EAGAIN, when a racing task faults
on the entry before migrate_page_move_mapping() gets there.
And that failure is retried nine times, amplifying the pain when trying to
migrate a popular page. With a single persistent faulter, migration
sometimes succeeds; with two or three concurrent faulters, success becomes
much less likely (and the more the page was mapped, the worse the overhead
of unmapping and remapping it on each try).
This is especially a problem for memory offlining, where the outer level
retries forever (or until terminated from userspace), because a heavy
refault workload can trigger an endless loop of migration failures.
wait_on_page_locked() is the wrong tool for the job.
David Herrmann (but was he the first?) noticed this issue in 2014:
https://marc.info/?l=linux-mm&m=140110465608116&w=2
Tim Chen started a thread in August 2017 which appears relevant:
https://marc.info/?l=linux-mm&m=150275941014915&w=2 where Kan Liang went
on to implicate __migration_entry_wait():
https://marc.info/?l=linux-mm&m=150300268411980&w=2 and the thread ended
up with the v4.14 commits: 2554db916586 ("sched/wait: Break up long wake
list walk") 11a19c7b099f ("sched/wait: Introduce wakeup boomark in
wake_up_page_bit")
Baoquan He reported "Memory hotplug softlock issue" 14 November 2018:
https://marc.info/?l=linux-mm&m=154217936431300&w=2
We have all assumed that it is essential to hold a page reference while
waiting on a page lock: partly to guarantee that there is still a struct
page when MEMORY_HOTREMOVE is configured, but also to protect against
reuse of the struct page going to someone who then holds the page locked
indefinitely, when the waiter can reasonably expect timely unlocking.
But in fact, so long as wait_on_page_bit_common() does the put_page(), and
is careful not to rely on struct page contents thereafter, there is no
need to hold a reference to the page while waiting on it. That does mean
that this case cannot go back through the loop: but that's fine for the
page migration case, and even if used more widely, is limited by the "Stop
walking if it's locked" optimization in wake_page_function().
Add interface put_and_wait_on_page_locked() to do this, using "behavior"
enum in place of "lock" arg to wait_on_page_bit_common() to implement it.
No interruptible or killable variant needed yet, but they might follow: I
have a vague notion that reporting -EINTR should take precedence over
return from wait_on_page_bit_common() without knowing the page state, so
arrange it accordingly - but that may be nothing but pedantic.
__migration_entry_wait() still has to take a brief reference to the page,
prior to calling put_and_wait_on_page_locked(): but now that it is dropped
before waiting, the chance of impeding page migration is very much
reduced. Should we perhaps disable preemption across this?
shrink_page_list()'s __ClearPageLocked(): that was a surprise! This
survived a lot of testing before that showed up. PageWaiters may have
been set by wait_on_page_bit_common(), and the reference dropped, just
before shrink_page_list() succeeds in freezing its last page reference: in
such a case, unlock_page() must be used. Follow the suggestion from
Michal Hocko, just revert a978d6f52106 ("mm: unlockless reclaim") now:
that optimization predates PageWaiters, and won't buy much these days; but
we can reinstate it for the !PageWaiters case if anyone notices.
It does raise the question: should vmscan.c's is_page_cache_freeable() and
__remove_mapping() now treat a PageWaiters page as if an extra reference
were held? Perhaps, but I don't think it matters much, since
shrink_page_list() already had to win its trylock_page(), so waiters are
not very common there: I noticed no difference when trying the bigger
change, and it's surely not needed while put_and_wait_on_page_locked() is
only used for page migration.
[willy@infradead.org: add put_and_wait_on_page_locked() kerneldoc]
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261121330.1116@eggly.anvils
Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: Baoquan He <bhe@redhat.com>
Tested-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: David Herrmann <dh.herrmann@gmail.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Nick Piggin <npiggin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 11:36:14 +03:00
/*
2021-03-04 20:02:54 +03:00
* A choice of three behaviors for folio_wait_bit_common ( ) :
mm: put_and_wait_on_page_locked() while page is migrated
Waiting on a page migration entry has used wait_on_page_locked() all along
since 2006: but you cannot safely wait_on_page_locked() without holding a
reference to the page, and that extra reference is enough to make
migrate_page_move_mapping() fail with -EAGAIN, when a racing task faults
on the entry before migrate_page_move_mapping() gets there.
And that failure is retried nine times, amplifying the pain when trying to
migrate a popular page. With a single persistent faulter, migration
sometimes succeeds; with two or three concurrent faulters, success becomes
much less likely (and the more the page was mapped, the worse the overhead
of unmapping and remapping it on each try).
This is especially a problem for memory offlining, where the outer level
retries forever (or until terminated from userspace), because a heavy
refault workload can trigger an endless loop of migration failures.
wait_on_page_locked() is the wrong tool for the job.
David Herrmann (but was he the first?) noticed this issue in 2014:
https://marc.info/?l=linux-mm&m=140110465608116&w=2
Tim Chen started a thread in August 2017 which appears relevant:
https://marc.info/?l=linux-mm&m=150275941014915&w=2 where Kan Liang went
on to implicate __migration_entry_wait():
https://marc.info/?l=linux-mm&m=150300268411980&w=2 and the thread ended
up with the v4.14 commits: 2554db916586 ("sched/wait: Break up long wake
list walk") 11a19c7b099f ("sched/wait: Introduce wakeup boomark in
wake_up_page_bit")
Baoquan He reported "Memory hotplug softlock issue" 14 November 2018:
https://marc.info/?l=linux-mm&m=154217936431300&w=2
We have all assumed that it is essential to hold a page reference while
waiting on a page lock: partly to guarantee that there is still a struct
page when MEMORY_HOTREMOVE is configured, but also to protect against
reuse of the struct page going to someone who then holds the page locked
indefinitely, when the waiter can reasonably expect timely unlocking.
But in fact, so long as wait_on_page_bit_common() does the put_page(), and
is careful not to rely on struct page contents thereafter, there is no
need to hold a reference to the page while waiting on it. That does mean
that this case cannot go back through the loop: but that's fine for the
page migration case, and even if used more widely, is limited by the "Stop
walking if it's locked" optimization in wake_page_function().
Add interface put_and_wait_on_page_locked() to do this, using "behavior"
enum in place of "lock" arg to wait_on_page_bit_common() to implement it.
No interruptible or killable variant needed yet, but they might follow: I
have a vague notion that reporting -EINTR should take precedence over
return from wait_on_page_bit_common() without knowing the page state, so
arrange it accordingly - but that may be nothing but pedantic.
__migration_entry_wait() still has to take a brief reference to the page,
prior to calling put_and_wait_on_page_locked(): but now that it is dropped
before waiting, the chance of impeding page migration is very much
reduced. Should we perhaps disable preemption across this?
shrink_page_list()'s __ClearPageLocked(): that was a surprise! This
survived a lot of testing before that showed up. PageWaiters may have
been set by wait_on_page_bit_common(), and the reference dropped, just
before shrink_page_list() succeeds in freezing its last page reference: in
such a case, unlock_page() must be used. Follow the suggestion from
Michal Hocko, just revert a978d6f52106 ("mm: unlockless reclaim") now:
that optimization predates PageWaiters, and won't buy much these days; but
we can reinstate it for the !PageWaiters case if anyone notices.
It does raise the question: should vmscan.c's is_page_cache_freeable() and
__remove_mapping() now treat a PageWaiters page as if an extra reference
were held? Perhaps, but I don't think it matters much, since
shrink_page_list() already had to win its trylock_page(), so waiters are
not very common there: I noticed no difference when trying the bigger
change, and it's surely not needed while put_and_wait_on_page_locked() is
only used for page migration.
[willy@infradead.org: add put_and_wait_on_page_locked() kerneldoc]
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261121330.1116@eggly.anvils
Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: Baoquan He <bhe@redhat.com>
Tested-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: David Herrmann <dh.herrmann@gmail.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Nick Piggin <npiggin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 11:36:14 +03:00
*/
enum behavior {
EXCLUSIVE , /* Hold ref to page and take the bit when woken, like
2021-03-02 03:38:25 +03:00
* __folio_lock ( ) waiting on then setting PG_locked .
mm: put_and_wait_on_page_locked() while page is migrated
Waiting on a page migration entry has used wait_on_page_locked() all along
since 2006: but you cannot safely wait_on_page_locked() without holding a
reference to the page, and that extra reference is enough to make
migrate_page_move_mapping() fail with -EAGAIN, when a racing task faults
on the entry before migrate_page_move_mapping() gets there.
And that failure is retried nine times, amplifying the pain when trying to
migrate a popular page. With a single persistent faulter, migration
sometimes succeeds; with two or three concurrent faulters, success becomes
much less likely (and the more the page was mapped, the worse the overhead
of unmapping and remapping it on each try).
This is especially a problem for memory offlining, where the outer level
retries forever (or until terminated from userspace), because a heavy
refault workload can trigger an endless loop of migration failures.
wait_on_page_locked() is the wrong tool for the job.
David Herrmann (but was he the first?) noticed this issue in 2014:
https://marc.info/?l=linux-mm&m=140110465608116&w=2
Tim Chen started a thread in August 2017 which appears relevant:
https://marc.info/?l=linux-mm&m=150275941014915&w=2 where Kan Liang went
on to implicate __migration_entry_wait():
https://marc.info/?l=linux-mm&m=150300268411980&w=2 and the thread ended
up with the v4.14 commits: 2554db916586 ("sched/wait: Break up long wake
list walk") 11a19c7b099f ("sched/wait: Introduce wakeup boomark in
wake_up_page_bit")
Baoquan He reported "Memory hotplug softlock issue" 14 November 2018:
https://marc.info/?l=linux-mm&m=154217936431300&w=2
We have all assumed that it is essential to hold a page reference while
waiting on a page lock: partly to guarantee that there is still a struct
page when MEMORY_HOTREMOVE is configured, but also to protect against
reuse of the struct page going to someone who then holds the page locked
indefinitely, when the waiter can reasonably expect timely unlocking.
But in fact, so long as wait_on_page_bit_common() does the put_page(), and
is careful not to rely on struct page contents thereafter, there is no
need to hold a reference to the page while waiting on it. That does mean
that this case cannot go back through the loop: but that's fine for the
page migration case, and even if used more widely, is limited by the "Stop
walking if it's locked" optimization in wake_page_function().
Add interface put_and_wait_on_page_locked() to do this, using "behavior"
enum in place of "lock" arg to wait_on_page_bit_common() to implement it.
No interruptible or killable variant needed yet, but they might follow: I
have a vague notion that reporting -EINTR should take precedence over
return from wait_on_page_bit_common() without knowing the page state, so
arrange it accordingly - but that may be nothing but pedantic.
__migration_entry_wait() still has to take a brief reference to the page,
prior to calling put_and_wait_on_page_locked(): but now that it is dropped
before waiting, the chance of impeding page migration is very much
reduced. Should we perhaps disable preemption across this?
shrink_page_list()'s __ClearPageLocked(): that was a surprise! This
survived a lot of testing before that showed up. PageWaiters may have
been set by wait_on_page_bit_common(), and the reference dropped, just
before shrink_page_list() succeeds in freezing its last page reference: in
such a case, unlock_page() must be used. Follow the suggestion from
Michal Hocko, just revert a978d6f52106 ("mm: unlockless reclaim") now:
that optimization predates PageWaiters, and won't buy much these days; but
we can reinstate it for the !PageWaiters case if anyone notices.
It does raise the question: should vmscan.c's is_page_cache_freeable() and
__remove_mapping() now treat a PageWaiters page as if an extra reference
were held? Perhaps, but I don't think it matters much, since
shrink_page_list() already had to win its trylock_page(), so waiters are
not very common there: I noticed no difference when trying the bigger
change, and it's surely not needed while put_and_wait_on_page_locked() is
only used for page migration.
[willy@infradead.org: add put_and_wait_on_page_locked() kerneldoc]
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261121330.1116@eggly.anvils
Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: Baoquan He <bhe@redhat.com>
Tested-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: David Herrmann <dh.herrmann@gmail.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Nick Piggin <npiggin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 11:36:14 +03:00
*/
SHARED , /* Hold ref to page and check the bit when woken, like
2021-08-17 06:36:31 +03:00
* folio_wait_writeback ( ) waiting on PG_writeback .
mm: put_and_wait_on_page_locked() while page is migrated
Waiting on a page migration entry has used wait_on_page_locked() all along
since 2006: but you cannot safely wait_on_page_locked() without holding a
reference to the page, and that extra reference is enough to make
migrate_page_move_mapping() fail with -EAGAIN, when a racing task faults
on the entry before migrate_page_move_mapping() gets there.
And that failure is retried nine times, amplifying the pain when trying to
migrate a popular page. With a single persistent faulter, migration
sometimes succeeds; with two or three concurrent faulters, success becomes
much less likely (and the more the page was mapped, the worse the overhead
of unmapping and remapping it on each try).
This is especially a problem for memory offlining, where the outer level
retries forever (or until terminated from userspace), because a heavy
refault workload can trigger an endless loop of migration failures.
wait_on_page_locked() is the wrong tool for the job.
David Herrmann (but was he the first?) noticed this issue in 2014:
https://marc.info/?l=linux-mm&m=140110465608116&w=2
Tim Chen started a thread in August 2017 which appears relevant:
https://marc.info/?l=linux-mm&m=150275941014915&w=2 where Kan Liang went
on to implicate __migration_entry_wait():
https://marc.info/?l=linux-mm&m=150300268411980&w=2 and the thread ended
up with the v4.14 commits: 2554db916586 ("sched/wait: Break up long wake
list walk") 11a19c7b099f ("sched/wait: Introduce wakeup boomark in
wake_up_page_bit")
Baoquan He reported "Memory hotplug softlock issue" 14 November 2018:
https://marc.info/?l=linux-mm&m=154217936431300&w=2
We have all assumed that it is essential to hold a page reference while
waiting on a page lock: partly to guarantee that there is still a struct
page when MEMORY_HOTREMOVE is configured, but also to protect against
reuse of the struct page going to someone who then holds the page locked
indefinitely, when the waiter can reasonably expect timely unlocking.
But in fact, so long as wait_on_page_bit_common() does the put_page(), and
is careful not to rely on struct page contents thereafter, there is no
need to hold a reference to the page while waiting on it. That does mean
that this case cannot go back through the loop: but that's fine for the
page migration case, and even if used more widely, is limited by the "Stop
walking if it's locked" optimization in wake_page_function().
Add interface put_and_wait_on_page_locked() to do this, using "behavior"
enum in place of "lock" arg to wait_on_page_bit_common() to implement it.
No interruptible or killable variant needed yet, but they might follow: I
have a vague notion that reporting -EINTR should take precedence over
return from wait_on_page_bit_common() without knowing the page state, so
arrange it accordingly - but that may be nothing but pedantic.
__migration_entry_wait() still has to take a brief reference to the page,
prior to calling put_and_wait_on_page_locked(): but now that it is dropped
before waiting, the chance of impeding page migration is very much
reduced. Should we perhaps disable preemption across this?
shrink_page_list()'s __ClearPageLocked(): that was a surprise! This
survived a lot of testing before that showed up. PageWaiters may have
been set by wait_on_page_bit_common(), and the reference dropped, just
before shrink_page_list() succeeds in freezing its last page reference: in
such a case, unlock_page() must be used. Follow the suggestion from
Michal Hocko, just revert a978d6f52106 ("mm: unlockless reclaim") now:
that optimization predates PageWaiters, and won't buy much these days; but
we can reinstate it for the !PageWaiters case if anyone notices.
It does raise the question: should vmscan.c's is_page_cache_freeable() and
__remove_mapping() now treat a PageWaiters page as if an extra reference
were held? Perhaps, but I don't think it matters much, since
shrink_page_list() already had to win its trylock_page(), so waiters are
not very common there: I noticed no difference when trying the bigger
change, and it's surely not needed while put_and_wait_on_page_locked() is
only used for page migration.
[willy@infradead.org: add put_and_wait_on_page_locked() kerneldoc]
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261121330.1116@eggly.anvils
Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: Baoquan He <bhe@redhat.com>
Tested-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: David Herrmann <dh.herrmann@gmail.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Nick Piggin <npiggin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 11:36:14 +03:00
*/
DROP , /* Drop ref to page before wait, no check when woken,
2021-08-17 06:36:31 +03:00
* like folio_put_wait_locked ( ) on PG_locked .
mm: put_and_wait_on_page_locked() while page is migrated
Waiting on a page migration entry has used wait_on_page_locked() all along
since 2006: but you cannot safely wait_on_page_locked() without holding a
reference to the page, and that extra reference is enough to make
migrate_page_move_mapping() fail with -EAGAIN, when a racing task faults
on the entry before migrate_page_move_mapping() gets there.
And that failure is retried nine times, amplifying the pain when trying to
migrate a popular page. With a single persistent faulter, migration
sometimes succeeds; with two or three concurrent faulters, success becomes
much less likely (and the more the page was mapped, the worse the overhead
of unmapping and remapping it on each try).
This is especially a problem for memory offlining, where the outer level
retries forever (or until terminated from userspace), because a heavy
refault workload can trigger an endless loop of migration failures.
wait_on_page_locked() is the wrong tool for the job.
David Herrmann (but was he the first?) noticed this issue in 2014:
https://marc.info/?l=linux-mm&m=140110465608116&w=2
Tim Chen started a thread in August 2017 which appears relevant:
https://marc.info/?l=linux-mm&m=150275941014915&w=2 where Kan Liang went
on to implicate __migration_entry_wait():
https://marc.info/?l=linux-mm&m=150300268411980&w=2 and the thread ended
up with the v4.14 commits: 2554db916586 ("sched/wait: Break up long wake
list walk") 11a19c7b099f ("sched/wait: Introduce wakeup boomark in
wake_up_page_bit")
Baoquan He reported "Memory hotplug softlock issue" 14 November 2018:
https://marc.info/?l=linux-mm&m=154217936431300&w=2
We have all assumed that it is essential to hold a page reference while
waiting on a page lock: partly to guarantee that there is still a struct
page when MEMORY_HOTREMOVE is configured, but also to protect against
reuse of the struct page going to someone who then holds the page locked
indefinitely, when the waiter can reasonably expect timely unlocking.
But in fact, so long as wait_on_page_bit_common() does the put_page(), and
is careful not to rely on struct page contents thereafter, there is no
need to hold a reference to the page while waiting on it. That does mean
that this case cannot go back through the loop: but that's fine for the
page migration case, and even if used more widely, is limited by the "Stop
walking if it's locked" optimization in wake_page_function().
Add interface put_and_wait_on_page_locked() to do this, using "behavior"
enum in place of "lock" arg to wait_on_page_bit_common() to implement it.
No interruptible or killable variant needed yet, but they might follow: I
have a vague notion that reporting -EINTR should take precedence over
return from wait_on_page_bit_common() without knowing the page state, so
arrange it accordingly - but that may be nothing but pedantic.
__migration_entry_wait() still has to take a brief reference to the page,
prior to calling put_and_wait_on_page_locked(): but now that it is dropped
before waiting, the chance of impeding page migration is very much
reduced. Should we perhaps disable preemption across this?
shrink_page_list()'s __ClearPageLocked(): that was a surprise! This
survived a lot of testing before that showed up. PageWaiters may have
been set by wait_on_page_bit_common(), and the reference dropped, just
before shrink_page_list() succeeds in freezing its last page reference: in
such a case, unlock_page() must be used. Follow the suggestion from
Michal Hocko, just revert a978d6f52106 ("mm: unlockless reclaim") now:
that optimization predates PageWaiters, and won't buy much these days; but
we can reinstate it for the !PageWaiters case if anyone notices.
It does raise the question: should vmscan.c's is_page_cache_freeable() and
__remove_mapping() now treat a PageWaiters page as if an extra reference
were held? Perhaps, but I don't think it matters much, since
shrink_page_list() already had to win its trylock_page(), so waiters are
not very common there: I noticed no difference when trying the bigger
change, and it's surely not needed while put_and_wait_on_page_locked() is
only used for page migration.
[willy@infradead.org: add put_and_wait_on_page_locked() kerneldoc]
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261121330.1116@eggly.anvils
Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: Baoquan He <bhe@redhat.com>
Tested-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: David Herrmann <dh.herrmann@gmail.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Nick Piggin <npiggin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 11:36:14 +03:00
*/
} ;
2020-07-23 20:16:49 +03:00
/*
2021-03-04 20:02:54 +03:00
* Attempt to check ( or get ) the folio flag , and mark us done
mm: allow a controlled amount of unfairness in the page lock
Commit 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic") made
the page locking entirely fair, in that if a waiter came in while the
lock was held, the lock would be transferred to the lockers strictly in
order.
That was intended to finally get rid of the long-reported watchdog
failures that involved the page lock under extreme load, where a process
could end up waiting essentially forever, as other page lockers stole
the lock from under it.
It also improved some benchmarks, but it ended up causing huge
performance regressions on others, simply because fair lock behavior
doesn't end up giving out the lock as aggressively, causing better
worst-case latency, but potentially much worse average latencies and
throughput.
Instead of reverting that change entirely, this introduces a controlled
amount of unfairness, with a sysctl knob to tune it if somebody needs
to. But the default value should hopefully be good for any normal load,
allowing a few rounds of lock stealing, but enforcing the strict
ordering before the lock has been stolen too many times.
There is also a hint from Matthieu Baerts that the fair page coloring
may end up exposing an ABBA deadlock that is hidden by the usual
optimistic lock stealing, and while the unfairness doesn't fix the
fundamental issue (and I'm still looking at that), it avoids it in
practice.
The amount of unfairness can be modified by writing a new value to the
'sysctl_page_lock_unfairness' variable (default value of 5, exposed
through /proc/sys/vm/page_lock_unfairness), but that is hopefully
something we'd use mainly for debugging rather than being necessary for
any deep system tuning.
This whole issue has exposed just how critical the page lock can be, and
how contended it gets under certain locks. And the main contention
doesn't really seem to be anything related to IO (which was the origin
of this lock), but for things like just verifying that the page file
mapping is stable while faulting in the page into a page table.
Link: https://lore.kernel.org/linux-fsdevel/ed8442fd-6f54-dd84-cd4a-941e8b7ee603@MichaelLarabel.com/
Link: https://www.phoronix.com/scan.php?page=article&item=linux-50-59&num=1
Link: https://lore.kernel.org/linux-fsdevel/c560a38d-8313-51fb-b1ec-e904bd8836bc@tessares.net/
Reported-and-tested-by: Michael Larabel <Michael@michaellarabel.com>
Tested-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-14 00:05:35 +03:00
* if successful .
2020-07-23 20:16:49 +03:00
*/
2021-03-04 20:02:54 +03:00
static inline bool folio_trylock_flag ( struct folio * folio , int bit_nr ,
2020-07-23 20:16:49 +03:00
struct wait_queue_entry * wait )
{
if ( wait - > flags & WQ_FLAG_EXCLUSIVE ) {
2021-03-04 20:02:54 +03:00
if ( test_and_set_bit ( bit_nr , & folio - > flags ) )
2020-07-23 20:16:49 +03:00
return false ;
2021-03-04 20:02:54 +03:00
} else if ( test_bit ( bit_nr , & folio - > flags ) )
2020-07-23 20:16:49 +03:00
return false ;
mm: allow a controlled amount of unfairness in the page lock
Commit 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic") made
the page locking entirely fair, in that if a waiter came in while the
lock was held, the lock would be transferred to the lockers strictly in
order.
That was intended to finally get rid of the long-reported watchdog
failures that involved the page lock under extreme load, where a process
could end up waiting essentially forever, as other page lockers stole
the lock from under it.
It also improved some benchmarks, but it ended up causing huge
performance regressions on others, simply because fair lock behavior
doesn't end up giving out the lock as aggressively, causing better
worst-case latency, but potentially much worse average latencies and
throughput.
Instead of reverting that change entirely, this introduces a controlled
amount of unfairness, with a sysctl knob to tune it if somebody needs
to. But the default value should hopefully be good for any normal load,
allowing a few rounds of lock stealing, but enforcing the strict
ordering before the lock has been stolen too many times.
There is also a hint from Matthieu Baerts that the fair page coloring
may end up exposing an ABBA deadlock that is hidden by the usual
optimistic lock stealing, and while the unfairness doesn't fix the
fundamental issue (and I'm still looking at that), it avoids it in
practice.
The amount of unfairness can be modified by writing a new value to the
'sysctl_page_lock_unfairness' variable (default value of 5, exposed
through /proc/sys/vm/page_lock_unfairness), but that is hopefully
something we'd use mainly for debugging rather than being necessary for
any deep system tuning.
This whole issue has exposed just how critical the page lock can be, and
how contended it gets under certain locks. And the main contention
doesn't really seem to be anything related to IO (which was the origin
of this lock), but for things like just verifying that the page file
mapping is stable while faulting in the page into a page table.
Link: https://lore.kernel.org/linux-fsdevel/ed8442fd-6f54-dd84-cd4a-941e8b7ee603@MichaelLarabel.com/
Link: https://www.phoronix.com/scan.php?page=article&item=linux-50-59&num=1
Link: https://lore.kernel.org/linux-fsdevel/c560a38d-8313-51fb-b1ec-e904bd8836bc@tessares.net/
Reported-and-tested-by: Michael Larabel <Michael@michaellarabel.com>
Tested-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-14 00:05:35 +03:00
wait - > flags | = WQ_FLAG_WOKEN | WQ_FLAG_DONE ;
2020-07-23 20:16:49 +03:00
return true ;
}
mm: allow a controlled amount of unfairness in the page lock
Commit 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic") made
the page locking entirely fair, in that if a waiter came in while the
lock was held, the lock would be transferred to the lockers strictly in
order.
That was intended to finally get rid of the long-reported watchdog
failures that involved the page lock under extreme load, where a process
could end up waiting essentially forever, as other page lockers stole
the lock from under it.
It also improved some benchmarks, but it ended up causing huge
performance regressions on others, simply because fair lock behavior
doesn't end up giving out the lock as aggressively, causing better
worst-case latency, but potentially much worse average latencies and
throughput.
Instead of reverting that change entirely, this introduces a controlled
amount of unfairness, with a sysctl knob to tune it if somebody needs
to. But the default value should hopefully be good for any normal load,
allowing a few rounds of lock stealing, but enforcing the strict
ordering before the lock has been stolen too many times.
There is also a hint from Matthieu Baerts that the fair page coloring
may end up exposing an ABBA deadlock that is hidden by the usual
optimistic lock stealing, and while the unfairness doesn't fix the
fundamental issue (and I'm still looking at that), it avoids it in
practice.
The amount of unfairness can be modified by writing a new value to the
'sysctl_page_lock_unfairness' variable (default value of 5, exposed
through /proc/sys/vm/page_lock_unfairness), but that is hopefully
something we'd use mainly for debugging rather than being necessary for
any deep system tuning.
This whole issue has exposed just how critical the page lock can be, and
how contended it gets under certain locks. And the main contention
doesn't really seem to be anything related to IO (which was the origin
of this lock), but for things like just verifying that the page file
mapping is stable while faulting in the page into a page table.
Link: https://lore.kernel.org/linux-fsdevel/ed8442fd-6f54-dd84-cd4a-941e8b7ee603@MichaelLarabel.com/
Link: https://www.phoronix.com/scan.php?page=article&item=linux-50-59&num=1
Link: https://lore.kernel.org/linux-fsdevel/c560a38d-8313-51fb-b1ec-e904bd8836bc@tessares.net/
Reported-and-tested-by: Michael Larabel <Michael@michaellarabel.com>
Tested-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-14 00:05:35 +03:00
/* How many times do we accept lock stealing from under a waiter? */
int sysctl_page_lock_unfairness = 5 ;
2021-03-04 20:02:54 +03:00
static inline int folio_wait_bit_common ( struct folio * folio , int bit_nr ,
int state , enum behavior behavior )
2016-12-25 06:00:30 +03:00
{
2021-01-16 19:22:14 +03:00
wait_queue_head_t * q = folio_waitqueue ( folio ) ;
mm: allow a controlled amount of unfairness in the page lock
Commit 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic") made
the page locking entirely fair, in that if a waiter came in while the
lock was held, the lock would be transferred to the lockers strictly in
order.
That was intended to finally get rid of the long-reported watchdog
failures that involved the page lock under extreme load, where a process
could end up waiting essentially forever, as other page lockers stole
the lock from under it.
It also improved some benchmarks, but it ended up causing huge
performance regressions on others, simply because fair lock behavior
doesn't end up giving out the lock as aggressively, causing better
worst-case latency, but potentially much worse average latencies and
throughput.
Instead of reverting that change entirely, this introduces a controlled
amount of unfairness, with a sysctl knob to tune it if somebody needs
to. But the default value should hopefully be good for any normal load,
allowing a few rounds of lock stealing, but enforcing the strict
ordering before the lock has been stolen too many times.
There is also a hint from Matthieu Baerts that the fair page coloring
may end up exposing an ABBA deadlock that is hidden by the usual
optimistic lock stealing, and while the unfairness doesn't fix the
fundamental issue (and I'm still looking at that), it avoids it in
practice.
The amount of unfairness can be modified by writing a new value to the
'sysctl_page_lock_unfairness' variable (default value of 5, exposed
through /proc/sys/vm/page_lock_unfairness), but that is hopefully
something we'd use mainly for debugging rather than being necessary for
any deep system tuning.
This whole issue has exposed just how critical the page lock can be, and
how contended it gets under certain locks. And the main contention
doesn't really seem to be anything related to IO (which was the origin
of this lock), but for things like just verifying that the page file
mapping is stable while faulting in the page into a page table.
Link: https://lore.kernel.org/linux-fsdevel/ed8442fd-6f54-dd84-cd4a-941e8b7ee603@MichaelLarabel.com/
Link: https://www.phoronix.com/scan.php?page=article&item=linux-50-59&num=1
Link: https://lore.kernel.org/linux-fsdevel/c560a38d-8313-51fb-b1ec-e904bd8836bc@tessares.net/
Reported-and-tested-by: Michael Larabel <Michael@michaellarabel.com>
Tested-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-14 00:05:35 +03:00
int unfairness = sysctl_page_lock_unfairness ;
2016-12-25 06:00:30 +03:00
struct wait_page_queue wait_page ;
2017-06-20 13:06:13 +03:00
wait_queue_entry_t * wait = & wait_page . wait ;
2018-10-27 01:06:08 +03:00
bool thrashing = false ;
psi: pressure stall information for CPU, memory, and IO
When systems are overcommitted and resources become contended, it's hard
to tell exactly the impact this has on workload productivity, or how close
the system is to lockups and OOM kills. In particular, when machines work
multiple jobs concurrently, the impact of overcommit in terms of latency
and throughput on the individual job can be enormous.
In order to maximize hardware utilization without sacrificing individual
job health or risk complete machine lockups, this patch implements a way
to quantify resource pressure in the system.
A kernel built with CONFIG_PSI=y creates files in /proc/pressure/ that
expose the percentage of time the system is stalled on CPU, memory, or IO,
respectively. Stall states are aggregate versions of the per-task delay
accounting delays:
cpu: some tasks are runnable but not executing on a CPU
memory: tasks are reclaiming, or waiting for swapin or thrashing cache
io: tasks are waiting for io completions
These percentages of walltime can be thought of as pressure percentages,
and they give a general sense of system health and productivity loss
incurred by resource overcommit. They can also indicate when the system
is approaching lockup scenarios and OOMs.
To do this, psi keeps track of the task states associated with each CPU
and samples the time they spend in stall states. Every 2 seconds, the
samples are averaged across CPUs - weighted by the CPUs' non-idle time to
eliminate artifacts from unused CPUs - and translated into percentages of
walltime. A running average of those percentages is maintained over 10s,
1m, and 5m periods (similar to the loadaverage).
[hannes@cmpxchg.org: doc fixlet, per Randy]
Link: http://lkml.kernel.org/r/20180828205625.GA14030@cmpxchg.org
[hannes@cmpxchg.org: code optimization]
Link: http://lkml.kernel.org/r/20180907175015.GA8479@cmpxchg.org
[hannes@cmpxchg.org: rename psi_clock() to psi_update_work(), per Peter]
Link: http://lkml.kernel.org/r/20180907145404.GB11088@cmpxchg.org
[hannes@cmpxchg.org: fix build]
Link: http://lkml.kernel.org/r/20180913014222.GA2370@cmpxchg.org
Link: http://lkml.kernel.org/r/20180828172258.3185-9-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Daniel Drake <drake@endlessm.com>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-27 01:06:27 +03:00
unsigned long pflags ;
2022-08-15 10:11:35 +03:00
bool in_thrashing ;
2016-12-25 06:00:30 +03:00
psi: pressure stall information for CPU, memory, and IO
When systems are overcommitted and resources become contended, it's hard
to tell exactly the impact this has on workload productivity, or how close
the system is to lockups and OOM kills. In particular, when machines work
multiple jobs concurrently, the impact of overcommit in terms of latency
and throughput on the individual job can be enormous.
In order to maximize hardware utilization without sacrificing individual
job health or risk complete machine lockups, this patch implements a way
to quantify resource pressure in the system.
A kernel built with CONFIG_PSI=y creates files in /proc/pressure/ that
expose the percentage of time the system is stalled on CPU, memory, or IO,
respectively. Stall states are aggregate versions of the per-task delay
accounting delays:
cpu: some tasks are runnable but not executing on a CPU
memory: tasks are reclaiming, or waiting for swapin or thrashing cache
io: tasks are waiting for io completions
These percentages of walltime can be thought of as pressure percentages,
and they give a general sense of system health and productivity loss
incurred by resource overcommit. They can also indicate when the system
is approaching lockup scenarios and OOMs.
To do this, psi keeps track of the task states associated with each CPU
and samples the time they spend in stall states. Every 2 seconds, the
samples are averaged across CPUs - weighted by the CPUs' non-idle time to
eliminate artifacts from unused CPUs - and translated into percentages of
walltime. A running average of those percentages is maintained over 10s,
1m, and 5m periods (similar to the loadaverage).
[hannes@cmpxchg.org: doc fixlet, per Randy]
Link: http://lkml.kernel.org/r/20180828205625.GA14030@cmpxchg.org
[hannes@cmpxchg.org: code optimization]
Link: http://lkml.kernel.org/r/20180907175015.GA8479@cmpxchg.org
[hannes@cmpxchg.org: rename psi_clock() to psi_update_work(), per Peter]
Link: http://lkml.kernel.org/r/20180907145404.GB11088@cmpxchg.org
[hannes@cmpxchg.org: fix build]
Link: http://lkml.kernel.org/r/20180913014222.GA2370@cmpxchg.org
Link: http://lkml.kernel.org/r/20180828172258.3185-9-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Daniel Drake <drake@endlessm.com>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-27 01:06:27 +03:00
if ( bit_nr = = PG_locked & &
2021-03-04 20:02:54 +03:00
! folio_test_uptodate ( folio ) & & folio_test_workingset ( folio ) ) {
2022-08-15 10:11:35 +03:00
delayacct_thrashing_start ( & in_thrashing ) ;
psi: pressure stall information for CPU, memory, and IO
When systems are overcommitted and resources become contended, it's hard
to tell exactly the impact this has on workload productivity, or how close
the system is to lockups and OOM kills. In particular, when machines work
multiple jobs concurrently, the impact of overcommit in terms of latency
and throughput on the individual job can be enormous.
In order to maximize hardware utilization without sacrificing individual
job health or risk complete machine lockups, this patch implements a way
to quantify resource pressure in the system.
A kernel built with CONFIG_PSI=y creates files in /proc/pressure/ that
expose the percentage of time the system is stalled on CPU, memory, or IO,
respectively. Stall states are aggregate versions of the per-task delay
accounting delays:
cpu: some tasks are runnable but not executing on a CPU
memory: tasks are reclaiming, or waiting for swapin or thrashing cache
io: tasks are waiting for io completions
These percentages of walltime can be thought of as pressure percentages,
and they give a general sense of system health and productivity loss
incurred by resource overcommit. They can also indicate when the system
is approaching lockup scenarios and OOMs.
To do this, psi keeps track of the task states associated with each CPU
and samples the time they spend in stall states. Every 2 seconds, the
samples are averaged across CPUs - weighted by the CPUs' non-idle time to
eliminate artifacts from unused CPUs - and translated into percentages of
walltime. A running average of those percentages is maintained over 10s,
1m, and 5m periods (similar to the loadaverage).
[hannes@cmpxchg.org: doc fixlet, per Randy]
Link: http://lkml.kernel.org/r/20180828205625.GA14030@cmpxchg.org
[hannes@cmpxchg.org: code optimization]
Link: http://lkml.kernel.org/r/20180907175015.GA8479@cmpxchg.org
[hannes@cmpxchg.org: rename psi_clock() to psi_update_work(), per Peter]
Link: http://lkml.kernel.org/r/20180907145404.GB11088@cmpxchg.org
[hannes@cmpxchg.org: fix build]
Link: http://lkml.kernel.org/r/20180913014222.GA2370@cmpxchg.org
Link: http://lkml.kernel.org/r/20180828172258.3185-9-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Daniel Drake <drake@endlessm.com>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-27 01:06:27 +03:00
psi_memstall_enter ( & pflags ) ;
2018-10-27 01:06:08 +03:00
thrashing = true ;
}
2016-12-25 06:00:30 +03:00
init_wait ( wait ) ;
wait - > func = wake_page_function ;
2021-01-16 19:22:14 +03:00
wait_page . folio = folio ;
2016-12-25 06:00:30 +03:00
wait_page . bit_nr = bit_nr ;
mm: allow a controlled amount of unfairness in the page lock
Commit 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic") made
the page locking entirely fair, in that if a waiter came in while the
lock was held, the lock would be transferred to the lockers strictly in
order.
That was intended to finally get rid of the long-reported watchdog
failures that involved the page lock under extreme load, where a process
could end up waiting essentially forever, as other page lockers stole
the lock from under it.
It also improved some benchmarks, but it ended up causing huge
performance regressions on others, simply because fair lock behavior
doesn't end up giving out the lock as aggressively, causing better
worst-case latency, but potentially much worse average latencies and
throughput.
Instead of reverting that change entirely, this introduces a controlled
amount of unfairness, with a sysctl knob to tune it if somebody needs
to. But the default value should hopefully be good for any normal load,
allowing a few rounds of lock stealing, but enforcing the strict
ordering before the lock has been stolen too many times.
There is also a hint from Matthieu Baerts that the fair page coloring
may end up exposing an ABBA deadlock that is hidden by the usual
optimistic lock stealing, and while the unfairness doesn't fix the
fundamental issue (and I'm still looking at that), it avoids it in
practice.
The amount of unfairness can be modified by writing a new value to the
'sysctl_page_lock_unfairness' variable (default value of 5, exposed
through /proc/sys/vm/page_lock_unfairness), but that is hopefully
something we'd use mainly for debugging rather than being necessary for
any deep system tuning.
This whole issue has exposed just how critical the page lock can be, and
how contended it gets under certain locks. And the main contention
doesn't really seem to be anything related to IO (which was the origin
of this lock), but for things like just verifying that the page file
mapping is stable while faulting in the page into a page table.
Link: https://lore.kernel.org/linux-fsdevel/ed8442fd-6f54-dd84-cd4a-941e8b7ee603@MichaelLarabel.com/
Link: https://www.phoronix.com/scan.php?page=article&item=linux-50-59&num=1
Link: https://lore.kernel.org/linux-fsdevel/c560a38d-8313-51fb-b1ec-e904bd8836bc@tessares.net/
Reported-and-tested-by: Michael Larabel <Michael@michaellarabel.com>
Tested-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-14 00:05:35 +03:00
repeat :
wait - > flags = 0 ;
if ( behavior = = EXCLUSIVE ) {
wait - > flags = WQ_FLAG_EXCLUSIVE ;
if ( - - unfairness < 0 )
wait - > flags | = WQ_FLAG_CUSTOM ;
}
2020-07-23 20:16:49 +03:00
/*
* Do one last check whether we can get the
* page bit synchronously .
*
2021-03-04 20:02:54 +03:00
* Do the folio_set_waiters ( ) marking before that
2020-07-23 20:16:49 +03:00
* to let any waker we _just_ missed know they
* need to wake us up ( otherwise they ' ll never
* even go to the slow case that looks at the
* page queue ) , and add ourselves to the wait
* queue if we need to sleep .
*
* This part needs to be done under the queue
* lock to avoid races .
*/
spin_lock_irq ( & q - > lock ) ;
2021-03-04 20:02:54 +03:00
folio_set_waiters ( folio ) ;
if ( ! folio_trylock_flag ( folio , bit_nr , wait ) )
2020-07-23 20:16:49 +03:00
__add_wait_queue_entry_tail ( q , wait ) ;
spin_unlock_irq ( & q - > lock ) ;
2016-12-25 06:00:30 +03:00
2020-07-23 20:16:49 +03:00
/*
* From now on , all the logic will be based on
mm: allow a controlled amount of unfairness in the page lock
Commit 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic") made
the page locking entirely fair, in that if a waiter came in while the
lock was held, the lock would be transferred to the lockers strictly in
order.
That was intended to finally get rid of the long-reported watchdog
failures that involved the page lock under extreme load, where a process
could end up waiting essentially forever, as other page lockers stole
the lock from under it.
It also improved some benchmarks, but it ended up causing huge
performance regressions on others, simply because fair lock behavior
doesn't end up giving out the lock as aggressively, causing better
worst-case latency, but potentially much worse average latencies and
throughput.
Instead of reverting that change entirely, this introduces a controlled
amount of unfairness, with a sysctl knob to tune it if somebody needs
to. But the default value should hopefully be good for any normal load,
allowing a few rounds of lock stealing, but enforcing the strict
ordering before the lock has been stolen too many times.
There is also a hint from Matthieu Baerts that the fair page coloring
may end up exposing an ABBA deadlock that is hidden by the usual
optimistic lock stealing, and while the unfairness doesn't fix the
fundamental issue (and I'm still looking at that), it avoids it in
practice.
The amount of unfairness can be modified by writing a new value to the
'sysctl_page_lock_unfairness' variable (default value of 5, exposed
through /proc/sys/vm/page_lock_unfairness), but that is hopefully
something we'd use mainly for debugging rather than being necessary for
any deep system tuning.
This whole issue has exposed just how critical the page lock can be, and
how contended it gets under certain locks. And the main contention
doesn't really seem to be anything related to IO (which was the origin
of this lock), but for things like just verifying that the page file
mapping is stable while faulting in the page into a page table.
Link: https://lore.kernel.org/linux-fsdevel/ed8442fd-6f54-dd84-cd4a-941e8b7ee603@MichaelLarabel.com/
Link: https://www.phoronix.com/scan.php?page=article&item=linux-50-59&num=1
Link: https://lore.kernel.org/linux-fsdevel/c560a38d-8313-51fb-b1ec-e904bd8836bc@tessares.net/
Reported-and-tested-by: Michael Larabel <Michael@michaellarabel.com>
Tested-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-14 00:05:35 +03:00
* the WQ_FLAG_WOKEN and WQ_FLAG_DONE flag , to
* see whether the page bit testing has already
* been done by the wake function .
2020-07-23 20:16:49 +03:00
*
2021-03-04 20:02:54 +03:00
* We can drop our reference to the folio .
2020-07-23 20:16:49 +03:00
*/
if ( behavior = = DROP )
2021-03-04 20:02:54 +03:00
folio_put ( folio ) ;
2016-12-25 06:00:30 +03:00
mm: allow a controlled amount of unfairness in the page lock
Commit 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic") made
the page locking entirely fair, in that if a waiter came in while the
lock was held, the lock would be transferred to the lockers strictly in
order.
That was intended to finally get rid of the long-reported watchdog
failures that involved the page lock under extreme load, where a process
could end up waiting essentially forever, as other page lockers stole
the lock from under it.
It also improved some benchmarks, but it ended up causing huge
performance regressions on others, simply because fair lock behavior
doesn't end up giving out the lock as aggressively, causing better
worst-case latency, but potentially much worse average latencies and
throughput.
Instead of reverting that change entirely, this introduces a controlled
amount of unfairness, with a sysctl knob to tune it if somebody needs
to. But the default value should hopefully be good for any normal load,
allowing a few rounds of lock stealing, but enforcing the strict
ordering before the lock has been stolen too many times.
There is also a hint from Matthieu Baerts that the fair page coloring
may end up exposing an ABBA deadlock that is hidden by the usual
optimistic lock stealing, and while the unfairness doesn't fix the
fundamental issue (and I'm still looking at that), it avoids it in
practice.
The amount of unfairness can be modified by writing a new value to the
'sysctl_page_lock_unfairness' variable (default value of 5, exposed
through /proc/sys/vm/page_lock_unfairness), but that is hopefully
something we'd use mainly for debugging rather than being necessary for
any deep system tuning.
This whole issue has exposed just how critical the page lock can be, and
how contended it gets under certain locks. And the main contention
doesn't really seem to be anything related to IO (which was the origin
of this lock), but for things like just verifying that the page file
mapping is stable while faulting in the page into a page table.
Link: https://lore.kernel.org/linux-fsdevel/ed8442fd-6f54-dd84-cd4a-941e8b7ee603@MichaelLarabel.com/
Link: https://www.phoronix.com/scan.php?page=article&item=linux-50-59&num=1
Link: https://lore.kernel.org/linux-fsdevel/c560a38d-8313-51fb-b1ec-e904bd8836bc@tessares.net/
Reported-and-tested-by: Michael Larabel <Michael@michaellarabel.com>
Tested-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-14 00:05:35 +03:00
/*
* Note that until the " finish_wait() " , or until
* we see the WQ_FLAG_WOKEN flag , we need to
* be very careful with the ' wait - > flags ' , because
* we may race with a waker that sets them .
*/
2020-07-23 20:16:49 +03:00
for ( ; ; ) {
mm: allow a controlled amount of unfairness in the page lock
Commit 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic") made
the page locking entirely fair, in that if a waiter came in while the
lock was held, the lock would be transferred to the lockers strictly in
order.
That was intended to finally get rid of the long-reported watchdog
failures that involved the page lock under extreme load, where a process
could end up waiting essentially forever, as other page lockers stole
the lock from under it.
It also improved some benchmarks, but it ended up causing huge
performance regressions on others, simply because fair lock behavior
doesn't end up giving out the lock as aggressively, causing better
worst-case latency, but potentially much worse average latencies and
throughput.
Instead of reverting that change entirely, this introduces a controlled
amount of unfairness, with a sysctl knob to tune it if somebody needs
to. But the default value should hopefully be good for any normal load,
allowing a few rounds of lock stealing, but enforcing the strict
ordering before the lock has been stolen too many times.
There is also a hint from Matthieu Baerts that the fair page coloring
may end up exposing an ABBA deadlock that is hidden by the usual
optimistic lock stealing, and while the unfairness doesn't fix the
fundamental issue (and I'm still looking at that), it avoids it in
practice.
The amount of unfairness can be modified by writing a new value to the
'sysctl_page_lock_unfairness' variable (default value of 5, exposed
through /proc/sys/vm/page_lock_unfairness), but that is hopefully
something we'd use mainly for debugging rather than being necessary for
any deep system tuning.
This whole issue has exposed just how critical the page lock can be, and
how contended it gets under certain locks. And the main contention
doesn't really seem to be anything related to IO (which was the origin
of this lock), but for things like just verifying that the page file
mapping is stable while faulting in the page into a page table.
Link: https://lore.kernel.org/linux-fsdevel/ed8442fd-6f54-dd84-cd4a-941e8b7ee603@MichaelLarabel.com/
Link: https://www.phoronix.com/scan.php?page=article&item=linux-50-59&num=1
Link: https://lore.kernel.org/linux-fsdevel/c560a38d-8313-51fb-b1ec-e904bd8836bc@tessares.net/
Reported-and-tested-by: Michael Larabel <Michael@michaellarabel.com>
Tested-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-14 00:05:35 +03:00
unsigned int flags ;
2016-12-25 06:00:30 +03:00
set_current_state ( state ) ;
mm: allow a controlled amount of unfairness in the page lock
Commit 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic") made
the page locking entirely fair, in that if a waiter came in while the
lock was held, the lock would be transferred to the lockers strictly in
order.
That was intended to finally get rid of the long-reported watchdog
failures that involved the page lock under extreme load, where a process
could end up waiting essentially forever, as other page lockers stole
the lock from under it.
It also improved some benchmarks, but it ended up causing huge
performance regressions on others, simply because fair lock behavior
doesn't end up giving out the lock as aggressively, causing better
worst-case latency, but potentially much worse average latencies and
throughput.
Instead of reverting that change entirely, this introduces a controlled
amount of unfairness, with a sysctl knob to tune it if somebody needs
to. But the default value should hopefully be good for any normal load,
allowing a few rounds of lock stealing, but enforcing the strict
ordering before the lock has been stolen too many times.
There is also a hint from Matthieu Baerts that the fair page coloring
may end up exposing an ABBA deadlock that is hidden by the usual
optimistic lock stealing, and while the unfairness doesn't fix the
fundamental issue (and I'm still looking at that), it avoids it in
practice.
The amount of unfairness can be modified by writing a new value to the
'sysctl_page_lock_unfairness' variable (default value of 5, exposed
through /proc/sys/vm/page_lock_unfairness), but that is hopefully
something we'd use mainly for debugging rather than being necessary for
any deep system tuning.
This whole issue has exposed just how critical the page lock can be, and
how contended it gets under certain locks. And the main contention
doesn't really seem to be anything related to IO (which was the origin
of this lock), but for things like just verifying that the page file
mapping is stable while faulting in the page into a page table.
Link: https://lore.kernel.org/linux-fsdevel/ed8442fd-6f54-dd84-cd4a-941e8b7ee603@MichaelLarabel.com/
Link: https://www.phoronix.com/scan.php?page=article&item=linux-50-59&num=1
Link: https://lore.kernel.org/linux-fsdevel/c560a38d-8313-51fb-b1ec-e904bd8836bc@tessares.net/
Reported-and-tested-by: Michael Larabel <Michael@michaellarabel.com>
Tested-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-14 00:05:35 +03:00
/* Loop until we've been woken or interrupted */
flags = smp_load_acquire ( & wait - > flags ) ;
if ( ! ( flags & WQ_FLAG_WOKEN ) ) {
if ( signal_pending_state ( state , current ) )
break ;
io_schedule ( ) ;
continue ;
}
/* If we were non-exclusive, we're done */
if ( behavior ! = EXCLUSIVE )
2017-08-28 02:25:09 +03:00
break ;
mm: put_and_wait_on_page_locked() while page is migrated
Waiting on a page migration entry has used wait_on_page_locked() all along
since 2006: but you cannot safely wait_on_page_locked() without holding a
reference to the page, and that extra reference is enough to make
migrate_page_move_mapping() fail with -EAGAIN, when a racing task faults
on the entry before migrate_page_move_mapping() gets there.
And that failure is retried nine times, amplifying the pain when trying to
migrate a popular page. With a single persistent faulter, migration
sometimes succeeds; with two or three concurrent faulters, success becomes
much less likely (and the more the page was mapped, the worse the overhead
of unmapping and remapping it on each try).
This is especially a problem for memory offlining, where the outer level
retries forever (or until terminated from userspace), because a heavy
refault workload can trigger an endless loop of migration failures.
wait_on_page_locked() is the wrong tool for the job.
David Herrmann (but was he the first?) noticed this issue in 2014:
https://marc.info/?l=linux-mm&m=140110465608116&w=2
Tim Chen started a thread in August 2017 which appears relevant:
https://marc.info/?l=linux-mm&m=150275941014915&w=2 where Kan Liang went
on to implicate __migration_entry_wait():
https://marc.info/?l=linux-mm&m=150300268411980&w=2 and the thread ended
up with the v4.14 commits: 2554db916586 ("sched/wait: Break up long wake
list walk") 11a19c7b099f ("sched/wait: Introduce wakeup boomark in
wake_up_page_bit")
Baoquan He reported "Memory hotplug softlock issue" 14 November 2018:
https://marc.info/?l=linux-mm&m=154217936431300&w=2
We have all assumed that it is essential to hold a page reference while
waiting on a page lock: partly to guarantee that there is still a struct
page when MEMORY_HOTREMOVE is configured, but also to protect against
reuse of the struct page going to someone who then holds the page locked
indefinitely, when the waiter can reasonably expect timely unlocking.
But in fact, so long as wait_on_page_bit_common() does the put_page(), and
is careful not to rely on struct page contents thereafter, there is no
need to hold a reference to the page while waiting on it. That does mean
that this case cannot go back through the loop: but that's fine for the
page migration case, and even if used more widely, is limited by the "Stop
walking if it's locked" optimization in wake_page_function().
Add interface put_and_wait_on_page_locked() to do this, using "behavior"
enum in place of "lock" arg to wait_on_page_bit_common() to implement it.
No interruptible or killable variant needed yet, but they might follow: I
have a vague notion that reporting -EINTR should take precedence over
return from wait_on_page_bit_common() without knowing the page state, so
arrange it accordingly - but that may be nothing but pedantic.
__migration_entry_wait() still has to take a brief reference to the page,
prior to calling put_and_wait_on_page_locked(): but now that it is dropped
before waiting, the chance of impeding page migration is very much
reduced. Should we perhaps disable preemption across this?
shrink_page_list()'s __ClearPageLocked(): that was a surprise! This
survived a lot of testing before that showed up. PageWaiters may have
been set by wait_on_page_bit_common(), and the reference dropped, just
before shrink_page_list() succeeds in freezing its last page reference: in
such a case, unlock_page() must be used. Follow the suggestion from
Michal Hocko, just revert a978d6f52106 ("mm: unlockless reclaim") now:
that optimization predates PageWaiters, and won't buy much these days; but
we can reinstate it for the !PageWaiters case if anyone notices.
It does raise the question: should vmscan.c's is_page_cache_freeable() and
__remove_mapping() now treat a PageWaiters page as if an extra reference
were held? Perhaps, but I don't think it matters much, since
shrink_page_list() already had to win its trylock_page(), so waiters are
not very common there: I noticed no difference when trying the bigger
change, and it's surely not needed while put_and_wait_on_page_locked() is
only used for page migration.
[willy@infradead.org: add put_and_wait_on_page_locked() kerneldoc]
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261121330.1116@eggly.anvils
Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: Baoquan He <bhe@redhat.com>
Tested-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: David Herrmann <dh.herrmann@gmail.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Nick Piggin <npiggin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 11:36:14 +03:00
mm: allow a controlled amount of unfairness in the page lock
Commit 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic") made
the page locking entirely fair, in that if a waiter came in while the
lock was held, the lock would be transferred to the lockers strictly in
order.
That was intended to finally get rid of the long-reported watchdog
failures that involved the page lock under extreme load, where a process
could end up waiting essentially forever, as other page lockers stole
the lock from under it.
It also improved some benchmarks, but it ended up causing huge
performance regressions on others, simply because fair lock behavior
doesn't end up giving out the lock as aggressively, causing better
worst-case latency, but potentially much worse average latencies and
throughput.
Instead of reverting that change entirely, this introduces a controlled
amount of unfairness, with a sysctl knob to tune it if somebody needs
to. But the default value should hopefully be good for any normal load,
allowing a few rounds of lock stealing, but enforcing the strict
ordering before the lock has been stolen too many times.
There is also a hint from Matthieu Baerts that the fair page coloring
may end up exposing an ABBA deadlock that is hidden by the usual
optimistic lock stealing, and while the unfairness doesn't fix the
fundamental issue (and I'm still looking at that), it avoids it in
practice.
The amount of unfairness can be modified by writing a new value to the
'sysctl_page_lock_unfairness' variable (default value of 5, exposed
through /proc/sys/vm/page_lock_unfairness), but that is hopefully
something we'd use mainly for debugging rather than being necessary for
any deep system tuning.
This whole issue has exposed just how critical the page lock can be, and
how contended it gets under certain locks. And the main contention
doesn't really seem to be anything related to IO (which was the origin
of this lock), but for things like just verifying that the page file
mapping is stable while faulting in the page into a page table.
Link: https://lore.kernel.org/linux-fsdevel/ed8442fd-6f54-dd84-cd4a-941e8b7ee603@MichaelLarabel.com/
Link: https://www.phoronix.com/scan.php?page=article&item=linux-50-59&num=1
Link: https://lore.kernel.org/linux-fsdevel/c560a38d-8313-51fb-b1ec-e904bd8836bc@tessares.net/
Reported-and-tested-by: Michael Larabel <Michael@michaellarabel.com>
Tested-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-14 00:05:35 +03:00
/* If the waker got the lock for us, we're done */
if ( flags & WQ_FLAG_DONE )
mm: put_and_wait_on_page_locked() while page is migrated
Waiting on a page migration entry has used wait_on_page_locked() all along
since 2006: but you cannot safely wait_on_page_locked() without holding a
reference to the page, and that extra reference is enough to make
migrate_page_move_mapping() fail with -EAGAIN, when a racing task faults
on the entry before migrate_page_move_mapping() gets there.
And that failure is retried nine times, amplifying the pain when trying to
migrate a popular page. With a single persistent faulter, migration
sometimes succeeds; with two or three concurrent faulters, success becomes
much less likely (and the more the page was mapped, the worse the overhead
of unmapping and remapping it on each try).
This is especially a problem for memory offlining, where the outer level
retries forever (or until terminated from userspace), because a heavy
refault workload can trigger an endless loop of migration failures.
wait_on_page_locked() is the wrong tool for the job.
David Herrmann (but was he the first?) noticed this issue in 2014:
https://marc.info/?l=linux-mm&m=140110465608116&w=2
Tim Chen started a thread in August 2017 which appears relevant:
https://marc.info/?l=linux-mm&m=150275941014915&w=2 where Kan Liang went
on to implicate __migration_entry_wait():
https://marc.info/?l=linux-mm&m=150300268411980&w=2 and the thread ended
up with the v4.14 commits: 2554db916586 ("sched/wait: Break up long wake
list walk") 11a19c7b099f ("sched/wait: Introduce wakeup boomark in
wake_up_page_bit")
Baoquan He reported "Memory hotplug softlock issue" 14 November 2018:
https://marc.info/?l=linux-mm&m=154217936431300&w=2
We have all assumed that it is essential to hold a page reference while
waiting on a page lock: partly to guarantee that there is still a struct
page when MEMORY_HOTREMOVE is configured, but also to protect against
reuse of the struct page going to someone who then holds the page locked
indefinitely, when the waiter can reasonably expect timely unlocking.
But in fact, so long as wait_on_page_bit_common() does the put_page(), and
is careful not to rely on struct page contents thereafter, there is no
need to hold a reference to the page while waiting on it. That does mean
that this case cannot go back through the loop: but that's fine for the
page migration case, and even if used more widely, is limited by the "Stop
walking if it's locked" optimization in wake_page_function().
Add interface put_and_wait_on_page_locked() to do this, using "behavior"
enum in place of "lock" arg to wait_on_page_bit_common() to implement it.
No interruptible or killable variant needed yet, but they might follow: I
have a vague notion that reporting -EINTR should take precedence over
return from wait_on_page_bit_common() without knowing the page state, so
arrange it accordingly - but that may be nothing but pedantic.
__migration_entry_wait() still has to take a brief reference to the page,
prior to calling put_and_wait_on_page_locked(): but now that it is dropped
before waiting, the chance of impeding page migration is very much
reduced. Should we perhaps disable preemption across this?
shrink_page_list()'s __ClearPageLocked(): that was a surprise! This
survived a lot of testing before that showed up. PageWaiters may have
been set by wait_on_page_bit_common(), and the reference dropped, just
before shrink_page_list() succeeds in freezing its last page reference: in
such a case, unlock_page() must be used. Follow the suggestion from
Michal Hocko, just revert a978d6f52106 ("mm: unlockless reclaim") now:
that optimization predates PageWaiters, and won't buy much these days; but
we can reinstate it for the !PageWaiters case if anyone notices.
It does raise the question: should vmscan.c's is_page_cache_freeable() and
__remove_mapping() now treat a PageWaiters page as if an extra reference
were held? Perhaps, but I don't think it matters much, since
shrink_page_list() already had to win its trylock_page(), so waiters are
not very common there: I noticed no difference when trying the bigger
change, and it's surely not needed while put_and_wait_on_page_locked() is
only used for page migration.
[willy@infradead.org: add put_and_wait_on_page_locked() kerneldoc]
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261121330.1116@eggly.anvils
Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: Baoquan He <bhe@redhat.com>
Tested-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: David Herrmann <dh.herrmann@gmail.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Nick Piggin <npiggin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 11:36:14 +03:00
break ;
2020-07-23 20:16:49 +03:00
mm: allow a controlled amount of unfairness in the page lock
Commit 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic") made
the page locking entirely fair, in that if a waiter came in while the
lock was held, the lock would be transferred to the lockers strictly in
order.
That was intended to finally get rid of the long-reported watchdog
failures that involved the page lock under extreme load, where a process
could end up waiting essentially forever, as other page lockers stole
the lock from under it.
It also improved some benchmarks, but it ended up causing huge
performance regressions on others, simply because fair lock behavior
doesn't end up giving out the lock as aggressively, causing better
worst-case latency, but potentially much worse average latencies and
throughput.
Instead of reverting that change entirely, this introduces a controlled
amount of unfairness, with a sysctl knob to tune it if somebody needs
to. But the default value should hopefully be good for any normal load,
allowing a few rounds of lock stealing, but enforcing the strict
ordering before the lock has been stolen too many times.
There is also a hint from Matthieu Baerts that the fair page coloring
may end up exposing an ABBA deadlock that is hidden by the usual
optimistic lock stealing, and while the unfairness doesn't fix the
fundamental issue (and I'm still looking at that), it avoids it in
practice.
The amount of unfairness can be modified by writing a new value to the
'sysctl_page_lock_unfairness' variable (default value of 5, exposed
through /proc/sys/vm/page_lock_unfairness), but that is hopefully
something we'd use mainly for debugging rather than being necessary for
any deep system tuning.
This whole issue has exposed just how critical the page lock can be, and
how contended it gets under certain locks. And the main contention
doesn't really seem to be anything related to IO (which was the origin
of this lock), but for things like just verifying that the page file
mapping is stable while faulting in the page into a page table.
Link: https://lore.kernel.org/linux-fsdevel/ed8442fd-6f54-dd84-cd4a-941e8b7ee603@MichaelLarabel.com/
Link: https://www.phoronix.com/scan.php?page=article&item=linux-50-59&num=1
Link: https://lore.kernel.org/linux-fsdevel/c560a38d-8313-51fb-b1ec-e904bd8836bc@tessares.net/
Reported-and-tested-by: Michael Larabel <Michael@michaellarabel.com>
Tested-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-14 00:05:35 +03:00
/*
* Otherwise , if we ' re getting the lock , we need to
* try to get it ourselves .
*
* And if that fails , we ' ll have to retry this all .
*/
2021-03-04 20:02:54 +03:00
if ( unlikely ( test_and_set_bit ( bit_nr , folio_flags ( folio , 0 ) ) ) )
mm: allow a controlled amount of unfairness in the page lock
Commit 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic") made
the page locking entirely fair, in that if a waiter came in while the
lock was held, the lock would be transferred to the lockers strictly in
order.
That was intended to finally get rid of the long-reported watchdog
failures that involved the page lock under extreme load, where a process
could end up waiting essentially forever, as other page lockers stole
the lock from under it.
It also improved some benchmarks, but it ended up causing huge
performance regressions on others, simply because fair lock behavior
doesn't end up giving out the lock as aggressively, causing better
worst-case latency, but potentially much worse average latencies and
throughput.
Instead of reverting that change entirely, this introduces a controlled
amount of unfairness, with a sysctl knob to tune it if somebody needs
to. But the default value should hopefully be good for any normal load,
allowing a few rounds of lock stealing, but enforcing the strict
ordering before the lock has been stolen too many times.
There is also a hint from Matthieu Baerts that the fair page coloring
may end up exposing an ABBA deadlock that is hidden by the usual
optimistic lock stealing, and while the unfairness doesn't fix the
fundamental issue (and I'm still looking at that), it avoids it in
practice.
The amount of unfairness can be modified by writing a new value to the
'sysctl_page_lock_unfairness' variable (default value of 5, exposed
through /proc/sys/vm/page_lock_unfairness), but that is hopefully
something we'd use mainly for debugging rather than being necessary for
any deep system tuning.
This whole issue has exposed just how critical the page lock can be, and
how contended it gets under certain locks. And the main contention
doesn't really seem to be anything related to IO (which was the origin
of this lock), but for things like just verifying that the page file
mapping is stable while faulting in the page into a page table.
Link: https://lore.kernel.org/linux-fsdevel/ed8442fd-6f54-dd84-cd4a-941e8b7ee603@MichaelLarabel.com/
Link: https://www.phoronix.com/scan.php?page=article&item=linux-50-59&num=1
Link: https://lore.kernel.org/linux-fsdevel/c560a38d-8313-51fb-b1ec-e904bd8836bc@tessares.net/
Reported-and-tested-by: Michael Larabel <Michael@michaellarabel.com>
Tested-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-14 00:05:35 +03:00
goto repeat ;
wait - > flags | = WQ_FLAG_DONE ;
break ;
2016-12-25 06:00:30 +03:00
}
mm: allow a controlled amount of unfairness in the page lock
Commit 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic") made
the page locking entirely fair, in that if a waiter came in while the
lock was held, the lock would be transferred to the lockers strictly in
order.
That was intended to finally get rid of the long-reported watchdog
failures that involved the page lock under extreme load, where a process
could end up waiting essentially forever, as other page lockers stole
the lock from under it.
It also improved some benchmarks, but it ended up causing huge
performance regressions on others, simply because fair lock behavior
doesn't end up giving out the lock as aggressively, causing better
worst-case latency, but potentially much worse average latencies and
throughput.
Instead of reverting that change entirely, this introduces a controlled
amount of unfairness, with a sysctl knob to tune it if somebody needs
to. But the default value should hopefully be good for any normal load,
allowing a few rounds of lock stealing, but enforcing the strict
ordering before the lock has been stolen too many times.
There is also a hint from Matthieu Baerts that the fair page coloring
may end up exposing an ABBA deadlock that is hidden by the usual
optimistic lock stealing, and while the unfairness doesn't fix the
fundamental issue (and I'm still looking at that), it avoids it in
practice.
The amount of unfairness can be modified by writing a new value to the
'sysctl_page_lock_unfairness' variable (default value of 5, exposed
through /proc/sys/vm/page_lock_unfairness), but that is hopefully
something we'd use mainly for debugging rather than being necessary for
any deep system tuning.
This whole issue has exposed just how critical the page lock can be, and
how contended it gets under certain locks. And the main contention
doesn't really seem to be anything related to IO (which was the origin
of this lock), but for things like just verifying that the page file
mapping is stable while faulting in the page into a page table.
Link: https://lore.kernel.org/linux-fsdevel/ed8442fd-6f54-dd84-cd4a-941e8b7ee603@MichaelLarabel.com/
Link: https://www.phoronix.com/scan.php?page=article&item=linux-50-59&num=1
Link: https://lore.kernel.org/linux-fsdevel/c560a38d-8313-51fb-b1ec-e904bd8836bc@tessares.net/
Reported-and-tested-by: Michael Larabel <Michael@michaellarabel.com>
Tested-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-14 00:05:35 +03:00
/*
* If a signal happened , this ' finish_wait ( ) ' may remove the last
2021-03-04 20:02:54 +03:00
* waiter from the wait - queues , but the folio waiters bit will remain
mm: allow a controlled amount of unfairness in the page lock
Commit 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic") made
the page locking entirely fair, in that if a waiter came in while the
lock was held, the lock would be transferred to the lockers strictly in
order.
That was intended to finally get rid of the long-reported watchdog
failures that involved the page lock under extreme load, where a process
could end up waiting essentially forever, as other page lockers stole
the lock from under it.
It also improved some benchmarks, but it ended up causing huge
performance regressions on others, simply because fair lock behavior
doesn't end up giving out the lock as aggressively, causing better
worst-case latency, but potentially much worse average latencies and
throughput.
Instead of reverting that change entirely, this introduces a controlled
amount of unfairness, with a sysctl knob to tune it if somebody needs
to. But the default value should hopefully be good for any normal load,
allowing a few rounds of lock stealing, but enforcing the strict
ordering before the lock has been stolen too many times.
There is also a hint from Matthieu Baerts that the fair page coloring
may end up exposing an ABBA deadlock that is hidden by the usual
optimistic lock stealing, and while the unfairness doesn't fix the
fundamental issue (and I'm still looking at that), it avoids it in
practice.
The amount of unfairness can be modified by writing a new value to the
'sysctl_page_lock_unfairness' variable (default value of 5, exposed
through /proc/sys/vm/page_lock_unfairness), but that is hopefully
something we'd use mainly for debugging rather than being necessary for
any deep system tuning.
This whole issue has exposed just how critical the page lock can be, and
how contended it gets under certain locks. And the main contention
doesn't really seem to be anything related to IO (which was the origin
of this lock), but for things like just verifying that the page file
mapping is stable while faulting in the page into a page table.
Link: https://lore.kernel.org/linux-fsdevel/ed8442fd-6f54-dd84-cd4a-941e8b7ee603@MichaelLarabel.com/
Link: https://www.phoronix.com/scan.php?page=article&item=linux-50-59&num=1
Link: https://lore.kernel.org/linux-fsdevel/c560a38d-8313-51fb-b1ec-e904bd8836bc@tessares.net/
Reported-and-tested-by: Michael Larabel <Michael@michaellarabel.com>
Tested-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-14 00:05:35 +03:00
* set . That ' s ok . The next wakeup will take care of it , and trying
* to do it here would be difficult and prone to races .
*/
2016-12-25 06:00:30 +03:00
finish_wait ( q , wait ) ;
psi: pressure stall information for CPU, memory, and IO
When systems are overcommitted and resources become contended, it's hard
to tell exactly the impact this has on workload productivity, or how close
the system is to lockups and OOM kills. In particular, when machines work
multiple jobs concurrently, the impact of overcommit in terms of latency
and throughput on the individual job can be enormous.
In order to maximize hardware utilization without sacrificing individual
job health or risk complete machine lockups, this patch implements a way
to quantify resource pressure in the system.
A kernel built with CONFIG_PSI=y creates files in /proc/pressure/ that
expose the percentage of time the system is stalled on CPU, memory, or IO,
respectively. Stall states are aggregate versions of the per-task delay
accounting delays:
cpu: some tasks are runnable but not executing on a CPU
memory: tasks are reclaiming, or waiting for swapin or thrashing cache
io: tasks are waiting for io completions
These percentages of walltime can be thought of as pressure percentages,
and they give a general sense of system health and productivity loss
incurred by resource overcommit. They can also indicate when the system
is approaching lockup scenarios and OOMs.
To do this, psi keeps track of the task states associated with each CPU
and samples the time they spend in stall states. Every 2 seconds, the
samples are averaged across CPUs - weighted by the CPUs' non-idle time to
eliminate artifacts from unused CPUs - and translated into percentages of
walltime. A running average of those percentages is maintained over 10s,
1m, and 5m periods (similar to the loadaverage).
[hannes@cmpxchg.org: doc fixlet, per Randy]
Link: http://lkml.kernel.org/r/20180828205625.GA14030@cmpxchg.org
[hannes@cmpxchg.org: code optimization]
Link: http://lkml.kernel.org/r/20180907175015.GA8479@cmpxchg.org
[hannes@cmpxchg.org: rename psi_clock() to psi_update_work(), per Peter]
Link: http://lkml.kernel.org/r/20180907145404.GB11088@cmpxchg.org
[hannes@cmpxchg.org: fix build]
Link: http://lkml.kernel.org/r/20180913014222.GA2370@cmpxchg.org
Link: http://lkml.kernel.org/r/20180828172258.3185-9-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Daniel Drake <drake@endlessm.com>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-27 01:06:27 +03:00
if ( thrashing ) {
2022-08-15 10:11:35 +03:00
delayacct_thrashing_end ( & in_thrashing ) ;
psi: pressure stall information for CPU, memory, and IO
When systems are overcommitted and resources become contended, it's hard
to tell exactly the impact this has on workload productivity, or how close
the system is to lockups and OOM kills. In particular, when machines work
multiple jobs concurrently, the impact of overcommit in terms of latency
and throughput on the individual job can be enormous.
In order to maximize hardware utilization without sacrificing individual
job health or risk complete machine lockups, this patch implements a way
to quantify resource pressure in the system.
A kernel built with CONFIG_PSI=y creates files in /proc/pressure/ that
expose the percentage of time the system is stalled on CPU, memory, or IO,
respectively. Stall states are aggregate versions of the per-task delay
accounting delays:
cpu: some tasks are runnable but not executing on a CPU
memory: tasks are reclaiming, or waiting for swapin or thrashing cache
io: tasks are waiting for io completions
These percentages of walltime can be thought of as pressure percentages,
and they give a general sense of system health and productivity loss
incurred by resource overcommit. They can also indicate when the system
is approaching lockup scenarios and OOMs.
To do this, psi keeps track of the task states associated with each CPU
and samples the time they spend in stall states. Every 2 seconds, the
samples are averaged across CPUs - weighted by the CPUs' non-idle time to
eliminate artifacts from unused CPUs - and translated into percentages of
walltime. A running average of those percentages is maintained over 10s,
1m, and 5m periods (similar to the loadaverage).
[hannes@cmpxchg.org: doc fixlet, per Randy]
Link: http://lkml.kernel.org/r/20180828205625.GA14030@cmpxchg.org
[hannes@cmpxchg.org: code optimization]
Link: http://lkml.kernel.org/r/20180907175015.GA8479@cmpxchg.org
[hannes@cmpxchg.org: rename psi_clock() to psi_update_work(), per Peter]
Link: http://lkml.kernel.org/r/20180907145404.GB11088@cmpxchg.org
[hannes@cmpxchg.org: fix build]
Link: http://lkml.kernel.org/r/20180913014222.GA2370@cmpxchg.org
Link: http://lkml.kernel.org/r/20180828172258.3185-9-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Daniel Drake <drake@endlessm.com>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-27 01:06:27 +03:00
psi_memstall_leave ( & pflags ) ;
}
2018-10-27 01:06:08 +03:00
2016-12-25 06:00:30 +03:00
/*
mm: allow a controlled amount of unfairness in the page lock
Commit 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic") made
the page locking entirely fair, in that if a waiter came in while the
lock was held, the lock would be transferred to the lockers strictly in
order.
That was intended to finally get rid of the long-reported watchdog
failures that involved the page lock under extreme load, where a process
could end up waiting essentially forever, as other page lockers stole
the lock from under it.
It also improved some benchmarks, but it ended up causing huge
performance regressions on others, simply because fair lock behavior
doesn't end up giving out the lock as aggressively, causing better
worst-case latency, but potentially much worse average latencies and
throughput.
Instead of reverting that change entirely, this introduces a controlled
amount of unfairness, with a sysctl knob to tune it if somebody needs
to. But the default value should hopefully be good for any normal load,
allowing a few rounds of lock stealing, but enforcing the strict
ordering before the lock has been stolen too many times.
There is also a hint from Matthieu Baerts that the fair page coloring
may end up exposing an ABBA deadlock that is hidden by the usual
optimistic lock stealing, and while the unfairness doesn't fix the
fundamental issue (and I'm still looking at that), it avoids it in
practice.
The amount of unfairness can be modified by writing a new value to the
'sysctl_page_lock_unfairness' variable (default value of 5, exposed
through /proc/sys/vm/page_lock_unfairness), but that is hopefully
something we'd use mainly for debugging rather than being necessary for
any deep system tuning.
This whole issue has exposed just how critical the page lock can be, and
how contended it gets under certain locks. And the main contention
doesn't really seem to be anything related to IO (which was the origin
of this lock), but for things like just verifying that the page file
mapping is stable while faulting in the page into a page table.
Link: https://lore.kernel.org/linux-fsdevel/ed8442fd-6f54-dd84-cd4a-941e8b7ee603@MichaelLarabel.com/
Link: https://www.phoronix.com/scan.php?page=article&item=linux-50-59&num=1
Link: https://lore.kernel.org/linux-fsdevel/c560a38d-8313-51fb-b1ec-e904bd8836bc@tessares.net/
Reported-and-tested-by: Michael Larabel <Michael@michaellarabel.com>
Tested-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-14 00:05:35 +03:00
* NOTE ! The wait - > flags weren ' t stable until we ' ve done the
* ' finish_wait ( ) ' , and we could have exited the loop above due
* to a signal , and had a wakeup event happen after the signal
* test but before the ' finish_wait ( ) ' .
*
* So only after the finish_wait ( ) can we reliably determine
* if we got woken up or not , so we can now figure out the final
* return value based on that state without races .
*
* Also note that WQ_FLAG_WOKEN is sufficient for a non - exclusive
* waiter , but an exclusive one requires WQ_FLAG_DONE .
2016-12-25 06:00:30 +03:00
*/
mm: allow a controlled amount of unfairness in the page lock
Commit 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic") made
the page locking entirely fair, in that if a waiter came in while the
lock was held, the lock would be transferred to the lockers strictly in
order.
That was intended to finally get rid of the long-reported watchdog
failures that involved the page lock under extreme load, where a process
could end up waiting essentially forever, as other page lockers stole
the lock from under it.
It also improved some benchmarks, but it ended up causing huge
performance regressions on others, simply because fair lock behavior
doesn't end up giving out the lock as aggressively, causing better
worst-case latency, but potentially much worse average latencies and
throughput.
Instead of reverting that change entirely, this introduces a controlled
amount of unfairness, with a sysctl knob to tune it if somebody needs
to. But the default value should hopefully be good for any normal load,
allowing a few rounds of lock stealing, but enforcing the strict
ordering before the lock has been stolen too many times.
There is also a hint from Matthieu Baerts that the fair page coloring
may end up exposing an ABBA deadlock that is hidden by the usual
optimistic lock stealing, and while the unfairness doesn't fix the
fundamental issue (and I'm still looking at that), it avoids it in
practice.
The amount of unfairness can be modified by writing a new value to the
'sysctl_page_lock_unfairness' variable (default value of 5, exposed
through /proc/sys/vm/page_lock_unfairness), but that is hopefully
something we'd use mainly for debugging rather than being necessary for
any deep system tuning.
This whole issue has exposed just how critical the page lock can be, and
how contended it gets under certain locks. And the main contention
doesn't really seem to be anything related to IO (which was the origin
of this lock), but for things like just verifying that the page file
mapping is stable while faulting in the page into a page table.
Link: https://lore.kernel.org/linux-fsdevel/ed8442fd-6f54-dd84-cd4a-941e8b7ee603@MichaelLarabel.com/
Link: https://www.phoronix.com/scan.php?page=article&item=linux-50-59&num=1
Link: https://lore.kernel.org/linux-fsdevel/c560a38d-8313-51fb-b1ec-e904bd8836bc@tessares.net/
Reported-and-tested-by: Michael Larabel <Michael@michaellarabel.com>
Tested-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-14 00:05:35 +03:00
if ( behavior = = EXCLUSIVE )
return wait - > flags & WQ_FLAG_DONE ? 0 : - EINTR ;
2016-12-25 06:00:30 +03:00
2020-07-23 20:16:49 +03:00
return wait - > flags & WQ_FLAG_WOKEN ? 0 : - EINTR ;
2016-12-25 06:00:30 +03:00
}
2022-01-22 09:10:46 +03:00
# ifdef CONFIG_MIGRATION
/**
* migration_entry_wait_on_locked - Wait for a migration entry to be removed
* @ entry : migration swap entry .
* @ ptl : already locked ptl . This function will drop the lock .
*
* Wait for a migration entry referencing the given page to be removed . This is
* equivalent to put_and_wait_on_page_locked ( page , TASK_UNINTERRUPTIBLE ) except
* this can be called without taking a reference on the page . Instead this
* should be called while holding the ptl for the migration entry referencing
* the page .
*
2023-06-09 04:08:20 +03:00
* Returns after unlocking the ptl .
2022-01-22 09:10:46 +03:00
*
* This follows the same logic as folio_wait_bit_common ( ) so see the comments
* there .
*/
2023-06-09 04:08:20 +03:00
void migration_entry_wait_on_locked ( swp_entry_t entry , spinlock_t * ptl )
__releases ( ptl )
2022-01-22 09:10:46 +03:00
{
struct wait_page_queue wait_page ;
wait_queue_entry_t * wait = & wait_page . wait ;
bool thrashing = false ;
unsigned long pflags ;
2022-08-15 10:11:35 +03:00
bool in_thrashing ;
2022-01-22 09:10:46 +03:00
wait_queue_head_t * q ;
struct folio * folio = page_folio ( pfn_swap_entry_to_page ( entry ) ) ;
q = folio_waitqueue ( folio ) ;
if ( ! folio_test_uptodate ( folio ) & & folio_test_workingset ( folio ) ) {
2022-08-15 10:11:35 +03:00
delayacct_thrashing_start ( & in_thrashing ) ;
2022-01-22 09:10:46 +03:00
psi_memstall_enter ( & pflags ) ;
thrashing = true ;
}
init_wait ( wait ) ;
wait - > func = wake_page_function ;
wait_page . folio = folio ;
wait_page . bit_nr = PG_locked ;
wait - > flags = 0 ;
spin_lock_irq ( & q - > lock ) ;
folio_set_waiters ( folio ) ;
if ( ! folio_trylock_flag ( folio , PG_locked , wait ) )
__add_wait_queue_entry_tail ( q , wait ) ;
spin_unlock_irq ( & q - > lock ) ;
/*
* If a migration entry exists for the page the migration path must hold
* a valid reference to the page , and it must take the ptl to remove the
* migration entry . So the page is valid until the ptl is dropped .
*/
2023-06-09 04:08:20 +03:00
spin_unlock ( ptl ) ;
2022-01-22 09:10:46 +03:00
for ( ; ; ) {
unsigned int flags ;
set_current_state ( TASK_UNINTERRUPTIBLE ) ;
/* Loop until we've been woken or interrupted */
flags = smp_load_acquire ( & wait - > flags ) ;
if ( ! ( flags & WQ_FLAG_WOKEN ) ) {
if ( signal_pending_state ( TASK_UNINTERRUPTIBLE , current ) )
break ;
io_schedule ( ) ;
continue ;
}
break ;
}
finish_wait ( q , wait ) ;
if ( thrashing ) {
2022-08-15 10:11:35 +03:00
delayacct_thrashing_end ( & in_thrashing ) ;
2022-01-22 09:10:46 +03:00
psi_memstall_leave ( & pflags ) ;
}
}
# endif
2021-03-04 20:02:54 +03:00
void folio_wait_bit ( struct folio * folio , int bit_nr )
2016-12-25 06:00:30 +03:00
{
2021-03-04 20:02:54 +03:00
folio_wait_bit_common ( folio , bit_nr , TASK_UNINTERRUPTIBLE , SHARED ) ;
2016-12-25 06:00:30 +03:00
}
2021-03-04 20:02:54 +03:00
EXPORT_SYMBOL ( folio_wait_bit ) ;
2016-12-25 06:00:30 +03:00
2021-03-04 20:02:54 +03:00
int folio_wait_bit_killable ( struct folio * folio , int bit_nr )
2016-12-25 06:00:30 +03:00
{
2021-03-04 20:02:54 +03:00
return folio_wait_bit_common ( folio , bit_nr , TASK_KILLABLE , SHARED ) ;
2014-09-25 07:55:19 +04:00
}
2021-03-04 20:02:54 +03:00
EXPORT_SYMBOL ( folio_wait_bit_killable ) ;
2014-09-25 07:55:19 +04:00
mm: put_and_wait_on_page_locked() while page is migrated
Waiting on a page migration entry has used wait_on_page_locked() all along
since 2006: but you cannot safely wait_on_page_locked() without holding a
reference to the page, and that extra reference is enough to make
migrate_page_move_mapping() fail with -EAGAIN, when a racing task faults
on the entry before migrate_page_move_mapping() gets there.
And that failure is retried nine times, amplifying the pain when trying to
migrate a popular page. With a single persistent faulter, migration
sometimes succeeds; with two or three concurrent faulters, success becomes
much less likely (and the more the page was mapped, the worse the overhead
of unmapping and remapping it on each try).
This is especially a problem for memory offlining, where the outer level
retries forever (or until terminated from userspace), because a heavy
refault workload can trigger an endless loop of migration failures.
wait_on_page_locked() is the wrong tool for the job.
David Herrmann (but was he the first?) noticed this issue in 2014:
https://marc.info/?l=linux-mm&m=140110465608116&w=2
Tim Chen started a thread in August 2017 which appears relevant:
https://marc.info/?l=linux-mm&m=150275941014915&w=2 where Kan Liang went
on to implicate __migration_entry_wait():
https://marc.info/?l=linux-mm&m=150300268411980&w=2 and the thread ended
up with the v4.14 commits: 2554db916586 ("sched/wait: Break up long wake
list walk") 11a19c7b099f ("sched/wait: Introduce wakeup boomark in
wake_up_page_bit")
Baoquan He reported "Memory hotplug softlock issue" 14 November 2018:
https://marc.info/?l=linux-mm&m=154217936431300&w=2
We have all assumed that it is essential to hold a page reference while
waiting on a page lock: partly to guarantee that there is still a struct
page when MEMORY_HOTREMOVE is configured, but also to protect against
reuse of the struct page going to someone who then holds the page locked
indefinitely, when the waiter can reasonably expect timely unlocking.
But in fact, so long as wait_on_page_bit_common() does the put_page(), and
is careful not to rely on struct page contents thereafter, there is no
need to hold a reference to the page while waiting on it. That does mean
that this case cannot go back through the loop: but that's fine for the
page migration case, and even if used more widely, is limited by the "Stop
walking if it's locked" optimization in wake_page_function().
Add interface put_and_wait_on_page_locked() to do this, using "behavior"
enum in place of "lock" arg to wait_on_page_bit_common() to implement it.
No interruptible or killable variant needed yet, but they might follow: I
have a vague notion that reporting -EINTR should take precedence over
return from wait_on_page_bit_common() without knowing the page state, so
arrange it accordingly - but that may be nothing but pedantic.
__migration_entry_wait() still has to take a brief reference to the page,
prior to calling put_and_wait_on_page_locked(): but now that it is dropped
before waiting, the chance of impeding page migration is very much
reduced. Should we perhaps disable preemption across this?
shrink_page_list()'s __ClearPageLocked(): that was a surprise! This
survived a lot of testing before that showed up. PageWaiters may have
been set by wait_on_page_bit_common(), and the reference dropped, just
before shrink_page_list() succeeds in freezing its last page reference: in
such a case, unlock_page() must be used. Follow the suggestion from
Michal Hocko, just revert a978d6f52106 ("mm: unlockless reclaim") now:
that optimization predates PageWaiters, and won't buy much these days; but
we can reinstate it for the !PageWaiters case if anyone notices.
It does raise the question: should vmscan.c's is_page_cache_freeable() and
__remove_mapping() now treat a PageWaiters page as if an extra reference
were held? Perhaps, but I don't think it matters much, since
shrink_page_list() already had to win its trylock_page(), so waiters are
not very common there: I noticed no difference when trying the bigger
change, and it's surely not needed while put_and_wait_on_page_locked() is
only used for page migration.
[willy@infradead.org: add put_and_wait_on_page_locked() kerneldoc]
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261121330.1116@eggly.anvils
Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: Baoquan He <bhe@redhat.com>
Tested-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: David Herrmann <dh.herrmann@gmail.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Nick Piggin <npiggin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 11:36:14 +03:00
/**
2021-08-17 06:36:31 +03:00
* folio_put_wait_locked - Drop a reference and wait for it to be unlocked
* @ folio : The folio to wait for .
2021-02-24 23:02:02 +03:00
* @ state : The sleep state ( TASK_KILLABLE , TASK_UNINTERRUPTIBLE , etc ) .
mm: put_and_wait_on_page_locked() while page is migrated
Waiting on a page migration entry has used wait_on_page_locked() all along
since 2006: but you cannot safely wait_on_page_locked() without holding a
reference to the page, and that extra reference is enough to make
migrate_page_move_mapping() fail with -EAGAIN, when a racing task faults
on the entry before migrate_page_move_mapping() gets there.
And that failure is retried nine times, amplifying the pain when trying to
migrate a popular page. With a single persistent faulter, migration
sometimes succeeds; with two or three concurrent faulters, success becomes
much less likely (and the more the page was mapped, the worse the overhead
of unmapping and remapping it on each try).
This is especially a problem for memory offlining, where the outer level
retries forever (or until terminated from userspace), because a heavy
refault workload can trigger an endless loop of migration failures.
wait_on_page_locked() is the wrong tool for the job.
David Herrmann (but was he the first?) noticed this issue in 2014:
https://marc.info/?l=linux-mm&m=140110465608116&w=2
Tim Chen started a thread in August 2017 which appears relevant:
https://marc.info/?l=linux-mm&m=150275941014915&w=2 where Kan Liang went
on to implicate __migration_entry_wait():
https://marc.info/?l=linux-mm&m=150300268411980&w=2 and the thread ended
up with the v4.14 commits: 2554db916586 ("sched/wait: Break up long wake
list walk") 11a19c7b099f ("sched/wait: Introduce wakeup boomark in
wake_up_page_bit")
Baoquan He reported "Memory hotplug softlock issue" 14 November 2018:
https://marc.info/?l=linux-mm&m=154217936431300&w=2
We have all assumed that it is essential to hold a page reference while
waiting on a page lock: partly to guarantee that there is still a struct
page when MEMORY_HOTREMOVE is configured, but also to protect against
reuse of the struct page going to someone who then holds the page locked
indefinitely, when the waiter can reasonably expect timely unlocking.
But in fact, so long as wait_on_page_bit_common() does the put_page(), and
is careful not to rely on struct page contents thereafter, there is no
need to hold a reference to the page while waiting on it. That does mean
that this case cannot go back through the loop: but that's fine for the
page migration case, and even if used more widely, is limited by the "Stop
walking if it's locked" optimization in wake_page_function().
Add interface put_and_wait_on_page_locked() to do this, using "behavior"
enum in place of "lock" arg to wait_on_page_bit_common() to implement it.
No interruptible or killable variant needed yet, but they might follow: I
have a vague notion that reporting -EINTR should take precedence over
return from wait_on_page_bit_common() without knowing the page state, so
arrange it accordingly - but that may be nothing but pedantic.
__migration_entry_wait() still has to take a brief reference to the page,
prior to calling put_and_wait_on_page_locked(): but now that it is dropped
before waiting, the chance of impeding page migration is very much
reduced. Should we perhaps disable preemption across this?
shrink_page_list()'s __ClearPageLocked(): that was a surprise! This
survived a lot of testing before that showed up. PageWaiters may have
been set by wait_on_page_bit_common(), and the reference dropped, just
before shrink_page_list() succeeds in freezing its last page reference: in
such a case, unlock_page() must be used. Follow the suggestion from
Michal Hocko, just revert a978d6f52106 ("mm: unlockless reclaim") now:
that optimization predates PageWaiters, and won't buy much these days; but
we can reinstate it for the !PageWaiters case if anyone notices.
It does raise the question: should vmscan.c's is_page_cache_freeable() and
__remove_mapping() now treat a PageWaiters page as if an extra reference
were held? Perhaps, but I don't think it matters much, since
shrink_page_list() already had to win its trylock_page(), so waiters are
not very common there: I noticed no difference when trying the bigger
change, and it's surely not needed while put_and_wait_on_page_locked() is
only used for page migration.
[willy@infradead.org: add put_and_wait_on_page_locked() kerneldoc]
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261121330.1116@eggly.anvils
Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: Baoquan He <bhe@redhat.com>
Tested-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: David Herrmann <dh.herrmann@gmail.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Nick Piggin <npiggin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 11:36:14 +03:00
*
2021-08-17 06:36:31 +03:00
* The caller should hold a reference on @ folio . They expect the page to
mm: put_and_wait_on_page_locked() while page is migrated
Waiting on a page migration entry has used wait_on_page_locked() all along
since 2006: but you cannot safely wait_on_page_locked() without holding a
reference to the page, and that extra reference is enough to make
migrate_page_move_mapping() fail with -EAGAIN, when a racing task faults
on the entry before migrate_page_move_mapping() gets there.
And that failure is retried nine times, amplifying the pain when trying to
migrate a popular page. With a single persistent faulter, migration
sometimes succeeds; with two or three concurrent faulters, success becomes
much less likely (and the more the page was mapped, the worse the overhead
of unmapping and remapping it on each try).
This is especially a problem for memory offlining, where the outer level
retries forever (or until terminated from userspace), because a heavy
refault workload can trigger an endless loop of migration failures.
wait_on_page_locked() is the wrong tool for the job.
David Herrmann (but was he the first?) noticed this issue in 2014:
https://marc.info/?l=linux-mm&m=140110465608116&w=2
Tim Chen started a thread in August 2017 which appears relevant:
https://marc.info/?l=linux-mm&m=150275941014915&w=2 where Kan Liang went
on to implicate __migration_entry_wait():
https://marc.info/?l=linux-mm&m=150300268411980&w=2 and the thread ended
up with the v4.14 commits: 2554db916586 ("sched/wait: Break up long wake
list walk") 11a19c7b099f ("sched/wait: Introduce wakeup boomark in
wake_up_page_bit")
Baoquan He reported "Memory hotplug softlock issue" 14 November 2018:
https://marc.info/?l=linux-mm&m=154217936431300&w=2
We have all assumed that it is essential to hold a page reference while
waiting on a page lock: partly to guarantee that there is still a struct
page when MEMORY_HOTREMOVE is configured, but also to protect against
reuse of the struct page going to someone who then holds the page locked
indefinitely, when the waiter can reasonably expect timely unlocking.
But in fact, so long as wait_on_page_bit_common() does the put_page(), and
is careful not to rely on struct page contents thereafter, there is no
need to hold a reference to the page while waiting on it. That does mean
that this case cannot go back through the loop: but that's fine for the
page migration case, and even if used more widely, is limited by the "Stop
walking if it's locked" optimization in wake_page_function().
Add interface put_and_wait_on_page_locked() to do this, using "behavior"
enum in place of "lock" arg to wait_on_page_bit_common() to implement it.
No interruptible or killable variant needed yet, but they might follow: I
have a vague notion that reporting -EINTR should take precedence over
return from wait_on_page_bit_common() without knowing the page state, so
arrange it accordingly - but that may be nothing but pedantic.
__migration_entry_wait() still has to take a brief reference to the page,
prior to calling put_and_wait_on_page_locked(): but now that it is dropped
before waiting, the chance of impeding page migration is very much
reduced. Should we perhaps disable preemption across this?
shrink_page_list()'s __ClearPageLocked(): that was a surprise! This
survived a lot of testing before that showed up. PageWaiters may have
been set by wait_on_page_bit_common(), and the reference dropped, just
before shrink_page_list() succeeds in freezing its last page reference: in
such a case, unlock_page() must be used. Follow the suggestion from
Michal Hocko, just revert a978d6f52106 ("mm: unlockless reclaim") now:
that optimization predates PageWaiters, and won't buy much these days; but
we can reinstate it for the !PageWaiters case if anyone notices.
It does raise the question: should vmscan.c's is_page_cache_freeable() and
__remove_mapping() now treat a PageWaiters page as if an extra reference
were held? Perhaps, but I don't think it matters much, since
shrink_page_list() already had to win its trylock_page(), so waiters are
not very common there: I noticed no difference when trying the bigger
change, and it's surely not needed while put_and_wait_on_page_locked() is
only used for page migration.
[willy@infradead.org: add put_and_wait_on_page_locked() kerneldoc]
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261121330.1116@eggly.anvils
Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: Baoquan He <bhe@redhat.com>
Tested-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: David Herrmann <dh.herrmann@gmail.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Nick Piggin <npiggin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 11:36:14 +03:00
* become unlocked relatively soon , but do not wish to hold up migration
2021-08-17 06:36:31 +03:00
* ( for example ) by holding the reference while waiting for the folio to
mm: put_and_wait_on_page_locked() while page is migrated
Waiting on a page migration entry has used wait_on_page_locked() all along
since 2006: but you cannot safely wait_on_page_locked() without holding a
reference to the page, and that extra reference is enough to make
migrate_page_move_mapping() fail with -EAGAIN, when a racing task faults
on the entry before migrate_page_move_mapping() gets there.
And that failure is retried nine times, amplifying the pain when trying to
migrate a popular page. With a single persistent faulter, migration
sometimes succeeds; with two or three concurrent faulters, success becomes
much less likely (and the more the page was mapped, the worse the overhead
of unmapping and remapping it on each try).
This is especially a problem for memory offlining, where the outer level
retries forever (or until terminated from userspace), because a heavy
refault workload can trigger an endless loop of migration failures.
wait_on_page_locked() is the wrong tool for the job.
David Herrmann (but was he the first?) noticed this issue in 2014:
https://marc.info/?l=linux-mm&m=140110465608116&w=2
Tim Chen started a thread in August 2017 which appears relevant:
https://marc.info/?l=linux-mm&m=150275941014915&w=2 where Kan Liang went
on to implicate __migration_entry_wait():
https://marc.info/?l=linux-mm&m=150300268411980&w=2 and the thread ended
up with the v4.14 commits: 2554db916586 ("sched/wait: Break up long wake
list walk") 11a19c7b099f ("sched/wait: Introduce wakeup boomark in
wake_up_page_bit")
Baoquan He reported "Memory hotplug softlock issue" 14 November 2018:
https://marc.info/?l=linux-mm&m=154217936431300&w=2
We have all assumed that it is essential to hold a page reference while
waiting on a page lock: partly to guarantee that there is still a struct
page when MEMORY_HOTREMOVE is configured, but also to protect against
reuse of the struct page going to someone who then holds the page locked
indefinitely, when the waiter can reasonably expect timely unlocking.
But in fact, so long as wait_on_page_bit_common() does the put_page(), and
is careful not to rely on struct page contents thereafter, there is no
need to hold a reference to the page while waiting on it. That does mean
that this case cannot go back through the loop: but that's fine for the
page migration case, and even if used more widely, is limited by the "Stop
walking if it's locked" optimization in wake_page_function().
Add interface put_and_wait_on_page_locked() to do this, using "behavior"
enum in place of "lock" arg to wait_on_page_bit_common() to implement it.
No interruptible or killable variant needed yet, but they might follow: I
have a vague notion that reporting -EINTR should take precedence over
return from wait_on_page_bit_common() without knowing the page state, so
arrange it accordingly - but that may be nothing but pedantic.
__migration_entry_wait() still has to take a brief reference to the page,
prior to calling put_and_wait_on_page_locked(): but now that it is dropped
before waiting, the chance of impeding page migration is very much
reduced. Should we perhaps disable preemption across this?
shrink_page_list()'s __ClearPageLocked(): that was a surprise! This
survived a lot of testing before that showed up. PageWaiters may have
been set by wait_on_page_bit_common(), and the reference dropped, just
before shrink_page_list() succeeds in freezing its last page reference: in
such a case, unlock_page() must be used. Follow the suggestion from
Michal Hocko, just revert a978d6f52106 ("mm: unlockless reclaim") now:
that optimization predates PageWaiters, and won't buy much these days; but
we can reinstate it for the !PageWaiters case if anyone notices.
It does raise the question: should vmscan.c's is_page_cache_freeable() and
__remove_mapping() now treat a PageWaiters page as if an extra reference
were held? Perhaps, but I don't think it matters much, since
shrink_page_list() already had to win its trylock_page(), so waiters are
not very common there: I noticed no difference when trying the bigger
change, and it's surely not needed while put_and_wait_on_page_locked() is
only used for page migration.
[willy@infradead.org: add put_and_wait_on_page_locked() kerneldoc]
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261121330.1116@eggly.anvils
Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: Baoquan He <bhe@redhat.com>
Tested-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: David Herrmann <dh.herrmann@gmail.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Nick Piggin <npiggin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 11:36:14 +03:00
* come unlocked . After this function returns , the caller should not
2021-08-17 06:36:31 +03:00
* dereference @ folio .
2021-02-24 23:02:02 +03:00
*
2021-08-17 06:36:31 +03:00
* Return : 0 if the folio was unlocked or - EINTR if interrupted by a signal .
mm: put_and_wait_on_page_locked() while page is migrated
Waiting on a page migration entry has used wait_on_page_locked() all along
since 2006: but you cannot safely wait_on_page_locked() without holding a
reference to the page, and that extra reference is enough to make
migrate_page_move_mapping() fail with -EAGAIN, when a racing task faults
on the entry before migrate_page_move_mapping() gets there.
And that failure is retried nine times, amplifying the pain when trying to
migrate a popular page. With a single persistent faulter, migration
sometimes succeeds; with two or three concurrent faulters, success becomes
much less likely (and the more the page was mapped, the worse the overhead
of unmapping and remapping it on each try).
This is especially a problem for memory offlining, where the outer level
retries forever (or until terminated from userspace), because a heavy
refault workload can trigger an endless loop of migration failures.
wait_on_page_locked() is the wrong tool for the job.
David Herrmann (but was he the first?) noticed this issue in 2014:
https://marc.info/?l=linux-mm&m=140110465608116&w=2
Tim Chen started a thread in August 2017 which appears relevant:
https://marc.info/?l=linux-mm&m=150275941014915&w=2 where Kan Liang went
on to implicate __migration_entry_wait():
https://marc.info/?l=linux-mm&m=150300268411980&w=2 and the thread ended
up with the v4.14 commits: 2554db916586 ("sched/wait: Break up long wake
list walk") 11a19c7b099f ("sched/wait: Introduce wakeup boomark in
wake_up_page_bit")
Baoquan He reported "Memory hotplug softlock issue" 14 November 2018:
https://marc.info/?l=linux-mm&m=154217936431300&w=2
We have all assumed that it is essential to hold a page reference while
waiting on a page lock: partly to guarantee that there is still a struct
page when MEMORY_HOTREMOVE is configured, but also to protect against
reuse of the struct page going to someone who then holds the page locked
indefinitely, when the waiter can reasonably expect timely unlocking.
But in fact, so long as wait_on_page_bit_common() does the put_page(), and
is careful not to rely on struct page contents thereafter, there is no
need to hold a reference to the page while waiting on it. That does mean
that this case cannot go back through the loop: but that's fine for the
page migration case, and even if used more widely, is limited by the "Stop
walking if it's locked" optimization in wake_page_function().
Add interface put_and_wait_on_page_locked() to do this, using "behavior"
enum in place of "lock" arg to wait_on_page_bit_common() to implement it.
No interruptible or killable variant needed yet, but they might follow: I
have a vague notion that reporting -EINTR should take precedence over
return from wait_on_page_bit_common() without knowing the page state, so
arrange it accordingly - but that may be nothing but pedantic.
__migration_entry_wait() still has to take a brief reference to the page,
prior to calling put_and_wait_on_page_locked(): but now that it is dropped
before waiting, the chance of impeding page migration is very much
reduced. Should we perhaps disable preemption across this?
shrink_page_list()'s __ClearPageLocked(): that was a surprise! This
survived a lot of testing before that showed up. PageWaiters may have
been set by wait_on_page_bit_common(), and the reference dropped, just
before shrink_page_list() succeeds in freezing its last page reference: in
such a case, unlock_page() must be used. Follow the suggestion from
Michal Hocko, just revert a978d6f52106 ("mm: unlockless reclaim") now:
that optimization predates PageWaiters, and won't buy much these days; but
we can reinstate it for the !PageWaiters case if anyone notices.
It does raise the question: should vmscan.c's is_page_cache_freeable() and
__remove_mapping() now treat a PageWaiters page as if an extra reference
were held? Perhaps, but I don't think it matters much, since
shrink_page_list() already had to win its trylock_page(), so waiters are
not very common there: I noticed no difference when trying the bigger
change, and it's surely not needed while put_and_wait_on_page_locked() is
only used for page migration.
[willy@infradead.org: add put_and_wait_on_page_locked() kerneldoc]
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261121330.1116@eggly.anvils
Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: Baoquan He <bhe@redhat.com>
Tested-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: David Herrmann <dh.herrmann@gmail.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Nick Piggin <npiggin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 11:36:14 +03:00
*/
2022-09-14 05:17:38 +03:00
static int folio_put_wait_locked ( struct folio * folio , int state )
mm: put_and_wait_on_page_locked() while page is migrated
Waiting on a page migration entry has used wait_on_page_locked() all along
since 2006: but you cannot safely wait_on_page_locked() without holding a
reference to the page, and that extra reference is enough to make
migrate_page_move_mapping() fail with -EAGAIN, when a racing task faults
on the entry before migrate_page_move_mapping() gets there.
And that failure is retried nine times, amplifying the pain when trying to
migrate a popular page. With a single persistent faulter, migration
sometimes succeeds; with two or three concurrent faulters, success becomes
much less likely (and the more the page was mapped, the worse the overhead
of unmapping and remapping it on each try).
This is especially a problem for memory offlining, where the outer level
retries forever (or until terminated from userspace), because a heavy
refault workload can trigger an endless loop of migration failures.
wait_on_page_locked() is the wrong tool for the job.
David Herrmann (but was he the first?) noticed this issue in 2014:
https://marc.info/?l=linux-mm&m=140110465608116&w=2
Tim Chen started a thread in August 2017 which appears relevant:
https://marc.info/?l=linux-mm&m=150275941014915&w=2 where Kan Liang went
on to implicate __migration_entry_wait():
https://marc.info/?l=linux-mm&m=150300268411980&w=2 and the thread ended
up with the v4.14 commits: 2554db916586 ("sched/wait: Break up long wake
list walk") 11a19c7b099f ("sched/wait: Introduce wakeup boomark in
wake_up_page_bit")
Baoquan He reported "Memory hotplug softlock issue" 14 November 2018:
https://marc.info/?l=linux-mm&m=154217936431300&w=2
We have all assumed that it is essential to hold a page reference while
waiting on a page lock: partly to guarantee that there is still a struct
page when MEMORY_HOTREMOVE is configured, but also to protect against
reuse of the struct page going to someone who then holds the page locked
indefinitely, when the waiter can reasonably expect timely unlocking.
But in fact, so long as wait_on_page_bit_common() does the put_page(), and
is careful not to rely on struct page contents thereafter, there is no
need to hold a reference to the page while waiting on it. That does mean
that this case cannot go back through the loop: but that's fine for the
page migration case, and even if used more widely, is limited by the "Stop
walking if it's locked" optimization in wake_page_function().
Add interface put_and_wait_on_page_locked() to do this, using "behavior"
enum in place of "lock" arg to wait_on_page_bit_common() to implement it.
No interruptible or killable variant needed yet, but they might follow: I
have a vague notion that reporting -EINTR should take precedence over
return from wait_on_page_bit_common() without knowing the page state, so
arrange it accordingly - but that may be nothing but pedantic.
__migration_entry_wait() still has to take a brief reference to the page,
prior to calling put_and_wait_on_page_locked(): but now that it is dropped
before waiting, the chance of impeding page migration is very much
reduced. Should we perhaps disable preemption across this?
shrink_page_list()'s __ClearPageLocked(): that was a surprise! This
survived a lot of testing before that showed up. PageWaiters may have
been set by wait_on_page_bit_common(), and the reference dropped, just
before shrink_page_list() succeeds in freezing its last page reference: in
such a case, unlock_page() must be used. Follow the suggestion from
Michal Hocko, just revert a978d6f52106 ("mm: unlockless reclaim") now:
that optimization predates PageWaiters, and won't buy much these days; but
we can reinstate it for the !PageWaiters case if anyone notices.
It does raise the question: should vmscan.c's is_page_cache_freeable() and
__remove_mapping() now treat a PageWaiters page as if an extra reference
were held? Perhaps, but I don't think it matters much, since
shrink_page_list() already had to win its trylock_page(), so waiters are
not very common there: I noticed no difference when trying the bigger
change, and it's surely not needed while put_and_wait_on_page_locked() is
only used for page migration.
[willy@infradead.org: add put_and_wait_on_page_locked() kerneldoc]
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261121330.1116@eggly.anvils
Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: Baoquan He <bhe@redhat.com>
Tested-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: David Herrmann <dh.herrmann@gmail.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Nick Piggin <npiggin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 11:36:14 +03:00
{
2021-08-17 06:36:31 +03:00
return folio_wait_bit_common ( folio , PG_locked , state , DROP ) ;
mm: put_and_wait_on_page_locked() while page is migrated
Waiting on a page migration entry has used wait_on_page_locked() all along
since 2006: but you cannot safely wait_on_page_locked() without holding a
reference to the page, and that extra reference is enough to make
migrate_page_move_mapping() fail with -EAGAIN, when a racing task faults
on the entry before migrate_page_move_mapping() gets there.
And that failure is retried nine times, amplifying the pain when trying to
migrate a popular page. With a single persistent faulter, migration
sometimes succeeds; with two or three concurrent faulters, success becomes
much less likely (and the more the page was mapped, the worse the overhead
of unmapping and remapping it on each try).
This is especially a problem for memory offlining, where the outer level
retries forever (or until terminated from userspace), because a heavy
refault workload can trigger an endless loop of migration failures.
wait_on_page_locked() is the wrong tool for the job.
David Herrmann (but was he the first?) noticed this issue in 2014:
https://marc.info/?l=linux-mm&m=140110465608116&w=2
Tim Chen started a thread in August 2017 which appears relevant:
https://marc.info/?l=linux-mm&m=150275941014915&w=2 where Kan Liang went
on to implicate __migration_entry_wait():
https://marc.info/?l=linux-mm&m=150300268411980&w=2 and the thread ended
up with the v4.14 commits: 2554db916586 ("sched/wait: Break up long wake
list walk") 11a19c7b099f ("sched/wait: Introduce wakeup boomark in
wake_up_page_bit")
Baoquan He reported "Memory hotplug softlock issue" 14 November 2018:
https://marc.info/?l=linux-mm&m=154217936431300&w=2
We have all assumed that it is essential to hold a page reference while
waiting on a page lock: partly to guarantee that there is still a struct
page when MEMORY_HOTREMOVE is configured, but also to protect against
reuse of the struct page going to someone who then holds the page locked
indefinitely, when the waiter can reasonably expect timely unlocking.
But in fact, so long as wait_on_page_bit_common() does the put_page(), and
is careful not to rely on struct page contents thereafter, there is no
need to hold a reference to the page while waiting on it. That does mean
that this case cannot go back through the loop: but that's fine for the
page migration case, and even if used more widely, is limited by the "Stop
walking if it's locked" optimization in wake_page_function().
Add interface put_and_wait_on_page_locked() to do this, using "behavior"
enum in place of "lock" arg to wait_on_page_bit_common() to implement it.
No interruptible or killable variant needed yet, but they might follow: I
have a vague notion that reporting -EINTR should take precedence over
return from wait_on_page_bit_common() without knowing the page state, so
arrange it accordingly - but that may be nothing but pedantic.
__migration_entry_wait() still has to take a brief reference to the page,
prior to calling put_and_wait_on_page_locked(): but now that it is dropped
before waiting, the chance of impeding page migration is very much
reduced. Should we perhaps disable preemption across this?
shrink_page_list()'s __ClearPageLocked(): that was a surprise! This
survived a lot of testing before that showed up. PageWaiters may have
been set by wait_on_page_bit_common(), and the reference dropped, just
before shrink_page_list() succeeds in freezing its last page reference: in
such a case, unlock_page() must be used. Follow the suggestion from
Michal Hocko, just revert a978d6f52106 ("mm: unlockless reclaim") now:
that optimization predates PageWaiters, and won't buy much these days; but
we can reinstate it for the !PageWaiters case if anyone notices.
It does raise the question: should vmscan.c's is_page_cache_freeable() and
__remove_mapping() now treat a PageWaiters page as if an extra reference
were held? Perhaps, but I don't think it matters much, since
shrink_page_list() already had to win its trylock_page(), so waiters are
not very common there: I noticed no difference when trying the bigger
change, and it's surely not needed while put_and_wait_on_page_locked() is
only used for page migration.
[willy@infradead.org: add put_and_wait_on_page_locked() kerneldoc]
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261121330.1116@eggly.anvils
Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: Baoquan He <bhe@redhat.com>
Tested-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: David Herrmann <dh.herrmann@gmail.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Nick Piggin <npiggin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 11:36:14 +03:00
}
2009-04-03 19:42:39 +04:00
/**
2021-01-16 19:22:14 +03:00
* folio_add_wait_queue - Add an arbitrary waiter to a folio ' s wait queue
* @ folio : Folio defining the wait queue of interest
2009-04-14 01:39:54 +04:00
* @ waiter : Waiter to add to the queue
2009-04-03 19:42:39 +04:00
*
2021-01-16 19:22:14 +03:00
* Add an arbitrary @ waiter to the wait queue for the nominated @ folio .
2009-04-03 19:42:39 +04:00
*/
2021-01-16 19:22:14 +03:00
void folio_add_wait_queue ( struct folio * folio , wait_queue_entry_t * waiter )
2009-04-03 19:42:39 +04:00
{
2021-01-16 19:22:14 +03:00
wait_queue_head_t * q = folio_waitqueue ( folio ) ;
2009-04-03 19:42:39 +04:00
unsigned long flags ;
spin_lock_irqsave ( & q - > lock , flags ) ;
2017-08-29 02:45:40 +03:00
__add_wait_queue_entry_tail ( q , waiter ) ;
2021-01-16 19:22:14 +03:00
folio_set_waiters ( folio ) ;
2009-04-03 19:42:39 +04:00
spin_unlock_irqrestore ( & q - > lock , flags ) ;
}
2021-01-16 19:22:14 +03:00
EXPORT_SYMBOL_GPL ( folio_add_wait_queue ) ;
2009-04-03 19:42:39 +04:00
mm: optimize PageWaiters bit use for unlock_page()
In commit 62906027091f ("mm: add PageWaiters indicating tasks are
waiting for a page bit") Nick Piggin made our page locking no longer
unconditionally touch the hashed page waitqueue, which not only helps
performance in general, but is particularly helpful on NUMA machines
where the hashed wait queues can bounce around a lot.
However, the "clear lock bit atomically and then test the waiters bit"
sequence turns out to be much more expensive than it needs to be,
because you get a nasty stall when trying to access the same word that
just got updated atomically.
On architectures where locking is done with LL/SC, this would be trivial
to fix with a new primitive that clears one bit and tests another
atomically, but that ends up not working on x86, where the only atomic
operations that return the result end up being cmpxchg and xadd. The
atomic bit operations return the old value of the same bit we changed,
not the value of an unrelated bit.
On x86, we could put the lock bit in the high bit of the byte, and use
"xadd" with that bit (where the overflow ends up not touching other
bits), and look at the other bits of the result. However, an even
simpler model is to just use a regular atomic "and" to clear the lock
bit, and then the sign bit in eflags will indicate the resulting state
of the unrelated bit #7.
So by moving the PageWaiters bit up to bit #7, we can atomically clear
the lock bit and test the waiters bit on x86 too. And architectures
with LL/SC (which is all the usual RISC suspects), the particular bit
doesn't matter, so they are fine with this approach too.
This avoids the extra access to the same atomic word, and thus avoids
the costly stall at page unlock time.
The only downside is that the interface ends up being a bit odd and
specialized: clear a bit in a byte, and test the sign bit. Nick doesn't
love the resulting name of the new primitive, but I'd rather make the
name be descriptive and very clear about the limitation imposed by
trying to work across all relevant architectures than make it be some
generic thing that doesn't make the odd semantics explicit.
So this introduces the new architecture primitive
clear_bit_unlock_is_negative_byte();
and adds the trivial implementation for x86. We have a generic
non-optimized fallback (that just does a "clear_bit()"+"test_bit(7)"
combination) which can be overridden by any architecture that can do
better. According to Nick, Power has the same hickup x86 has, for
example, but some other architectures may not even care.
All these optimizations mean that my page locking stress-test (which is
just executing a lot of small short-lived shell scripts: "make test" in
the git source tree) no longer makes our page locking look horribly bad.
Before all these optimizations, just the unlock_page() costs were just
over 3% of all CPU overhead on "make test". After this, it's down to
0.66%, so just a quarter of the cost it used to be.
(The difference on NUMA is bigger, but there this micro-optimization is
likely less noticeable, since the big issue on NUMA was not the accesses
to 'struct page', but the waitqueue accesses that were already removed
by Nick's earlier commit).
Acked-by: Nick Piggin <npiggin@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Bob Peterson <rpeterso@redhat.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Andrew Lutomirski <luto@kernel.org>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-27 22:40:38 +03:00
# ifndef clear_bit_unlock_is_negative_byte
/*
* PG_waiters is the high bit in the same byte as PG_lock .
*
* On x86 ( and on many other architectures ) , we can clear PG_lock and
* test the sign bit at the same time . But if the architecture does
* not support that special operation , we just do this all by hand
* instead .
*
* The read of PG_waiters has to be after ( or concurrently with ) PG_locked
2020-06-05 02:49:22 +03:00
* being cleared , but a memory barrier should be unnecessary since it is
mm: optimize PageWaiters bit use for unlock_page()
In commit 62906027091f ("mm: add PageWaiters indicating tasks are
waiting for a page bit") Nick Piggin made our page locking no longer
unconditionally touch the hashed page waitqueue, which not only helps
performance in general, but is particularly helpful on NUMA machines
where the hashed wait queues can bounce around a lot.
However, the "clear lock bit atomically and then test the waiters bit"
sequence turns out to be much more expensive than it needs to be,
because you get a nasty stall when trying to access the same word that
just got updated atomically.
On architectures where locking is done with LL/SC, this would be trivial
to fix with a new primitive that clears one bit and tests another
atomically, but that ends up not working on x86, where the only atomic
operations that return the result end up being cmpxchg and xadd. The
atomic bit operations return the old value of the same bit we changed,
not the value of an unrelated bit.
On x86, we could put the lock bit in the high bit of the byte, and use
"xadd" with that bit (where the overflow ends up not touching other
bits), and look at the other bits of the result. However, an even
simpler model is to just use a regular atomic "and" to clear the lock
bit, and then the sign bit in eflags will indicate the resulting state
of the unrelated bit #7.
So by moving the PageWaiters bit up to bit #7, we can atomically clear
the lock bit and test the waiters bit on x86 too. And architectures
with LL/SC (which is all the usual RISC suspects), the particular bit
doesn't matter, so they are fine with this approach too.
This avoids the extra access to the same atomic word, and thus avoids
the costly stall at page unlock time.
The only downside is that the interface ends up being a bit odd and
specialized: clear a bit in a byte, and test the sign bit. Nick doesn't
love the resulting name of the new primitive, but I'd rather make the
name be descriptive and very clear about the limitation imposed by
trying to work across all relevant architectures than make it be some
generic thing that doesn't make the odd semantics explicit.
So this introduces the new architecture primitive
clear_bit_unlock_is_negative_byte();
and adds the trivial implementation for x86. We have a generic
non-optimized fallback (that just does a "clear_bit()"+"test_bit(7)"
combination) which can be overridden by any architecture that can do
better. According to Nick, Power has the same hickup x86 has, for
example, but some other architectures may not even care.
All these optimizations mean that my page locking stress-test (which is
just executing a lot of small short-lived shell scripts: "make test" in
the git source tree) no longer makes our page locking look horribly bad.
Before all these optimizations, just the unlock_page() costs were just
over 3% of all CPU overhead on "make test". After this, it's down to
0.66%, so just a quarter of the cost it used to be.
(The difference on NUMA is bigger, but there this micro-optimization is
likely less noticeable, since the big issue on NUMA was not the accesses
to 'struct page', but the waitqueue accesses that were already removed
by Nick's earlier commit).
Acked-by: Nick Piggin <npiggin@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Bob Peterson <rpeterso@redhat.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Andrew Lutomirski <luto@kernel.org>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-27 22:40:38 +03:00
* in the same byte as PG_locked .
*/
static inline bool clear_bit_unlock_is_negative_byte ( long nr , volatile void * mem )
{
clear_bit_unlock ( nr , mem ) ;
/* smp_mb__after_atomic(); */
2016-12-30 01:16:07 +03:00
return test_bit ( PG_waiters , mem ) ;
mm: optimize PageWaiters bit use for unlock_page()
In commit 62906027091f ("mm: add PageWaiters indicating tasks are
waiting for a page bit") Nick Piggin made our page locking no longer
unconditionally touch the hashed page waitqueue, which not only helps
performance in general, but is particularly helpful on NUMA machines
where the hashed wait queues can bounce around a lot.
However, the "clear lock bit atomically and then test the waiters bit"
sequence turns out to be much more expensive than it needs to be,
because you get a nasty stall when trying to access the same word that
just got updated atomically.
On architectures where locking is done with LL/SC, this would be trivial
to fix with a new primitive that clears one bit and tests another
atomically, but that ends up not working on x86, where the only atomic
operations that return the result end up being cmpxchg and xadd. The
atomic bit operations return the old value of the same bit we changed,
not the value of an unrelated bit.
On x86, we could put the lock bit in the high bit of the byte, and use
"xadd" with that bit (where the overflow ends up not touching other
bits), and look at the other bits of the result. However, an even
simpler model is to just use a regular atomic "and" to clear the lock
bit, and then the sign bit in eflags will indicate the resulting state
of the unrelated bit #7.
So by moving the PageWaiters bit up to bit #7, we can atomically clear
the lock bit and test the waiters bit on x86 too. And architectures
with LL/SC (which is all the usual RISC suspects), the particular bit
doesn't matter, so they are fine with this approach too.
This avoids the extra access to the same atomic word, and thus avoids
the costly stall at page unlock time.
The only downside is that the interface ends up being a bit odd and
specialized: clear a bit in a byte, and test the sign bit. Nick doesn't
love the resulting name of the new primitive, but I'd rather make the
name be descriptive and very clear about the limitation imposed by
trying to work across all relevant architectures than make it be some
generic thing that doesn't make the odd semantics explicit.
So this introduces the new architecture primitive
clear_bit_unlock_is_negative_byte();
and adds the trivial implementation for x86. We have a generic
non-optimized fallback (that just does a "clear_bit()"+"test_bit(7)"
combination) which can be overridden by any architecture that can do
better. According to Nick, Power has the same hickup x86 has, for
example, but some other architectures may not even care.
All these optimizations mean that my page locking stress-test (which is
just executing a lot of small short-lived shell scripts: "make test" in
the git source tree) no longer makes our page locking look horribly bad.
Before all these optimizations, just the unlock_page() costs were just
over 3% of all CPU overhead on "make test". After this, it's down to
0.66%, so just a quarter of the cost it used to be.
(The difference on NUMA is bigger, but there this micro-optimization is
likely less noticeable, since the big issue on NUMA was not the accesses
to 'struct page', but the waitqueue accesses that were already removed
by Nick's earlier commit).
Acked-by: Nick Piggin <npiggin@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Bob Peterson <rpeterso@redhat.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Andrew Lutomirski <luto@kernel.org>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-27 22:40:38 +03:00
}
# endif
2005-04-17 02:20:36 +04:00
/**
2020-12-07 23:44:35 +03:00
* folio_unlock - Unlock a locked folio .
* @ folio : The folio .
*
* Unlocks the folio and wakes up any thread sleeping on the page lock .
*
* Context : May be called from interrupt or process context . May not be
* called from NMI context .
2005-04-17 02:20:36 +04:00
*/
2020-12-07 23:44:35 +03:00
void folio_unlock ( struct folio * folio )
2005-04-17 02:20:36 +04:00
{
2020-12-07 23:44:35 +03:00
/* Bit 7 allows x86 to check the byte's sign bit */
mm: optimize PageWaiters bit use for unlock_page()
In commit 62906027091f ("mm: add PageWaiters indicating tasks are
waiting for a page bit") Nick Piggin made our page locking no longer
unconditionally touch the hashed page waitqueue, which not only helps
performance in general, but is particularly helpful on NUMA machines
where the hashed wait queues can bounce around a lot.
However, the "clear lock bit atomically and then test the waiters bit"
sequence turns out to be much more expensive than it needs to be,
because you get a nasty stall when trying to access the same word that
just got updated atomically.
On architectures where locking is done with LL/SC, this would be trivial
to fix with a new primitive that clears one bit and tests another
atomically, but that ends up not working on x86, where the only atomic
operations that return the result end up being cmpxchg and xadd. The
atomic bit operations return the old value of the same bit we changed,
not the value of an unrelated bit.
On x86, we could put the lock bit in the high bit of the byte, and use
"xadd" with that bit (where the overflow ends up not touching other
bits), and look at the other bits of the result. However, an even
simpler model is to just use a regular atomic "and" to clear the lock
bit, and then the sign bit in eflags will indicate the resulting state
of the unrelated bit #7.
So by moving the PageWaiters bit up to bit #7, we can atomically clear
the lock bit and test the waiters bit on x86 too. And architectures
with LL/SC (which is all the usual RISC suspects), the particular bit
doesn't matter, so they are fine with this approach too.
This avoids the extra access to the same atomic word, and thus avoids
the costly stall at page unlock time.
The only downside is that the interface ends up being a bit odd and
specialized: clear a bit in a byte, and test the sign bit. Nick doesn't
love the resulting name of the new primitive, but I'd rather make the
name be descriptive and very clear about the limitation imposed by
trying to work across all relevant architectures than make it be some
generic thing that doesn't make the odd semantics explicit.
So this introduces the new architecture primitive
clear_bit_unlock_is_negative_byte();
and adds the trivial implementation for x86. We have a generic
non-optimized fallback (that just does a "clear_bit()"+"test_bit(7)"
combination) which can be overridden by any architecture that can do
better. According to Nick, Power has the same hickup x86 has, for
example, but some other architectures may not even care.
All these optimizations mean that my page locking stress-test (which is
just executing a lot of small short-lived shell scripts: "make test" in
the git source tree) no longer makes our page locking look horribly bad.
Before all these optimizations, just the unlock_page() costs were just
over 3% of all CPU overhead on "make test". After this, it's down to
0.66%, so just a quarter of the cost it used to be.
(The difference on NUMA is bigger, but there this micro-optimization is
likely less noticeable, since the big issue on NUMA was not the accesses
to 'struct page', but the waitqueue accesses that were already removed
by Nick's earlier commit).
Acked-by: Nick Piggin <npiggin@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Bob Peterson <rpeterso@redhat.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Andrew Lutomirski <luto@kernel.org>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-27 22:40:38 +03:00
BUILD_BUG_ON ( PG_waiters ! = 7 ) ;
2020-12-07 23:44:35 +03:00
BUILD_BUG_ON ( PG_locked > 7 ) ;
VM_BUG_ON_FOLIO ( ! folio_test_locked ( folio ) , folio ) ;
if ( clear_bit_unlock_is_negative_byte ( PG_locked , folio_flags ( folio , 0 ) ) )
2021-01-16 01:14:48 +03:00
folio_wake_bit ( folio , PG_locked ) ;
2005-04-17 02:20:36 +04:00
}
2020-12-07 23:44:35 +03:00
EXPORT_SYMBOL ( folio_unlock ) ;
2005-04-17 02:20:36 +04:00
2020-02-10 13:00:21 +03:00
/**
2021-04-23 05:58:32 +03:00
* folio_end_private_2 - Clear PG_private_2 and wake any waiters .
* @ folio : The folio .
2020-02-10 13:00:21 +03:00
*
2021-04-23 05:58:32 +03:00
* Clear the PG_private_2 bit on a folio and wake up any sleepers waiting for
* it . The folio reference held for PG_private_2 being set is released .
2020-02-10 13:00:21 +03:00
*
2021-04-23 05:58:32 +03:00
* This is , for example , used when a netfs folio is being written to a local
* disk cache , thereby allowing writes to the cache for the same folio to be
2020-02-10 13:00:21 +03:00
* serialised .
*/
2021-04-23 05:58:32 +03:00
void folio_end_private_2 ( struct folio * folio )
2020-02-10 13:00:21 +03:00
{
2021-01-16 01:14:48 +03:00
VM_BUG_ON_FOLIO ( ! folio_test_private_2 ( folio ) , folio ) ;
clear_bit_unlock ( PG_private_2 , folio_flags ( folio , 0 ) ) ;
folio_wake_bit ( folio , PG_private_2 ) ;
folio_put ( folio ) ;
2020-02-10 13:00:21 +03:00
}
2021-04-23 05:58:32 +03:00
EXPORT_SYMBOL ( folio_end_private_2 ) ;
2020-02-10 13:00:21 +03:00
/**
2021-04-23 05:58:32 +03:00
* folio_wait_private_2 - Wait for PG_private_2 to be cleared on a folio .
* @ folio : The folio to wait on .
2020-02-10 13:00:21 +03:00
*
2021-04-23 05:58:32 +03:00
* Wait for PG_private_2 ( aka PG_fscache ) to be cleared on a folio .
2020-02-10 13:00:21 +03:00
*/
2021-04-23 05:58:32 +03:00
void folio_wait_private_2 ( struct folio * folio )
2020-02-10 13:00:21 +03:00
{
2021-03-04 20:02:54 +03:00
while ( folio_test_private_2 ( folio ) )
folio_wait_bit ( folio , PG_private_2 ) ;
2020-02-10 13:00:21 +03:00
}
2021-04-23 05:58:32 +03:00
EXPORT_SYMBOL ( folio_wait_private_2 ) ;
2020-02-10 13:00:21 +03:00
/**
2021-04-23 05:58:32 +03:00
* folio_wait_private_2_killable - Wait for PG_private_2 to be cleared on a folio .
* @ folio : The folio to wait on .
2020-02-10 13:00:21 +03:00
*
2021-04-23 05:58:32 +03:00
* Wait for PG_private_2 ( aka PG_fscache ) to be cleared on a folio or until a
2020-02-10 13:00:21 +03:00
* fatal signal is received by the calling task .
*
* Return :
* - 0 if successful .
* - - EINTR if a fatal signal was encountered .
*/
2021-04-23 05:58:32 +03:00
int folio_wait_private_2_killable ( struct folio * folio )
2020-02-10 13:00:21 +03:00
{
int ret = 0 ;
2021-03-04 20:02:54 +03:00
while ( folio_test_private_2 ( folio ) ) {
ret = folio_wait_bit_killable ( folio , PG_private_2 ) ;
2020-02-10 13:00:21 +03:00
if ( ret < 0 )
break ;
}
return ret ;
}
2021-04-23 05:58:32 +03:00
EXPORT_SYMBOL ( folio_wait_private_2_killable ) ;
2020-02-10 13:00:21 +03:00
2006-06-23 13:03:49 +04:00
/**
2021-03-03 23:21:55 +03:00
* folio_end_writeback - End writeback against a folio .
* @ folio : The folio .
2005-04-17 02:20:36 +04:00
*/
2021-03-03 23:21:55 +03:00
void folio_end_writeback ( struct folio * folio )
2005-04-17 02:20:36 +04:00
{
2014-06-05 03:10:34 +04:00
/*
2021-03-03 23:21:55 +03:00
* folio_test_clear_reclaim ( ) could be used here but it is an
* atomic operation and overkill in this particular case . Failing
* to shuffle a folio marked for immediate reclaim is too mild
* a gain to justify taking an atomic operation penalty at the
* end of every folio writeback .
2014-06-05 03:10:34 +04:00
*/
2021-03-03 23:21:55 +03:00
if ( folio_test_reclaim ( folio ) ) {
folio_clear_reclaim ( folio ) ;
2020-12-08 09:25:39 +03:00
folio_rotate_reclaimable ( folio ) ;
2014-06-05 03:10:34 +04:00
}
2008-04-28 13:12:38 +04:00
mm: fix VM_BUG_ON(PageTail) and BUG_ON(PageWriteback)
Twice now, when exercising ext4 looped on shmem huge pages, I have crashed
on the PF_ONLY_HEAD check inside PageWaiters(): ext4_finish_bio() calling
end_page_writeback() calling wake_up_page() on tail of a shmem huge page,
no longer an ext4 page at all.
The problem is that PageWriteback is not accompanied by a page reference
(as the NOTE at the end of test_clear_page_writeback() acknowledges): as
soon as TestClearPageWriteback has been done, that page could be removed
from page cache, freed, and reused for something else by the time that
wake_up_page() is reached.
https://lore.kernel.org/linux-mm/20200827122019.GC14765@casper.infradead.org/
Matthew Wilcox suggested avoiding or weakening the PageWaiters() tail
check; but I'm paranoid about even looking at an unreferenced struct page,
lest its memory might itself have already been reused or hotremoved (and
wake_up_page_bit() may modify that memory with its ClearPageWaiters()).
Then on crashing a second time, realized there's a stronger reason against
that approach. If my testing just occasionally crashes on that check,
when the page is reused for part of a compound page, wouldn't it be much
more common for the page to get reused as an order-0 page before reaching
wake_up_page()? And on rare occasions, might that reused page already be
marked PageWriteback by its new user, and already be waited upon? What
would that look like?
It would look like BUG_ON(PageWriteback) after wait_on_page_writeback()
in write_cache_pages() (though I have never seen that crash myself).
Matthew Wilcox explaining this to himself:
"page is allocated, added to page cache, dirtied, writeback starts,
--- thread A ---
filesystem calls end_page_writeback()
test_clear_page_writeback()
--- context switch to thread B ---
truncate_inode_pages_range() finds the page, it doesn't have writeback set,
we delete it from the page cache. Page gets reallocated, dirtied, writeback
starts again. Then we call write_cache_pages(), see
PageWriteback() set, call wait_on_page_writeback()
--- context switch back to thread A ---
wake_up_page(page, PG_writeback);
... thread B is woken, but because the wakeup was for the old use of
the page, PageWriteback is still set.
Devious"
And prior to 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic")
this would have been much less likely: before that, wake_page_function()'s
non-exclusive case would stop walking and not wake if it found Writeback
already set again; whereas now the non-exclusive case proceeds to wake.
I have not thought of a fix that does not add a little overhead: the
simplest fix is for end_page_writeback() to get_page() before calling
test_clear_page_writeback(), then put_page() after wake_up_page().
Was there a chance of missed wakeups before, since a page freed before
reaching wake_up_page() would have PageWaiters cleared? I think not,
because each waiter does hold a reference on the page. This bug comes
when the old use of the page, the one we do TestClearPageWriteback on,
had *no* waiters, so no additional page reference beyond the page cache
(and whoever racily freed it). The reuse of the page has a waiter
holding a reference, and its own PageWriteback set; but the belated
wake_up_page() has woken the reuse to hit that BUG_ON(PageWriteback).
Reported-by: syzbot+3622cea378100f45d59f@syzkaller.appspotmail.com
Reported-by: Qian Cai <cai@lca.pw>
Fixes: 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic")
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: stable@vger.kernel.org # v5.8+
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-11-24 19:46:43 +03:00
/*
2021-03-03 23:21:55 +03:00
* Writeback does not hold a folio reference of its own , relying
mm: fix VM_BUG_ON(PageTail) and BUG_ON(PageWriteback)
Twice now, when exercising ext4 looped on shmem huge pages, I have crashed
on the PF_ONLY_HEAD check inside PageWaiters(): ext4_finish_bio() calling
end_page_writeback() calling wake_up_page() on tail of a shmem huge page,
no longer an ext4 page at all.
The problem is that PageWriteback is not accompanied by a page reference
(as the NOTE at the end of test_clear_page_writeback() acknowledges): as
soon as TestClearPageWriteback has been done, that page could be removed
from page cache, freed, and reused for something else by the time that
wake_up_page() is reached.
https://lore.kernel.org/linux-mm/20200827122019.GC14765@casper.infradead.org/
Matthew Wilcox suggested avoiding or weakening the PageWaiters() tail
check; but I'm paranoid about even looking at an unreferenced struct page,
lest its memory might itself have already been reused or hotremoved (and
wake_up_page_bit() may modify that memory with its ClearPageWaiters()).
Then on crashing a second time, realized there's a stronger reason against
that approach. If my testing just occasionally crashes on that check,
when the page is reused for part of a compound page, wouldn't it be much
more common for the page to get reused as an order-0 page before reaching
wake_up_page()? And on rare occasions, might that reused page already be
marked PageWriteback by its new user, and already be waited upon? What
would that look like?
It would look like BUG_ON(PageWriteback) after wait_on_page_writeback()
in write_cache_pages() (though I have never seen that crash myself).
Matthew Wilcox explaining this to himself:
"page is allocated, added to page cache, dirtied, writeback starts,
--- thread A ---
filesystem calls end_page_writeback()
test_clear_page_writeback()
--- context switch to thread B ---
truncate_inode_pages_range() finds the page, it doesn't have writeback set,
we delete it from the page cache. Page gets reallocated, dirtied, writeback
starts again. Then we call write_cache_pages(), see
PageWriteback() set, call wait_on_page_writeback()
--- context switch back to thread A ---
wake_up_page(page, PG_writeback);
... thread B is woken, but because the wakeup was for the old use of
the page, PageWriteback is still set.
Devious"
And prior to 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic")
this would have been much less likely: before that, wake_page_function()'s
non-exclusive case would stop walking and not wake if it found Writeback
already set again; whereas now the non-exclusive case proceeds to wake.
I have not thought of a fix that does not add a little overhead: the
simplest fix is for end_page_writeback() to get_page() before calling
test_clear_page_writeback(), then put_page() after wake_up_page().
Was there a chance of missed wakeups before, since a page freed before
reaching wake_up_page() would have PageWaiters cleared? I think not,
because each waiter does hold a reference on the page. This bug comes
when the old use of the page, the one we do TestClearPageWriteback on,
had *no* waiters, so no additional page reference beyond the page cache
(and whoever racily freed it). The reuse of the page has a waiter
holding a reference, and its own PageWriteback set; but the belated
wake_up_page() has woken the reuse to hit that BUG_ON(PageWriteback).
Reported-by: syzbot+3622cea378100f45d59f@syzkaller.appspotmail.com
Reported-by: Qian Cai <cai@lca.pw>
Fixes: 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic")
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: stable@vger.kernel.org # v5.8+
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-11-24 19:46:43 +03:00
* on truncation to wait for the clearing of PG_writeback .
2021-03-03 23:21:55 +03:00
* But here we must make sure that the folio is not freed and
* reused before the folio_wake ( ) .
mm: fix VM_BUG_ON(PageTail) and BUG_ON(PageWriteback)
Twice now, when exercising ext4 looped on shmem huge pages, I have crashed
on the PF_ONLY_HEAD check inside PageWaiters(): ext4_finish_bio() calling
end_page_writeback() calling wake_up_page() on tail of a shmem huge page,
no longer an ext4 page at all.
The problem is that PageWriteback is not accompanied by a page reference
(as the NOTE at the end of test_clear_page_writeback() acknowledges): as
soon as TestClearPageWriteback has been done, that page could be removed
from page cache, freed, and reused for something else by the time that
wake_up_page() is reached.
https://lore.kernel.org/linux-mm/20200827122019.GC14765@casper.infradead.org/
Matthew Wilcox suggested avoiding or weakening the PageWaiters() tail
check; but I'm paranoid about even looking at an unreferenced struct page,
lest its memory might itself have already been reused or hotremoved (and
wake_up_page_bit() may modify that memory with its ClearPageWaiters()).
Then on crashing a second time, realized there's a stronger reason against
that approach. If my testing just occasionally crashes on that check,
when the page is reused for part of a compound page, wouldn't it be much
more common for the page to get reused as an order-0 page before reaching
wake_up_page()? And on rare occasions, might that reused page already be
marked PageWriteback by its new user, and already be waited upon? What
would that look like?
It would look like BUG_ON(PageWriteback) after wait_on_page_writeback()
in write_cache_pages() (though I have never seen that crash myself).
Matthew Wilcox explaining this to himself:
"page is allocated, added to page cache, dirtied, writeback starts,
--- thread A ---
filesystem calls end_page_writeback()
test_clear_page_writeback()
--- context switch to thread B ---
truncate_inode_pages_range() finds the page, it doesn't have writeback set,
we delete it from the page cache. Page gets reallocated, dirtied, writeback
starts again. Then we call write_cache_pages(), see
PageWriteback() set, call wait_on_page_writeback()
--- context switch back to thread A ---
wake_up_page(page, PG_writeback);
... thread B is woken, but because the wakeup was for the old use of
the page, PageWriteback is still set.
Devious"
And prior to 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic")
this would have been much less likely: before that, wake_page_function()'s
non-exclusive case would stop walking and not wake if it found Writeback
already set again; whereas now the non-exclusive case proceeds to wake.
I have not thought of a fix that does not add a little overhead: the
simplest fix is for end_page_writeback() to get_page() before calling
test_clear_page_writeback(), then put_page() after wake_up_page().
Was there a chance of missed wakeups before, since a page freed before
reaching wake_up_page() would have PageWaiters cleared? I think not,
because each waiter does hold a reference on the page. This bug comes
when the old use of the page, the one we do TestClearPageWriteback on,
had *no* waiters, so no additional page reference beyond the page cache
(and whoever racily freed it). The reuse of the page has a waiter
holding a reference, and its own PageWriteback set; but the belated
wake_up_page() has woken the reuse to hit that BUG_ON(PageWriteback).
Reported-by: syzbot+3622cea378100f45d59f@syzkaller.appspotmail.com
Reported-by: Qian Cai <cai@lca.pw>
Fixes: 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic")
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: stable@vger.kernel.org # v5.8+
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-11-24 19:46:43 +03:00
*/
2021-03-03 23:21:55 +03:00
folio_get ( folio ) ;
2021-01-16 07:34:16 +03:00
if ( ! __folio_end_writeback ( folio ) )
2008-04-28 13:12:38 +04:00
BUG ( ) ;
2014-03-17 21:06:10 +04:00
smp_mb__after_atomic ( ) ;
2021-03-03 23:21:55 +03:00
folio_wake ( folio , PG_writeback ) ;
Merge branch 'akpm' (patches from Andrew)
Merge misc updates from Andrew Morton:
"257 patches.
Subsystems affected by this patch series: scripts, ocfs2, vfs, and
mm (slab-generic, slab, slub, kconfig, dax, kasan, debug, pagecache,
gup, swap, memcg, pagemap, mprotect, mremap, iomap, tracing, vmalloc,
pagealloc, memory-failure, hugetlb, userfaultfd, vmscan, tools,
memblock, oom-kill, hugetlbfs, migration, thp, readahead, nommu, ksm,
vmstat, madvise, memory-hotplug, rmap, zsmalloc, highmem, zram,
cleanups, kfence, and damon)"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (257 commits)
mm/damon: remove return value from before_terminate callback
mm/damon: fix a few spelling mistakes in comments and a pr_debug message
mm/damon: simplify stop mechanism
Docs/admin-guide/mm/pagemap: wordsmith page flags descriptions
Docs/admin-guide/mm/damon/start: simplify the content
Docs/admin-guide/mm/damon/start: fix a wrong link
Docs/admin-guide/mm/damon/start: fix wrong example commands
mm/damon/dbgfs: add adaptive_targets list check before enable monitor_on
mm/damon: remove unnecessary variable initialization
Documentation/admin-guide/mm/damon: add a document for DAMON_RECLAIM
mm/damon: introduce DAMON-based Reclamation (DAMON_RECLAIM)
selftests/damon: support watermarks
mm/damon/dbgfs: support watermarks
mm/damon/schemes: activate schemes based on a watermarks mechanism
tools/selftests/damon: update for regions prioritization of schemes
mm/damon/dbgfs: support prioritization weights
mm/damon/vaddr,paddr: support pageout prioritization
mm/damon/schemes: prioritize regions within the quotas
mm/damon/selftests: support schemes quotas
mm/damon/dbgfs: support quotas of schemes
...
2021-11-07 00:08:17 +03:00
acct_reclaim_writeback ( folio ) ;
2021-03-03 23:21:55 +03:00
folio_put ( folio ) ;
2005-04-17 02:20:36 +04:00
}
2021-03-03 23:21:55 +03:00
EXPORT_SYMBOL ( folio_end_writeback ) ;
2005-04-17 02:20:36 +04:00
2006-06-23 13:03:49 +04:00
/**
2021-03-02 03:38:25 +03:00
* __folio_lock - Get a lock on the folio , assuming we need to sleep to get it .
* @ folio : The folio to lock
2005-04-17 02:20:36 +04:00
*/
2021-03-02 03:38:25 +03:00
void __folio_lock ( struct folio * folio )
2005-04-17 02:20:36 +04:00
{
2021-03-04 20:02:54 +03:00
folio_wait_bit_common ( folio , PG_locked , TASK_UNINTERRUPTIBLE ,
mm: put_and_wait_on_page_locked() while page is migrated
Waiting on a page migration entry has used wait_on_page_locked() all along
since 2006: but you cannot safely wait_on_page_locked() without holding a
reference to the page, and that extra reference is enough to make
migrate_page_move_mapping() fail with -EAGAIN, when a racing task faults
on the entry before migrate_page_move_mapping() gets there.
And that failure is retried nine times, amplifying the pain when trying to
migrate a popular page. With a single persistent faulter, migration
sometimes succeeds; with two or three concurrent faulters, success becomes
much less likely (and the more the page was mapped, the worse the overhead
of unmapping and remapping it on each try).
This is especially a problem for memory offlining, where the outer level
retries forever (or until terminated from userspace), because a heavy
refault workload can trigger an endless loop of migration failures.
wait_on_page_locked() is the wrong tool for the job.
David Herrmann (but was he the first?) noticed this issue in 2014:
https://marc.info/?l=linux-mm&m=140110465608116&w=2
Tim Chen started a thread in August 2017 which appears relevant:
https://marc.info/?l=linux-mm&m=150275941014915&w=2 where Kan Liang went
on to implicate __migration_entry_wait():
https://marc.info/?l=linux-mm&m=150300268411980&w=2 and the thread ended
up with the v4.14 commits: 2554db916586 ("sched/wait: Break up long wake
list walk") 11a19c7b099f ("sched/wait: Introduce wakeup boomark in
wake_up_page_bit")
Baoquan He reported "Memory hotplug softlock issue" 14 November 2018:
https://marc.info/?l=linux-mm&m=154217936431300&w=2
We have all assumed that it is essential to hold a page reference while
waiting on a page lock: partly to guarantee that there is still a struct
page when MEMORY_HOTREMOVE is configured, but also to protect against
reuse of the struct page going to someone who then holds the page locked
indefinitely, when the waiter can reasonably expect timely unlocking.
But in fact, so long as wait_on_page_bit_common() does the put_page(), and
is careful not to rely on struct page contents thereafter, there is no
need to hold a reference to the page while waiting on it. That does mean
that this case cannot go back through the loop: but that's fine for the
page migration case, and even if used more widely, is limited by the "Stop
walking if it's locked" optimization in wake_page_function().
Add interface put_and_wait_on_page_locked() to do this, using "behavior"
enum in place of "lock" arg to wait_on_page_bit_common() to implement it.
No interruptible or killable variant needed yet, but they might follow: I
have a vague notion that reporting -EINTR should take precedence over
return from wait_on_page_bit_common() without knowing the page state, so
arrange it accordingly - but that may be nothing but pedantic.
__migration_entry_wait() still has to take a brief reference to the page,
prior to calling put_and_wait_on_page_locked(): but now that it is dropped
before waiting, the chance of impeding page migration is very much
reduced. Should we perhaps disable preemption across this?
shrink_page_list()'s __ClearPageLocked(): that was a surprise! This
survived a lot of testing before that showed up. PageWaiters may have
been set by wait_on_page_bit_common(), and the reference dropped, just
before shrink_page_list() succeeds in freezing its last page reference: in
such a case, unlock_page() must be used. Follow the suggestion from
Michal Hocko, just revert a978d6f52106 ("mm: unlockless reclaim") now:
that optimization predates PageWaiters, and won't buy much these days; but
we can reinstate it for the !PageWaiters case if anyone notices.
It does raise the question: should vmscan.c's is_page_cache_freeable() and
__remove_mapping() now treat a PageWaiters page as if an extra reference
were held? Perhaps, but I don't think it matters much, since
shrink_page_list() already had to win its trylock_page(), so waiters are
not very common there: I noticed no difference when trying the bigger
change, and it's surely not needed while put_and_wait_on_page_locked() is
only used for page migration.
[willy@infradead.org: add put_and_wait_on_page_locked() kerneldoc]
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261121330.1116@eggly.anvils
Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: Baoquan He <bhe@redhat.com>
Tested-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: David Herrmann <dh.herrmann@gmail.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Nick Piggin <npiggin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 11:36:14 +03:00
EXCLUSIVE ) ;
2005-04-17 02:20:36 +04:00
}
2021-03-02 03:38:25 +03:00
EXPORT_SYMBOL ( __folio_lock ) ;
2005-04-17 02:20:36 +04:00
2020-12-08 08:07:31 +03:00
int __folio_lock_killable ( struct folio * folio )
2007-12-06 19:18:49 +03:00
{
2021-03-04 20:02:54 +03:00
return folio_wait_bit_common ( folio , PG_locked , TASK_KILLABLE ,
mm: put_and_wait_on_page_locked() while page is migrated
Waiting on a page migration entry has used wait_on_page_locked() all along
since 2006: but you cannot safely wait_on_page_locked() without holding a
reference to the page, and that extra reference is enough to make
migrate_page_move_mapping() fail with -EAGAIN, when a racing task faults
on the entry before migrate_page_move_mapping() gets there.
And that failure is retried nine times, amplifying the pain when trying to
migrate a popular page. With a single persistent faulter, migration
sometimes succeeds; with two or three concurrent faulters, success becomes
much less likely (and the more the page was mapped, the worse the overhead
of unmapping and remapping it on each try).
This is especially a problem for memory offlining, where the outer level
retries forever (or until terminated from userspace), because a heavy
refault workload can trigger an endless loop of migration failures.
wait_on_page_locked() is the wrong tool for the job.
David Herrmann (but was he the first?) noticed this issue in 2014:
https://marc.info/?l=linux-mm&m=140110465608116&w=2
Tim Chen started a thread in August 2017 which appears relevant:
https://marc.info/?l=linux-mm&m=150275941014915&w=2 where Kan Liang went
on to implicate __migration_entry_wait():
https://marc.info/?l=linux-mm&m=150300268411980&w=2 and the thread ended
up with the v4.14 commits: 2554db916586 ("sched/wait: Break up long wake
list walk") 11a19c7b099f ("sched/wait: Introduce wakeup boomark in
wake_up_page_bit")
Baoquan He reported "Memory hotplug softlock issue" 14 November 2018:
https://marc.info/?l=linux-mm&m=154217936431300&w=2
We have all assumed that it is essential to hold a page reference while
waiting on a page lock: partly to guarantee that there is still a struct
page when MEMORY_HOTREMOVE is configured, but also to protect against
reuse of the struct page going to someone who then holds the page locked
indefinitely, when the waiter can reasonably expect timely unlocking.
But in fact, so long as wait_on_page_bit_common() does the put_page(), and
is careful not to rely on struct page contents thereafter, there is no
need to hold a reference to the page while waiting on it. That does mean
that this case cannot go back through the loop: but that's fine for the
page migration case, and even if used more widely, is limited by the "Stop
walking if it's locked" optimization in wake_page_function().
Add interface put_and_wait_on_page_locked() to do this, using "behavior"
enum in place of "lock" arg to wait_on_page_bit_common() to implement it.
No interruptible or killable variant needed yet, but they might follow: I
have a vague notion that reporting -EINTR should take precedence over
return from wait_on_page_bit_common() without knowing the page state, so
arrange it accordingly - but that may be nothing but pedantic.
__migration_entry_wait() still has to take a brief reference to the page,
prior to calling put_and_wait_on_page_locked(): but now that it is dropped
before waiting, the chance of impeding page migration is very much
reduced. Should we perhaps disable preemption across this?
shrink_page_list()'s __ClearPageLocked(): that was a surprise! This
survived a lot of testing before that showed up. PageWaiters may have
been set by wait_on_page_bit_common(), and the reference dropped, just
before shrink_page_list() succeeds in freezing its last page reference: in
such a case, unlock_page() must be used. Follow the suggestion from
Michal Hocko, just revert a978d6f52106 ("mm: unlockless reclaim") now:
that optimization predates PageWaiters, and won't buy much these days; but
we can reinstate it for the !PageWaiters case if anyone notices.
It does raise the question: should vmscan.c's is_page_cache_freeable() and
__remove_mapping() now treat a PageWaiters page as if an extra reference
were held? Perhaps, but I don't think it matters much, since
shrink_page_list() already had to win its trylock_page(), so waiters are
not very common there: I noticed no difference when trying the bigger
change, and it's surely not needed while put_and_wait_on_page_locked() is
only used for page migration.
[willy@infradead.org: add put_and_wait_on_page_locked() kerneldoc]
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261121330.1116@eggly.anvils
Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: Baoquan He <bhe@redhat.com>
Tested-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: David Herrmann <dh.herrmann@gmail.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Nick Piggin <npiggin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 11:36:14 +03:00
EXCLUSIVE ) ;
2007-12-06 19:18:49 +03:00
}
2020-12-08 08:07:31 +03:00
EXPORT_SYMBOL_GPL ( __folio_lock_killable ) ;
2007-12-06 19:18:49 +03:00
2020-12-31 01:58:40 +03:00
static int __folio_lock_async ( struct folio * folio , struct wait_page_queue * wait )
2020-05-22 18:12:09 +03:00
{
2021-01-16 19:22:14 +03:00
struct wait_queue_head * q = folio_waitqueue ( folio ) ;
2021-02-24 23:02:09 +03:00
int ret = 0 ;
2021-01-16 19:22:14 +03:00
wait - > folio = folio ;
2021-02-24 23:02:09 +03:00
wait - > bit_nr = PG_locked ;
spin_lock_irq ( & q - > lock ) ;
__add_wait_queue_entry_tail ( q , & wait - > wait ) ;
2020-12-31 01:58:40 +03:00
folio_set_waiters ( folio ) ;
ret = ! folio_trylock ( folio ) ;
2021-02-24 23:02:09 +03:00
/*
* If we were successful now , we know we ' re still on the
* waitqueue as we ' re still under the lock . This means it ' s
* safe to remove and return success , we know the callback
* isn ' t going to trigger .
*/
if ( ! ret )
__remove_wait_queue ( q , & wait - > wait ) ;
else
ret = - EIOCBQUEUED ;
spin_unlock_irq ( & q - > lock ) ;
return ret ;
2020-05-22 18:12:09 +03:00
}
2014-08-07 03:07:24 +04:00
/*
* Return values :
2021-03-19 04:39:45 +03:00
* true - folio is locked ; mmap_lock is still held .
* false - folio is not locked .
2020-06-09 07:33:51 +03:00
* mmap_lock has been released ( mmap_read_unlock ( ) , unless flags had both
2014-08-07 03:07:24 +04:00
* FAULT_FLAG_ALLOW_RETRY and FAULT_FLAG_RETRY_NOWAIT set , in
2020-06-09 07:33:54 +03:00
* which case mmap_lock is still held .
2014-08-07 03:07:24 +04:00
*
2021-03-19 04:39:45 +03:00
* If neither ALLOW_RETRY nor KILLABLE are set , will always return true
* with the folio locked and the mmap_lock unperturbed .
2014-08-07 03:07:24 +04:00
*/
2021-03-19 04:39:45 +03:00
bool __folio_lock_or_retry ( struct folio * folio , struct mm_struct * mm ,
2010-10-27 01:21:57 +04:00
unsigned int flags )
{
2020-04-02 07:08:45 +03:00
if ( fault_flag_allow_retry_first ( flags ) ) {
2011-05-25 04:11:30 +04:00
/*
2020-06-09 07:33:54 +03:00
* CAUTION ! In this case , mmap_lock is not released
2011-05-25 04:11:30 +04:00
* even though return 0.
*/
if ( flags & FAULT_FLAG_RETRY_NOWAIT )
2021-03-19 04:39:45 +03:00
return false ;
2011-05-25 04:11:30 +04:00
2020-06-09 07:33:25 +03:00
mmap_read_unlock ( mm ) ;
2011-05-25 04:11:30 +04:00
if ( flags & FAULT_FLAG_KILLABLE )
2021-03-04 18:21:02 +03:00
folio_wait_locked_killable ( folio ) ;
2011-05-25 04:11:30 +04:00
else
2021-03-04 18:21:02 +03:00
folio_wait_locked ( folio ) ;
2021-03-19 04:39:45 +03:00
return false ;
2020-12-15 06:05:02 +03:00
}
if ( flags & FAULT_FLAG_KILLABLE ) {
2021-03-19 04:39:45 +03:00
bool ret ;
2011-05-25 04:11:30 +04:00
2020-12-08 08:07:31 +03:00
ret = __folio_lock_killable ( folio ) ;
2020-12-15 06:05:02 +03:00
if ( ret ) {
mmap_read_unlock ( mm ) ;
2021-03-19 04:39:45 +03:00
return false ;
2020-12-15 06:05:02 +03:00
}
} else {
2020-12-08 08:07:31 +03:00
__folio_lock ( folio ) ;
2010-10-27 01:21:57 +04:00
}
2020-12-15 06:05:02 +03:00
2021-03-19 04:39:45 +03:00
return true ;
2010-10-27 01:21:57 +04:00
}
2014-04-04 01:47:44 +04:00
/**
2017-11-21 22:07:06 +03:00
* page_cache_next_miss ( ) - Find the next gap in the page cache .
* @ mapping : Mapping .
* @ index : Index .
* @ max_scan : Maximum range to search .
2014-04-04 01:47:44 +04:00
*
2017-11-21 22:07:06 +03:00
* Search the range [ index , min ( index + max_scan - 1 , ULONG_MAX ) ] for the
* gap with the lowest index .
2014-04-04 01:47:44 +04:00
*
2017-11-21 22:07:06 +03:00
* This function may be called under the rcu_read_lock . However , this will
* not atomically search a snapshot of the cache at a single point in time .
* For example , if a gap is created at index 5 , then subsequently a gap is
* created at index 10 , page_cache_next_miss covering both indices may
* return 10 if called under the rcu_read_lock .
2014-04-04 01:47:44 +04:00
*
2017-11-21 22:07:06 +03:00
* Return : The index of the gap if found , otherwise an index outside the
* range specified ( in which case ' return - index > = max_scan ' will be true ) .
2023-06-22 00:24:02 +03:00
* In the rare case of index wrap - around , 0 will be returned .
2014-04-04 01:47:44 +04:00
*/
2017-11-21 22:07:06 +03:00
pgoff_t page_cache_next_miss ( struct address_space * mapping ,
2014-04-04 01:47:44 +04:00
pgoff_t index , unsigned long max_scan )
{
2017-11-21 22:07:06 +03:00
XA_STATE ( xas , & mapping - > i_pages , index ) ;
2014-04-04 01:47:44 +04:00
2017-11-21 22:07:06 +03:00
while ( max_scan - - ) {
void * entry = xas_next ( & xas ) ;
if ( ! entry | | xa_is_value ( entry ) )
2023-06-22 00:24:02 +03:00
break ;
if ( xas . xa_index = = 0 )
break ;
2014-04-04 01:47:44 +04:00
}
2023-06-22 00:24:02 +03:00
return xas . xa_index ;
2014-04-04 01:47:44 +04:00
}
2017-11-21 22:07:06 +03:00
EXPORT_SYMBOL ( page_cache_next_miss ) ;
2014-04-04 01:47:44 +04:00
/**
2019-05-14 03:21:29 +03:00
* page_cache_prev_miss ( ) - Find the previous gap in the page cache .
2017-11-21 22:07:06 +03:00
* @ mapping : Mapping .
* @ index : Index .
* @ max_scan : Maximum range to search .
2014-04-04 01:47:44 +04:00
*
2017-11-21 22:07:06 +03:00
* Search the range [ max ( index - max_scan + 1 , 0 ) , index ] for the
* gap with the highest index .
2014-04-04 01:47:44 +04:00
*
2017-11-21 22:07:06 +03:00
* This function may be called under the rcu_read_lock . However , this will
* not atomically search a snapshot of the cache at a single point in time .
* For example , if a gap is created at index 10 , then subsequently a gap is
* created at index 5 , page_cache_prev_miss ( ) covering both indices may
* return 5 if called under the rcu_read_lock .
2014-04-04 01:47:44 +04:00
*
2017-11-21 22:07:06 +03:00
* Return : The index of the gap if found , otherwise an index outside the
* range specified ( in which case ' index - return > = max_scan ' will be true ) .
2023-06-22 00:24:02 +03:00
* In the rare case of wrap - around , ULONG_MAX will be returned .
2014-04-04 01:47:44 +04:00
*/
2017-11-21 22:07:06 +03:00
pgoff_t page_cache_prev_miss ( struct address_space * mapping ,
2014-04-04 01:47:44 +04:00
pgoff_t index , unsigned long max_scan )
{
2017-11-21 22:07:06 +03:00
XA_STATE ( xas , & mapping - > i_pages , index ) ;
2014-04-04 01:47:44 +04:00
2017-11-21 22:07:06 +03:00
while ( max_scan - - ) {
void * entry = xas_prev ( & xas ) ;
if ( ! entry | | xa_is_value ( entry ) )
2023-06-22 00:24:02 +03:00
break ;
if ( xas . xa_index = = ULONG_MAX )
break ;
2014-04-04 01:47:44 +04:00
}
2023-06-22 00:24:02 +03:00
return xas . xa_index ;
2014-04-04 01:47:44 +04:00
}
2017-11-21 22:07:06 +03:00
EXPORT_SYMBOL ( page_cache_prev_miss ) ;
2014-04-04 01:47:44 +04:00
2021-05-10 23:33:22 +03:00
/*
* Lockless page cache protocol :
* On the lookup side :
* 1. Load the folio from i_pages
* 2. Increment the refcount if it ' s not zero
* 3. If the folio is not found by xas_reload ( ) , put the refcount and retry
*
* On the removal side :
* A . Freeze the page ( by zeroing the refcount if nobody else has a reference )
* B . Remove the page from i_pages
* C . Return the page to the page allocator
*
* This means that any page may have its reference count temporarily
* increased by a speculative page cache ( or fast GUP ) lookup as it can
* be allocated by another user before the RCU grace period expires .
* Because the refcount temporarily acquired here may end up being the
* last refcount on the page , any page allocation must be freeable by
* folio_put ( ) .
*/
2021-02-26 04:15:36 +03:00
/*
2023-03-07 17:34:05 +03:00
* filemap_get_entry - Get a page cache entry .
2006-06-23 13:03:49 +04:00
* @ mapping : the address_space to search
2020-10-14 02:51:34 +03:00
* @ index : The page cache index .
2014-04-04 01:47:46 +04:00
*
2020-12-16 07:22:38 +03:00
* Looks up the page cache entry at @ mapping & @ index . If it is a folio ,
* it is returned with an increased refcount . If it is a shadow entry
* of a previously evicted folio , or a swap entry from shmem / tmpfs ,
* it is returned without further action .
2006-06-23 13:03:49 +04:00
*
2020-12-16 07:22:38 +03:00
* Return : The folio , swap or shadow entry , % NULL if nothing is found .
2005-04-17 02:20:36 +04:00
*/
2023-03-07 17:34:05 +03:00
void * filemap_get_entry ( struct address_space * mapping , pgoff_t index )
2005-04-17 02:20:36 +04:00
{
2020-10-14 02:51:34 +03:00
XA_STATE ( xas , & mapping - > i_pages , index ) ;
2020-12-16 07:22:38 +03:00
struct folio * folio ;
2005-04-17 02:20:36 +04:00
2008-07-26 06:45:31 +04:00
rcu_read_lock ( ) ;
repeat :
2018-05-16 23:12:50 +03:00
xas_reset ( & xas ) ;
2020-12-16 07:22:38 +03:00
folio = xas_load ( & xas ) ;
if ( xas_retry ( & xas , folio ) )
2018-05-16 23:12:50 +03:00
goto repeat ;
/*
* A shadow entry of a recently evicted page , or a swap entry from
* shmem / tmpfs . Return it without attempting to raise page count .
*/
2020-12-16 07:22:38 +03:00
if ( ! folio | | xa_is_value ( folio ) )
2018-05-16 23:12:50 +03:00
goto out ;
2016-07-27 01:26:04 +03:00
2020-12-16 07:22:38 +03:00
if ( ! folio_try_get_rcu ( folio ) )
2018-05-16 23:12:50 +03:00
goto repeat ;
2016-07-27 01:26:04 +03:00
2020-12-16 07:22:38 +03:00
if ( unlikely ( folio ! = xas_reload ( & xas ) ) ) {
folio_put ( folio ) ;
2018-05-16 23:12:50 +03:00
goto repeat ;
2008-07-26 06:45:31 +04:00
}
2010-11-12 01:05:19 +03:00
out :
2008-07-26 06:45:31 +04:00
rcu_read_unlock ( ) ;
2020-12-16 07:22:38 +03:00
return folio ;
2005-04-17 02:20:36 +04:00
}
2014-04-04 01:47:46 +04:00
/**
2021-03-08 19:45:35 +03:00
* __filemap_get_folio - Find and get a reference to a folio .
2020-04-02 07:05:07 +03:00
* @ mapping : The address_space to search .
* @ index : The page index .
2021-03-08 19:45:35 +03:00
* @ fgp_flags : % FGP flags modify how the folio is returned .
* @ gfp : Memory allocation flags to use if % FGP_CREAT is specified .
2005-04-17 02:20:36 +04:00
*
2020-04-02 07:05:07 +03:00
* Looks up the page cache entry at @ mapping & @ index .
2014-04-04 01:47:46 +04:00
*
2020-04-02 07:05:07 +03:00
* @ fgp_flags can be zero or more of these flags :
2017-03-30 23:11:36 +03:00
*
2021-03-08 19:45:35 +03:00
* * % FGP_ACCESSED - The folio will be marked accessed .
* * % FGP_LOCK - The folio is returned locked .
2020-04-02 07:05:07 +03:00
* * % FGP_CREAT - If no page is present then a new page is allocated using
2021-03-08 19:45:35 +03:00
* @ gfp and added to the page cache and the VM ' s LRU list .
2020-04-02 07:05:07 +03:00
* The page is returned locked and with an increased refcount .
* * % FGP_FOR_MMAP - The caller wants to do its own locking dance if the
* page is already in cache . If the page was allocated , unlock it before
* returning so the caller can do the same dance .
2021-03-08 19:45:35 +03:00
* * % FGP_WRITE - The page will be written to by the caller .
* * % FGP_NOFS - __GFP_FS will get cleared in gfp .
* * % FGP_NOWAIT - Don ' t get blocked by page lock .
2020-12-24 20:55:56 +03:00
* * % FGP_STABLE - Wait for the folio to be stable ( finished writeback )
2005-04-17 02:20:36 +04:00
*
2020-04-02 07:05:07 +03:00
* If % FGP_LOCK or % FGP_CREAT are specified then the function may sleep even
* if the % GFP flags specified for % FGP_CREAT are atomic .
2005-04-17 02:20:36 +04:00
*
2014-06-05 03:10:31 +04:00
* If there is a page cache page , it is returned with an increased refcount .
2019-03-06 02:48:42 +03:00
*
2023-03-07 17:34:10 +03:00
* Return : The found folio or an ERR_PTR ( ) otherwise .
2005-04-17 02:20:36 +04:00
*/
2021-03-08 19:45:35 +03:00
struct folio * __filemap_get_folio ( struct address_space * mapping , pgoff_t index ,
int fgp_flags , gfp_t gfp )
2005-04-17 02:20:36 +04:00
{
2021-03-08 19:45:35 +03:00
struct folio * folio ;
2014-06-05 03:10:31 +04:00
2005-04-17 02:20:36 +04:00
repeat :
2023-03-07 17:34:05 +03:00
folio = filemap_get_entry ( mapping , index ) ;
2023-03-07 17:34:09 +03:00
if ( xa_is_value ( folio ) )
2021-03-08 19:45:35 +03:00
folio = NULL ;
if ( ! folio )
2014-06-05 03:10:31 +04:00
goto no_page ;
if ( fgp_flags & FGP_LOCK ) {
if ( fgp_flags & FGP_NOWAIT ) {
2021-03-08 19:45:35 +03:00
if ( ! folio_trylock ( folio ) ) {
folio_put ( folio ) ;
2023-03-07 17:34:10 +03:00
return ERR_PTR ( - EAGAIN ) ;
2014-06-05 03:10:31 +04:00
}
} else {
2021-03-08 19:45:35 +03:00
folio_lock ( folio ) ;
2014-06-05 03:10:31 +04:00
}
/* Has the page been truncated? */
2021-03-08 19:45:35 +03:00
if ( unlikely ( folio - > mapping ! = mapping ) ) {
folio_unlock ( folio ) ;
folio_put ( folio ) ;
2014-06-05 03:10:31 +04:00
goto repeat ;
}
2021-03-08 19:45:35 +03:00
VM_BUG_ON_FOLIO ( ! folio_contains ( folio , index ) , folio ) ;
2014-06-05 03:10:31 +04:00
}
2018-12-28 11:37:35 +03:00
if ( fgp_flags & FGP_ACCESSED )
2021-03-08 19:45:35 +03:00
folio_mark_accessed ( folio ) ;
2020-08-07 09:19:55 +03:00
else if ( fgp_flags & FGP_WRITE ) {
/* Clear idle flag for buffer write */
2021-03-08 19:45:35 +03:00
if ( folio_test_idle ( folio ) )
folio_clear_idle ( folio ) ;
2020-08-07 09:19:55 +03:00
}
2014-06-05 03:10:31 +04:00
2020-12-24 20:55:56 +03:00
if ( fgp_flags & FGP_STABLE )
folio_wait_stable ( folio ) ;
2014-06-05 03:10:31 +04:00
no_page :
2021-03-08 19:45:35 +03:00
if ( ! folio & & ( fgp_flags & FGP_CREAT ) ) {
2014-06-05 03:10:31 +04:00
int err ;
2020-09-24 09:51:40 +03:00
if ( ( fgp_flags & FGP_WRITE ) & & mapping_can_writeback ( mapping ) )
2021-03-08 19:45:35 +03:00
gfp | = __GFP_WRITE ;
2014-12-29 22:30:35 +03:00
if ( fgp_flags & FGP_NOFS )
2021-03-08 19:45:35 +03:00
gfp & = ~ __GFP_FS ;
2022-07-01 23:04:43 +03:00
if ( fgp_flags & FGP_NOWAIT ) {
gfp & = ~ GFP_KERNEL ;
gfp | = GFP_NOWAIT | __GFP_NOWARN ;
}
2014-06-05 03:10:31 +04:00
2021-03-08 19:45:35 +03:00
folio = filemap_alloc_folio ( gfp , 0 ) ;
if ( ! folio )
2023-03-07 17:34:10 +03:00
return ERR_PTR ( - ENOMEM ) ;
2014-06-05 03:10:31 +04:00
filemap: kill page_cache_read usage in filemap_fault
Patch series "drop the mmap_sem when doing IO in the fault path", v6.
Now that we have proper isolation in place with cgroups2 we have started
going through and fixing the various priority inversions. Most are all
gone now, but this one is sort of weird since it's not necessarily a
priority inversion that happens within the kernel, but rather because of
something userspace does.
We have giant applications that we want to protect, and parts of these
giant applications do things like watch the system state to determine how
healthy the box is for load balancing and such. This involves running
'ps' or other such utilities. These utilities will often walk
/proc/<pid>/whatever, and these files can sometimes need to
down_read(&task->mmap_sem). Not usually a big deal, but we noticed when
we are stress testing that sometimes our protected application has latency
spikes trying to get the mmap_sem for tasks that are in lower priority
cgroups.
This is because any down_write() on a semaphore essentially turns it into
a mutex, so even if we currently have it held for reading, any new readers
will not be allowed on to keep from starving the writer. This is fine,
except a lower priority task could be stuck doing IO because it has been
throttled to the point that its IO is taking much longer than normal. But
because a higher priority group depends on this completing it is now stuck
behind lower priority work.
In order to avoid this particular priority inversion we want to use the
existing retry mechanism to stop from holding the mmap_sem at all if we
are going to do IO. This already exists in the read case sort of, but
needed to be extended for more than just grabbing the page lock. With
io.latency we throttle at submit_bio() time, so the readahead stuff can
block and even page_cache_read can block, so all these paths need to have
the mmap_sem dropped.
The other big thing is ->page_mkwrite. btrfs is particularly shitty here
because we have to reserve space for the dirty page, which can be a very
expensive operation. We use the same retry method as the read path, and
simply cache the page and verify the page is still setup properly the next
pass through ->page_mkwrite().
I've tested these patches with xfstests and there are no regressions.
This patch (of 3):
If we do not have a page at filemap_fault time we'll do this weird forced
page_cache_read thing to populate the page, and then drop it again and
loop around and find it. This makes for 2 ways we can read a page in
filemap_fault, and it's not really needed. Instead add a FGP_FOR_MMAP
flag so that pagecache_get_page() will return a unlocked page that's in
pagecache. Then use the normal page locking and readpage logic already in
filemap_fault. This simplifies the no page in page cache case
significantly.
[akpm@linux-foundation.org: fix comment text]
[josef@toxicpanda.com: don't unlock null page in FGP_FOR_MMAP case]
Link: http://lkml.kernel.org/r/20190312201742.22935-1-josef@toxicpanda.com
Link: http://lkml.kernel.org/r/20181211173801.29535-2-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-13 21:44:14 +03:00
if ( WARN_ON_ONCE ( ! ( fgp_flags & ( FGP_LOCK | FGP_FOR_MMAP ) ) ) )
2014-06-05 03:10:31 +04:00
fgp_flags | = FGP_LOCK ;
2014-08-07 03:06:43 +04:00
/* Init accessed so avoid atomic mark_page_accessed later */
2014-06-05 03:10:31 +04:00
if ( fgp_flags & FGP_ACCESSED )
2021-03-08 19:45:35 +03:00
__folio_set_referenced ( folio ) ;
2014-06-05 03:10:31 +04:00
2021-03-08 19:45:35 +03:00
err = filemap_add_folio ( mapping , folio , index , gfp ) ;
2007-10-16 12:24:57 +04:00
if ( unlikely ( err ) ) {
2021-03-08 19:45:35 +03:00
folio_put ( folio ) ;
folio = NULL ;
2007-10-16 12:24:57 +04:00
if ( err = = - EEXIST )
goto repeat ;
2005-04-17 02:20:36 +04:00
}
filemap: kill page_cache_read usage in filemap_fault
Patch series "drop the mmap_sem when doing IO in the fault path", v6.
Now that we have proper isolation in place with cgroups2 we have started
going through and fixing the various priority inversions. Most are all
gone now, but this one is sort of weird since it's not necessarily a
priority inversion that happens within the kernel, but rather because of
something userspace does.
We have giant applications that we want to protect, and parts of these
giant applications do things like watch the system state to determine how
healthy the box is for load balancing and such. This involves running
'ps' or other such utilities. These utilities will often walk
/proc/<pid>/whatever, and these files can sometimes need to
down_read(&task->mmap_sem). Not usually a big deal, but we noticed when
we are stress testing that sometimes our protected application has latency
spikes trying to get the mmap_sem for tasks that are in lower priority
cgroups.
This is because any down_write() on a semaphore essentially turns it into
a mutex, so even if we currently have it held for reading, any new readers
will not be allowed on to keep from starving the writer. This is fine,
except a lower priority task could be stuck doing IO because it has been
throttled to the point that its IO is taking much longer than normal. But
because a higher priority group depends on this completing it is now stuck
behind lower priority work.
In order to avoid this particular priority inversion we want to use the
existing retry mechanism to stop from holding the mmap_sem at all if we
are going to do IO. This already exists in the read case sort of, but
needed to be extended for more than just grabbing the page lock. With
io.latency we throttle at submit_bio() time, so the readahead stuff can
block and even page_cache_read can block, so all these paths need to have
the mmap_sem dropped.
The other big thing is ->page_mkwrite. btrfs is particularly shitty here
because we have to reserve space for the dirty page, which can be a very
expensive operation. We use the same retry method as the read path, and
simply cache the page and verify the page is still setup properly the next
pass through ->page_mkwrite().
I've tested these patches with xfstests and there are no regressions.
This patch (of 3):
If we do not have a page at filemap_fault time we'll do this weird forced
page_cache_read thing to populate the page, and then drop it again and
loop around and find it. This makes for 2 ways we can read a page in
filemap_fault, and it's not really needed. Instead add a FGP_FOR_MMAP
flag so that pagecache_get_page() will return a unlocked page that's in
pagecache. Then use the normal page locking and readpage logic already in
filemap_fault. This simplifies the no page in page cache case
significantly.
[akpm@linux-foundation.org: fix comment text]
[josef@toxicpanda.com: don't unlock null page in FGP_FOR_MMAP case]
Link: http://lkml.kernel.org/r/20190312201742.22935-1-josef@toxicpanda.com
Link: http://lkml.kernel.org/r/20181211173801.29535-2-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-13 21:44:14 +03:00
/*
2021-03-08 19:45:35 +03:00
* filemap_add_folio locks the page , and for mmap
* we expect an unlocked page .
filemap: kill page_cache_read usage in filemap_fault
Patch series "drop the mmap_sem when doing IO in the fault path", v6.
Now that we have proper isolation in place with cgroups2 we have started
going through and fixing the various priority inversions. Most are all
gone now, but this one is sort of weird since it's not necessarily a
priority inversion that happens within the kernel, but rather because of
something userspace does.
We have giant applications that we want to protect, and parts of these
giant applications do things like watch the system state to determine how
healthy the box is for load balancing and such. This involves running
'ps' or other such utilities. These utilities will often walk
/proc/<pid>/whatever, and these files can sometimes need to
down_read(&task->mmap_sem). Not usually a big deal, but we noticed when
we are stress testing that sometimes our protected application has latency
spikes trying to get the mmap_sem for tasks that are in lower priority
cgroups.
This is because any down_write() on a semaphore essentially turns it into
a mutex, so even if we currently have it held for reading, any new readers
will not be allowed on to keep from starving the writer. This is fine,
except a lower priority task could be stuck doing IO because it has been
throttled to the point that its IO is taking much longer than normal. But
because a higher priority group depends on this completing it is now stuck
behind lower priority work.
In order to avoid this particular priority inversion we want to use the
existing retry mechanism to stop from holding the mmap_sem at all if we
are going to do IO. This already exists in the read case sort of, but
needed to be extended for more than just grabbing the page lock. With
io.latency we throttle at submit_bio() time, so the readahead stuff can
block and even page_cache_read can block, so all these paths need to have
the mmap_sem dropped.
The other big thing is ->page_mkwrite. btrfs is particularly shitty here
because we have to reserve space for the dirty page, which can be a very
expensive operation. We use the same retry method as the read path, and
simply cache the page and verify the page is still setup properly the next
pass through ->page_mkwrite().
I've tested these patches with xfstests and there are no regressions.
This patch (of 3):
If we do not have a page at filemap_fault time we'll do this weird forced
page_cache_read thing to populate the page, and then drop it again and
loop around and find it. This makes for 2 ways we can read a page in
filemap_fault, and it's not really needed. Instead add a FGP_FOR_MMAP
flag so that pagecache_get_page() will return a unlocked page that's in
pagecache. Then use the normal page locking and readpage logic already in
filemap_fault. This simplifies the no page in page cache case
significantly.
[akpm@linux-foundation.org: fix comment text]
[josef@toxicpanda.com: don't unlock null page in FGP_FOR_MMAP case]
Link: http://lkml.kernel.org/r/20190312201742.22935-1-josef@toxicpanda.com
Link: http://lkml.kernel.org/r/20181211173801.29535-2-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-13 21:44:14 +03:00
*/
2021-03-08 19:45:35 +03:00
if ( folio & & ( fgp_flags & FGP_FOR_MMAP ) )
folio_unlock ( folio ) ;
2005-04-17 02:20:36 +04:00
}
2014-06-05 03:10:31 +04:00
2023-03-07 17:34:10 +03:00
if ( ! folio )
return ERR_PTR ( - ENOENT ) ;
2021-03-08 19:45:35 +03:00
return folio ;
2005-04-17 02:20:36 +04:00
}
2021-03-08 19:45:35 +03:00
EXPORT_SYMBOL ( __filemap_get_folio ) ;
2005-04-17 02:20:36 +04:00
2020-12-17 08:12:26 +03:00
static inline struct folio * find_get_entry ( struct xa_state * xas , pgoff_t max ,
2021-02-26 04:15:44 +03:00
xa_mark_t mark )
{
2020-12-17 08:12:26 +03:00
struct folio * folio ;
2021-02-26 04:15:44 +03:00
retry :
if ( mark = = XA_PRESENT )
2020-12-17 08:12:26 +03:00
folio = xas_find ( xas , max ) ;
2021-02-26 04:15:44 +03:00
else
2020-12-17 08:12:26 +03:00
folio = xas_find_marked ( xas , max , mark ) ;
2021-02-26 04:15:44 +03:00
2020-12-17 08:12:26 +03:00
if ( xas_retry ( xas , folio ) )
2021-02-26 04:15:44 +03:00
goto retry ;
/*
* A shadow entry of a recently evicted page , a swap
* entry from shmem / tmpfs or a DAX entry . Return it
* without attempting to raise page count .
*/
2020-12-17 08:12:26 +03:00
if ( ! folio | | xa_is_value ( folio ) )
return folio ;
2021-02-26 04:15:44 +03:00
2020-12-17 08:12:26 +03:00
if ( ! folio_try_get_rcu ( folio ) )
2021-02-26 04:15:44 +03:00
goto reset ;
2020-12-17 08:12:26 +03:00
if ( unlikely ( folio ! = xas_reload ( xas ) ) ) {
folio_put ( folio ) ;
2021-02-26 04:15:44 +03:00
goto reset ;
}
2020-12-17 08:12:26 +03:00
return folio ;
2021-02-26 04:15:44 +03:00
reset :
xas_reset ( xas ) ;
goto retry ;
}
2014-04-04 01:47:46 +04:00
/**
* find_get_entries - gang pagecache lookup
* @ mapping : The address_space to search
* @ start : The starting page cache index
2021-02-26 04:16:00 +03:00
* @ end : The final page index ( inclusive ) .
2020-09-02 06:17:50 +03:00
* @ fbatch : Where the resulting entries are placed .
2014-04-04 01:47:46 +04:00
* @ indices : The cache indices corresponding to the entries in @ entries
*
2021-02-26 04:16:11 +03:00
* find_get_entries ( ) will search for and return a batch of entries in
2020-09-02 06:17:50 +03:00
* the mapping . The entries are placed in @ fbatch . find_get_entries ( )
* takes a reference on any actual folios it returns .
2014-04-04 01:47:46 +04:00
*
2020-09-02 06:17:50 +03:00
* The entries have ascending indexes . The indices may not be consecutive
* due to not - present entries or large folios .
2014-04-04 01:47:46 +04:00
*
2020-09-02 06:17:50 +03:00
* Any shadow entries of evicted folios , or swap entries from
2014-05-06 23:50:05 +04:00
* shmem / tmpfs , are included in the returned array .
2014-04-04 01:47:46 +04:00
*
2020-09-02 06:17:50 +03:00
* Return : The number of entries which were found .
2014-04-04 01:47:46 +04:00
*/
2022-10-17 19:18:00 +03:00
unsigned find_get_entries ( struct address_space * mapping , pgoff_t * start ,
2020-09-02 06:17:50 +03:00
pgoff_t end , struct folio_batch * fbatch , pgoff_t * indices )
2014-04-04 01:47:46 +04:00
{
2022-10-17 19:18:00 +03:00
XA_STATE ( xas , & mapping - > i_pages , * start ) ;
2020-12-17 08:12:26 +03:00
struct folio * folio ;
2014-04-04 01:47:46 +04:00
rcu_read_lock ( ) ;
2020-12-17 08:12:26 +03:00
while ( ( folio = find_get_entry ( & xas , end , XA_PRESENT ) ) ! = NULL ) {
2020-09-02 06:17:50 +03:00
indices [ fbatch - > nr ] = xas . xa_index ;
if ( ! folio_batch_add ( fbatch , folio ) )
2014-04-04 01:47:46 +04:00
break ;
}
rcu_read_unlock ( ) ;
2021-02-26 04:16:11 +03:00
2022-10-17 19:18:00 +03:00
if ( folio_batch_count ( fbatch ) ) {
unsigned long nr = 1 ;
int idx = folio_batch_count ( fbatch ) - 1 ;
folio = fbatch - > folios [ idx ] ;
if ( ! xa_is_value ( folio ) & & ! folio_test_hugetlb ( folio ) )
nr = folio_nr_pages ( folio ) ;
* start = indices [ idx ] + nr ;
}
2020-09-02 06:17:50 +03:00
return folio_batch_count ( fbatch ) ;
2014-04-04 01:47:46 +04:00
}
2021-02-26 04:15:56 +03:00
/**
* find_lock_entries - Find a batch of pagecache entries .
* @ mapping : The address_space to search .
* @ start : The starting page cache index .
* @ end : The final page index ( inclusive ) .
2021-12-07 22:15:07 +03:00
* @ fbatch : Where the resulting entries are placed .
* @ indices : The cache indices of the entries in @ fbatch .
2021-02-26 04:15:56 +03:00
*
* find_lock_entries ( ) will return a batch of entries from @ mapping .
2020-12-17 08:12:26 +03:00
* Swap , shadow and DAX entries are included . Folios are returned
* locked and with an incremented refcount . Folios which are locked
* by somebody else or under writeback are skipped . Folios which are
* partially outside the range are not returned .
2021-02-26 04:15:56 +03:00
*
* The entries have ascending indexes . The indices may not be consecutive
2020-12-17 08:12:26 +03:00
* due to not - present entries , large folios , folios which could not be
* locked or folios under writeback .
2021-02-26 04:15:56 +03:00
*
* Return : The number of entries which were found .
*/
2022-10-17 19:17:59 +03:00
unsigned find_lock_entries ( struct address_space * mapping , pgoff_t * start ,
2021-12-07 22:15:07 +03:00
pgoff_t end , struct folio_batch * fbatch , pgoff_t * indices )
2021-02-26 04:15:56 +03:00
{
2022-10-17 19:17:59 +03:00
XA_STATE ( xas , & mapping - > i_pages , * start ) ;
2020-12-17 08:12:26 +03:00
struct folio * folio ;
2021-02-26 04:15:56 +03:00
rcu_read_lock ( ) ;
2020-12-17 08:12:26 +03:00
while ( ( folio = find_get_entry ( & xas , end , XA_PRESENT ) ) ) {
if ( ! xa_is_value ( folio ) ) {
2022-10-17 19:17:59 +03:00
if ( folio - > index < * start )
2021-02-26 04:15:56 +03:00
goto put ;
2020-12-17 08:12:26 +03:00
if ( folio - > index + folio_nr_pages ( folio ) - 1 > end )
2021-02-26 04:15:56 +03:00
goto put ;
2020-12-17 08:12:26 +03:00
if ( ! folio_trylock ( folio ) )
2021-02-26 04:15:56 +03:00
goto put ;
2020-12-17 08:12:26 +03:00
if ( folio - > mapping ! = mapping | |
folio_test_writeback ( folio ) )
2021-02-26 04:15:56 +03:00
goto unlock ;
2020-12-17 08:12:26 +03:00
VM_BUG_ON_FOLIO ( ! folio_contains ( folio , xas . xa_index ) ,
folio ) ;
2021-02-26 04:15:56 +03:00
}
2021-12-07 22:15:07 +03:00
indices [ fbatch - > nr ] = xas . xa_index ;
if ( ! folio_batch_add ( fbatch , folio ) )
2021-02-26 04:15:56 +03:00
break ;
2020-06-28 05:19:08 +03:00
continue ;
2021-02-26 04:15:56 +03:00
unlock :
2020-12-17 08:12:26 +03:00
folio_unlock ( folio ) ;
2021-02-26 04:15:56 +03:00
put :
2020-12-17 08:12:26 +03:00
folio_put ( folio ) ;
2021-02-26 04:15:56 +03:00
}
rcu_read_unlock ( ) ;
2022-10-17 19:17:59 +03:00
if ( folio_batch_count ( fbatch ) ) {
unsigned long nr = 1 ;
int idx = folio_batch_count ( fbatch ) - 1 ;
folio = fbatch - > folios [ idx ] ;
if ( ! xa_is_value ( folio ) & & ! folio_test_hugetlb ( folio ) )
nr = folio_nr_pages ( folio ) ;
* start = indices [ idx ] + nr ;
}
2021-12-07 22:15:07 +03:00
return folio_batch_count ( fbatch ) ;
2021-02-26 04:15:56 +03:00
}
2005-04-17 02:20:36 +04:00
/**
2022-06-03 22:30:25 +03:00
* filemap_get_folios - Get a batch of folios
2005-04-17 02:20:36 +04:00
* @ mapping : The address_space to search
* @ start : The starting page index
2017-09-07 02:21:21 +03:00
* @ end : The final page index ( inclusive )
2022-06-03 22:30:25 +03:00
* @ fbatch : The batch to fill .
2005-04-17 02:20:36 +04:00
*
2022-06-03 22:30:25 +03:00
* Search for and return a batch of folios in the mapping starting at
* index @ start and up to index @ end ( inclusive ) . The folios are returned
* in @ fbatch with an elevated reference count .
2005-04-17 02:20:36 +04:00
*
2022-06-03 22:30:25 +03:00
* The first folio may start before @ start ; if it does , it will contain
* @ start . The final folio may extend beyond @ end ; if it does , it will
* contain @ end . The folios have ascending indices . There may be gaps
* between the folios if there are indices which have no folio in the
* page cache . If folios are added to or removed from the page cache
* while this is running , they may or may not be found by this call .
2005-04-17 02:20:36 +04:00
*
2022-06-03 22:30:25 +03:00
* Return : The number of folios which were found .
* We also update @ start to index the next folio for the traversal .
2005-04-17 02:20:36 +04:00
*/
2022-06-03 22:30:25 +03:00
unsigned filemap_get_folios ( struct address_space * mapping , pgoff_t * start ,
pgoff_t end , struct folio_batch * fbatch )
2005-04-17 02:20:36 +04:00
{
2018-05-17 00:38:56 +03:00
XA_STATE ( xas , & mapping - > i_pages , * start ) ;
2020-12-17 08:12:26 +03:00
struct folio * folio ;
2008-07-26 06:45:31 +04:00
rcu_read_lock ( ) ;
2022-06-03 22:30:25 +03:00
while ( ( folio = find_get_entry ( & xas , end , XA_PRESENT ) ) ! = NULL ) {
2018-05-17 00:38:56 +03:00
/* Skip over shadow, swap and DAX entries */
2020-12-17 08:12:26 +03:00
if ( xa_is_value ( folio ) )
2011-08-04 03:21:28 +04:00
continue ;
2022-06-03 22:30:25 +03:00
if ( ! folio_batch_add ( fbatch , folio ) ) {
unsigned long nr = folio_nr_pages ( folio ) ;
2008-07-26 06:45:31 +04:00
2022-06-03 22:30:25 +03:00
if ( folio_test_hugetlb ( folio ) )
nr = 1 ;
* start = folio - > index + nr ;
2017-09-07 02:21:21 +03:00
goto out ;
}
2008-07-26 06:45:31 +04:00
}
2011-03-23 02:33:07 +03:00
2017-09-07 02:21:21 +03:00
/*
* We come here when there is no page beyond @ end . We take care to not
* overflow the index @ start as it confuses some of the callers . This
2018-05-17 00:38:56 +03:00
* breaks the iteration when there is a page at index - 1 but that is
2017-09-07 02:21:21 +03:00
* already broken anyway .
*/
if ( end = = ( pgoff_t ) - 1 )
* start = ( pgoff_t ) - 1 ;
else
* start = end + 1 ;
out :
2008-07-26 06:45:31 +04:00
rcu_read_unlock ( ) ;
2017-09-07 02:21:18 +03:00
2022-06-03 22:30:25 +03:00
return folio_batch_count ( fbatch ) ;
}
EXPORT_SYMBOL ( filemap_get_folios ) ;
2020-06-28 05:19:08 +03:00
static inline
bool folio_more_pages ( struct folio * folio , pgoff_t index , pgoff_t max )
{
if ( ! folio_test_large ( folio ) | | folio_test_hugetlb ( folio ) )
return false ;
if ( index > = max )
return false ;
return index < folio - > index + folio_nr_pages ( folio ) - 1 ;
2005-04-17 02:20:36 +04:00
}
2006-04-27 10:46:01 +04:00
/**
2022-08-24 03:40:17 +03:00
* filemap_get_folios_contig - Get a batch of contiguous folios
2006-04-27 10:46:01 +04:00
* @ mapping : The address_space to search
2022-08-24 03:40:17 +03:00
* @ start : The starting page index
* @ end : The final page index ( inclusive )
* @ fbatch : The batch to fill
2006-04-27 10:46:01 +04:00
*
2022-08-24 03:40:17 +03:00
* filemap_get_folios_contig ( ) works exactly like filemap_get_folios ( ) ,
* except the returned folios are guaranteed to be contiguous . This may
* not return all contiguous folios if the batch gets filled up .
2006-04-27 10:46:01 +04:00
*
2022-08-24 03:40:17 +03:00
* Return : The number of folios found .
* Also update @ start to be positioned for traversal of the next folio .
2006-04-27 10:46:01 +04:00
*/
2022-08-24 03:40:17 +03:00
unsigned filemap_get_folios_contig ( struct address_space * mapping ,
pgoff_t * start , pgoff_t end , struct folio_batch * fbatch )
2006-04-27 10:46:01 +04:00
{
2022-08-24 03:40:17 +03:00
XA_STATE ( xas , & mapping - > i_pages , * start ) ;
unsigned long nr ;
2021-03-07 00:38:38 +03:00
struct folio * folio ;
2008-07-26 06:45:31 +04:00
rcu_read_lock ( ) ;
2022-08-24 03:40:17 +03:00
for ( folio = xas_load ( & xas ) ; folio & & xas . xa_index < = end ;
folio = xas_next ( & xas ) ) {
2021-03-07 00:38:38 +03:00
if ( xas_retry ( & xas , folio ) )
2018-05-17 01:00:33 +03:00
continue ;
/*
* If the entry has been swapped out , we can stop looking .
* No current caller is looking for DAX entries .
*/
2021-03-07 00:38:38 +03:00
if ( xa_is_value ( folio ) )
2022-08-24 03:40:17 +03:00
goto update_start ;
2006-04-27 10:46:01 +04:00
2021-03-07 00:38:38 +03:00
if ( ! folio_try_get_rcu ( folio ) )
2018-05-17 01:00:33 +03:00
goto retry ;
2016-07-27 01:26:04 +03:00
2021-03-07 00:38:38 +03:00
if ( unlikely ( folio ! = xas_reload ( & xas ) ) )
2022-08-24 03:40:17 +03:00
goto put_folio ;
2008-07-26 06:45:31 +04:00
2022-08-24 03:40:17 +03:00
if ( ! folio_batch_add ( fbatch , folio ) ) {
nr = folio_nr_pages ( folio ) ;
if ( folio_test_hugetlb ( folio ) )
nr = 1 ;
* start = folio - > index + nr ;
goto out ;
2020-06-28 05:19:08 +03:00
}
2018-05-17 01:00:33 +03:00
continue ;
2022-08-24 03:40:17 +03:00
put_folio :
2021-03-07 00:38:38 +03:00
folio_put ( folio ) ;
2022-08-24 03:40:17 +03:00
2018-05-17 01:00:33 +03:00
retry :
xas_reset ( & xas ) ;
2006-04-27 10:46:01 +04:00
}
2022-08-24 03:40:17 +03:00
update_start :
nr = folio_batch_count ( fbatch ) ;
if ( nr ) {
folio = fbatch - > folios [ nr - 1 ] ;
if ( folio_test_hugetlb ( folio ) )
* start = folio - > index + 1 ;
else
* start = folio - > index + folio_nr_pages ( folio ) ;
}
out :
2008-07-26 06:45:31 +04:00
rcu_read_unlock ( ) ;
2022-08-24 03:40:17 +03:00
return folio_batch_count ( fbatch ) ;
2006-04-27 10:46:01 +04:00
}
2022-08-24 03:40:17 +03:00
EXPORT_SYMBOL ( filemap_get_folios_contig ) ;
2006-04-27 10:46:01 +04:00
2006-06-23 13:03:49 +04:00
/**
2023-01-05 00:14:27 +03:00
* filemap_get_folios_tag - Get a batch of folios matching @ tag
* @ mapping : The address_space to search
* @ start : The starting page index
* @ end : The final page index ( inclusive )
* @ tag : The tag index
* @ fbatch : The batch to fill
2006-06-23 13:03:49 +04:00
*
2023-01-05 00:14:27 +03:00
* Same as filemap_get_folios ( ) , but only returning folios tagged with @ tag .
2019-03-06 02:48:42 +03:00
*
2023-01-05 00:14:27 +03:00
* Return : The number of folios found .
* Also update @ start to index the next folio for traversal .
2005-04-17 02:20:36 +04:00
*/
2023-01-05 00:14:27 +03:00
unsigned filemap_get_folios_tag ( struct address_space * mapping , pgoff_t * start ,
pgoff_t end , xa_mark_t tag , struct folio_batch * fbatch )
2005-04-17 02:20:36 +04:00
{
2023-01-05 00:14:27 +03:00
XA_STATE ( xas , & mapping - > i_pages , * start ) ;
2020-12-17 08:12:26 +03:00
struct folio * folio ;
2008-07-26 06:45:31 +04:00
rcu_read_lock ( ) ;
2023-01-05 00:14:27 +03:00
while ( ( folio = find_get_entry ( & xas , end , tag ) ) ! = NULL ) {
2018-05-17 01:12:54 +03:00
/*
* Shadow entries should never be tagged , but this iteration
* is lockless so there is a window for page reclaim to evict
2023-01-05 00:14:27 +03:00
* a page we saw tagged . Skip over it .
2018-05-17 01:12:54 +03:00
*/
2020-12-17 08:12:26 +03:00
if ( xa_is_value ( folio ) )
2014-05-06 23:50:05 +04:00
continue ;
2023-01-05 00:14:27 +03:00
if ( ! folio_batch_add ( fbatch , folio ) ) {
unsigned long nr = folio_nr_pages ( folio ) ;
2008-07-26 06:45:31 +04:00
2023-01-05 00:14:27 +03:00
if ( folio_test_hugetlb ( folio ) )
nr = 1 ;
* start = folio - > index + nr ;
2017-11-16 04:34:33 +03:00
goto out ;
}
2008-07-26 06:45:31 +04:00
}
2017-11-16 04:34:33 +03:00
/*
2023-01-05 00:14:27 +03:00
* We come here when there is no page beyond @ end . We take care to not
* overflow the index @ start as it confuses some of the callers . This
* breaks the iteration when there is a page at index - 1 but that is
* already broke anyway .
2017-11-16 04:34:33 +03:00
*/
if ( end = = ( pgoff_t ) - 1 )
2023-01-05 00:14:27 +03:00
* start = ( pgoff_t ) - 1 ;
2017-11-16 04:34:33 +03:00
else
2023-01-05 00:14:27 +03:00
* start = end + 1 ;
2017-11-16 04:34:33 +03:00
out :
2008-07-26 06:45:31 +04:00
rcu_read_unlock ( ) ;
2005-04-17 02:20:36 +04:00
2023-01-05 00:14:27 +03:00
return folio_batch_count ( fbatch ) ;
2005-04-17 02:20:36 +04:00
}
2023-01-05 00:14:27 +03:00
EXPORT_SYMBOL ( filemap_get_folios_tag ) ;
2005-04-17 02:20:36 +04:00
[PATCH] readahead: backoff on I/O error
Backoff readahead size exponentially on I/O error.
Michael Tokarev <mjt@tls.msk.ru> described the problem as:
[QUOTE]
Suppose there's a CD-rom with a scratch/etc, one sector is unreadable.
In order to "fix" it, one have to read it and write to another CD-rom,
or something.. or just ignore the error (if it's just a skip in a video
stream). Let's assume the unreadable block is number U.
But current behavior is just insane. An application requests block
number N, which is before U. Kernel tries to read-ahead blocks N..U.
Cdrom drive tries to read it, re-read it.. for some time. Finally,
when all the N..U-1 blocks are read, kernel returns block number N
(as requested) to an application, successefully.
Now an app requests block number N+1, and kernel tries to read
blocks N+1..U+1. Retrying again as in previous step.
And so on, up to when an app requests block number U-1. And when,
finally, it requests block U, it receives read error.
So, kernel currentry tries to re-read the same failing block as
many times as the current readahead value (256 (times?) by default).
This whole process already killed my cdrom drive (I posted about it
to LKML several months ago) - literally, the drive has fried, and
does not work anymore. Ofcourse that problem was a bug in firmware
(or whatever) of the drive *too*, but.. main problem with that is
current readahead logic as described above.
[/QUOTE]
Which was confirmed by Jens Axboe <axboe@suse.de>:
[QUOTE]
For ide-cd, it tends do only end the first part of the request on a
medium error. So you may see a lot of repeats :/
[/QUOTE]
With this patch, retries are expected to be reduced from, say, 256, to 5.
[akpm@osdl.org: cleanups]
Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25 16:48:43 +04:00
/*
* CD / DVDs are error prone . When a medium error occurs , the driver may fail
* a _large_ part of the i / o request . Imagine the worst scenario :
*
* - - - R__________________________________________B__________
* ^ reading here ^ bad block ( assume 4 k )
*
* read ( R ) = > miss = > readahead ( R . . . B ) = > media error = > frustrating retries
* = > failing the whole request = > read ( R ) = > read ( R + 1 ) = >
* readahead ( R + 1. . . B + 1 ) = > bang = > read ( R + 2 ) = > read ( R + 3 ) = >
* readahead ( R + 3. . . B + 2 ) = > bang = > read ( R + 3 ) = > read ( R + 4 ) = >
* readahead ( R + 4. . . B + 3 ) = > bang = > read ( R + 4 ) = > read ( R + 5 ) = > . . . . . .
*
* It is going insane . Fix it by quickly scaling down the readahead size .
*/
2020-04-02 07:04:50 +03:00
static void shrink_readahead_size_eio ( struct file_ra_state * ra )
[PATCH] readahead: backoff on I/O error
Backoff readahead size exponentially on I/O error.
Michael Tokarev <mjt@tls.msk.ru> described the problem as:
[QUOTE]
Suppose there's a CD-rom with a scratch/etc, one sector is unreadable.
In order to "fix" it, one have to read it and write to another CD-rom,
or something.. or just ignore the error (if it's just a skip in a video
stream). Let's assume the unreadable block is number U.
But current behavior is just insane. An application requests block
number N, which is before U. Kernel tries to read-ahead blocks N..U.
Cdrom drive tries to read it, re-read it.. for some time. Finally,
when all the N..U-1 blocks are read, kernel returns block number N
(as requested) to an application, successefully.
Now an app requests block number N+1, and kernel tries to read
blocks N+1..U+1. Retrying again as in previous step.
And so on, up to when an app requests block number U-1. And when,
finally, it requests block U, it receives read error.
So, kernel currentry tries to re-read the same failing block as
many times as the current readahead value (256 (times?) by default).
This whole process already killed my cdrom drive (I posted about it
to LKML several months ago) - literally, the drive has fried, and
does not work anymore. Ofcourse that problem was a bug in firmware
(or whatever) of the drive *too*, but.. main problem with that is
current readahead logic as described above.
[/QUOTE]
Which was confirmed by Jens Axboe <axboe@suse.de>:
[QUOTE]
For ide-cd, it tends do only end the first part of the request on a
medium error. So you may see a lot of repeats :/
[/QUOTE]
With this patch, retries are expected to be reduced from, say, 256, to 5.
[akpm@osdl.org: cleanups]
Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25 16:48:43 +04:00
{
ra - > ra_pages / = 4 ;
}
2021-02-24 23:01:59 +03:00
/*
2021-12-06 23:25:33 +03:00
* filemap_get_read_batch - Get a batch of folios for read
2021-02-24 23:01:59 +03:00
*
2021-12-06 23:25:33 +03:00
* Get a batch of folios which represent a contiguous range of bytes in
* the file . No exceptional entries will be returned . If @ index is in
* the middle of a folio , the entire folio will be returned . The last
* folio in the batch may have the readahead flag set or the uptodate flag
* clear so that the caller can take the appropriate action .
2021-02-24 23:01:59 +03:00
*/
static void filemap_get_read_batch ( struct address_space * mapping ,
2021-12-06 23:25:33 +03:00
pgoff_t index , pgoff_t max , struct folio_batch * fbatch )
2021-02-24 23:01:59 +03:00
{
XA_STATE ( xas , & mapping - > i_pages , index ) ;
2021-03-05 18:29:41 +03:00
struct folio * folio ;
2021-02-24 23:01:59 +03:00
rcu_read_lock ( ) ;
2021-03-05 18:29:41 +03:00
for ( folio = xas_load ( & xas ) ; folio ; folio = xas_next ( & xas ) ) {
if ( xas_retry ( & xas , folio ) )
2021-02-24 23:01:59 +03:00
continue ;
2021-03-05 18:29:41 +03:00
if ( xas . xa_index > max | | xa_is_value ( folio ) )
2021-02-24 23:01:59 +03:00
break ;
2022-06-18 03:00:17 +03:00
if ( xa_is_sibling ( folio ) )
break ;
2021-03-05 18:29:41 +03:00
if ( ! folio_try_get_rcu ( folio ) )
2021-02-24 23:01:59 +03:00
goto retry ;
2021-03-05 18:29:41 +03:00
if ( unlikely ( folio ! = xas_reload ( & xas ) ) )
2021-12-06 23:25:33 +03:00
goto put_folio ;
2021-02-24 23:01:59 +03:00
2021-12-06 23:25:33 +03:00
if ( ! folio_batch_add ( fbatch , folio ) )
2021-02-24 23:01:59 +03:00
break ;
2021-03-05 18:29:41 +03:00
if ( ! folio_test_uptodate ( folio ) )
2021-02-24 23:01:59 +03:00
break ;
2021-03-05 18:29:41 +03:00
if ( folio_test_readahead ( folio ) )
2021-02-24 23:01:59 +03:00
break ;
2020-06-28 05:19:08 +03:00
xas_advance ( & xas , folio - > index + folio_nr_pages ( folio ) - 1 ) ;
2021-02-24 23:01:59 +03:00
continue ;
2021-12-06 23:25:33 +03:00
put_folio :
2021-03-05 18:29:41 +03:00
folio_put ( folio ) ;
2021-02-24 23:01:59 +03:00
retry :
xas_reset ( & xas ) ;
}
rcu_read_unlock ( ) ;
}
2022-05-13 00:37:01 +03:00
static int filemap_read_folio ( struct file * file , filler_t filler ,
2021-03-10 18:19:30 +03:00
struct folio * folio )
2020-12-15 06:04:52 +03:00
{
2022-09-15 12:41:56 +03:00
bool workingset = folio_test_workingset ( folio ) ;
unsigned long pflags ;
2020-12-15 06:04:52 +03:00
int error ;
/*
2021-02-24 23:02:15 +03:00
* A previous I / O error may have been due to temporary failures ,
2022-04-29 18:53:28 +03:00
* eg . multipath errors . PG_error will be set again if read_folio
2021-02-24 23:02:15 +03:00
* fails .
2020-12-15 06:04:52 +03:00
*/
2021-03-10 18:19:30 +03:00
folio_clear_error ( folio ) ;
2022-09-15 12:41:56 +03:00
2020-12-15 06:04:52 +03:00
/* Start the actual read. The read will unlock the page. */
2022-09-15 12:41:56 +03:00
if ( unlikely ( workingset ) )
psi_memstall_enter ( & pflags ) ;
2022-05-13 00:37:01 +03:00
error = filler ( file , folio ) ;
2022-09-15 12:41:56 +03:00
if ( unlikely ( workingset ) )
psi_memstall_leave ( & pflags ) ;
2021-02-24 23:02:15 +03:00
if ( error )
return error ;
2020-12-15 06:04:52 +03:00
2021-03-10 18:19:30 +03:00
error = folio_wait_locked_killable ( folio ) ;
2021-02-24 23:02:15 +03:00
if ( error )
return error ;
2021-03-10 18:19:30 +03:00
if ( folio_test_uptodate ( folio ) )
2021-02-24 23:02:38 +03:00
return 0 ;
2022-05-13 00:37:01 +03:00
if ( file )
shrink_readahead_size_eio ( & file - > f_ra ) ;
2021-02-24 23:02:38 +03:00
return - EIO ;
2020-12-15 06:04:52 +03:00
}
2021-02-24 23:02:28 +03:00
static bool filemap_range_uptodate ( struct address_space * mapping ,
2023-02-08 21:18:17 +03:00
loff_t pos , size_t count , struct folio * folio ,
bool need_uptodate )
2021-02-24 23:02:28 +03:00
{
2021-03-10 19:04:19 +03:00
if ( folio_test_uptodate ( folio ) )
2021-02-24 23:02:28 +03:00
return true ;
/* pipes can't handle partially uptodate pages */
2023-02-08 21:18:17 +03:00
if ( need_uptodate )
2021-02-24 23:02:28 +03:00
return false ;
if ( ! mapping - > a_ops - > is_partially_uptodate )
return false ;
2021-03-10 19:04:19 +03:00
if ( mapping - > host - > i_blkbits > = folio_shift ( folio ) )
2021-02-24 23:02:28 +03:00
return false ;
2021-03-10 19:04:19 +03:00
if ( folio_pos ( folio ) > pos ) {
count - = folio_pos ( folio ) - pos ;
2021-02-24 23:02:28 +03:00
pos = 0 ;
} else {
2021-03-10 19:04:19 +03:00
pos - = folio_pos ( folio ) ;
2021-02-24 23:02:28 +03:00
}
2022-02-09 23:21:27 +03:00
return mapping - > a_ops - > is_partially_uptodate ( folio , pos , count ) ;
2021-02-24 23:02:28 +03:00
}
2021-02-24 23:02:22 +03:00
static int filemap_update_page ( struct kiocb * iocb ,
2023-02-08 21:18:17 +03:00
struct address_space * mapping , size_t count ,
struct folio * folio , bool need_uptodate )
2020-12-15 06:04:52 +03:00
{
int error ;
2021-01-28 21:19:45 +03:00
if ( iocb - > ki_flags & IOCB_NOWAIT ) {
if ( ! filemap_invalidate_trylock_shared ( mapping ) )
return - EAGAIN ;
} else {
filemap_invalidate_lock_shared ( mapping ) ;
}
2020-12-31 01:58:40 +03:00
if ( ! folio_trylock ( folio ) ) {
2021-01-28 21:19:45 +03:00
error = - EAGAIN ;
2021-02-24 23:02:25 +03:00
if ( iocb - > ki_flags & ( IOCB_NOWAIT | IOCB_NOIO ) )
2021-01-28 21:19:45 +03:00
goto unlock_mapping ;
2021-02-24 23:02:25 +03:00
if ( ! ( iocb - > ki_flags & IOCB_WAITQ ) ) {
2021-01-28 21:19:45 +03:00
filemap_invalidate_unlock_shared ( mapping ) ;
2021-08-17 06:36:31 +03:00
/*
* This is where we usually end up waiting for a
* previously submitted readahead to finish .
*/
folio_put_wait_locked ( folio , TASK_KILLABLE ) ;
2021-02-24 23:02:22 +03:00
return AOP_TRUNCATED_PAGE ;
2021-02-24 23:02:05 +03:00
}
2020-12-31 01:58:40 +03:00
error = __folio_lock_async ( folio , iocb - > ki_waitq ) ;
2021-02-24 23:02:25 +03:00
if ( error )
2021-01-28 21:19:45 +03:00
goto unlock_mapping ;
2020-12-15 06:04:52 +03:00
}
2021-01-28 21:19:45 +03:00
error = AOP_TRUNCATED_PAGE ;
2020-12-31 01:58:40 +03:00
if ( ! folio - > mapping )
2021-01-28 21:19:45 +03:00
goto unlock ;
2020-12-15 06:04:52 +03:00
2021-02-24 23:02:28 +03:00
error = 0 ;
2023-02-08 21:18:17 +03:00
if ( filemap_range_uptodate ( mapping , iocb - > ki_pos , count , folio ,
need_uptodate ) )
2021-02-24 23:02:28 +03:00
goto unlock ;
error = - EAGAIN ;
if ( iocb - > ki_flags & ( IOCB_NOIO | IOCB_NOWAIT | IOCB_WAITQ ) )
goto unlock ;
2022-05-13 00:37:01 +03:00
error = filemap_read_folio ( iocb - > ki_filp , mapping - > a_ops - > read_folio ,
folio ) ;
2021-01-28 21:19:45 +03:00
goto unlock_mapping ;
2021-02-24 23:02:28 +03:00
unlock :
2020-12-31 01:58:40 +03:00
folio_unlock ( folio ) ;
2021-01-28 21:19:45 +03:00
unlock_mapping :
filemap_invalidate_unlock_shared ( mapping ) ;
if ( error = = AOP_TRUNCATED_PAGE )
2020-12-31 01:58:40 +03:00
folio_put ( folio ) ;
2021-02-24 23:02:28 +03:00
return error ;
2020-12-15 06:04:52 +03:00
}
2021-03-10 18:34:00 +03:00
static int filemap_create_folio ( struct file * file ,
2021-02-24 23:02:18 +03:00
struct address_space * mapping , pgoff_t index ,
2021-12-06 23:25:33 +03:00
struct folio_batch * fbatch )
2020-12-15 06:04:52 +03:00
{
2021-03-10 18:34:00 +03:00
struct folio * folio ;
2020-12-15 06:04:52 +03:00
int error ;
2021-03-10 18:34:00 +03:00
folio = filemap_alloc_folio ( mapping_gfp_mask ( mapping ) , 0 ) ;
if ( ! folio )
2021-02-24 23:02:18 +03:00
return - ENOMEM ;
2020-12-15 06:04:52 +03:00
2021-01-28 21:19:45 +03:00
/*
2021-03-10 18:34:00 +03:00
* Protect against truncate / hole punch . Grabbing invalidate_lock
* here assures we cannot instantiate and bring uptodate new
* pagecache folios after evicting page cache during truncate
* and before actually freeing blocks . Note that we could
* release invalidate_lock after inserting the folio into
* the page cache as the locked folio would then be enough to
* synchronize with hole punching . But there are code paths
* such as filemap_update_page ( ) filling in partially uptodate
2022-03-24 04:29:04 +03:00
* pages or - > readahead ( ) that need to hold invalidate_lock
2021-03-10 18:34:00 +03:00
* while mapping blocks for IO so let ' s hold the lock here as
* well to keep locking rules simple .
2021-01-28 21:19:45 +03:00
*/
filemap_invalidate_lock_shared ( mapping ) ;
2021-03-10 18:34:00 +03:00
error = filemap_add_folio ( mapping , folio , index ,
2021-02-24 23:02:18 +03:00
mapping_gfp_constraint ( mapping , GFP_KERNEL ) ) ;
if ( error = = - EEXIST )
error = AOP_TRUNCATED_PAGE ;
if ( error )
goto error ;
2022-05-13 00:37:01 +03:00
error = filemap_read_folio ( file , mapping - > a_ops - > read_folio , folio ) ;
2021-02-24 23:02:18 +03:00
if ( error )
goto error ;
2021-01-28 21:19:45 +03:00
filemap_invalidate_unlock_shared ( mapping ) ;
2021-12-06 23:25:33 +03:00
folio_batch_add ( fbatch , folio ) ;
2021-02-24 23:02:18 +03:00
return 0 ;
error :
2021-01-28 21:19:45 +03:00
filemap_invalidate_unlock_shared ( mapping ) ;
2021-03-10 18:34:00 +03:00
folio_put ( folio ) ;
2021-02-24 23:02:18 +03:00
return error ;
2020-12-15 06:04:52 +03:00
}
2021-02-24 23:02:32 +03:00
static int filemap_readahead ( struct kiocb * iocb , struct file * file ,
2021-03-10 22:01:22 +03:00
struct address_space * mapping , struct folio * folio ,
2021-02-24 23:02:32 +03:00
pgoff_t last_index )
{
2021-03-10 22:01:22 +03:00
DEFINE_READAHEAD ( ractl , file , & file - > f_ra , mapping , folio - > index ) ;
2021-02-24 23:02:32 +03:00
if ( iocb - > ki_flags & IOCB_NOIO )
return - EAGAIN ;
2021-03-10 22:01:22 +03:00
page_cache_async_ra ( & ractl , folio , last_index - folio - > index ) ;
2021-02-24 23:02:32 +03:00
return 0 ;
}
2023-02-08 21:18:17 +03:00
static int filemap_get_pages ( struct kiocb * iocb , size_t count ,
struct folio_batch * fbatch , bool need_uptodate )
2020-12-15 06:04:56 +03:00
{
struct file * filp = iocb - > ki_filp ;
struct address_space * mapping = filp - > f_mapping ;
struct file_ra_state * ra = & filp - > f_ra ;
pgoff_t index = iocb - > ki_pos > > PAGE_SHIFT ;
2021-02-24 23:01:59 +03:00
pgoff_t last_index ;
2021-03-10 22:01:22 +03:00
struct folio * folio ;
2021-02-24 23:01:59 +03:00
int err = 0 ;
2020-12-15 06:04:56 +03:00
2023-02-08 05:24:00 +03:00
/* "last_index" is the index of the page beyond the end of the read */
2023-02-08 21:18:17 +03:00
last_index = DIV_ROUND_UP ( iocb - > ki_pos + count , PAGE_SIZE ) ;
2021-02-24 23:02:35 +03:00
retry :
2020-12-15 06:04:56 +03:00
if ( fatal_signal_pending ( current ) )
return - EINTR ;
2023-02-08 05:24:00 +03:00
filemap_get_read_batch ( mapping , index , last_index - 1 , fbatch ) ;
2021-12-06 23:25:33 +03:00
if ( ! folio_batch_count ( fbatch ) ) {
2021-02-24 23:02:35 +03:00
if ( iocb - > ki_flags & IOCB_NOIO )
return - EAGAIN ;
page_cache_sync_readahead ( mapping , ra , filp , index ,
last_index - index ) ;
2023-02-08 05:24:00 +03:00
filemap_get_read_batch ( mapping , index , last_index - 1 , fbatch ) ;
2021-02-24 23:02:35 +03:00
}
2021-12-06 23:25:33 +03:00
if ( ! folio_batch_count ( fbatch ) ) {
2021-02-24 23:02:18 +03:00
if ( iocb - > ki_flags & ( IOCB_NOWAIT | IOCB_WAITQ ) )
return - EAGAIN ;
2021-03-10 18:34:00 +03:00
err = filemap_create_folio ( filp , mapping ,
2021-12-06 23:25:33 +03:00
iocb - > ki_pos > > PAGE_SHIFT , fbatch ) ;
2021-02-24 23:02:18 +03:00
if ( err = = AOP_TRUNCATED_PAGE )
2021-02-24 23:02:35 +03:00
goto retry ;
2021-02-24 23:02:18 +03:00
return err ;
}
2020-12-15 06:04:56 +03:00
2021-12-06 23:25:33 +03:00
folio = fbatch - > folios [ folio_batch_count ( fbatch ) - 1 ] ;
2021-03-10 22:01:22 +03:00
if ( folio_test_readahead ( folio ) ) {
err = filemap_readahead ( iocb , filp , mapping , folio , last_index ) ;
2021-02-24 23:02:35 +03:00
if ( err )
goto err ;
}
2021-03-10 22:01:22 +03:00
if ( ! folio_test_uptodate ( folio ) ) {
2021-12-06 23:25:33 +03:00
if ( ( iocb - > ki_flags & IOCB_WAITQ ) & &
folio_batch_count ( fbatch ) > 1 )
2021-02-24 23:02:35 +03:00
iocb - > ki_flags | = IOCB_NOWAIT ;
2023-02-08 21:18:17 +03:00
err = filemap_update_page ( iocb , mapping , count , folio ,
need_uptodate ) ;
2021-02-24 23:02:35 +03:00
if ( err )
goto err ;
2020-12-15 06:04:56 +03:00
}
2021-02-24 23:02:35 +03:00
return 0 ;
2021-02-24 23:01:59 +03:00
err :
2021-02-24 23:02:35 +03:00
if ( err < 0 )
2021-03-10 22:01:22 +03:00
folio_put ( folio ) ;
2021-12-06 23:25:33 +03:00
if ( likely ( - - fbatch - > nr ) )
2021-02-24 23:01:55 +03:00
return 0 ;
2021-02-24 23:02:22 +03:00
if ( err = = AOP_TRUNCATED_PAGE )
2021-02-24 23:02:35 +03:00
goto retry ;
return err ;
2020-12-15 06:04:56 +03:00
}
2022-06-10 21:44:41 +03:00
static inline bool pos_same_folio ( loff_t pos1 , loff_t pos2 , struct folio * folio )
{
unsigned int shift = folio_shift ( folio ) ;
return ( pos1 > > shift = = pos2 > > shift ) ;
}
2006-06-23 13:03:49 +04:00
/**
2021-02-24 23:02:42 +03:00
* filemap_read - Read data from the page cache .
* @ iocb : The iocb to read .
* @ iter : Destination for the data .
* @ already_read : Number of bytes already read by the caller .
2006-06-23 13:03:49 +04:00
*
2021-02-24 23:02:42 +03:00
* Copies data from the page cache . If the data is not currently present ,
2022-04-29 18:53:28 +03:00
* uses the readahead and read_folio address_space operations to fetch it .
2005-04-17 02:20:36 +04:00
*
2021-02-24 23:02:42 +03:00
* Return : Total number of bytes copied , including those already read by
* the caller . If an error happens before any bytes are copied , returns
* a negative error number .
2005-04-17 02:20:36 +04:00
*/
2021-02-24 23:02:42 +03:00
ssize_t filemap_read ( struct kiocb * iocb , struct iov_iter * iter ,
ssize_t already_read )
2005-04-17 02:20:36 +04:00
{
2017-08-29 17:13:18 +03:00
struct file * filp = iocb - > ki_filp ;
2020-12-15 06:04:56 +03:00
struct file_ra_state * ra = & filp - > f_ra ;
2008-02-08 15:21:24 +03:00
struct address_space * mapping = filp - > f_mapping ;
2005-04-17 02:20:36 +04:00
struct inode * inode = mapping - > host ;
2021-12-06 23:25:33 +03:00
struct folio_batch fbatch ;
2021-02-24 23:01:55 +03:00
int i , error = 0 ;
2020-12-15 06:04:56 +03:00
bool writably_mapped ;
loff_t isize , end_offset ;
2005-04-17 02:20:36 +04:00
2020-12-15 06:04:52 +03:00
if ( unlikely ( iocb - > ki_pos > = inode - > i_sb - > s_maxbytes ) )
2016-12-14 23:45:25 +03:00
return 0 ;
2020-12-18 12:07:11 +03:00
if ( unlikely ( ! iov_iter_count ( iter ) ) )
return 0 ;
vfs,mm: fix a dead loop in truncate_inode_pages_range()
We triggered a deadloop in truncate_inode_pages_range() on 32 bits
architecture with the test case bellow:
...
fd = open();
write(fd, buf, 4096);
preadv64(fd, &iovec, 1, 0xffffffff000);
ftruncate(fd, 0);
...
Then ftruncate() will not return forever.
The filesystem used in this case is ubifs, but it can be triggered on
many other filesystems.
When preadv64() is called with offset=0xffffffff000, a page with
index=0xffffffff will be added to the radix tree of ->mapping. Then
this page can be found in ->mapping with pagevec_lookup(). After that,
truncate_inode_pages_range(), which is called in ftruncate(), will fall
into an infinite loop:
- find a page with index=0xffffffff, since index>=end, this page won't
be truncated
- index++, and index become 0
- the page with index=0xffffffff will be found again
The data type of index is unsigned long, so index won't overflow to 0 on
64 bits architecture in this case, and the dead loop won't happen.
Since truncate_inode_pages_range() is executed with holding lock of
inode->i_rwsem, any operation related with this lock will be blocked,
and a hung task will happen, e.g.:
INFO: task truncate_test:3364 blocked for more than 120 seconds.
...
call_rwsem_down_write_failed+0x17/0x30
generic_file_write_iter+0x32/0x1c0
ubifs_write_iter+0xcc/0x170
__vfs_write+0xc4/0x120
vfs_write+0xb2/0x1b0
SyS_write+0x46/0xa0
The page with index=0xffffffff added to ->mapping is useless. Fix this
by checking the read position before allocating pages.
Link: http://lkml.kernel.org/r/1475151010-40166-1-git-send-email-fangwei1@huawei.com
Signed-off-by: Wei Fang <fangwei1@huawei.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-08 03:01:52 +03:00
iov_iter_truncate ( iter , inode - > i_sb - > s_maxbytes ) ;
2021-12-06 23:25:33 +03:00
folio_batch_init ( & fbatch ) ;
vfs,mm: fix a dead loop in truncate_inode_pages_range()
We triggered a deadloop in truncate_inode_pages_range() on 32 bits
architecture with the test case bellow:
...
fd = open();
write(fd, buf, 4096);
preadv64(fd, &iovec, 1, 0xffffffff000);
ftruncate(fd, 0);
...
Then ftruncate() will not return forever.
The filesystem used in this case is ubifs, but it can be triggered on
many other filesystems.
When preadv64() is called with offset=0xffffffff000, a page with
index=0xffffffff will be added to the radix tree of ->mapping. Then
this page can be found in ->mapping with pagevec_lookup(). After that,
truncate_inode_pages_range(), which is called in ftruncate(), will fall
into an infinite loop:
- find a page with index=0xffffffff, since index>=end, this page won't
be truncated
- index++, and index become 0
- the page with index=0xffffffff will be found again
The data type of index is unsigned long, so index won't overflow to 0 on
64 bits architecture in this case, and the dead loop won't happen.
Since truncate_inode_pages_range() is executed with holding lock of
inode->i_rwsem, any operation related with this lock will be blocked,
and a hung task will happen, e.g.:
INFO: task truncate_test:3364 blocked for more than 120 seconds.
...
call_rwsem_down_write_failed+0x17/0x30
generic_file_write_iter+0x32/0x1c0
ubifs_write_iter+0xcc/0x170
__vfs_write+0xc4/0x120
vfs_write+0xb2/0x1b0
SyS_write+0x46/0xa0
The page with index=0xffffffff added to ->mapping is useless. Fix this
by checking the read position before allocating pages.
Link: http://lkml.kernel.org/r/1475151010-40166-1-git-send-email-fangwei1@huawei.com
Signed-off-by: Wei Fang <fangwei1@huawei.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-08 03:01:52 +03:00
2020-12-15 06:04:56 +03:00
do {
2005-04-17 02:20:36 +04:00
cond_resched ( ) ;
2017-02-04 00:13:29 +03:00
2020-12-15 06:04:52 +03:00
/*
2020-12-15 06:04:56 +03:00
* If we ' ve already successfully copied some data , then we
* can no longer safely return - EIOCBQUEUED . Hence mark
* an async read NOWAIT at that point .
2020-12-15 06:04:52 +03:00
*/
2021-02-24 23:02:42 +03:00
if ( ( iocb - > ki_flags & IOCB_WAITQ ) & & already_read )
2020-12-15 06:04:52 +03:00
iocb - > ki_flags | = IOCB_NOWAIT ;
2021-11-05 23:36:49 +03:00
if ( unlikely ( iocb - > ki_pos > = i_size_read ( inode ) ) )
break ;
2023-05-22 16:50:17 +03:00
error = filemap_get_pages ( iocb , iter - > count , & fbatch , false ) ;
2021-02-24 23:01:55 +03:00
if ( error < 0 )
2020-12-15 06:04:56 +03:00
break ;
2005-04-17 02:20:36 +04:00
2020-12-15 06:04:56 +03:00
/*
* i_size must be checked after we know the pages are Uptodate .
*
* Checking i_size after the check allows us to calculate
* the correct value for " nr " , which means the zero - filled
* part of the page is not copied back to userspace ( unless
* another truncate extends the file - this is desired though ) .
*/
isize = i_size_read ( inode ) ;
if ( unlikely ( iocb - > ki_pos > = isize ) )
2021-12-06 23:25:33 +03:00
goto put_folios ;
2020-12-15 06:04:56 +03:00
end_offset = min_t ( loff_t , isize , iocb - > ki_pos + iter - > count ) ;
/*
* Once we start copying data , we don ' t want to be touching any
* cachelines that might be contended :
*/
writably_mapped = mapping_writably_mapped ( mapping ) ;
/*
2022-06-10 21:44:41 +03:00
* When a read accesses the same folio several times , only
2020-12-15 06:04:56 +03:00
* mark it as accessed the first time .
*/
2022-06-10 21:44:41 +03:00
if ( ! pos_same_folio ( iocb - > ki_pos , ra - > prev_pos - 1 ,
fbatch . folios [ 0 ] ) )
2021-12-06 23:25:33 +03:00
folio_mark_accessed ( fbatch . folios [ 0 ] ) ;
2020-12-15 06:04:56 +03:00
2021-12-06 23:25:33 +03:00
for ( i = 0 ; i < folio_batch_count ( & fbatch ) ; i + + ) {
struct folio * folio = fbatch . folios [ i ] ;
2021-11-01 05:22:19 +03:00
size_t fsize = folio_size ( folio ) ;
size_t offset = iocb - > ki_pos & ( fsize - 1 ) ;
2021-02-24 23:01:59 +03:00
size_t bytes = min_t ( loff_t , end_offset - iocb - > ki_pos ,
2021-11-01 05:22:19 +03:00
fsize - offset ) ;
2021-02-24 23:01:59 +03:00
size_t copied ;
2020-12-15 06:04:56 +03:00
2021-11-01 05:22:19 +03:00
if ( end_offset < folio_pos ( folio ) )
2021-02-24 23:01:59 +03:00
break ;
if ( i > 0 )
2021-11-01 05:22:19 +03:00
folio_mark_accessed ( folio ) ;
2020-12-15 06:04:56 +03:00
/*
2021-11-01 05:22:19 +03:00
* If users can be writing to this folio using arbitrary
* virtual addresses , take care of potential aliasing
* before reading the folio on the kernel side .
2020-12-15 06:04:56 +03:00
*/
2021-11-01 05:22:19 +03:00
if ( writably_mapped )
flush_dcache_folio ( folio ) ;
2020-12-15 06:04:56 +03:00
2021-11-01 05:22:19 +03:00
copied = copy_folio_to_iter ( folio , offset , bytes , iter ) ;
2020-12-15 06:04:56 +03:00
2021-02-24 23:02:42 +03:00
already_read + = copied ;
2020-12-15 06:04:56 +03:00
iocb - > ki_pos + = copied ;
ra - > prev_pos = iocb - > ki_pos ;
if ( copied < bytes ) {
error = - EFAULT ;
break ;
}
2005-04-17 02:20:36 +04:00
}
2021-12-06 23:25:33 +03:00
put_folios :
for ( i = 0 ; i < folio_batch_count ( & fbatch ) ; i + + )
folio_put ( fbatch . folios [ i ] ) ;
folio_batch_init ( & fbatch ) ;
2020-12-15 06:04:56 +03:00
} while ( iov_iter_count ( iter ) & & iocb - > ki_pos < isize & & ! error ) ;
2005-04-17 02:20:36 +04:00
2008-10-16 09:01:13 +04:00
file_accessed ( filp ) ;
2020-12-15 06:04:56 +03:00
2021-02-24 23:02:42 +03:00
return already_read ? already_read : error ;
2005-04-17 02:20:36 +04:00
}
2021-02-24 23:02:42 +03:00
EXPORT_SYMBOL_GPL ( filemap_read ) ;
2005-04-17 02:20:36 +04:00
2023-06-01 17:58:56 +03:00
int kiocb_write_and_wait ( struct kiocb * iocb , size_t count )
{
struct address_space * mapping = iocb - > ki_filp - > f_mapping ;
loff_t pos = iocb - > ki_pos ;
loff_t end = pos + count - 1 ;
if ( iocb - > ki_flags & IOCB_NOWAIT ) {
if ( filemap_range_needs_writeback ( mapping , pos , end ) )
return - EAGAIN ;
return 0 ;
}
return filemap_write_and_wait_range ( mapping , pos , end ) ;
}
2023-06-01 17:58:57 +03:00
int kiocb_invalidate_pages ( struct kiocb * iocb , size_t count )
{
struct address_space * mapping = iocb - > ki_filp - > f_mapping ;
loff_t pos = iocb - > ki_pos ;
loff_t end = pos + count - 1 ;
int ret ;
if ( iocb - > ki_flags & IOCB_NOWAIT ) {
/* we could block if there are any pages in the range */
if ( filemap_range_has_page ( mapping , pos , end ) )
return - EAGAIN ;
} else {
ret = filemap_write_and_wait_range ( mapping , pos , end ) ;
if ( ret )
return ret ;
}
/*
* After a write we want buffered reads to be sure to go to disk to get
* the new data . We invalidate clean cached page from the region we ' re
* about to write . We do this * before * the write so that we can return
* without clobbering - EIOCBQUEUED from - > direct_IO ( ) .
*/
return invalidate_inode_pages2_range ( mapping , pos > > PAGE_SHIFT ,
end > > PAGE_SHIFT ) ;
}
2006-06-23 13:03:49 +04:00
/**
2014-04-04 22:20:57 +04:00
* generic_file_read_iter - generic filesystem read routine
2006-06-23 13:03:49 +04:00
* @ iocb : kernel I / O control block
2014-04-04 22:20:57 +04:00
* @ iter : destination for the data read
2006-06-23 13:03:49 +04:00
*
2014-04-04 22:20:57 +04:00
* This is the " read_iter() " routine for all filesystems
2005-04-17 02:20:36 +04:00
* that can use the page cache directly .
2019-11-22 02:25:07 +03:00
*
* The IOCB_NOWAIT flag in iocb - > ki_flags indicates that - EAGAIN shall
* be returned when no data can be read without waiting for I / O requests
* to complete ; it doesn ' t prevent readahead .
*
* The IOCB_NOIO flag in iocb - > ki_flags indicates that no new I / O
* requests shall be made for the read or for readahead . When no data
* can be read , - EAGAIN shall be returned . When readahead would be
* triggered , a partial , possibly empty read shall be returned .
*
2019-03-06 02:48:42 +03:00
* Return :
* * number of bytes copied , even for partial reads
2019-11-22 02:25:07 +03:00
* * negative error code ( or 0 if IOCB_NOIO ) if nothing was read
2005-04-17 02:20:36 +04:00
*/
ssize_t
2014-03-06 07:53:04 +04:00
generic_file_read_iter ( struct kiocb * iocb , struct iov_iter * iter )
2005-04-17 02:20:36 +04:00
{
mm/filemap: generic_file_read_iter(): check for zero reads unconditionally
If
- generic_file_read_iter() gets called with a zero read length,
- the read offset is at a page boundary,
- IOCB_DIRECT is not set
- and the page in question hasn't made it into the page cache yet,
then do_generic_file_read() will trigger a readahead with a req_size hint
of zero.
Since roundup_pow_of_two(0) is undefined, UBSAN reports
UBSAN: Undefined behaviour in include/linux/log2.h:63:13
shift exponent 64 is too large for 64-bit type 'long unsigned int'
CPU: 3 PID: 1017 Comm: sa1 Tainted: G L 4.5.0-next-20160318+ #14
[...]
Call Trace:
[...]
[<ffffffff813ef61a>] ondemand_readahead+0x3aa/0x3d0
[<ffffffff813ef61a>] ? ondemand_readahead+0x3aa/0x3d0
[<ffffffff813c73bd>] ? find_get_entry+0x2d/0x210
[<ffffffff813ef9c3>] page_cache_sync_readahead+0x63/0xa0
[<ffffffff813cc04d>] do_generic_file_read+0x80d/0xf90
[<ffffffff813cc955>] generic_file_read_iter+0x185/0x420
[...]
[<ffffffff81510b06>] __vfs_read+0x256/0x3d0
[...]
when get_init_ra_size() gets called from ondemand_readahead().
The net effect is that the initial readahead size is arch dependent for
requested read lengths of zero: for example, since
1UL << (sizeof(unsigned long) * 8)
evaluates to 1 on x86 while its result is 0 on ARMv7, the initial readahead
size becomes 4 on the former and 0 on the latter.
What's more, whether or not the file access timestamp is updated for zero
length reads is decided differently for the two cases of IOCB_DIRECT
being set or cleared: in the first case, generic_file_read_iter()
explicitly skips updating that timestamp while in the latter case, it is
always updated through the call to do_generic_file_read().
According to POSIX, zero length reads "do not modify the last data access
timestamp" and thus, the IOCB_DIRECT behaviour is POSIXly correct.
Let generic_file_read_iter() unconditionally check the requested read
length at its entry and return immediately with success if it is zero.
Signed-off-by: Nicolai Stange <nicstange@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-26 00:22:14 +03:00
size_t count = iov_iter_count ( iter ) ;
2017-08-29 17:13:18 +03:00
ssize_t retval = 0 ;
mm/filemap: generic_file_read_iter(): check for zero reads unconditionally
If
- generic_file_read_iter() gets called with a zero read length,
- the read offset is at a page boundary,
- IOCB_DIRECT is not set
- and the page in question hasn't made it into the page cache yet,
then do_generic_file_read() will trigger a readahead with a req_size hint
of zero.
Since roundup_pow_of_two(0) is undefined, UBSAN reports
UBSAN: Undefined behaviour in include/linux/log2.h:63:13
shift exponent 64 is too large for 64-bit type 'long unsigned int'
CPU: 3 PID: 1017 Comm: sa1 Tainted: G L 4.5.0-next-20160318+ #14
[...]
Call Trace:
[...]
[<ffffffff813ef61a>] ondemand_readahead+0x3aa/0x3d0
[<ffffffff813ef61a>] ? ondemand_readahead+0x3aa/0x3d0
[<ffffffff813c73bd>] ? find_get_entry+0x2d/0x210
[<ffffffff813ef9c3>] page_cache_sync_readahead+0x63/0xa0
[<ffffffff813cc04d>] do_generic_file_read+0x80d/0xf90
[<ffffffff813cc955>] generic_file_read_iter+0x185/0x420
[...]
[<ffffffff81510b06>] __vfs_read+0x256/0x3d0
[...]
when get_init_ra_size() gets called from ondemand_readahead().
The net effect is that the initial readahead size is arch dependent for
requested read lengths of zero: for example, since
1UL << (sizeof(unsigned long) * 8)
evaluates to 1 on x86 while its result is 0 on ARMv7, the initial readahead
size becomes 4 on the former and 0 on the latter.
What's more, whether or not the file access timestamp is updated for zero
length reads is decided differently for the two cases of IOCB_DIRECT
being set or cleared: in the first case, generic_file_read_iter()
explicitly skips updating that timestamp while in the latter case, it is
always updated through the call to do_generic_file_read().
According to POSIX, zero length reads "do not modify the last data access
timestamp" and thus, the IOCB_DIRECT behaviour is POSIXly correct.
Let generic_file_read_iter() unconditionally check the requested read
length at its entry and return immediately with success if it is zero.
Signed-off-by: Nicolai Stange <nicstange@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-26 00:22:14 +03:00
if ( ! count )
2021-02-24 23:02:45 +03:00
return 0 ; /* skip atime */
2005-04-17 02:20:36 +04:00
2015-04-09 20:52:01 +03:00
if ( iocb - > ki_flags & IOCB_DIRECT ) {
2017-08-29 17:13:18 +03:00
struct file * file = iocb - > ki_filp ;
2014-03-06 07:53:04 +04:00
struct address_space * mapping = file - > f_mapping ;
struct inode * inode = mapping - > host ;
2005-04-17 02:20:36 +04:00
2023-06-01 17:58:56 +03:00
retval = kiocb_write_and_wait ( iocb , count ) ;
if ( retval < 0 )
return retval ;
2016-10-03 01:48:08 +03:00
file_accessed ( file ) ;
2017-04-13 21:13:36 +03:00
retval = mapping - > a_ops - > direct_IO ( iocb , iter ) ;
2016-10-10 20:26:27 +03:00
if ( retval > = 0 ) {
2016-04-07 18:51:55 +03:00
iocb - > ki_pos + = retval ;
2017-04-13 21:13:36 +03:00
count - = retval ;
Fix race when checking i_size on direct i/o read
So far I've had one ACK for this, and no other comments. So I think it
is probably time to send this via some suitable tree. I'm guessing that
the vfs tree would be the most appropriate route, but not sure that
there is one at the moment (don't see anything recent at kernel.org)
so in that case I think -mm is the "back up plan". Al, please let me
know if you will take this?
Steve.
---------------------
Following on from the "Re: [PATCH v3] vfs: fix a bug when we do some dio
reads with append dio writes" thread on linux-fsdevel, this patch is my
current version of the fix proposed as option (b) in that thread.
Removing the i_size test from the direct i/o read path at vfs level
means that filesystems now have to deal with requests which are beyond
i_size themselves. These I've divided into three sets:
a) Those with "no op" ->direct_IO (9p, cifs, ceph)
These are obviously not going to be an issue
b) Those with "home brew" ->direct_IO (nfs, fuse)
I've been told that NFS should not have any problem with the larger
i_size, however I've added an extra test to FUSE to duplicate the
original behaviour just to be on the safe side.
c) Those using __blockdev_direct_IO()
These call through to ->get_block() which should deal with the EOF
condition correctly. I've verified that with GFS2 and I believe that
Zheng has verified it for ext4. I've also run the test on XFS and it
passes both before and after this change.
The part of the patch in filemap.c looks a lot larger than it really is
- there are only two lines of real change. The rest is just indentation
of the contained code.
There remains a test of i_size though, which was added for btrfs. It
doesn't cause the other filesystems a problem as the test is performed
after ->direct_IO has been called. It is possible that there is a race
that does matter to btrfs, however this patch doesn't change that, so
its still an overall improvement.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Reported-by: Zheng Liu <gnehzuil.liu@gmail.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>
Acked-by: Miklos Szeredi <miklos@szeredi.hu>
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-01-24 18:42:22 +04:00
}
2021-02-24 23:01:45 +03:00
if ( retval ! = - EIOCBQUEUED )
iov_iter_revert ( iter , count - iov_iter_count ( iter ) ) ;
2010-05-23 19:00:54 +04:00
Fix race when checking i_size on direct i/o read
So far I've had one ACK for this, and no other comments. So I think it
is probably time to send this via some suitable tree. I'm guessing that
the vfs tree would be the most appropriate route, but not sure that
there is one at the moment (don't see anything recent at kernel.org)
so in that case I think -mm is the "back up plan". Al, please let me
know if you will take this?
Steve.
---------------------
Following on from the "Re: [PATCH v3] vfs: fix a bug when we do some dio
reads with append dio writes" thread on linux-fsdevel, this patch is my
current version of the fix proposed as option (b) in that thread.
Removing the i_size test from the direct i/o read path at vfs level
means that filesystems now have to deal with requests which are beyond
i_size themselves. These I've divided into three sets:
a) Those with "no op" ->direct_IO (9p, cifs, ceph)
These are obviously not going to be an issue
b) Those with "home brew" ->direct_IO (nfs, fuse)
I've been told that NFS should not have any problem with the larger
i_size, however I've added an extra test to FUSE to duplicate the
original behaviour just to be on the safe side.
c) Those using __blockdev_direct_IO()
These call through to ->get_block() which should deal with the EOF
condition correctly. I've verified that with GFS2 and I believe that
Zheng has verified it for ext4. I've also run the test on XFS and it
passes both before and after this change.
The part of the patch in filemap.c looks a lot larger than it really is
- there are only two lines of real change. The rest is just indentation
of the contained code.
There remains a test of i_size though, which was added for btrfs. It
doesn't cause the other filesystems a problem as the test is performed
after ->direct_IO has been called. It is possible that there is a race
that does matter to btrfs, however this patch doesn't change that, so
its still an overall improvement.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Reported-by: Zheng Liu <gnehzuil.liu@gmail.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>
Acked-by: Miklos Szeredi <miklos@szeredi.hu>
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-01-24 18:42:22 +04:00
/*
* Btrfs can have a short DIO read if we encounter
* compressed extents , so if there was an error , or if
* we ' ve already read everything we wanted to , or if
* there was a short read because we hit EOF , go ahead
* and return . Otherwise fallthrough to buffered io for
2015-02-17 02:58:53 +03:00
* the rest of the read . Buffered reads will not work for
* DAX files , so don ' t bother trying .
Fix race when checking i_size on direct i/o read
So far I've had one ACK for this, and no other comments. So I think it
is probably time to send this via some suitable tree. I'm guessing that
the vfs tree would be the most appropriate route, but not sure that
there is one at the moment (don't see anything recent at kernel.org)
so in that case I think -mm is the "back up plan". Al, please let me
know if you will take this?
Steve.
---------------------
Following on from the "Re: [PATCH v3] vfs: fix a bug when we do some dio
reads with append dio writes" thread on linux-fsdevel, this patch is my
current version of the fix proposed as option (b) in that thread.
Removing the i_size test from the direct i/o read path at vfs level
means that filesystems now have to deal with requests which are beyond
i_size themselves. These I've divided into three sets:
a) Those with "no op" ->direct_IO (9p, cifs, ceph)
These are obviously not going to be an issue
b) Those with "home brew" ->direct_IO (nfs, fuse)
I've been told that NFS should not have any problem with the larger
i_size, however I've added an extra test to FUSE to duplicate the
original behaviour just to be on the safe side.
c) Those using __blockdev_direct_IO()
These call through to ->get_block() which should deal with the EOF
condition correctly. I've verified that with GFS2 and I believe that
Zheng has verified it for ext4. I've also run the test on XFS and it
passes both before and after this change.
The part of the patch in filemap.c looks a lot larger than it really is
- there are only two lines of real change. The rest is just indentation
of the contained code.
There remains a test of i_size though, which was added for btrfs. It
doesn't cause the other filesystems a problem as the test is performed
after ->direct_IO has been called. It is possible that there is a race
that does matter to btrfs, however this patch doesn't change that, so
its still an overall improvement.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Reported-by: Zheng Liu <gnehzuil.liu@gmail.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>
Acked-by: Miklos Szeredi <miklos@szeredi.hu>
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-01-24 18:42:22 +04:00
*/
2021-11-05 23:37:07 +03:00
if ( retval < 0 | | ! count | | IS_DAX ( inode ) )
return retval ;
if ( iocb - > ki_pos > = i_size_read ( inode ) )
2021-02-24 23:02:45 +03:00
return retval ;
2005-04-17 02:20:36 +04:00
}
2021-02-24 23:02:45 +03:00
return filemap_read ( iocb , iter , retval ) ;
2005-04-17 02:20:36 +04:00
}
2014-03-06 07:53:04 +04:00
EXPORT_SYMBOL ( generic_file_read_iter ) ;
2005-04-17 02:20:36 +04:00
2023-02-14 18:01:42 +03:00
/*
* Splice subpages from a folio into a pipe .
*/
size_t splice_folio_into_pipe ( struct pipe_inode_info * pipe ,
struct folio * folio , loff_t fpos , size_t size )
{
struct page * page ;
size_t spliced = 0 , offset = offset_in_folio ( folio , fpos ) ;
page = folio_page ( folio , offset / PAGE_SIZE ) ;
size = min ( size , folio_size ( folio ) - offset ) ;
offset % = PAGE_SIZE ;
while ( spliced < size & &
! pipe_full ( pipe - > head , pipe - > tail , pipe - > max_usage ) ) {
struct pipe_buffer * buf = pipe_head_buf ( pipe ) ;
size_t part = min_t ( size_t , PAGE_SIZE - offset , size - spliced ) ;
* buf = ( struct pipe_buffer ) {
. ops = & page_cache_pipe_buf_ops ,
. page = page ,
. offset = offset ,
. len = part ,
} ;
folio_get ( folio ) ;
pipe - > head + + ;
page + + ;
spliced + = part ;
offset = 0 ;
}
return spliced ;
}
2023-05-22 16:50:18 +03:00
/**
* filemap_splice_read - Splice data from a file ' s pagecache into a pipe
* @ in : The file to read from
* @ ppos : Pointer to the file position to read from
* @ pipe : The pipe to splice into
* @ len : The amount to splice
* @ flags : The SPLICE_F_ * flags
*
* This function gets folios from a file ' s pagecache and splices them into the
* pipe . Readahead will be called as necessary to fill more folios . This may
* be used for blockdevs also .
*
* Return : On success , the number of bytes read will be returned and * @ ppos
* will be updated if appropriate ; 0 will be returned if there is no more data
* to be read ; - EAGAIN will be returned if the pipe had no space , and some
* other negative error code will be returned on error . A short read may occur
* if the pipe has insufficient space , we reach the end of the data or we hit a
* hole .
2023-02-14 18:01:42 +03:00
*/
ssize_t filemap_splice_read ( struct file * in , loff_t * ppos ,
struct pipe_inode_info * pipe ,
size_t len , unsigned int flags )
{
struct folio_batch fbatch ;
struct kiocb iocb ;
size_t total_spliced = 0 , used , npages ;
loff_t isize , end_offset ;
bool writably_mapped ;
int i , error = 0 ;
2023-05-22 16:49:49 +03:00
if ( unlikely ( * ppos > = in - > f_mapping - > host - > i_sb - > s_maxbytes ) )
return 0 ;
2023-02-14 18:01:42 +03:00
init_sync_kiocb ( & iocb , in ) ;
iocb . ki_pos = * ppos ;
/* Work out how much data we can actually add into the pipe */
used = pipe_occupancy ( pipe - > head , pipe - > tail ) ;
npages = max_t ( ssize_t , pipe - > max_usage - used , 0 ) ;
len = min_t ( size_t , len , npages * PAGE_SIZE ) ;
folio_batch_init ( & fbatch ) ;
do {
cond_resched ( ) ;
2023-05-22 16:49:48 +03:00
if ( * ppos > = i_size_read ( in - > f_mapping - > host ) )
2023-02-14 18:01:42 +03:00
break ;
iocb . ki_pos = * ppos ;
error = filemap_get_pages ( & iocb , len , & fbatch , true ) ;
if ( error < 0 )
break ;
/*
* i_size must be checked after we know the pages are Uptodate .
*
* Checking i_size after the check allows us to calculate
* the correct value for " nr " , which means the zero - filled
* part of the page is not copied back to userspace ( unless
* another truncate extends the file - this is desired though ) .
*/
2023-05-22 16:49:48 +03:00
isize = i_size_read ( in - > f_mapping - > host ) ;
2023-02-14 18:01:42 +03:00
if ( unlikely ( * ppos > = isize ) )
break ;
end_offset = min_t ( loff_t , isize , * ppos + len ) ;
/*
* Once we start copying data , we don ' t want to be touching any
* cachelines that might be contended :
*/
writably_mapped = mapping_writably_mapped ( in - > f_mapping ) ;
for ( i = 0 ; i < folio_batch_count ( & fbatch ) ; i + + ) {
struct folio * folio = fbatch . folios [ i ] ;
size_t n ;
if ( folio_pos ( folio ) > = end_offset )
goto out ;
folio_mark_accessed ( folio ) ;
/*
* If users can be writing to this folio using arbitrary
* virtual addresses , take care of potential aliasing
* before reading the folio on the kernel side .
*/
if ( writably_mapped )
flush_dcache_folio ( folio ) ;
n = min_t ( loff_t , len , isize - * ppos ) ;
n = splice_folio_into_pipe ( pipe , folio , * ppos , n ) ;
if ( ! n )
goto out ;
len - = n ;
total_spliced + = n ;
* ppos + = n ;
in - > f_ra . prev_pos = * ppos ;
if ( pipe_full ( pipe - > head , pipe - > tail , pipe - > max_usage ) )
goto out ;
}
folio_batch_release ( & fbatch ) ;
} while ( len ) ;
out :
folio_batch_release ( & fbatch ) ;
file_accessed ( in ) ;
return total_spliced ? total_spliced : error ;
}
2023-02-15 11:00:31 +03:00
EXPORT_SYMBOL ( filemap_splice_read ) ;
2023-02-14 18:01:42 +03:00
2020-12-17 08:12:26 +03:00
static inline loff_t folio_seek_hole_data ( struct xa_state * xas ,
struct address_space * mapping , struct folio * folio ,
2021-02-26 04:15:52 +03:00
loff_t start , loff_t end , bool seek_data )
2021-02-26 04:15:48 +03:00
{
2021-02-26 04:15:52 +03:00
const struct address_space_operations * ops = mapping - > a_ops ;
size_t offset , bsz = i_blocksize ( mapping - > host ) ;
2020-12-17 08:12:26 +03:00
if ( xa_is_value ( folio ) | | folio_test_uptodate ( folio ) )
2021-02-26 04:15:52 +03:00
return seek_data ? start : end ;
if ( ! ops - > is_partially_uptodate )
return seek_data ? end : start ;
xas_pause ( xas ) ;
rcu_read_unlock ( ) ;
2020-12-17 08:12:26 +03:00
folio_lock ( folio ) ;
if ( unlikely ( folio - > mapping ! = mapping ) )
2021-02-26 04:15:52 +03:00
goto unlock ;
2020-12-17 08:12:26 +03:00
offset = offset_in_folio ( folio , start ) & ~ ( bsz - 1 ) ;
2021-02-26 04:15:52 +03:00
do {
2022-02-09 23:21:27 +03:00
if ( ops - > is_partially_uptodate ( folio , offset , bsz ) = =
2020-12-17 08:12:26 +03:00
seek_data )
2021-02-26 04:15:52 +03:00
break ;
start = ( start + bsz ) & ~ ( bsz - 1 ) ;
offset + = bsz ;
2020-12-17 08:12:26 +03:00
} while ( offset < folio_size ( folio ) ) ;
2021-02-26 04:15:52 +03:00
unlock :
2020-12-17 08:12:26 +03:00
folio_unlock ( folio ) ;
2021-02-26 04:15:52 +03:00
rcu_read_lock ( ) ;
return start ;
2021-02-26 04:15:48 +03:00
}
2020-12-17 08:12:26 +03:00
static inline size_t seek_folio_size ( struct xa_state * xas , struct folio * folio )
2021-02-26 04:15:48 +03:00
{
2020-12-17 08:12:26 +03:00
if ( xa_is_value ( folio ) )
2021-02-26 04:15:48 +03:00
return PAGE_SIZE < < xa_get_order ( xas - > xa , xas - > xa_index ) ;
2020-12-17 08:12:26 +03:00
return folio_size ( folio ) ;
2021-02-26 04:15:48 +03:00
}
/**
* mapping_seek_hole_data - Seek for SEEK_DATA / SEEK_HOLE in the page cache .
* @ mapping : Address space to search .
* @ start : First byte to consider .
* @ end : Limit of search ( exclusive ) .
* @ whence : Either SEEK_HOLE or SEEK_DATA .
*
* If the page cache knows which blocks contain holes and which blocks
* contain data , your filesystem can use this function to implement
* SEEK_HOLE and SEEK_DATA . This is useful for filesystems which are
* entirely memory - based such as tmpfs , and filesystems which support
* unwritten extents .
*
2021-05-07 04:06:47 +03:00
* Return : The requested offset on success , or - ENXIO if @ whence specifies
2021-02-26 04:15:48 +03:00
* SEEK_DATA and there is no data after @ start . There is an implicit hole
* after @ end - 1 , so SEEK_HOLE returns @ end if all the bytes between @ start
* and @ end contain data .
*/
loff_t mapping_seek_hole_data ( struct address_space * mapping , loff_t start ,
loff_t end , int whence )
{
XA_STATE ( xas , & mapping - > i_pages , start > > PAGE_SHIFT ) ;
mm/filemap: fix mapping_seek_hole_data on THP & 32-bit
No problem on 64-bit, or without huge pages, but xfstests generic/285
and other SEEK_HOLE/SEEK_DATA tests have regressed on huge tmpfs, and on
32-bit architectures, with the new mapping_seek_hole_data(). Several
different bugs turned out to need fixing.
u64 cast to stop losing bits when converting unsigned long to loff_t
(and let's use shifts throughout, rather than mixed with * and /).
Use round_up() when advancing pos, to stop assuming that pos was already
THP-aligned when advancing it by THP-size. (This use of round_up()
assumes that any THP has THP-aligned index: true at present and true
going forward, but could be recoded to avoid the assumption.)
Use xas_set() when iterating away from a THP, so that xa_index stays in
synch with start, instead of drifting away to return bogus offset.
Check start against end to avoid wrapping 32-bit xa_index to 0 (and to
handle these additional cases, seek_data or not, it's easier to break
the loop than goto: so rearrange exit from the function).
[hughd@google.com: remove unneeded u64 casts, per Matthew]
Link: https://lkml.kernel.org/r/alpine.LSU.2.11.2104221347240.1170@eggly.anvils
Link: https://lkml.kernel.org/r/alpine.LSU.2.11.2104211737410.3299@eggly.anvils
Fixes: 41139aa4c3a3 ("mm/filemap: add mapping_seek_hole_data")
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: William Kucharski <william.kucharski@oracle.com>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-24 00:29:00 +03:00
pgoff_t max = ( end - 1 ) > > PAGE_SHIFT ;
2021-02-26 04:15:48 +03:00
bool seek_data = ( whence = = SEEK_DATA ) ;
2020-12-17 08:12:26 +03:00
struct folio * folio ;
2021-02-26 04:15:48 +03:00
if ( end < = start )
return - ENXIO ;
rcu_read_lock ( ) ;
2020-12-17 08:12:26 +03:00
while ( ( folio = find_get_entry ( & xas , max , XA_PRESENT ) ) ) {
mm/filemap: fix mapping_seek_hole_data on THP & 32-bit
No problem on 64-bit, or without huge pages, but xfstests generic/285
and other SEEK_HOLE/SEEK_DATA tests have regressed on huge tmpfs, and on
32-bit architectures, with the new mapping_seek_hole_data(). Several
different bugs turned out to need fixing.
u64 cast to stop losing bits when converting unsigned long to loff_t
(and let's use shifts throughout, rather than mixed with * and /).
Use round_up() when advancing pos, to stop assuming that pos was already
THP-aligned when advancing it by THP-size. (This use of round_up()
assumes that any THP has THP-aligned index: true at present and true
going forward, but could be recoded to avoid the assumption.)
Use xas_set() when iterating away from a THP, so that xa_index stays in
synch with start, instead of drifting away to return bogus offset.
Check start against end to avoid wrapping 32-bit xa_index to 0 (and to
handle these additional cases, seek_data or not, it's easier to break
the loop than goto: so rearrange exit from the function).
[hughd@google.com: remove unneeded u64 casts, per Matthew]
Link: https://lkml.kernel.org/r/alpine.LSU.2.11.2104221347240.1170@eggly.anvils
Link: https://lkml.kernel.org/r/alpine.LSU.2.11.2104211737410.3299@eggly.anvils
Fixes: 41139aa4c3a3 ("mm/filemap: add mapping_seek_hole_data")
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: William Kucharski <william.kucharski@oracle.com>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-24 00:29:00 +03:00
loff_t pos = ( u64 ) xas . xa_index < < PAGE_SHIFT ;
2020-12-17 08:12:26 +03:00
size_t seek_size ;
2021-02-26 04:15:48 +03:00
if ( start < pos ) {
if ( ! seek_data )
goto unlock ;
start = pos ;
}
2020-12-17 08:12:26 +03:00
seek_size = seek_folio_size ( & xas , folio ) ;
pos = round_up ( ( u64 ) pos + 1 , seek_size ) ;
start = folio_seek_hole_data ( & xas , mapping , folio , start , pos ,
2021-02-26 04:15:52 +03:00
seek_data ) ;
if ( start < pos )
2021-02-26 04:15:48 +03:00
goto unlock ;
mm/filemap: fix mapping_seek_hole_data on THP & 32-bit
No problem on 64-bit, or without huge pages, but xfstests generic/285
and other SEEK_HOLE/SEEK_DATA tests have regressed on huge tmpfs, and on
32-bit architectures, with the new mapping_seek_hole_data(). Several
different bugs turned out to need fixing.
u64 cast to stop losing bits when converting unsigned long to loff_t
(and let's use shifts throughout, rather than mixed with * and /).
Use round_up() when advancing pos, to stop assuming that pos was already
THP-aligned when advancing it by THP-size. (This use of round_up()
assumes that any THP has THP-aligned index: true at present and true
going forward, but could be recoded to avoid the assumption.)
Use xas_set() when iterating away from a THP, so that xa_index stays in
synch with start, instead of drifting away to return bogus offset.
Check start against end to avoid wrapping 32-bit xa_index to 0 (and to
handle these additional cases, seek_data or not, it's easier to break
the loop than goto: so rearrange exit from the function).
[hughd@google.com: remove unneeded u64 casts, per Matthew]
Link: https://lkml.kernel.org/r/alpine.LSU.2.11.2104221347240.1170@eggly.anvils
Link: https://lkml.kernel.org/r/alpine.LSU.2.11.2104211737410.3299@eggly.anvils
Fixes: 41139aa4c3a3 ("mm/filemap: add mapping_seek_hole_data")
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: William Kucharski <william.kucharski@oracle.com>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-24 00:29:00 +03:00
if ( start > = end )
break ;
if ( seek_size > PAGE_SIZE )
xas_set ( & xas , pos > > PAGE_SHIFT ) ;
2020-12-17 08:12:26 +03:00
if ( ! xa_is_value ( folio ) )
folio_put ( folio ) ;
2021-02-26 04:15:48 +03:00
}
if ( seek_data )
mm/filemap: fix mapping_seek_hole_data on THP & 32-bit
No problem on 64-bit, or without huge pages, but xfstests generic/285
and other SEEK_HOLE/SEEK_DATA tests have regressed on huge tmpfs, and on
32-bit architectures, with the new mapping_seek_hole_data(). Several
different bugs turned out to need fixing.
u64 cast to stop losing bits when converting unsigned long to loff_t
(and let's use shifts throughout, rather than mixed with * and /).
Use round_up() when advancing pos, to stop assuming that pos was already
THP-aligned when advancing it by THP-size. (This use of round_up()
assumes that any THP has THP-aligned index: true at present and true
going forward, but could be recoded to avoid the assumption.)
Use xas_set() when iterating away from a THP, so that xa_index stays in
synch with start, instead of drifting away to return bogus offset.
Check start against end to avoid wrapping 32-bit xa_index to 0 (and to
handle these additional cases, seek_data or not, it's easier to break
the loop than goto: so rearrange exit from the function).
[hughd@google.com: remove unneeded u64 casts, per Matthew]
Link: https://lkml.kernel.org/r/alpine.LSU.2.11.2104221347240.1170@eggly.anvils
Link: https://lkml.kernel.org/r/alpine.LSU.2.11.2104211737410.3299@eggly.anvils
Fixes: 41139aa4c3a3 ("mm/filemap: add mapping_seek_hole_data")
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: William Kucharski <william.kucharski@oracle.com>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-24 00:29:00 +03:00
start = - ENXIO ;
2021-02-26 04:15:48 +03:00
unlock :
rcu_read_unlock ( ) ;
2020-12-17 08:12:26 +03:00
if ( folio & & ! xa_is_value ( folio ) )
folio_put ( folio ) ;
2021-02-26 04:15:48 +03:00
if ( start > end )
return end ;
return start ;
}
2005-04-17 02:20:36 +04:00
# ifdef CONFIG_MMU
# define MMAP_LOTSAMISS (100)
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.
Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.
Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions. Consider some large company that does a
lot of web traffic. This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons. Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur. Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.
This patch also adds a new helper for locking the page with the mmap_sem
dropped. This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked. However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page. This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.
[josef@toxicpanda.com: v6]
Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-13 21:44:22 +03:00
/*
2021-03-10 18:46:41 +03:00
* lock_folio_maybe_drop_mmap - lock the page , possibly dropping the mmap_lock
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.
Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.
Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions. Consider some large company that does a
lot of web traffic. This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons. Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur. Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.
This patch also adds a new helper for locking the page with the mmap_sem
dropped. This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked. However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page. This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.
[josef@toxicpanda.com: v6]
Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-13 21:44:22 +03:00
* @ vmf - the vm_fault for this fault .
2021-03-10 18:46:41 +03:00
* @ folio - the folio to lock .
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.
Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.
Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions. Consider some large company that does a
lot of web traffic. This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons. Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur. Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.
This patch also adds a new helper for locking the page with the mmap_sem
dropped. This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked. However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page. This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.
[josef@toxicpanda.com: v6]
Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-13 21:44:22 +03:00
* @ fpin - the pointer to the file we may pin ( or is already pinned ) .
*
2021-03-10 18:46:41 +03:00
* This works similar to lock_folio_or_retry in that it can drop the
* mmap_lock . It differs in that it actually returns the folio locked
* if it returns 1 and 0 if it couldn ' t lock the folio . If we did have
* to drop the mmap_lock then fpin will point to the pinned file and
* needs to be fput ( ) ' ed at a later point .
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.
Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.
Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions. Consider some large company that does a
lot of web traffic. This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons. Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur. Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.
This patch also adds a new helper for locking the page with the mmap_sem
dropped. This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked. However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page. This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.
[josef@toxicpanda.com: v6]
Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-13 21:44:22 +03:00
*/
2021-03-10 18:46:41 +03:00
static int lock_folio_maybe_drop_mmap ( struct vm_fault * vmf , struct folio * folio ,
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.
Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.
Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions. Consider some large company that does a
lot of web traffic. This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons. Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur. Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.
This patch also adds a new helper for locking the page with the mmap_sem
dropped. This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked. However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page. This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.
[josef@toxicpanda.com: v6]
Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-13 21:44:22 +03:00
struct file * * fpin )
{
2021-03-02 03:38:25 +03:00
if ( folio_trylock ( folio ) )
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.
Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.
Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions. Consider some large company that does a
lot of web traffic. This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons. Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur. Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.
This patch also adds a new helper for locking the page with the mmap_sem
dropped. This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked. However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page. This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.
[josef@toxicpanda.com: v6]
Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-13 21:44:22 +03:00
return 1 ;
2019-03-15 21:26:07 +03:00
/*
* NOTE ! This will make us return with VM_FAULT_RETRY , but with
2020-06-09 07:33:54 +03:00
* the mmap_lock still held . That ' s how FAULT_FLAG_RETRY_NOWAIT
2019-03-15 21:26:07 +03:00
* is supposed to work . We have way too many special cases . .
*/
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.
Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.
Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions. Consider some large company that does a
lot of web traffic. This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons. Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur. Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.
This patch also adds a new helper for locking the page with the mmap_sem
dropped. This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked. However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page. This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.
[josef@toxicpanda.com: v6]
Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-13 21:44:22 +03:00
if ( vmf - > flags & FAULT_FLAG_RETRY_NOWAIT )
return 0 ;
* fpin = maybe_unlock_mmap_for_io ( vmf , * fpin ) ;
if ( vmf - > flags & FAULT_FLAG_KILLABLE ) {
2020-12-08 08:07:31 +03:00
if ( __folio_lock_killable ( folio ) ) {
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.
Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.
Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions. Consider some large company that does a
lot of web traffic. This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons. Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur. Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.
This patch also adds a new helper for locking the page with the mmap_sem
dropped. This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked. However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page. This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.
[josef@toxicpanda.com: v6]
Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-13 21:44:22 +03:00
/*
2020-06-09 07:33:54 +03:00
* We didn ' t have the right flags to drop the mmap_lock ,
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.
Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.
Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions. Consider some large company that does a
lot of web traffic. This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons. Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur. Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.
This patch also adds a new helper for locking the page with the mmap_sem
dropped. This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked. However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page. This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.
[josef@toxicpanda.com: v6]
Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-13 21:44:22 +03:00
* but all fault_handlers only check for fatal signals
* if we return VM_FAULT_RETRY , so we need to drop the
2020-06-09 07:33:54 +03:00
* mmap_lock here and return 0 if we don ' t have a fpin .
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.
Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.
Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions. Consider some large company that does a
lot of web traffic. This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons. Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur. Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.
This patch also adds a new helper for locking the page with the mmap_sem
dropped. This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked. However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page. This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.
[josef@toxicpanda.com: v6]
Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-13 21:44:22 +03:00
*/
if ( * fpin = = NULL )
2020-06-09 07:33:25 +03:00
mmap_read_unlock ( vmf - > vma - > vm_mm ) ;
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.
Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.
Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions. Consider some large company that does a
lot of web traffic. This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons. Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur. Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.
This patch also adds a new helper for locking the page with the mmap_sem
dropped. This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked. However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page. This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.
[josef@toxicpanda.com: v6]
Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-13 21:44:22 +03:00
return 0 ;
}
} else
2021-03-02 03:38:25 +03:00
__folio_lock ( folio ) ;
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.
Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.
Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions. Consider some large company that does a
lot of web traffic. This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons. Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur. Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.
This patch also adds a new helper for locking the page with the mmap_sem
dropped. This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked. However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page. This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.
[josef@toxicpanda.com: v6]
Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-13 21:44:22 +03:00
return 1 ;
}
2009-06-17 02:31:25 +04:00
/*
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.
Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.
Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions. Consider some large company that does a
lot of web traffic. This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons. Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur. Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.
This patch also adds a new helper for locking the page with the mmap_sem
dropped. This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked. However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page. This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.
[josef@toxicpanda.com: v6]
Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-13 21:44:22 +03:00
* Synchronous readahead happens when we don ' t even find a page in the page
* cache at all . We don ' t want to perform IO under the mmap sem , so if we have
* to drop the mmap sem we return the file that was pinned in order for us to do
* that . If we didn ' t pin a file then we return NULL . The file that is
* returned needs to be fput ( ) ' ed when we ' re done with it .
2009-06-17 02:31:25 +04:00
*/
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.
Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.
Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions. Consider some large company that does a
lot of web traffic. This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons. Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur. Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.
This patch also adds a new helper for locking the page with the mmap_sem
dropped. This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked. However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page. This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.
[josef@toxicpanda.com: v6]
Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-13 21:44:22 +03:00
static struct file * do_sync_mmap_readahead ( struct vm_fault * vmf )
2009-06-17 02:31:25 +04:00
{
2019-03-13 21:44:18 +03:00
struct file * file = vmf - > vma - > vm_file ;
struct file_ra_state * ra = & file - > f_ra ;
2009-06-17 02:31:25 +04:00
struct address_space * mapping = file - > f_mapping ;
2021-04-07 23:18:55 +03:00
DEFINE_READAHEAD ( ractl , file , ra , mapping , vmf - > pgoff ) ;
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.
Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.
Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions. Consider some large company that does a
lot of web traffic. This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons. Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur. Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.
This patch also adds a new helper for locking the page with the mmap_sem
dropped. This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked. However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page. This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.
[josef@toxicpanda.com: v6]
Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-13 21:44:22 +03:00
struct file * fpin = NULL ;
2022-05-25 21:23:45 +03:00
unsigned long vm_flags = vmf - > vma - > vm_flags ;
2020-08-15 03:31:27 +03:00
unsigned int mmap_miss ;
2009-06-17 02:31:25 +04:00
2021-07-25 06:37:13 +03:00
# ifdef CONFIG_TRANSPARENT_HUGEPAGE
/* Use the readahead code, even if readahead is disabled */
2022-05-25 21:23:45 +03:00
if ( vm_flags & VM_HUGEPAGE ) {
2021-07-25 06:37:13 +03:00
fpin = maybe_unlock_mmap_for_io ( vmf , fpin ) ;
ractl . _index & = ~ ( ( unsigned long ) HPAGE_PMD_NR - 1 ) ;
ra - > size = HPAGE_PMD_NR ;
/*
* Fetch two PMD folios , so we get the chance to actually
* readahead , unless we ' ve been told not to .
*/
2022-05-25 21:23:45 +03:00
if ( ! ( vm_flags & VM_RAND_READ ) )
2021-07-25 06:37:13 +03:00
ra - > size * = 2 ;
ra - > async_size = HPAGE_PMD_NR ;
page_cache_ra_order ( & ractl , ra , HPAGE_PMD_ORDER ) ;
return fpin ;
}
# endif
2009-06-17 02:31:25 +04:00
/* If we don't want any read-ahead, don't bother */
2022-05-25 21:23:45 +03:00
if ( vm_flags & VM_RAND_READ )
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.
Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.
Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions. Consider some large company that does a
lot of web traffic. This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons. Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur. Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.
This patch also adds a new helper for locking the page with the mmap_sem
dropped. This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked. However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page. This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.
[josef@toxicpanda.com: v6]
Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-13 21:44:22 +03:00
return fpin ;
2011-05-25 04:12:28 +04:00
if ( ! ra - > ra_pages )
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.
Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.
Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions. Consider some large company that does a
lot of web traffic. This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons. Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur. Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.
This patch also adds a new helper for locking the page with the mmap_sem
dropped. This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked. However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page. This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.
[josef@toxicpanda.com: v6]
Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-13 21:44:22 +03:00
return fpin ;
2009-06-17 02:31:25 +04:00
2022-05-25 21:23:45 +03:00
if ( vm_flags & VM_SEQ_READ ) {
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.
Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.
Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions. Consider some large company that does a
lot of web traffic. This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons. Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur. Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.
This patch also adds a new helper for locking the page with the mmap_sem
dropped. This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked. However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page. This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.
[josef@toxicpanda.com: v6]
Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-13 21:44:22 +03:00
fpin = maybe_unlock_mmap_for_io ( vmf , fpin ) ;
2021-04-07 23:18:55 +03:00
page_cache_sync_ra ( & ractl , ra - > ra_pages ) ;
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.
Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.
Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions. Consider some large company that does a
lot of web traffic. This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons. Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur. Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.
This patch also adds a new helper for locking the page with the mmap_sem
dropped. This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked. However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page. This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.
[josef@toxicpanda.com: v6]
Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-13 21:44:22 +03:00
return fpin ;
2009-06-17 02:31:25 +04:00
}
2011-05-25 04:12:29 +04:00
/* Avoid banging the cache line if not needed */
2020-08-15 03:31:27 +03:00
mmap_miss = READ_ONCE ( ra - > mmap_miss ) ;
if ( mmap_miss < MMAP_LOTSAMISS * 10 )
WRITE_ONCE ( ra - > mmap_miss , + + mmap_miss ) ;
2009-06-17 02:31:25 +04:00
/*
* Do we miss much more than hit in this file ? If so ,
* stop bothering with read - ahead . It will only hurt .
*/
2020-08-15 03:31:27 +03:00
if ( mmap_miss > MMAP_LOTSAMISS )
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.
Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.
Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions. Consider some large company that does a
lot of web traffic. This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons. Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur. Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.
This patch also adds a new helper for locking the page with the mmap_sem
dropped. This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked. However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page. This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.
[josef@toxicpanda.com: v6]
Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-13 21:44:22 +03:00
return fpin ;
2009-06-17 02:31:25 +04:00
2009-06-17 02:31:30 +04:00
/*
* mmap read - around
*/
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.
Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.
Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions. Consider some large company that does a
lot of web traffic. This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons. Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur. Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.
This patch also adds a new helper for locking the page with the mmap_sem
dropped. This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked. However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page. This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.
[josef@toxicpanda.com: v6]
Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-13 21:44:22 +03:00
fpin = maybe_unlock_mmap_for_io ( vmf , fpin ) ;
2020-10-16 06:06:31 +03:00
ra - > start = max_t ( long , 0 , vmf - > pgoff - ra - > ra_pages / 2 ) ;
2015-11-06 05:47:08 +03:00
ra - > size = ra - > ra_pages ;
ra - > async_size = ra - > ra_pages / 4 ;
2020-10-16 06:06:31 +03:00
ractl . _index = ra - > start ;
2021-07-25 06:26:14 +03:00
page_cache_ra_order ( & ractl , ra , 0 ) ;
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.
Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.
Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions. Consider some large company that does a
lot of web traffic. This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons. Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur. Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.
This patch also adds a new helper for locking the page with the mmap_sem
dropped. This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked. However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page. This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.
[josef@toxicpanda.com: v6]
Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-13 21:44:22 +03:00
return fpin ;
2009-06-17 02:31:25 +04:00
}
/*
* Asynchronous readahead happens when we find the page and PG_readahead ,
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.
Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.
Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions. Consider some large company that does a
lot of web traffic. This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons. Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur. Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.
This patch also adds a new helper for locking the page with the mmap_sem
dropped. This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked. However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page. This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.
[josef@toxicpanda.com: v6]
Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-13 21:44:22 +03:00
* so we want to possibly extend the readahead further . We return the file that
2020-06-09 07:33:54 +03:00
* was pinned if we have to drop the mmap_lock in order to do IO .
2009-06-17 02:31:25 +04:00
*/
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.
Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.
Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions. Consider some large company that does a
lot of web traffic. This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons. Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur. Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.
This patch also adds a new helper for locking the page with the mmap_sem
dropped. This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked. However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page. This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.
[josef@toxicpanda.com: v6]
Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-13 21:44:22 +03:00
static struct file * do_async_mmap_readahead ( struct vm_fault * vmf ,
2021-07-29 21:57:01 +03:00
struct folio * folio )
2009-06-17 02:31:25 +04:00
{
2019-03-13 21:44:18 +03:00
struct file * file = vmf - > vma - > vm_file ;
struct file_ra_state * ra = & file - > f_ra ;
2021-07-29 21:57:01 +03:00
DEFINE_READAHEAD ( ractl , file , ra , file - > f_mapping , vmf - > pgoff ) ;
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.
Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.
Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions. Consider some large company that does a
lot of web traffic. This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons. Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur. Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.
This patch also adds a new helper for locking the page with the mmap_sem
dropped. This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked. However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page. This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.
[josef@toxicpanda.com: v6]
Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-13 21:44:22 +03:00
struct file * fpin = NULL ;
2020-08-15 03:31:27 +03:00
unsigned int mmap_miss ;
2009-06-17 02:31:25 +04:00
/* If we don't want any read-ahead, don't bother */
2020-04-02 07:04:40 +03:00
if ( vmf - > vma - > vm_flags & VM_RAND_READ | | ! ra - > ra_pages )
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.
Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.
Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions. Consider some large company that does a
lot of web traffic. This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons. Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur. Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.
This patch also adds a new helper for locking the page with the mmap_sem
dropped. This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked. However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page. This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.
[josef@toxicpanda.com: v6]
Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-13 21:44:22 +03:00
return fpin ;
2021-07-29 21:57:01 +03:00
2020-08-15 03:31:27 +03:00
mmap_miss = READ_ONCE ( ra - > mmap_miss ) ;
if ( mmap_miss )
WRITE_ONCE ( ra - > mmap_miss , - - mmap_miss ) ;
2021-07-29 21:57:01 +03:00
if ( folio_test_readahead ( folio ) ) {
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.
Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.
Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions. Consider some large company that does a
lot of web traffic. This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons. Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur. Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.
This patch also adds a new helper for locking the page with the mmap_sem
dropped. This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked. However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page. This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.
[josef@toxicpanda.com: v6]
Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-13 21:44:22 +03:00
fpin = maybe_unlock_mmap_for_io ( vmf , fpin ) ;
2021-07-29 21:57:01 +03:00
page_cache_async_ra ( & ractl , folio , ra - > ra_pages ) ;
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.
Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.
Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions. Consider some large company that does a
lot of web traffic. This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons. Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur. Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.
This patch also adds a new helper for locking the page with the mmap_sem
dropped. This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked. However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page. This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.
[josef@toxicpanda.com: v6]
Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-13 21:44:22 +03:00
}
return fpin ;
2009-06-17 02:31:25 +04:00
}
2006-06-23 13:03:49 +04:00
/**
2007-07-19 12:46:59 +04:00
* filemap_fault - read in file data for page fault handling
2007-07-19 12:47:03 +04:00
* @ vmf : struct vm_fault containing details of the fault
2006-06-23 13:03:49 +04:00
*
2007-07-19 12:46:59 +04:00
* filemap_fault ( ) is invoked via the vma operations vector for a
2005-04-17 02:20:36 +04:00
* mapped memory region to read in file data during a page fault .
*
* The goto ' s are kind of ugly , but this streamlines the normal case of having
* it in the page cache , and handles the special cases reasonably without
* having a lot of duplicated code .
2014-08-07 03:07:24 +04:00
*
2020-06-09 07:33:54 +03:00
* vma - > vm_mm - > mmap_lock must be held on entry .
2014-08-07 03:07:24 +04:00
*
2020-06-09 07:33:54 +03:00
* If our return value has VM_FAULT_RETRY set , it ' s because the mmap_lock
2021-03-10 18:46:41 +03:00
* may be dropped before doing I / O or by lock_folio_maybe_drop_mmap ( ) .
2014-08-07 03:07:24 +04:00
*
2020-06-09 07:33:54 +03:00
* If our return value does not have VM_FAULT_RETRY set , the mmap_lock
2014-08-07 03:07:24 +04:00
* has not been released .
*
* We never return with VM_FAULT_RETRY and a bit from VM_FAULT_ERROR set .
2019-03-06 02:48:42 +03:00
*
* Return : bitwise - OR of % VM_FAULT_ codes .
2005-04-17 02:20:36 +04:00
*/
2018-06-08 03:08:00 +03:00
vm_fault_t filemap_fault ( struct vm_fault * vmf )
2005-04-17 02:20:36 +04:00
{
int error ;
2017-02-25 01:56:41 +03:00
struct file * file = vmf - > vma - > vm_file ;
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.
Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.
Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions. Consider some large company that does a
lot of web traffic. This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons. Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur. Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.
This patch also adds a new helper for locking the page with the mmap_sem
dropped. This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked. However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page. This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.
[josef@toxicpanda.com: v6]
Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-13 21:44:22 +03:00
struct file * fpin = NULL ;
2005-04-17 02:20:36 +04:00
struct address_space * mapping = file - > f_mapping ;
struct inode * inode = mapping - > host ;
2021-03-10 18:46:41 +03:00
pgoff_t max_idx , index = vmf - > pgoff ;
struct folio * folio ;
2018-06-08 03:08:00 +03:00
vm_fault_t ret = 0 ;
2021-01-28 21:19:45 +03:00
bool mapping_locked = false ;
2005-04-17 02:20:36 +04:00
2021-03-10 18:46:41 +03:00
max_idx = DIV_ROUND_UP ( i_size_read ( inode ) , PAGE_SIZE ) ;
if ( unlikely ( index > = max_idx ) )
2007-10-31 19:19:46 +03:00
return VM_FAULT_SIGBUS ;
2005-04-17 02:20:36 +04:00
/*
2013-10-17 00:46:59 +04:00
* Do we have something in the page cache already ?
2005-04-17 02:20:36 +04:00
*/
2021-03-10 18:46:41 +03:00
folio = filemap_get_folio ( mapping , index ) ;
2023-03-07 17:34:10 +03:00
if ( likely ( ! IS_ERR ( folio ) ) ) {
2005-04-17 02:20:36 +04:00
/*
2021-01-28 21:19:45 +03:00
* We found the page , so try async readahead before waiting for
* the lock .
2005-04-17 02:20:36 +04:00
*/
2021-01-28 21:19:45 +03:00
if ( ! ( vmf - > flags & FAULT_FLAG_TRIED ) )
2021-07-29 21:57:01 +03:00
fpin = do_async_mmap_readahead ( vmf , folio ) ;
2021-03-10 18:46:41 +03:00
if ( unlikely ( ! folio_test_uptodate ( folio ) ) ) {
2021-01-28 21:19:45 +03:00
filemap_invalidate_lock_shared ( mapping ) ;
mapping_locked = true ;
}
} else {
2009-06-17 02:31:25 +04:00
/* No page in the page cache at all */
count_vm_event ( PGMAJFAULT ) ;
2017-07-07 01:40:25 +03:00
count_memcg_event_mm ( vmf - > vma - > vm_mm , PGMAJFAULT ) ;
2009-06-17 02:31:25 +04:00
ret = VM_FAULT_MAJOR ;
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.
Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.
Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions. Consider some large company that does a
lot of web traffic. This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons. Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur. Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.
This patch also adds a new helper for locking the page with the mmap_sem
dropped. This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked. However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page. This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.
[josef@toxicpanda.com: v6]
Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-13 21:44:22 +03:00
fpin = do_sync_mmap_readahead ( vmf ) ;
2009-06-17 02:31:25 +04:00
retry_find :
2021-01-28 21:19:45 +03:00
/*
2021-03-10 18:46:41 +03:00
* See comment in filemap_create_folio ( ) why we need
2021-01-28 21:19:45 +03:00
* invalidate_lock
*/
if ( ! mapping_locked ) {
filemap_invalidate_lock_shared ( mapping ) ;
mapping_locked = true ;
}
2021-03-10 18:46:41 +03:00
folio = __filemap_get_folio ( mapping , index ,
filemap: kill page_cache_read usage in filemap_fault
Patch series "drop the mmap_sem when doing IO in the fault path", v6.
Now that we have proper isolation in place with cgroups2 we have started
going through and fixing the various priority inversions. Most are all
gone now, but this one is sort of weird since it's not necessarily a
priority inversion that happens within the kernel, but rather because of
something userspace does.
We have giant applications that we want to protect, and parts of these
giant applications do things like watch the system state to determine how
healthy the box is for load balancing and such. This involves running
'ps' or other such utilities. These utilities will often walk
/proc/<pid>/whatever, and these files can sometimes need to
down_read(&task->mmap_sem). Not usually a big deal, but we noticed when
we are stress testing that sometimes our protected application has latency
spikes trying to get the mmap_sem for tasks that are in lower priority
cgroups.
This is because any down_write() on a semaphore essentially turns it into
a mutex, so even if we currently have it held for reading, any new readers
will not be allowed on to keep from starving the writer. This is fine,
except a lower priority task could be stuck doing IO because it has been
throttled to the point that its IO is taking much longer than normal. But
because a higher priority group depends on this completing it is now stuck
behind lower priority work.
In order to avoid this particular priority inversion we want to use the
existing retry mechanism to stop from holding the mmap_sem at all if we
are going to do IO. This already exists in the read case sort of, but
needed to be extended for more than just grabbing the page lock. With
io.latency we throttle at submit_bio() time, so the readahead stuff can
block and even page_cache_read can block, so all these paths need to have
the mmap_sem dropped.
The other big thing is ->page_mkwrite. btrfs is particularly shitty here
because we have to reserve space for the dirty page, which can be a very
expensive operation. We use the same retry method as the read path, and
simply cache the page and verify the page is still setup properly the next
pass through ->page_mkwrite().
I've tested these patches with xfstests and there are no regressions.
This patch (of 3):
If we do not have a page at filemap_fault time we'll do this weird forced
page_cache_read thing to populate the page, and then drop it again and
loop around and find it. This makes for 2 ways we can read a page in
filemap_fault, and it's not really needed. Instead add a FGP_FOR_MMAP
flag so that pagecache_get_page() will return a unlocked page that's in
pagecache. Then use the normal page locking and readpage logic already in
filemap_fault. This simplifies the no page in page cache case
significantly.
[akpm@linux-foundation.org: fix comment text]
[josef@toxicpanda.com: don't unlock null page in FGP_FOR_MMAP case]
Link: http://lkml.kernel.org/r/20190312201742.22935-1-josef@toxicpanda.com
Link: http://lkml.kernel.org/r/20181211173801.29535-2-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-13 21:44:14 +03:00
FGP_CREAT | FGP_FOR_MMAP ,
vmf - > gfp_mask ) ;
2023-03-07 17:34:10 +03:00
if ( IS_ERR ( folio ) ) {
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.
Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.
Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions. Consider some large company that does a
lot of web traffic. This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons. Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur. Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.
This patch also adds a new helper for locking the page with the mmap_sem
dropped. This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked. However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page. This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.
[josef@toxicpanda.com: v6]
Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-13 21:44:22 +03:00
if ( fpin )
goto out_retry ;
2021-01-28 21:19:45 +03:00
filemap_invalidate_unlock_shared ( mapping ) ;
2020-04-02 07:04:53 +03:00
return VM_FAULT_OOM ;
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.
Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.
Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions. Consider some large company that does a
lot of web traffic. This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons. Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur. Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.
This patch also adds a new helper for locking the page with the mmap_sem
dropped. This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked. However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page. This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.
[josef@toxicpanda.com: v6]
Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-13 21:44:22 +03:00
}
2005-04-17 02:20:36 +04:00
}
2021-03-10 18:46:41 +03:00
if ( ! lock_folio_maybe_drop_mmap ( vmf , folio , & fpin ) )
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.
Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.
Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions. Consider some large company that does a
lot of web traffic. This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons. Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur. Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.
This patch also adds a new helper for locking the page with the mmap_sem
dropped. This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked. However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page. This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.
[josef@toxicpanda.com: v6]
Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-13 21:44:22 +03:00
goto out_retry ;
2010-10-27 01:21:56 +04:00
/* Did it get truncated? */
2021-03-10 18:46:41 +03:00
if ( unlikely ( folio - > mapping ! = mapping ) ) {
folio_unlock ( folio ) ;
folio_put ( folio ) ;
2010-10-27 01:21:56 +04:00
goto retry_find ;
}
2021-03-10 18:46:41 +03:00
VM_BUG_ON_FOLIO ( ! folio_contains ( folio , index ) , folio ) ;
2010-10-27 01:21:56 +04:00
2005-04-17 02:20:36 +04:00
/*
mm: fix fault vs invalidate race for linear mappings
Fix the race between invalidate_inode_pages and do_no_page.
Andrea Arcangeli identified a subtle race between invalidation of pages from
pagecache with userspace mappings, and do_no_page.
The issue is that invalidation has to shoot down all mappings to the page,
before it can be discarded from the pagecache. Between shooting down ptes to
a particular page, and actually dropping the struct page from the pagecache,
do_no_page from any process might fault on that page and establish a new
mapping to the page just before it gets discarded from the pagecache.
The most common case where such invalidation is used is in file truncation.
This case was catered for by doing a sort of open-coded seqlock between the
file's i_size, and its truncate_count.
Truncation will decrease i_size, then increment truncate_count before
unmapping userspace pages; do_no_page will read truncate_count, then find the
page if it is within i_size, and then check truncate_count under the page
table lock and back out and retry if it had subsequently been changed (ptl
will serialise against unmapping, and ensure a potentially updated
truncate_count is actually visible).
Complexity and documentation issues aside, the locking protocol fails in the
case where we would like to invalidate pagecache inside i_size. do_no_page
can come in anytime and filemap_nopage is not aware of the invalidation in
progress (as it is when it is outside i_size). The end result is that
dangling (->mapping == NULL) pages that appear to be from a particular file
may be mapped into userspace with nonsense data. Valid mappings to the same
place will see a different page.
Andrea implemented two working fixes, one using a real seqlock, another using
a page->flags bit. He also proposed using the page lock in do_no_page, but
that was initially considered too heavyweight. However, it is not a global or
per-file lock, and the page cacheline is modified in do_no_page to increment
_count and _mapcount anyway, so a further modification should not be a large
performance hit. Scalability is not an issue.
This patch implements this latter approach. ->nopage implementations return
with the page locked if it is possible for their underlying file to be
invalidated (in that case, they must set a special vm_flags bit to indicate
so). do_no_page only unlocks the page after setting up the mapping
completely. invalidation is excluded because it holds the page lock during
invalidation of each page (and ensures that the page is not mapped while
holding the lock).
This also allows significant simplifications in do_no_page, because we have
the page locked in the right place in the pagecache from the start.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19 12:46:57 +04:00
* We have a locked page in the page cache , now we need to check
* that it ' s up - to - date . If not , it is going to be due to an error .
2005-04-17 02:20:36 +04:00
*/
2021-03-10 18:46:41 +03:00
if ( unlikely ( ! folio_test_uptodate ( folio ) ) ) {
2021-01-28 21:19:45 +03:00
/*
* The page was in cache and uptodate and now it is not .
* Strange but possible since we didn ' t hold the page lock all
* the time . Let ' s drop everything get the invalidate lock and
* try again .
*/
if ( ! mapping_locked ) {
2021-03-10 18:46:41 +03:00
folio_unlock ( folio ) ;
folio_put ( folio ) ;
2021-01-28 21:19:45 +03:00
goto retry_find ;
}
2005-04-17 02:20:36 +04:00
goto page_not_uptodate ;
2021-01-28 21:19:45 +03:00
}
2005-04-17 02:20:36 +04:00
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.
Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.
Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions. Consider some large company that does a
lot of web traffic. This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons. Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur. Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.
This patch also adds a new helper for locking the page with the mmap_sem
dropped. This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked. However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page. This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.
[josef@toxicpanda.com: v6]
Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-13 21:44:22 +03:00
/*
2020-06-09 07:33:54 +03:00
* We ' ve made it this far and we had to drop our mmap_lock , now is the
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.
Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.
Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions. Consider some large company that does a
lot of web traffic. This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons. Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur. Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.
This patch also adds a new helper for locking the page with the mmap_sem
dropped. This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked. However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page. This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.
[josef@toxicpanda.com: v6]
Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-13 21:44:22 +03:00
* time to return to the upper layer and have it re - find the vma and
* redo the fault .
*/
if ( fpin ) {
2021-03-10 18:46:41 +03:00
folio_unlock ( folio ) ;
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.
Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.
Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions. Consider some large company that does a
lot of web traffic. This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons. Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur. Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.
This patch also adds a new helper for locking the page with the mmap_sem
dropped. This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked. However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page. This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.
[josef@toxicpanda.com: v6]
Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-13 21:44:22 +03:00
goto out_retry ;
}
2021-01-28 21:19:45 +03:00
if ( mapping_locked )
filemap_invalidate_unlock_shared ( mapping ) ;
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.
Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.
Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions. Consider some large company that does a
lot of web traffic. This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons. Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur. Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.
This patch also adds a new helper for locking the page with the mmap_sem
dropped. This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked. However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page. This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.
[josef@toxicpanda.com: v6]
Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-13 21:44:22 +03:00
2009-06-17 02:31:25 +04:00
/*
* Found the page and have a reference on it .
* We must recheck i_size under page lock .
*/
2021-03-10 18:46:41 +03:00
max_idx = DIV_ROUND_UP ( i_size_read ( inode ) , PAGE_SIZE ) ;
if ( unlikely ( index > = max_idx ) ) {
folio_unlock ( folio ) ;
folio_put ( folio ) ;
2007-10-31 19:19:46 +03:00
return VM_FAULT_SIGBUS ;
mm: fix fault vs invalidate race for linear mappings
Fix the race between invalidate_inode_pages and do_no_page.
Andrea Arcangeli identified a subtle race between invalidation of pages from
pagecache with userspace mappings, and do_no_page.
The issue is that invalidation has to shoot down all mappings to the page,
before it can be discarded from the pagecache. Between shooting down ptes to
a particular page, and actually dropping the struct page from the pagecache,
do_no_page from any process might fault on that page and establish a new
mapping to the page just before it gets discarded from the pagecache.
The most common case where such invalidation is used is in file truncation.
This case was catered for by doing a sort of open-coded seqlock between the
file's i_size, and its truncate_count.
Truncation will decrease i_size, then increment truncate_count before
unmapping userspace pages; do_no_page will read truncate_count, then find the
page if it is within i_size, and then check truncate_count under the page
table lock and back out and retry if it had subsequently been changed (ptl
will serialise against unmapping, and ensure a potentially updated
truncate_count is actually visible).
Complexity and documentation issues aside, the locking protocol fails in the
case where we would like to invalidate pagecache inside i_size. do_no_page
can come in anytime and filemap_nopage is not aware of the invalidation in
progress (as it is when it is outside i_size). The end result is that
dangling (->mapping == NULL) pages that appear to be from a particular file
may be mapped into userspace with nonsense data. Valid mappings to the same
place will see a different page.
Andrea implemented two working fixes, one using a real seqlock, another using
a page->flags bit. He also proposed using the page lock in do_no_page, but
that was initially considered too heavyweight. However, it is not a global or
per-file lock, and the page cacheline is modified in do_no_page to increment
_count and _mapcount anyway, so a further modification should not be a large
performance hit. Scalability is not an issue.
This patch implements this latter approach. ->nopage implementations return
with the page locked if it is possible for their underlying file to be
invalidated (in that case, they must set a special vm_flags bit to indicate
so). do_no_page only unlocks the page after setting up the mapping
completely. invalidation is excluded because it holds the page lock during
invalidation of each page (and ensures that the page is not mapped while
holding the lock).
This also allows significant simplifications in do_no_page, because we have
the page locked in the right place in the pagecache from the start.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19 12:46:57 +04:00
}
2021-03-10 18:46:41 +03:00
vmf - > page = folio_file_page ( folio , index ) ;
2007-07-19 12:47:05 +04:00
return ret | VM_FAULT_LOCKED ;
2005-04-17 02:20:36 +04:00
page_not_uptodate :
/*
* Umm , take care of errors if the page isn ' t up - to - date .
* Try to re - read it _once_ . We do this synchronously ,
* because there really aren ' t any performance issues here
* and we need to check for errors .
*/
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.
Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.
Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions. Consider some large company that does a
lot of web traffic. This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons. Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur. Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.
This patch also adds a new helper for locking the page with the mmap_sem
dropped. This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked. However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page. This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.
[josef@toxicpanda.com: v6]
Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-13 21:44:22 +03:00
fpin = maybe_unlock_mmap_for_io ( vmf , fpin ) ;
2022-05-13 00:37:01 +03:00
error = filemap_read_folio ( file , mapping - > a_ops - > read_folio , folio ) ;
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.
Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.
Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions. Consider some large company that does a
lot of web traffic. This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons. Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur. Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.
This patch also adds a new helper for locking the page with the mmap_sem
dropped. This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked. However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page. This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.
[josef@toxicpanda.com: v6]
Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-13 21:44:22 +03:00
if ( fpin )
goto out_retry ;
2021-03-10 18:46:41 +03:00
folio_put ( folio ) ;
mm: fix fault vs invalidate race for linear mappings
Fix the race between invalidate_inode_pages and do_no_page.
Andrea Arcangeli identified a subtle race between invalidation of pages from
pagecache with userspace mappings, and do_no_page.
The issue is that invalidation has to shoot down all mappings to the page,
before it can be discarded from the pagecache. Between shooting down ptes to
a particular page, and actually dropping the struct page from the pagecache,
do_no_page from any process might fault on that page and establish a new
mapping to the page just before it gets discarded from the pagecache.
The most common case where such invalidation is used is in file truncation.
This case was catered for by doing a sort of open-coded seqlock between the
file's i_size, and its truncate_count.
Truncation will decrease i_size, then increment truncate_count before
unmapping userspace pages; do_no_page will read truncate_count, then find the
page if it is within i_size, and then check truncate_count under the page
table lock and back out and retry if it had subsequently been changed (ptl
will serialise against unmapping, and ensure a potentially updated
truncate_count is actually visible).
Complexity and documentation issues aside, the locking protocol fails in the
case where we would like to invalidate pagecache inside i_size. do_no_page
can come in anytime and filemap_nopage is not aware of the invalidation in
progress (as it is when it is outside i_size). The end result is that
dangling (->mapping == NULL) pages that appear to be from a particular file
may be mapped into userspace with nonsense data. Valid mappings to the same
place will see a different page.
Andrea implemented two working fixes, one using a real seqlock, another using
a page->flags bit. He also proposed using the page lock in do_no_page, but
that was initially considered too heavyweight. However, it is not a global or
per-file lock, and the page cacheline is modified in do_no_page to increment
_count and _mapcount anyway, so a further modification should not be a large
performance hit. Scalability is not an issue.
This patch implements this latter approach. ->nopage implementations return
with the page locked if it is possible for their underlying file to be
invalidated (in that case, they must set a special vm_flags bit to indicate
so). do_no_page only unlocks the page after setting up the mapping
completely. invalidation is excluded because it holds the page lock during
invalidation of each page (and ensures that the page is not mapped while
holding the lock).
This also allows significant simplifications in do_no_page, because we have
the page locked in the right place in the pagecache from the start.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19 12:46:57 +04:00
if ( ! error | | error = = AOP_TRUNCATED_PAGE )
2005-12-16 01:28:17 +03:00
goto retry_find ;
2021-01-28 21:19:45 +03:00
filemap_invalidate_unlock_shared ( mapping ) ;
2005-04-17 02:20:36 +04:00
2007-07-19 12:47:03 +04:00
return VM_FAULT_SIGBUS ;
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.
Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.
Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions. Consider some large company that does a
lot of web traffic. This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons. Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur. Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.
This patch also adds a new helper for locking the page with the mmap_sem
dropped. This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked. However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page. This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.
[josef@toxicpanda.com: v6]
Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-13 21:44:22 +03:00
out_retry :
/*
2020-06-09 07:33:54 +03:00
* We dropped the mmap_lock , we need to return to the fault handler to
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.
Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.
Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions. Consider some large company that does a
lot of web traffic. This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons. Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur. Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.
This patch also adds a new helper for locking the page with the mmap_sem
dropped. This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked. However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page. This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.
[josef@toxicpanda.com: v6]
Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-13 21:44:22 +03:00
* re - find the vma and come back and find our hopefully still populated
* page .
*/
2023-05-06 19:04:14 +03:00
if ( ! IS_ERR ( folio ) )
2021-03-10 18:46:41 +03:00
folio_put ( folio ) ;
2021-01-28 21:19:45 +03:00
if ( mapping_locked )
filemap_invalidate_unlock_shared ( mapping ) ;
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.
Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.
Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions. Consider some large company that does a
lot of web traffic. This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons. Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur. Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.
This patch also adds a new helper for locking the page with the mmap_sem
dropped. This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked. However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page. This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.
[josef@toxicpanda.com: v6]
Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-13 21:44:22 +03:00
if ( fpin )
fput ( fpin ) ;
return ret | VM_FAULT_RETRY ;
2007-07-19 12:46:59 +04:00
}
EXPORT_SYMBOL ( filemap_fault ) ;
2023-01-16 22:39:39 +03:00
static bool filemap_map_pmd ( struct vm_fault * vmf , struct folio * folio ,
pgoff_t start )
2014-04-08 02:37:19 +04:00
{
2020-12-19 15:19:23 +03:00
struct mm_struct * mm = vmf - > vma - > vm_mm ;
/* Huge page is mapped? No need to proceed. */
if ( pmd_trans_huge ( * vmf - > pmd ) ) {
2023-01-16 22:39:39 +03:00
folio_unlock ( folio ) ;
folio_put ( folio ) ;
2020-12-19 15:19:23 +03:00
return true ;
}
2023-01-16 22:39:39 +03:00
if ( pmd_none ( * vmf - > pmd ) & & folio_test_pmd_mappable ( folio ) ) {
struct page * page = folio_file_page ( folio , start ) ;
mm: filemap: coding style cleanup for filemap_map_pmd()
Patch series "Solve silent data loss caused by poisoned page cache (shmem/tmpfs)", v5.
When discussing the patch that splits page cache THP in order to offline
the poisoned page, Noaya mentioned there is a bigger problem [1] that
prevents this from working since the page cache page will be truncated
if uncorrectable errors happen. By looking this deeper it turns out
this approach (truncating poisoned page) may incur silent data loss for
all non-readonly filesystems if the page is dirty. It may be worse for
in-memory filesystem, e.g. shmem/tmpfs since the data blocks are
actually gone.
To solve this problem we could keep the poisoned dirty page in page
cache then notify the users on any later access, e.g. page fault,
read/write, etc. The clean page could be truncated as is since they can
be reread from disk later on.
The consequence is the filesystems may find poisoned page and manipulate
it as healthy page since all the filesystems actually don't check if the
page is poisoned or not in all the relevant paths except page fault. In
general, we need make the filesystems be aware of poisoned page before
we could keep the poisoned page in page cache in order to solve the data
loss problem.
To make filesystems be aware of poisoned page we should consider:
- The page should be not written back: clearing dirty flag could
prevent from writeback.
- The page should not be dropped (it shows as a clean page) by drop
caches or other callers: the refcount pin from hwpoison could prevent
from invalidating (called by cache drop, inode cache shrinking, etc),
but it doesn't avoid invalidation in DIO path.
- The page should be able to get truncated/hole punched/unlinked: it
works as it is.
- Notify users when the page is accessed, e.g. read/write, page fault
and other paths (compression, encryption, etc).
The scope of the last one is huge since almost all filesystems need do
it once a page is returned from page cache lookup. There are a couple
of options to do it:
1. Check hwpoison flag for every path, the most straightforward way.
2. Return NULL for poisoned page from page cache lookup, the most
callsites check if NULL is returned, this should have least work I
think. But the error handling in filesystems just return -ENOMEM,
the error code will incur confusion to the users obviously.
3. To improve #2, we could return error pointer, e.g. ERR_PTR(-EIO),
but this will involve significant amount of code change as well
since all the paths need check if the pointer is ERR or not just
like option #1.
I did prototypes for both #1 and #3, but it seems #3 may require more
changes than #1. For #3 ERR_PTR will be returned so all the callers
need to check the return value otherwise invalid pointer may be
dereferenced, but not all callers really care about the content of the
page, for example, partial truncate which just sets the truncated range
in one page to 0. So for such paths it needs additional modification if
ERR_PTR is returned. And if the callers have their own way to handle
the problematic pages we need to add a new FGP flag to tell FGP
functions to return the pointer to the page.
It may happen very rarely, but once it happens the consequence (data
corruption) could be very bad and it is very hard to debug. It seems
this problem had been slightly discussed before, but seems no action was
taken at that time. [2]
As the aforementioned investigation, it needs huge amount of work to
solve the potential data loss for all filesystems. But it is much
easier for in-memory filesystems and such filesystems actually suffer
more than others since even the data blocks are gone due to truncating.
So this patchset starts from shmem/tmpfs by taking option #1.
TODO:
* The unpoison has been broken since commit 0ed950d1f281 ("mm,hwpoison: make
get_hwpoison_page() call get_any_page()"), and this patch series make
refcount check for unpoisoning shmem page fail.
* Expand to other filesystems. But I haven't heard feedback from filesystem
developers yet.
Patch breakdown:
Patch #1: cleanup, depended by patch #2
Patch #2: fix THP with hwpoisoned subpage(s) PMD map bug
Patch #3: coding style cleanup
Patch #4: refactor and preparation.
Patch #5: keep the poisoned page in page cache and handle such case for all
the paths.
Patch #6: the previous patches unblock page cache THP split, so this patch
add page cache THP split support.
This patch (of 4):
A minor cleanup to the indent.
Link: https://lkml.kernel.org/r/20211020210755.23964-1-shy828301@gmail.com
Link: https://lkml.kernel.org/r/20211020210755.23964-4-shy828301@gmail.com
Signed-off-by: Yang Shi <shy828301@gmail.com>
Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-05 23:41:04 +03:00
vm_fault_t ret = do_set_pmd ( vmf , page ) ;
if ( ! ret ) {
/* The page is mapped successfully, reference consumed. */
2023-01-16 22:39:39 +03:00
folio_unlock ( folio ) ;
mm: filemap: coding style cleanup for filemap_map_pmd()
Patch series "Solve silent data loss caused by poisoned page cache (shmem/tmpfs)", v5.
When discussing the patch that splits page cache THP in order to offline
the poisoned page, Noaya mentioned there is a bigger problem [1] that
prevents this from working since the page cache page will be truncated
if uncorrectable errors happen. By looking this deeper it turns out
this approach (truncating poisoned page) may incur silent data loss for
all non-readonly filesystems if the page is dirty. It may be worse for
in-memory filesystem, e.g. shmem/tmpfs since the data blocks are
actually gone.
To solve this problem we could keep the poisoned dirty page in page
cache then notify the users on any later access, e.g. page fault,
read/write, etc. The clean page could be truncated as is since they can
be reread from disk later on.
The consequence is the filesystems may find poisoned page and manipulate
it as healthy page since all the filesystems actually don't check if the
page is poisoned or not in all the relevant paths except page fault. In
general, we need make the filesystems be aware of poisoned page before
we could keep the poisoned page in page cache in order to solve the data
loss problem.
To make filesystems be aware of poisoned page we should consider:
- The page should be not written back: clearing dirty flag could
prevent from writeback.
- The page should not be dropped (it shows as a clean page) by drop
caches or other callers: the refcount pin from hwpoison could prevent
from invalidating (called by cache drop, inode cache shrinking, etc),
but it doesn't avoid invalidation in DIO path.
- The page should be able to get truncated/hole punched/unlinked: it
works as it is.
- Notify users when the page is accessed, e.g. read/write, page fault
and other paths (compression, encryption, etc).
The scope of the last one is huge since almost all filesystems need do
it once a page is returned from page cache lookup. There are a couple
of options to do it:
1. Check hwpoison flag for every path, the most straightforward way.
2. Return NULL for poisoned page from page cache lookup, the most
callsites check if NULL is returned, this should have least work I
think. But the error handling in filesystems just return -ENOMEM,
the error code will incur confusion to the users obviously.
3. To improve #2, we could return error pointer, e.g. ERR_PTR(-EIO),
but this will involve significant amount of code change as well
since all the paths need check if the pointer is ERR or not just
like option #1.
I did prototypes for both #1 and #3, but it seems #3 may require more
changes than #1. For #3 ERR_PTR will be returned so all the callers
need to check the return value otherwise invalid pointer may be
dereferenced, but not all callers really care about the content of the
page, for example, partial truncate which just sets the truncated range
in one page to 0. So for such paths it needs additional modification if
ERR_PTR is returned. And if the callers have their own way to handle
the problematic pages we need to add a new FGP flag to tell FGP
functions to return the pointer to the page.
It may happen very rarely, but once it happens the consequence (data
corruption) could be very bad and it is very hard to debug. It seems
this problem had been slightly discussed before, but seems no action was
taken at that time. [2]
As the aforementioned investigation, it needs huge amount of work to
solve the potential data loss for all filesystems. But it is much
easier for in-memory filesystems and such filesystems actually suffer
more than others since even the data blocks are gone due to truncating.
So this patchset starts from shmem/tmpfs by taking option #1.
TODO:
* The unpoison has been broken since commit 0ed950d1f281 ("mm,hwpoison: make
get_hwpoison_page() call get_any_page()"), and this patch series make
refcount check for unpoisoning shmem page fail.
* Expand to other filesystems. But I haven't heard feedback from filesystem
developers yet.
Patch breakdown:
Patch #1: cleanup, depended by patch #2
Patch #2: fix THP with hwpoisoned subpage(s) PMD map bug
Patch #3: coding style cleanup
Patch #4: refactor and preparation.
Patch #5: keep the poisoned page in page cache and handle such case for all
the paths.
Patch #6: the previous patches unblock page cache THP split, so this patch
add page cache THP split support.
This patch (of 4):
A minor cleanup to the indent.
Link: https://lkml.kernel.org/r/20211020210755.23964-1-shy828301@gmail.com
Link: https://lkml.kernel.org/r/20211020210755.23964-4-shy828301@gmail.com
Signed-off-by: Yang Shi <shy828301@gmail.com>
Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-05 23:41:04 +03:00
return true ;
2020-12-19 15:19:23 +03:00
}
}
2021-11-05 23:38:38 +03:00
if ( pmd_none ( * vmf - > pmd ) )
pmd_install ( mm , vmf - > pmd , & vmf - > prealloc_pte ) ;
2020-12-19 15:19:23 +03:00
return false ;
}
2021-03-13 07:46:45 +03:00
static struct folio * next_uptodate_page ( struct folio * folio ,
2020-12-19 15:19:23 +03:00
struct address_space * mapping ,
struct xa_state * xas , pgoff_t end_pgoff )
{
unsigned long max_idx ;
do {
2021-03-13 07:33:43 +03:00
if ( ! folio )
2020-12-19 15:19:23 +03:00
return NULL ;
2021-03-13 07:33:43 +03:00
if ( xas_retry ( xas , folio ) )
2020-12-19 15:19:23 +03:00
continue ;
2021-03-13 07:33:43 +03:00
if ( xa_is_value ( folio ) )
2020-12-19 15:19:23 +03:00
continue ;
2021-03-13 07:33:43 +03:00
if ( folio_test_locked ( folio ) )
2020-12-19 15:19:23 +03:00
continue ;
2021-03-13 07:33:43 +03:00
if ( ! folio_try_get_rcu ( folio ) )
2020-12-19 15:19:23 +03:00
continue ;
/* Has the page moved or been split? */
2021-03-13 07:33:43 +03:00
if ( unlikely ( folio ! = xas_reload ( xas ) ) )
2020-12-19 15:19:23 +03:00
goto skip ;
2021-03-13 07:33:43 +03:00
if ( ! folio_test_uptodate ( folio ) | | folio_test_readahead ( folio ) )
2020-12-19 15:19:23 +03:00
goto skip ;
2021-03-13 07:33:43 +03:00
if ( ! folio_trylock ( folio ) )
2020-12-19 15:19:23 +03:00
goto skip ;
2021-03-13 07:33:43 +03:00
if ( folio - > mapping ! = mapping )
2020-12-19 15:19:23 +03:00
goto unlock ;
2021-03-13 07:33:43 +03:00
if ( ! folio_test_uptodate ( folio ) )
2020-12-19 15:19:23 +03:00
goto unlock ;
max_idx = DIV_ROUND_UP ( i_size_read ( mapping - > host ) , PAGE_SIZE ) ;
if ( xas - > xa_index > = max_idx )
goto unlock ;
2021-03-13 07:46:45 +03:00
return folio ;
2020-12-19 15:19:23 +03:00
unlock :
2021-03-13 07:33:43 +03:00
folio_unlock ( folio ) ;
2020-12-19 15:19:23 +03:00
skip :
2021-03-13 07:33:43 +03:00
folio_put ( folio ) ;
} while ( ( folio = xas_next_entry ( xas , end_pgoff ) ) ! = NULL ) ;
2020-12-19 15:19:23 +03:00
return NULL ;
}
2021-03-13 07:46:45 +03:00
static inline struct folio * first_map_page ( struct address_space * mapping ,
2020-12-19 15:19:23 +03:00
struct xa_state * xas ,
pgoff_t end_pgoff )
{
return next_uptodate_page ( xas_find ( xas , end_pgoff ) ,
mapping , xas , end_pgoff ) ;
}
2021-03-13 07:46:45 +03:00
static inline struct folio * next_map_page ( struct address_space * mapping ,
2020-12-19 15:19:23 +03:00
struct xa_state * xas ,
pgoff_t end_pgoff )
{
return next_uptodate_page ( xas_next_entry ( xas , end_pgoff ) ,
mapping , xas , end_pgoff ) ;
}
vm_fault_t filemap_map_pages ( struct vm_fault * vmf ,
pgoff_t start_pgoff , pgoff_t end_pgoff )
{
struct vm_area_struct * vma = vmf - > vma ;
struct file * file = vma - > vm_file ;
2014-04-08 02:37:19 +04:00
struct address_space * mapping = file - > f_mapping ;
2016-07-27 01:25:20 +03:00
pgoff_t last_pgoff = start_pgoff ;
2021-01-14 18:24:19 +03:00
unsigned long addr ;
2018-05-17 07:08:30 +03:00
XA_STATE ( xas , & mapping - > i_pages , start_pgoff ) ;
2021-03-13 07:46:45 +03:00
struct folio * folio ;
struct page * page ;
2020-08-15 03:31:27 +03:00
unsigned int mmap_miss = READ_ONCE ( file - > f_ra . mmap_miss ) ;
2020-12-19 15:19:23 +03:00
vm_fault_t ret = 0 ;
2014-04-08 02:37:19 +04:00
rcu_read_lock ( ) ;
2021-03-13 07:46:45 +03:00
folio = first_map_page ( mapping , & xas , end_pgoff ) ;
if ( ! folio )
2020-12-19 15:19:23 +03:00
goto out ;
2014-04-08 02:37:19 +04:00
2023-01-16 22:39:39 +03:00
if ( filemap_map_pmd ( vmf , folio , start_pgoff ) ) {
2020-12-19 15:19:23 +03:00
ret = VM_FAULT_NOPAGE ;
goto out ;
}
2014-04-08 02:37:19 +04:00
2021-01-14 18:24:19 +03:00
addr = vma - > vm_start + ( ( start_pgoff - vma - > vm_pgoff ) < < PAGE_SHIFT ) ;
vmf - > pte = pte_offset_map_lock ( vma - > vm_mm , vmf - > pmd , addr , & vmf - > ptl ) ;
2023-06-09 04:11:29 +03:00
if ( ! vmf - > pte ) {
folio_unlock ( folio ) ;
folio_put ( folio ) ;
goto out ;
}
2020-12-19 15:19:23 +03:00
do {
2020-06-28 05:19:08 +03:00
again :
2021-03-13 07:46:45 +03:00
page = folio_file_page ( folio , xas . xa_index ) ;
2020-12-19 15:19:23 +03:00
if ( PageHWPoison ( page ) )
2014-04-08 02:37:19 +04:00
goto unlock ;
2020-08-15 03:31:27 +03:00
if ( mmap_miss > 0 )
mmap_miss - - ;
2016-07-27 01:25:23 +03:00
2021-01-14 18:24:19 +03:00
addr + = ( xas . xa_index - last_pgoff ) < < PAGE_SHIFT ;
2020-12-19 15:19:23 +03:00
vmf - > pte + = xas . xa_index - last_pgoff ;
2018-05-17 07:08:30 +03:00
last_pgoff = xas . xa_index ;
2020-12-19 15:19:23 +03:00
2022-05-13 06:22:52 +03:00
/*
* NOTE : If there ' re PTE markers , we ' ll leave them to be
* handled in the specific fault path , and it ' ll prohibit the
* fault - around logic .
*/
mm: ptep_get() conversion
Convert all instances of direct pte_t* dereferencing to instead use
ptep_get() helper. This means that by default, the accesses change from a
C dereference to a READ_ONCE(). This is technically the correct thing to
do since where pgtables are modified by HW (for access/dirty) they are
volatile and therefore we should always ensure READ_ONCE() semantics.
But more importantly, by always using the helper, it can be overridden by
the architecture to fully encapsulate the contents of the pte. Arch code
is deliberately not converted, as the arch code knows best. It is
intended that arch code (arm64) will override the default with its own
implementation that can (e.g.) hide certain bits from the core code, or
determine young/dirty status by mixing in state from another source.
Conversion was done using Coccinelle:
----
// $ make coccicheck \
// COCCI=ptepget.cocci \
// SPFLAGS="--include-headers" \
// MODE=patch
virtual patch
@ depends on patch @
pte_t *v;
@@
- *v
+ ptep_get(v)
----
Then reviewed and hand-edited to avoid multiple unnecessary calls to
ptep_get(), instead opting to store the result of a single call in a
variable, where it is correct to do so. This aims to negate any cost of
READ_ONCE() and will benefit arch-overrides that may be more complex.
Included is a fix for an issue in an earlier version of this patch that
was pointed out by kernel test robot. The issue arose because config
MMU=n elides definition of the ptep helper functions, including
ptep_get(). HUGETLB_PAGE=n configs still define a simple
huge_ptep_clear_flush() for linking purposes, which dereferences the ptep.
So when both configs are disabled, this caused a build error because
ptep_get() is not defined. Fix by continuing to do a direct dereference
when MMU=n. This is safe because for this config the arch code cannot be
trying to virtualize the ptes because none of the ptep helpers are
defined.
Link: https://lkml.kernel.org/r/20230612151545.3317766-4-ryan.roberts@arm.com
Reported-by: kernel test robot <lkp@intel.com>
Link: https://lore.kernel.org/oe-kbuild-all/202305120142.yXsNEo6H-lkp@intel.com/
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Dave Airlie <airlied@gmail.com>
Cc: Dimitri Sivanich <dimitri.sivanich@hpe.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: SeongJae Park <sj@kernel.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-12 18:15:45 +03:00
if ( ! pte_none ( ptep_get ( vmf - > pte ) ) )
2016-07-27 01:25:23 +03:00
goto unlock ;
2020-12-19 15:19:23 +03:00
2020-11-24 21:48:26 +03:00
/* We're about to handle the fault */
2021-01-14 18:24:19 +03:00
if ( vmf - > address = = addr )
2020-11-24 21:48:26 +03:00
ret = VM_FAULT_NOPAGE ;
2021-01-14 18:24:19 +03:00
do_set_pte ( vmf , page , addr ) ;
2020-12-19 15:19:23 +03:00
/* no need to invalidate: a not-present page won't be cached */
2021-01-14 18:24:19 +03:00
update_mmu_cache ( vma , addr , vmf - > pte ) ;
2020-06-28 05:19:08 +03:00
if ( folio_more_pages ( folio , xas . xa_index , end_pgoff ) ) {
xas . xa_index + + ;
folio_ref_inc ( folio ) ;
goto again ;
}
2021-03-13 07:46:45 +03:00
folio_unlock ( folio ) ;
2020-12-19 15:19:23 +03:00
continue ;
2014-04-08 02:37:19 +04:00
unlock :
2020-06-28 05:19:08 +03:00
if ( folio_more_pages ( folio , xas . xa_index , end_pgoff ) ) {
xas . xa_index + + ;
goto again ;
}
2021-03-13 07:46:45 +03:00
folio_unlock ( folio ) ;
folio_put ( folio ) ;
} while ( ( folio = next_map_page ( mapping , & xas , end_pgoff ) ) ! = NULL ) ;
2020-12-19 15:19:23 +03:00
pte_unmap_unlock ( vmf - > pte , vmf - > ptl ) ;
out :
2014-04-08 02:37:19 +04:00
rcu_read_unlock ( ) ;
2020-08-15 03:31:27 +03:00
WRITE_ONCE ( file - > f_ra . mmap_miss , mmap_miss ) ;
2020-12-19 15:19:23 +03:00
return ret ;
2014-04-08 02:37:19 +04:00
}
EXPORT_SYMBOL ( filemap_map_pages ) ;
2018-06-08 03:08:00 +03:00
vm_fault_t filemap_page_mkwrite ( struct vm_fault * vmf )
2012-06-12 18:20:29 +04:00
{
2020-11-16 16:33:37 +03:00
struct address_space * mapping = vmf - > vma - > vm_file - > f_mapping ;
2021-03-13 07:57:44 +03:00
struct folio * folio = page_folio ( vmf - > page ) ;
2018-06-08 03:08:00 +03:00
vm_fault_t ret = VM_FAULT_LOCKED ;
2012-06-12 18:20:29 +04:00
2020-11-16 16:33:37 +03:00
sb_start_pagefault ( mapping - > host - > i_sb ) ;
2017-02-25 01:56:41 +03:00
file_update_time ( vmf - > vma - > vm_file ) ;
2021-03-13 07:57:44 +03:00
folio_lock ( folio ) ;
if ( folio - > mapping ! = mapping ) {
folio_unlock ( folio ) ;
2012-06-12 18:20:29 +04:00
ret = VM_FAULT_NOPAGE ;
goto out ;
}
2012-06-12 18:20:37 +04:00
/*
2021-03-13 07:57:44 +03:00
* We mark the folio dirty already here so that when freeze is in
2012-06-12 18:20:37 +04:00
* progress , we are guaranteed that writeback during freezing will
2021-03-13 07:57:44 +03:00
* see the dirty folio and writeprotect it again .
2012-06-12 18:20:37 +04:00
*/
2021-03-13 07:57:44 +03:00
folio_mark_dirty ( folio ) ;
folio_wait_stable ( folio ) ;
2012-06-12 18:20:29 +04:00
out :
2020-11-16 16:33:37 +03:00
sb_end_pagefault ( mapping - > host - > i_sb ) ;
2012-06-12 18:20:29 +04:00
return ret ;
}
2009-09-27 22:29:37 +04:00
const struct vm_operations_struct generic_file_vm_ops = {
2007-07-19 12:46:59 +04:00
. fault = filemap_fault ,
2014-04-08 02:37:19 +04:00
. map_pages = filemap_map_pages ,
2012-06-12 18:20:29 +04:00
. page_mkwrite = filemap_page_mkwrite ,
2005-04-17 02:20:36 +04:00
} ;
/* This is used for a general mmap of a disk file */
2021-05-05 04:40:12 +03:00
int generic_file_mmap ( struct file * file , struct vm_area_struct * vma )
2005-04-17 02:20:36 +04:00
{
struct address_space * mapping = file - > f_mapping ;
2022-04-29 18:53:28 +03:00
if ( ! mapping - > a_ops - > read_folio )
2005-04-17 02:20:36 +04:00
return - ENOEXEC ;
file_accessed ( file ) ;
vma - > vm_ops = & generic_file_vm_ops ;
return 0 ;
}
/*
* This is for filesystems which do not implement - > writepage .
*/
int generic_file_readonly_mmap ( struct file * file , struct vm_area_struct * vma )
{
if ( ( vma - > vm_flags & VM_SHARED ) & & ( vma - > vm_flags & VM_MAYWRITE ) )
return - EINVAL ;
return generic_file_mmap ( file , vma ) ;
}
# else
2018-10-27 01:04:03 +03:00
vm_fault_t filemap_page_mkwrite ( struct vm_fault * vmf )
2018-04-14 01:35:27 +03:00
{
2018-10-27 01:04:03 +03:00
return VM_FAULT_SIGBUS ;
2018-04-14 01:35:27 +03:00
}
2021-05-05 04:40:12 +03:00
int generic_file_mmap ( struct file * file , struct vm_area_struct * vma )
2005-04-17 02:20:36 +04:00
{
return - ENOSYS ;
}
2021-05-05 04:40:12 +03:00
int generic_file_readonly_mmap ( struct file * file , struct vm_area_struct * vma )
2005-04-17 02:20:36 +04:00
{
return - ENOSYS ;
}
# endif /* CONFIG_MMU */
2018-04-14 01:35:27 +03:00
EXPORT_SYMBOL ( filemap_page_mkwrite ) ;
2005-04-17 02:20:36 +04:00
EXPORT_SYMBOL ( generic_file_mmap ) ;
EXPORT_SYMBOL ( generic_file_readonly_mmap ) ;
2020-12-16 19:45:30 +03:00
static struct folio * do_read_cache_folio ( struct address_space * mapping ,
2022-05-02 04:39:29 +03:00
pgoff_t index , filler_t filler , struct file * file , gfp_t gfp )
2014-04-04 01:48:18 +04:00
{
2020-12-16 19:45:30 +03:00
struct folio * folio ;
2005-04-17 02:20:36 +04:00
int err ;
2022-05-08 22:07:11 +03:00
if ( ! filler )
filler = mapping - > a_ops - > read_folio ;
2005-04-17 02:20:36 +04:00
repeat :
2020-12-16 19:45:30 +03:00
folio = filemap_get_folio ( mapping , index ) ;
2023-03-07 17:34:10 +03:00
if ( IS_ERR ( folio ) ) {
2020-12-16 19:45:30 +03:00
folio = filemap_alloc_folio ( gfp , 0 ) ;
if ( ! folio )
2007-10-16 12:24:57 +04:00
return ERR_PTR ( - ENOMEM ) ;
2020-12-16 19:45:30 +03:00
err = filemap_add_folio ( mapping , folio , index , gfp ) ;
2007-10-16 12:24:57 +04:00
if ( unlikely ( err ) ) {
2020-12-16 19:45:30 +03:00
folio_put ( folio ) ;
2007-10-16 12:24:57 +04:00
if ( err = = - EEXIST )
goto repeat ;
2017-12-04 12:02:00 +03:00
/* Presumably ENOMEM for xarray node */
2005-04-17 02:20:36 +04:00
return ERR_PTR ( err ) ;
}
2016-03-16 00:55:36 +03:00
2022-05-13 00:12:21 +03:00
goto filler ;
2016-03-16 00:55:36 +03:00
}
2020-12-16 19:45:30 +03:00
if ( folio_test_uptodate ( folio ) )
2005-04-17 02:20:36 +04:00
goto out ;
2021-12-23 23:17:28 +03:00
if ( ! folio_trylock ( folio ) ) {
folio_put_wait_locked ( folio , TASK_UNINTERRUPTIBLE ) ;
goto repeat ;
}
2016-03-16 00:55:39 +03:00
2021-12-23 23:17:28 +03:00
/* Folio was truncated from mapping */
2020-12-16 19:45:30 +03:00
if ( ! folio - > mapping ) {
folio_unlock ( folio ) ;
folio_put ( folio ) ;
2016-03-16 00:55:36 +03:00
goto repeat ;
2005-04-17 02:20:36 +04:00
}
2016-03-16 00:55:39 +03:00
/* Someone else locked and filled the page in a very small window */
2020-12-16 19:45:30 +03:00
if ( folio_test_uptodate ( folio ) ) {
folio_unlock ( folio ) ;
2005-04-17 02:20:36 +04:00
goto out ;
}
mm/filemap.c: clear page error before actual read
Mount failure issue happens under the scenario: Application forked dozens
of threads to mount the same number of cramfs images separately in docker,
but several mounts failed with high probability. Mount failed due to the
checking result of the page(read from the superblock of loop dev) is not
uptodate after wait_on_page_locked(page) returned in function cramfs_read:
wait_on_page_locked(page);
if (!PageUptodate(page)) {
...
}
The reason of the checking result of the page not uptodate: systemd-udevd
read the loopX dev before mount, because the status of loopX is Lo_unbound
at this time, so loop_make_request directly trigger the calling of io_end
handler end_buffer_async_read, which called SetPageError(page). So It
caused the page can't be set to uptodate in function
end_buffer_async_read:
if(page_uptodate && !PageError(page)) {
SetPageUptodate(page);
}
Then mount operation is performed, it used the same page which is just
accessed by systemd-udevd above, Because this page is not uptodate, it
will launch a actual read via submit_bh, then wait on this page by calling
wait_on_page_locked(page). When the I/O of the page done, io_end handler
end_buffer_async_read is called, because no one cleared the page
error(during the whole read path of mount), which is caused by
systemd-udevd reading, so this page is still in "PageError" status, which
can't be set to uptodate in function end_buffer_async_read, then caused
mount failure.
But sometimes mount succeed even through systemd-udeved read loopX dev
just before, The reason is systemd-udevd launched other loopX read just
between step 3.1 and 3.2, the steps as below:
1, loopX dev default status is Lo_unbound;
2, systemd-udved read loopX dev (page is set to PageError);
3, mount operation
1) set loopX status to Lo_bound;
==>systemd-udevd read loopX dev<==
2) read loopX dev(page has no error)
3) mount succeed
As the loopX dev status is set to Lo_bound after step 3.1, so the other
loopX dev read by systemd-udevd will go through the whole I/O stack, part
of the call trace as below:
SYS_read
vfs_read
do_sync_read
blkdev_aio_read
generic_file_aio_read
do_generic_file_read:
ClearPageError(page);
mapping->a_ops->readpage(filp, page);
here, mapping->a_ops->readpage() is blkdev_readpage. In latest kernel,
some function name changed, the call trace as below:
blkdev_read_iter
generic_file_read_iter
generic_file_buffered_read:
/*
* A previous I/O error may have been due to temporary
* failures, eg. mutipath errors.
* Pg_error will be set again if readpage fails.
*/
ClearPageError(page);
/* Start the actual read. The read will unlock the page*/
error=mapping->a_ops->readpage(flip, page);
We can see ClearPageError(page) is called before the actual read,
then the read in step 3.2 succeed.
This patch is to add the calling of ClearPageError just before the actual
read of read path of cramfs mount. Without the patch, the call trace as
below when performing cramfs mount:
do_mount
cramfs_read
cramfs_blkdev_read
read_cache_page
do_read_cache_page:
filler(data, page);
or
mapping->a_ops->readpage(data, page);
With the patch, the call trace as below when performing mount:
do_mount
cramfs_read
cramfs_blkdev_read
read_cache_page:
do_read_cache_page:
ClearPageError(page); <== new add
filler(data, page);
or
mapping->a_ops->readpage(data, page);
With the patch, mount operation trigger the calling of
ClearPageError(page) before the actual read, the page has no error if no
additional page error happen when I/O done.
Signed-off-by: Xianting Tian <xianting_tian@126.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Jan Kara <jack@suse.cz>
Cc: <yubin@h3c.com>
Link: http://lkml.kernel.org/r/1583318844-22971-1-git-send-email-xianting_tian@126.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-02 07:04:47 +03:00
2022-05-13 00:12:21 +03:00
filler :
2022-05-13 00:37:01 +03:00
err = filemap_read_folio ( file , filler , folio ) ;
2022-05-13 00:47:06 +03:00
if ( err ) {
2022-05-13 00:12:21 +03:00
folio_put ( folio ) ;
2022-05-13 00:47:06 +03:00
if ( err = = AOP_TRUNCATED_PAGE )
goto repeat ;
2022-05-13 00:12:21 +03:00
return ERR_PTR ( err ) ;
}
2016-03-16 00:55:36 +03:00
2007-05-09 16:42:20 +04:00
out :
2020-12-16 19:45:30 +03:00
folio_mark_accessed ( folio ) ;
return folio ;
2007-05-07 01:49:04 +04:00
}
2010-01-27 20:20:03 +03:00
/**
2022-05-02 04:39:29 +03:00
* read_cache_folio - Read into page cache , fill it if needed .
* @ mapping : The address_space to read from .
* @ index : The index to read .
* @ filler : Function to perform the read , or NULL to use aops - > read_folio ( ) .
* @ file : Passed to filler function , may be NULL if not required .
2010-01-27 20:20:03 +03:00
*
2022-05-02 04:39:29 +03:00
* Read one page into the page cache . If it succeeds , the folio returned
* will contain @ index , but it may not be the first page of the folio .
2019-03-06 02:48:42 +03:00
*
2022-05-02 04:39:29 +03:00
* If the filler function returns an error , it will be returned to the
* caller .
2021-01-28 21:19:45 +03:00
*
2022-05-02 04:39:29 +03:00
* Context : May sleep . Expects mapping - > invalidate_lock to be held .
* Return : An uptodate folio on success , ERR_PTR ( ) on failure .
2010-01-27 20:20:03 +03:00
*/
2020-12-16 19:45:30 +03:00
struct folio * read_cache_folio ( struct address_space * mapping , pgoff_t index ,
2022-05-02 04:39:29 +03:00
filler_t filler , struct file * file )
2020-12-16 19:45:30 +03:00
{
2022-05-02 04:39:29 +03:00
return do_read_cache_folio ( mapping , index , filler , file ,
2020-12-16 19:45:30 +03:00
mapping_gfp_mask ( mapping ) ) ;
}
EXPORT_SYMBOL ( read_cache_folio ) ;
2023-02-06 19:25:19 +03:00
/**
* mapping_read_folio_gfp - Read into page cache , using specified allocation flags .
* @ mapping : The address_space for the folio .
* @ index : The index that the allocated folio will contain .
* @ gfp : The page allocator flags to use if allocating .
*
* This is the same as " read_cache_folio(mapping, index, NULL, NULL) " , but with
* any new memory allocations done using the specified allocation flags .
*
* The most likely error from this function is EIO , but ENOMEM is
* possible and so is EINTR . If - > read_folio returns another error ,
* that will be returned to the caller .
*
* The function expects mapping - > invalidate_lock to be already held .
*
* Return : Uptodate folio on success , ERR_PTR ( ) on failure .
*/
struct folio * mapping_read_folio_gfp ( struct address_space * mapping ,
pgoff_t index , gfp_t gfp )
{
return do_read_cache_folio ( mapping , index , NULL , NULL , gfp ) ;
}
EXPORT_SYMBOL ( mapping_read_folio_gfp ) ;
2020-12-16 19:45:30 +03:00
static struct page * do_read_cache_page ( struct address_space * mapping ,
2022-05-02 04:39:29 +03:00
pgoff_t index , filler_t * filler , struct file * file , gfp_t gfp )
2020-12-16 19:45:30 +03:00
{
struct folio * folio ;
2022-05-02 04:39:29 +03:00
folio = do_read_cache_folio ( mapping , index , filler , file , gfp ) ;
2020-12-16 19:45:30 +03:00
if ( IS_ERR ( folio ) )
return & folio - > page ;
return folio_file_page ( folio , index ) ;
}
2014-04-04 01:48:18 +04:00
struct page * read_cache_page ( struct address_space * mapping ,
2022-05-02 04:39:29 +03:00
pgoff_t index , filler_t * filler , struct file * file )
2010-01-27 20:20:03 +03:00
{
2022-05-02 04:39:29 +03:00
return do_read_cache_page ( mapping , index , filler , file ,
2019-07-12 06:55:17 +03:00
mapping_gfp_mask ( mapping ) ) ;
2010-01-27 20:20:03 +03:00
}
2014-04-04 01:48:18 +04:00
EXPORT_SYMBOL ( read_cache_page ) ;
2010-01-27 20:20:03 +03:00
/**
* read_cache_page_gfp - read into page cache , using specified page allocation flags .
* @ mapping : the page ' s address_space
* @ index : the page index
* @ gfp : the page allocator flags to use if allocating
*
* This is the same as " read_mapping_page(mapping, index, NULL) " , but with
2011-12-21 21:05:48 +04:00
* any new page allocations done using the specified allocation flags .
2010-01-27 20:20:03 +03:00
*
* If the page does not get brought uptodate , return - EIO .
2019-03-06 02:48:42 +03:00
*
2021-01-28 21:19:45 +03:00
* The function expects mapping - > invalidate_lock to be already held .
*
2019-03-06 02:48:42 +03:00
* Return : up to date page on success , ERR_PTR ( ) on failure .
2010-01-27 20:20:03 +03:00
*/
struct page * read_cache_page_gfp ( struct address_space * mapping ,
pgoff_t index ,
gfp_t gfp )
{
2019-07-12 06:55:20 +03:00
return do_read_cache_page ( mapping , index , NULL , NULL , gfp ) ;
2010-01-27 20:20:03 +03:00
}
EXPORT_SYMBOL ( read_cache_page_gfp ) ;
2019-12-01 04:49:44 +03:00
/*
* Warn about a page cache invalidation failure during a direct I / O write .
*/
2023-06-01 17:58:58 +03:00
static void dio_warn_stale_pagecache ( struct file * filp )
2019-12-01 04:49:44 +03:00
{
static DEFINE_RATELIMIT_STATE ( _rs , 86400 * HZ , DEFAULT_RATELIMIT_BURST ) ;
char pathname [ 128 ] ;
char * path ;
2020-11-16 16:33:37 +03:00
errseq_set ( & filp - > f_mapping - > wb_err , - EIO ) ;
2019-12-01 04:49:44 +03:00
if ( __ratelimit ( & _rs ) ) {
path = file_path ( filp , pathname , sizeof ( pathname ) ) ;
if ( IS_ERR ( path ) )
path = " (unknown) " ;
pr_crit ( " Page cache invalidation failure on direct I/O. Possible data corruption due to collision with buffered I/O! \n " ) ;
pr_crit ( " File: %s PID: %d Comm: %.20s \n " , path , current - > pid ,
current - > comm ) ;
}
}
2023-06-01 17:58:58 +03:00
void kiocb_invalidate_post_direct_write ( struct kiocb * iocb , size_t count )
2005-04-17 02:20:36 +04:00
{
2023-06-01 17:58:58 +03:00
struct address_space * mapping = iocb - > ki_filp - > f_mapping ;
2005-04-17 02:20:36 +04:00
2023-06-01 17:58:58 +03:00
if ( mapping - > nrpages & &
invalidate_inode_pages2_range ( mapping ,
iocb - > ki_pos > > PAGE_SHIFT ,
( iocb - > ki_pos + count - 1 ) > > PAGE_SHIFT ) )
dio_warn_stale_pagecache ( iocb - > ki_filp ) ;
}
2008-07-24 08:27:04 +04:00
2005-04-17 02:20:36 +04:00
ssize_t
2016-04-07 18:51:56 +03:00
generic_file_direct_write ( struct kiocb * iocb , struct iov_iter * from )
2005-04-17 02:20:36 +04:00
{
2023-06-01 17:58:58 +03:00
struct address_space * mapping = iocb - > ki_filp - > f_mapping ;
size_t write_len = iov_iter_count ( from ) ;
ssize_t written ;
2008-07-24 08:27:04 +04:00
fs: fix data invalidation in the cleancache during direct IO
Patch series "Properly invalidate data in the cleancache", v2.
We've noticed that after direct IO write, buffered read sometimes gets
stale data which is coming from the cleancache. The reason for this is
that some direct write hooks call call invalidate_inode_pages2[_range]()
conditionally iff mapping->nrpages is not zero, so we may not invalidate
data in the cleancache.
Another odd thing is that we check only for ->nrpages and don't check
for ->nrexceptional, but invalidate_inode_pages2[_range] also
invalidates exceptional entries as well. So we invalidate exceptional
entries only if ->nrpages != 0? This doesn't feel right.
- Patch 1 fixes direct IO writes by removing ->nrpages check.
- Patch 2 fixes similar case in invalidate_bdev().
Note: I only fixed conditional cleancache_invalidate_inode() here.
Do we also need to add ->nrexceptional check in into invalidate_bdev()?
- Patches 3-4: some optimizations.
This patch (of 4):
Some direct IO write fs hooks call invalidate_inode_pages2[_range]()
conditionally iff mapping->nrpages is not zero. This can't be right,
because invalidate_inode_pages2[_range]() also invalidate data in the
cleancache via cleancache_invalidate_inode() call. So if page cache is
empty but there is some data in the cleancache, buffered read after
direct IO write would get stale data from the cleancache.
Also it doesn't feel right to check only for ->nrpages because
invalidate_inode_pages2[_range] invalidates exceptional entries as well.
Fix this by calling invalidate_inode_pages2[_range]() regardless of
nrpages state.
Note: nfs,cifs,9p doesn't need similar fix because the never call
cleancache_get_page() (nor directly, nor via mpage_readpage[s]()), so
they are not affected by this bug.
Fixes: c515e1fd361c ("mm/fs: add hooks to support cleancache")
Link: http://lkml.kernel.org/r/20170424164135.22350-2-aryabinin@virtuozzo.com
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Alexey Kuznetsov <kuznet@virtuozzo.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Nikolay Borisov <n.borisov.lkml@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-04 00:55:59 +03:00
/*
* If a page can not be invalidated , return 0 to fall back
* to buffered write .
*/
2023-06-01 17:58:57 +03:00
written = kiocb_invalidate_pages ( iocb , write_len ) ;
fs: fix data invalidation in the cleancache during direct IO
Patch series "Properly invalidate data in the cleancache", v2.
We've noticed that after direct IO write, buffered read sometimes gets
stale data which is coming from the cleancache. The reason for this is
that some direct write hooks call call invalidate_inode_pages2[_range]()
conditionally iff mapping->nrpages is not zero, so we may not invalidate
data in the cleancache.
Another odd thing is that we check only for ->nrpages and don't check
for ->nrexceptional, but invalidate_inode_pages2[_range] also
invalidates exceptional entries as well. So we invalidate exceptional
entries only if ->nrpages != 0? This doesn't feel right.
- Patch 1 fixes direct IO writes by removing ->nrpages check.
- Patch 2 fixes similar case in invalidate_bdev().
Note: I only fixed conditional cleancache_invalidate_inode() here.
Do we also need to add ->nrexceptional check in into invalidate_bdev()?
- Patches 3-4: some optimizations.
This patch (of 4):
Some direct IO write fs hooks call invalidate_inode_pages2[_range]()
conditionally iff mapping->nrpages is not zero. This can't be right,
because invalidate_inode_pages2[_range]() also invalidate data in the
cleancache via cleancache_invalidate_inode() call. So if page cache is
empty but there is some data in the cleancache, buffered read after
direct IO write would get stale data from the cleancache.
Also it doesn't feel right to check only for ->nrpages because
invalidate_inode_pages2[_range] invalidates exceptional entries as well.
Fix this by calling invalidate_inode_pages2[_range]() regardless of
nrpages state.
Note: nfs,cifs,9p doesn't need similar fix because the never call
cleancache_get_page() (nor directly, nor via mpage_readpage[s]()), so
they are not affected by this bug.
Fixes: c515e1fd361c ("mm/fs: add hooks to support cleancache")
Link: http://lkml.kernel.org/r/20170424164135.22350-2-aryabinin@virtuozzo.com
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Alexey Kuznetsov <kuznet@virtuozzo.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Nikolay Borisov <n.borisov.lkml@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-04 00:55:59 +03:00
if ( written ) {
if ( written = = - EBUSY )
return 0 ;
2023-06-01 17:58:58 +03:00
return written ;
2008-07-24 08:27:04 +04:00
}
2017-04-13 21:10:15 +03:00
written = mapping - > a_ops - > direct_IO ( iocb , from ) ;
2008-07-24 08:27:04 +04:00
/*
* Finally , try again to invalidate clean pages which might have been
* cached by non - direct readahead , or faulted in by get_user_pages ( )
* if the source of the write was an mmap ' ed region of the file
* we ' re writing . Either one is a pretty crazy thing to do ,
* so we don ' t support it 100 % . If this invalidation
* fails , tough , the write still worked . . .
2017-09-21 17:16:29 +03:00
*
* Most of the time we do not need this since dio_complete ( ) will do
* the invalidation for us . However there are some file systems that
* do not end up with dio_complete ( ) being called , so let ' s not break
2019-12-01 04:49:41 +03:00
* them by removing it completely .
*
2019-12-01 04:49:47 +03:00
* Noticeable example is a blkdev_direct_IO ( ) .
*
2019-12-01 04:49:41 +03:00
* Skip invalidation for async writes or if mapping has no pages .
2008-07-24 08:27:04 +04:00
*/
2005-04-17 02:20:36 +04:00
if ( written > 0 ) {
2023-06-01 17:58:58 +03:00
struct inode * inode = mapping - > host ;
loff_t pos = iocb - > ki_pos ;
kiocb_invalidate_post_direct_write ( iocb , written ) ;
2010-10-27 01:21:58 +04:00
pos + = written ;
2017-04-13 21:10:15 +03:00
write_len - = written ;
2010-10-27 01:21:58 +04:00
if ( pos > i_size_read ( inode ) & & ! S_ISBLK ( inode - > i_mode ) ) {
i_size_write ( inode , pos ) ;
2005-04-17 02:20:36 +04:00
mark_inode_dirty ( inode ) ;
}
2014-02-12 05:58:20 +04:00
iocb - > ki_pos = pos ;
2005-04-17 02:20:36 +04:00
}
2021-02-24 23:01:45 +03:00
if ( written ! = - EIOCBQUEUED )
iov_iter_revert ( from , write_len - iov_iter_count ( from ) ) ;
2005-04-17 02:20:36 +04:00
return written ;
}
EXPORT_SYMBOL ( generic_file_direct_write ) ;
2022-02-20 07:19:49 +03:00
ssize_t generic_perform_write ( struct kiocb * iocb , struct iov_iter * i )
2007-10-16 12:25:01 +04:00
{
2022-02-20 07:19:49 +03:00
struct file * file = iocb - > ki_filp ;
loff_t pos = iocb - > ki_pos ;
2007-10-16 12:25:01 +04:00
struct address_space * mapping = file - > f_mapping ;
const struct address_space_operations * a_ops = mapping - > a_ops ;
long status = 0 ;
ssize_t written = 0 ;
2007-10-16 12:25:03 +04:00
2007-10-16 12:25:01 +04:00
do {
struct page * page ;
unsigned long offset ; /* Offset into pagecache page */
unsigned long bytes ; /* Bytes to write to page */
size_t copied ; /* Bytes copied from user */
2022-09-15 18:04:16 +03:00
void * fsdata = NULL ;
2007-10-16 12:25:01 +04:00
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 15:29:47 +03:00
offset = ( pos & ( PAGE_SIZE - 1 ) ) ;
bytes = min_t ( unsigned long , PAGE_SIZE - offset ,
2007-10-16 12:25:01 +04:00
iov_iter_count ( i ) ) ;
again :
2015-10-07 10:32:38 +03:00
/*
* Bring in the user page that we will copy from _first_ .
* Otherwise there ' s a nasty deadlock on copying from the
* same page as we ' re writing to , without it being marked
* up - to - date .
*/
2021-11-09 14:56:06 +03:00
if ( unlikely ( fault_in_iov_iter_readable ( i , bytes ) = = bytes ) ) {
2015-10-07 10:32:38 +03:00
status = - EFAULT ;
break ;
}
mm: make sendfile(2) killable
Currently a simple program below issues a sendfile(2) system call which
takes about 62 days to complete in my test KVM instance.
int fd;
off_t off = 0;
fd = open("file", O_RDWR | O_TRUNC | O_SYNC | O_CREAT, 0644);
ftruncate(fd, 2);
lseek(fd, 0, SEEK_END);
sendfile(fd, fd, &off, 0xfffffff);
Now you should not ask kernel to do a stupid stuff like copying 256MB in
2-byte chunks and call fsync(2) after each chunk but if you do, sysadmin
should have a way to stop you.
We actually do have a check for fatal_signal_pending() in
generic_perform_write() which triggers in this path however because we
always succeed in writing something before the check is done, we return
value > 0 from generic_perform_write() and thus the information about
signal gets lost.
Fix the problem by doing the signal check before writing anything. That
way generic_perform_write() returns -EINTR, the error gets propagated up
and the sendfile loop terminates early.
Signed-off-by: Jan Kara <jack@suse.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-10-22 23:32:21 +03:00
if ( fatal_signal_pending ( current ) ) {
status = - EINTR ;
break ;
}
2022-02-22 22:31:43 +03:00
status = a_ops - > write_begin ( file , mapping , pos , bytes ,
2007-10-16 12:25:01 +04:00
& page , & fsdata ) ;
2014-06-05 03:10:31 +04:00
if ( unlikely ( status < 0 ) )
2007-10-16 12:25:01 +04:00
break ;
mm: flush dcache before writing into page to avoid alias
The cache alias problem will happen if the changes of user shared mapping
is not flushed before copying, then user and kernel mapping may be mapped
into two different cache line, it is impossible to guarantee the coherence
after iov_iter_copy_from_user_atomic. So the right steps should be:
flush_dcache_page(page);
kmap_atomic(page);
write to page;
kunmap_atomic(page);
flush_dcache_page(page);
More precisely, we might create two new APIs flush_dcache_user_page and
flush_dcache_kern_page to replace the two flush_dcache_page accordingly.
Here is a snippet tested on omap2430 with VIPT cache, and I think it is
not ARM-specific:
int val = 0x11111111;
fd = open("abc", O_RDWR);
addr = mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
*(addr+0) = 0x44444444;
tmp = *(addr+0);
*(addr+1) = 0x77777777;
write(fd, &val, sizeof(int));
close(fd);
The results are not always 0x11111111 0x77777777 at the beginning as expected. Sometimes we see 0x44444444 0x77777777.
Signed-off-by: Anfei <anfei.zhou@gmail.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: <linux-arch@vger.kernel.org>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-02-03 00:44:02 +03:00
if ( mapping_writably_mapped ( mapping ) )
flush_dcache_page ( page ) ;
2015-10-07 10:32:38 +03:00
2021-04-30 17:26:41 +03:00
copied = copy_page_from_iter_atomic ( page , offset , bytes , i ) ;
2007-10-16 12:25:01 +04:00
flush_dcache_page ( page ) ;
status = a_ops - > write_end ( file , mapping , pos , bytes , copied ,
page , fsdata ) ;
2021-04-30 17:26:41 +03:00
if ( unlikely ( status ! = copied ) ) {
iov_iter_revert ( i , copied - max ( status , 0L ) ) ;
if ( unlikely ( status < 0 ) )
break ;
}
2007-10-16 12:25:01 +04:00
cond_resched ( ) ;
2021-05-31 07:32:44 +03:00
if ( unlikely ( status = = 0 ) ) {
2007-10-16 12:25:01 +04:00
/*
2021-05-31 07:32:44 +03:00
* A short copy made - > write_end ( ) reject the
* thing entirely . Might be memory poisoning
* halfway through , might be a race with munmap ,
* might be severe memory pressure .
2007-10-16 12:25:01 +04:00
*/
2021-05-31 07:32:44 +03:00
if ( copied )
bytes = copied ;
2007-10-16 12:25:01 +04:00
goto again ;
}
2021-04-30 17:26:41 +03:00
pos + = status ;
written + = status ;
2007-10-16 12:25:01 +04:00
balance_dirty_pages_ratelimited ( mapping ) ;
} while ( iov_iter_count ( i ) ) ;
2023-06-01 17:58:55 +03:00
if ( ! written )
return status ;
iocb - > ki_pos + = written ;
return written ;
2007-10-16 12:25:01 +04:00
}
2014-02-12 06:34:08 +04:00
EXPORT_SYMBOL ( generic_perform_write ) ;
2005-04-17 02:20:36 +04:00
2009-08-17 20:10:06 +04:00
/**
2014-04-03 11:17:43 +04:00
* __generic_file_write_iter - write data to a file
2009-08-17 20:10:06 +04:00
* @ iocb : IO state structure ( file , offset , etc . )
2014-04-03 11:17:43 +04:00
* @ from : iov_iter with data to write
2009-08-17 20:10:06 +04:00
*
* This function does all the work needed for actually writing data to a
* file . It does all basic checks , removes SUID from the file , updates
* modification times and calls proper subroutines depending on whether we
* do direct IO or a standard buffered write .
*
2021-04-12 16:50:21 +03:00
* It expects i_rwsem to be grabbed unless we work on a block device or similar
2009-08-17 20:10:06 +04:00
* object which does not need locking at all .
*
* This function does * not * take care of syncing data in case of O_SYNC write .
* A caller has to handle it . This is mainly due to the fact that we want to
2021-04-12 16:50:21 +03:00
* avoid syncing under i_rwsem .
2019-03-06 02:48:42 +03:00
*
* Return :
* * number of bytes written , even for truncated writes
* * negative error code if no data has been written at all
2009-08-17 20:10:06 +04:00
*/
2014-04-03 11:17:43 +04:00
ssize_t __generic_file_write_iter ( struct kiocb * iocb , struct iov_iter * from )
2005-04-17 02:20:36 +04:00
{
struct file * file = iocb - > ki_filp ;
2021-05-05 04:40:12 +03:00
struct address_space * mapping = file - > f_mapping ;
2023-06-01 17:59:01 +03:00
struct inode * inode = mapping - > host ;
ssize_t ret ;
2005-04-17 02:20:36 +04:00
2023-06-01 17:59:01 +03:00
ret = file_remove_privs ( file ) ;
if ( ret )
return ret ;
2005-04-17 02:20:36 +04:00
2023-06-01 17:59:01 +03:00
ret = file_update_time ( file ) ;
if ( ret )
return ret ;
2006-10-20 10:28:13 +04:00
2015-04-09 20:52:01 +03:00
if ( iocb - > ki_flags & IOCB_DIRECT ) {
2023-06-01 17:59:01 +03:00
ret = generic_file_direct_write ( iocb , from ) ;
2005-04-17 02:20:36 +04:00
/*
2015-02-17 02:58:53 +03:00
* If the write stopped short of completing , fall back to
* buffered writes . Some filesystems do this for writes to
* holes , for example . For DAX files , a buffered write will
* not succeed ( even if it did , DAX does not handle dirty
* page - cache pages correctly ) .
2005-04-17 02:20:36 +04:00
*/
2023-06-01 17:59:01 +03:00
if ( ret < 0 | | ! iov_iter_count ( from ) | | IS_DAX ( inode ) )
return ret ;
return direct_write_fallback ( iocb , from , ret ,
generic_perform_write ( iocb , from ) ) ;
2006-10-20 10:28:13 +04:00
}
2023-06-01 17:59:01 +03:00
return generic_perform_write ( iocb , from ) ;
2005-04-17 02:20:36 +04:00
}
2014-04-03 11:17:43 +04:00
EXPORT_SYMBOL ( __generic_file_write_iter ) ;
2009-08-17 20:10:06 +04:00
/**
2014-04-03 11:17:43 +04:00
* generic_file_write_iter - write data to a file
2009-08-17 20:10:06 +04:00
* @ iocb : IO state structure
2014-04-03 11:17:43 +04:00
* @ from : iov_iter with data to write
2009-08-17 20:10:06 +04:00
*
2014-04-03 11:17:43 +04:00
* This is a wrapper around __generic_file_write_iter ( ) to be used by most
2009-08-17 20:10:06 +04:00
* filesystems . It takes care of syncing the file in case of O_SYNC file
2021-04-12 16:50:21 +03:00
* and acquires i_rwsem as needed .
2019-03-06 02:48:42 +03:00
* Return :
* * negative error code if no data has been written at all of
* vfs_fsync_range ( ) failed for a synchronous write
* * number of bytes written , even for truncated writes
2009-08-17 20:10:06 +04:00
*/
2014-04-03 11:17:43 +04:00
ssize_t generic_file_write_iter ( struct kiocb * iocb , struct iov_iter * from )
2005-04-17 02:20:36 +04:00
{
struct file * file = iocb - > ki_filp ;
2009-08-17 21:52:36 +04:00
struct inode * inode = file - > f_mapping - > host ;
2005-04-17 02:20:36 +04:00
ssize_t ret ;
2016-01-22 23:40:57 +03:00
inode_lock ( inode ) ;
2015-04-09 19:55:47 +03:00
ret = generic_write_checks ( iocb , from ) ;
if ( ret > 0 )
2015-04-07 18:28:12 +03:00
ret = __generic_file_write_iter ( iocb , from ) ;
2016-01-22 23:40:57 +03:00
inode_unlock ( inode ) ;
2005-04-17 02:20:36 +04:00
2016-04-07 18:52:01 +03:00
if ( ret > 0 )
ret = generic_write_sync ( iocb , ret ) ;
2005-04-17 02:20:36 +04:00
return ret ;
}
2014-04-03 11:17:43 +04:00
EXPORT_SYMBOL ( generic_file_write_iter ) ;
2005-04-17 02:20:36 +04:00
2006-08-29 22:05:54 +04:00
/**
2021-07-28 22:14:48 +03:00
* filemap_release_folio ( ) - Release fs - specific metadata on a folio .
* @ folio : The folio which the kernel is trying to free .
* @ gfp : Memory allocation flags ( and I / O mode ) .
2006-08-29 22:05:54 +04:00
*
2021-07-28 22:14:48 +03:00
* The address_space is trying to release any data attached to a folio
* ( presumably at folio - > private ) .
2006-08-29 22:05:54 +04:00
*
2021-07-28 22:14:48 +03:00
* This will also be called if the private_2 flag is set on a page ,
* indicating that the folio has other metadata associated with it .
2009-04-03 19:42:36 +04:00
*
2021-07-28 22:14:48 +03:00
* The @ gfp argument specifies whether I / O may be performed to release
* this page ( __GFP_IO ) , and whether the call may block
* ( __GFP_RECLAIM & __GFP_FS ) .
2006-08-29 22:05:54 +04:00
*
2021-07-28 22:14:48 +03:00
* Return : % true if the release was successful , otherwise % false .
2006-08-29 22:05:54 +04:00
*/
2021-07-28 22:14:48 +03:00
bool filemap_release_folio ( struct folio * folio , gfp_t gfp )
2006-08-29 22:05:54 +04:00
{
2021-07-28 22:14:48 +03:00
struct address_space * const mapping = folio - > mapping ;
2006-08-29 22:05:54 +04:00
2021-07-28 22:14:48 +03:00
BUG_ON ( ! folio_test_locked ( folio ) ) ;
if ( folio_test_writeback ( folio ) )
return false ;
2006-08-29 22:05:54 +04:00
2022-04-30 00:00:05 +03:00
if ( mapping & & mapping - > a_ops - > release_folio )
return mapping - > a_ops - > release_folio ( folio , gfp ) ;
2022-05-01 08:08:08 +03:00
return try_to_free_buffers ( folio ) ;
2006-08-29 22:05:54 +04:00
}
2021-07-28 22:14:48 +03:00
EXPORT_SYMBOL ( filemap_release_folio ) ;
cachestat: implement cachestat syscall
There is currently no good way to query the page cache state of large file
sets and directory trees. There is mincore(), but it scales poorly: the
kernel writes out a lot of bitmap data that userspace has to aggregate,
when the user really doesn not care about per-page information in that
case. The user also needs to mmap and unmap each file as it goes along,
which can be quite slow as well.
Some use cases where this information could come in handy:
* Allowing database to decide whether to perform an index scan or
direct table queries based on the in-memory cache state of the
index.
* Visibility into the writeback algorithm, for performance issues
diagnostic.
* Workload-aware writeback pacing: estimating IO fulfilled by page
cache (and IO to be done) within a range of a file, allowing for
more frequent syncing when and where there is IO capacity, and
batching when there is not.
* Computing memory usage of large files/directory trees, analogous to
the du tool for disk usage.
More information about these use cases could be found in the following
thread:
https://lore.kernel.org/lkml/20230315170934.GA97793@cmpxchg.org/
This patch implements a new syscall that queries cache state of a file and
summarizes the number of cached pages, number of dirty pages, number of
pages marked for writeback, number of (recently) evicted pages, etc. in a
given range. Currently, the syscall is only wired in for x86
architecture.
NAME
cachestat - query the page cache statistics of a file.
SYNOPSIS
#include <sys/mman.h>
struct cachestat_range {
__u64 off;
__u64 len;
};
struct cachestat {
__u64 nr_cache;
__u64 nr_dirty;
__u64 nr_writeback;
__u64 nr_evicted;
__u64 nr_recently_evicted;
};
int cachestat(unsigned int fd, struct cachestat_range *cstat_range,
struct cachestat *cstat, unsigned int flags);
DESCRIPTION
cachestat() queries the number of cached pages, number of dirty
pages, number of pages marked for writeback, number of evicted
pages, number of recently evicted pages, in the bytes range given by
`off` and `len`.
An evicted page is a page that is previously in the page cache but
has been evicted since. A page is recently evicted if its last
eviction was recent enough that its reentry to the cache would
indicate that it is actively being used by the system, and that
there is memory pressure on the system.
These values are returned in a cachestat struct, whose address is
given by the `cstat` argument.
The `off` and `len` arguments must be non-negative integers. If
`len` > 0, the queried range is [`off`, `off` + `len`]. If `len` ==
0, we will query in the range from `off` to the end of the file.
The `flags` argument is unused for now, but is included for future
extensibility. User should pass 0 (i.e no flag specified).
Currently, hugetlbfs is not supported.
Because the status of a page can change after cachestat() checks it
but before it returns to the application, the returned values may
contain stale information.
RETURN VALUE
On success, cachestat returns 0. On error, -1 is returned, and errno
is set to indicate the error.
ERRORS
EFAULT cstat or cstat_args points to an invalid address.
EINVAL invalid flags.
EBADF invalid file descriptor.
EOPNOTSUPP file descriptor is of a hugetlbfs file
[nphamcs@gmail.com: replace rounddown logic with the existing helper]
Link: https://lkml.kernel.org/r/20230504022044.3675469-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20230503013608.2431726-3-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Brian Foster <bfoster@redhat.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-03 04:36:07 +03:00
# ifdef CONFIG_CACHESTAT_SYSCALL
/**
* filemap_cachestat ( ) - compute the page cache statistics of a mapping
* @ mapping : The mapping to compute the statistics for .
* @ first_index : The starting page cache index .
* @ last_index : The final page index ( inclusive ) .
* @ cs : the cachestat struct to write the result to .
*
* This will query the page cache statistics of a mapping in the
* page range of [ first_index , last_index ] ( inclusive ) . The statistics
* queried include : number of dirty pages , number of pages marked for
* writeback , and the number of ( recently ) evicted pages .
*/
static void filemap_cachestat ( struct address_space * mapping ,
pgoff_t first_index , pgoff_t last_index , struct cachestat * cs )
{
XA_STATE ( xas , & mapping - > i_pages , first_index ) ;
struct folio * folio ;
rcu_read_lock ( ) ;
xas_for_each ( & xas , folio , last_index ) {
unsigned long nr_pages ;
pgoff_t folio_first_index , folio_last_index ;
if ( xas_retry ( & xas , folio ) )
continue ;
if ( xa_is_value ( folio ) ) {
/* page is evicted */
void * shadow = ( void * ) folio ;
bool workingset ; /* not used */
int order = xa_get_order ( xas . xa , xas . xa_index ) ;
nr_pages = 1 < < order ;
folio_first_index = round_down ( xas . xa_index , 1 < < order ) ;
folio_last_index = folio_first_index + nr_pages - 1 ;
/* Folios might straddle the range boundaries, only count covered pages */
if ( folio_first_index < first_index )
nr_pages - = first_index - folio_first_index ;
if ( folio_last_index > last_index )
nr_pages - = folio_last_index - last_index ;
cs - > nr_evicted + = nr_pages ;
# ifdef CONFIG_SWAP /* implies CONFIG_MMU */
if ( shmem_mapping ( mapping ) ) {
/* shmem file - in swap cache */
swp_entry_t swp = radix_to_swp_entry ( folio ) ;
shadow = get_shadow_from_swap_cache ( swp ) ;
}
# endif
if ( workingset_test_recent ( shadow , true , & workingset ) )
cs - > nr_recently_evicted + = nr_pages ;
goto resched ;
}
nr_pages = folio_nr_pages ( folio ) ;
folio_first_index = folio_pgoff ( folio ) ;
folio_last_index = folio_first_index + nr_pages - 1 ;
/* Folios might straddle the range boundaries, only count covered pages */
if ( folio_first_index < first_index )
nr_pages - = first_index - folio_first_index ;
if ( folio_last_index > last_index )
nr_pages - = folio_last_index - last_index ;
/* page is in cache */
cs - > nr_cache + = nr_pages ;
if ( folio_test_dirty ( folio ) )
cs - > nr_dirty + = nr_pages ;
if ( folio_test_writeback ( folio ) )
cs - > nr_writeback + = nr_pages ;
resched :
if ( need_resched ( ) ) {
xas_pause ( & xas ) ;
cond_resched_rcu ( ) ;
}
}
rcu_read_unlock ( ) ;
}
/*
* The cachestat ( 2 ) system call .
*
* cachestat ( ) returns the page cache statistics of a file in the
* bytes range specified by ` off ` and ` len ` : number of cached pages ,
* number of dirty pages , number of pages marked for writeback ,
* number of evicted pages , and number of recently evicted pages .
*
* An evicted page is a page that is previously in the page cache
* but has been evicted since . A page is recently evicted if its last
* eviction was recent enough that its reentry to the cache would
* indicate that it is actively being used by the system , and that
* there is memory pressure on the system .
*
* ` off ` and ` len ` must be non - negative integers . If ` len ` > 0 ,
* the queried range is [ ` off ` , ` off ` + ` len ` ] . If ` len ` = = 0 ,
* we will query in the range from ` off ` to the end of the file .
*
* The ` flags ` argument is unused for now , but is included for future
* extensibility . User should pass 0 ( i . e no flag specified ) .
*
* Currently , hugetlbfs is not supported .
*
* Because the status of a page can change after cachestat ( ) checks it
* but before it returns to the application , the returned values may
* contain stale information .
*
* return values :
* zero - success
* - EFAULT - cstat or cstat_range points to an illegal address
* - EINVAL - invalid flags
* - EBADF - invalid file descriptor
* - EOPNOTSUPP - file descriptor is of a hugetlbfs file
*/
SYSCALL_DEFINE4 ( cachestat , unsigned int , fd ,
struct cachestat_range __user * , cstat_range ,
struct cachestat __user * , cstat , unsigned int , flags )
{
struct fd f = fdget ( fd ) ;
struct address_space * mapping ;
struct cachestat_range csr ;
struct cachestat cs ;
pgoff_t first_index , last_index ;
if ( ! f . file )
return - EBADF ;
if ( copy_from_user ( & csr , cstat_range ,
sizeof ( struct cachestat_range ) ) ) {
fdput ( f ) ;
return - EFAULT ;
}
/* hugetlbfs is not supported */
if ( is_file_hugepages ( f . file ) ) {
fdput ( f ) ;
return - EOPNOTSUPP ;
}
if ( flags ! = 0 ) {
fdput ( f ) ;
return - EINVAL ;
}
first_index = csr . off > > PAGE_SHIFT ;
last_index =
csr . len = = 0 ? ULONG_MAX : ( csr . off + csr . len - 1 ) > > PAGE_SHIFT ;
memset ( & cs , 0 , sizeof ( struct cachestat ) ) ;
mapping = f . file - > f_mapping ;
filemap_cachestat ( mapping , first_index , last_index , & cs ) ;
fdput ( f ) ;
if ( copy_to_user ( cstat , & cs , sizeof ( struct cachestat ) ) )
return - EFAULT ;
return 0 ;
}
# endif /* CONFIG_CACHESTAT_SYSCALL */