2019-05-19 13:08:55 +01:00
// SPDX-License-Identifier: GPL-2.0-only
2005-04-16 15:20:36 -07:00
/*
* fs / fs - writeback . c
*
* Copyright ( C ) 2002 , Linus Torvalds .
*
* Contains all the functions related to writing back and waiting
* upon dirty inodes against superblocks , and writing back dirty
* pages against inodes . ie : data writeback . Writeout of the
* inode itself is not handled here .
*
2008-10-15 22:01:59 -07:00
* 10 Apr2002 Andrew Morton
2005-04-16 15:20:36 -07:00
* Split out of fs / inode . c
* Additions for address_space - based writeback
*/
# include <linux/kernel.h>
2011-11-16 23:57:37 -05:00
# include <linux/export.h>
2005-04-16 15:20:36 -07:00
# include <linux/spinlock.h>
include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.
percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.
http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.
* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.
* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.
The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.
2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.
3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.
4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.
5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.
6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).
* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig
8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.
Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-24 17:04:11 +09:00
# include <linux/slab.h>
2005-04-16 15:20:36 -07:00
# include <linux/sched.h>
# include <linux/fs.h>
# include <linux/mm.h>
2012-01-07 20:41:55 -06:00
# include <linux/pagemap.h>
2009-09-09 09:08:54 +02:00
# include <linux/kthread.h>
2005-04-16 15:20:36 -07:00
# include <linux/writeback.h>
# include <linux/blkdev.h>
# include <linux/backing-dev.h>
2010-07-07 13:24:06 +10:00
# include <linux/tracepoint.h>
2013-09-29 11:24:49 -04:00
# include <linux/device.h>
2015-05-28 14:50:49 -04:00
# include <linux/memcontrol.h>
2006-09-30 20:52:18 +02:00
# include "internal.h"
2005-04-16 15:20:36 -07:00
2012-01-07 20:41:55 -06:00
/*
* 4 MB minimal write chunk size
*/
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 15:29:47 +03:00
# define MIN_WRITEBACK_PAGES (4096UL >> (PAGE_SHIFT - 10))
2012-01-07 20:41:55 -06:00
2009-09-16 15:18:25 +02:00
/*
* Passed into wb_writeback ( ) , essentially a subset of writeback_control
*/
2010-07-06 08:59:53 +02:00
struct wb_writeback_work {
2009-09-16 15:18:25 +02:00
long nr_pages ;
struct super_block * sb ;
enum writeback_sync_modes sync_mode ;
2010-06-06 10:38:15 -06:00
unsigned int tagged_writepages : 1 ;
2010-04-01 20:36:30 -05:00
unsigned int for_kupdate : 1 ;
unsigned int range_cyclic : 1 ;
unsigned int for_background : 1 ;
2013-07-02 22:38:35 +10:00
unsigned int for_sync : 1 ; /* sync(2) WB_SYNC_ALL writeback */
2015-05-22 17:13:57 -04:00
unsigned int auto_free : 1 ; /* free on completion */
2011-10-07 21:54:10 -06:00
enum wb_reason reason ; /* why was writeback initiated? */
2009-09-16 15:18:25 +02:00
2009-09-15 20:04:57 +02:00
struct list_head list ; /* pending work list */
2015-05-22 17:13:58 -04:00
struct wb_completion * done ; /* set if the caller waits */
2009-09-09 09:08:54 +02:00
} ;
2015-03-17 12:23:19 -04:00
/*
* If an inode is constantly having its pages dirtied , but then the
* updates stop dirtytime_expire_interval seconds in the past , it ' s
* possible for the worst case time between when an inode has its
* timestamps updated and when they finally get written out to be two
* dirtytime_expire_intervals . We set the default to 12 hours ( in
* seconds ) , which means most of the time inodes will have their
* timestamps written to disk after 12 hours , but in the worst case a
* few inodes might not their timestamps updated for 24 hours .
*/
unsigned int dirtytime_expire_interval = 12 * 60 * 60 ;
2010-10-21 11:49:30 +11:00
static inline struct inode * wb_inode ( struct list_head * head )
{
2015-03-04 14:07:22 -05:00
return list_entry ( head , struct inode , i_io_list ) ;
2010-10-21 11:49:30 +11:00
}
2012-01-17 11:18:56 -06:00
/*
* Include the creation of the trace points after defining the
* wb_writeback_work structure and inline functions so that the definition
* remains local to this file .
*/
# define CREATE_TRACE_POINTS
# include <trace/events/writeback.h>
2014-02-06 15:47:47 +00:00
EXPORT_TRACEPOINT_SYMBOL_GPL ( wbc_writepage ) ;
2015-05-22 17:13:45 -04:00
static bool wb_io_lists_populated ( struct bdi_writeback * wb )
{
if ( wb_has_dirty_io ( wb ) ) {
return false ;
} else {
set_bit ( WB_has_dirty_io , & wb - > state ) ;
2015-05-22 17:13:47 -04:00
WARN_ON_ONCE ( ! wb - > avg_write_bandwidth ) ;
2015-05-22 17:13:46 -04:00
atomic_long_add ( wb - > avg_write_bandwidth ,
& wb - > bdi - > tot_write_bandwidth ) ;
2015-05-22 17:13:45 -04:00
return true ;
}
}
static void wb_io_lists_depopulated ( struct bdi_writeback * wb )
{
if ( wb_has_dirty_io ( wb ) & & list_empty ( & wb - > b_dirty ) & &
2015-05-22 17:13:46 -04:00
list_empty ( & wb - > b_io ) & & list_empty ( & wb - > b_more_io ) ) {
2015-05-22 17:13:45 -04:00
clear_bit ( WB_has_dirty_io , & wb - > state ) ;
2015-05-22 17:13:47 -04:00
WARN_ON_ONCE ( atomic_long_sub_return ( wb - > avg_write_bandwidth ,
& wb - > bdi - > tot_write_bandwidth ) < 0 ) ;
2015-05-22 17:13:46 -04:00
}
2015-05-22 17:13:45 -04:00
}
/**
2015-03-04 14:07:22 -05:00
* inode_io_list_move_locked - move an inode onto a bdi_writeback IO list
2015-05-22 17:13:45 -04:00
* @ inode : inode to be moved
* @ wb : target bdi_writeback
2017-12-05 07:23:19 -05:00
* @ head : one of @ wb - > b_ { dirty | io | more_io | dirty_time }
2015-05-22 17:13:45 -04:00
*
2015-03-04 14:07:22 -05:00
* Move @ inode - > i_io_list to @ list of @ wb and set % WB_has_dirty_io .
2015-05-22 17:13:45 -04:00
* Returns % true if @ inode is the first occupant of the ! dirty_time IO
* lists ; otherwise , % false .
*/
2015-03-04 14:07:22 -05:00
static bool inode_io_list_move_locked ( struct inode * inode ,
2015-05-22 17:13:45 -04:00
struct bdi_writeback * wb ,
struct list_head * head )
{
assert_spin_locked ( & wb - > list_lock ) ;
2022-05-24 08:05:40 -07:00
assert_spin_locked ( & inode - > i_lock ) ;
2022-12-12 12:36:33 +01:00
WARN_ON_ONCE ( inode - > i_state & I_FREEING ) ;
2015-05-22 17:13:45 -04:00
2015-03-04 14:07:22 -05:00
list_move ( & inode - > i_io_list , head ) ;
2015-05-22 17:13:45 -04:00
/* dirty_time doesn't count as dirty_io until expiration */
if ( head ! = & wb - > b_dirty_time )
return wb_io_lists_populated ( wb ) ;
wb_io_lists_depopulated ( wb ) ;
return false ;
}
writeback: move backing_dev_info->wb_lock and ->worklist into bdi_writeback
Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback)
and the role of the separation is unclear. For cgroup support for
writeback IOs, a bdi will be updated to host multiple wb's where each
wb serves writeback IOs of a different cgroup on the bdi. To achieve
that, a wb should carry all states necessary for servicing writeback
IOs for a cgroup independently.
This patch moves bdi->wb_lock and ->worklist into wb.
* The lock protects bdi->worklist and bdi->wb.dwork scheduling. While
moving, rename it to wb->work_lock as wb->wb_lock is confusing.
Also, move wb->dwork downwards so that it's colocated with the new
->work_lock and ->work_list fields.
* bdi_writeback_workfn() -> wb_workfn()
bdi_wakeup_thread_delayed(bdi) -> wb_wakeup_delayed(wb)
bdi_wakeup_thread(bdi) -> wb_wakeup(wb)
bdi_queue_work(bdi, ...) -> wb_queue_work(wb, ...)
__bdi_start_writeback(bdi, ...) -> __wb_start_writeback(wb, ...)
get_next_work_item(bdi) -> get_next_work_item(wb)
* bdi_wb_shutdown() is renamed to wb_shutdown() and now takes @wb.
The function contained parts which belong to the containing bdi
rather than the wb itself - testing cap_writeback_dirty and
bdi_remove_from_list() invocation. Those are moved to
bdi_unregister().
* bdi_wb_{init|exit}() are renamed to wb_{init|exit}().
Initializations of the moved bdi->wb_lock and ->work_list are
relocated from bdi_init() to wb_init().
* As there's still only one bdi_writeback per backing_dev_info, all
uses of bdi->state are mechanically replaced with bdi->wb.state
introducing no behavior changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-22 17:13:30 -04:00
static void wb_wakeup ( struct bdi_writeback * wb )
2014-04-03 14:46:23 -07:00
{
2022-08-01 08:50:34 -07:00
spin_lock_irq ( & wb - > work_lock ) ;
writeback: move backing_dev_info->wb_lock and ->worklist into bdi_writeback
Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback)
and the role of the separation is unclear. For cgroup support for
writeback IOs, a bdi will be updated to host multiple wb's where each
wb serves writeback IOs of a different cgroup on the bdi. To achieve
that, a wb should carry all states necessary for servicing writeback
IOs for a cgroup independently.
This patch moves bdi->wb_lock and ->worklist into wb.
* The lock protects bdi->worklist and bdi->wb.dwork scheduling. While
moving, rename it to wb->work_lock as wb->wb_lock is confusing.
Also, move wb->dwork downwards so that it's colocated with the new
->work_lock and ->work_list fields.
* bdi_writeback_workfn() -> wb_workfn()
bdi_wakeup_thread_delayed(bdi) -> wb_wakeup_delayed(wb)
bdi_wakeup_thread(bdi) -> wb_wakeup(wb)
bdi_queue_work(bdi, ...) -> wb_queue_work(wb, ...)
__bdi_start_writeback(bdi, ...) -> __wb_start_writeback(wb, ...)
get_next_work_item(bdi) -> get_next_work_item(wb)
* bdi_wb_shutdown() is renamed to wb_shutdown() and now takes @wb.
The function contained parts which belong to the containing bdi
rather than the wb itself - testing cap_writeback_dirty and
bdi_remove_from_list() invocation. Those are moved to
bdi_unregister().
* bdi_wb_{init|exit}() are renamed to wb_{init|exit}().
Initializations of the moved bdi->wb_lock and ->work_list are
relocated from bdi_init() to wb_init().
* As there's still only one bdi_writeback per backing_dev_info, all
uses of bdi->state are mechanically replaced with bdi->wb.state
introducing no behavior changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-22 17:13:30 -04:00
if ( test_bit ( WB_registered , & wb - > state ) )
mod_delayed_work ( bdi_wq , & wb - > dwork , 0 ) ;
2022-08-01 08:50:34 -07:00
spin_unlock_irq ( & wb - > work_lock ) ;
2014-04-03 14:46:23 -07:00
}
2017-03-10 12:09:49 -08:00
static void finish_writeback_work ( struct bdi_writeback * wb ,
struct wb_writeback_work * work )
{
struct wb_completion * done = work - > done ;
if ( work - > auto_free )
kfree ( work ) ;
2019-10-06 17:58:09 -07:00
if ( done ) {
wait_queue_head_t * waitq = done - > waitq ;
/* @done can't be accessed after the following dec */
if ( atomic_dec_and_test ( & done - > cnt ) )
wake_up_all ( waitq ) ;
}
2017-03-10 12:09:49 -08:00
}
writeback: move backing_dev_info->wb_lock and ->worklist into bdi_writeback
Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback)
and the role of the separation is unclear. For cgroup support for
writeback IOs, a bdi will be updated to host multiple wb's where each
wb serves writeback IOs of a different cgroup on the bdi. To achieve
that, a wb should carry all states necessary for servicing writeback
IOs for a cgroup independently.
This patch moves bdi->wb_lock and ->worklist into wb.
* The lock protects bdi->worklist and bdi->wb.dwork scheduling. While
moving, rename it to wb->work_lock as wb->wb_lock is confusing.
Also, move wb->dwork downwards so that it's colocated with the new
->work_lock and ->work_list fields.
* bdi_writeback_workfn() -> wb_workfn()
bdi_wakeup_thread_delayed(bdi) -> wb_wakeup_delayed(wb)
bdi_wakeup_thread(bdi) -> wb_wakeup(wb)
bdi_queue_work(bdi, ...) -> wb_queue_work(wb, ...)
__bdi_start_writeback(bdi, ...) -> __wb_start_writeback(wb, ...)
get_next_work_item(bdi) -> get_next_work_item(wb)
* bdi_wb_shutdown() is renamed to wb_shutdown() and now takes @wb.
The function contained parts which belong to the containing bdi
rather than the wb itself - testing cap_writeback_dirty and
bdi_remove_from_list() invocation. Those are moved to
bdi_unregister().
* bdi_wb_{init|exit}() are renamed to wb_{init|exit}().
Initializations of the moved bdi->wb_lock and ->work_list are
relocated from bdi_init() to wb_init().
* As there's still only one bdi_writeback per backing_dev_info, all
uses of bdi->state are mechanically replaced with bdi->wb.state
introducing no behavior changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-22 17:13:30 -04:00
static void wb_queue_work ( struct bdi_writeback * wb ,
struct wb_writeback_work * work )
2011-01-13 15:45:44 -08:00
{
2015-08-18 14:54:56 -07:00
trace_writeback_queue ( wb , work ) ;
2011-01-13 15:45:44 -08:00
2015-05-22 17:13:58 -04:00
if ( work - > done )
atomic_inc ( & work - > done - > cnt ) ;
2017-03-10 12:09:49 -08:00
2022-08-01 08:50:34 -07:00
spin_lock_irq ( & wb - > work_lock ) ;
2017-03-10 12:09:49 -08:00
if ( test_bit ( WB_registered , & wb - > state ) ) {
list_add_tail ( & work - > list , & wb - > work_list ) ;
mod_delayed_work ( bdi_wq , & wb - > dwork , 0 ) ;
} else
finish_writeback_work ( wb , work ) ;
2022-08-01 08:50:34 -07:00
spin_unlock_irq ( & wb - > work_lock ) ;
2005-04-16 15:20:36 -07:00
}
2015-05-22 17:13:58 -04:00
/**
* wb_wait_for_completion - wait for completion of bdi_writeback_works
* @ done : target wb_completion
*
* Wait for one or more work items issued to @ bdi with their - > done field
2019-08-26 09:06:52 -07:00
* set to @ done , which should have been initialized with
* DEFINE_WB_COMPLETION ( ) . This function returns after all such work items
* are completed . Work items which are waited upon aren ' t freed
2015-05-22 17:13:58 -04:00
* automatically on completion .
*/
2019-08-26 09:06:52 -07:00
void wb_wait_for_completion ( struct wb_completion * done )
2015-05-22 17:13:58 -04:00
{
atomic_dec ( & done - > cnt ) ; /* put down the initial count */
2019-08-26 09:06:52 -07:00
wait_event ( * done - > waitq , ! atomic_read ( & done - > cnt ) ) ;
2015-05-22 17:13:58 -04:00
}
2015-05-22 17:13:44 -04:00
# ifdef CONFIG_CGROUP_WRITEBACK
2019-08-15 12:25:28 -07:00
/*
* Parameters for foreign inode detection , see wbc_detach_inode ( ) to see
* how they ' re used .
*
* These paramters are inherently heuristical as the detection target
* itself is fuzzy . All we want to do is detaching an inode from the
* current owner if it ' s being written to by some other cgroups too much .
*
* The current cgroup writeback is built on the assumption that multiple
* cgroups writing to the same inode concurrently is very rare and a mode
* of operation which isn ' t well supported . As such , the goal is not
* taking too long when a different cgroup takes over an inode while
* avoiding too aggressive flip - flops from occasional foreign writes .
*
* We record , very roughly , 2 s worth of IO time history and if more than
* half of that is foreign , trigger the switch . The recording is quantized
* to 16 slots . To avoid tiny writes from swinging the decision too much ,
* writes smaller than 1 / 8 of avg size are ignored .
*/
2015-05-28 14:50:51 -04:00
# define WB_FRN_TIME_SHIFT 13 /* 1s = 2^13, upto 8 secs w/ 16bit */
# define WB_FRN_TIME_AVG_SHIFT 3 /* avg = avg * 7/8 + new * 1/8 */
2019-08-15 12:25:28 -07:00
# define WB_FRN_TIME_CUT_DIV 8 /* ignore rounds < avg / 8 */
2015-05-28 14:50:51 -04:00
# define WB_FRN_TIME_PERIOD (2 * (1 << WB_FRN_TIME_SHIFT)) /* 2s */
# define WB_FRN_HIST_SLOTS 16 /* inode->i_wb_frn_history is 16bit */
# define WB_FRN_HIST_UNIT (WB_FRN_TIME_PERIOD / WB_FRN_HIST_SLOTS)
/* each slot's duration is 2s / 16 */
# define WB_FRN_HIST_THR_SLOTS (WB_FRN_HIST_SLOTS / 2)
/* if foreign slots >= 8, switch */
# define WB_FRN_HIST_MAX_SLOTS (WB_FRN_HIST_THR_SLOTS / 2 + 1)
/* one round can affect upto 5 slots */
2019-08-02 12:08:13 -07:00
# define WB_FRN_MAX_IN_FLIGHT 1024 /* don't queue too many concurrently */
2015-05-28 14:50:51 -04:00
2021-06-28 19:36:03 -07:00
/*
* Maximum inodes per isw . A specific value has been chosen to make
* struct inode_switch_wbs_context fit into 1024 bytes kmalloc .
*/
# define WB_MAX_INODES_PER_ISW ((1024UL - sizeof(struct inode_switch_wbs_context)) \
/ sizeof ( struct inode * ) )
2016-02-29 18:28:53 -05:00
static atomic_t isw_nr_in_flight = ATOMIC_INIT ( 0 ) ;
static struct workqueue_struct * isw_wq ;
2023-01-16 19:25:06 +00:00
void __inode_attach_wb ( struct inode * inode , struct folio * folio )
2015-05-28 14:50:49 -04:00
{
struct backing_dev_info * bdi = inode_to_bdi ( inode ) ;
struct bdi_writeback * wb = NULL ;
if ( inode_cgwb_enabled ( inode ) ) {
struct cgroup_subsys_state * memcg_css ;
2023-01-16 19:25:06 +00:00
if ( folio ) {
2023-01-16 19:25:07 +00:00
memcg_css = mem_cgroup_css_from_folio ( folio ) ;
2015-05-28 14:50:49 -04:00
wb = wb_get_create ( bdi , memcg_css , GFP_ATOMIC ) ;
} else {
/* must pin memcg_css, see wb_get_create() */
memcg_css = task_get_css ( current , memory_cgrp_id ) ;
wb = wb_get_create ( bdi , memcg_css , GFP_ATOMIC ) ;
css_put ( memcg_css ) ;
}
}
if ( ! wb )
wb = & bdi - > wb ;
/*
* There may be multiple instances of this function racing to
* update the same inode . Use cmpxchg ( ) to tell the winner .
*/
if ( unlikely ( cmpxchg ( & inode - > i_wb , NULL , wb ) ) )
wb_put ( wb ) ;
}
2019-06-27 13:39:48 -07:00
EXPORT_SYMBOL_GPL ( __inode_attach_wb ) ;
2015-05-28 14:50:49 -04:00
2021-06-28 19:35:53 -07:00
/**
* inode_cgwb_move_to_attached - put the inode onto wb - > b_attached list
* @ inode : inode of interest with i_lock held
* @ wb : target bdi_writeback
*
* Remove the inode from wb ' s io lists and if necessarily put onto b_attached
* list . Only inodes attached to cgwb ' s are kept on this list .
*/
static void inode_cgwb_move_to_attached ( struct inode * inode ,
struct bdi_writeback * wb )
{
assert_spin_locked ( & wb - > list_lock ) ;
assert_spin_locked ( & inode - > i_lock ) ;
2022-12-12 12:36:33 +01:00
WARN_ON_ONCE ( inode - > i_state & I_FREEING ) ;
2021-06-28 19:35:53 -07:00
inode - > i_state & = ~ I_SYNC_QUEUED ;
if ( wb ! = & wb - > bdi - > wb )
list_move ( & inode - > i_io_list , & wb - > b_attached ) ;
else
list_del_init ( & inode - > i_io_list ) ;
wb_io_lists_depopulated ( wb ) ;
}
2015-05-28 14:50:52 -04:00
/**
* locked_inode_to_wb_and_lock_list - determine a locked inode ' s wb and lock it
* @ inode : inode of interest with i_lock held
*
* Returns @ inode ' s wb with its list_lock held . @ inode - > i_lock must be
* held on entry and is released on return . The returned wb is guaranteed
* to stay @ inode ' s associated wb until its list_lock is released .
*/
static struct bdi_writeback *
locked_inode_to_wb_and_lock_list ( struct inode * inode )
__releases ( & inode - > i_lock )
__acquires ( & wb - > list_lock )
{
while ( true ) {
struct bdi_writeback * wb = inode_to_wb ( inode ) ;
/*
* inode_to_wb ( ) association is protected by both
* @ inode - > i_lock and @ wb - > list_lock but list_lock nests
* outside i_lock . Drop i_lock and verify that the
* association hasn ' t changed after acquiring list_lock .
*/
wb_get ( wb ) ;
spin_unlock ( & inode - > i_lock ) ;
spin_lock ( & wb - > list_lock ) ;
2015-05-28 14:50:55 -04:00
/* i_wb may have changed inbetween, can't use inode_to_wb() */
2016-03-18 13:50:03 -04:00
if ( likely ( wb = = inode - > i_wb ) ) {
wb_put ( wb ) ; /* @inode already has ref */
return wb ;
}
2015-05-28 14:50:52 -04:00
spin_unlock ( & wb - > list_lock ) ;
2016-03-18 13:50:03 -04:00
wb_put ( wb ) ;
2015-05-28 14:50:52 -04:00
cpu_relax ( ) ;
spin_lock ( & inode - > i_lock ) ;
}
}
/**
* inode_to_wb_and_lock_list - determine an inode ' s wb and lock it
* @ inode : inode of interest
*
* Same as locked_inode_to_wb_and_lock_list ( ) but @ inode - > i_lock isn ' t held
* on entry .
*/
static struct bdi_writeback * inode_to_wb_and_lock_list ( struct inode * inode )
__acquires ( & wb - > list_lock )
{
spin_lock ( & inode - > i_lock ) ;
return locked_inode_to_wb_and_lock_list ( inode ) ;
}
writeback: implement unlocked_inode_to_wb transaction and use it for stat updates
The mechanism for detecting whether an inode should switch its wb
(bdi_writeback) association is now in place. This patch build the
framework for the actual switching.
This patch adds a new inode flag I_WB_SWITCHING, which has two
functions. First, the easy one, it ensures that there's only one
switching in progress for a give inode. Second, it's used as a
mechanism to synchronize wb stat updates.
The two stats, WB_RECLAIMABLE and WB_WRITEBACK, aren't event counters
but track the current number of dirty pages and pages under writeback
respectively. As such, when an inode is moved from one wb to another,
the inode's portion of those stats have to be transferred together;
unfortunately, this is a bit tricky as those stat updates are percpu
operations which are performed without holding any lock in some
places.
This patch solves the problem in a similar way as memcg. Each such
lockless stat updates are wrapped in transaction surrounded by
unlocked_inode_to_wb_begin/end(). During normal operation, they map
to rcu_read_lock/unlock(); however, if I_WB_SWITCHING is asserted,
mapping->tree_lock is grabbed across the transaction.
In turn, the switching path sets I_WB_SWITCHING and waits for a RCU
grace period to pass before actually starting to switch, which
guarantees that all stat update paths are synchronizing against
mapping->tree_lock.
This patch still doesn't implement the actual switching.
v3: Updated on top of the recent cancel_dirty_page() updates.
unlocked_inode_to_wb_begin() now nests inside
mem_cgroup_begin_page_stat() to match the locking order.
v2: The i_wb access transaction will be used for !stat accesses too.
Function names and comments updated accordingly.
s/inode_wb_stat_unlocked_{begin|end}/unlocked_inode_to_wb_{begin|end}/
s/switch_wb/switch_wbs/
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Greg Thelen <gthelen@google.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-28 14:50:53 -04:00
struct inode_switch_wbs_context {
2021-06-28 19:35:50 -07:00
struct rcu_work work ;
2021-06-28 19:35:59 -07:00
/*
* Multiple inodes can be switched at once . The switching procedure
* consists of two parts , separated by a RCU grace period . To make
* sure that the second part is executed for each inode gone through
* the first part , all inode pointers are placed into a NULL - terminated
* array embedded into struct inode_switch_wbs_context . Otherwise
* an inode could be left in a non - consistent state .
*/
struct bdi_writeback * new_wb ;
struct inode * inodes [ ] ;
writeback: implement unlocked_inode_to_wb transaction and use it for stat updates
The mechanism for detecting whether an inode should switch its wb
(bdi_writeback) association is now in place. This patch build the
framework for the actual switching.
This patch adds a new inode flag I_WB_SWITCHING, which has two
functions. First, the easy one, it ensures that there's only one
switching in progress for a give inode. Second, it's used as a
mechanism to synchronize wb stat updates.
The two stats, WB_RECLAIMABLE and WB_WRITEBACK, aren't event counters
but track the current number of dirty pages and pages under writeback
respectively. As such, when an inode is moved from one wb to another,
the inode's portion of those stats have to be transferred together;
unfortunately, this is a bit tricky as those stat updates are percpu
operations which are performed without holding any lock in some
places.
This patch solves the problem in a similar way as memcg. Each such
lockless stat updates are wrapped in transaction surrounded by
unlocked_inode_to_wb_begin/end(). During normal operation, they map
to rcu_read_lock/unlock(); however, if I_WB_SWITCHING is asserted,
mapping->tree_lock is grabbed across the transaction.
In turn, the switching path sets I_WB_SWITCHING and waits for a RCU
grace period to pass before actually starting to switch, which
guarantees that all stat update paths are synchronizing against
mapping->tree_lock.
This patch still doesn't implement the actual switching.
v3: Updated on top of the recent cancel_dirty_page() updates.
unlocked_inode_to_wb_begin() now nests inside
mem_cgroup_begin_page_stat() to match the locking order.
v2: The i_wb access transaction will be used for !stat accesses too.
Function names and comments updated accordingly.
s/inode_wb_stat_unlocked_{begin|end}/unlocked_inode_to_wb_{begin|end}/
s/switch_wb/switch_wbs/
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Greg Thelen <gthelen@google.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-28 14:50:53 -04:00
} ;
2017-12-12 08:38:30 -08:00
static void bdi_down_write_wb_switch_rwsem ( struct backing_dev_info * bdi )
{
down_write ( & bdi - > wb_switch_rwsem ) ;
}
static void bdi_up_write_wb_switch_rwsem ( struct backing_dev_info * bdi )
{
up_write ( & bdi - > wb_switch_rwsem ) ;
}
2021-06-28 19:35:59 -07:00
static bool inode_do_switch_wbs ( struct inode * inode ,
struct bdi_writeback * old_wb ,
2021-06-28 19:35:56 -07:00
struct bdi_writeback * new_wb )
writeback: implement unlocked_inode_to_wb transaction and use it for stat updates
The mechanism for detecting whether an inode should switch its wb
(bdi_writeback) association is now in place. This patch build the
framework for the actual switching.
This patch adds a new inode flag I_WB_SWITCHING, which has two
functions. First, the easy one, it ensures that there's only one
switching in progress for a give inode. Second, it's used as a
mechanism to synchronize wb stat updates.
The two stats, WB_RECLAIMABLE and WB_WRITEBACK, aren't event counters
but track the current number of dirty pages and pages under writeback
respectively. As such, when an inode is moved from one wb to another,
the inode's portion of those stats have to be transferred together;
unfortunately, this is a bit tricky as those stat updates are percpu
operations which are performed without holding any lock in some
places.
This patch solves the problem in a similar way as memcg. Each such
lockless stat updates are wrapped in transaction surrounded by
unlocked_inode_to_wb_begin/end(). During normal operation, they map
to rcu_read_lock/unlock(); however, if I_WB_SWITCHING is asserted,
mapping->tree_lock is grabbed across the transaction.
In turn, the switching path sets I_WB_SWITCHING and waits for a RCU
grace period to pass before actually starting to switch, which
guarantees that all stat update paths are synchronizing against
mapping->tree_lock.
This patch still doesn't implement the actual switching.
v3: Updated on top of the recent cancel_dirty_page() updates.
unlocked_inode_to_wb_begin() now nests inside
mem_cgroup_begin_page_stat() to match the locking order.
v2: The i_wb access transaction will be used for !stat accesses too.
Function names and comments updated accordingly.
s/inode_wb_stat_unlocked_{begin|end}/unlocked_inode_to_wb_{begin|end}/
s/switch_wb/switch_wbs/
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Greg Thelen <gthelen@google.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-28 14:50:53 -04:00
{
2015-05-28 14:50:56 -04:00
struct address_space * mapping = inode - > i_mapping ;
2017-12-04 10:46:23 -05:00
XA_STATE ( xas , & mapping - > i_pages , 0 ) ;
2021-03-19 08:58:36 -04:00
struct folio * folio ;
2015-05-28 14:50:56 -04:00
bool switched = false ;
writeback: implement unlocked_inode_to_wb transaction and use it for stat updates
The mechanism for detecting whether an inode should switch its wb
(bdi_writeback) association is now in place. This patch build the
framework for the actual switching.
This patch adds a new inode flag I_WB_SWITCHING, which has two
functions. First, the easy one, it ensures that there's only one
switching in progress for a give inode. Second, it's used as a
mechanism to synchronize wb stat updates.
The two stats, WB_RECLAIMABLE and WB_WRITEBACK, aren't event counters
but track the current number of dirty pages and pages under writeback
respectively. As such, when an inode is moved from one wb to another,
the inode's portion of those stats have to be transferred together;
unfortunately, this is a bit tricky as those stat updates are percpu
operations which are performed without holding any lock in some
places.
This patch solves the problem in a similar way as memcg. Each such
lockless stat updates are wrapped in transaction surrounded by
unlocked_inode_to_wb_begin/end(). During normal operation, they map
to rcu_read_lock/unlock(); however, if I_WB_SWITCHING is asserted,
mapping->tree_lock is grabbed across the transaction.
In turn, the switching path sets I_WB_SWITCHING and waits for a RCU
grace period to pass before actually starting to switch, which
guarantees that all stat update paths are synchronizing against
mapping->tree_lock.
This patch still doesn't implement the actual switching.
v3: Updated on top of the recent cancel_dirty_page() updates.
unlocked_inode_to_wb_begin() now nests inside
mem_cgroup_begin_page_stat() to match the locking order.
v2: The i_wb access transaction will be used for !stat accesses too.
Function names and comments updated accordingly.
s/inode_wb_stat_unlocked_{begin|end}/unlocked_inode_to_wb_{begin|end}/
s/switch_wb/switch_wbs/
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Greg Thelen <gthelen@google.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-28 14:50:53 -04:00
spin_lock ( & inode - > i_lock ) ;
2018-04-10 16:36:56 -07:00
xa_lock_irq ( & mapping - > i_pages ) ;
2015-05-28 14:50:56 -04:00
/*
writeback, cgroup: do not switch inodes with I_WILL_FREE flag
Patch series "cgroup, blkcg: prevent dirty inodes to pin dying memory cgroups", v9.
When an inode is getting dirty for the first time it's associated with a
wb structure (see __inode_attach_wb()). It can later be switched to
another wb (if e.g. some other cgroup is writing a lot of data to the
same inode), but otherwise stays attached to the original wb until being
reclaimed.
The problem is that the wb structure holds a reference to the original
memory and blkcg cgroups. So if an inode has been dirty once and later is
actively used in read-only mode, it has a good chance to pin down the
original memory and blkcg cgroups forever. This is often the case with
services bringing data for other services, e.g. updating some rpm
packages.
In the real life it becomes a problem due to a large size of the memcg
structure, which can easily be 1000x larger than an inode. Also a really
large number of dying cgroups can raise different scalability issues, e.g.
making the memory reclaim costly and less effective.
To solve the problem inodes should be eventually detached from the
corresponding writeback structure. It's inefficient to do it after every
writeback completion. Instead it can be done whenever the original memory
cgroup is offlined and writeback structure is getting killed. Scanning
over a (potentially long) list of inodes and detach them from the
writeback structure can take quite some time. To avoid scanning all
inodes, attached inodes are kept on a new list (b_attached). To make it
less noticeable to a user, the scanning and switching is performed from a
work context.
Big thanks to Jan Kara, Dennis Zhou, Hillf Danton and Tejun Heo for their
ideas and contribution to this patchset.
This patch (of 8):
If an inode's state has I_WILL_FREE flag set, the inode will be freed
soon, so there is no point in trying to switch the inode to a different
cgwb.
I_WILL_FREE was ignored since the introduction of the inode switching, so
it looks like it doesn't lead to any noticeable issues for a user. This
is why the patch is not intended for a stable backport.
Link: https://lkml.kernel.org/r/20210608230225.2078447-1-guro@fb.com
Link: https://lkml.kernel.org/r/20210608230225.2078447-2-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Suggested-by: Jan Kara <jack@suse.cz>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Dennis Zhou <dennis@kernel.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Jan Kara <jack@suse.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-28 19:35:41 -07:00
* Once I_FREEING or I_WILL_FREE are visible under i_lock , the eviction
* path owns the inode and we shouldn ' t modify - > i_io_list .
2015-05-28 14:50:56 -04:00
*/
writeback, cgroup: do not switch inodes with I_WILL_FREE flag
Patch series "cgroup, blkcg: prevent dirty inodes to pin dying memory cgroups", v9.
When an inode is getting dirty for the first time it's associated with a
wb structure (see __inode_attach_wb()). It can later be switched to
another wb (if e.g. some other cgroup is writing a lot of data to the
same inode), but otherwise stays attached to the original wb until being
reclaimed.
The problem is that the wb structure holds a reference to the original
memory and blkcg cgroups. So if an inode has been dirty once and later is
actively used in read-only mode, it has a good chance to pin down the
original memory and blkcg cgroups forever. This is often the case with
services bringing data for other services, e.g. updating some rpm
packages.
In the real life it becomes a problem due to a large size of the memcg
structure, which can easily be 1000x larger than an inode. Also a really
large number of dying cgroups can raise different scalability issues, e.g.
making the memory reclaim costly and less effective.
To solve the problem inodes should be eventually detached from the
corresponding writeback structure. It's inefficient to do it after every
writeback completion. Instead it can be done whenever the original memory
cgroup is offlined and writeback structure is getting killed. Scanning
over a (potentially long) list of inodes and detach them from the
writeback structure can take quite some time. To avoid scanning all
inodes, attached inodes are kept on a new list (b_attached). To make it
less noticeable to a user, the scanning and switching is performed from a
work context.
Big thanks to Jan Kara, Dennis Zhou, Hillf Danton and Tejun Heo for their
ideas and contribution to this patchset.
This patch (of 8):
If an inode's state has I_WILL_FREE flag set, the inode will be freed
soon, so there is no point in trying to switch the inode to a different
cgwb.
I_WILL_FREE was ignored since the introduction of the inode switching, so
it looks like it doesn't lead to any noticeable issues for a user. This
is why the patch is not intended for a stable backport.
Link: https://lkml.kernel.org/r/20210608230225.2078447-1-guro@fb.com
Link: https://lkml.kernel.org/r/20210608230225.2078447-2-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Suggested-by: Jan Kara <jack@suse.cz>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Dennis Zhou <dennis@kernel.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Jan Kara <jack@suse.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-28 19:35:41 -07:00
if ( unlikely ( inode - > i_state & ( I_FREEING | I_WILL_FREE ) ) )
2015-05-28 14:50:56 -04:00
goto skip_switch ;
2019-08-29 15:47:19 -07:00
trace_inode_switch_wbs ( inode , old_wb , new_wb ) ;
2015-05-28 14:50:56 -04:00
/*
* Count and transfer stats . Note that PAGECACHE_TAG_DIRTY points
2021-03-19 08:58:36 -04:00
* to possibly dirty folios while PAGECACHE_TAG_WRITEBACK points to
* folios actually under writeback .
2015-05-28 14:50:56 -04:00
*/
2021-03-19 08:58:36 -04:00
xas_for_each_marked ( & xas , folio , ULONG_MAX , PAGECACHE_TAG_DIRTY ) {
if ( folio_test_dirty ( folio ) ) {
long nr = folio_nr_pages ( folio ) ;
wb_stat_mod ( old_wb , WB_RECLAIMABLE , - nr ) ;
wb_stat_mod ( new_wb , WB_RECLAIMABLE , nr ) ;
2015-05-28 14:50:56 -04:00
}
}
2017-12-04 10:46:23 -05:00
xas_set ( & xas , 0 ) ;
2021-03-19 08:58:36 -04:00
xas_for_each_marked ( & xas , folio , ULONG_MAX , PAGECACHE_TAG_WRITEBACK ) {
long nr = folio_nr_pages ( folio ) ;
WARN_ON_ONCE ( ! folio_test_writeback ( folio ) ) ;
wb_stat_mod ( old_wb , WB_WRITEBACK , - nr ) ;
wb_stat_mod ( new_wb , WB_WRITEBACK , nr ) ;
2015-05-28 14:50:56 -04:00
}
2021-09-02 14:53:04 -07:00
if ( mapping_tagged ( mapping , PAGECACHE_TAG_WRITEBACK ) ) {
atomic_dec ( & old_wb - > writeback_inodes ) ;
atomic_inc ( & new_wb - > writeback_inodes ) ;
}
2015-05-28 14:50:56 -04:00
wb_get ( new_wb ) ;
/*
2021-06-28 19:35:53 -07:00
* Transfer to @ new_wb ' s IO list if necessary . If the @ inode is dirty ,
* the specific list @ inode was on is ignored and the @ inode is put on
* - > b_dirty which is always correct including from - > b_dirty_time .
* The transfer preserves @ inode - > dirtied_when ordering . If the @ inode
* was clean , it means it was on the b_attached list , so move it onto
* the b_attached list of @ new_wb .
2015-05-28 14:50:56 -04:00
*/
2015-03-04 14:07:22 -05:00
if ( ! list_empty ( & inode - > i_io_list ) ) {
2015-05-28 14:50:56 -04:00
inode - > i_wb = new_wb ;
2021-06-28 19:35:53 -07:00
if ( inode - > i_state & I_DIRTY_ALL ) {
struct inode * pos ;
list_for_each_entry ( pos , & new_wb - > b_dirty , i_io_list )
if ( time_after_eq ( inode - > dirtied_when ,
pos - > dirtied_when ) )
break ;
inode_io_list_move_locked ( inode , new_wb ,
pos - > i_io_list . prev ) ;
} else {
inode_cgwb_move_to_attached ( inode , new_wb ) ;
}
2015-05-28 14:50:56 -04:00
} else {
inode - > i_wb = new_wb ;
}
writeback: implement unlocked_inode_to_wb transaction and use it for stat updates
The mechanism for detecting whether an inode should switch its wb
(bdi_writeback) association is now in place. This patch build the
framework for the actual switching.
This patch adds a new inode flag I_WB_SWITCHING, which has two
functions. First, the easy one, it ensures that there's only one
switching in progress for a give inode. Second, it's used as a
mechanism to synchronize wb stat updates.
The two stats, WB_RECLAIMABLE and WB_WRITEBACK, aren't event counters
but track the current number of dirty pages and pages under writeback
respectively. As such, when an inode is moved from one wb to another,
the inode's portion of those stats have to be transferred together;
unfortunately, this is a bit tricky as those stat updates are percpu
operations which are performed without holding any lock in some
places.
This patch solves the problem in a similar way as memcg. Each such
lockless stat updates are wrapped in transaction surrounded by
unlocked_inode_to_wb_begin/end(). During normal operation, they map
to rcu_read_lock/unlock(); however, if I_WB_SWITCHING is asserted,
mapping->tree_lock is grabbed across the transaction.
In turn, the switching path sets I_WB_SWITCHING and waits for a RCU
grace period to pass before actually starting to switch, which
guarantees that all stat update paths are synchronizing against
mapping->tree_lock.
This patch still doesn't implement the actual switching.
v3: Updated on top of the recent cancel_dirty_page() updates.
unlocked_inode_to_wb_begin() now nests inside
mem_cgroup_begin_page_stat() to match the locking order.
v2: The i_wb access transaction will be used for !stat accesses too.
Function names and comments updated accordingly.
s/inode_wb_stat_unlocked_{begin|end}/unlocked_inode_to_wb_{begin|end}/
s/switch_wb/switch_wbs/
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Greg Thelen <gthelen@google.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-28 14:50:53 -04:00
2015-05-28 14:50:56 -04:00
/* ->i_wb_frn updates may race wbc_detach_inode() but doesn't matter */
writeback: implement unlocked_inode_to_wb transaction and use it for stat updates
The mechanism for detecting whether an inode should switch its wb
(bdi_writeback) association is now in place. This patch build the
framework for the actual switching.
This patch adds a new inode flag I_WB_SWITCHING, which has two
functions. First, the easy one, it ensures that there's only one
switching in progress for a give inode. Second, it's used as a
mechanism to synchronize wb stat updates.
The two stats, WB_RECLAIMABLE and WB_WRITEBACK, aren't event counters
but track the current number of dirty pages and pages under writeback
respectively. As such, when an inode is moved from one wb to another,
the inode's portion of those stats have to be transferred together;
unfortunately, this is a bit tricky as those stat updates are percpu
operations which are performed without holding any lock in some
places.
This patch solves the problem in a similar way as memcg. Each such
lockless stat updates are wrapped in transaction surrounded by
unlocked_inode_to_wb_begin/end(). During normal operation, they map
to rcu_read_lock/unlock(); however, if I_WB_SWITCHING is asserted,
mapping->tree_lock is grabbed across the transaction.
In turn, the switching path sets I_WB_SWITCHING and waits for a RCU
grace period to pass before actually starting to switch, which
guarantees that all stat update paths are synchronizing against
mapping->tree_lock.
This patch still doesn't implement the actual switching.
v3: Updated on top of the recent cancel_dirty_page() updates.
unlocked_inode_to_wb_begin() now nests inside
mem_cgroup_begin_page_stat() to match the locking order.
v2: The i_wb access transaction will be used for !stat accesses too.
Function names and comments updated accordingly.
s/inode_wb_stat_unlocked_{begin|end}/unlocked_inode_to_wb_{begin|end}/
s/switch_wb/switch_wbs/
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Greg Thelen <gthelen@google.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-28 14:50:53 -04:00
inode - > i_wb_frn_winner = 0 ;
inode - > i_wb_frn_avg_time = 0 ;
inode - > i_wb_frn_history = 0 ;
2015-05-28 14:50:56 -04:00
switched = true ;
skip_switch :
writeback: implement unlocked_inode_to_wb transaction and use it for stat updates
The mechanism for detecting whether an inode should switch its wb
(bdi_writeback) association is now in place. This patch build the
framework for the actual switching.
This patch adds a new inode flag I_WB_SWITCHING, which has two
functions. First, the easy one, it ensures that there's only one
switching in progress for a give inode. Second, it's used as a
mechanism to synchronize wb stat updates.
The two stats, WB_RECLAIMABLE and WB_WRITEBACK, aren't event counters
but track the current number of dirty pages and pages under writeback
respectively. As such, when an inode is moved from one wb to another,
the inode's portion of those stats have to be transferred together;
unfortunately, this is a bit tricky as those stat updates are percpu
operations which are performed without holding any lock in some
places.
This patch solves the problem in a similar way as memcg. Each such
lockless stat updates are wrapped in transaction surrounded by
unlocked_inode_to_wb_begin/end(). During normal operation, they map
to rcu_read_lock/unlock(); however, if I_WB_SWITCHING is asserted,
mapping->tree_lock is grabbed across the transaction.
In turn, the switching path sets I_WB_SWITCHING and waits for a RCU
grace period to pass before actually starting to switch, which
guarantees that all stat update paths are synchronizing against
mapping->tree_lock.
This patch still doesn't implement the actual switching.
v3: Updated on top of the recent cancel_dirty_page() updates.
unlocked_inode_to_wb_begin() now nests inside
mem_cgroup_begin_page_stat() to match the locking order.
v2: The i_wb access transaction will be used for !stat accesses too.
Function names and comments updated accordingly.
s/inode_wb_stat_unlocked_{begin|end}/unlocked_inode_to_wb_{begin|end}/
s/switch_wb/switch_wbs/
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Greg Thelen <gthelen@google.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-28 14:50:53 -04:00
/*
* Paired with load_acquire in unlocked_inode_to_wb_begin ( ) and
* ensures that the new wb is visible if they see ! I_WB_SWITCH .
*/
smp_store_release ( & inode - > i_state , inode - > i_state & ~ I_WB_SWITCH ) ;
2018-04-10 16:36:56 -07:00
xa_unlock_irq ( & mapping - > i_pages ) ;
writeback: implement unlocked_inode_to_wb transaction and use it for stat updates
The mechanism for detecting whether an inode should switch its wb
(bdi_writeback) association is now in place. This patch build the
framework for the actual switching.
This patch adds a new inode flag I_WB_SWITCHING, which has two
functions. First, the easy one, it ensures that there's only one
switching in progress for a give inode. Second, it's used as a
mechanism to synchronize wb stat updates.
The two stats, WB_RECLAIMABLE and WB_WRITEBACK, aren't event counters
but track the current number of dirty pages and pages under writeback
respectively. As such, when an inode is moved from one wb to another,
the inode's portion of those stats have to be transferred together;
unfortunately, this is a bit tricky as those stat updates are percpu
operations which are performed without holding any lock in some
places.
This patch solves the problem in a similar way as memcg. Each such
lockless stat updates are wrapped in transaction surrounded by
unlocked_inode_to_wb_begin/end(). During normal operation, they map
to rcu_read_lock/unlock(); however, if I_WB_SWITCHING is asserted,
mapping->tree_lock is grabbed across the transaction.
In turn, the switching path sets I_WB_SWITCHING and waits for a RCU
grace period to pass before actually starting to switch, which
guarantees that all stat update paths are synchronizing against
mapping->tree_lock.
This patch still doesn't implement the actual switching.
v3: Updated on top of the recent cancel_dirty_page() updates.
unlocked_inode_to_wb_begin() now nests inside
mem_cgroup_begin_page_stat() to match the locking order.
v2: The i_wb access transaction will be used for !stat accesses too.
Function names and comments updated accordingly.
s/inode_wb_stat_unlocked_{begin|end}/unlocked_inode_to_wb_{begin|end}/
s/switch_wb/switch_wbs/
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Greg Thelen <gthelen@google.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-28 14:50:53 -04:00
spin_unlock ( & inode - > i_lock ) ;
2017-12-12 08:38:30 -08:00
2021-06-28 19:35:59 -07:00
return switched ;
2021-06-28 19:35:56 -07:00
}
2015-05-28 14:50:56 -04:00
2021-06-28 19:35:56 -07:00
static void inode_switch_wbs_work_fn ( struct work_struct * work )
{
struct inode_switch_wbs_context * isw =
container_of ( to_rcu_work ( work ) , struct inode_switch_wbs_context , work ) ;
2021-06-28 19:35:59 -07:00
struct backing_dev_info * bdi = inode_to_bdi ( isw - > inodes [ 0 ] ) ;
struct bdi_writeback * old_wb = isw - > inodes [ 0 ] - > i_wb ;
struct bdi_writeback * new_wb = isw - > new_wb ;
unsigned long nr_switched = 0 ;
struct inode * * inodep ;
/*
* If @ inode switches cgwb membership while sync_inodes_sb ( ) is
* being issued , sync_inodes_sb ( ) might miss it . Synchronize .
*/
down_read ( & bdi - > wb_switch_rwsem ) ;
/*
* By the time control reaches here , RCU grace period has passed
* since I_WB_SWITCH assertion and all wb stat update transactions
* between unlocked_inode_to_wb_begin / end ( ) are guaranteed to be
* synchronizing against the i_pages lock .
*
* Grabbing old_wb - > list_lock , inode - > i_lock and the i_pages lock
* gives us exclusion against all wb related operations on @ inode
* including IO list manipulations and stat updates .
*/
if ( old_wb < new_wb ) {
spin_lock ( & old_wb - > list_lock ) ;
spin_lock_nested ( & new_wb - > list_lock , SINGLE_DEPTH_NESTING ) ;
} else {
spin_lock ( & new_wb - > list_lock ) ;
spin_lock_nested ( & old_wb - > list_lock , SINGLE_DEPTH_NESTING ) ;
}
for ( inodep = isw - > inodes ; * inodep ; inodep + + ) {
WARN_ON_ONCE ( ( * inodep ) - > i_wb ! = old_wb ) ;
if ( inode_do_switch_wbs ( * inodep , old_wb , new_wb ) )
nr_switched + + ;
}
spin_unlock ( & new_wb - > list_lock ) ;
spin_unlock ( & old_wb - > list_lock ) ;
up_read ( & bdi - > wb_switch_rwsem ) ;
if ( nr_switched ) {
wb_wakeup ( new_wb ) ;
wb_put_many ( old_wb , nr_switched ) ;
}
2016-02-29 18:28:53 -05:00
2021-06-28 19:35:59 -07:00
for ( inodep = isw - > inodes ; * inodep ; inodep + + )
iput ( * inodep ) ;
wb_put ( new_wb ) ;
2021-06-28 19:35:56 -07:00
kfree ( isw ) ;
2016-02-29 18:28:53 -05:00
atomic_dec ( & isw_nr_in_flight ) ;
writeback: implement unlocked_inode_to_wb transaction and use it for stat updates
The mechanism for detecting whether an inode should switch its wb
(bdi_writeback) association is now in place. This patch build the
framework for the actual switching.
This patch adds a new inode flag I_WB_SWITCHING, which has two
functions. First, the easy one, it ensures that there's only one
switching in progress for a give inode. Second, it's used as a
mechanism to synchronize wb stat updates.
The two stats, WB_RECLAIMABLE and WB_WRITEBACK, aren't event counters
but track the current number of dirty pages and pages under writeback
respectively. As such, when an inode is moved from one wb to another,
the inode's portion of those stats have to be transferred together;
unfortunately, this is a bit tricky as those stat updates are percpu
operations which are performed without holding any lock in some
places.
This patch solves the problem in a similar way as memcg. Each such
lockless stat updates are wrapped in transaction surrounded by
unlocked_inode_to_wb_begin/end(). During normal operation, they map
to rcu_read_lock/unlock(); however, if I_WB_SWITCHING is asserted,
mapping->tree_lock is grabbed across the transaction.
In turn, the switching path sets I_WB_SWITCHING and waits for a RCU
grace period to pass before actually starting to switch, which
guarantees that all stat update paths are synchronizing against
mapping->tree_lock.
This patch still doesn't implement the actual switching.
v3: Updated on top of the recent cancel_dirty_page() updates.
unlocked_inode_to_wb_begin() now nests inside
mem_cgroup_begin_page_stat() to match the locking order.
v2: The i_wb access transaction will be used for !stat accesses too.
Function names and comments updated accordingly.
s/inode_wb_stat_unlocked_{begin|end}/unlocked_inode_to_wb_{begin|end}/
s/switch_wb/switch_wbs/
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Greg Thelen <gthelen@google.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-28 14:50:53 -04:00
}
2021-06-28 19:36:03 -07:00
static bool inode_prepare_wbs_switch ( struct inode * inode ,
struct bdi_writeback * new_wb )
{
/*
* Paired with smp_mb ( ) in cgroup_writeback_umount ( ) .
* isw_nr_in_flight must be increased before checking SB_ACTIVE and
* grabbing an inode , otherwise isw_nr_in_flight can be observed as 0
* in cgroup_writeback_umount ( ) and the isw_wq will be not flushed .
*/
smp_mb ( ) ;
2021-07-23 15:50:32 -07:00
if ( IS_DAX ( inode ) )
return false ;
2021-06-28 19:36:03 -07:00
/* while holding I_WB_SWITCH, no one else can update the association */
spin_lock ( & inode - > i_lock ) ;
if ( ! ( inode - > i_sb - > s_flags & SB_ACTIVE ) | |
inode - > i_state & ( I_WB_SWITCH | I_FREEING | I_WILL_FREE ) | |
inode_to_wb ( inode ) = = new_wb ) {
spin_unlock ( & inode - > i_lock ) ;
return false ;
}
inode - > i_state | = I_WB_SWITCH ;
__iget ( inode ) ;
spin_unlock ( & inode - > i_lock ) ;
return true ;
}
writeback: implement unlocked_inode_to_wb transaction and use it for stat updates
The mechanism for detecting whether an inode should switch its wb
(bdi_writeback) association is now in place. This patch build the
framework for the actual switching.
This patch adds a new inode flag I_WB_SWITCHING, which has two
functions. First, the easy one, it ensures that there's only one
switching in progress for a give inode. Second, it's used as a
mechanism to synchronize wb stat updates.
The two stats, WB_RECLAIMABLE and WB_WRITEBACK, aren't event counters
but track the current number of dirty pages and pages under writeback
respectively. As such, when an inode is moved from one wb to another,
the inode's portion of those stats have to be transferred together;
unfortunately, this is a bit tricky as those stat updates are percpu
operations which are performed without holding any lock in some
places.
This patch solves the problem in a similar way as memcg. Each such
lockless stat updates are wrapped in transaction surrounded by
unlocked_inode_to_wb_begin/end(). During normal operation, they map
to rcu_read_lock/unlock(); however, if I_WB_SWITCHING is asserted,
mapping->tree_lock is grabbed across the transaction.
In turn, the switching path sets I_WB_SWITCHING and waits for a RCU
grace period to pass before actually starting to switch, which
guarantees that all stat update paths are synchronizing against
mapping->tree_lock.
This patch still doesn't implement the actual switching.
v3: Updated on top of the recent cancel_dirty_page() updates.
unlocked_inode_to_wb_begin() now nests inside
mem_cgroup_begin_page_stat() to match the locking order.
v2: The i_wb access transaction will be used for !stat accesses too.
Function names and comments updated accordingly.
s/inode_wb_stat_unlocked_{begin|end}/unlocked_inode_to_wb_{begin|end}/
s/switch_wb/switch_wbs/
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Greg Thelen <gthelen@google.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-28 14:50:53 -04:00
/**
* inode_switch_wbs - change the wb association of an inode
* @ inode : target inode
* @ new_wb_id : ID of the new wb
*
* Switch @ inode ' s wb association to the wb identified by @ new_wb_id . The
* switching is performed asynchronously and may fail silently .
*/
static void inode_switch_wbs ( struct inode * inode , int new_wb_id )
{
struct backing_dev_info * bdi = inode_to_bdi ( inode ) ;
struct cgroup_subsys_state * memcg_css ;
struct inode_switch_wbs_context * isw ;
/* noop if seems to be already in progress */
if ( inode - > i_state & I_WB_SWITCH )
return ;
2019-08-02 12:08:13 -07:00
/* avoid queueing a new switch if too many are already in flight */
if ( atomic_read ( & isw_nr_in_flight ) > WB_FRN_MAX_IN_FLIGHT )
2017-12-12 08:38:30 -08:00
return ;
2021-09-25 13:43:08 +02:00
isw = kzalloc ( struct_size ( isw , inodes , 2 ) , GFP_ATOMIC ) ;
writeback: implement unlocked_inode_to_wb transaction and use it for stat updates
The mechanism for detecting whether an inode should switch its wb
(bdi_writeback) association is now in place. This patch build the
framework for the actual switching.
This patch adds a new inode flag I_WB_SWITCHING, which has two
functions. First, the easy one, it ensures that there's only one
switching in progress for a give inode. Second, it's used as a
mechanism to synchronize wb stat updates.
The two stats, WB_RECLAIMABLE and WB_WRITEBACK, aren't event counters
but track the current number of dirty pages and pages under writeback
respectively. As such, when an inode is moved from one wb to another,
the inode's portion of those stats have to be transferred together;
unfortunately, this is a bit tricky as those stat updates are percpu
operations which are performed without holding any lock in some
places.
This patch solves the problem in a similar way as memcg. Each such
lockless stat updates are wrapped in transaction surrounded by
unlocked_inode_to_wb_begin/end(). During normal operation, they map
to rcu_read_lock/unlock(); however, if I_WB_SWITCHING is asserted,
mapping->tree_lock is grabbed across the transaction.
In turn, the switching path sets I_WB_SWITCHING and waits for a RCU
grace period to pass before actually starting to switch, which
guarantees that all stat update paths are synchronizing against
mapping->tree_lock.
This patch still doesn't implement the actual switching.
v3: Updated on top of the recent cancel_dirty_page() updates.
unlocked_inode_to_wb_begin() now nests inside
mem_cgroup_begin_page_stat() to match the locking order.
v2: The i_wb access transaction will be used for !stat accesses too.
Function names and comments updated accordingly.
s/inode_wb_stat_unlocked_{begin|end}/unlocked_inode_to_wb_{begin|end}/
s/switch_wb/switch_wbs/
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Greg Thelen <gthelen@google.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-28 14:50:53 -04:00
if ( ! isw )
2019-08-02 12:08:13 -07:00
return ;
writeback: implement unlocked_inode_to_wb transaction and use it for stat updates
The mechanism for detecting whether an inode should switch its wb
(bdi_writeback) association is now in place. This patch build the
framework for the actual switching.
This patch adds a new inode flag I_WB_SWITCHING, which has two
functions. First, the easy one, it ensures that there's only one
switching in progress for a give inode. Second, it's used as a
mechanism to synchronize wb stat updates.
The two stats, WB_RECLAIMABLE and WB_WRITEBACK, aren't event counters
but track the current number of dirty pages and pages under writeback
respectively. As such, when an inode is moved from one wb to another,
the inode's portion of those stats have to be transferred together;
unfortunately, this is a bit tricky as those stat updates are percpu
operations which are performed without holding any lock in some
places.
This patch solves the problem in a similar way as memcg. Each such
lockless stat updates are wrapped in transaction surrounded by
unlocked_inode_to_wb_begin/end(). During normal operation, they map
to rcu_read_lock/unlock(); however, if I_WB_SWITCHING is asserted,
mapping->tree_lock is grabbed across the transaction.
In turn, the switching path sets I_WB_SWITCHING and waits for a RCU
grace period to pass before actually starting to switch, which
guarantees that all stat update paths are synchronizing against
mapping->tree_lock.
This patch still doesn't implement the actual switching.
v3: Updated on top of the recent cancel_dirty_page() updates.
unlocked_inode_to_wb_begin() now nests inside
mem_cgroup_begin_page_stat() to match the locking order.
v2: The i_wb access transaction will be used for !stat accesses too.
Function names and comments updated accordingly.
s/inode_wb_stat_unlocked_{begin|end}/unlocked_inode_to_wb_{begin|end}/
s/switch_wb/switch_wbs/
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Greg Thelen <gthelen@google.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-28 14:50:53 -04:00
2021-06-28 19:35:47 -07:00
atomic_inc ( & isw_nr_in_flight ) ;
writeback: implement unlocked_inode_to_wb transaction and use it for stat updates
The mechanism for detecting whether an inode should switch its wb
(bdi_writeback) association is now in place. This patch build the
framework for the actual switching.
This patch adds a new inode flag I_WB_SWITCHING, which has two
functions. First, the easy one, it ensures that there's only one
switching in progress for a give inode. Second, it's used as a
mechanism to synchronize wb stat updates.
The two stats, WB_RECLAIMABLE and WB_WRITEBACK, aren't event counters
but track the current number of dirty pages and pages under writeback
respectively. As such, when an inode is moved from one wb to another,
the inode's portion of those stats have to be transferred together;
unfortunately, this is a bit tricky as those stat updates are percpu
operations which are performed without holding any lock in some
places.
This patch solves the problem in a similar way as memcg. Each such
lockless stat updates are wrapped in transaction surrounded by
unlocked_inode_to_wb_begin/end(). During normal operation, they map
to rcu_read_lock/unlock(); however, if I_WB_SWITCHING is asserted,
mapping->tree_lock is grabbed across the transaction.
In turn, the switching path sets I_WB_SWITCHING and waits for a RCU
grace period to pass before actually starting to switch, which
guarantees that all stat update paths are synchronizing against
mapping->tree_lock.
This patch still doesn't implement the actual switching.
v3: Updated on top of the recent cancel_dirty_page() updates.
unlocked_inode_to_wb_begin() now nests inside
mem_cgroup_begin_page_stat() to match the locking order.
v2: The i_wb access transaction will be used for !stat accesses too.
Function names and comments updated accordingly.
s/inode_wb_stat_unlocked_{begin|end}/unlocked_inode_to_wb_{begin|end}/
s/switch_wb/switch_wbs/
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Greg Thelen <gthelen@google.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-28 14:50:53 -04:00
/* find and pin the new wb */
rcu_read_lock ( ) ;
memcg_css = css_from_id ( new_wb_id , & memory_cgrp_subsys ) ;
2021-04-02 17:11:45 +08:00
if ( memcg_css & & ! css_tryget ( memcg_css ) )
memcg_css = NULL ;
writeback: implement unlocked_inode_to_wb transaction and use it for stat updates
The mechanism for detecting whether an inode should switch its wb
(bdi_writeback) association is now in place. This patch build the
framework for the actual switching.
This patch adds a new inode flag I_WB_SWITCHING, which has two
functions. First, the easy one, it ensures that there's only one
switching in progress for a give inode. Second, it's used as a
mechanism to synchronize wb stat updates.
The two stats, WB_RECLAIMABLE and WB_WRITEBACK, aren't event counters
but track the current number of dirty pages and pages under writeback
respectively. As such, when an inode is moved from one wb to another,
the inode's portion of those stats have to be transferred together;
unfortunately, this is a bit tricky as those stat updates are percpu
operations which are performed without holding any lock in some
places.
This patch solves the problem in a similar way as memcg. Each such
lockless stat updates are wrapped in transaction surrounded by
unlocked_inode_to_wb_begin/end(). During normal operation, they map
to rcu_read_lock/unlock(); however, if I_WB_SWITCHING is asserted,
mapping->tree_lock is grabbed across the transaction.
In turn, the switching path sets I_WB_SWITCHING and waits for a RCU
grace period to pass before actually starting to switch, which
guarantees that all stat update paths are synchronizing against
mapping->tree_lock.
This patch still doesn't implement the actual switching.
v3: Updated on top of the recent cancel_dirty_page() updates.
unlocked_inode_to_wb_begin() now nests inside
mem_cgroup_begin_page_stat() to match the locking order.
v2: The i_wb access transaction will be used for !stat accesses too.
Function names and comments updated accordingly.
s/inode_wb_stat_unlocked_{begin|end}/unlocked_inode_to_wb_{begin|end}/
s/switch_wb/switch_wbs/
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Greg Thelen <gthelen@google.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-28 14:50:53 -04:00
rcu_read_unlock ( ) ;
2021-04-02 17:11:45 +08:00
if ( ! memcg_css )
goto out_free ;
isw - > new_wb = wb_get_create ( bdi , memcg_css , GFP_ATOMIC ) ;
css_put ( memcg_css ) ;
writeback: implement unlocked_inode_to_wb transaction and use it for stat updates
The mechanism for detecting whether an inode should switch its wb
(bdi_writeback) association is now in place. This patch build the
framework for the actual switching.
This patch adds a new inode flag I_WB_SWITCHING, which has two
functions. First, the easy one, it ensures that there's only one
switching in progress for a give inode. Second, it's used as a
mechanism to synchronize wb stat updates.
The two stats, WB_RECLAIMABLE and WB_WRITEBACK, aren't event counters
but track the current number of dirty pages and pages under writeback
respectively. As such, when an inode is moved from one wb to another,
the inode's portion of those stats have to be transferred together;
unfortunately, this is a bit tricky as those stat updates are percpu
operations which are performed without holding any lock in some
places.
This patch solves the problem in a similar way as memcg. Each such
lockless stat updates are wrapped in transaction surrounded by
unlocked_inode_to_wb_begin/end(). During normal operation, they map
to rcu_read_lock/unlock(); however, if I_WB_SWITCHING is asserted,
mapping->tree_lock is grabbed across the transaction.
In turn, the switching path sets I_WB_SWITCHING and waits for a RCU
grace period to pass before actually starting to switch, which
guarantees that all stat update paths are synchronizing against
mapping->tree_lock.
This patch still doesn't implement the actual switching.
v3: Updated on top of the recent cancel_dirty_page() updates.
unlocked_inode_to_wb_begin() now nests inside
mem_cgroup_begin_page_stat() to match the locking order.
v2: The i_wb access transaction will be used for !stat accesses too.
Function names and comments updated accordingly.
s/inode_wb_stat_unlocked_{begin|end}/unlocked_inode_to_wb_{begin|end}/
s/switch_wb/switch_wbs/
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Greg Thelen <gthelen@google.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-28 14:50:53 -04:00
if ( ! isw - > new_wb )
goto out_free ;
2021-06-28 19:36:03 -07:00
if ( ! inode_prepare_wbs_switch ( inode , isw - > new_wb ) )
2016-02-29 18:28:53 -05:00
goto out_free ;
writeback: implement unlocked_inode_to_wb transaction and use it for stat updates
The mechanism for detecting whether an inode should switch its wb
(bdi_writeback) association is now in place. This patch build the
framework for the actual switching.
This patch adds a new inode flag I_WB_SWITCHING, which has two
functions. First, the easy one, it ensures that there's only one
switching in progress for a give inode. Second, it's used as a
mechanism to synchronize wb stat updates.
The two stats, WB_RECLAIMABLE and WB_WRITEBACK, aren't event counters
but track the current number of dirty pages and pages under writeback
respectively. As such, when an inode is moved from one wb to another,
the inode's portion of those stats have to be transferred together;
unfortunately, this is a bit tricky as those stat updates are percpu
operations which are performed without holding any lock in some
places.
This patch solves the problem in a similar way as memcg. Each such
lockless stat updates are wrapped in transaction surrounded by
unlocked_inode_to_wb_begin/end(). During normal operation, they map
to rcu_read_lock/unlock(); however, if I_WB_SWITCHING is asserted,
mapping->tree_lock is grabbed across the transaction.
In turn, the switching path sets I_WB_SWITCHING and waits for a RCU
grace period to pass before actually starting to switch, which
guarantees that all stat update paths are synchronizing against
mapping->tree_lock.
This patch still doesn't implement the actual switching.
v3: Updated on top of the recent cancel_dirty_page() updates.
unlocked_inode_to_wb_begin() now nests inside
mem_cgroup_begin_page_stat() to match the locking order.
v2: The i_wb access transaction will be used for !stat accesses too.
Function names and comments updated accordingly.
s/inode_wb_stat_unlocked_{begin|end}/unlocked_inode_to_wb_{begin|end}/
s/switch_wb/switch_wbs/
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Greg Thelen <gthelen@google.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-28 14:50:53 -04:00
2021-06-28 19:35:59 -07:00
isw - > inodes [ 0 ] = inode ;
writeback: implement unlocked_inode_to_wb transaction and use it for stat updates
The mechanism for detecting whether an inode should switch its wb
(bdi_writeback) association is now in place. This patch build the
framework for the actual switching.
This patch adds a new inode flag I_WB_SWITCHING, which has two
functions. First, the easy one, it ensures that there's only one
switching in progress for a give inode. Second, it's used as a
mechanism to synchronize wb stat updates.
The two stats, WB_RECLAIMABLE and WB_WRITEBACK, aren't event counters
but track the current number of dirty pages and pages under writeback
respectively. As such, when an inode is moved from one wb to another,
the inode's portion of those stats have to be transferred together;
unfortunately, this is a bit tricky as those stat updates are percpu
operations which are performed without holding any lock in some
places.
This patch solves the problem in a similar way as memcg. Each such
lockless stat updates are wrapped in transaction surrounded by
unlocked_inode_to_wb_begin/end(). During normal operation, they map
to rcu_read_lock/unlock(); however, if I_WB_SWITCHING is asserted,
mapping->tree_lock is grabbed across the transaction.
In turn, the switching path sets I_WB_SWITCHING and waits for a RCU
grace period to pass before actually starting to switch, which
guarantees that all stat update paths are synchronizing against
mapping->tree_lock.
This patch still doesn't implement the actual switching.
v3: Updated on top of the recent cancel_dirty_page() updates.
unlocked_inode_to_wb_begin() now nests inside
mem_cgroup_begin_page_stat() to match the locking order.
v2: The i_wb access transaction will be used for !stat accesses too.
Function names and comments updated accordingly.
s/inode_wb_stat_unlocked_{begin|end}/unlocked_inode_to_wb_{begin|end}/
s/switch_wb/switch_wbs/
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Greg Thelen <gthelen@google.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-28 14:50:53 -04:00
/*
* In addition to synchronizing among switchers , I_WB_SWITCH tells
2018-04-10 16:36:56 -07:00
* the RCU protected stat update paths to grab the i_page
* lock so that stat transfer can synchronize against them .
writeback: implement unlocked_inode_to_wb transaction and use it for stat updates
The mechanism for detecting whether an inode should switch its wb
(bdi_writeback) association is now in place. This patch build the
framework for the actual switching.
This patch adds a new inode flag I_WB_SWITCHING, which has two
functions. First, the easy one, it ensures that there's only one
switching in progress for a give inode. Second, it's used as a
mechanism to synchronize wb stat updates.
The two stats, WB_RECLAIMABLE and WB_WRITEBACK, aren't event counters
but track the current number of dirty pages and pages under writeback
respectively. As such, when an inode is moved from one wb to another,
the inode's portion of those stats have to be transferred together;
unfortunately, this is a bit tricky as those stat updates are percpu
operations which are performed without holding any lock in some
places.
This patch solves the problem in a similar way as memcg. Each such
lockless stat updates are wrapped in transaction surrounded by
unlocked_inode_to_wb_begin/end(). During normal operation, they map
to rcu_read_lock/unlock(); however, if I_WB_SWITCHING is asserted,
mapping->tree_lock is grabbed across the transaction.
In turn, the switching path sets I_WB_SWITCHING and waits for a RCU
grace period to pass before actually starting to switch, which
guarantees that all stat update paths are synchronizing against
mapping->tree_lock.
This patch still doesn't implement the actual switching.
v3: Updated on top of the recent cancel_dirty_page() updates.
unlocked_inode_to_wb_begin() now nests inside
mem_cgroup_begin_page_stat() to match the locking order.
v2: The i_wb access transaction will be used for !stat accesses too.
Function names and comments updated accordingly.
s/inode_wb_stat_unlocked_{begin|end}/unlocked_inode_to_wb_{begin|end}/
s/switch_wb/switch_wbs/
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Greg Thelen <gthelen@google.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-28 14:50:53 -04:00
* Let ' s continue after I_WB_SWITCH is guaranteed to be visible .
*/
2021-06-28 19:35:50 -07:00
INIT_RCU_WORK ( & isw - > work , inode_switch_wbs_work_fn ) ;
queue_rcu_work ( isw_wq , & isw - > work ) ;
2019-08-02 12:08:13 -07:00
return ;
writeback: implement unlocked_inode_to_wb transaction and use it for stat updates
The mechanism for detecting whether an inode should switch its wb
(bdi_writeback) association is now in place. This patch build the
framework for the actual switching.
This patch adds a new inode flag I_WB_SWITCHING, which has two
functions. First, the easy one, it ensures that there's only one
switching in progress for a give inode. Second, it's used as a
mechanism to synchronize wb stat updates.
The two stats, WB_RECLAIMABLE and WB_WRITEBACK, aren't event counters
but track the current number of dirty pages and pages under writeback
respectively. As such, when an inode is moved from one wb to another,
the inode's portion of those stats have to be transferred together;
unfortunately, this is a bit tricky as those stat updates are percpu
operations which are performed without holding any lock in some
places.
This patch solves the problem in a similar way as memcg. Each such
lockless stat updates are wrapped in transaction surrounded by
unlocked_inode_to_wb_begin/end(). During normal operation, they map
to rcu_read_lock/unlock(); however, if I_WB_SWITCHING is asserted,
mapping->tree_lock is grabbed across the transaction.
In turn, the switching path sets I_WB_SWITCHING and waits for a RCU
grace period to pass before actually starting to switch, which
guarantees that all stat update paths are synchronizing against
mapping->tree_lock.
This patch still doesn't implement the actual switching.
v3: Updated on top of the recent cancel_dirty_page() updates.
unlocked_inode_to_wb_begin() now nests inside
mem_cgroup_begin_page_stat() to match the locking order.
v2: The i_wb access transaction will be used for !stat accesses too.
Function names and comments updated accordingly.
s/inode_wb_stat_unlocked_{begin|end}/unlocked_inode_to_wb_{begin|end}/
s/switch_wb/switch_wbs/
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Greg Thelen <gthelen@google.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-28 14:50:53 -04:00
out_free :
2021-06-28 19:35:47 -07:00
atomic_dec ( & isw_nr_in_flight ) ;
writeback: implement unlocked_inode_to_wb transaction and use it for stat updates
The mechanism for detecting whether an inode should switch its wb
(bdi_writeback) association is now in place. This patch build the
framework for the actual switching.
This patch adds a new inode flag I_WB_SWITCHING, which has two
functions. First, the easy one, it ensures that there's only one
switching in progress for a give inode. Second, it's used as a
mechanism to synchronize wb stat updates.
The two stats, WB_RECLAIMABLE and WB_WRITEBACK, aren't event counters
but track the current number of dirty pages and pages under writeback
respectively. As such, when an inode is moved from one wb to another,
the inode's portion of those stats have to be transferred together;
unfortunately, this is a bit tricky as those stat updates are percpu
operations which are performed without holding any lock in some
places.
This patch solves the problem in a similar way as memcg. Each such
lockless stat updates are wrapped in transaction surrounded by
unlocked_inode_to_wb_begin/end(). During normal operation, they map
to rcu_read_lock/unlock(); however, if I_WB_SWITCHING is asserted,
mapping->tree_lock is grabbed across the transaction.
In turn, the switching path sets I_WB_SWITCHING and waits for a RCU
grace period to pass before actually starting to switch, which
guarantees that all stat update paths are synchronizing against
mapping->tree_lock.
This patch still doesn't implement the actual switching.
v3: Updated on top of the recent cancel_dirty_page() updates.
unlocked_inode_to_wb_begin() now nests inside
mem_cgroup_begin_page_stat() to match the locking order.
v2: The i_wb access transaction will be used for !stat accesses too.
Function names and comments updated accordingly.
s/inode_wb_stat_unlocked_{begin|end}/unlocked_inode_to_wb_{begin|end}/
s/switch_wb/switch_wbs/
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Greg Thelen <gthelen@google.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-28 14:50:53 -04:00
if ( isw - > new_wb )
wb_put ( isw - > new_wb ) ;
kfree ( isw ) ;
}
2023-10-14 20:55:11 +08:00
static bool isw_prepare_wbs_switch ( struct inode_switch_wbs_context * isw ,
struct list_head * list , int * nr )
{
struct inode * inode ;
list_for_each_entry ( inode , list , i_io_list ) {
if ( ! inode_prepare_wbs_switch ( inode , isw - > new_wb ) )
continue ;
isw - > inodes [ * nr ] = inode ;
( * nr ) + + ;
if ( * nr > = WB_MAX_INODES_PER_ISW - 1 )
return true ;
}
return false ;
}
2021-06-28 19:36:03 -07:00
/**
* cleanup_offline_cgwb - detach associated inodes
* @ wb : target wb
*
* Switch all inodes attached to @ wb to a nearest living ancestor ' s wb in order
* to eventually release the dying @ wb . Returns % true if not all inodes were
* switched and the function has to be restarted .
*/
bool cleanup_offline_cgwb ( struct bdi_writeback * wb )
{
struct cgroup_subsys_state * memcg_css ;
struct inode_switch_wbs_context * isw ;
int nr ;
bool restart = false ;
2021-09-25 13:43:08 +02:00
isw = kzalloc ( struct_size ( isw , inodes , WB_MAX_INODES_PER_ISW ) ,
GFP_KERNEL ) ;
2021-06-28 19:36:03 -07:00
if ( ! isw )
return restart ;
atomic_inc ( & isw_nr_in_flight ) ;
for ( memcg_css = wb - > memcg_css - > parent ; memcg_css ;
memcg_css = memcg_css - > parent ) {
isw - > new_wb = wb_get_create ( wb - > bdi , memcg_css , GFP_KERNEL ) ;
if ( isw - > new_wb )
break ;
}
if ( unlikely ( ! isw - > new_wb ) )
isw - > new_wb = & wb - > bdi - > wb ; /* wb_get() is noop for bdi's wb */
nr = 0 ;
spin_lock ( & wb - > list_lock ) ;
2023-10-14 20:55:11 +08:00
/*
* In addition to the inodes that have completed writeback , also switch
* cgwbs for those inodes only with dirty timestamps . Otherwise , those
* inodes won ' t be written back for a long time when lazytime is
* enabled , and thus pinning the dying cgwbs . It won ' t break the
* bandwidth restrictions , as writeback of inode metadata is not
* accounted for .
*/
restart = isw_prepare_wbs_switch ( isw , & wb - > b_attached , & nr ) ;
if ( ! restart )
restart = isw_prepare_wbs_switch ( isw , & wb - > b_dirty_time , & nr ) ;
2021-06-28 19:36:03 -07:00
spin_unlock ( & wb - > list_lock ) ;
/* no attached inodes? bail out */
if ( nr = = 0 ) {
atomic_dec ( & isw_nr_in_flight ) ;
wb_put ( isw - > new_wb ) ;
kfree ( isw ) ;
return restart ;
}
/*
* In addition to synchronizing among switchers , I_WB_SWITCH tells
* the RCU protected stat update paths to grab the i_page
* lock so that stat transfer can synchronize against them .
* Let ' s continue after I_WB_SWITCH is guaranteed to be visible .
*/
INIT_RCU_WORK ( & isw - > work , inode_switch_wbs_work_fn ) ;
queue_rcu_work ( isw_wq , & isw - > work ) ;
return restart ;
}
2015-06-02 08:39:48 -06:00
/**
* wbc_attach_and_unlock_inode - associate wbc with target inode and unlock it
* @ wbc : writeback_control of interest
* @ inode : target inode
*
* @ inode is locked and about to be written back under the control of @ wbc .
* Record @ inode ' s writeback context into @ wbc and unlock the i_lock . On
* writeback completion , wbc_detach_inode ( ) should be called . This is used
* to track the cgroup writeback context .
*/
void wbc_attach_and_unlock_inode ( struct writeback_control * wbc ,
struct inode * inode )
{
2015-06-16 18:48:30 -04:00
if ( ! inode_cgwb_enabled ( inode ) ) {
spin_unlock ( & inode - > i_lock ) ;
return ;
}
2015-06-02 08:39:48 -06:00
wbc - > wb = inode_to_wb ( inode ) ;
2015-05-28 14:50:51 -04:00
wbc - > inode = inode ;
wbc - > wb_id = wbc - > wb - > memcg_css - > id ;
wbc - > wb_lcand_id = inode - > i_wb_frn_winner ;
wbc - > wb_tcand_id = 0 ;
wbc - > wb_bytes = 0 ;
wbc - > wb_lcand_bytes = 0 ;
wbc - > wb_tcand_bytes = 0 ;
2015-06-02 08:39:48 -06:00
wb_get ( wbc - > wb ) ;
spin_unlock ( & inode - > i_lock ) ;
2015-05-28 14:50:57 -04:00
/*
2019-11-08 12:18:29 -08:00
* A dying wb indicates that either the blkcg associated with the
* memcg changed or the associated memcg is dying . In the first
* case , a replacement wb should already be available and we should
* refresh the wb immediately . In the second case , trying to
* refresh will keep failing .
2015-05-28 14:50:57 -04:00
*/
2019-11-08 12:18:29 -08:00
if ( unlikely ( wb_dying ( wbc - > wb ) & & ! css_is_dying ( wbc - > wb - > memcg_css ) ) )
2015-05-28 14:50:57 -04:00
inode_switch_wbs ( inode , wbc - > wb_id ) ;
2015-06-02 08:39:48 -06:00
}
2019-06-27 13:39:48 -07:00
EXPORT_SYMBOL_GPL ( wbc_attach_and_unlock_inode ) ;
2015-06-02 08:39:48 -06:00
/**
2015-05-28 14:50:51 -04:00
* wbc_detach_inode - disassociate wbc from inode and perform foreign detection
* @ wbc : writeback_control of the just finished writeback
2015-06-02 08:39:48 -06:00
*
* To be called after a writeback attempt of an inode finishes and undoes
* wbc_attach_and_unlock_inode ( ) . Can be called under any context .
2015-05-28 14:50:51 -04:00
*
* As concurrent write sharing of an inode is expected to be very rare and
* memcg only tracks page ownership on first - use basis severely confining
* the usefulness of such sharing , cgroup writeback tracks ownership
* per - inode . While the support for concurrent write sharing of an inode
* is deemed unnecessary , an inode being written to by different cgroups at
* different points in time is a lot more common , and , more importantly ,
* charging only by first - use can too readily lead to grossly incorrect
* behaviors ( single foreign page can lead to gigabytes of writeback to be
* incorrectly attributed ) .
*
* To resolve this issue , cgroup writeback detects the majority dirtier of
2022-05-21 13:10:42 +02:00
* an inode and transfers the ownership to it . To avoid unnecessary
2015-05-28 14:50:51 -04:00
* oscillation , the detection mechanism keeps track of history and gives
* out the switch verdict only if the foreign usage pattern is stable over
* a certain amount of time and / or writeback attempts .
*
* On each writeback attempt , @ wbc tries to detect the majority writer
* using Boyer - Moore majority vote algorithm . In addition to the byte
* count from the majority voting , it also counts the bytes written for the
* current wb and the last round ' s winner wb ( max of last round ' s current
* wb , the winner from two rounds ago , and the last round ' s majority
* candidate ) . Keeping track of the historical winner helps the algorithm
* to semi - reliably detect the most active writer even when it ' s not the
* absolute majority .
*
* Once the winner of the round is determined , whether the winner is
* foreign or not and how much IO time the round consumed is recorded in
* inode - > i_wb_frn_history . If the amount of recorded foreign IO time is
* over a certain threshold , the switch verdict is given .
2015-06-02 08:39:48 -06:00
*/
void wbc_detach_inode ( struct writeback_control * wbc )
{
2015-05-28 14:50:51 -04:00
struct bdi_writeback * wb = wbc - > wb ;
struct inode * inode = wbc - > inode ;
2015-06-16 18:48:30 -04:00
unsigned long avg_time , max_bytes , max_time ;
u16 history ;
2015-05-28 14:50:51 -04:00
int max_id ;
2015-06-16 18:48:30 -04:00
if ( ! wb )
return ;
history = inode - > i_wb_frn_history ;
avg_time = inode - > i_wb_frn_avg_time ;
2015-05-28 14:50:51 -04:00
/* pick the winner of this round */
if ( wbc - > wb_bytes > = wbc - > wb_lcand_bytes & &
wbc - > wb_bytes > = wbc - > wb_tcand_bytes ) {
max_id = wbc - > wb_id ;
max_bytes = wbc - > wb_bytes ;
} else if ( wbc - > wb_lcand_bytes > = wbc - > wb_tcand_bytes ) {
max_id = wbc - > wb_lcand_id ;
max_bytes = wbc - > wb_lcand_bytes ;
} else {
max_id = wbc - > wb_tcand_id ;
max_bytes = wbc - > wb_tcand_bytes ;
}
/*
* Calculate the amount of IO time the winner consumed and fold it
* into the running average kept per inode . If the consumed IO
* time is lower than avag / WB_FRN_TIME_CUT_DIV , ignore it for
* deciding whether to switch or not . This is to prevent one - off
* small dirtiers from skewing the verdict .
*/
max_time = DIV_ROUND_UP ( ( max_bytes > > PAGE_SHIFT ) < < WB_FRN_TIME_SHIFT ,
wb - > avg_write_bandwidth ) ;
if ( avg_time )
avg_time + = ( max_time > > WB_FRN_TIME_AVG_SHIFT ) -
( avg_time > > WB_FRN_TIME_AVG_SHIFT ) ;
else
avg_time = max_time ; /* immediate catch up on first run */
if ( max_time > = avg_time / WB_FRN_TIME_CUT_DIV ) {
int slots ;
/*
* The switch verdict is reached if foreign wb ' s consume
* more than a certain proportion of IO time in a
* WB_FRN_TIME_PERIOD . This is loosely tracked by 16 slot
* history mask where each bit represents one sixteenth of
* the period . Determine the number of slots to shift into
* history from @ max_time .
*/
slots = min ( DIV_ROUND_UP ( max_time , WB_FRN_HIST_UNIT ) ,
( unsigned long ) WB_FRN_HIST_MAX_SLOTS ) ;
history < < = slots ;
if ( wbc - > wb_id ! = max_id )
history | = ( 1U < < slots ) - 1 ;
2019-08-29 15:47:19 -07:00
if ( history )
trace_inode_foreign_history ( inode , wbc , history ) ;
2015-05-28 14:50:51 -04:00
/*
* Switch if the current wb isn ' t the consistent winner .
* If there are multiple closely competing dirtiers , the
* inode may switch across them repeatedly over time , which
* is okay . The main goal is avoiding keeping an inode on
* the wrong wb for an extended period of time .
*/
2023-01-19 13:44:43 +03:00
if ( hweight16 ( history ) > WB_FRN_HIST_THR_SLOTS )
writeback: implement unlocked_inode_to_wb transaction and use it for stat updates
The mechanism for detecting whether an inode should switch its wb
(bdi_writeback) association is now in place. This patch build the
framework for the actual switching.
This patch adds a new inode flag I_WB_SWITCHING, which has two
functions. First, the easy one, it ensures that there's only one
switching in progress for a give inode. Second, it's used as a
mechanism to synchronize wb stat updates.
The two stats, WB_RECLAIMABLE and WB_WRITEBACK, aren't event counters
but track the current number of dirty pages and pages under writeback
respectively. As such, when an inode is moved from one wb to another,
the inode's portion of those stats have to be transferred together;
unfortunately, this is a bit tricky as those stat updates are percpu
operations which are performed without holding any lock in some
places.
This patch solves the problem in a similar way as memcg. Each such
lockless stat updates are wrapped in transaction surrounded by
unlocked_inode_to_wb_begin/end(). During normal operation, they map
to rcu_read_lock/unlock(); however, if I_WB_SWITCHING is asserted,
mapping->tree_lock is grabbed across the transaction.
In turn, the switching path sets I_WB_SWITCHING and waits for a RCU
grace period to pass before actually starting to switch, which
guarantees that all stat update paths are synchronizing against
mapping->tree_lock.
This patch still doesn't implement the actual switching.
v3: Updated on top of the recent cancel_dirty_page() updates.
unlocked_inode_to_wb_begin() now nests inside
mem_cgroup_begin_page_stat() to match the locking order.
v2: The i_wb access transaction will be used for !stat accesses too.
Function names and comments updated accordingly.
s/inode_wb_stat_unlocked_{begin|end}/unlocked_inode_to_wb_{begin|end}/
s/switch_wb/switch_wbs/
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Greg Thelen <gthelen@google.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-28 14:50:53 -04:00
inode_switch_wbs ( inode , max_id ) ;
2015-05-28 14:50:51 -04:00
}
/*
* Multiple instances of this function may race to update the
* following fields but we don ' t mind occassional inaccuracies .
*/
inode - > i_wb_frn_winner = max_id ;
inode - > i_wb_frn_avg_time = min ( avg_time , ( unsigned long ) U16_MAX ) ;
inode - > i_wb_frn_history = history ;
2015-06-02 08:39:48 -06:00
wb_put ( wbc - > wb ) ;
wbc - > wb = NULL ;
}
2019-06-27 13:39:48 -07:00
EXPORT_SYMBOL_GPL ( wbc_detach_inode ) ;
2015-06-02 08:39:48 -06:00
2015-05-28 14:50:51 -04:00
/**
2019-06-27 13:39:49 -07:00
* wbc_account_cgroup_owner - account writeback to update inode cgroup ownership
2015-05-28 14:50:51 -04:00
* @ wbc : writeback_control of the writeback in progress
* @ page : page being written out
* @ bytes : number of bytes being written out
*
* @ bytes from @ page are about to written out during the writeback
* controlled by @ wbc . Keep the book for foreign inode detection . See
* wbc_detach_inode ( ) .
*/
2019-06-27 13:39:49 -07:00
void wbc_account_cgroup_owner ( struct writeback_control * wbc , struct page * page ,
size_t bytes )
2015-05-28 14:50:51 -04:00
{
2023-01-16 19:25:07 +00:00
struct folio * folio ;
2019-06-13 15:30:41 -07:00
struct cgroup_subsys_state * css ;
2015-05-28 14:50:51 -04:00
int id ;
/*
* pageout ( ) path doesn ' t attach @ wbc to the inode being written
* out . This is intentional as we don ' t want the function to block
* behind a slow cgroup . Ultimately , we want pageout ( ) to kick off
* regular writeback instead of writing things out itself .
*/
2019-06-27 13:39:50 -07:00
if ( ! wbc - > wb | | wbc - > no_cgroup_owner )
2015-05-28 14:50:51 -04:00
return ;
2023-01-16 19:25:07 +00:00
folio = page_folio ( page ) ;
css = mem_cgroup_css_from_folio ( folio ) ;
2019-06-13 15:30:41 -07:00
/* dead cgroups shouldn't contribute to inode ownership arbitration */
if ( ! ( css - > flags & CSS_ONLINE ) )
return ;
id = css - > id ;
2015-05-28 14:50:51 -04:00
if ( id = = wbc - > wb_id ) {
wbc - > wb_bytes + = bytes ;
return ;
}
if ( id = = wbc - > wb_lcand_id )
wbc - > wb_lcand_bytes + = bytes ;
/* Boyer-Moore majority vote algorithm */
if ( ! wbc - > wb_tcand_bytes )
wbc - > wb_tcand_id = id ;
if ( id = = wbc - > wb_tcand_id )
wbc - > wb_tcand_bytes + = bytes ;
else
wbc - > wb_tcand_bytes - = min ( bytes , wbc - > wb_tcand_bytes ) ;
}
2019-06-27 13:39:49 -07:00
EXPORT_SYMBOL_GPL ( wbc_account_cgroup_owner ) ;
2015-05-28 14:50:51 -04:00
2015-05-22 17:13:55 -04:00
/**
* wb_split_bdi_pages - split nr_pages to write according to bandwidth
* @ wb : target bdi_writeback to split @ nr_pages to
* @ nr_pages : number of pages to write for the whole bdi
*
* Split @ wb ' s portion of @ nr_pages according to @ wb ' s write bandwidth in
* relation to the total write bandwidth of all wb ' s w / dirty inodes on
* @ wb - > bdi .
*/
static long wb_split_bdi_pages ( struct bdi_writeback * wb , long nr_pages )
{
unsigned long this_bw = wb - > avg_write_bandwidth ;
unsigned long tot_bw = atomic_long_read ( & wb - > bdi - > tot_write_bandwidth ) ;
if ( nr_pages = = LONG_MAX )
return LONG_MAX ;
/*
* This may be called on clean wb ' s and proportional distribution
* may not make sense , just use the original @ nr_pages in those
* cases . In general , we wanna err on the side of writing more .
*/
if ( ! tot_bw | | this_bw > = tot_bw )
return nr_pages ;
else
return DIV_ROUND_UP_ULL ( ( u64 ) nr_pages * this_bw , tot_bw ) ;
}
2015-05-22 17:14:01 -04:00
/**
* bdi_split_work_to_wbs - split a wb_writeback_work to all wb ' s of a bdi
* @ bdi : target backing_dev_info
* @ base_work : wb_writeback_work to issue
* @ skip_if_busy : skip wb ' s which already have writeback in progress
*
* Split and issue @ base_work to all wb ' s ( bdi_writeback ' s ) of @ bdi which
* have dirty inodes . If @ base_work - > nr_page isn ' t % LONG_MAX , it ' s
* distributed to the busy wbs according to each wb ' s proportion in the
* total active write bandwidth of @ bdi .
*/
static void bdi_split_work_to_wbs ( struct backing_dev_info * bdi ,
struct wb_writeback_work * base_work ,
bool skip_if_busy )
{
2015-10-02 14:47:05 -04:00
struct bdi_writeback * last_wb = NULL ;
2015-10-27 14:19:39 +09:00
struct bdi_writeback * wb = list_entry ( & bdi - > wb_list ,
struct bdi_writeback , bdi_node ) ;
2015-05-22 17:14:01 -04:00
might_sleep ( ) ;
restart :
rcu_read_lock ( ) ;
2015-10-02 14:47:05 -04:00
list_for_each_entry_continue_rcu ( wb , & bdi - > wb_list , bdi_node ) {
2019-08-26 09:06:52 -07:00
DEFINE_WB_COMPLETION ( fallback_work_done , bdi ) ;
2015-08-18 14:54:53 -07:00
struct wb_writeback_work fallback_work ;
struct wb_writeback_work * work ;
long nr_pages ;
2015-10-02 14:47:05 -04:00
if ( last_wb ) {
wb_put ( last_wb ) ;
last_wb = NULL ;
}
2015-08-25 14:11:52 -04:00
/* SYNC_ALL writes out I_DIRTY_TIME too */
if ( ! wb_has_dirty_io ( wb ) & &
( base_work - > sync_mode = = WB_SYNC_NONE | |
list_empty ( & wb - > b_dirty_time ) ) )
continue ;
if ( skip_if_busy & & writeback_in_progress ( wb ) )
2015-05-22 17:14:01 -04:00
continue ;
2015-08-18 14:54:53 -07:00
nr_pages = wb_split_bdi_pages ( wb , base_work - > nr_pages ) ;
work = kmalloc ( sizeof ( * work ) , GFP_ATOMIC ) ;
if ( work ) {
* work = * base_work ;
work - > nr_pages = nr_pages ;
work - > auto_free = 1 ;
wb_queue_work ( wb , work ) ;
continue ;
2015-05-22 17:14:01 -04:00
}
2015-08-18 14:54:53 -07:00
writeback, cgroup: fix null-ptr-deref write in bdi_split_work_to_wbs
KASAN report null-ptr-deref:
==================================================================
BUG: KASAN: null-ptr-deref in bdi_split_work_to_wbs+0x5c5/0x7b0
Write of size 8 at addr 0000000000000000 by task sync/943
CPU: 5 PID: 943 Comm: sync Tainted: 6.3.0-rc5-next-20230406-dirty #461
Call Trace:
<TASK>
dump_stack_lvl+0x7f/0xc0
print_report+0x2ba/0x340
kasan_report+0xc4/0x120
kasan_check_range+0x1b7/0x2e0
__kasan_check_write+0x24/0x40
bdi_split_work_to_wbs+0x5c5/0x7b0
sync_inodes_sb+0x195/0x630
sync_inodes_one_sb+0x3a/0x50
iterate_supers+0x106/0x1b0
ksys_sync+0x98/0x160
[...]
==================================================================
The race that causes the above issue is as follows:
cpu1 cpu2
-------------------------|-------------------------
inode_switch_wbs
INIT_WORK(&isw->work, inode_switch_wbs_work_fn)
queue_rcu_work(isw_wq, &isw->work)
// queue_work async
inode_switch_wbs_work_fn
wb_put_many(old_wb, nr_switched)
percpu_ref_put_many
ref->data->release(ref)
cgwb_release
queue_work(cgwb_release_wq, &wb->release_work)
// queue_work async
&wb->release_work
cgwb_release_workfn
ksys_sync
iterate_supers
sync_inodes_one_sb
sync_inodes_sb
bdi_split_work_to_wbs
kmalloc(sizeof(*work), GFP_ATOMIC)
// alloc memory failed
percpu_ref_exit
ref->data = NULL
kfree(data)
wb_get(wb)
percpu_ref_get(&wb->refcnt)
percpu_ref_get_many(ref, 1)
atomic_long_add(nr, &ref->data->count)
atomic64_add(i, v)
// trigger null-ptr-deref
bdi_split_work_to_wbs() traverses &bdi->wb_list to split work into all
wbs. If the allocation of new work fails, the on-stack fallback will be
used and the reference count of the current wb is increased afterwards.
If cgroup writeback membership switches occur before getting the reference
count and the current wb is released as old_wd, then calling wb_get() or
wb_put() will trigger the null pointer dereference above.
This issue was introduced in v4.3-rc7 (see fix tag1). Both
sync_inodes_sb() and __writeback_inodes_sb_nr() calls to
bdi_split_work_to_wbs() can trigger this issue. For scenarios called via
sync_inodes_sb(), originally commit 7fc5854f8c6e ("writeback: synchronize
sync(2) against cgroup writeback membership switches") reduced the
possibility of the issue by adding wb_switch_rwsem, but in v5.14-rc1 (see
fix tag2) removed the "inode_io_list_del_locked(inode, old_wb)" from
inode_switch_wbs_work_fn() so that wb->state contains WB_has_dirty_io,
thus old_wb is not skipped when traversing wbs in bdi_split_work_to_wbs(),
and the issue becomes easily reproducible again.
To solve this problem, percpu_ref_exit() is called under RCU protection to
avoid race between cgwb_release_workfn() and bdi_split_work_to_wbs().
Moreover, replace wb_get() with wb_tryget() in bdi_split_work_to_wbs(),
and skip the current wb if wb_tryget() fails because the wb has already
been shutdown.
Link: https://lkml.kernel.org/r/20230410130826.1492525-1-libaokun1@huawei.com
Fixes: b817525a4a80 ("writeback: bdi_writeback iteration must not skip dying ones")
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Hou Tao <houtao1@huawei.com>
Cc: yangerkun <yangerkun@huawei.com>
Cc: Zhang Yi <yi.zhang@huawei.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-10 21:08:26 +08:00
/*
* If wb_tryget fails , the wb has been shutdown , skip it .
*
* Pin @ wb so that it stays on @ bdi - > wb_list . This allows
* continuing iteration from @ wb after dropping and
* regrabbing rcu read lock .
*/
if ( ! wb_tryget ( wb ) )
continue ;
2015-08-18 14:54:53 -07:00
/* alloc failed, execute synchronously using on-stack fallback */
work = & fallback_work ;
* work = * base_work ;
work - > nr_pages = nr_pages ;
work - > auto_free = 0 ;
work - > done = & fallback_work_done ;
wb_queue_work ( wb , work ) ;
2015-10-02 14:47:05 -04:00
last_wb = wb ;
2015-08-18 14:54:53 -07:00
rcu_read_unlock ( ) ;
2019-08-26 09:06:52 -07:00
wb_wait_for_completion ( & fallback_work_done ) ;
2015-08-18 14:54:53 -07:00
goto restart ;
2015-05-22 17:14:01 -04:00
}
rcu_read_unlock ( ) ;
2015-10-02 14:47:05 -04:00
if ( last_wb )
wb_put ( last_wb ) ;
2015-05-22 17:14:01 -04:00
}
2019-08-26 09:06:55 -07:00
/**
* cgroup_writeback_by_id - initiate cgroup writeback from bdi and memcg IDs
* @ bdi_id : target bdi id
* @ memcg_id : target memcg css id
* @ reason : reason why some writeback work initiated
* @ done : target wb_completion
*
* Initiate flush of the bdi_writeback identified by @ bdi_id and @ memcg_id
* with the specified parameters .
*/
2021-09-02 14:53:27 -07:00
int cgroup_writeback_by_id ( u64 bdi_id , int memcg_id ,
2019-08-26 09:06:55 -07:00
enum wb_reason reason , struct wb_completion * done )
{
struct backing_dev_info * bdi ;
struct cgroup_subsys_state * memcg_css ;
struct bdi_writeback * wb ;
struct wb_writeback_work * work ;
2021-09-02 14:53:27 -07:00
unsigned long dirty ;
2019-08-26 09:06:55 -07:00
int ret ;
/* lookup bdi and memcg */
bdi = bdi_get_by_id ( bdi_id ) ;
if ( ! bdi )
return - ENOENT ;
rcu_read_lock ( ) ;
memcg_css = css_from_id ( memcg_id , & memory_cgrp_subsys ) ;
if ( memcg_css & & ! css_tryget ( memcg_css ) )
memcg_css = NULL ;
rcu_read_unlock ( ) ;
if ( ! memcg_css ) {
ret = - ENOENT ;
goto out_bdi_put ;
}
/*
* And find the associated wb . If the wb isn ' t there already
* there ' s nothing to flush , don ' t create one .
*/
wb = wb_get_lookup ( bdi , memcg_css ) ;
if ( ! wb ) {
ret = - ENOENT ;
goto out_css_put ;
}
/*
2021-09-02 14:53:27 -07:00
* The caller is attempting to write out most of
2019-08-26 09:06:55 -07:00
* the currently dirty pages . Let ' s take the current dirty page
* count and inflate it by 25 % which should be large enough to
* flush out most dirty pages while avoiding getting livelocked by
* concurrent dirtiers .
2021-09-02 14:53:27 -07:00
*
* BTW the memcg stats are flushed periodically and this is best - effort
* estimation , so some potential error is ok .
2019-08-26 09:06:55 -07:00
*/
2021-09-02 14:53:27 -07:00
dirty = memcg_page_state ( mem_cgroup_from_css ( memcg_css ) , NR_FILE_DIRTY ) ;
dirty = dirty * 10 / 8 ;
2019-08-26 09:06:55 -07:00
/* issue the writeback work */
work = kzalloc ( sizeof ( * work ) , GFP_NOWAIT | __GFP_NOWARN ) ;
if ( work ) {
2021-09-02 14:53:27 -07:00
work - > nr_pages = dirty ;
2019-08-26 09:06:55 -07:00
work - > sync_mode = WB_SYNC_NONE ;
work - > range_cyclic = 1 ;
work - > reason = reason ;
work - > done = done ;
work - > auto_free = 1 ;
wb_queue_work ( wb , work ) ;
ret = 0 ;
} else {
ret = - ENOMEM ;
}
wb_put ( wb ) ;
out_css_put :
css_put ( memcg_css ) ;
out_bdi_put :
bdi_put ( bdi ) ;
return ret ;
}
2016-02-29 18:28:53 -05:00
/**
* cgroup_writeback_umount - flush inode wb switches for umount
*
* This function is called when a super_block is about to be destroyed and
* flushes in - flight inode wb switches . An inode wb switch goes through
* RCU and then workqueue , so the two need to be flushed in order to ensure
* that all previously scheduled switches are finished . As wb switches are
* rare occurrences and synchronize_rcu ( ) can take a while , perform
* flushing iff wb switches are in flight .
*/
void cgroup_writeback_umount ( void )
{
2021-06-28 19:35:44 -07:00
/*
* SB_ACTIVE should be reliably cleared before checking
* isw_nr_in_flight , see generic_shutdown_super ( ) .
*/
smp_mb ( ) ;
2016-02-29 18:28:53 -05:00
if ( atomic_read ( & isw_nr_in_flight ) ) {
2019-05-17 14:31:44 -07:00
/*
* Use rcu_barrier ( ) to wait for all pending callbacks to
* ensure that all in - flight wb switches are in the workqueue .
*/
rcu_barrier ( ) ;
2016-02-29 18:28:53 -05:00
flush_workqueue ( isw_wq ) ;
}
}
static int __init cgroup_writeback_init ( void )
{
isw_wq = alloc_workqueue ( " inode_switch_wbs " , 0 , 0 ) ;
if ( ! isw_wq )
return - ENOMEM ;
return 0 ;
}
fs_initcall ( cgroup_writeback_init ) ;
2015-05-22 17:13:55 -04:00
# else /* CONFIG_CGROUP_WRITEBACK */
2017-12-12 08:38:30 -08:00
static void bdi_down_write_wb_switch_rwsem ( struct backing_dev_info * bdi ) { }
static void bdi_up_write_wb_switch_rwsem ( struct backing_dev_info * bdi ) { }
2021-06-28 19:35:53 -07:00
static void inode_cgwb_move_to_attached ( struct inode * inode ,
struct bdi_writeback * wb )
{
assert_spin_locked ( & wb - > list_lock ) ;
assert_spin_locked ( & inode - > i_lock ) ;
2022-12-12 12:36:33 +01:00
WARN_ON_ONCE ( inode - > i_state & I_FREEING ) ;
2021-06-28 19:35:53 -07:00
inode - > i_state & = ~ I_SYNC_QUEUED ;
list_del_init ( & inode - > i_io_list ) ;
wb_io_lists_depopulated ( wb ) ;
}
2015-05-28 14:50:52 -04:00
static struct bdi_writeback *
locked_inode_to_wb_and_lock_list ( struct inode * inode )
__releases ( & inode - > i_lock )
__acquires ( & wb - > list_lock )
{
struct bdi_writeback * wb = inode_to_wb ( inode ) ;
spin_unlock ( & inode - > i_lock ) ;
spin_lock ( & wb - > list_lock ) ;
return wb ;
}
static struct bdi_writeback * inode_to_wb_and_lock_list ( struct inode * inode )
__acquires ( & wb - > list_lock )
{
struct bdi_writeback * wb = inode_to_wb ( inode ) ;
spin_lock ( & wb - > list_lock ) ;
return wb ;
}
2015-05-22 17:13:55 -04:00
static long wb_split_bdi_pages ( struct bdi_writeback * wb , long nr_pages )
{
return nr_pages ;
}
2015-05-22 17:14:01 -04:00
static void bdi_split_work_to_wbs ( struct backing_dev_info * bdi ,
struct wb_writeback_work * base_work ,
bool skip_if_busy )
{
might_sleep ( ) ;
2015-08-25 14:11:52 -04:00
if ( ! skip_if_busy | | ! writeback_in_progress ( & bdi - > wb ) ) {
2015-05-22 17:14:01 -04:00
base_work - > auto_free = 0 ;
wb_queue_work ( & bdi - > wb , base_work ) ;
}
}
2015-05-22 17:13:44 -04:00
# endif /* CONFIG_CGROUP_WRITEBACK */
2017-09-28 11:31:22 -06:00
/*
* Add in the number of potentially dirty inodes , because each inode
* write can dirty pagecache in the underlying blockdev .
*/
static unsigned long get_nr_dirty_pages ( void )
{
return global_node_page_state ( NR_FILE_DIRTY ) +
get_nr_dirty_inodes ( ) ;
}
static void wb_start_writeback ( struct bdi_writeback * wb , enum wb_reason reason )
2009-09-16 15:13:54 +02:00
{
2015-05-22 17:13:51 -04:00
if ( ! wb_has_dirty_io ( wb ) )
return ;
2017-09-28 11:31:55 -06:00
/*
* All callers of this function want to start writeback of all
* dirty pages . Places like vmscan can call this at a very
* high frequency , causing pointless allocations of tons of
* work items and keeping the flusher threads busy retrieving
* that work . Ensure that we only allow one of them pending and
2017-09-30 02:09:06 -06:00
* inflight at the time .
2017-09-28 11:31:55 -06:00
*/
2017-09-30 02:09:06 -06:00
if ( test_bit ( WB_start_all , & wb - > state ) | |
test_and_set_bit ( WB_start_all , & wb - > state ) )
2017-09-28 11:31:55 -06:00
return ;
2017-09-30 02:09:06 -06:00
wb - > start_all_reason = reason ;
wb_wakeup ( wb ) ;
2010-06-08 18:15:15 +02:00
}
2009-09-23 20:33:40 +08:00
2010-06-08 18:15:15 +02:00
/**
2015-05-22 17:13:54 -04:00
* wb_start_background_writeback - start background writeback
* @ wb : bdi_writback to write from
2010-06-08 18:15:15 +02:00
*
* Description :
2011-01-13 15:45:44 -08:00
* This makes sure WB_SYNC_NONE background writeback happens . When
2015-05-22 17:13:54 -04:00
* this function returns , it is only guaranteed that for given wb
2011-01-13 15:45:44 -08:00
* some IO is happening if we are over background dirty threshold .
* Caller need not hold sb s_umount semaphore .
2010-06-08 18:15:15 +02:00
*/
2015-05-22 17:13:54 -04:00
void wb_start_background_writeback ( struct bdi_writeback * wb )
2010-06-08 18:15:15 +02:00
{
2011-01-13 15:45:44 -08:00
/*
* We just wake up the flusher thread . It will perform background
* writeback as soon as there is no other work to do .
*/
2015-08-18 14:54:56 -07:00
trace_writeback_wake_background ( wb ) ;
2015-05-22 17:13:54 -04:00
wb_wakeup ( wb ) ;
2005-04-16 15:20:36 -07:00
}
2011-03-22 22:23:41 +11:00
/*
* Remove the inode from the writeback list it is on .
*/
2015-03-04 14:07:22 -05:00
void inode_io_list_del ( struct inode * inode )
2011-03-22 22:23:41 +11:00
{
2015-05-28 14:50:52 -04:00
struct bdi_writeback * wb ;
2011-04-21 18:19:44 -06:00
2015-05-28 14:50:52 -04:00
wb = inode_to_wb_and_lock_list ( inode ) ;
2020-06-10 17:36:03 +02:00
spin_lock ( & inode - > i_lock ) ;
2021-06-28 19:35:53 -07:00
inode - > i_state & = ~ I_SYNC_QUEUED ;
list_del_init ( & inode - > i_io_list ) ;
wb_io_lists_depopulated ( wb ) ;
2020-06-10 17:36:03 +02:00
spin_unlock ( & inode - > i_lock ) ;
2015-05-22 17:13:37 -04:00
spin_unlock ( & wb - > list_lock ) ;
2011-03-22 22:23:41 +11:00
}
2020-04-21 10:54:44 +02:00
EXPORT_SYMBOL ( inode_io_list_del ) ;
2011-03-22 22:23:41 +11:00
2016-07-26 15:21:50 -07:00
/*
* mark an inode as under writeback on the sb
*/
void sb_mark_inode_writeback ( struct inode * inode )
{
struct super_block * sb = inode - > i_sb ;
unsigned long flags ;
if ( list_empty ( & inode - > i_wb_list ) ) {
spin_lock_irqsave ( & sb - > s_inode_wblist_lock , flags ) ;
2016-07-26 15:21:53 -07:00
if ( list_empty ( & inode - > i_wb_list ) ) {
2016-07-26 15:21:50 -07:00
list_add_tail ( & inode - > i_wb_list , & sb - > s_inodes_wb ) ;
2016-07-26 15:21:53 -07:00
trace_sb_mark_inode_writeback ( inode ) ;
}
2016-07-26 15:21:50 -07:00
spin_unlock_irqrestore ( & sb - > s_inode_wblist_lock , flags ) ;
}
}
/*
* clear an inode as under writeback on the sb
*/
void sb_clear_inode_writeback ( struct inode * inode )
{
struct super_block * sb = inode - > i_sb ;
unsigned long flags ;
if ( ! list_empty ( & inode - > i_wb_list ) ) {
spin_lock_irqsave ( & sb - > s_inode_wblist_lock , flags ) ;
2016-07-26 15:21:53 -07:00
if ( ! list_empty ( & inode - > i_wb_list ) ) {
list_del_init ( & inode - > i_wb_list ) ;
trace_sb_clear_inode_writeback ( inode ) ;
}
2016-07-26 15:21:50 -07:00
spin_unlock_irqrestore ( & sb - > s_inode_wblist_lock , flags ) ;
}
}
2007-10-16 23:30:32 -07:00
/*
* Redirty an inode : set its when - it - was dirtied timestamp and move it to the
* furthest end of its superblock ' s dirty - inode list .
*
* Before stamping the inode ' s - > dirtied_when , we check to see whether it is
2009-09-02 09:19:46 +02:00
* already the most - recently - dirtied inode on the b_dirty list . If that is
2007-10-16 23:30:32 -07:00
* the case then the inode must have been redirtied while it was being written
* out and we don ' t reset its dirtied_when .
*/
2020-06-10 17:36:03 +02:00
static void redirty_tail_locked ( struct inode * inode , struct bdi_writeback * wb )
2007-10-16 23:30:32 -07:00
{
2020-06-10 17:36:03 +02:00
assert_spin_locked ( & inode - > i_lock ) ;
2022-12-12 12:36:33 +01:00
inode - > i_state & = ~ I_SYNC_QUEUED ;
/*
* When the inode is being freed just don ' t bother with dirty list
* tracking . Flush worker will ignore this inode anyway and it will
* trigger assertions in inode_io_list_move_locked ( ) .
*/
if ( inode - > i_state & I_FREEING ) {
list_del_init ( & inode - > i_io_list ) ;
wb_io_lists_depopulated ( wb ) ;
return ;
}
2009-09-09 09:08:54 +02:00
if ( ! list_empty ( & wb - > b_dirty ) ) {
2009-09-02 09:19:46 +02:00
struct inode * tail ;
2007-10-16 23:30:32 -07:00
2010-10-21 11:49:30 +11:00
tail = wb_inode ( wb - > b_dirty . next ) ;
2009-09-02 09:19:46 +02:00
if ( time_before ( inode - > dirtied_when , tail - > dirtied_when ) )
2007-10-16 23:30:32 -07:00
inode - > dirtied_when = jiffies ;
}
2015-03-04 14:07:22 -05:00
inode_io_list_move_locked ( inode , wb , & wb - > b_dirty ) ;
2007-10-16 23:30:32 -07:00
}
2020-06-10 17:36:03 +02:00
static void redirty_tail ( struct inode * inode , struct bdi_writeback * wb )
{
spin_lock ( & inode - > i_lock ) ;
redirty_tail_locked ( inode , wb ) ;
spin_unlock ( & inode - > i_lock ) ;
}
2007-10-16 23:30:34 -07:00
/*
2009-09-02 09:19:46 +02:00
* requeue inode for re - scanning after bdi - > b_io list is exhausted .
2007-10-16 23:30:34 -07:00
*/
2011-04-21 18:19:44 -06:00
static void requeue_io ( struct inode * inode , struct bdi_writeback * wb )
2007-10-16 23:30:34 -07:00
{
2015-03-04 14:07:22 -05:00
inode_io_list_move_locked ( inode , wb , & wb - > b_more_io ) ;
2007-10-16 23:30:34 -07:00
}
2007-10-16 23:30:44 -07:00
static void inode_sync_complete ( struct inode * inode )
{
2012-05-03 14:47:55 +02:00
inode - > i_state & = ~ I_SYNC ;
2012-11-26 16:29:51 -08:00
/* If inode is clean an unused, put it into LRU now... */
inode_add_lru ( inode ) ;
2012-05-03 14:47:55 +02:00
/* Waiters must see I_SYNC cleared before being woken up */
2007-10-16 23:30:44 -07:00
smp_mb ( ) ;
wake_up_bit ( & inode - > i_state , __I_SYNC ) ;
}
2009-04-02 16:56:37 -07:00
static bool inode_dirtied_after ( struct inode * inode , unsigned long t )
{
bool ret = time_after ( inode - > dirtied_when , t ) ;
# ifndef CONFIG_64BIT
/*
* For inodes being constantly redirtied , dirtied_when can get stuck .
* It _appears_ to be in the future , but is actually in distant past .
* This test is necessary to prevent such wrapped - around relative times
2009-09-23 19:37:09 +02:00
* from permanently stopping the whole bdi writeback .
2009-04-02 16:56:37 -07:00
*/
ret = ret & & time_before_eq ( inode - > dirtied_when , jiffies ) ;
# endif
return ret ;
}
2007-10-16 23:30:39 -07:00
/*
2020-05-29 16:08:58 +02:00
* Move expired ( dirtied before dirtied_before ) dirty inodes from
2012-03-09 07:26:22 -08:00
* @ delaying_queue to @ dispatch_queue .
2007-10-16 23:30:39 -07:00
*/
2011-04-23 12:27:27 -06:00
static int move_expired_inodes ( struct list_head * delaying_queue ,
2007-10-16 23:30:39 -07:00
struct list_head * dispatch_queue ,
2020-05-29 16:24:43 +02:00
unsigned long dirtied_before )
2007-10-16 23:30:39 -07:00
{
2009-09-24 14:42:33 +02:00
LIST_HEAD ( tmp ) ;
struct list_head * pos , * node ;
2009-09-24 15:12:57 +02:00
struct super_block * sb = NULL ;
2009-09-24 14:42:33 +02:00
struct inode * inode ;
2009-09-24 15:12:57 +02:00
int do_sb_sort = 0 ;
2011-04-23 12:27:27 -06:00
int moved = 0 ;
2009-09-24 14:42:33 +02:00
2007-10-16 23:30:39 -07:00
while ( ! list_empty ( delaying_queue ) ) {
2010-10-21 11:49:30 +11:00
inode = wb_inode ( delaying_queue - > prev ) ;
2020-05-29 16:08:58 +02:00
if ( inode_dirtied_after ( inode , dirtied_before ) )
2007-10-16 23:30:39 -07:00
break ;
2022-05-24 08:05:40 -07:00
spin_lock ( & inode - > i_lock ) ;
2015-03-04 14:07:22 -05:00
list_move ( & inode - > i_io_list , & tmp ) ;
2013-07-09 22:36:45 +08:00
moved + + ;
2020-05-29 15:05:22 +02:00
inode - > i_state | = I_SYNC_QUEUED ;
spin_unlock ( & inode - > i_lock ) ;
2013-07-09 22:36:45 +08:00
if ( sb_is_blkdev_sb ( inode - > i_sb ) )
continue ;
2009-09-24 15:12:57 +02:00
if ( sb & & sb ! = inode - > i_sb )
do_sb_sort = 1 ;
sb = inode - > i_sb ;
2009-09-24 14:42:33 +02:00
}
2009-09-24 15:12:57 +02:00
/* just one sb in list, splice to dispatch_queue and we're done */
if ( ! do_sb_sort ) {
list_splice ( & tmp , dispatch_queue ) ;
2011-04-23 12:27:27 -06:00
goto out ;
2009-09-24 15:12:57 +02:00
}
2022-05-24 08:05:40 -07:00
/*
* Although inode ' s i_io_list is moved from ' tmp ' to ' dispatch_queue ' ,
* we don ' t take inode - > i_lock here because it is just a pointless overhead .
* Inode is already marked as I_SYNC_QUEUED so writeback list handling is
* fully under our control .
*/
2009-09-24 14:42:33 +02:00
while ( ! list_empty ( & tmp ) ) {
2010-10-21 11:49:30 +11:00
sb = wb_inode ( tmp . prev ) - > i_sb ;
2009-09-24 14:42:33 +02:00
list_for_each_prev_safe ( pos , node , & tmp ) {
2010-10-21 11:49:30 +11:00
inode = wb_inode ( pos ) ;
2009-09-24 14:42:33 +02:00
if ( inode - > i_sb = = sb )
2015-03-04 14:07:22 -05:00
list_move ( & inode - > i_io_list , dispatch_queue ) ;
2009-09-24 14:42:33 +02:00
}
2007-10-16 23:30:39 -07:00
}
2011-04-23 12:27:27 -06:00
out :
return moved ;
2007-10-16 23:30:39 -07:00
}
/*
* Queue all expired dirty inodes for io , eldest first .
2010-08-11 14:17:42 -07:00
* Before
* newly dirtied b_dirty b_io b_more_io
* = = = = = = = = = = = = = > gf edc BA
* After
* newly dirtied b_dirty b_io b_more_io
* = = = = = = = = = = = = = > g fBAedc
* |
* + - - > dequeue for IO
2007-10-16 23:30:39 -07:00
*/
2020-05-29 16:08:58 +02:00
static void queue_io ( struct bdi_writeback * wb , struct wb_writeback_work * work ,
unsigned long dirtied_before )
2009-09-02 09:19:46 +02:00
{
2011-04-23 12:27:27 -06:00
int moved ;
2020-05-29 16:08:58 +02:00
unsigned long time_expire_jif = dirtied_before ;
2015-02-02 00:37:00 -05:00
2011-04-21 18:19:44 -06:00
assert_spin_locked ( & wb - > list_lock ) ;
2010-08-11 14:17:42 -07:00
list_splice_init ( & wb - > b_more_io , & wb - > b_io ) ;
2020-05-29 16:24:43 +02:00
moved = move_expired_inodes ( & wb - > b_dirty , & wb - > b_io , dirtied_before ) ;
2020-05-29 16:08:58 +02:00
if ( ! work - > for_sync )
time_expire_jif = jiffies - dirtytime_expire_interval * HZ ;
2015-02-02 00:37:00 -05:00
moved + = move_expired_inodes ( & wb - > b_dirty_time , & wb - > b_io ,
2020-05-29 16:24:43 +02:00
time_expire_jif ) ;
2015-05-22 17:13:45 -04:00
if ( moved )
wb_io_lists_populated ( wb ) ;
2020-05-29 16:08:58 +02:00
trace_writeback_queue_io ( wb , work , dirtied_before , moved ) ;
2009-09-02 09:19:46 +02:00
}
2010-03-05 09:21:37 +01:00
static int write_inode ( struct inode * inode , struct writeback_control * wbc )
2007-10-16 23:30:39 -07:00
{
2013-01-11 13:06:37 -08:00
int ret ;
if ( inode - > i_sb - > s_op - > write_inode & & ! is_bad_inode ( inode ) ) {
trace_writeback_write_inode_start ( inode , wbc ) ;
ret = inode - > i_sb - > s_op - > write_inode ( inode , wbc ) ;
trace_writeback_write_inode ( inode , wbc ) ;
return ret ;
}
2009-09-09 09:08:54 +02:00
return 0 ;
2007-10-16 23:30:39 -07:00
}
2005-04-16 15:20:36 -07:00
/*
2012-05-03 14:48:03 +02:00
* Wait for writeback on an inode to complete . Called with i_lock held .
* Caller must make sure inode cannot go away when we drop i_lock .
2009-06-08 13:35:40 +02:00
*/
2012-05-03 14:48:03 +02:00
static void __inode_wait_for_writeback ( struct inode * inode )
__releases ( inode - > i_lock )
__acquires ( inode - > i_lock )
2009-06-08 13:35:40 +02:00
{
DEFINE_WAIT_BIT ( wq , & inode - > i_state , __I_SYNC ) ;
wait_queue_head_t * wqh ;
wqh = bit_waitqueue ( & inode - > i_state , __I_SYNC ) ;
2011-03-22 22:23:36 +11:00
while ( inode - > i_state & I_SYNC ) {
spin_unlock ( & inode - > i_lock ) ;
sched: Remove proliferation of wait_on_bit() action functions
The current "wait_on_bit" interface requires an 'action'
function to be provided which does the actual waiting.
There are over 20 such functions, many of them identical.
Most cases can be satisfied by one of just two functions, one
which uses io_schedule() and one which just uses schedule().
So:
Rename wait_on_bit and wait_on_bit_lock to
wait_on_bit_action and wait_on_bit_lock_action
to make it explicit that they need an action function.
Introduce new wait_on_bit{,_lock} and wait_on_bit{,_lock}_io
which are *not* given an action function but implicitly use
a standard one.
The decision to error-out if a signal is pending is now made
based on the 'mode' argument rather than being encoded in the action
function.
All instances of the old wait_on_bit and wait_on_bit_lock which
can use the new version have been changed accordingly and their
action functions have been discarded.
wait_on_bit{_lock} does not return any specific error code in the
event of a signal so the caller must check for non-zero and
interpolate their own error code as appropriate.
The wait_on_bit() call in __fscache_wait_on_invalidate() was
ambiguous as it specified TASK_UNINTERRUPTIBLE but used
fscache_wait_bit_interruptible as an action function.
David Howells confirms this should be uniformly
"uninterruptible"
The main remaining user of wait_on_bit{,_lock}_action is NFS
which needs to use a freezer-aware schedule() call.
A comment in fs/gfs2/glock.c notes that having multiple 'action'
functions is useful as they display differently in the 'wchan'
field of 'ps'. (and /proc/$PID/wchan).
As the new bit_wait{,_io} functions are tagged "__sched", they
will not show up at all, but something higher in the stack. So
the distinction will still be visible, only with different
function names (gds2_glock_wait versus gfs2_glock_dq_wait in the
gfs2/glock.c case).
Since first version of this patch (against 3.15) two new action
functions appeared, on in NFS and one in CIFS. CIFS also now
uses an action function that makes the same freezer aware
schedule call as NFS.
Signed-off-by: NeilBrown <neilb@suse.de>
Acked-by: David Howells <dhowells@redhat.com> (fscache, keys)
Acked-by: Steven Whitehouse <swhiteho@redhat.com> (gfs2)
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Steve French <sfrench@samba.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20140707051603.28027.72349.stgit@notabene.brown
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-07 15:16:04 +10:00
__wait_on_bit ( wqh , & wq , bit_wait ,
TASK_UNINTERRUPTIBLE ) ;
2011-03-22 22:23:36 +11:00
spin_lock ( & inode - > i_lock ) ;
2010-05-24 14:32:38 -07:00
}
2009-06-08 13:35:40 +02:00
}
2012-05-03 14:48:03 +02:00
/*
* Wait for writeback on an inode to complete . Caller must have inode pinned .
*/
void inode_wait_for_writeback ( struct inode * inode )
{
spin_lock ( & inode - > i_lock ) ;
__inode_wait_for_writeback ( inode ) ;
spin_unlock ( & inode - > i_lock ) ;
}
/*
* Sleep until I_SYNC is cleared . This function must be called with i_lock
* held and drops it . It is aimed for callers not holding any inode reference
* so once i_lock is dropped , inode can go away .
*/
static void inode_sleep_on_writeback ( struct inode * inode )
__releases ( inode - > i_lock )
{
DEFINE_WAIT ( wait ) ;
wait_queue_head_t * wqh = bit_waitqueue ( & inode - > i_state , __I_SYNC ) ;
int sleep ;
prepare_to_wait ( wqh , & wait , TASK_UNINTERRUPTIBLE ) ;
sleep = inode - > i_state & I_SYNC ;
spin_unlock ( & inode - > i_lock ) ;
if ( sleep )
schedule ( ) ;
finish_wait ( wqh , & wait ) ;
}
2012-05-03 14:47:58 +02:00
/*
* Find proper writeback list for the inode depending on its current state and
* possibly also change of its state while we were doing writeback . Here we
* handle things such as livelock prevention or fairness of writeback among
* inodes . This function can be called only by flusher thread - noone else
* processes all inodes in writeback lists and requeueing inodes behind flusher
* thread ' s back can have unexpected consequences .
*/
static void requeue_inode ( struct inode * inode , struct bdi_writeback * wb ,
struct writeback_control * wbc )
{
if ( inode - > i_state & I_FREEING )
return ;
/*
* Sync livelock prevention . Each inode is tagged and synced in one
* shot . If still dirty , it will be redirty_tail ( ) ' ed below . Update
* the dirty time to prevent enqueue and sync it again .
*/
if ( ( inode - > i_state & I_DIRTY ) & &
( wbc - > sync_mode = = WB_SYNC_ALL | | wbc - > tagged_writepages ) )
inode - > dirtied_when = jiffies ;
2012-05-03 14:48:00 +02:00
if ( wbc - > pages_skipped ) {
/*
2023-09-15 22:51:31 -06:00
* Writeback is not making progress due to locked buffers .
* Skip this inode for now . Although having skipped pages
* is odd for clean inodes , it can happen for some
* filesystems so handle that gracefully .
2012-05-03 14:48:00 +02:00
*/
2023-09-15 22:51:31 -06:00
if ( inode - > i_state & I_DIRTY_ALL )
redirty_tail_locked ( inode , wb ) ;
else
inode_cgwb_move_to_attached ( inode , wb ) ;
2012-05-03 14:48:00 +02:00
return ;
}
2012-05-03 14:47:58 +02:00
if ( mapping_tagged ( inode - > i_mapping , PAGECACHE_TAG_DIRTY ) ) {
/*
* We didn ' t write back all the pages . nfs_writepages ( )
* sometimes bales out without doing anything .
*/
if ( wbc - > nr_to_write < = 0 ) {
/* Slice used up. Queue for next turn. */
requeue_io ( inode , wb ) ;
} else {
/*
* Writeback blocked by something other than
* congestion . Delay the inode for some time to
* avoid spinning on the CPU ( 100 % iowait )
* retrying writeback of the dirty page / inode
* that cannot be performed immediately .
*/
2020-06-10 17:36:03 +02:00
redirty_tail_locked ( inode , wb ) ;
2012-05-03 14:47:58 +02:00
}
} else if ( inode - > i_state & I_DIRTY ) {
/*
* Filesystems can dirty the inode during writeback operations ,
* such as delayed allocation during submission or metadata
* updates after data IO completion .
*/
2020-06-10 17:36:03 +02:00
redirty_tail_locked ( inode , wb ) ;
2015-02-02 00:37:00 -05:00
} else if ( inode - > i_state & I_DIRTY_TIME ) {
2015-03-17 12:23:19 -04:00
inode - > dirtied_when = jiffies ;
2015-03-04 14:07:22 -05:00
inode_io_list_move_locked ( inode , wb , & wb - > b_dirty_time ) ;
2020-05-29 15:05:22 +02:00
inode - > i_state & = ~ I_SYNC_QUEUED ;
2012-05-03 14:47:58 +02:00
} else {
/* The inode is clean. Remove from writeback lists. */
2021-06-28 19:35:53 -07:00
inode_cgwb_move_to_attached ( inode , wb ) ;
2012-05-03 14:47:58 +02:00
}
}
2009-06-08 13:35:40 +02:00
/*
2021-01-12 11:02:51 -08:00
* Write out an inode and its dirty pages ( or some of its dirty pages , depending
* on @ wbc - > nr_to_write ) , and clear the relevant dirty flags from i_state .
*
* This doesn ' t remove the inode from the writeback list it is on , except
* potentially to move it from b_dirty_time to b_dirty due to timestamp
* expiration . The caller is otherwise responsible for writeback list handling .
*
* The caller is also responsible for setting the I_SYNC flag beforehand and
* calling inode_sync_complete ( ) to clear it afterwards .
2005-04-16 15:20:36 -07:00
*/
static int
2012-10-08 16:33:45 -07:00
__writeback_single_inode ( struct inode * inode , struct writeback_control * wbc )
2005-04-16 15:20:36 -07:00
{
struct address_space * mapping = inode - > i_mapping ;
2010-12-01 17:33:37 -06:00
long nr_to_write = wbc - > nr_to_write ;
2009-06-08 13:35:40 +02:00
unsigned dirty ;
2005-04-16 15:20:36 -07:00
int ret ;
2012-05-03 14:48:00 +02:00
WARN_ON ( ! ( inode - > i_state & I_SYNC ) ) ;
2005-04-16 15:20:36 -07:00
2013-01-11 13:06:37 -08:00
trace_writeback_single_inode_start ( inode , wbc , nr_to_write ) ;
2005-04-16 15:20:36 -07:00
ret = do_writepages ( mapping , wbc ) ;
2010-03-05 09:21:21 +01:00
/*
* Make sure to wait on the data before writing out the metadata .
* This is important for filesystems that modify metadata on data
2013-07-02 22:38:35 +10:00
* I / O completion . We don ' t do it for sync ( 2 ) writeback because it has a
* separate , external IO completion path and - > sync_fs for guaranteeing
* inode metadata is written back correctly .
2010-03-05 09:21:21 +01:00
*/
2013-07-02 22:38:35 +10:00
if ( wbc - > sync_mode = = WB_SYNC_ALL & & ! wbc - > for_sync ) {
2010-03-05 09:21:21 +01:00
int err = filemap_fdatawait ( mapping ) ;
2005-04-16 15:20:36 -07:00
if ( ret = = 0 )
ret = err ;
}
2010-05-07 13:35:44 +04:00
/*
fs: fix lazytime expiration handling in __writeback_single_inode()
When lazytime is enabled and an inode is being written due to its
in-memory updated timestamps having expired, either due to a sync() or
syncfs() system call or due to dirtytime_expire_interval having elapsed,
the VFS needs to inform the filesystem so that the filesystem can copy
the inode's timestamps out to the on-disk data structures.
This is done by __writeback_single_inode() calling
mark_inode_dirty_sync(), which then calls ->dirty_inode(I_DIRTY_SYNC).
However, this occurs after __writeback_single_inode() has already
cleared the dirty flags from ->i_state. This causes two bugs:
- mark_inode_dirty_sync() redirties the inode, causing it to remain
dirty. This wastefully causes the inode to be written twice. But
more importantly, it breaks cases where sync_filesystem() is expected
to clean dirty inodes. This includes the FS_IOC_REMOVE_ENCRYPTION_KEY
ioctl (as reported at
https://lore.kernel.org/r/20200306004555.GB225345@gmail.com), as well
as possibly filesystem freezing (freeze_super()).
- Since ->i_state doesn't contain I_DIRTY_TIME when ->dirty_inode() is
called from __writeback_single_inode() for lazytime expiration,
xfs_fs_dirty_inode() ignores the notification. (XFS only cares about
lazytime expirations, and it assumes that i_state will contain
I_DIRTY_TIME during those.) Therefore, lazy timestamps aren't
persisted by sync(), syncfs(), or dirtytime_expire_interval on XFS.
Fix this by moving the call to mark_inode_dirty_sync() to earlier in
__writeback_single_inode(), before the dirty flags are cleared from
i_state. This makes filesystems be properly notified of the timestamp
expiration, and it avoids incorrectly redirtying the inode.
This fixes xfstest generic/580 (which tests
FS_IOC_REMOVE_ENCRYPTION_KEY) when run on ext4 or f2fs with lazytime
enabled. It also fixes the new lazytime xfstest I've proposed, which
reproduces the above-mentioned XFS bug
(https://lore.kernel.org/r/20210105005818.92978-1-ebiggers@kernel.org).
Alternatively, we could call ->dirty_inode(I_DIRTY_SYNC) directly. But
due to the introduction of I_SYNC_QUEUED, mark_inode_dirty_sync() is the
right thing to do because mark_inode_dirty_sync() now knows not to move
the inode to a writeback list if it is currently queued for sync.
Fixes: 0ae45f63d4ef ("vfs: add support for a lazytime mount option")
Cc: stable@vger.kernel.org
Depends-on: 5afced3bf281 ("writeback: Avoid skipping inode writeback")
Link: https://lore.kernel.org/r/20210112190253.64307-2-ebiggers@kernel.org
Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2021-01-12 11:02:43 -08:00
* If the inode has dirty timestamps and we need to write them , call
* mark_inode_dirty_sync ( ) to notify the filesystem about it and to
* change I_DIRTY_TIME into I_DIRTY_SYNC .
2010-05-07 13:35:44 +04:00
*/
2020-05-29 16:24:43 +02:00
if ( ( inode - > i_state & I_DIRTY_TIME ) & &
2021-01-12 11:02:50 -08:00
( wbc - > sync_mode = = WB_SYNC_ALL | |
2020-05-29 16:24:43 +02:00
time_after ( jiffies , inode - > dirtied_time_when +
dirtytime_expire_interval * HZ ) ) ) {
trace_writeback_lazytime ( inode ) ;
fs: fix lazytime expiration handling in __writeback_single_inode()
When lazytime is enabled and an inode is being written due to its
in-memory updated timestamps having expired, either due to a sync() or
syncfs() system call or due to dirtytime_expire_interval having elapsed,
the VFS needs to inform the filesystem so that the filesystem can copy
the inode's timestamps out to the on-disk data structures.
This is done by __writeback_single_inode() calling
mark_inode_dirty_sync(), which then calls ->dirty_inode(I_DIRTY_SYNC).
However, this occurs after __writeback_single_inode() has already
cleared the dirty flags from ->i_state. This causes two bugs:
- mark_inode_dirty_sync() redirties the inode, causing it to remain
dirty. This wastefully causes the inode to be written twice. But
more importantly, it breaks cases where sync_filesystem() is expected
to clean dirty inodes. This includes the FS_IOC_REMOVE_ENCRYPTION_KEY
ioctl (as reported at
https://lore.kernel.org/r/20200306004555.GB225345@gmail.com), as well
as possibly filesystem freezing (freeze_super()).
- Since ->i_state doesn't contain I_DIRTY_TIME when ->dirty_inode() is
called from __writeback_single_inode() for lazytime expiration,
xfs_fs_dirty_inode() ignores the notification. (XFS only cares about
lazytime expirations, and it assumes that i_state will contain
I_DIRTY_TIME during those.) Therefore, lazy timestamps aren't
persisted by sync(), syncfs(), or dirtytime_expire_interval on XFS.
Fix this by moving the call to mark_inode_dirty_sync() to earlier in
__writeback_single_inode(), before the dirty flags are cleared from
i_state. This makes filesystems be properly notified of the timestamp
expiration, and it avoids incorrectly redirtying the inode.
This fixes xfstest generic/580 (which tests
FS_IOC_REMOVE_ENCRYPTION_KEY) when run on ext4 or f2fs with lazytime
enabled. It also fixes the new lazytime xfstest I've proposed, which
reproduces the above-mentioned XFS bug
(https://lore.kernel.org/r/20210105005818.92978-1-ebiggers@kernel.org).
Alternatively, we could call ->dirty_inode(I_DIRTY_SYNC) directly. But
due to the introduction of I_SYNC_QUEUED, mark_inode_dirty_sync() is the
right thing to do because mark_inode_dirty_sync() now knows not to move
the inode to a writeback list if it is currently queued for sync.
Fixes: 0ae45f63d4ef ("vfs: add support for a lazytime mount option")
Cc: stable@vger.kernel.org
Depends-on: 5afced3bf281 ("writeback: Avoid skipping inode writeback")
Link: https://lore.kernel.org/r/20210112190253.64307-2-ebiggers@kernel.org
Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2021-01-12 11:02:43 -08:00
mark_inode_dirty_sync ( inode ) ;
2020-05-29 16:24:43 +02:00
}
fs: fix lazytime expiration handling in __writeback_single_inode()
When lazytime is enabled and an inode is being written due to its
in-memory updated timestamps having expired, either due to a sync() or
syncfs() system call or due to dirtytime_expire_interval having elapsed,
the VFS needs to inform the filesystem so that the filesystem can copy
the inode's timestamps out to the on-disk data structures.
This is done by __writeback_single_inode() calling
mark_inode_dirty_sync(), which then calls ->dirty_inode(I_DIRTY_SYNC).
However, this occurs after __writeback_single_inode() has already
cleared the dirty flags from ->i_state. This causes two bugs:
- mark_inode_dirty_sync() redirties the inode, causing it to remain
dirty. This wastefully causes the inode to be written twice. But
more importantly, it breaks cases where sync_filesystem() is expected
to clean dirty inodes. This includes the FS_IOC_REMOVE_ENCRYPTION_KEY
ioctl (as reported at
https://lore.kernel.org/r/20200306004555.GB225345@gmail.com), as well
as possibly filesystem freezing (freeze_super()).
- Since ->i_state doesn't contain I_DIRTY_TIME when ->dirty_inode() is
called from __writeback_single_inode() for lazytime expiration,
xfs_fs_dirty_inode() ignores the notification. (XFS only cares about
lazytime expirations, and it assumes that i_state will contain
I_DIRTY_TIME during those.) Therefore, lazy timestamps aren't
persisted by sync(), syncfs(), or dirtytime_expire_interval on XFS.
Fix this by moving the call to mark_inode_dirty_sync() to earlier in
__writeback_single_inode(), before the dirty flags are cleared from
i_state. This makes filesystems be properly notified of the timestamp
expiration, and it avoids incorrectly redirtying the inode.
This fixes xfstest generic/580 (which tests
FS_IOC_REMOVE_ENCRYPTION_KEY) when run on ext4 or f2fs with lazytime
enabled. It also fixes the new lazytime xfstest I've proposed, which
reproduces the above-mentioned XFS bug
(https://lore.kernel.org/r/20210105005818.92978-1-ebiggers@kernel.org).
Alternatively, we could call ->dirty_inode(I_DIRTY_SYNC) directly. But
due to the introduction of I_SYNC_QUEUED, mark_inode_dirty_sync() is the
right thing to do because mark_inode_dirty_sync() now knows not to move
the inode to a writeback list if it is currently queued for sync.
Fixes: 0ae45f63d4ef ("vfs: add support for a lazytime mount option")
Cc: stable@vger.kernel.org
Depends-on: 5afced3bf281 ("writeback: Avoid skipping inode writeback")
Link: https://lore.kernel.org/r/20210112190253.64307-2-ebiggers@kernel.org
Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2021-01-12 11:02:43 -08:00
/*
2021-01-12 11:02:51 -08:00
* Get and clear the dirty flags from i_state . This needs to be done
* after calling writepages because some filesystems may redirty the
* inode during writepages due to delalloc . It also needs to be done
* after handling timestamp expiration , as that may dirty the inode too .
fs: fix lazytime expiration handling in __writeback_single_inode()
When lazytime is enabled and an inode is being written due to its
in-memory updated timestamps having expired, either due to a sync() or
syncfs() system call or due to dirtytime_expire_interval having elapsed,
the VFS needs to inform the filesystem so that the filesystem can copy
the inode's timestamps out to the on-disk data structures.
This is done by __writeback_single_inode() calling
mark_inode_dirty_sync(), which then calls ->dirty_inode(I_DIRTY_SYNC).
However, this occurs after __writeback_single_inode() has already
cleared the dirty flags from ->i_state. This causes two bugs:
- mark_inode_dirty_sync() redirties the inode, causing it to remain
dirty. This wastefully causes the inode to be written twice. But
more importantly, it breaks cases where sync_filesystem() is expected
to clean dirty inodes. This includes the FS_IOC_REMOVE_ENCRYPTION_KEY
ioctl (as reported at
https://lore.kernel.org/r/20200306004555.GB225345@gmail.com), as well
as possibly filesystem freezing (freeze_super()).
- Since ->i_state doesn't contain I_DIRTY_TIME when ->dirty_inode() is
called from __writeback_single_inode() for lazytime expiration,
xfs_fs_dirty_inode() ignores the notification. (XFS only cares about
lazytime expirations, and it assumes that i_state will contain
I_DIRTY_TIME during those.) Therefore, lazy timestamps aren't
persisted by sync(), syncfs(), or dirtytime_expire_interval on XFS.
Fix this by moving the call to mark_inode_dirty_sync() to earlier in
__writeback_single_inode(), before the dirty flags are cleared from
i_state. This makes filesystems be properly notified of the timestamp
expiration, and it avoids incorrectly redirtying the inode.
This fixes xfstest generic/580 (which tests
FS_IOC_REMOVE_ENCRYPTION_KEY) when run on ext4 or f2fs with lazytime
enabled. It also fixes the new lazytime xfstest I've proposed, which
reproduces the above-mentioned XFS bug
(https://lore.kernel.org/r/20210105005818.92978-1-ebiggers@kernel.org).
Alternatively, we could call ->dirty_inode(I_DIRTY_SYNC) directly. But
due to the introduction of I_SYNC_QUEUED, mark_inode_dirty_sync() is the
right thing to do because mark_inode_dirty_sync() now knows not to move
the inode to a writeback list if it is currently queued for sync.
Fixes: 0ae45f63d4ef ("vfs: add support for a lazytime mount option")
Cc: stable@vger.kernel.org
Depends-on: 5afced3bf281 ("writeback: Avoid skipping inode writeback")
Link: https://lore.kernel.org/r/20210112190253.64307-2-ebiggers@kernel.org
Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2021-01-12 11:02:43 -08:00
*/
spin_lock ( & inode - > i_lock ) ;
dirty = inode - > i_state & I_DIRTY ;
2015-02-02 00:37:00 -05:00
inode - > i_state & = ~ dirty ;
2014-10-24 15:38:21 -04:00
/*
* Paired with smp_mb ( ) in __mark_inode_dirty ( ) . This allows
* __mark_inode_dirty ( ) to test i_state without grabbing i_lock -
* either they see the I_DIRTY bits cleared or we see the dirtied
* inode .
*
* I_DIRTY_PAGES is always cleared together above even if @ mapping
* still has dirty pages . The flag is reinstated after smp_mb ( ) if
* necessary . This guarantees that either __mark_inode_dirty ( )
* sees clear I_DIRTY_PAGES or we see PAGECACHE_TAG_DIRTY .
*/
smp_mb ( ) ;
if ( mapping_tagged ( mapping , PAGECACHE_TAG_DIRTY ) )
inode - > i_state | = I_DIRTY_PAGES ;
2023-11-27 13:58:07 +00:00
else if ( unlikely ( inode - > i_state & I_PINNING_NETFS_WB ) ) {
vfs, fscache: Implement pinning of cache usage for writeback
Cachefiles has a problem in that it needs to keep the backing file for a
cookie open whilst there are local modifications pending that need to be
written to it. However, we don't want to keep the file open indefinitely,
as that causes EMFILE/ENFILE/ENOMEM problems.
Reopening the cache file, however, is a problem if this is being done due
to writeback triggered by exit(). Some filesystems will oops if we try to
open a file in that context because they want to access current->fs or
other resources that have already been dismantled.
To get around this, I added the following:
(1) An inode flag, I_PINNING_FSCACHE_WB, to be set on a network filesystem
inode to indicate that we have a usage count on the cookie caching
that inode.
(2) A flag in struct writeback_control, unpinned_fscache_wb, that is set
when __writeback_single_inode() clears the last dirty page from
i_pages - at which point it clears I_PINNING_FSCACHE_WB and sets this
flag.
This has to be done here so that clearing I_PINNING_FSCACHE_WB can be
done atomically with the check of PAGECACHE_TAG_DIRTY that clears
I_DIRTY_PAGES.
(3) A function, fscache_set_page_dirty(), which if it is not set, sets
I_PINNING_FSCACHE_WB and calls fscache_use_cookie() to pin the cache
resources.
(4) A function, fscache_unpin_writeback(), to be called by ->write_inode()
to unuse the cookie.
(5) A function, fscache_clear_inode_writeback(), to be called when the
inode is evicted, before clear_inode() is called. This cleans up any
lingering I_PINNING_FSCACHE_WB.
The network filesystem can then use these tools to make sure that
fscache_write_to_cache() can write locally modified data to the cache as
well as to the server.
For the future, I'm working on write helpers for netfs lib that should
allow this facility to be removed by keeping track of the dirty regions
separately - but that's incomplete at the moment and is also going to be
affected by folios, one way or another, since it deals with pages
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/163819615157.215744.17623791756928043114.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906917856.143852.8224898306177154573.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967124567.1823006.14188359004568060298.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021524705.640689.17824932021727663017.stgit@warthog.procyon.org.uk/ # v4
2021-10-20 23:50:01 +01:00
if ( ! ( inode - > i_state & I_DIRTY_PAGES ) ) {
2023-11-27 13:58:07 +00:00
inode - > i_state & = ~ I_PINNING_NETFS_WB ;
wbc - > unpinned_netfs_wb = true ;
dirty | = I_PINNING_NETFS_WB ; /* Cause write_inode */
vfs, fscache: Implement pinning of cache usage for writeback
Cachefiles has a problem in that it needs to keep the backing file for a
cookie open whilst there are local modifications pending that need to be
written to it. However, we don't want to keep the file open indefinitely,
as that causes EMFILE/ENFILE/ENOMEM problems.
Reopening the cache file, however, is a problem if this is being done due
to writeback triggered by exit(). Some filesystems will oops if we try to
open a file in that context because they want to access current->fs or
other resources that have already been dismantled.
To get around this, I added the following:
(1) An inode flag, I_PINNING_FSCACHE_WB, to be set on a network filesystem
inode to indicate that we have a usage count on the cookie caching
that inode.
(2) A flag in struct writeback_control, unpinned_fscache_wb, that is set
when __writeback_single_inode() clears the last dirty page from
i_pages - at which point it clears I_PINNING_FSCACHE_WB and sets this
flag.
This has to be done here so that clearing I_PINNING_FSCACHE_WB can be
done atomically with the check of PAGECACHE_TAG_DIRTY that clears
I_DIRTY_PAGES.
(3) A function, fscache_set_page_dirty(), which if it is not set, sets
I_PINNING_FSCACHE_WB and calls fscache_use_cookie() to pin the cache
resources.
(4) A function, fscache_unpin_writeback(), to be called by ->write_inode()
to unuse the cookie.
(5) A function, fscache_clear_inode_writeback(), to be called when the
inode is evicted, before clear_inode() is called. This cleans up any
lingering I_PINNING_FSCACHE_WB.
The network filesystem can then use these tools to make sure that
fscache_write_to_cache() can write locally modified data to the cache as
well as to the server.
For the future, I'm working on write helpers for netfs lib that should
allow this facility to be removed by keeping track of the dirty regions
separately - but that's incomplete at the moment and is also going to be
affected by folios, one way or another, since it deals with pages
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/163819615157.215744.17623791756928043114.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906917856.143852.8224898306177154573.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967124567.1823006.14188359004568060298.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021524705.640689.17824932021727663017.stgit@warthog.procyon.org.uk/ # v4
2021-10-20 23:50:01 +01:00
}
}
2014-10-24 15:38:21 -04:00
2011-03-22 22:23:36 +11:00
spin_unlock ( & inode - > i_lock ) ;
2014-10-24 15:38:21 -04:00
2010-03-05 09:21:21 +01:00
/* Don't write the inode if only I_DIRTY_PAGES was set */
2015-02-02 00:37:00 -05:00
if ( dirty & ~ I_DIRTY_PAGES ) {
2010-03-05 09:21:37 +01:00
int err = write_inode ( inode , wbc ) ;
2005-04-16 15:20:36 -07:00
if ( ret = = 0 )
ret = err ;
}
2023-11-27 13:58:07 +00:00
wbc - > unpinned_netfs_wb = false ;
2012-05-03 14:48:00 +02:00
trace_writeback_single_inode ( inode , wbc , nr_to_write ) ;
return ret ;
}
/*
2021-01-12 11:02:51 -08:00
* Write out an inode ' s dirty data and metadata on - demand , i . e . separately from
* the regular batched writeback done by the flusher threads in
* writeback_sb_inodes ( ) . @ wbc controls various aspects of the write , such as
* whether it is a data - integrity sync ( % WB_SYNC_ALL ) or not ( % WB_SYNC_NONE ) .
2012-05-03 14:48:00 +02:00
*
2021-01-12 11:02:51 -08:00
* To prevent the inode from going away , either the caller must have a reference
* to the inode , or the inode must have I_WILL_FREE or I_FREEING set .
2012-05-03 14:48:00 +02:00
*/
2016-03-18 13:52:04 -04:00
static int writeback_single_inode ( struct inode * inode ,
struct writeback_control * wbc )
2012-05-03 14:48:00 +02:00
{
2016-03-18 13:52:04 -04:00
struct bdi_writeback * wb ;
2012-05-03 14:48:00 +02:00
int ret = 0 ;
spin_lock ( & inode - > i_lock ) ;
if ( ! atomic_read ( & inode - > i_count ) )
WARN_ON ( ! ( inode - > i_state & ( I_WILL_FREE | I_FREEING ) ) ) ;
else
WARN_ON ( inode - > i_state & I_WILL_FREE ) ;
if ( inode - > i_state & I_SYNC ) {
/*
2021-01-12 11:02:51 -08:00
* Writeback is already running on the inode . For WB_SYNC_NONE ,
* that ' s enough and we can just return . For WB_SYNC_ALL , we
* must wait for the existing writeback to complete , then do
* writeback again if there ' s anything left .
2012-05-03 14:48:00 +02:00
*/
2021-01-12 11:02:51 -08:00
if ( wbc - > sync_mode ! = WB_SYNC_ALL )
goto out ;
2012-05-03 14:48:03 +02:00
__inode_wait_for_writeback ( inode ) ;
2012-05-03 14:48:00 +02:00
}
WARN_ON ( inode - > i_state & I_SYNC ) ;
/*
2021-01-12 11:02:51 -08:00
* If the inode is already fully clean , then there ' s nothing to do .
*
* For data - integrity syncs we also need to check whether any pages are
* still under writeback , e . g . due to prior WB_SYNC_NONE writeback . If
* there are any such pages , we ' ll need to wait for them .
2012-05-03 14:48:00 +02:00
*/
2015-02-02 00:37:00 -05:00
if ( ! ( inode - > i_state & I_DIRTY_ALL ) & &
2013-12-14 04:21:26 +08:00
( wbc - > sync_mode ! = WB_SYNC_ALL | |
! mapping_tagged ( inode - > i_mapping , PAGECACHE_TAG_WRITEBACK ) ) )
2012-05-03 14:48:00 +02:00
goto out ;
inode - > i_state | = I_SYNC ;
2015-06-02 08:39:48 -06:00
wbc_attach_and_unlock_inode ( wbc , inode ) ;
2012-05-03 14:48:00 +02:00
2012-10-08 16:33:45 -07:00
ret = __writeback_single_inode ( inode , wbc ) ;
2005-04-16 15:20:36 -07:00
2015-06-02 08:39:48 -06:00
wbc_detach_inode ( wbc ) ;
2016-03-18 13:52:04 -04:00
wb = inode_to_wb_and_lock_list ( inode ) ;
2011-03-22 22:23:36 +11:00
spin_lock ( & inode - > i_lock ) ;
2012-05-03 14:48:00 +02:00
/*
2022-11-15 20:20:01 +00:00
* If the inode is freeing , its i_io_list shoudn ' t be updated
* as it can be finally deleted at this moment .
2012-05-03 14:48:00 +02:00
*/
2022-11-15 20:20:01 +00:00
if ( ! ( inode - > i_state & I_FREEING ) ) {
/*
* If the inode is now fully clean , then it can be safely
* removed from its writeback list ( if any ) . Otherwise the
* flusher threads are responsible for the writeback lists .
*/
if ( ! ( inode - > i_state & I_DIRTY_ALL ) )
inode_cgwb_move_to_attached ( inode , wb ) ;
else if ( ! ( inode - > i_state & I_SYNC_QUEUED ) ) {
if ( ( inode - > i_state & I_DIRTY ) )
redirty_tail_locked ( inode , wb ) ;
else if ( inode - > i_state & I_DIRTY_TIME ) {
inode - > dirtied_when = jiffies ;
inode_io_list_move_locked ( inode ,
wb ,
& wb - > b_dirty_time ) ;
}
2022-08-25 12:06:57 +02:00
}
}
2022-05-10 10:35:14 +08:00
2012-05-03 14:48:00 +02:00
spin_unlock ( & wb - > list_lock ) ;
2007-10-16 23:30:44 -07:00
inode_sync_complete ( inode ) ;
2012-05-03 14:48:00 +02:00
out :
spin_unlock ( & inode - > i_lock ) ;
2005-04-16 15:20:36 -07:00
return ret ;
}
writeback: move bandwidth related fields from backing_dev_info into bdi_writeback
Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback)
and the role of the separation is unclear. For cgroup support for
writeback IOs, a bdi will be updated to host multiple wb's where each
wb serves writeback IOs of a different cgroup on the bdi. To achieve
that, a wb should carry all states necessary for servicing writeback
IOs for a cgroup independently.
This patch moves bandwidth related fields from backing_dev_info into
bdi_writeback.
* The moved fields are: bw_time_stamp, dirtied_stamp, written_stamp,
write_bandwidth, avg_write_bandwidth, dirty_ratelimit,
balanced_dirty_ratelimit, completions and dirty_exceeded.
* writeback_chunk_size() and over_bground_thresh() now take @wb
instead of @bdi.
* bdi_writeout_fraction(bdi, ...) -> wb_writeout_fraction(wb, ...)
bdi_dirty_limit(bdi, ...) -> wb_dirty_limit(wb, ...)
bdi_position_ration(bdi, ...) -> wb_position_ratio(wb, ...)
bdi_update_writebandwidth(bdi, ...) -> wb_update_write_bandwidth(wb, ...)
[__]bdi_update_bandwidth(bdi, ...) -> [__]wb_update_bandwidth(wb, ...)
bdi_{max|min}_pause(bdi, ...) -> wb_{max|min}_pause(wb, ...)
bdi_dirty_limits(bdi, ...) -> wb_dirty_limits(wb, ...)
* Init/exits of the relocated fields are moved to bdi_wb_init/exit()
respectively. Note that explicit zeroing is dropped in the process
as wb's are cleared in entirety anyway.
* As there's still only one bdi_writeback per backing_dev_info, all
uses of bdi->stat[] are mechanically replaced with bdi->wb.stat[]
introducing no behavior changes.
v2: Typo in description fixed as suggested by Jan.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-22 17:13:28 -04:00
static long writeback_chunk_size ( struct bdi_writeback * wb ,
2010-08-29 13:28:09 -06:00
struct wb_writeback_work * work )
2011-05-04 19:54:37 -06:00
{
long pages ;
/*
* WB_SYNC_ALL mode does livelock avoidance by syncing dirty
* inodes / pages in one big loop . Setting wbc . nr_to_write = LONG_MAX
* here avoids calling into writeback_inodes_wb ( ) more than once .
*
* The intended call sequence for WB_SYNC_ALL writeback is :
*
* wb_writeback ( )
* writeback_sb_inodes ( ) < = = called only once
* write_cache_pages ( ) < = = called once for each inode
* ( quickly ) tag currently dirty pages
* ( maybe slowly ) sync all tagged pages
*/
if ( work - > sync_mode = = WB_SYNC_ALL | | work - > tagged_writepages )
pages = LONG_MAX ;
2010-08-29 13:28:09 -06:00
else {
writeback: move bandwidth related fields from backing_dev_info into bdi_writeback
Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback)
and the role of the separation is unclear. For cgroup support for
writeback IOs, a bdi will be updated to host multiple wb's where each
wb serves writeback IOs of a different cgroup on the bdi. To achieve
that, a wb should carry all states necessary for servicing writeback
IOs for a cgroup independently.
This patch moves bandwidth related fields from backing_dev_info into
bdi_writeback.
* The moved fields are: bw_time_stamp, dirtied_stamp, written_stamp,
write_bandwidth, avg_write_bandwidth, dirty_ratelimit,
balanced_dirty_ratelimit, completions and dirty_exceeded.
* writeback_chunk_size() and over_bground_thresh() now take @wb
instead of @bdi.
* bdi_writeout_fraction(bdi, ...) -> wb_writeout_fraction(wb, ...)
bdi_dirty_limit(bdi, ...) -> wb_dirty_limit(wb, ...)
bdi_position_ration(bdi, ...) -> wb_position_ratio(wb, ...)
bdi_update_writebandwidth(bdi, ...) -> wb_update_write_bandwidth(wb, ...)
[__]bdi_update_bandwidth(bdi, ...) -> [__]wb_update_bandwidth(wb, ...)
bdi_{max|min}_pause(bdi, ...) -> wb_{max|min}_pause(wb, ...)
bdi_dirty_limits(bdi, ...) -> wb_dirty_limits(wb, ...)
* Init/exits of the relocated fields are moved to bdi_wb_init/exit()
respectively. Note that explicit zeroing is dropped in the process
as wb's are cleared in entirety anyway.
* As there's still only one bdi_writeback per backing_dev_info, all
uses of bdi->stat[] are mechanically replaced with bdi->wb.stat[]
introducing no behavior changes.
v2: Typo in description fixed as suggested by Jan.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-22 17:13:28 -04:00
pages = min ( wb - > avg_write_bandwidth / 2 ,
2015-05-22 18:23:22 -04:00
global_wb_domain . dirty_limit / DIRTY_SCOPE ) ;
2010-08-29 13:28:09 -06:00
pages = min ( pages , work - > nr_pages ) ;
pages = round_down ( pages + MIN_WRITEBACK_PAGES ,
MIN_WRITEBACK_PAGES ) ;
}
2011-05-04 19:54:37 -06:00
return pages ;
}
2010-03-11 14:09:47 -08:00
/*
* Write a portion of b_io inodes which belong to @ sb .
2010-06-10 12:07:54 +02:00
*
2011-05-04 19:54:37 -06:00
* Return the number of pages and / or inodes written .
Revert "writeback: plug writeback at a high level"
This reverts commit d353d7587d02116b9732d5c06615aed75a4d3a47.
Doing the block layer plug/unplug inside writeback_sb_inodes() is
broken, because that function is actually called with a spinlock held:
wb->list_lock, as pointed out by Chris Mason.
Chris suggested just dropping and re-taking the spinlock around the
blk_finish_plug() call (the plgging itself can happen under the
spinlock), and that would technically work, but is just disgusting.
We do something fairly similar - but not quite as disgusting because we
at least have a better reason for it - in writeback_single_inode(), so
it's not like the caller can depend on the lock being held over the
call, but in this case there just isn't any good reason for that
"release and re-take the lock" pattern.
[ In general, we should really strive to avoid the "release and retake"
pattern for locks, because in the general case it can easily cause
subtle bugs when the caller caches any state around the call that
might be invalidated by dropping the lock even just temporarily. ]
But in this case, the plugging should be easy to just move up to the
callers before the spinlock is taken, which should even improve the
effectiveness of the plug. So there is really no good reason to play
games with locking here.
I'll send off a test-patch so that Dave Chinner can verify that that
plug movement works. In the meantime this just reverts the problematic
commit and adds a comment to the function so that we hopefully don't
make this mistake again.
Reported-by: Chris Mason <clm@fb.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Jan Kara <jack@suse.cz>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-11 13:26:39 -07:00
*
* NOTE ! This is called with wb - > list_lock held , and will
* unlock and relock that for each inode it ends up doing
* IO for .
2010-03-11 14:09:47 -08:00
*/
2011-05-04 19:54:37 -06:00
static long writeback_sb_inodes ( struct super_block * sb ,
struct bdi_writeback * wb ,
struct wb_writeback_work * work )
2005-04-16 15:20:36 -07:00
{
2011-05-04 19:54:37 -06:00
struct writeback_control wbc = {
. sync_mode = work - > sync_mode ,
. tagged_writepages = work - > tagged_writepages ,
. for_kupdate = work - > for_kupdate ,
. for_background = work - > for_background ,
2013-07-02 22:38:35 +10:00
. for_sync = work - > for_sync ,
2011-05-04 19:54:37 -06:00
. range_cyclic = work - > range_cyclic ,
. range_start = 0 ,
. range_end = LLONG_MAX ,
} ;
unsigned long start_time = jiffies ;
long write_chunk ;
2022-05-10 21:38:05 +08:00
long total_wrote = 0 ; /* count both pages and inodes */
2011-05-04 19:54:37 -06:00
2009-09-09 09:08:54 +02:00
while ( ! list_empty ( & wb - > b_io ) ) {
2010-10-21 11:49:30 +11:00
struct inode * inode = wb_inode ( wb - > b_io . prev ) ;
2016-03-18 13:52:04 -04:00
struct bdi_writeback * tmp_wb ;
2022-05-10 21:38:05 +08:00
long wrote ;
2010-06-10 12:07:54 +02:00
if ( inode - > i_sb ! = sb ) {
2011-05-04 19:54:37 -06:00
if ( work - > sb ) {
2010-06-10 12:07:54 +02:00
/*
* We only want to write back data for this
* superblock , move all inodes not belonging
* to it back onto the dirty list .
*/
2011-04-21 18:19:44 -06:00
redirty_tail ( inode , wb ) ;
2010-06-10 12:07:54 +02:00
continue ;
}
/*
* The inode belongs to a different superblock .
* Bounce back to the caller to unpin this and
* pin the next superblock .
*/
2011-05-04 19:54:37 -06:00
break ;
2010-06-10 12:07:54 +02:00
}
2010-10-24 19:40:46 +02:00
/*
2012-06-09 11:10:55 +08:00
* Don ' t bother with new inodes or inodes being freed , first
* kind does not need periodic writeout yet , and for the latter
2010-10-24 19:40:46 +02:00
* kind writeout is handled by the freer .
*/
2011-03-22 22:23:36 +11:00
spin_lock ( & inode - > i_lock ) ;
2010-10-24 19:40:46 +02:00
if ( inode - > i_state & ( I_NEW | I_FREEING | I_WILL_FREE ) ) {
2020-06-10 17:36:03 +02:00
redirty_tail_locked ( inode , wb ) ;
2011-03-22 22:23:36 +11:00
spin_unlock ( & inode - > i_lock ) ;
fs: new inode i_state corruption fix
There was a report of a data corruption
http://lkml.org/lkml/2008/11/14/121. There is a script included to
reproduce the problem.
During testing, I encountered a number of strange things with ext3, so I
tried ext2 to attempt to reduce complexity of the problem. I found that
fsstress would quickly hang in wait_on_inode, waiting for I_LOCK to be
cleared, even though instrumentation showed that unlock_new_inode had
already been called for that inode. This points to memory scribble, or
synchronisation problme.
i_state of I_NEW inodes is not protected by inode_lock because other
processes are not supposed to touch them until I_LOCK (and I_NEW) is
cleared. Adding WARN_ON(inode->i_state & I_NEW) to sites where we modify
i_state revealed that generic_sync_sb_inodes is picking up new inodes from
the inode lists and passing them to __writeback_single_inode without
waiting for I_NEW. Subsequently modifying i_state causes corruption. In
my case it would look like this:
CPU0 CPU1
unlock_new_inode() __sync_single_inode()
reg <- inode->i_state
reg -> reg & ~(I_LOCK|I_NEW) reg <- inode->i_state
reg -> inode->i_state reg -> reg | I_SYNC
reg -> inode->i_state
Non-atomic RMW on CPU1 overwrites CPU0 store and sets I_LOCK|I_NEW again.
Fix for this is rather than wait for I_NEW inodes, just skip over them:
inodes concurrently being created are not subject to data integrity
operations, and should not significantly contribute to dirty memory
either.
After this change, I'm unable to reproduce any of the added warnings or
hangs after ~1hour of running. Previously, the new warnings would start
immediately and hang would happen in under 5 minutes.
I'm also testing on ext3 now, and so far no problems there either. I
don't know whether this fixes the problem reported above, but it fixes a
real problem for me.
Cc: "Jorge Boncompte [DTI2]" <jorge@dti2.net>
Reported-by: Adrian Hunter <ext-adrian.hunter@nokia.com>
Cc: Jan Kara <jack@suse.cz>
Cc: <stable@kernel.org>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-03-12 14:31:38 -07:00
continue ;
}
2012-05-03 14:47:56 +02:00
if ( ( inode - > i_state & I_SYNC ) & & wbc . sync_mode ! = WB_SYNC_ALL ) {
/*
* If this inode is locked for writeback and we are not
* doing writeback - for - data - integrity , move it to
* b_more_io so that writeback can proceed with the
* other inodes on s_io .
*
* We ' ll have another go at writing back this inode
* when we completed a full scan of b_io .
*/
requeue_io ( inode , wb ) ;
2022-05-24 08:05:40 -07:00
spin_unlock ( & inode - > i_lock ) ;
2012-05-03 14:47:56 +02:00
trace_writeback_sb_inodes_requeue ( inode ) ;
continue ;
}
2012-05-03 14:47:59 +02:00
spin_unlock ( & wb - > list_lock ) ;
2012-05-03 14:48:00 +02:00
/*
* We already requeued the inode if it had I_SYNC set and we
* are doing WB_SYNC_NONE writeback . So this catches only the
* WB_SYNC_ALL case .
*/
2012-05-03 14:48:03 +02:00
if ( inode - > i_state & I_SYNC ) {
/* Wait for I_SYNC. This function drops i_lock... */
inode_sleep_on_writeback ( inode ) ;
/* Inode may be gone, start again */
2012-06-08 17:07:36 +02:00
spin_lock ( & wb - > list_lock ) ;
2012-05-03 14:48:03 +02:00
continue ;
}
2012-05-03 14:48:00 +02:00
inode - > i_state | = I_SYNC ;
2015-06-02 08:39:48 -06:00
wbc_attach_and_unlock_inode ( & wbc , inode ) ;
2012-05-03 14:48:03 +02:00
writeback: move bandwidth related fields from backing_dev_info into bdi_writeback
Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback)
and the role of the separation is unclear. For cgroup support for
writeback IOs, a bdi will be updated to host multiple wb's where each
wb serves writeback IOs of a different cgroup on the bdi. To achieve
that, a wb should carry all states necessary for servicing writeback
IOs for a cgroup independently.
This patch moves bandwidth related fields from backing_dev_info into
bdi_writeback.
* The moved fields are: bw_time_stamp, dirtied_stamp, written_stamp,
write_bandwidth, avg_write_bandwidth, dirty_ratelimit,
balanced_dirty_ratelimit, completions and dirty_exceeded.
* writeback_chunk_size() and over_bground_thresh() now take @wb
instead of @bdi.
* bdi_writeout_fraction(bdi, ...) -> wb_writeout_fraction(wb, ...)
bdi_dirty_limit(bdi, ...) -> wb_dirty_limit(wb, ...)
bdi_position_ration(bdi, ...) -> wb_position_ratio(wb, ...)
bdi_update_writebandwidth(bdi, ...) -> wb_update_write_bandwidth(wb, ...)
[__]bdi_update_bandwidth(bdi, ...) -> [__]wb_update_bandwidth(wb, ...)
bdi_{max|min}_pause(bdi, ...) -> wb_{max|min}_pause(wb, ...)
bdi_dirty_limits(bdi, ...) -> wb_dirty_limits(wb, ...)
* Init/exits of the relocated fields are moved to bdi_wb_init/exit()
respectively. Note that explicit zeroing is dropped in the process
as wb's are cleared in entirety anyway.
* As there's still only one bdi_writeback per backing_dev_info, all
uses of bdi->stat[] are mechanically replaced with bdi->wb.stat[]
introducing no behavior changes.
v2: Typo in description fixed as suggested by Jan.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-22 17:13:28 -04:00
write_chunk = writeback_chunk_size ( wb , work ) ;
2011-05-04 19:54:37 -06:00
wbc . nr_to_write = write_chunk ;
wbc . pages_skipped = 0 ;
2011-03-22 22:23:36 +11:00
2012-05-03 14:48:03 +02:00
/*
* We use I_SYNC to pin the inode in memory . While it is set
* evict_inode ( ) will wait so the inode cannot be freed .
*/
2012-10-08 16:33:45 -07:00
__writeback_single_inode ( inode , & wbc ) ;
2011-03-22 22:23:36 +11:00
2015-06-02 08:39:48 -06:00
wbc_detach_inode ( & wbc ) ;
2011-05-04 19:54:37 -06:00
work - > nr_pages - = write_chunk - wbc . nr_to_write ;
2022-05-10 21:38:05 +08:00
wrote = write_chunk - wbc . nr_to_write - wbc . pages_skipped ;
wrote = wrote < 0 ? 0 : wrote ;
total_wrote + = wrote ;
2015-09-18 13:35:08 -04:00
if ( need_resched ( ) ) {
/*
* We ' re trying to balance between building up a nice
* long list of IOs to improve our merge rate , and
* getting those IOs out quickly for anyone throttling
* in balance_dirty_pages ( ) . cond_resched ( ) doesn ' t
* unplug , so get our IOs out the door before we
* give up the CPU .
*/
2022-01-27 08:05:49 +01:00
blk_flush_plug ( current - > plug , false ) ;
2015-09-18 13:35:08 -04:00
cond_resched ( ) ;
}
2016-03-18 13:52:04 -04:00
/*
* Requeue @ inode if still dirty . Be careful as @ inode may
* have been switched to another wb in the meantime .
*/
tmp_wb = inode_to_wb_and_lock_list ( inode ) ;
2012-05-03 14:48:00 +02:00
spin_lock ( & inode - > i_lock ) ;
2015-02-02 00:37:00 -05:00
if ( ! ( inode - > i_state & I_DIRTY_ALL ) )
2022-05-10 21:38:05 +08:00
total_wrote + + ;
2016-03-18 13:52:04 -04:00
requeue_inode ( inode , tmp_wb , & wbc ) ;
2012-05-03 14:48:00 +02:00
inode_sync_complete ( inode ) ;
2011-03-22 22:23:43 +11:00
spin_unlock ( & inode - > i_lock ) ;
2015-09-18 13:35:08 -04:00
2016-03-18 13:52:04 -04:00
if ( unlikely ( tmp_wb ! = wb ) ) {
spin_unlock ( & tmp_wb - > list_lock ) ;
spin_lock ( & wb - > list_lock ) ;
}
2011-05-04 19:54:37 -06:00
/*
* bail out to wb_writeback ( ) often enough to check
* background threshold and other termination conditions .
*/
2022-05-10 21:38:05 +08:00
if ( total_wrote ) {
2011-05-04 19:54:37 -06:00
if ( time_is_before_jiffies ( start_time + HZ / 10UL ) )
break ;
if ( work - > nr_pages < = 0 )
break ;
writeback: speed up writeback of big dirty files
After making dirty a 100M file, the normal behavior is to start the
writeback for all data after 30s delays. But sometimes the following
happens instead:
- after 30s: ~4M
- after 5s: ~4M
- after 5s: all remaining 92M
Some analyze shows that the internal io dispatch queues goes like this:
s_io s_more_io
-------------------------
1) 100M,1K 0
2) 1K 96M
3) 0 96M
1) initial state with a 100M file and a 1K file
2) 4M written, nr_to_write <= 0, so write more
3) 1K written, nr_to_write > 0, no more writes(BUG)
nr_to_write > 0 in (3) fools the upper layer to think that data have all
been written out. The big dirty file is actually still sitting in
s_more_io. We cannot simply splice s_more_io back to s_io as soon as s_io
becomes empty, and let the loop in generic_sync_sb_inodes() continue: this
may starve newly expired inodes in s_dirty. It is also not an option to
draw inodes from both s_more_io and s_dirty, an let the loop go on: this
might lead to live locks, and might also starve other superblocks in sync
time(well kupdate may still starve some superblocks, that's another bug).
We have to return when a full scan of s_io completes. So nr_to_write > 0
does not necessarily mean that "all data are written". This patch
introduces a flag writeback_control.more_io to indicate that more io should
be done. With it the big dirty file no longer has to wait for the next
kupdate invokation 5s later.
In sync_sb_inodes() we only set more_io on super_blocks we actually
visited. This avoids the interaction between two pdflush deamons.
Also in __sync_single_inode() we don't blindly keep requeuing the io if the
filesystem cannot progress. Failing to do so may lead to 100% iowait.
Tested-by: Mike Snitzer <snitzer@gmail.com>
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
Cc: Michael Rubin <mrubin@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-04 22:29:36 -08:00
}
2005-04-16 15:20:36 -07:00
}
2022-05-10 21:38:05 +08:00
return total_wrote ;
2010-03-11 14:09:47 -08:00
}
2011-05-04 19:54:37 -06:00
static long __writeback_inodes_wb ( struct bdi_writeback * wb ,
struct wb_writeback_work * work )
2010-03-11 14:09:47 -08:00
{
2011-05-04 19:54:37 -06:00
unsigned long start_time = jiffies ;
long wrote = 0 ;
2009-01-06 14:40:25 -08:00
2010-03-11 14:09:47 -08:00
while ( ! list_empty ( & wb - > b_io ) ) {
2010-10-21 11:49:30 +11:00
struct inode * inode = wb_inode ( wb - > b_io . prev ) ;
2010-03-11 14:09:47 -08:00
struct super_block * sb = inode - > i_sb ;
2009-09-24 15:25:11 +02:00
2023-08-18 16:00:49 +02:00
if ( ! super_trylock_shared ( sb ) ) {
2011-07-29 22:14:35 -06:00
/*
2023-08-18 16:00:49 +02:00
* super_trylock_shared ( ) may fail consistently due to
2011-07-29 22:14:35 -06:00
* s_umount being grabbed by someone else . Don ' t use
* requeue_io ( ) to avoid busy retrying the inode / sb .
*/
redirty_tail ( inode , wb ) ;
2010-06-10 12:07:54 +02:00
continue ;
2010-03-11 14:09:47 -08:00
}
2011-05-04 19:54:37 -06:00
wrote + = writeback_sb_inodes ( sb , wb , work ) ;
2015-02-19 20:19:35 +03:00
up_read ( & sb - > s_umount ) ;
2010-03-11 14:09:47 -08:00
2011-05-04 19:54:37 -06:00
/* refer to the same tests at the end of writeback_sb_inodes */
if ( wrote ) {
if ( time_is_before_jiffies ( start_time + HZ / 10UL ) )
break ;
if ( work - > nr_pages < = 0 )
break ;
}
2010-03-11 14:09:47 -08:00
}
2009-09-02 09:19:46 +02:00
/* Leave any unwritten inodes on b_io */
2011-05-04 19:54:37 -06:00
return wrote ;
2009-09-02 09:19:46 +02:00
}
2013-09-11 14:22:40 -07:00
static long writeback_inodes_wb ( struct bdi_writeback * wb , long nr_pages ,
2011-10-07 21:54:10 -06:00
enum wb_reason reason )
2010-06-10 12:07:54 +02:00
{
2011-05-04 19:54:37 -06:00
struct wb_writeback_work work = {
. nr_pages = nr_pages ,
. sync_mode = WB_SYNC_NONE ,
. range_cyclic = 1 ,
2011-10-07 21:54:10 -06:00
. reason = reason ,
2011-05-04 19:54:37 -06:00
} ;
2015-09-11 13:37:19 -07:00
struct blk_plug plug ;
2010-06-10 12:07:54 +02:00
2015-09-11 13:37:19 -07:00
blk_start_plug ( & plug ) ;
2011-04-21 18:19:44 -06:00
spin_lock ( & wb - > list_lock ) ;
writeback: refill b_io iff empty
There is no point to carry different refill policies between for_kupdate
and other type of works. Use a consistent "refill b_io iff empty" policy
which can guarantee fairness in an easy to understand way.
A b_io refill will setup a _fixed_ work set with all currently eligible
inodes and start a new round of walk through b_io. The "fixed" work set
means no new inodes will be added to the work set during the walk.
Only when a complete walk over b_io is done, new inodes that are
eligible at the time will be enqueued and the walk be started over.
This procedure provides fairness among the inodes because it guarantees
each inode to be synced once and only once at each round. So all inodes
will be free from starvations.
This change relies on wb_writeback() to keep retrying as long as we made
some progress on cleaning some pages and/or inodes. Without that ability,
the old logic on background works relies on aggressively queuing all
eligible inodes into b_io at every time. But that's not a guarantee.
The below test script completes a slightly faster now:
2.6.39-rc3 2.6.39-rc3-dyn-expire+
------------------------------------------------
all elapsed 256.043 252.367
stddev 24.381 12.530
tar elapsed 30.097 28.808
dd elapsed 13.214 11.782
#!/bin/zsh
cp /c/linux-2.6.38.3.tar.bz2 /dev/shm/
umount /dev/sda7
mkfs.xfs -f /dev/sda7
mount /dev/sda7 /fs
echo 3 > /proc/sys/vm/drop_caches
tic=$(cat /proc/uptime|cut -d' ' -f2)
cd /fs
time tar jxf /dev/shm/linux-2.6.38.3.tar.bz2 &
time dd if=/dev/zero of=/fs/zero bs=1M count=1000 &
wait
sync
tac=$(cat /proc/uptime|cut -d' ' -f2)
echo elapsed: $((tac - tic))
It maintains roughly the same small vs. large file writeout shares, and
offers large files better chances to be written in nice 4M chunks.
Analyzes from Dave Chinner in great details:
Let's say we have lots of inodes with 100 dirty pages being created,
and one large writeback going on. We expire 8 new inodes for every
1024 pages we write back.
With the old code, we do:
b_more_io (large inode) -> b_io (1l)
8 newly expired inodes -> b_io (1l, 8s)
writeback large inode 1024 pages -> b_more_io
b_more_io (large inode) -> b_io (8s, 1l)
8 newly expired inodes -> b_io (8s, 1l, 8s)
writeback 8 small inodes 800 pages
1 large inode 224 pages -> b_more_io
b_more_io (large inode) -> b_io (8s, 1l)
8 newly expired inodes -> b_io (8s, 1l, 8s)
.....
Your new code:
b_more_io (large inode) -> b_io (1l)
8 newly expired inodes -> b_io (1l, 8s)
writeback large inode 1024 pages -> b_more_io
(b_io == 8s)
writeback 8 small inodes 800 pages
b_io empty: (1800 pages written)
b_more_io (large inode) -> b_io (1l)
14 newly expired inodes -> b_io (1l, 14s)
writeback large inode 1024 pages -> b_more_io
(b_io == 14s)
writeback 10 small inodes 1000 pages
1 small inode 24 pages -> b_more_io (1l, 1s(24))
writeback 5 small inodes 500 pages
b_io empty: (2548 pages written)
b_more_io (large inode) -> b_io (1l, 1s(24))
20 newly expired inodes -> b_io (1l, 1s(24), 20s)
......
Rough progression of pages written at b_io refill:
Old code:
total large file % of writeback
1024 224 21.9% (fixed)
New code:
total large file % of writeback
1800 1024 ~55%
2550 1024 ~40%
3050 1024 ~33%
3500 1024 ~29%
3950 1024 ~26%
4250 1024 ~24%
4500 1024 ~22.7%
4700 1024 ~21.7%
4800 1024 ~21.3%
4800 1024 ~21.3%
(pretty much steady state from here)
Ok, so the steady state is reached with a similar percentage of
writeback to the large file as the existing code. Ok, that's good,
but providing some evidence that is doesn't change the shared of
writeback to the large should be in the commit message ;)
The other advantage to this is that we always write 1024 page chunks
to the large file, rather than smaller "whatever remains" chunks.
CC: Jan Kara <jack@suse.cz>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2010-07-21 20:11:53 -06:00
if ( list_empty ( & wb - > b_io ) )
2020-05-29 16:08:58 +02:00
queue_io ( wb , & work , jiffies ) ;
2011-05-04 19:54:37 -06:00
__writeback_inodes_wb ( wb , & work ) ;
2011-04-21 18:19:44 -06:00
spin_unlock ( & wb - > list_lock ) ;
2015-09-11 13:37:19 -07:00
blk_finish_plug ( & plug ) ;
2010-06-10 12:07:54 +02:00
2011-05-04 19:54:37 -06:00
return nr_pages - work . nr_pages ;
}
2009-09-09 09:08:54 +02:00
/*
* Explicit flushing or periodic writeback of " old " data .
2009-09-02 09:19:46 +02:00
*
2009-09-09 09:08:54 +02:00
* Define " old " : the first time one of an inode ' s pages is dirtied , we mark the
* dirtying - time in the inode ' s address_space . So this periodic writeback code
* just walks the superblock inode list , writing back any inodes which are
* older than a specific point in time .
2009-09-02 09:19:46 +02:00
*
2009-09-09 09:08:54 +02:00
* Try to run once per dirty_writeback_interval . But if a writeback event
* takes longer than a dirty_writeback_interval interval , then leave a
* one - second gap .
2009-09-02 09:19:46 +02:00
*
2020-05-29 16:08:58 +02:00
* dirtied_before takes precedence over nr_to_write . So we ' ll only write back
2009-09-09 09:08:54 +02:00
* all dirty pages if they are all attached to " old " mappings .
2009-09-02 09:19:46 +02:00
*/
2009-09-16 15:18:25 +02:00
static long wb_writeback ( struct bdi_writeback * wb ,
2010-07-06 08:59:53 +02:00
struct wb_writeback_work * work )
2009-09-02 09:19:46 +02:00
{
2011-05-04 19:54:37 -06:00
long nr_pages = work - > nr_pages ;
2020-05-29 16:08:58 +02:00
unsigned long dirtied_before = jiffies ;
2009-09-16 19:22:48 +02:00
struct inode * inode ;
2011-05-04 19:54:37 -06:00
long progress ;
2015-09-11 13:37:19 -07:00
struct blk_plug plug ;
2009-09-02 09:19:46 +02:00
2015-09-11 13:37:19 -07:00
blk_start_plug ( & plug ) ;
2009-09-09 09:08:54 +02:00
for ( ; ; ) {
/*
2009-09-23 20:33:40 +08:00
* Stop writeback when nr_pages has been consumed
2009-09-09 09:08:54 +02:00
*/
2010-07-06 08:59:53 +02:00
if ( work - > nr_pages < = 0 )
2009-09-09 09:08:54 +02:00
break ;
2009-09-02 09:19:46 +02:00
2011-01-13 15:45:47 -08:00
/*
* Background writeout and kupdate - style writeback may
* run forever . Stop them if there is other work to do
* so that e . g . sync can proceed . They ' ll be restarted
* after the other works are all done .
*/
if ( ( work - > for_background | | work - > for_kupdate ) & &
writeback: move backing_dev_info->wb_lock and ->worklist into bdi_writeback
Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback)
and the role of the separation is unclear. For cgroup support for
writeback IOs, a bdi will be updated to host multiple wb's where each
wb serves writeback IOs of a different cgroup on the bdi. To achieve
that, a wb should carry all states necessary for servicing writeback
IOs for a cgroup independently.
This patch moves bdi->wb_lock and ->worklist into wb.
* The lock protects bdi->worklist and bdi->wb.dwork scheduling. While
moving, rename it to wb->work_lock as wb->wb_lock is confusing.
Also, move wb->dwork downwards so that it's colocated with the new
->work_lock and ->work_list fields.
* bdi_writeback_workfn() -> wb_workfn()
bdi_wakeup_thread_delayed(bdi) -> wb_wakeup_delayed(wb)
bdi_wakeup_thread(bdi) -> wb_wakeup(wb)
bdi_queue_work(bdi, ...) -> wb_queue_work(wb, ...)
__bdi_start_writeback(bdi, ...) -> __wb_start_writeback(wb, ...)
get_next_work_item(bdi) -> get_next_work_item(wb)
* bdi_wb_shutdown() is renamed to wb_shutdown() and now takes @wb.
The function contained parts which belong to the containing bdi
rather than the wb itself - testing cap_writeback_dirty and
bdi_remove_from_list() invocation. Those are moved to
bdi_unregister().
* bdi_wb_{init|exit}() are renamed to wb_{init|exit}().
Initializations of the moved bdi->wb_lock and ->work_list are
relocated from bdi_init() to wb_init().
* As there's still only one bdi_writeback per backing_dev_info, all
uses of bdi->state are mechanically replaced with bdi->wb.state
introducing no behavior changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-22 17:13:30 -04:00
! list_empty ( & wb - > work_list ) )
2011-01-13 15:45:47 -08:00
break ;
2009-01-06 14:40:25 -08:00
/*
2009-09-23 20:33:40 +08:00
* For background writeout , stop when we are below the
* background dirty threshold
2009-01-06 14:40:25 -08:00
*/
2015-05-22 18:23:31 -04:00
if ( work - > for_background & & ! wb_over_bg_thresh ( wb ) )
2009-09-09 09:08:54 +02:00
break ;
2009-01-06 14:40:25 -08:00
writeback: move wb_over_bg_thresh() call outside lock section
Patch series "cgroup: eliminate atomic rstat flushing", v5.
A previous patch series [1] changed most atomic rstat flushing contexts to
become non-atomic. This was done to avoid an expensive operation that
scales with # cgroups and # cpus to happen with irqs disabled and
scheduling not permitted. There were two remaining atomic flushing
contexts after that series. This series tries to eliminate them as well,
eliminating atomic rstat flushing completely.
The two remaining atomic flushing contexts are:
(a) wb_over_bg_thresh()->mem_cgroup_wb_stats()
(b) mem_cgroup_threshold()->mem_cgroup_usage()
For (a), flushing needs to be atomic as wb_writeback() calls
wb_over_bg_thresh() with a spinlock held. However, it seems like the call
to wb_over_bg_thresh() doesn't need to be protected by that spinlock, so
this series proposes a refactoring that moves the call outside the lock
criticial section and makes the stats flushing in mem_cgroup_wb_stats()
non-atomic.
For (b), flushing needs to be atomic as mem_cgroup_threshold() is called
with irqs disabled. We only flush the stats when calculating the root
usage, as it is approximated as the sum of some memcg stats (file, anon,
and optionally swap) instead of the conventional page counter. This
series proposes changing this calculation to use the global stats instead,
eliminating the need for a memcg stat flush.
After these 2 contexts are eliminated, we no longer need
mem_cgroup_flush_stats_atomic() or cgroup_rstat_flush_atomic(). We can
remove them and simplify the code.
[1] https://lore.kernel.org/linux-mm/20230330191801.1967435-1-yosryahmed@google.com/
This patch (of 5):
wb_over_bg_thresh() calls mem_cgroup_wb_stats() which invokes an rstat
flush, which can be expensive on large systems. Currently,
wb_writeback() calls wb_over_bg_thresh() within a lock section, so we
have to do the rstat flush atomically. On systems with a lot of
cpus and/or cgroups, this can cause us to disable irqs for a long time,
potentially causing problems.
Move the call to wb_over_bg_thresh() outside the lock section in
preparation to make the rstat flush in mem_cgroup_wb_stats() non-atomic.
The list_empty(&wb->work_list) check should be okay outside the lock
section of wb->list_lock as it is protected by a separate lock
(wb->work_lock), and wb_over_bg_thresh() doesn't seem like it is
modifying any of wb->b_* lists the wb->list_lock is protecting.
Also, the loop seems to be already releasing and reacquring the
lock, so this refactoring looks safe.
Link: https://lkml.kernel.org/r/20230421174020.2994750-1-yosryahmed@google.com
Link: https://lkml.kernel.org/r/20230421174020.2994750-2-yosryahmed@google.com
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-21 17:40:16 +00:00
spin_lock ( & wb - > list_lock ) ;
2011-10-19 11:44:41 +02:00
/*
* Kupdate and background works are special and we want to
* include all inodes that need writing . Livelock avoidance is
* handled by these works yielding to any other work so we are
* safe .
*/
2010-07-21 20:32:30 -06:00
if ( work - > for_kupdate ) {
2020-05-29 16:08:58 +02:00
dirtied_before = jiffies -
2010-07-21 20:32:30 -06:00
msecs_to_jiffies ( dirty_expire_interval * 10 ) ;
2011-10-19 11:44:41 +02:00
} else if ( work - > for_background )
2020-05-29 16:08:58 +02:00
dirtied_before = jiffies ;
2010-07-07 13:24:07 +10:00
2015-08-18 14:54:56 -07:00
trace_writeback_start ( wb , work ) ;
2011-04-21 12:06:32 -06:00
if ( list_empty ( & wb - > b_io ) )
2020-05-29 16:08:58 +02:00
queue_io ( wb , work , dirtied_before ) ;
2010-07-06 08:59:53 +02:00
if ( work - > sb )
2011-05-04 19:54:37 -06:00
progress = writeback_sb_inodes ( work - > sb , wb , work ) ;
2010-06-10 12:07:54 +02:00
else
2011-05-04 19:54:37 -06:00
progress = __writeback_inodes_wb ( wb , work ) ;
2015-08-18 14:54:56 -07:00
trace_writeback_written ( wb , work ) ;
2010-07-07 13:24:07 +10:00
2009-09-09 09:08:54 +02:00
/*
writeback: try more writeback as long as something was written
writeback_inodes_wb()/__writeback_inodes_sb() are not aggressive in that
they only populate possibly a subset of eligible inodes into b_io at
entrance time. When the queued set of inodes are all synced, they just
return, possibly with all queued inode pages written but still
wbc.nr_to_write > 0.
For kupdate and background writeback, there may be more eligible inodes
sitting in b_dirty when the current set of b_io inodes are completed. So
it is necessary to try another round of writeback as long as we made some
progress in this round. When there are no more eligible inodes, no more
inodes will be enqueued in queue_io(), hence nothing could/will be
synced and we may safely bail.
For example, imagine 100 inodes
i0, i1, i2, ..., i90, i91, i99
At queue_io() time, i90-i99 happen to be expired and moved to s_io for
IO. When finished successfully, if their total size is less than
MAX_WRITEBACK_PAGES, nr_to_write will be > 0. Then wb_writeback() will
quit the background work (w/o this patch) while it's still over
background threshold. This will be a fairly normal/frequent case I guess.
Now that we do tagged sync and update inode->dirtied_when after the sync,
this change won't livelock sync(1). I actually tried to write 1 page
per 1ms with this command
write-and-fsync -n10000 -S 1000 -c 4096 /fs/test
and do sync(1) at the same time. The sync completes quickly on ext4,
xfs, btrfs.
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2010-07-22 10:23:44 -06:00
* Did we write something ? Try for more
*
* Dirty inodes are moved to b_io for writeback in batches .
* The completion of the current batch does not necessarily
* mean the overall work is done . So we keep looping as long
* as made some progress on cleaning pages or inodes .
2009-09-09 09:08:54 +02:00
*/
writeback: move wb_over_bg_thresh() call outside lock section
Patch series "cgroup: eliminate atomic rstat flushing", v5.
A previous patch series [1] changed most atomic rstat flushing contexts to
become non-atomic. This was done to avoid an expensive operation that
scales with # cgroups and # cpus to happen with irqs disabled and
scheduling not permitted. There were two remaining atomic flushing
contexts after that series. This series tries to eliminate them as well,
eliminating atomic rstat flushing completely.
The two remaining atomic flushing contexts are:
(a) wb_over_bg_thresh()->mem_cgroup_wb_stats()
(b) mem_cgroup_threshold()->mem_cgroup_usage()
For (a), flushing needs to be atomic as wb_writeback() calls
wb_over_bg_thresh() with a spinlock held. However, it seems like the call
to wb_over_bg_thresh() doesn't need to be protected by that spinlock, so
this series proposes a refactoring that moves the call outside the lock
criticial section and makes the stats flushing in mem_cgroup_wb_stats()
non-atomic.
For (b), flushing needs to be atomic as mem_cgroup_threshold() is called
with irqs disabled. We only flush the stats when calculating the root
usage, as it is approximated as the sum of some memcg stats (file, anon,
and optionally swap) instead of the conventional page counter. This
series proposes changing this calculation to use the global stats instead,
eliminating the need for a memcg stat flush.
After these 2 contexts are eliminated, we no longer need
mem_cgroup_flush_stats_atomic() or cgroup_rstat_flush_atomic(). We can
remove them and simplify the code.
[1] https://lore.kernel.org/linux-mm/20230330191801.1967435-1-yosryahmed@google.com/
This patch (of 5):
wb_over_bg_thresh() calls mem_cgroup_wb_stats() which invokes an rstat
flush, which can be expensive on large systems. Currently,
wb_writeback() calls wb_over_bg_thresh() within a lock section, so we
have to do the rstat flush atomically. On systems with a lot of
cpus and/or cgroups, this can cause us to disable irqs for a long time,
potentially causing problems.
Move the call to wb_over_bg_thresh() outside the lock section in
preparation to make the rstat flush in mem_cgroup_wb_stats() non-atomic.
The list_empty(&wb->work_list) check should be okay outside the lock
section of wb->list_lock as it is protected by a separate lock
(wb->work_lock), and wb_over_bg_thresh() doesn't seem like it is
modifying any of wb->b_* lists the wb->list_lock is protecting.
Also, the loop seems to be already releasing and reacquring the
lock, so this refactoring looks safe.
Link: https://lkml.kernel.org/r/20230421174020.2994750-1-yosryahmed@google.com
Link: https://lkml.kernel.org/r/20230421174020.2994750-2-yosryahmed@google.com
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-21 17:40:16 +00:00
if ( progress ) {
spin_unlock ( & wb - > list_lock ) ;
2009-09-23 19:32:26 +02:00
continue ;
writeback: move wb_over_bg_thresh() call outside lock section
Patch series "cgroup: eliminate atomic rstat flushing", v5.
A previous patch series [1] changed most atomic rstat flushing contexts to
become non-atomic. This was done to avoid an expensive operation that
scales with # cgroups and # cpus to happen with irqs disabled and
scheduling not permitted. There were two remaining atomic flushing
contexts after that series. This series tries to eliminate them as well,
eliminating atomic rstat flushing completely.
The two remaining atomic flushing contexts are:
(a) wb_over_bg_thresh()->mem_cgroup_wb_stats()
(b) mem_cgroup_threshold()->mem_cgroup_usage()
For (a), flushing needs to be atomic as wb_writeback() calls
wb_over_bg_thresh() with a spinlock held. However, it seems like the call
to wb_over_bg_thresh() doesn't need to be protected by that spinlock, so
this series proposes a refactoring that moves the call outside the lock
criticial section and makes the stats flushing in mem_cgroup_wb_stats()
non-atomic.
For (b), flushing needs to be atomic as mem_cgroup_threshold() is called
with irqs disabled. We only flush the stats when calculating the root
usage, as it is approximated as the sum of some memcg stats (file, anon,
and optionally swap) instead of the conventional page counter. This
series proposes changing this calculation to use the global stats instead,
eliminating the need for a memcg stat flush.
After these 2 contexts are eliminated, we no longer need
mem_cgroup_flush_stats_atomic() or cgroup_rstat_flush_atomic(). We can
remove them and simplify the code.
[1] https://lore.kernel.org/linux-mm/20230330191801.1967435-1-yosryahmed@google.com/
This patch (of 5):
wb_over_bg_thresh() calls mem_cgroup_wb_stats() which invokes an rstat
flush, which can be expensive on large systems. Currently,
wb_writeback() calls wb_over_bg_thresh() within a lock section, so we
have to do the rstat flush atomically. On systems with a lot of
cpus and/or cgroups, this can cause us to disable irqs for a long time,
potentially causing problems.
Move the call to wb_over_bg_thresh() outside the lock section in
preparation to make the rstat flush in mem_cgroup_wb_stats() non-atomic.
The list_empty(&wb->work_list) check should be okay outside the lock
section of wb->list_lock as it is protected by a separate lock
(wb->work_lock), and wb_over_bg_thresh() doesn't seem like it is
modifying any of wb->b_* lists the wb->list_lock is protecting.
Also, the loop seems to be already releasing and reacquring the
lock, so this refactoring looks safe.
Link: https://lkml.kernel.org/r/20230421174020.2994750-1-yosryahmed@google.com
Link: https://lkml.kernel.org/r/20230421174020.2994750-2-yosryahmed@google.com
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-21 17:40:16 +00:00
}
2009-09-23 19:32:26 +02:00
/*
writeback: try more writeback as long as something was written
writeback_inodes_wb()/__writeback_inodes_sb() are not aggressive in that
they only populate possibly a subset of eligible inodes into b_io at
entrance time. When the queued set of inodes are all synced, they just
return, possibly with all queued inode pages written but still
wbc.nr_to_write > 0.
For kupdate and background writeback, there may be more eligible inodes
sitting in b_dirty when the current set of b_io inodes are completed. So
it is necessary to try another round of writeback as long as we made some
progress in this round. When there are no more eligible inodes, no more
inodes will be enqueued in queue_io(), hence nothing could/will be
synced and we may safely bail.
For example, imagine 100 inodes
i0, i1, i2, ..., i90, i91, i99
At queue_io() time, i90-i99 happen to be expired and moved to s_io for
IO. When finished successfully, if their total size is less than
MAX_WRITEBACK_PAGES, nr_to_write will be > 0. Then wb_writeback() will
quit the background work (w/o this patch) while it's still over
background threshold. This will be a fairly normal/frequent case I guess.
Now that we do tagged sync and update inode->dirtied_when after the sync,
this change won't livelock sync(1). I actually tried to write 1 page
per 1ms with this command
write-and-fsync -n10000 -S 1000 -c 4096 /fs/test
and do sync(1) at the same time. The sync completes quickly on ext4,
xfs, btrfs.
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2010-07-22 10:23:44 -06:00
* No more inodes for IO , bail
2009-09-23 19:32:26 +02:00
*/
writeback: move wb_over_bg_thresh() call outside lock section
Patch series "cgroup: eliminate atomic rstat flushing", v5.
A previous patch series [1] changed most atomic rstat flushing contexts to
become non-atomic. This was done to avoid an expensive operation that
scales with # cgroups and # cpus to happen with irqs disabled and
scheduling not permitted. There were two remaining atomic flushing
contexts after that series. This series tries to eliminate them as well,
eliminating atomic rstat flushing completely.
The two remaining atomic flushing contexts are:
(a) wb_over_bg_thresh()->mem_cgroup_wb_stats()
(b) mem_cgroup_threshold()->mem_cgroup_usage()
For (a), flushing needs to be atomic as wb_writeback() calls
wb_over_bg_thresh() with a spinlock held. However, it seems like the call
to wb_over_bg_thresh() doesn't need to be protected by that spinlock, so
this series proposes a refactoring that moves the call outside the lock
criticial section and makes the stats flushing in mem_cgroup_wb_stats()
non-atomic.
For (b), flushing needs to be atomic as mem_cgroup_threshold() is called
with irqs disabled. We only flush the stats when calculating the root
usage, as it is approximated as the sum of some memcg stats (file, anon,
and optionally swap) instead of the conventional page counter. This
series proposes changing this calculation to use the global stats instead,
eliminating the need for a memcg stat flush.
After these 2 contexts are eliminated, we no longer need
mem_cgroup_flush_stats_atomic() or cgroup_rstat_flush_atomic(). We can
remove them and simplify the code.
[1] https://lore.kernel.org/linux-mm/20230330191801.1967435-1-yosryahmed@google.com/
This patch (of 5):
wb_over_bg_thresh() calls mem_cgroup_wb_stats() which invokes an rstat
flush, which can be expensive on large systems. Currently,
wb_writeback() calls wb_over_bg_thresh() within a lock section, so we
have to do the rstat flush atomically. On systems with a lot of
cpus and/or cgroups, this can cause us to disable irqs for a long time,
potentially causing problems.
Move the call to wb_over_bg_thresh() outside the lock section in
preparation to make the rstat flush in mem_cgroup_wb_stats() non-atomic.
The list_empty(&wb->work_list) check should be okay outside the lock
section of wb->list_lock as it is protected by a separate lock
(wb->work_lock), and wb_over_bg_thresh() doesn't seem like it is
modifying any of wb->b_* lists the wb->list_lock is protecting.
Also, the loop seems to be already releasing and reacquring the
lock, so this refactoring looks safe.
Link: https://lkml.kernel.org/r/20230421174020.2994750-1-yosryahmed@google.com
Link: https://lkml.kernel.org/r/20230421174020.2994750-2-yosryahmed@google.com
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-21 17:40:16 +00:00
if ( list_empty ( & wb - > b_more_io ) ) {
spin_unlock ( & wb - > list_lock ) ;
2009-09-09 09:08:54 +02:00
break ;
writeback: move wb_over_bg_thresh() call outside lock section
Patch series "cgroup: eliminate atomic rstat flushing", v5.
A previous patch series [1] changed most atomic rstat flushing contexts to
become non-atomic. This was done to avoid an expensive operation that
scales with # cgroups and # cpus to happen with irqs disabled and
scheduling not permitted. There were two remaining atomic flushing
contexts after that series. This series tries to eliminate them as well,
eliminating atomic rstat flushing completely.
The two remaining atomic flushing contexts are:
(a) wb_over_bg_thresh()->mem_cgroup_wb_stats()
(b) mem_cgroup_threshold()->mem_cgroup_usage()
For (a), flushing needs to be atomic as wb_writeback() calls
wb_over_bg_thresh() with a spinlock held. However, it seems like the call
to wb_over_bg_thresh() doesn't need to be protected by that spinlock, so
this series proposes a refactoring that moves the call outside the lock
criticial section and makes the stats flushing in mem_cgroup_wb_stats()
non-atomic.
For (b), flushing needs to be atomic as mem_cgroup_threshold() is called
with irqs disabled. We only flush the stats when calculating the root
usage, as it is approximated as the sum of some memcg stats (file, anon,
and optionally swap) instead of the conventional page counter. This
series proposes changing this calculation to use the global stats instead,
eliminating the need for a memcg stat flush.
After these 2 contexts are eliminated, we no longer need
mem_cgroup_flush_stats_atomic() or cgroup_rstat_flush_atomic(). We can
remove them and simplify the code.
[1] https://lore.kernel.org/linux-mm/20230330191801.1967435-1-yosryahmed@google.com/
This patch (of 5):
wb_over_bg_thresh() calls mem_cgroup_wb_stats() which invokes an rstat
flush, which can be expensive on large systems. Currently,
wb_writeback() calls wb_over_bg_thresh() within a lock section, so we
have to do the rstat flush atomically. On systems with a lot of
cpus and/or cgroups, this can cause us to disable irqs for a long time,
potentially causing problems.
Move the call to wb_over_bg_thresh() outside the lock section in
preparation to make the rstat flush in mem_cgroup_wb_stats() non-atomic.
The list_empty(&wb->work_list) check should be okay outside the lock
section of wb->list_lock as it is protected by a separate lock
(wb->work_lock), and wb_over_bg_thresh() doesn't seem like it is
modifying any of wb->b_* lists the wb->list_lock is protecting.
Also, the loop seems to be already releasing and reacquring the
lock, so this refactoring looks safe.
Link: https://lkml.kernel.org/r/20230421174020.2994750-1-yosryahmed@google.com
Link: https://lkml.kernel.org/r/20230421174020.2994750-2-yosryahmed@google.com
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-21 17:40:16 +00:00
}
2009-09-23 19:32:26 +02:00
/*
* Nothing written . Wait for some inode to
* become available for writeback . Otherwise
* we ' ll just busyloop .
*/
2016-12-12 16:43:20 -08:00
trace_writeback_wait ( wb , work ) ;
inode = wb_inode ( wb - > b_more_io . prev ) ;
spin_lock ( & inode - > i_lock ) ;
spin_unlock ( & wb - > list_lock ) ;
/* This function drops i_lock... */
inode_sleep_on_writeback ( inode ) ;
2009-09-09 09:08:54 +02:00
}
2015-09-11 13:37:19 -07:00
blk_finish_plug ( & plug ) ;
2009-09-09 09:08:54 +02:00
2011-05-04 19:54:37 -06:00
return nr_pages - work - > nr_pages ;
2009-09-09 09:08:54 +02:00
}
/*
2010-07-06 08:59:53 +02:00
* Return the next wb_writeback_work struct that hasn ' t been processed yet .
2009-09-09 09:08:54 +02:00
*/
writeback: move backing_dev_info->wb_lock and ->worklist into bdi_writeback
Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback)
and the role of the separation is unclear. For cgroup support for
writeback IOs, a bdi will be updated to host multiple wb's where each
wb serves writeback IOs of a different cgroup on the bdi. To achieve
that, a wb should carry all states necessary for servicing writeback
IOs for a cgroup independently.
This patch moves bdi->wb_lock and ->worklist into wb.
* The lock protects bdi->worklist and bdi->wb.dwork scheduling. While
moving, rename it to wb->work_lock as wb->wb_lock is confusing.
Also, move wb->dwork downwards so that it's colocated with the new
->work_lock and ->work_list fields.
* bdi_writeback_workfn() -> wb_workfn()
bdi_wakeup_thread_delayed(bdi) -> wb_wakeup_delayed(wb)
bdi_wakeup_thread(bdi) -> wb_wakeup(wb)
bdi_queue_work(bdi, ...) -> wb_queue_work(wb, ...)
__bdi_start_writeback(bdi, ...) -> __wb_start_writeback(wb, ...)
get_next_work_item(bdi) -> get_next_work_item(wb)
* bdi_wb_shutdown() is renamed to wb_shutdown() and now takes @wb.
The function contained parts which belong to the containing bdi
rather than the wb itself - testing cap_writeback_dirty and
bdi_remove_from_list() invocation. Those are moved to
bdi_unregister().
* bdi_wb_{init|exit}() are renamed to wb_{init|exit}().
Initializations of the moved bdi->wb_lock and ->work_list are
relocated from bdi_init() to wb_init().
* As there's still only one bdi_writeback per backing_dev_info, all
uses of bdi->state are mechanically replaced with bdi->wb.state
introducing no behavior changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-22 17:13:30 -04:00
static struct wb_writeback_work * get_next_work_item ( struct bdi_writeback * wb )
2009-09-09 09:08:54 +02:00
{
2010-07-06 08:59:53 +02:00
struct wb_writeback_work * work = NULL ;
2009-09-09 09:08:54 +02:00
2022-08-01 08:50:34 -07:00
spin_lock_irq ( & wb - > work_lock ) ;
writeback: move backing_dev_info->wb_lock and ->worklist into bdi_writeback
Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback)
and the role of the separation is unclear. For cgroup support for
writeback IOs, a bdi will be updated to host multiple wb's where each
wb serves writeback IOs of a different cgroup on the bdi. To achieve
that, a wb should carry all states necessary for servicing writeback
IOs for a cgroup independently.
This patch moves bdi->wb_lock and ->worklist into wb.
* The lock protects bdi->worklist and bdi->wb.dwork scheduling. While
moving, rename it to wb->work_lock as wb->wb_lock is confusing.
Also, move wb->dwork downwards so that it's colocated with the new
->work_lock and ->work_list fields.
* bdi_writeback_workfn() -> wb_workfn()
bdi_wakeup_thread_delayed(bdi) -> wb_wakeup_delayed(wb)
bdi_wakeup_thread(bdi) -> wb_wakeup(wb)
bdi_queue_work(bdi, ...) -> wb_queue_work(wb, ...)
__bdi_start_writeback(bdi, ...) -> __wb_start_writeback(wb, ...)
get_next_work_item(bdi) -> get_next_work_item(wb)
* bdi_wb_shutdown() is renamed to wb_shutdown() and now takes @wb.
The function contained parts which belong to the containing bdi
rather than the wb itself - testing cap_writeback_dirty and
bdi_remove_from_list() invocation. Those are moved to
bdi_unregister().
* bdi_wb_{init|exit}() are renamed to wb_{init|exit}().
Initializations of the moved bdi->wb_lock and ->work_list are
relocated from bdi_init() to wb_init().
* As there's still only one bdi_writeback per backing_dev_info, all
uses of bdi->state are mechanically replaced with bdi->wb.state
introducing no behavior changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-22 17:13:30 -04:00
if ( ! list_empty ( & wb - > work_list ) ) {
work = list_entry ( wb - > work_list . next ,
2010-07-06 08:59:53 +02:00
struct wb_writeback_work , list ) ;
list_del_init ( & work - > list ) ;
2009-09-09 09:08:54 +02:00
}
2022-08-01 08:50:34 -07:00
spin_unlock_irq ( & wb - > work_lock ) ;
2010-07-06 08:59:53 +02:00
return work ;
2009-09-09 09:08:54 +02:00
}
2011-01-13 15:45:44 -08:00
static long wb_check_background_flush ( struct bdi_writeback * wb )
{
2015-05-22 18:23:31 -04:00
if ( wb_over_bg_thresh ( wb ) ) {
2011-01-13 15:45:44 -08:00
struct wb_writeback_work work = {
. nr_pages = LONG_MAX ,
. sync_mode = WB_SYNC_NONE ,
. for_background = 1 ,
. range_cyclic = 1 ,
2011-10-07 21:54:10 -06:00
. reason = WB_REASON_BACKGROUND ,
2011-01-13 15:45:44 -08:00
} ;
return wb_writeback ( wb , & work ) ;
}
return 0 ;
}
2009-09-09 09:08:54 +02:00
static long wb_check_old_data_flush ( struct bdi_writeback * wb )
{
unsigned long expired ;
long nr_pages ;
2010-05-17 12:51:03 +02:00
/*
* When set to zero , disable periodic writeback
*/
if ( ! dirty_writeback_interval )
return 0 ;
2009-09-09 09:08:54 +02:00
expired = wb - > last_old_flush +
msecs_to_jiffies ( dirty_writeback_interval * 10 ) ;
if ( time_before ( jiffies , expired ) )
return 0 ;
wb - > last_old_flush = jiffies ;
2010-10-30 08:55:52 -07:00
nr_pages = get_nr_dirty_pages ( ) ;
2009-09-09 09:08:54 +02:00
2009-09-16 15:18:25 +02:00
if ( nr_pages ) {
2010-07-06 08:59:53 +02:00
struct wb_writeback_work work = {
2009-09-16 15:18:25 +02:00
. nr_pages = nr_pages ,
. sync_mode = WB_SYNC_NONE ,
. for_kupdate = 1 ,
. range_cyclic = 1 ,
2011-10-07 21:54:10 -06:00
. reason = WB_REASON_PERIODIC ,
2009-09-16 15:18:25 +02:00
} ;
2010-07-06 08:59:53 +02:00
return wb_writeback ( wb , & work ) ;
2009-09-16 15:18:25 +02:00
}
2009-09-09 09:08:54 +02:00
return 0 ;
}
2017-09-30 02:09:06 -06:00
static long wb_check_start_all ( struct bdi_writeback * wb )
{
long nr_pages ;
if ( ! test_bit ( WB_start_all , & wb - > state ) )
return 0 ;
nr_pages = get_nr_dirty_pages ( ) ;
if ( nr_pages ) {
struct wb_writeback_work work = {
. nr_pages = wb_split_bdi_pages ( wb , nr_pages ) ,
. sync_mode = WB_SYNC_NONE ,
. range_cyclic = 1 ,
. reason = wb - > start_all_reason ,
} ;
nr_pages = wb_writeback ( wb , & work ) ;
}
clear_bit ( WB_start_all , & wb - > state ) ;
return nr_pages ;
}
2009-09-09 09:08:54 +02:00
/*
* Retrieve work items and do the writeback they describe
*/
2013-07-08 16:00:14 -07:00
static long wb_do_writeback ( struct bdi_writeback * wb )
2009-09-09 09:08:54 +02:00
{
2010-07-06 08:59:53 +02:00
struct wb_writeback_work * work ;
2009-09-16 15:18:25 +02:00
long wrote = 0 ;
2009-09-09 09:08:54 +02:00
2015-05-22 17:13:26 -04:00
set_bit ( WB_writeback_running , & wb - > state ) ;
writeback: move backing_dev_info->wb_lock and ->worklist into bdi_writeback
Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback)
and the role of the separation is unclear. For cgroup support for
writeback IOs, a bdi will be updated to host multiple wb's where each
wb serves writeback IOs of a different cgroup on the bdi. To achieve
that, a wb should carry all states necessary for servicing writeback
IOs for a cgroup independently.
This patch moves bdi->wb_lock and ->worklist into wb.
* The lock protects bdi->worklist and bdi->wb.dwork scheduling. While
moving, rename it to wb->work_lock as wb->wb_lock is confusing.
Also, move wb->dwork downwards so that it's colocated with the new
->work_lock and ->work_list fields.
* bdi_writeback_workfn() -> wb_workfn()
bdi_wakeup_thread_delayed(bdi) -> wb_wakeup_delayed(wb)
bdi_wakeup_thread(bdi) -> wb_wakeup(wb)
bdi_queue_work(bdi, ...) -> wb_queue_work(wb, ...)
__bdi_start_writeback(bdi, ...) -> __wb_start_writeback(wb, ...)
get_next_work_item(bdi) -> get_next_work_item(wb)
* bdi_wb_shutdown() is renamed to wb_shutdown() and now takes @wb.
The function contained parts which belong to the containing bdi
rather than the wb itself - testing cap_writeback_dirty and
bdi_remove_from_list() invocation. Those are moved to
bdi_unregister().
* bdi_wb_{init|exit}() are renamed to wb_{init|exit}().
Initializations of the moved bdi->wb_lock and ->work_list are
relocated from bdi_init() to wb_init().
* As there's still only one bdi_writeback per backing_dev_info, all
uses of bdi->state are mechanically replaced with bdi->wb.state
introducing no behavior changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-22 17:13:30 -04:00
while ( ( work = get_next_work_item ( wb ) ) ! = NULL ) {
2015-08-18 14:54:56 -07:00
trace_writeback_exec ( wb , work ) ;
2010-07-06 08:59:53 +02:00
wrote + = wb_writeback ( wb , work ) ;
2017-03-10 12:09:49 -08:00
finish_writeback_work ( wb , work ) ;
2009-09-09 09:08:54 +02:00
}
2017-09-30 02:09:06 -06:00
/*
* Check for a flush - everything request
*/
wrote + = wb_check_start_all ( wb ) ;
2009-09-09 09:08:54 +02:00
/*
* Check for periodic writeback , kupdated ( ) style
*/
wrote + = wb_check_old_data_flush ( wb ) ;
2011-01-13 15:45:44 -08:00
wrote + = wb_check_background_flush ( wb ) ;
2015-05-22 17:13:26 -04:00
clear_bit ( WB_writeback_running , & wb - > state ) ;
2009-09-09 09:08:54 +02:00
return wrote ;
}
/*
* Handle writeback of dirty data for the device backed by this bdi . Also
writeback: replace custom worker pool implementation with unbound workqueue
Writeback implements its own worker pool - each bdi can be associated
with a worker thread which is created and destroyed dynamically. The
worker thread for the default bdi is always present and serves as the
"forker" thread which forks off worker threads for other bdis.
there's no reason for writeback to implement its own worker pool when
using unbound workqueue instead is much simpler and more efficient.
This patch replaces custom worker pool implementation in writeback
with an unbound workqueue.
The conversion isn't too complicated but the followings are worth
mentioning.
* bdi_writeback->last_active, task and wakeup_timer are removed.
delayed_work ->dwork is added instead. Explicit timer handling is
no longer necessary. Everything works by either queueing / modding
/ flushing / canceling the delayed_work item.
* bdi_writeback_thread() becomes bdi_writeback_workfn() which runs off
bdi_writeback->dwork. On each execution, it processes
bdi->work_list and reschedules itself if there are more things to
do.
The function also handles low-mem condition, which used to be
handled by the forker thread. If the function is running off a
rescuer thread, it only writes out limited number of pages so that
the rescuer can serve other bdis too. This preserves the flusher
creation failure behavior of the forker thread.
* INIT_LIST_HEAD(&bdi->bdi_list) is used to tell
bdi_writeback_workfn() about on-going bdi unregistration so that it
always drains work_list even if it's running off the rescuer. Note
that the original code was broken in this regard. Under memory
pressure, a bdi could finish unregistration with non-empty
work_list.
* The default bdi is no longer special. It now is treated the same as
any other bdi and bdi_cap_flush_forker() is removed.
* BDI_pending is no longer used. Removed.
* Some tracepoints become non-applicable. The following TPs are
removed - writeback_nothread, writeback_wake_thread,
writeback_wake_forker_thread, writeback_thread_start,
writeback_thread_stop.
Everything, including devices coming and going away and rescuer
operation under simulated memory pressure, seems to work fine in my
test setup.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
2013-04-01 19:08:06 -07:00
* reschedules periodically and does kupdated style flushing .
2009-09-09 09:08:54 +02:00
*/
writeback: move backing_dev_info->wb_lock and ->worklist into bdi_writeback
Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback)
and the role of the separation is unclear. For cgroup support for
writeback IOs, a bdi will be updated to host multiple wb's where each
wb serves writeback IOs of a different cgroup on the bdi. To achieve
that, a wb should carry all states necessary for servicing writeback
IOs for a cgroup independently.
This patch moves bdi->wb_lock and ->worklist into wb.
* The lock protects bdi->worklist and bdi->wb.dwork scheduling. While
moving, rename it to wb->work_lock as wb->wb_lock is confusing.
Also, move wb->dwork downwards so that it's colocated with the new
->work_lock and ->work_list fields.
* bdi_writeback_workfn() -> wb_workfn()
bdi_wakeup_thread_delayed(bdi) -> wb_wakeup_delayed(wb)
bdi_wakeup_thread(bdi) -> wb_wakeup(wb)
bdi_queue_work(bdi, ...) -> wb_queue_work(wb, ...)
__bdi_start_writeback(bdi, ...) -> __wb_start_writeback(wb, ...)
get_next_work_item(bdi) -> get_next_work_item(wb)
* bdi_wb_shutdown() is renamed to wb_shutdown() and now takes @wb.
The function contained parts which belong to the containing bdi
rather than the wb itself - testing cap_writeback_dirty and
bdi_remove_from_list() invocation. Those are moved to
bdi_unregister().
* bdi_wb_{init|exit}() are renamed to wb_{init|exit}().
Initializations of the moved bdi->wb_lock and ->work_list are
relocated from bdi_init() to wb_init().
* As there's still only one bdi_writeback per backing_dev_info, all
uses of bdi->state are mechanically replaced with bdi->wb.state
introducing no behavior changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-22 17:13:30 -04:00
void wb_workfn ( struct work_struct * work )
2009-09-09 09:08:54 +02:00
{
writeback: replace custom worker pool implementation with unbound workqueue
Writeback implements its own worker pool - each bdi can be associated
with a worker thread which is created and destroyed dynamically. The
worker thread for the default bdi is always present and serves as the
"forker" thread which forks off worker threads for other bdis.
there's no reason for writeback to implement its own worker pool when
using unbound workqueue instead is much simpler and more efficient.
This patch replaces custom worker pool implementation in writeback
with an unbound workqueue.
The conversion isn't too complicated but the followings are worth
mentioning.
* bdi_writeback->last_active, task and wakeup_timer are removed.
delayed_work ->dwork is added instead. Explicit timer handling is
no longer necessary. Everything works by either queueing / modding
/ flushing / canceling the delayed_work item.
* bdi_writeback_thread() becomes bdi_writeback_workfn() which runs off
bdi_writeback->dwork. On each execution, it processes
bdi->work_list and reschedules itself if there are more things to
do.
The function also handles low-mem condition, which used to be
handled by the forker thread. If the function is running off a
rescuer thread, it only writes out limited number of pages so that
the rescuer can serve other bdis too. This preserves the flusher
creation failure behavior of the forker thread.
* INIT_LIST_HEAD(&bdi->bdi_list) is used to tell
bdi_writeback_workfn() about on-going bdi unregistration so that it
always drains work_list even if it's running off the rescuer. Note
that the original code was broken in this regard. Under memory
pressure, a bdi could finish unregistration with non-empty
work_list.
* The default bdi is no longer special. It now is treated the same as
any other bdi and bdi_cap_flush_forker() is removed.
* BDI_pending is no longer used. Removed.
* Some tracepoints become non-applicable. The following TPs are
removed - writeback_nothread, writeback_wake_thread,
writeback_wake_forker_thread, writeback_thread_start,
writeback_thread_stop.
Everything, including devices coming and going away and rescuer
operation under simulated memory pressure, seems to work fine in my
test setup.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
2013-04-01 19:08:06 -07:00
struct bdi_writeback * wb = container_of ( to_delayed_work ( work ) ,
struct bdi_writeback , dwork ) ;
2009-09-09 09:08:54 +02:00
long pages_written ;
memcg: fix a crash in wb_workfn when a device disappears
Without memcg, there is a one-to-one mapping between the bdi and
bdi_writeback structures. In this world, things are fairly
straightforward; the first thing bdi_unregister() does is to shutdown
the bdi_writeback structure (or wb), and part of that writeback ensures
that no other work queued against the wb, and that the wb is fully
drained.
With memcg, however, there is a one-to-many relationship between the bdi
and bdi_writeback structures; that is, there are multiple wb objects
which can all point to a single bdi. There is a refcount which prevents
the bdi object from being released (and hence, unregistered). So in
theory, the bdi_unregister() *should* only get called once its refcount
goes to zero (bdi_put will drop the refcount, and when it is zero,
release_bdi gets called, which calls bdi_unregister).
Unfortunately, del_gendisk() in block/gen_hd.c never got the memo about
the Brave New memcg World, and calls bdi_unregister directly. It does
this without informing the file system, or the memcg code, or anything
else. This causes the root wb associated with the bdi to be
unregistered, but none of the memcg-specific wb's are shutdown. So when
one of these wb's are woken up to do delayed work, they try to
dereference their wb->bdi->dev to fetch the device name, but
unfortunately bdi->dev is now NULL, thanks to the bdi_unregister()
called by del_gendisk(). As a result, *boom*.
Fortunately, it looks like the rest of the writeback path is perfectly
happy with bdi->dev and bdi->owner being NULL, so the simplest fix is to
create a bdi_dev_name() function which can handle bdi->dev being NULL.
This also allows us to bulletproof the writeback tracepoints to prevent
them from dereferencing a NULL pointer and crashing the kernel if one is
tracing with memcg's enabled, and an iSCSI device dies or a USB storage
stick is pulled.
The most common way of triggering this will be hotremoval of a device
while writeback with memcg enabled is going on. It was triggering
several times a day in a heavily loaded production environment.
Google Bug Id: 145475544
Link: https://lore.kernel.org/r/20191227194829.150110-1-tytso@mit.edu
Link: http://lkml.kernel.org/r/20191228005211.163952-1-tytso@mit.edu
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: Chris Mason <clm@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-30 22:11:04 -08:00
set_worker_desc ( " flush-%s " , bdi_dev_name ( wb - > bdi ) ) ;
2010-07-07 13:24:06 +10:00
writeback: replace custom worker pool implementation with unbound workqueue
Writeback implements its own worker pool - each bdi can be associated
with a worker thread which is created and destroyed dynamically. The
worker thread for the default bdi is always present and serves as the
"forker" thread which forks off worker threads for other bdis.
there's no reason for writeback to implement its own worker pool when
using unbound workqueue instead is much simpler and more efficient.
This patch replaces custom worker pool implementation in writeback
with an unbound workqueue.
The conversion isn't too complicated but the followings are worth
mentioning.
* bdi_writeback->last_active, task and wakeup_timer are removed.
delayed_work ->dwork is added instead. Explicit timer handling is
no longer necessary. Everything works by either queueing / modding
/ flushing / canceling the delayed_work item.
* bdi_writeback_thread() becomes bdi_writeback_workfn() which runs off
bdi_writeback->dwork. On each execution, it processes
bdi->work_list and reschedules itself if there are more things to
do.
The function also handles low-mem condition, which used to be
handled by the forker thread. If the function is running off a
rescuer thread, it only writes out limited number of pages so that
the rescuer can serve other bdis too. This preserves the flusher
creation failure behavior of the forker thread.
* INIT_LIST_HEAD(&bdi->bdi_list) is used to tell
bdi_writeback_workfn() about on-going bdi unregistration so that it
always drains work_list even if it's running off the rescuer. Note
that the original code was broken in this regard. Under memory
pressure, a bdi could finish unregistration with non-empty
work_list.
* The default bdi is no longer special. It now is treated the same as
any other bdi and bdi_cap_flush_forker() is removed.
* BDI_pending is no longer used. Removed.
* Some tracepoints become non-applicable. The following TPs are
removed - writeback_nothread, writeback_wake_thread,
writeback_wake_forker_thread, writeback_thread_start,
writeback_thread_stop.
Everything, including devices coming and going away and rescuer
operation under simulated memory pressure, seems to work fine in my
test setup.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
2013-04-01 19:08:06 -07:00
if ( likely ( ! current_is_workqueue_rescuer ( ) | |
2015-05-22 17:13:26 -04:00
! test_bit ( WB_registered , & wb - > state ) ) ) {
2010-07-25 14:29:22 +03:00
/*
writeback: move backing_dev_info->wb_lock and ->worklist into bdi_writeback
Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback)
and the role of the separation is unclear. For cgroup support for
writeback IOs, a bdi will be updated to host multiple wb's where each
wb serves writeback IOs of a different cgroup on the bdi. To achieve
that, a wb should carry all states necessary for servicing writeback
IOs for a cgroup independently.
This patch moves bdi->wb_lock and ->worklist into wb.
* The lock protects bdi->worklist and bdi->wb.dwork scheduling. While
moving, rename it to wb->work_lock as wb->wb_lock is confusing.
Also, move wb->dwork downwards so that it's colocated with the new
->work_lock and ->work_list fields.
* bdi_writeback_workfn() -> wb_workfn()
bdi_wakeup_thread_delayed(bdi) -> wb_wakeup_delayed(wb)
bdi_wakeup_thread(bdi) -> wb_wakeup(wb)
bdi_queue_work(bdi, ...) -> wb_queue_work(wb, ...)
__bdi_start_writeback(bdi, ...) -> __wb_start_writeback(wb, ...)
get_next_work_item(bdi) -> get_next_work_item(wb)
* bdi_wb_shutdown() is renamed to wb_shutdown() and now takes @wb.
The function contained parts which belong to the containing bdi
rather than the wb itself - testing cap_writeback_dirty and
bdi_remove_from_list() invocation. Those are moved to
bdi_unregister().
* bdi_wb_{init|exit}() are renamed to wb_{init|exit}().
Initializations of the moved bdi->wb_lock and ->work_list are
relocated from bdi_init() to wb_init().
* As there's still only one bdi_writeback per backing_dev_info, all
uses of bdi->state are mechanically replaced with bdi->wb.state
introducing no behavior changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-22 17:13:30 -04:00
* The normal path . Keep writing back @ wb until its
writeback: replace custom worker pool implementation with unbound workqueue
Writeback implements its own worker pool - each bdi can be associated
with a worker thread which is created and destroyed dynamically. The
worker thread for the default bdi is always present and serves as the
"forker" thread which forks off worker threads for other bdis.
there's no reason for writeback to implement its own worker pool when
using unbound workqueue instead is much simpler and more efficient.
This patch replaces custom worker pool implementation in writeback
with an unbound workqueue.
The conversion isn't too complicated but the followings are worth
mentioning.
* bdi_writeback->last_active, task and wakeup_timer are removed.
delayed_work ->dwork is added instead. Explicit timer handling is
no longer necessary. Everything works by either queueing / modding
/ flushing / canceling the delayed_work item.
* bdi_writeback_thread() becomes bdi_writeback_workfn() which runs off
bdi_writeback->dwork. On each execution, it processes
bdi->work_list and reschedules itself if there are more things to
do.
The function also handles low-mem condition, which used to be
handled by the forker thread. If the function is running off a
rescuer thread, it only writes out limited number of pages so that
the rescuer can serve other bdis too. This preserves the flusher
creation failure behavior of the forker thread.
* INIT_LIST_HEAD(&bdi->bdi_list) is used to tell
bdi_writeback_workfn() about on-going bdi unregistration so that it
always drains work_list even if it's running off the rescuer. Note
that the original code was broken in this regard. Under memory
pressure, a bdi could finish unregistration with non-empty
work_list.
* The default bdi is no longer special. It now is treated the same as
any other bdi and bdi_cap_flush_forker() is removed.
* BDI_pending is no longer used. Removed.
* Some tracepoints become non-applicable. The following TPs are
removed - writeback_nothread, writeback_wake_thread,
writeback_wake_forker_thread, writeback_thread_start,
writeback_thread_stop.
Everything, including devices coming and going away and rescuer
operation under simulated memory pressure, seems to work fine in my
test setup.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
2013-04-01 19:08:06 -07:00
* work_list is empty . Note that this path is also taken
writeback: move backing_dev_info->wb_lock and ->worklist into bdi_writeback
Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback)
and the role of the separation is unclear. For cgroup support for
writeback IOs, a bdi will be updated to host multiple wb's where each
wb serves writeback IOs of a different cgroup on the bdi. To achieve
that, a wb should carry all states necessary for servicing writeback
IOs for a cgroup independently.
This patch moves bdi->wb_lock and ->worklist into wb.
* The lock protects bdi->worklist and bdi->wb.dwork scheduling. While
moving, rename it to wb->work_lock as wb->wb_lock is confusing.
Also, move wb->dwork downwards so that it's colocated with the new
->work_lock and ->work_list fields.
* bdi_writeback_workfn() -> wb_workfn()
bdi_wakeup_thread_delayed(bdi) -> wb_wakeup_delayed(wb)
bdi_wakeup_thread(bdi) -> wb_wakeup(wb)
bdi_queue_work(bdi, ...) -> wb_queue_work(wb, ...)
__bdi_start_writeback(bdi, ...) -> __wb_start_writeback(wb, ...)
get_next_work_item(bdi) -> get_next_work_item(wb)
* bdi_wb_shutdown() is renamed to wb_shutdown() and now takes @wb.
The function contained parts which belong to the containing bdi
rather than the wb itself - testing cap_writeback_dirty and
bdi_remove_from_list() invocation. Those are moved to
bdi_unregister().
* bdi_wb_{init|exit}() are renamed to wb_{init|exit}().
Initializations of the moved bdi->wb_lock and ->work_list are
relocated from bdi_init() to wb_init().
* As there's still only one bdi_writeback per backing_dev_info, all
uses of bdi->state are mechanically replaced with bdi->wb.state
introducing no behavior changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-22 17:13:30 -04:00
* if @ wb is shutting down even when we ' re running off the
writeback: replace custom worker pool implementation with unbound workqueue
Writeback implements its own worker pool - each bdi can be associated
with a worker thread which is created and destroyed dynamically. The
worker thread for the default bdi is always present and serves as the
"forker" thread which forks off worker threads for other bdis.
there's no reason for writeback to implement its own worker pool when
using unbound workqueue instead is much simpler and more efficient.
This patch replaces custom worker pool implementation in writeback
with an unbound workqueue.
The conversion isn't too complicated but the followings are worth
mentioning.
* bdi_writeback->last_active, task and wakeup_timer are removed.
delayed_work ->dwork is added instead. Explicit timer handling is
no longer necessary. Everything works by either queueing / modding
/ flushing / canceling the delayed_work item.
* bdi_writeback_thread() becomes bdi_writeback_workfn() which runs off
bdi_writeback->dwork. On each execution, it processes
bdi->work_list and reschedules itself if there are more things to
do.
The function also handles low-mem condition, which used to be
handled by the forker thread. If the function is running off a
rescuer thread, it only writes out limited number of pages so that
the rescuer can serve other bdis too. This preserves the flusher
creation failure behavior of the forker thread.
* INIT_LIST_HEAD(&bdi->bdi_list) is used to tell
bdi_writeback_workfn() about on-going bdi unregistration so that it
always drains work_list even if it's running off the rescuer. Note
that the original code was broken in this regard. Under memory
pressure, a bdi could finish unregistration with non-empty
work_list.
* The default bdi is no longer special. It now is treated the same as
any other bdi and bdi_cap_flush_forker() is removed.
* BDI_pending is no longer used. Removed.
* Some tracepoints become non-applicable. The following TPs are
removed - writeback_nothread, writeback_wake_thread,
writeback_wake_forker_thread, writeback_thread_start,
writeback_thread_stop.
Everything, including devices coming and going away and rescuer
operation under simulated memory pressure, seems to work fine in my
test setup.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
2013-04-01 19:08:06 -07:00
* rescuer as work_list needs to be drained .
2010-07-25 14:29:22 +03:00
*/
writeback: replace custom worker pool implementation with unbound workqueue
Writeback implements its own worker pool - each bdi can be associated
with a worker thread which is created and destroyed dynamically. The
worker thread for the default bdi is always present and serves as the
"forker" thread which forks off worker threads for other bdis.
there's no reason for writeback to implement its own worker pool when
using unbound workqueue instead is much simpler and more efficient.
This patch replaces custom worker pool implementation in writeback
with an unbound workqueue.
The conversion isn't too complicated but the followings are worth
mentioning.
* bdi_writeback->last_active, task and wakeup_timer are removed.
delayed_work ->dwork is added instead. Explicit timer handling is
no longer necessary. Everything works by either queueing / modding
/ flushing / canceling the delayed_work item.
* bdi_writeback_thread() becomes bdi_writeback_workfn() which runs off
bdi_writeback->dwork. On each execution, it processes
bdi->work_list and reschedules itself if there are more things to
do.
The function also handles low-mem condition, which used to be
handled by the forker thread. If the function is running off a
rescuer thread, it only writes out limited number of pages so that
the rescuer can serve other bdis too. This preserves the flusher
creation failure behavior of the forker thread.
* INIT_LIST_HEAD(&bdi->bdi_list) is used to tell
bdi_writeback_workfn() about on-going bdi unregistration so that it
always drains work_list even if it's running off the rescuer. Note
that the original code was broken in this regard. Under memory
pressure, a bdi could finish unregistration with non-empty
work_list.
* The default bdi is no longer special. It now is treated the same as
any other bdi and bdi_cap_flush_forker() is removed.
* BDI_pending is no longer used. Removed.
* Some tracepoints become non-applicable. The following TPs are
removed - writeback_nothread, writeback_wake_thread,
writeback_wake_forker_thread, writeback_thread_start,
writeback_thread_stop.
Everything, including devices coming and going away and rescuer
operation under simulated memory pressure, seems to work fine in my
test setup.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
2013-04-01 19:08:06 -07:00
do {
2013-07-08 16:00:14 -07:00
pages_written = wb_do_writeback ( wb ) ;
writeback: replace custom worker pool implementation with unbound workqueue
Writeback implements its own worker pool - each bdi can be associated
with a worker thread which is created and destroyed dynamically. The
worker thread for the default bdi is always present and serves as the
"forker" thread which forks off worker threads for other bdis.
there's no reason for writeback to implement its own worker pool when
using unbound workqueue instead is much simpler and more efficient.
This patch replaces custom worker pool implementation in writeback
with an unbound workqueue.
The conversion isn't too complicated but the followings are worth
mentioning.
* bdi_writeback->last_active, task and wakeup_timer are removed.
delayed_work ->dwork is added instead. Explicit timer handling is
no longer necessary. Everything works by either queueing / modding
/ flushing / canceling the delayed_work item.
* bdi_writeback_thread() becomes bdi_writeback_workfn() which runs off
bdi_writeback->dwork. On each execution, it processes
bdi->work_list and reschedules itself if there are more things to
do.
The function also handles low-mem condition, which used to be
handled by the forker thread. If the function is running off a
rescuer thread, it only writes out limited number of pages so that
the rescuer can serve other bdis too. This preserves the flusher
creation failure behavior of the forker thread.
* INIT_LIST_HEAD(&bdi->bdi_list) is used to tell
bdi_writeback_workfn() about on-going bdi unregistration so that it
always drains work_list even if it's running off the rescuer. Note
that the original code was broken in this regard. Under memory
pressure, a bdi could finish unregistration with non-empty
work_list.
* The default bdi is no longer special. It now is treated the same as
any other bdi and bdi_cap_flush_forker() is removed.
* BDI_pending is no longer used. Removed.
* Some tracepoints become non-applicable. The following TPs are
removed - writeback_nothread, writeback_wake_thread,
writeback_wake_forker_thread, writeback_thread_start,
writeback_thread_stop.
Everything, including devices coming and going away and rescuer
operation under simulated memory pressure, seems to work fine in my
test setup.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
2013-04-01 19:08:06 -07:00
trace_writeback_pages_written ( pages_written ) ;
writeback: move backing_dev_info->wb_lock and ->worklist into bdi_writeback
Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback)
and the role of the separation is unclear. For cgroup support for
writeback IOs, a bdi will be updated to host multiple wb's where each
wb serves writeback IOs of a different cgroup on the bdi. To achieve
that, a wb should carry all states necessary for servicing writeback
IOs for a cgroup independently.
This patch moves bdi->wb_lock and ->worklist into wb.
* The lock protects bdi->worklist and bdi->wb.dwork scheduling. While
moving, rename it to wb->work_lock as wb->wb_lock is confusing.
Also, move wb->dwork downwards so that it's colocated with the new
->work_lock and ->work_list fields.
* bdi_writeback_workfn() -> wb_workfn()
bdi_wakeup_thread_delayed(bdi) -> wb_wakeup_delayed(wb)
bdi_wakeup_thread(bdi) -> wb_wakeup(wb)
bdi_queue_work(bdi, ...) -> wb_queue_work(wb, ...)
__bdi_start_writeback(bdi, ...) -> __wb_start_writeback(wb, ...)
get_next_work_item(bdi) -> get_next_work_item(wb)
* bdi_wb_shutdown() is renamed to wb_shutdown() and now takes @wb.
The function contained parts which belong to the containing bdi
rather than the wb itself - testing cap_writeback_dirty and
bdi_remove_from_list() invocation. Those are moved to
bdi_unregister().
* bdi_wb_{init|exit}() are renamed to wb_{init|exit}().
Initializations of the moved bdi->wb_lock and ->work_list are
relocated from bdi_init() to wb_init().
* As there's still only one bdi_writeback per backing_dev_info, all
uses of bdi->state are mechanically replaced with bdi->wb.state
introducing no behavior changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-22 17:13:30 -04:00
} while ( ! list_empty ( & wb - > work_list ) ) ;
writeback: replace custom worker pool implementation with unbound workqueue
Writeback implements its own worker pool - each bdi can be associated
with a worker thread which is created and destroyed dynamically. The
worker thread for the default bdi is always present and serves as the
"forker" thread which forks off worker threads for other bdis.
there's no reason for writeback to implement its own worker pool when
using unbound workqueue instead is much simpler and more efficient.
This patch replaces custom worker pool implementation in writeback
with an unbound workqueue.
The conversion isn't too complicated but the followings are worth
mentioning.
* bdi_writeback->last_active, task and wakeup_timer are removed.
delayed_work ->dwork is added instead. Explicit timer handling is
no longer necessary. Everything works by either queueing / modding
/ flushing / canceling the delayed_work item.
* bdi_writeback_thread() becomes bdi_writeback_workfn() which runs off
bdi_writeback->dwork. On each execution, it processes
bdi->work_list and reschedules itself if there are more things to
do.
The function also handles low-mem condition, which used to be
handled by the forker thread. If the function is running off a
rescuer thread, it only writes out limited number of pages so that
the rescuer can serve other bdis too. This preserves the flusher
creation failure behavior of the forker thread.
* INIT_LIST_HEAD(&bdi->bdi_list) is used to tell
bdi_writeback_workfn() about on-going bdi unregistration so that it
always drains work_list even if it's running off the rescuer. Note
that the original code was broken in this regard. Under memory
pressure, a bdi could finish unregistration with non-empty
work_list.
* The default bdi is no longer special. It now is treated the same as
any other bdi and bdi_cap_flush_forker() is removed.
* BDI_pending is no longer used. Removed.
* Some tracepoints become non-applicable. The following TPs are
removed - writeback_nothread, writeback_wake_thread,
writeback_wake_forker_thread, writeback_thread_start,
writeback_thread_stop.
Everything, including devices coming and going away and rescuer
operation under simulated memory pressure, seems to work fine in my
test setup.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
2013-04-01 19:08:06 -07:00
} else {
/*
* bdi_wq can ' t get enough workers and we ' re running off
* the emergency worker . Don ' t hog it . Hopefully , 1024 is
* enough for efficient IO .
*/
writeback: move backing_dev_info->wb_lock and ->worklist into bdi_writeback
Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback)
and the role of the separation is unclear. For cgroup support for
writeback IOs, a bdi will be updated to host multiple wb's where each
wb serves writeback IOs of a different cgroup on the bdi. To achieve
that, a wb should carry all states necessary for servicing writeback
IOs for a cgroup independently.
This patch moves bdi->wb_lock and ->worklist into wb.
* The lock protects bdi->worklist and bdi->wb.dwork scheduling. While
moving, rename it to wb->work_lock as wb->wb_lock is confusing.
Also, move wb->dwork downwards so that it's colocated with the new
->work_lock and ->work_list fields.
* bdi_writeback_workfn() -> wb_workfn()
bdi_wakeup_thread_delayed(bdi) -> wb_wakeup_delayed(wb)
bdi_wakeup_thread(bdi) -> wb_wakeup(wb)
bdi_queue_work(bdi, ...) -> wb_queue_work(wb, ...)
__bdi_start_writeback(bdi, ...) -> __wb_start_writeback(wb, ...)
get_next_work_item(bdi) -> get_next_work_item(wb)
* bdi_wb_shutdown() is renamed to wb_shutdown() and now takes @wb.
The function contained parts which belong to the containing bdi
rather than the wb itself - testing cap_writeback_dirty and
bdi_remove_from_list() invocation. Those are moved to
bdi_unregister().
* bdi_wb_{init|exit}() are renamed to wb_{init|exit}().
Initializations of the moved bdi->wb_lock and ->work_list are
relocated from bdi_init() to wb_init().
* As there's still only one bdi_writeback per backing_dev_info, all
uses of bdi->state are mechanically replaced with bdi->wb.state
introducing no behavior changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-22 17:13:30 -04:00
pages_written = writeback_inodes_wb ( wb , 1024 ,
writeback: replace custom worker pool implementation with unbound workqueue
Writeback implements its own worker pool - each bdi can be associated
with a worker thread which is created and destroyed dynamically. The
worker thread for the default bdi is always present and serves as the
"forker" thread which forks off worker threads for other bdis.
there's no reason for writeback to implement its own worker pool when
using unbound workqueue instead is much simpler and more efficient.
This patch replaces custom worker pool implementation in writeback
with an unbound workqueue.
The conversion isn't too complicated but the followings are worth
mentioning.
* bdi_writeback->last_active, task and wakeup_timer are removed.
delayed_work ->dwork is added instead. Explicit timer handling is
no longer necessary. Everything works by either queueing / modding
/ flushing / canceling the delayed_work item.
* bdi_writeback_thread() becomes bdi_writeback_workfn() which runs off
bdi_writeback->dwork. On each execution, it processes
bdi->work_list and reschedules itself if there are more things to
do.
The function also handles low-mem condition, which used to be
handled by the forker thread. If the function is running off a
rescuer thread, it only writes out limited number of pages so that
the rescuer can serve other bdis too. This preserves the flusher
creation failure behavior of the forker thread.
* INIT_LIST_HEAD(&bdi->bdi_list) is used to tell
bdi_writeback_workfn() about on-going bdi unregistration so that it
always drains work_list even if it's running off the rescuer. Note
that the original code was broken in this regard. Under memory
pressure, a bdi could finish unregistration with non-empty
work_list.
* The default bdi is no longer special. It now is treated the same as
any other bdi and bdi_cap_flush_forker() is removed.
* BDI_pending is no longer used. Removed.
* Some tracepoints become non-applicable. The following TPs are
removed - writeback_nothread, writeback_wake_thread,
writeback_wake_forker_thread, writeback_thread_start,
writeback_thread_stop.
Everything, including devices coming and going away and rescuer
operation under simulated memory pressure, seems to work fine in my
test setup.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
2013-04-01 19:08:06 -07:00
WB_REASON_FORKER_THREAD ) ;
2010-07-07 13:24:06 +10:00
trace_writeback_pages_written ( pages_written ) ;
2009-09-09 09:08:54 +02:00
}
writeback: move backing_dev_info->wb_lock and ->worklist into bdi_writeback
Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback)
and the role of the separation is unclear. For cgroup support for
writeback IOs, a bdi will be updated to host multiple wb's where each
wb serves writeback IOs of a different cgroup on the bdi. To achieve
that, a wb should carry all states necessary for servicing writeback
IOs for a cgroup independently.
This patch moves bdi->wb_lock and ->worklist into wb.
* The lock protects bdi->worklist and bdi->wb.dwork scheduling. While
moving, rename it to wb->work_lock as wb->wb_lock is confusing.
Also, move wb->dwork downwards so that it's colocated with the new
->work_lock and ->work_list fields.
* bdi_writeback_workfn() -> wb_workfn()
bdi_wakeup_thread_delayed(bdi) -> wb_wakeup_delayed(wb)
bdi_wakeup_thread(bdi) -> wb_wakeup(wb)
bdi_queue_work(bdi, ...) -> wb_queue_work(wb, ...)
__bdi_start_writeback(bdi, ...) -> __wb_start_writeback(wb, ...)
get_next_work_item(bdi) -> get_next_work_item(wb)
* bdi_wb_shutdown() is renamed to wb_shutdown() and now takes @wb.
The function contained parts which belong to the containing bdi
rather than the wb itself - testing cap_writeback_dirty and
bdi_remove_from_list() invocation. Those are moved to
bdi_unregister().
* bdi_wb_{init|exit}() are renamed to wb_{init|exit}().
Initializations of the moved bdi->wb_lock and ->work_list are
relocated from bdi_init() to wb_init().
* As there's still only one bdi_writeback per backing_dev_info, all
uses of bdi->state are mechanically replaced with bdi->wb.state
introducing no behavior changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-22 17:13:30 -04:00
if ( ! list_empty ( & wb - > work_list ) )
2018-05-03 18:26:26 +02:00
wb_wakeup ( wb ) ;
2014-04-03 14:46:22 -07:00
else if ( wb_has_dirty_io ( wb ) & & dirty_writeback_interval )
writeback: move backing_dev_info->wb_lock and ->worklist into bdi_writeback
Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback)
and the role of the separation is unclear. For cgroup support for
writeback IOs, a bdi will be updated to host multiple wb's where each
wb serves writeback IOs of a different cgroup on the bdi. To achieve
that, a wb should carry all states necessary for servicing writeback
IOs for a cgroup independently.
This patch moves bdi->wb_lock and ->worklist into wb.
* The lock protects bdi->worklist and bdi->wb.dwork scheduling. While
moving, rename it to wb->work_lock as wb->wb_lock is confusing.
Also, move wb->dwork downwards so that it's colocated with the new
->work_lock and ->work_list fields.
* bdi_writeback_workfn() -> wb_workfn()
bdi_wakeup_thread_delayed(bdi) -> wb_wakeup_delayed(wb)
bdi_wakeup_thread(bdi) -> wb_wakeup(wb)
bdi_queue_work(bdi, ...) -> wb_queue_work(wb, ...)
__bdi_start_writeback(bdi, ...) -> __wb_start_writeback(wb, ...)
get_next_work_item(bdi) -> get_next_work_item(wb)
* bdi_wb_shutdown() is renamed to wb_shutdown() and now takes @wb.
The function contained parts which belong to the containing bdi
rather than the wb itself - testing cap_writeback_dirty and
bdi_remove_from_list() invocation. Those are moved to
bdi_unregister().
* bdi_wb_{init|exit}() are renamed to wb_{init|exit}().
Initializations of the moved bdi->wb_lock and ->work_list are
relocated from bdi_init() to wb_init().
* As there's still only one bdi_writeback per backing_dev_info, all
uses of bdi->state are mechanically replaced with bdi->wb.state
introducing no behavior changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-22 17:13:30 -04:00
wb_wakeup_delayed ( wb ) ;
2009-09-09 09:08:54 +02:00
}
2017-09-28 11:26:59 -06:00
/*
* Start writeback of ` nr_pages ' pages on this bdi . If ` nr_pages ' is zero ,
* write back the whole world .
*/
static void __wakeup_flusher_threads_bdi ( struct backing_dev_info * bdi ,
2017-09-28 11:31:22 -06:00
enum wb_reason reason )
2017-09-28 11:26:59 -06:00
{
struct bdi_writeback * wb ;
if ( ! bdi_has_dirty_io ( bdi ) )
return ;
list_for_each_entry_rcu ( wb , & bdi - > wb_list , bdi_node )
2017-09-28 11:31:22 -06:00
wb_start_writeback ( wb , reason ) ;
2017-09-28 11:26:59 -06:00
}
void wakeup_flusher_threads_bdi ( struct backing_dev_info * bdi ,
enum wb_reason reason )
{
rcu_read_lock ( ) ;
2017-09-28 11:31:22 -06:00
__wakeup_flusher_threads_bdi ( bdi , reason ) ;
2017-09-28 11:26:59 -06:00
rcu_read_unlock ( ) ;
}
2009-09-09 09:08:54 +02:00
/*
2017-09-20 08:58:25 -06:00
* Wakeup the flusher threads to start writeback of all currently dirty pages
2009-09-09 09:08:54 +02:00
*/
2017-09-20 08:58:25 -06:00
void wakeup_flusher_threads ( enum wb_reason reason )
2009-09-09 09:08:54 +02:00
{
2010-06-08 18:15:07 +02:00
struct backing_dev_info * bdi ;
2009-09-09 09:08:54 +02:00
2016-08-04 21:36:05 +03:00
/*
* If we are expecting writeback progress we must submit plugged IO .
*/
2022-01-27 08:05:49 +01:00
blk_flush_plug ( current - > plug , true ) ;
2016-08-04 21:36:05 +03:00
2010-06-08 18:15:07 +02:00
rcu_read_lock ( ) ;
2017-09-28 11:26:59 -06:00
list_for_each_entry_rcu ( bdi , & bdi_list , bdi_list )
2017-09-28 11:31:22 -06:00
__wakeup_flusher_threads_bdi ( bdi , reason ) ;
2009-09-14 13:12:40 +02:00
rcu_read_unlock ( ) ;
2005-04-16 15:20:36 -07:00
}
2015-03-17 12:23:19 -04:00
/*
* Wake up bdi ' s periodically to make sure dirtytime inodes gets
* written back periodically . We deliberately do * not * check the
* b_dirtytime list in wb_has_dirty_io ( ) , since this would cause the
* kernel to be constantly waking up once there are any dirtytime
* inodes on the system . So instead we define a separate delayed work
* function which gets called much more rarely . ( By default , only
* once every 12 hours . )
*
* If there is any other write activity going on in the file system ,
* this function won ' t be necessary . But if the only thing that has
* happened on the file system is a dirtytime inode caused by an atime
* update , we need this infrastructure below to make sure that inode
* eventually gets pushed out to disk .
*/
static void wakeup_dirtytime_writeback ( struct work_struct * w ) ;
static DECLARE_DELAYED_WORK ( dirtytime_work , wakeup_dirtytime_writeback ) ;
static void wakeup_dirtytime_writeback ( struct work_struct * w )
{
struct backing_dev_info * bdi ;
rcu_read_lock ( ) ;
list_for_each_entry_rcu ( bdi , & bdi_list , bdi_list ) {
2015-05-22 17:13:56 -04:00
struct bdi_writeback * wb ;
2015-10-02 14:47:05 -04:00
list_for_each_entry_rcu ( wb , & bdi - > wb_list , bdi_node )
2015-09-29 12:47:51 -04:00
if ( ! list_empty ( & wb - > b_dirty_time ) )
wb_wakeup ( wb ) ;
2015-03-17 12:23:19 -04:00
}
rcu_read_unlock ( ) ;
schedule_delayed_work ( & dirtytime_work , dirtytime_expire_interval * HZ ) ;
}
static int __init start_dirtytime_writeback ( void )
{
schedule_delayed_work ( & dirtytime_work , dirtytime_expire_interval * HZ ) ;
return 0 ;
}
__initcall ( start_dirtytime_writeback ) ;
2015-03-17 12:23:32 -04:00
int dirtytime_interval_handler ( struct ctl_table * table , int write ,
2020-09-18 21:20:39 -07:00
void * buffer , size_t * lenp , loff_t * ppos )
2015-03-17 12:23:32 -04:00
{
int ret ;
ret = proc_dointvec_minmax ( table , write , buffer , lenp , ppos ) ;
if ( ret = = 0 & & write )
mod_delayed_work ( system_wq , & dirtytime_work , 0 ) ;
return ret ;
}
2009-09-09 09:08:54 +02:00
/**
2021-01-12 11:02:49 -08:00
* __mark_inode_dirty - internal function to mark an inode dirty
2017-05-12 07:45:42 -03:00
*
* @ inode : inode to mark
2021-01-12 11:02:49 -08:00
* @ flags : what kind of dirty , e . g . I_DIRTY_SYNC . This can be a combination of
* multiple I_DIRTY_ * flags , except that I_DIRTY_TIME can ' t be combined
* with I_DIRTY_PAGES .
2017-05-12 07:45:42 -03:00
*
2021-01-12 11:02:49 -08:00
* Mark an inode as dirty . We notify the filesystem , then update the inode ' s
* dirty flags . Then , if needed we add the inode to the appropriate dirty list .
2005-04-16 15:20:36 -07:00
*
2021-01-12 11:02:49 -08:00
* Most callers should use mark_inode_dirty ( ) or mark_inode_dirty_sync ( )
* instead of calling this directly .
2009-09-09 09:08:54 +02:00
*
2021-01-12 11:02:49 -08:00
* CAREFUL ! We only add the inode to the dirty list if it is hashed or if it
* refers to a blockdev . Unhashed inodes will never be added to the dirty list
* even if they are later hashed , as they will have been marked dirty already .
2009-09-09 09:08:54 +02:00
*
2021-01-12 11:02:49 -08:00
* In short , ensure you hash any inodes _before_ you start marking them dirty .
2005-04-16 15:20:36 -07:00
*
2009-09-09 09:08:54 +02:00
* Note that for blockdevs , inode - > dirtied_when represents the dirtying time of
* the block - special inode ( / dev / hda1 ) itself . And the - > dirtied_when field of
* the kernel - internal blockdev inode represents the dirtying time of the
* blockdev ' s pages . This is why for I_DIRTY_PAGES we always use
* page - > mapping - > host , so the page - dirtying time is recorded in the internal
* blockdev inode .
2005-04-16 15:20:36 -07:00
*/
2009-09-09 09:08:54 +02:00
void __mark_inode_dirty ( struct inode * inode , int flags )
2005-04-16 15:20:36 -07:00
{
2009-09-09 09:08:54 +02:00
struct super_block * sb = inode - > i_sb ;
2021-01-12 11:02:49 -08:00
int dirtytime = 0 ;
2022-05-24 08:05:40 -07:00
struct bdi_writeback * wb = NULL ;
2015-02-02 00:37:00 -05:00
trace_writeback_mark_inode_dirty ( inode , flags ) ;
2005-04-16 15:20:36 -07:00
2021-01-12 11:02:47 -08:00
if ( flags & I_DIRTY_INODE ) {
2022-08-25 12:06:57 +02:00
/*
* Inode timestamp update will piggback on this dirtying .
* We tell - > dirty_inode callback that timestamps need to
* be updated by setting I_DIRTY_TIME in flags .
*/
if ( inode - > i_state & I_DIRTY_TIME ) {
spin_lock ( & inode - > i_lock ) ;
if ( inode - > i_state & I_DIRTY_TIME ) {
inode - > i_state & = ~ I_DIRTY_TIME ;
flags | = I_DIRTY_TIME ;
}
spin_unlock ( & inode - > i_lock ) ;
}
2021-01-12 11:02:49 -08:00
/*
* Notify the filesystem about the inode being dirtied , so that
* ( if needed ) it can update on - disk fields and journal the
* inode . This is only needed when the inode itself is being
* dirtied now . I . e . it ' s only needed for I_DIRTY_INODE , not
* for just I_DIRTY_PAGES or I_DIRTY_TIME .
*/
2013-01-11 13:06:37 -08:00
trace_writeback_dirty_inode_start ( inode , flags ) ;
2009-09-09 09:08:54 +02:00
if ( sb - > s_op - > dirty_inode )
2022-08-25 12:06:57 +02:00
sb - > s_op - > dirty_inode ( inode ,
flags & ( I_DIRTY_INODE | I_DIRTY_TIME ) ) ;
2013-01-11 13:06:37 -08:00
trace_writeback_dirty_inode ( inode , flags ) ;
2021-01-12 11:02:47 -08:00
2021-01-12 11:02:49 -08:00
/* I_DIRTY_INODE supersedes I_DIRTY_TIME. */
2015-02-02 00:37:00 -05:00
flags & = ~ I_DIRTY_TIME ;
2021-01-12 11:02:49 -08:00
} else {
/*
* Else it ' s either I_DIRTY_PAGES , I_DIRTY_TIME , or nothing .
* ( We don ' t support setting both I_DIRTY_PAGES and I_DIRTY_TIME
* in one call to __mark_inode_dirty ( ) . )
*/
dirtytime = flags & I_DIRTY_TIME ;
WARN_ON_ONCE ( dirtytime & & flags ! = I_DIRTY_TIME ) ;
2021-01-12 11:02:47 -08:00
}
2009-09-09 09:08:54 +02:00
/*
2014-10-24 15:38:21 -04:00
* Paired with smp_mb ( ) in __writeback_single_inode ( ) for the
* following lockless i_state test . See there for details .
2009-09-09 09:08:54 +02:00
*/
smp_mb ( ) ;
2022-08-25 12:06:57 +02:00
if ( ( inode - > i_state & flags ) = = flags )
2009-09-09 09:08:54 +02:00
return ;
2011-03-22 22:23:36 +11:00
spin_lock ( & inode - > i_lock ) ;
2009-09-09 09:08:54 +02:00
if ( ( inode - > i_state & flags ) ! = flags ) {
const int was_dirty = inode - > i_state & I_DIRTY ;
2015-05-22 17:13:37 -04:00
inode_attach_wb ( inode , NULL ) ;
2009-09-09 09:08:54 +02:00
inode - > i_state | = flags ;
2022-05-24 08:05:40 -07:00
/*
* Grab inode ' s wb early because it requires dropping i_lock and we
* need to make sure following checks happen atomically with dirty
* list handling so that we don ' t move inodes under flush worker ' s
* hands .
*/
if ( ! was_dirty ) {
wb = locked_inode_to_wb_and_lock_list ( inode ) ;
spin_lock ( & inode - > i_lock ) ;
}
2009-09-09 09:08:54 +02:00
/*
2020-05-29 15:05:22 +02:00
* If the inode is queued for writeback by flush worker , just
* update its dirty state . Once the flush worker is done with
* the inode it will place it on the appropriate superblock
* list , based upon its state .
2009-09-09 09:08:54 +02:00
*/
2020-05-29 15:05:22 +02:00
if ( inode - > i_state & I_SYNC_QUEUED )
2022-05-24 08:05:40 -07:00
goto out_unlock ;
2009-09-09 09:08:54 +02:00
/*
* Only add valid ( hashed ) inodes to the superblock ' s
* dirty list . Add blockdev inodes as well .
*/
if ( ! S_ISBLK ( inode - > i_mode ) ) {
2010-10-23 15:19:20 -04:00
if ( inode_unhashed ( inode ) )
2022-05-24 08:05:40 -07:00
goto out_unlock ;
2009-09-09 09:08:54 +02:00
}
2010-06-02 17:38:30 -04:00
if ( inode - > i_state & I_FREEING )
2022-05-24 08:05:40 -07:00
goto out_unlock ;
2009-09-09 09:08:54 +02:00
/*
* If the inode was already on b_dirty / b_io / b_more_io , don ' t
* reposition it ( that would break b_dirty time - ordering ) .
*/
if ( ! was_dirty ) {
2015-05-22 17:13:45 -04:00
struct list_head * dirty_list ;
2011-03-22 22:23:41 +11:00
bool wakeup_bdi = false ;
2010-07-25 14:29:21 +03:00
2009-09-09 09:08:54 +02:00
inode - > dirtied_when = jiffies ;
2015-03-17 12:23:19 -04:00
if ( dirtytime )
inode - > dirtied_time_when = jiffies ;
2015-05-22 17:13:45 -04:00
2018-02-21 07:54:49 -08:00
if ( inode - > i_state & I_DIRTY )
2015-05-22 17:14:02 -04:00
dirty_list = & wb - > b_dirty ;
2015-03-17 12:23:19 -04:00
else
2015-05-22 17:14:02 -04:00
dirty_list = & wb - > b_dirty_time ;
2015-05-22 17:13:45 -04:00
2015-03-04 14:07:22 -05:00
wakeup_bdi = inode_io_list_move_locked ( inode , wb ,
2015-05-22 17:13:45 -04:00
dirty_list ) ;
2015-05-22 17:14:02 -04:00
spin_unlock ( & wb - > list_lock ) ;
2022-05-24 08:05:40 -07:00
spin_unlock ( & inode - > i_lock ) ;
2015-02-02 00:37:00 -05:00
trace_writeback_dirty_inode_enqueue ( inode ) ;
2011-03-22 22:23:41 +11:00
2015-05-22 17:13:45 -04:00
/*
* If this is the first dirty inode for this bdi ,
* we have to wake - up the corresponding bdi thread
* to make sure background write - back happens
* later .
*/
2020-09-24 08:51:40 +02:00
if ( wakeup_bdi & &
( wb - > bdi - > capabilities & BDI_CAP_WRITEBACK ) )
2015-05-22 17:14:02 -04:00
wb_wakeup_delayed ( wb ) ;
2011-03-22 22:23:41 +11:00
return ;
2005-04-16 15:20:36 -07:00
}
}
2022-05-24 08:05:40 -07:00
out_unlock :
if ( wb )
spin_unlock ( & wb - > list_lock ) ;
2011-03-22 22:23:36 +11:00
spin_unlock ( & inode - > i_lock ) ;
2009-09-09 09:08:54 +02:00
}
EXPORT_SYMBOL ( __mark_inode_dirty ) ;
2015-03-04 13:40:00 -05:00
/*
* The @ s_sync_lock is used to serialise concurrent sync operations
* to avoid lock contention problems with concurrent wait_sb_inodes ( ) calls .
* Concurrent callers will block on the s_sync_lock rather than doing contending
* walks . The queueing maintains sync ( 2 ) required behaviour as all the IO that
* has been issued up to the time this function is enter is guaranteed to be
* completed by the time we have gained the lock and waited for all IO that is
* in progress regardless of the order callers are granted the lock .
*/
2009-09-16 15:13:54 +02:00
static void wait_sb_inodes ( struct super_block * sb )
2009-09-09 09:08:54 +02:00
{
2016-07-26 15:21:50 -07:00
LIST_HEAD ( sync_list ) ;
2009-09-09 09:08:54 +02:00
/*
* We need to be protected against the filesystem going from
* r / o to r / w or vice versa .
*/
2009-09-16 15:13:54 +02:00
WARN_ON ( ! rwsem_is_locked ( & sb - > s_umount ) ) ;
2009-09-09 09:08:54 +02:00
2015-03-04 13:40:00 -05:00
mutex_lock ( & sb - > s_sync_lock ) ;
2009-09-09 09:08:54 +02:00
/*
2016-07-26 15:21:50 -07:00
* Splice the writeback list onto a temporary list to avoid waiting on
* inodes that have started writeback after this point .
*
* Use rcu_read_lock ( ) to keep the inodes around until we have a
* reference . s_inode_wblist_lock protects sb - > s_inodes_wb as well as
* the local list because inodes can be dropped from either by writeback
* completion .
*/
rcu_read_lock ( ) ;
spin_lock_irq ( & sb - > s_inode_wblist_lock ) ;
list_splice_init ( & sb - > s_inodes_wb , & sync_list ) ;
/*
* Data integrity sync . Must wait for all pages under writeback , because
* there may have been pages dirtied before our sync call , but which had
* writeout started before we write it out . In which case , the inode
* may not be on the dirty list , but we still have to wait for that
* writeout .
2009-09-09 09:08:54 +02:00
*/
2016-07-26 15:21:50 -07:00
while ( ! list_empty ( & sync_list ) ) {
struct inode * inode = list_first_entry ( & sync_list , struct inode ,
i_wb_list ) ;
2011-03-22 22:23:36 +11:00
struct address_space * mapping = inode - > i_mapping ;
2009-09-09 09:08:54 +02:00
2016-07-26 15:21:50 -07:00
/*
* Move each inode back to the wb list before we drop the lock
* to preserve consistency between i_wb_list and the mapping
* writeback tag . Writeback completion is responsible to remove
* the inode from either list once the writeback tag is cleared .
*/
list_move_tail ( & inode - > i_wb_list , & sb - > s_inodes_wb ) ;
/*
* The mapping can appear untagged while still on - list since we
* do not have the mapping lock . Skip it here , wb completion
* will remove it .
*/
if ( ! mapping_tagged ( mapping , PAGECACHE_TAG_WRITEBACK ) )
continue ;
spin_unlock_irq ( & sb - > s_inode_wblist_lock ) ;
2011-03-22 22:23:36 +11:00
spin_lock ( & inode - > i_lock ) ;
2016-07-26 15:21:50 -07:00
if ( inode - > i_state & ( I_FREEING | I_WILL_FREE | I_NEW ) ) {
2011-03-22 22:23:36 +11:00
spin_unlock ( & inode - > i_lock ) ;
2016-07-26 15:21:50 -07:00
spin_lock_irq ( & sb - > s_inode_wblist_lock ) ;
2009-09-09 09:08:54 +02:00
continue ;
2011-03-22 22:23:36 +11:00
}
2009-09-09 09:08:54 +02:00
__iget ( inode ) ;
2011-03-22 22:23:36 +11:00
spin_unlock ( & inode - > i_lock ) ;
2016-07-26 15:21:50 -07:00
rcu_read_unlock ( ) ;
2009-09-09 09:08:54 +02:00
2015-11-05 18:47:23 -08:00
/*
* We keep the error status of individual mapping so that
* applications can catch the writeback error using fsync ( 2 ) .
* See filemap_fdatawait_keep_errors ( ) for details .
*/
filemap_fdatawait_keep_errors ( mapping ) ;
2009-09-09 09:08:54 +02:00
cond_resched ( ) ;
2016-07-26 15:21:50 -07:00
iput ( inode ) ;
rcu_read_lock ( ) ;
spin_lock_irq ( & sb - > s_inode_wblist_lock ) ;
2009-09-09 09:08:54 +02:00
}
2016-07-26 15:21:50 -07:00
spin_unlock_irq ( & sb - > s_inode_wblist_lock ) ;
rcu_read_unlock ( ) ;
2015-03-04 13:40:00 -05:00
mutex_unlock ( & sb - > s_sync_lock ) ;
2005-04-16 15:20:36 -07:00
}
2015-05-22 17:14:00 -04:00
static void __writeback_inodes_sb_nr ( struct super_block * sb , unsigned long nr ,
enum wb_reason reason , bool skip_if_busy )
2005-04-16 15:20:36 -07:00
{
2019-08-26 09:06:52 -07:00
struct backing_dev_info * bdi = sb - > s_bdi ;
DEFINE_WB_COMPLETION ( done , bdi ) ;
2010-07-06 08:59:53 +02:00
struct wb_writeback_work work = {
2010-06-06 10:38:15 -06:00
. sb = sb ,
. sync_mode = WB_SYNC_NONE ,
. tagged_writepages = 1 ,
. done = & done ,
. nr_pages = nr ,
2011-10-07 21:54:10 -06:00
. reason = reason ,
2010-06-08 18:14:43 +02:00
} ;
2009-09-02 12:34:32 +02:00
2015-05-22 17:13:48 -04:00
if ( ! bdi_has_dirty_io ( bdi ) | | bdi = = & noop_backing_dev_info )
2012-07-03 16:45:27 +02:00
return ;
2010-06-08 18:14:51 +02:00
WARN_ON ( ! rwsem_is_locked ( & sb - > s_umount ) ) ;
2015-05-22 17:14:00 -04:00
2015-05-22 17:14:01 -04:00
bdi_split_work_to_wbs ( sb - > s_bdi , & work , skip_if_busy ) ;
2019-08-26 09:06:52 -07:00
wb_wait_for_completion ( & done ) ;
2010-05-17 12:55:07 +02:00
}
2015-05-22 17:14:00 -04:00
/**
* writeback_inodes_sb_nr - writeback dirty inodes from given super_block
* @ sb : the superblock
* @ nr : the number of pages to write
* @ reason : reason why some writeback work initiated
*
* Start writeback on some inodes on this super_block . No guarantees are made
* on how many ( if any ) will be written , and this function does not wait
* for IO completion of submitted IO .
*/
void writeback_inodes_sb_nr ( struct super_block * sb ,
unsigned long nr ,
enum wb_reason reason )
{
__writeback_inodes_sb_nr ( sb , nr , reason , false ) ;
}
2010-10-29 11:16:17 -04:00
EXPORT_SYMBOL ( writeback_inodes_sb_nr ) ;
/**
* writeback_inodes_sb - writeback dirty inodes from given super_block
* @ sb : the superblock
2011-11-23 20:56:45 +08:00
* @ reason : reason why some writeback work was initiated
2010-10-29 11:16:17 -04:00
*
* Start writeback on some inodes on this super_block . No guarantees are made
* on how many ( if any ) will be written , and this function does not wait
* for IO completion of submitted IO .
*/
2011-10-07 21:54:10 -06:00
void writeback_inodes_sb ( struct super_block * sb , enum wb_reason reason )
2010-10-29 11:16:17 -04:00
{
2011-10-07 21:54:10 -06:00
return writeback_inodes_sb_nr ( sb , get_nr_dirty_pages ( ) , reason ) ;
2010-10-29 11:16:17 -04:00
}
2010-06-01 11:08:43 +02:00
EXPORT_SYMBOL ( writeback_inodes_sb ) ;
2010-05-17 12:55:07 +02:00
2009-12-23 07:57:07 -05:00
/**
2017-10-09 13:34:41 +03:00
* try_to_writeback_inodes_sb - try to start writeback if none underway
2009-12-23 07:57:07 -05:00
* @ sb : the superblock
2017-10-09 13:34:41 +03:00
* @ reason : reason why some writeback work was initiated
2009-12-23 07:57:07 -05:00
*
2017-10-09 13:34:41 +03:00
* Invoke __writeback_inodes_sb_nr if no writeback is currently underway .
2009-12-23 07:57:07 -05:00
*/
2017-10-09 13:34:41 +03:00
void try_to_writeback_inodes_sb ( struct super_block * sb , enum wb_reason reason )
2009-12-23 07:57:07 -05:00
{
2013-01-10 13:47:57 +08:00
if ( ! down_read_trylock ( & sb - > s_umount ) )
2017-10-09 13:34:41 +03:00
return ;
2013-01-10 13:47:57 +08:00
2017-10-09 13:34:41 +03:00
__writeback_inodes_sb_nr ( sb , get_nr_dirty_pages ( ) , reason , true ) ;
2013-01-10 13:47:57 +08:00
up_read ( & sb - > s_umount ) ;
2010-10-29 11:16:17 -04:00
}
2013-01-10 13:47:57 +08:00
EXPORT_SYMBOL ( try_to_writeback_inodes_sb ) ;
2010-10-29 11:16:17 -04:00
2009-09-02 12:34:32 +02:00
/**
* sync_inodes_sb - sync sb inode pages
2014-02-21 11:19:04 +01:00
* @ sb : the superblock
2009-09-02 12:34:32 +02:00
*
* This function writes and waits on any dirty inode belonging to this
2014-02-21 11:19:04 +01:00
* super_block .
2009-09-02 12:34:32 +02:00
*/
2014-02-21 11:19:04 +01:00
void sync_inodes_sb ( struct super_block * sb )
2009-09-02 12:34:32 +02:00
{
2019-08-26 09:06:52 -07:00
struct backing_dev_info * bdi = sb - > s_bdi ;
DEFINE_WB_COMPLETION ( done , bdi ) ;
2010-07-06 08:59:53 +02:00
struct wb_writeback_work work = {
2010-06-08 18:14:43 +02:00
. sb = sb ,
. sync_mode = WB_SYNC_ALL ,
. nr_pages = LONG_MAX ,
. range_cyclic = 0 ,
2010-07-06 08:59:53 +02:00
. done = & done ,
2011-10-07 21:54:10 -06:00
. reason = WB_REASON_SYNC ,
2013-07-02 22:38:35 +10:00
. for_sync = 1 ,
2010-06-08 18:14:43 +02:00
} ;
2015-08-25 14:11:52 -04:00
/*
* Can ' t skip on ! bdi_has_dirty ( ) because we should wait for ! dirty
* inodes under writeback and I_DIRTY_TIME inodes ignored by
* bdi_has_dirty ( ) need to be written out too .
*/
if ( bdi = = & noop_backing_dev_info )
2012-07-03 16:45:27 +02:00
return ;
2010-06-08 18:14:51 +02:00
WARN_ON ( ! rwsem_is_locked ( & sb - > s_umount ) ) ;
2017-12-12 08:38:30 -08:00
/* protect against inode wb switch, see inode_switch_wbs_work_fn() */
bdi_down_write_wb_switch_rwsem ( bdi ) ;
2015-05-22 17:14:01 -04:00
bdi_split_work_to_wbs ( bdi , & work , false ) ;
2019-08-26 09:06:52 -07:00
wb_wait_for_completion ( & done ) ;
2017-12-12 08:38:30 -08:00
bdi_up_write_wb_switch_rwsem ( bdi ) ;
2010-07-06 08:59:53 +02:00
2009-09-16 15:13:54 +02:00
wait_sb_inodes ( sb ) ;
2005-04-16 15:20:36 -07:00
}
2009-09-02 12:34:32 +02:00
EXPORT_SYMBOL ( sync_inodes_sb ) ;
2005-04-16 15:20:36 -07:00
/**
[PATCH] fix nr_unused accounting, and avoid recursing in iput with I_WILL_FREE set
list_move(&inode->i_list, &inode_in_use);
} else {
list_move(&inode->i_list, &inode_unused);
+ inodes_stat.nr_unused++;
}
}
wake_up_inode(inode);
Are you sure the above diff is correct? It was added somewhere between
2.6.5 and 2.6.8. I think it's wrong.
The only way I can imagine the i_count to be zero in the above path, is
that I_WILL_FREE is set. And if I_WILL_FREE is set, then we must not
increase nr_unused. So I believe the above change is buggy and it will
definitely overstate the number of unused inodes and it should be backed
out.
Note that __writeback_single_inode before calling __sync_single_inode, can
drop the spinlock and we can have both the dirty and locked bitflags clear
here:
spin_unlock(&inode_lock);
__wait_on_inode(inode);
iput(inode);
XXXXXXX
spin_lock(&inode_lock);
}
use inode again here
a construct like the above makes zero sense from a reference counting
standpoint.
Either we don't ever use the inode again after the iput, or the
inode_lock should be taken _before_ executing the iput (i.e. a __iput
would be required). Taking the inode_lock after iput means the iget was
useless if we keep using the inode after the iput.
So the only chance the 2.6 was safe to call __writeback_single_inode
with the i_count == 0, is that I_WILL_FREE is set (I_WILL_FREE will
prevent the VM to free the inode in XXXXX).
Potentially calling the above iput with I_WILL_FREE was also wrong
because it would recurse in iput_final (the second mainline bug).
The below (untested) patch fixes the nr_unused accounting, avoids recursing
in iput when I_WILL_FREE is set and makes sure (with the BUG_ON) that we
don't corrupt memory and that all holders that don't set I_WILL_FREE, keeps
a reference on the inode!
Signed-off-by: Andrea Arcangeli <andrea@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30 15:03:05 -08:00
* write_inode_now - write an inode to disk
* @ inode : inode to write to disk
* @ sync : whether the write should be synchronous or not
*
* This function commits an inode to disk immediately if it is dirty . This is
* primarily needed by knfsd .
2005-04-16 15:20:36 -07:00
*
[PATCH] fix nr_unused accounting, and avoid recursing in iput with I_WILL_FREE set
list_move(&inode->i_list, &inode_in_use);
} else {
list_move(&inode->i_list, &inode_unused);
+ inodes_stat.nr_unused++;
}
}
wake_up_inode(inode);
Are you sure the above diff is correct? It was added somewhere between
2.6.5 and 2.6.8. I think it's wrong.
The only way I can imagine the i_count to be zero in the above path, is
that I_WILL_FREE is set. And if I_WILL_FREE is set, then we must not
increase nr_unused. So I believe the above change is buggy and it will
definitely overstate the number of unused inodes and it should be backed
out.
Note that __writeback_single_inode before calling __sync_single_inode, can
drop the spinlock and we can have both the dirty and locked bitflags clear
here:
spin_unlock(&inode_lock);
__wait_on_inode(inode);
iput(inode);
XXXXXXX
spin_lock(&inode_lock);
}
use inode again here
a construct like the above makes zero sense from a reference counting
standpoint.
Either we don't ever use the inode again after the iput, or the
inode_lock should be taken _before_ executing the iput (i.e. a __iput
would be required). Taking the inode_lock after iput means the iget was
useless if we keep using the inode after the iput.
So the only chance the 2.6 was safe to call __writeback_single_inode
with the i_count == 0, is that I_WILL_FREE is set (I_WILL_FREE will
prevent the VM to free the inode in XXXXX).
Potentially calling the above iput with I_WILL_FREE was also wrong
because it would recurse in iput_final (the second mainline bug).
The below (untested) patch fixes the nr_unused accounting, avoids recursing
in iput when I_WILL_FREE is set and makes sure (with the BUG_ON) that we
don't corrupt memory and that all holders that don't set I_WILL_FREE, keeps
a reference on the inode!
Signed-off-by: Andrea Arcangeli <andrea@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30 15:03:05 -08:00
* The caller must either have a ref on the inode or must have set I_WILL_FREE .
2005-04-16 15:20:36 -07:00
*/
int write_inode_now ( struct inode * inode , int sync )
{
struct writeback_control wbc = {
. nr_to_write = LONG_MAX ,
2008-02-08 04:20:23 -08:00
. sync_mode = sync ? WB_SYNC_ALL : WB_SYNC_NONE ,
[PATCH] writeback: fix range handling
When a writeback_control's `start' and `end' fields are used to
indicate a one-byte-range starting at file offset zero, the required
values of .start=0,.end=0 mean that the ->writepages() implementation
has no way of telling that it is being asked to perform a range
request. Because we're currently overloading (start == 0 && end == 0)
to mean "this is not a write-a-range request".
To make all this sane, the patch changes range of writeback_control.
So caller does: If it is calling ->writepages() to write pages, it
sets range (range_start/end or range_cyclic) always.
And if range_cyclic is true, ->writepages() thinks the range is
cyclic, otherwise it just uses range_start and range_end.
This patch does,
- Add LLONG_MAX, LLONG_MIN, ULLONG_MAX to include/linux/kernel.h
-1 is usually ok for range_end (type is long long). But, if someone did,
range_end += val; range_end is "val - 1"
u64val = range_end >> bits; u64val is "~(0ULL)"
or something, they are wrong. So, this adds LLONG_MAX to avoid nasty
things, and uses LLONG_MAX for range_end.
- All callers of ->writepages() sets range_start/end or range_cyclic.
- Fix updates of ->writeback_index. It seems already bit strange.
If it starts at 0 and ended by check of nr_to_write, this last
index may reduce chance to scan end of file. So, this updates
->writeback_index only if range_cyclic is true or whole-file is
scanned.
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: Nathan Scott <nathans@sgi.com>
Cc: Anton Altaparmakov <aia21@cantab.net>
Cc: Steven French <sfrench@us.ibm.com>
Cc: "Vladimir V. Saveliev" <vs@namesys.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 02:03:26 -07:00
. range_start = 0 ,
. range_end = LLONG_MAX ,
2005-04-16 15:20:36 -07:00
} ;
2020-09-24 08:51:40 +02:00
if ( ! mapping_can_writeback ( inode - > i_mapping ) )
2005-11-07 00:59:15 -08:00
wbc . nr_to_write = 0 ;
2005-04-16 15:20:36 -07:00
might_sleep ( ) ;
2016-03-18 13:52:04 -04:00
return writeback_single_inode ( inode , & wbc ) ;
2005-04-16 15:20:36 -07:00
}
EXPORT_SYMBOL ( write_inode_now ) ;
2010-10-06 10:48:20 +02:00
/**
2011-01-13 15:45:48 -08:00
* sync_inode_metadata - write an inode to disk
2010-10-06 10:48:20 +02:00
* @ inode : the inode to sync
* @ wait : wait for I / O to complete .
*
2011-01-13 15:45:48 -08:00
* Write an inode to disk and adjust its dirty state after completion .
2010-10-06 10:48:20 +02:00
*
* Note : only writes the actual inode , no associated data or other metadata .
*/
int sync_inode_metadata ( struct inode * inode , int wait )
{
struct writeback_control wbc = {
. sync_mode = wait ? WB_SYNC_ALL : WB_SYNC_NONE ,
. nr_to_write = 0 , /* metadata-only */
} ;
2021-07-14 14:47:25 -04:00
return writeback_single_inode ( inode , & wbc ) ;
2010-10-06 10:48:20 +02:00
}
EXPORT_SYMBOL ( sync_inode_metadata ) ;