2008-06-12 00:50:36 +04:00
/*
* Copyright ( C ) 2007 Oracle . All rights reserved .
2014-02-28 06:46:03 +04:00
* Copyright ( C ) 2014 Fujitsu . All rights reserved .
2008-06-12 00:50:36 +04:00
*
* This program is free software ; you can redistribute it and / or
* modify it under the terms of the GNU General Public
* License v2 as published by the Free Software Foundation .
*
* This program is distributed in the hope that it will be useful ,
* but WITHOUT ANY WARRANTY ; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE . See the GNU
* General Public License for more details .
*
* You should have received a copy of the GNU General Public
* License along with this program ; if not , write to the
* Free Software Foundation , Inc . , 59 Temple Place - Suite 330 ,
* Boston , MA 021110 - 1307 , USA .
*/
# include <linux/kthread.h>
include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.
percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.
http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.
* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.
* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.
The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.
2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.
3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.
4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.
5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.
6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).
* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig
8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.
Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-24 11:04:11 +03:00
# include <linux/slab.h>
2008-06-12 00:50:36 +04:00
# include <linux/list.h>
# include <linux/spinlock.h>
2009-02-04 17:23:24 +03:00
# include <linux/freezer.h>
2008-06-12 00:50:36 +04:00
# include "async-thread.h"
2014-03-06 08:19:50 +04:00
# include "ctree.h"
2008-06-12 00:50:36 +04:00
2014-02-28 06:46:18 +04:00
# define WORK_DONE_BIT 0
# define WORK_ORDER_DONE_BIT 1
# define WORK_HIGH_PRIO_BIT 2
Btrfs: Add ordered async work queues
Btrfs uses kernel threads to create async work queues for cpu intensive
operations such as checksumming and decompression. These work well,
but they make it difficult to keep IO order intact.
A single writepages call from pdflush or fsync will turn into a number
of bios, and each bio is checksummed in parallel. Once the checksum is
computed, the bio is sent down to the disk, and since we don't control
the order in which the parallel operations happen, they might go down to
the disk in almost any order.
The code deals with this somewhat by having deep work queues for a single
kernel thread, making it very likely that a single thread will process all
the bios for a single inode.
This patch introduces an explicitly ordered work queue. As work structs
are placed into the queue they are put onto the tail of a list. They have
three callbacks:
->func (cpu intensive processing here)
->ordered_func (order sensitive processing here)
->ordered_free (free the work struct, all processing is done)
The work struct has three callbacks. The func callback does the cpu intensive
work, and when it completes the work struct is marked as done.
Every time a work struct completes, the list is checked to see if the head
is marked as done. If so the ordered_func callback is used to do the
order sensitive processing and the ordered_free callback is used to do
any cleanup. Then we loop back and check the head of the list again.
This patch also changes the checksumming code to use the ordered workqueues.
One a 4 drive array, it increases streaming writes from 280MB/s to 350MB/s.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-11-07 06:03:00 +03:00
2014-02-28 06:46:05 +04:00
# define NO_THRESHOLD (-1)
# define DFT_THRESHOLD (32)
2014-02-28 06:46:19 +04:00
struct __btrfs_workqueue {
2014-02-28 06:46:03 +04:00
struct workqueue_struct * normal_wq ;
/* List head pointing to ordered work list */
struct list_head ordered_list ;
/* Spinlock for ordered_list */
spinlock_t list_lock ;
2014-02-28 06:46:05 +04:00
/* Thresholding related variants */
atomic_t pending ;
2015-08-20 04:30:39 +03:00
/* Up limit of concurrency workers */
int limit_active ;
/* Current number of concurrency workers */
int current_active ;
/* Threshold to change current_active */
2014-02-28 06:46:05 +04:00
int thresh ;
unsigned int count ;
spinlock_t thres_lock ;
2014-02-28 06:46:03 +04:00
} ;
2014-02-28 06:46:19 +04:00
struct btrfs_workqueue {
struct __btrfs_workqueue * normal ;
struct __btrfs_workqueue * high ;
2014-02-28 06:46:04 +04:00
} ;
Btrfs: fix task hang under heavy compressed write
This has been reported and discussed for a long time, and this hang occurs in
both 3.15 and 3.16.
Btrfs now migrates to use kernel workqueue, but it introduces this hang problem.
Btrfs has a kind of work queued as an ordered way, which means that its
ordered_func() must be processed in the way of FIFO, so it usually looks like --
normal_work_helper(arg)
work = container_of(arg, struct btrfs_work, normal_work);
work->func() <---- (we name it work X)
for ordered_work in wq->ordered_list
ordered_work->ordered_func()
ordered_work->ordered_free()
The hang is a rare case, first when we find free space, we get an uncached block
group, then we go to read its free space cache inode for free space information,
so it will
file a readahead request
btrfs_readpages()
for page that is not in page cache
__do_readpage()
submit_extent_page()
btrfs_submit_bio_hook()
btrfs_bio_wq_end_io()
submit_bio()
end_workqueue_bio() <--(ret by the 1st endio)
queue a work(named work Y) for the 2nd
also the real endio()
So the hang occurs when work Y's work_struct and work X's work_struct happens
to share the same address.
A bit more explanation,
A,B,C -- struct btrfs_work
arg -- struct work_struct
kthread:
worker_thread()
pick up a work_struct from @worklist
process_one_work(arg)
worker->current_work = arg; <-- arg is A->normal_work
worker->current_func(arg)
normal_work_helper(arg)
A = container_of(arg, struct btrfs_work, normal_work);
A->func()
A->ordered_func()
A->ordered_free() <-- A gets freed
B->ordered_func()
submit_compressed_extents()
find_free_extent()
load_free_space_inode()
... <-- (the above readhead stack)
end_workqueue_bio()
btrfs_queue_work(work C)
B->ordered_free()
As if work A has a high priority in wq->ordered_list and there are more ordered
works queued after it, such as B->ordered_func(), its memory could have been
freed before normal_work_helper() returns, which means that kernel workqueue
code worker_thread() still has worker->current_work pointer to be work
A->normal_work's, ie. arg's address.
Meanwhile, work C is allocated after work A is freed, work C->normal_work
and work A->normal_work are likely to share the same address(I confirmed this
with ftrace output, so I'm not just guessing, it's rare though).
When another kthread picks up work C->normal_work to process, and finds our
kthread is processing it(see find_worker_executing_work()), it'll think
work C as a collision and skip then, which ends up nobody processing work C.
So the situation is that our kthread is waiting forever on work C.
Besides, there're other cases that can lead to deadlock, but the real problem
is that all btrfs workqueue shares one work->func, -- normal_work_helper,
so this makes each workqueue to have its own helper function, but only a
wraper pf normal_work_helper.
With this patch, I no long hit the above hang.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-15 19:36:53 +04:00
static void normal_work_helper ( struct btrfs_work * work ) ;
# define BTRFS_WORK_HELPER(name) \
void btrfs_ # # name ( struct work_struct * arg ) \
{ \
struct btrfs_work * work = container_of ( arg , struct btrfs_work , \
normal_work ) ; \
normal_work_helper ( work ) ; \
}
BTRFS_WORK_HELPER ( worker_helper ) ;
BTRFS_WORK_HELPER ( delalloc_helper ) ;
BTRFS_WORK_HELPER ( flush_delalloc_helper ) ;
BTRFS_WORK_HELPER ( cache_helper ) ;
BTRFS_WORK_HELPER ( submit_helper ) ;
BTRFS_WORK_HELPER ( fixup_helper ) ;
BTRFS_WORK_HELPER ( endio_helper ) ;
BTRFS_WORK_HELPER ( endio_meta_helper ) ;
BTRFS_WORK_HELPER ( endio_meta_write_helper ) ;
BTRFS_WORK_HELPER ( endio_raid56_helper ) ;
2014-09-12 14:44:03 +04:00
BTRFS_WORK_HELPER ( endio_repair_helper ) ;
Btrfs: fix task hang under heavy compressed write
This has been reported and discussed for a long time, and this hang occurs in
both 3.15 and 3.16.
Btrfs now migrates to use kernel workqueue, but it introduces this hang problem.
Btrfs has a kind of work queued as an ordered way, which means that its
ordered_func() must be processed in the way of FIFO, so it usually looks like --
normal_work_helper(arg)
work = container_of(arg, struct btrfs_work, normal_work);
work->func() <---- (we name it work X)
for ordered_work in wq->ordered_list
ordered_work->ordered_func()
ordered_work->ordered_free()
The hang is a rare case, first when we find free space, we get an uncached block
group, then we go to read its free space cache inode for free space information,
so it will
file a readahead request
btrfs_readpages()
for page that is not in page cache
__do_readpage()
submit_extent_page()
btrfs_submit_bio_hook()
btrfs_bio_wq_end_io()
submit_bio()
end_workqueue_bio() <--(ret by the 1st endio)
queue a work(named work Y) for the 2nd
also the real endio()
So the hang occurs when work Y's work_struct and work X's work_struct happens
to share the same address.
A bit more explanation,
A,B,C -- struct btrfs_work
arg -- struct work_struct
kthread:
worker_thread()
pick up a work_struct from @worklist
process_one_work(arg)
worker->current_work = arg; <-- arg is A->normal_work
worker->current_func(arg)
normal_work_helper(arg)
A = container_of(arg, struct btrfs_work, normal_work);
A->func()
A->ordered_func()
A->ordered_free() <-- A gets freed
B->ordered_func()
submit_compressed_extents()
find_free_extent()
load_free_space_inode()
... <-- (the above readhead stack)
end_workqueue_bio()
btrfs_queue_work(work C)
B->ordered_free()
As if work A has a high priority in wq->ordered_list and there are more ordered
works queued after it, such as B->ordered_func(), its memory could have been
freed before normal_work_helper() returns, which means that kernel workqueue
code worker_thread() still has worker->current_work pointer to be work
A->normal_work's, ie. arg's address.
Meanwhile, work C is allocated after work A is freed, work C->normal_work
and work A->normal_work are likely to share the same address(I confirmed this
with ftrace output, so I'm not just guessing, it's rare though).
When another kthread picks up work C->normal_work to process, and finds our
kthread is processing it(see find_worker_executing_work()), it'll think
work C as a collision and skip then, which ends up nobody processing work C.
So the situation is that our kthread is waiting forever on work C.
Besides, there're other cases that can lead to deadlock, but the real problem
is that all btrfs workqueue shares one work->func, -- normal_work_helper,
so this makes each workqueue to have its own helper function, but only a
wraper pf normal_work_helper.
With this patch, I no long hit the above hang.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-15 19:36:53 +04:00
BTRFS_WORK_HELPER ( rmw_helper ) ;
BTRFS_WORK_HELPER ( endio_write_helper ) ;
BTRFS_WORK_HELPER ( freespace_write_helper ) ;
BTRFS_WORK_HELPER ( delayed_meta_helper ) ;
BTRFS_WORK_HELPER ( readahead_helper ) ;
BTRFS_WORK_HELPER ( qgroup_rescan_helper ) ;
BTRFS_WORK_HELPER ( extent_refs_helper ) ;
BTRFS_WORK_HELPER ( scrub_helper ) ;
BTRFS_WORK_HELPER ( scrubwrc_helper ) ;
BTRFS_WORK_HELPER ( scrubnc_helper ) ;
2015-06-04 15:09:15 +03:00
BTRFS_WORK_HELPER ( scrubparity_helper ) ;
Btrfs: fix task hang under heavy compressed write
This has been reported and discussed for a long time, and this hang occurs in
both 3.15 and 3.16.
Btrfs now migrates to use kernel workqueue, but it introduces this hang problem.
Btrfs has a kind of work queued as an ordered way, which means that its
ordered_func() must be processed in the way of FIFO, so it usually looks like --
normal_work_helper(arg)
work = container_of(arg, struct btrfs_work, normal_work);
work->func() <---- (we name it work X)
for ordered_work in wq->ordered_list
ordered_work->ordered_func()
ordered_work->ordered_free()
The hang is a rare case, first when we find free space, we get an uncached block
group, then we go to read its free space cache inode for free space information,
so it will
file a readahead request
btrfs_readpages()
for page that is not in page cache
__do_readpage()
submit_extent_page()
btrfs_submit_bio_hook()
btrfs_bio_wq_end_io()
submit_bio()
end_workqueue_bio() <--(ret by the 1st endio)
queue a work(named work Y) for the 2nd
also the real endio()
So the hang occurs when work Y's work_struct and work X's work_struct happens
to share the same address.
A bit more explanation,
A,B,C -- struct btrfs_work
arg -- struct work_struct
kthread:
worker_thread()
pick up a work_struct from @worklist
process_one_work(arg)
worker->current_work = arg; <-- arg is A->normal_work
worker->current_func(arg)
normal_work_helper(arg)
A = container_of(arg, struct btrfs_work, normal_work);
A->func()
A->ordered_func()
A->ordered_free() <-- A gets freed
B->ordered_func()
submit_compressed_extents()
find_free_extent()
load_free_space_inode()
... <-- (the above readhead stack)
end_workqueue_bio()
btrfs_queue_work(work C)
B->ordered_free()
As if work A has a high priority in wq->ordered_list and there are more ordered
works queued after it, such as B->ordered_func(), its memory could have been
freed before normal_work_helper() returns, which means that kernel workqueue
code worker_thread() still has worker->current_work pointer to be work
A->normal_work's, ie. arg's address.
Meanwhile, work C is allocated after work A is freed, work C->normal_work
and work A->normal_work are likely to share the same address(I confirmed this
with ftrace output, so I'm not just guessing, it's rare though).
When another kthread picks up work C->normal_work to process, and finds our
kthread is processing it(see find_worker_executing_work()), it'll think
work C as a collision and skip then, which ends up nobody processing work C.
So the situation is that our kthread is waiting forever on work C.
Besides, there're other cases that can lead to deadlock, but the real problem
is that all btrfs workqueue shares one work->func, -- normal_work_helper,
so this makes each workqueue to have its own helper function, but only a
wraper pf normal_work_helper.
With this patch, I no long hit the above hang.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-15 19:36:53 +04:00
static struct __btrfs_workqueue *
2015-08-20 04:30:39 +03:00
__btrfs_alloc_workqueue ( const char * name , unsigned int flags , int limit_active ,
2014-03-12 12:05:33 +04:00
int thresh )
2014-02-28 06:46:04 +04:00
{
2015-12-01 20:04:30 +03:00
struct __btrfs_workqueue * ret = kzalloc ( sizeof ( * ret ) , GFP_KERNEL ) ;
2014-02-28 06:46:04 +04:00
2014-09-29 21:20:37 +04:00
if ( ! ret )
2014-02-28 06:46:04 +04:00
return NULL ;
2015-08-20 04:30:39 +03:00
ret - > limit_active = limit_active ;
2014-02-28 06:46:05 +04:00
atomic_set ( & ret - > pending , 0 ) ;
if ( thresh = = 0 )
thresh = DFT_THRESHOLD ;
/* For low threshold, disabling threshold is a better choice */
if ( thresh < DFT_THRESHOLD ) {
2015-08-20 04:30:39 +03:00
ret - > current_active = limit_active ;
2014-02-28 06:46:05 +04:00
ret - > thresh = NO_THRESHOLD ;
} else {
2015-08-20 04:30:39 +03:00
/*
* For threshold - able wq , let its concurrency grow on demand .
* Use minimal max_active at alloc time to reduce resource
* usage .
*/
ret - > current_active = 1 ;
2014-02-28 06:46:05 +04:00
ret - > thresh = thresh ;
}
2014-02-28 06:46:04 +04:00
if ( flags & WQ_HIGHPRI )
ret - > normal_wq = alloc_workqueue ( " %s-%s-high " , flags ,
2015-08-20 04:30:39 +03:00
ret - > current_active , " btrfs " ,
name ) ;
2014-02-28 06:46:04 +04:00
else
ret - > normal_wq = alloc_workqueue ( " %s-%s " , flags ,
2015-08-20 04:30:39 +03:00
ret - > current_active , " btrfs " ,
2014-02-28 06:46:05 +04:00
name ) ;
2014-09-29 21:20:37 +04:00
if ( ! ret - > normal_wq ) {
2014-02-28 06:46:04 +04:00
kfree ( ret ) ;
return NULL ;
}
INIT_LIST_HEAD ( & ret - > ordered_list ) ;
spin_lock_init ( & ret - > list_lock ) ;
2014-02-28 06:46:05 +04:00
spin_lock_init ( & ret - > thres_lock ) ;
2014-03-12 12:05:33 +04:00
trace_btrfs_workqueue_alloc ( ret , name , flags & WQ_HIGHPRI ) ;
2014-02-28 06:46:04 +04:00
return ret ;
}
static inline void
2014-02-28 06:46:19 +04:00
__btrfs_destroy_workqueue ( struct __btrfs_workqueue * wq ) ;
2014-02-28 06:46:04 +04:00
2014-03-12 12:05:33 +04:00
struct btrfs_workqueue * btrfs_alloc_workqueue ( const char * name ,
2015-02-16 20:34:01 +03:00
unsigned int flags ,
2015-08-20 04:30:39 +03:00
int limit_active ,
2014-02-28 06:46:19 +04:00
int thresh )
2014-02-28 06:46:03 +04:00
{
2015-12-01 20:04:30 +03:00
struct btrfs_workqueue * ret = kzalloc ( sizeof ( * ret ) , GFP_KERNEL ) ;
2014-02-28 06:46:03 +04:00
2014-09-29 21:20:37 +04:00
if ( ! ret )
2014-02-28 06:46:03 +04:00
return NULL ;
2014-02-28 06:46:04 +04:00
ret - > normal = __btrfs_alloc_workqueue ( name , flags & ~ WQ_HIGHPRI ,
2015-08-20 04:30:39 +03:00
limit_active , thresh ) ;
2014-09-29 21:20:37 +04:00
if ( ! ret - > normal ) {
2014-02-28 06:46:03 +04:00
kfree ( ret ) ;
return NULL ;
}
2014-02-28 06:46:04 +04:00
if ( flags & WQ_HIGHPRI ) {
2015-08-20 04:30:39 +03:00
ret - > high = __btrfs_alloc_workqueue ( name , flags , limit_active ,
2014-02-28 06:46:05 +04:00
thresh ) ;
2014-09-29 21:20:37 +04:00
if ( ! ret - > high ) {
2014-02-28 06:46:04 +04:00
__btrfs_destroy_workqueue ( ret - > normal ) ;
kfree ( ret ) ;
return NULL ;
}
}
2014-02-28 06:46:03 +04:00
return ret ;
}
2014-02-28 06:46:05 +04:00
/*
* Hook for threshold which will be called in btrfs_queue_work .
* This hook WILL be called in IRQ handler context ,
* so workqueue_set_max_active MUST NOT be called in this hook
*/
2014-02-28 06:46:19 +04:00
static inline void thresh_queue_hook ( struct __btrfs_workqueue * wq )
2014-02-28 06:46:05 +04:00
{
if ( wq - > thresh = = NO_THRESHOLD )
return ;
atomic_inc ( & wq - > pending ) ;
}
/*
* Hook for threshold which will be called before executing the work ,
* This hook is called in kthread content .
* So workqueue_set_max_active is called here .
*/
2014-02-28 06:46:19 +04:00
static inline void thresh_exec_hook ( struct __btrfs_workqueue * wq )
2014-02-28 06:46:05 +04:00
{
2015-08-20 04:30:39 +03:00
int new_current_active ;
2014-02-28 06:46:05 +04:00
long pending ;
int need_change = 0 ;
if ( wq - > thresh = = NO_THRESHOLD )
return ;
atomic_dec ( & wq - > pending ) ;
spin_lock ( & wq - > thres_lock ) ;
/*
* Use wq - > count to limit the calling frequency of
* workqueue_set_max_active .
*/
wq - > count + + ;
wq - > count % = ( wq - > thresh / 4 ) ;
if ( ! wq - > count )
goto out ;
2015-08-20 04:30:39 +03:00
new_current_active = wq - > current_active ;
2014-02-28 06:46:05 +04:00
/*
* pending may be changed later , but it ' s OK since we really
* don ' t need it so accurate to calculate new_max_active .
*/
pending = atomic_read ( & wq - > pending ) ;
if ( pending > wq - > thresh )
2015-08-20 04:30:39 +03:00
new_current_active + + ;
2014-02-28 06:46:05 +04:00
if ( pending < wq - > thresh / 2 )
2015-08-20 04:30:39 +03:00
new_current_active - - ;
new_current_active = clamp_val ( new_current_active , 1 , wq - > limit_active ) ;
if ( new_current_active ! = wq - > current_active ) {
2014-02-28 06:46:05 +04:00
need_change = 1 ;
2015-08-20 04:30:39 +03:00
wq - > current_active = new_current_active ;
2014-02-28 06:46:05 +04:00
}
out :
spin_unlock ( & wq - > thres_lock ) ;
if ( need_change ) {
2015-08-20 04:30:39 +03:00
workqueue_set_max_active ( wq - > normal_wq , wq - > current_active ) ;
2014-02-28 06:46:05 +04:00
}
}
2014-02-28 06:46:19 +04:00
static void run_ordered_work ( struct __btrfs_workqueue * wq )
2014-02-28 06:46:03 +04:00
{
struct list_head * list = & wq - > ordered_list ;
2014-02-28 06:46:19 +04:00
struct btrfs_work * work ;
2014-02-28 06:46:03 +04:00
spinlock_t * lock = & wq - > list_lock ;
unsigned long flags ;
while ( 1 ) {
spin_lock_irqsave ( lock , flags ) ;
if ( list_empty ( list ) )
break ;
2014-02-28 06:46:19 +04:00
work = list_entry ( list - > next , struct btrfs_work ,
2014-02-28 06:46:03 +04:00
ordered_list ) ;
if ( ! test_bit ( WORK_DONE_BIT , & work - > flags ) )
break ;
/*
* we are going to call the ordered done function , but
* we leave the work item on the list as a barrier so
* that later work items that are done don ' t have their
* functions called before this one returns
*/
if ( test_and_set_bit ( WORK_ORDER_DONE_BIT , & work - > flags ) )
break ;
2014-03-06 08:19:50 +04:00
trace_btrfs_ordered_sched ( work ) ;
2014-02-28 06:46:03 +04:00
spin_unlock_irqrestore ( lock , flags ) ;
work - > ordered_func ( work ) ;
/* now take the lock again and drop our item from the list */
spin_lock_irqsave ( lock , flags ) ;
list_del ( & work - > ordered_list ) ;
spin_unlock_irqrestore ( lock , flags ) ;
/*
* we don ' t want to call the ordered free functions
* with the lock held though
*/
work - > ordered_free ( work ) ;
2014-03-06 08:19:50 +04:00
trace_btrfs_all_work_done ( work ) ;
2014-02-28 06:46:03 +04:00
}
spin_unlock_irqrestore ( lock , flags ) ;
}
Btrfs: fix task hang under heavy compressed write
This has been reported and discussed for a long time, and this hang occurs in
both 3.15 and 3.16.
Btrfs now migrates to use kernel workqueue, but it introduces this hang problem.
Btrfs has a kind of work queued as an ordered way, which means that its
ordered_func() must be processed in the way of FIFO, so it usually looks like --
normal_work_helper(arg)
work = container_of(arg, struct btrfs_work, normal_work);
work->func() <---- (we name it work X)
for ordered_work in wq->ordered_list
ordered_work->ordered_func()
ordered_work->ordered_free()
The hang is a rare case, first when we find free space, we get an uncached block
group, then we go to read its free space cache inode for free space information,
so it will
file a readahead request
btrfs_readpages()
for page that is not in page cache
__do_readpage()
submit_extent_page()
btrfs_submit_bio_hook()
btrfs_bio_wq_end_io()
submit_bio()
end_workqueue_bio() <--(ret by the 1st endio)
queue a work(named work Y) for the 2nd
also the real endio()
So the hang occurs when work Y's work_struct and work X's work_struct happens
to share the same address.
A bit more explanation,
A,B,C -- struct btrfs_work
arg -- struct work_struct
kthread:
worker_thread()
pick up a work_struct from @worklist
process_one_work(arg)
worker->current_work = arg; <-- arg is A->normal_work
worker->current_func(arg)
normal_work_helper(arg)
A = container_of(arg, struct btrfs_work, normal_work);
A->func()
A->ordered_func()
A->ordered_free() <-- A gets freed
B->ordered_func()
submit_compressed_extents()
find_free_extent()
load_free_space_inode()
... <-- (the above readhead stack)
end_workqueue_bio()
btrfs_queue_work(work C)
B->ordered_free()
As if work A has a high priority in wq->ordered_list and there are more ordered
works queued after it, such as B->ordered_func(), its memory could have been
freed before normal_work_helper() returns, which means that kernel workqueue
code worker_thread() still has worker->current_work pointer to be work
A->normal_work's, ie. arg's address.
Meanwhile, work C is allocated after work A is freed, work C->normal_work
and work A->normal_work are likely to share the same address(I confirmed this
with ftrace output, so I'm not just guessing, it's rare though).
When another kthread picks up work C->normal_work to process, and finds our
kthread is processing it(see find_worker_executing_work()), it'll think
work C as a collision and skip then, which ends up nobody processing work C.
So the situation is that our kthread is waiting forever on work C.
Besides, there're other cases that can lead to deadlock, but the real problem
is that all btrfs workqueue shares one work->func, -- normal_work_helper,
so this makes each workqueue to have its own helper function, but only a
wraper pf normal_work_helper.
With this patch, I no long hit the above hang.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-15 19:36:53 +04:00
static void normal_work_helper ( struct btrfs_work * work )
2014-02-28 06:46:03 +04:00
{
2014-02-28 06:46:19 +04:00
struct __btrfs_workqueue * wq ;
2014-02-28 06:46:03 +04:00
int need_order = 0 ;
/*
* We should not touch things inside work in the following cases :
* 1 ) after work - > func ( ) if it has no ordered_free
* Since the struct is freed in work - > func ( ) .
* 2 ) after setting WORK_DONE_BIT
* The work may be freed in other threads almost instantly .
* So we save the needed things here .
*/
if ( work - > ordered_func )
need_order = 1 ;
wq = work - > wq ;
2014-03-06 08:19:50 +04:00
trace_btrfs_work_sched ( work ) ;
2014-02-28 06:46:05 +04:00
thresh_exec_hook ( wq ) ;
2014-02-28 06:46:03 +04:00
work - > func ( work ) ;
if ( need_order ) {
set_bit ( WORK_DONE_BIT , & work - > flags ) ;
run_ordered_work ( wq ) ;
}
2014-03-06 08:19:50 +04:00
if ( ! need_order )
trace_btrfs_all_work_done ( work ) ;
2014-02-28 06:46:03 +04:00
}
Btrfs: fix task hang under heavy compressed write
This has been reported and discussed for a long time, and this hang occurs in
both 3.15 and 3.16.
Btrfs now migrates to use kernel workqueue, but it introduces this hang problem.
Btrfs has a kind of work queued as an ordered way, which means that its
ordered_func() must be processed in the way of FIFO, so it usually looks like --
normal_work_helper(arg)
work = container_of(arg, struct btrfs_work, normal_work);
work->func() <---- (we name it work X)
for ordered_work in wq->ordered_list
ordered_work->ordered_func()
ordered_work->ordered_free()
The hang is a rare case, first when we find free space, we get an uncached block
group, then we go to read its free space cache inode for free space information,
so it will
file a readahead request
btrfs_readpages()
for page that is not in page cache
__do_readpage()
submit_extent_page()
btrfs_submit_bio_hook()
btrfs_bio_wq_end_io()
submit_bio()
end_workqueue_bio() <--(ret by the 1st endio)
queue a work(named work Y) for the 2nd
also the real endio()
So the hang occurs when work Y's work_struct and work X's work_struct happens
to share the same address.
A bit more explanation,
A,B,C -- struct btrfs_work
arg -- struct work_struct
kthread:
worker_thread()
pick up a work_struct from @worklist
process_one_work(arg)
worker->current_work = arg; <-- arg is A->normal_work
worker->current_func(arg)
normal_work_helper(arg)
A = container_of(arg, struct btrfs_work, normal_work);
A->func()
A->ordered_func()
A->ordered_free() <-- A gets freed
B->ordered_func()
submit_compressed_extents()
find_free_extent()
load_free_space_inode()
... <-- (the above readhead stack)
end_workqueue_bio()
btrfs_queue_work(work C)
B->ordered_free()
As if work A has a high priority in wq->ordered_list and there are more ordered
works queued after it, such as B->ordered_func(), its memory could have been
freed before normal_work_helper() returns, which means that kernel workqueue
code worker_thread() still has worker->current_work pointer to be work
A->normal_work's, ie. arg's address.
Meanwhile, work C is allocated after work A is freed, work C->normal_work
and work A->normal_work are likely to share the same address(I confirmed this
with ftrace output, so I'm not just guessing, it's rare though).
When another kthread picks up work C->normal_work to process, and finds our
kthread is processing it(see find_worker_executing_work()), it'll think
work C as a collision and skip then, which ends up nobody processing work C.
So the situation is that our kthread is waiting forever on work C.
Besides, there're other cases that can lead to deadlock, but the real problem
is that all btrfs workqueue shares one work->func, -- normal_work_helper,
so this makes each workqueue to have its own helper function, but only a
wraper pf normal_work_helper.
With this patch, I no long hit the above hang.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-15 19:36:53 +04:00
void btrfs_init_work ( struct btrfs_work * work , btrfs_work_func_t uniq_func ,
2014-03-06 08:19:50 +04:00
btrfs_func_t func ,
btrfs_func_t ordered_func ,
btrfs_func_t ordered_free )
2014-02-28 06:46:03 +04:00
{
work - > func = func ;
work - > ordered_func = ordered_func ;
work - > ordered_free = ordered_free ;
Btrfs: fix task hang under heavy compressed write
This has been reported and discussed for a long time, and this hang occurs in
both 3.15 and 3.16.
Btrfs now migrates to use kernel workqueue, but it introduces this hang problem.
Btrfs has a kind of work queued as an ordered way, which means that its
ordered_func() must be processed in the way of FIFO, so it usually looks like --
normal_work_helper(arg)
work = container_of(arg, struct btrfs_work, normal_work);
work->func() <---- (we name it work X)
for ordered_work in wq->ordered_list
ordered_work->ordered_func()
ordered_work->ordered_free()
The hang is a rare case, first when we find free space, we get an uncached block
group, then we go to read its free space cache inode for free space information,
so it will
file a readahead request
btrfs_readpages()
for page that is not in page cache
__do_readpage()
submit_extent_page()
btrfs_submit_bio_hook()
btrfs_bio_wq_end_io()
submit_bio()
end_workqueue_bio() <--(ret by the 1st endio)
queue a work(named work Y) for the 2nd
also the real endio()
So the hang occurs when work Y's work_struct and work X's work_struct happens
to share the same address.
A bit more explanation,
A,B,C -- struct btrfs_work
arg -- struct work_struct
kthread:
worker_thread()
pick up a work_struct from @worklist
process_one_work(arg)
worker->current_work = arg; <-- arg is A->normal_work
worker->current_func(arg)
normal_work_helper(arg)
A = container_of(arg, struct btrfs_work, normal_work);
A->func()
A->ordered_func()
A->ordered_free() <-- A gets freed
B->ordered_func()
submit_compressed_extents()
find_free_extent()
load_free_space_inode()
... <-- (the above readhead stack)
end_workqueue_bio()
btrfs_queue_work(work C)
B->ordered_free()
As if work A has a high priority in wq->ordered_list and there are more ordered
works queued after it, such as B->ordered_func(), its memory could have been
freed before normal_work_helper() returns, which means that kernel workqueue
code worker_thread() still has worker->current_work pointer to be work
A->normal_work's, ie. arg's address.
Meanwhile, work C is allocated after work A is freed, work C->normal_work
and work A->normal_work are likely to share the same address(I confirmed this
with ftrace output, so I'm not just guessing, it's rare though).
When another kthread picks up work C->normal_work to process, and finds our
kthread is processing it(see find_worker_executing_work()), it'll think
work C as a collision and skip then, which ends up nobody processing work C.
So the situation is that our kthread is waiting forever on work C.
Besides, there're other cases that can lead to deadlock, but the real problem
is that all btrfs workqueue shares one work->func, -- normal_work_helper,
so this makes each workqueue to have its own helper function, but only a
wraper pf normal_work_helper.
With this patch, I no long hit the above hang.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-15 19:36:53 +04:00
INIT_WORK ( & work - > normal_work , uniq_func ) ;
2014-02-28 06:46:03 +04:00
INIT_LIST_HEAD ( & work - > ordered_list ) ;
work - > flags = 0 ;
}
2014-02-28 06:46:19 +04:00
static inline void __btrfs_queue_work ( struct __btrfs_workqueue * wq ,
struct btrfs_work * work )
2014-02-28 06:46:03 +04:00
{
unsigned long flags ;
work - > wq = wq ;
2014-02-28 06:46:05 +04:00
thresh_queue_hook ( wq ) ;
2014-02-28 06:46:03 +04:00
if ( work - > ordered_func ) {
spin_lock_irqsave ( & wq - > list_lock , flags ) ;
list_add_tail ( & work - > ordered_list , & wq - > ordered_list ) ;
spin_unlock_irqrestore ( & wq - > list_lock , flags ) ;
}
2014-03-06 08:19:50 +04:00
trace_btrfs_work_queued ( work ) ;
2016-01-22 04:28:38 +03:00
queue_work ( wq - > normal_wq , & work - > normal_work ) ;
2014-02-28 06:46:03 +04:00
}
2014-02-28 06:46:19 +04:00
void btrfs_queue_work ( struct btrfs_workqueue * wq ,
struct btrfs_work * work )
2014-02-28 06:46:04 +04:00
{
2014-02-28 06:46:19 +04:00
struct __btrfs_workqueue * dest_wq ;
2014-02-28 06:46:04 +04:00
if ( test_bit ( WORK_HIGH_PRIO_BIT , & work - > flags ) & & wq - > high )
dest_wq = wq - > high ;
else
dest_wq = wq - > normal ;
__btrfs_queue_work ( dest_wq , work ) ;
}
static inline void
2014-02-28 06:46:19 +04:00
__btrfs_destroy_workqueue ( struct __btrfs_workqueue * wq )
2014-02-28 06:46:03 +04:00
{
destroy_workqueue ( wq - > normal_wq ) ;
2014-03-12 12:05:33 +04:00
trace_btrfs_workqueue_destroy ( wq ) ;
2014-02-28 06:46:03 +04:00
kfree ( wq ) ;
}
2014-02-28 06:46:19 +04:00
void btrfs_destroy_workqueue ( struct btrfs_workqueue * wq )
2014-02-28 06:46:04 +04:00
{
if ( ! wq )
return ;
if ( wq - > high )
__btrfs_destroy_workqueue ( wq - > high ) ;
__btrfs_destroy_workqueue ( wq - > normal ) ;
2014-03-11 18:31:44 +04:00
kfree ( wq ) ;
2014-02-28 06:46:04 +04:00
}
2015-08-20 04:30:39 +03:00
void btrfs_workqueue_set_max ( struct btrfs_workqueue * wq , int limit_active )
2014-02-28 06:46:03 +04:00
{
2014-04-07 11:55:46 +04:00
if ( ! wq )
return ;
2015-08-20 04:30:39 +03:00
wq - > normal - > limit_active = limit_active ;
2014-02-28 06:46:04 +04:00
if ( wq - > high )
2015-08-20 04:30:39 +03:00
wq - > high - > limit_active = limit_active ;
2014-02-28 06:46:04 +04:00
}
2014-02-28 06:46:19 +04:00
void btrfs_set_work_high_priority ( struct btrfs_work * work )
2014-02-28 06:46:04 +04:00
{
set_bit ( WORK_HIGH_PRIO_BIT , & work - > flags ) ;
2014-02-28 06:46:03 +04:00
}