2018-04-03 19:23:33 +02:00
// SPDX-License-Identifier: GPL-2.0
2008-03-24 15:01:56 -04:00
/*
* Copyright ( C ) 2007 Oracle . All rights reserved .
*/
2018-04-03 19:23:33 +02:00
2008-03-24 15:01:56 -04:00
# include <linux/sched.h>
2020-08-31 10:52:42 -04:00
# include <linux/sched/mm.h>
2008-03-24 15:01:56 -04:00
# include <linux/bio.h>
include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.
percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.
http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.
* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.
* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.
The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.
2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.
3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.
4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.
5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.
6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).
* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig
8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.
Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-24 17:04:11 +09:00
# include <linux/slab.h>
2008-04-21 10:03:05 -04:00
# include <linux/blkdev.h>
2012-05-25 16:06:08 +02:00
# include <linux/ratelimit.h>
2012-01-16 22:04:48 +02:00
# include <linux/kthread.h>
2013-01-29 18:40:14 -05:00
# include <linux/raid/pq.h>
2013-08-15 17:11:21 +02:00
# include <linux/semaphore.h>
2016-05-20 17:01:00 -07:00
# include <linux/uuid.h>
2018-01-22 14:49:36 -08:00
# include <linux/list_sort.h>
2019-08-21 18:54:28 +02:00
# include "misc.h"
2008-03-24 15:01:56 -04:00
# include "ctree.h"
# include "extent_map.h"
# include "disk-io.h"
# include "transaction.h"
# include "print-tree.h"
# include "volumes.h"
2013-01-29 18:40:14 -05:00
# include "raid56.h"
2008-06-11 16:50:36 -04:00
# include "async-thread.h"
2011-11-09 13:44:05 +01:00
# include "check-integrity.h"
2012-06-04 14:03:51 -04:00
# include "rcu-string.h"
2012-11-06 13:15:27 +01:00
# include "dev-replace.h"
2014-06-03 11:36:00 +08:00
# include "sysfs.h"
2019-03-20 13:16:42 +08:00
# include "tree-checker.h"
2019-06-18 16:09:16 -04:00
# include "space-info.h"
2019-06-20 15:37:44 -04:00
# include "block-group.h"
2019-12-13 16:22:14 -08:00
# include "discard.h"
2020-11-10 20:26:07 +09:00
# include "zoned.h"
2008-03-24 15:01:56 -04:00
2015-09-15 21:08:06 +08:00
const struct btrfs_raid_attr btrfs_raid_array [ BTRFS_NR_RAID_TYPES ] = {
[ BTRFS_RAID_RAID10 ] = {
. sub_stripes = 2 ,
. dev_stripes = 1 ,
. devs_max = 0 , /* 0 == as many as possible */
btrfs: allow degenerate raid0/raid10
The data on raid0 and raid10 are supposed to be spread over multiple
devices, so the minimum constraints are set to 2 and 4 respectively.
This is an artificial limit and there's some interest to remove it.
Change this to allow raid0 on one device and raid10 on two devices. This
works as expected eg. when converting or removing devices.
The only difference is when raid0 on two devices gets one device
removed. Unpatched would silently create a single profile, while newly
it would be raid0.
The motivation is to allow to preserve the profile type as long as it
possible for some intermediate state (device removal, conversion), or
when there are disks of different size, with raid0 the otherwise
unusable space of the last device will be used too. Similarly for
raid10, though the two largest devices would need to be the same.
Unpatched kernel will mount and use the degenerate profiles just fine
but won't allow any operation that would not satisfy the stricter device
number constraints, eg. not allowing to go from 3 to 2 devices for
raid10 or various profile conversions.
Example output:
# btrfs fi us -T .
Overall:
Device size: 10.00GiB
Device allocated: 1.01GiB
Device unallocated: 8.99GiB
Device missing: 0.00B
Used: 200.61MiB
Free (estimated): 9.79GiB (min: 9.79GiB)
Free (statfs, df): 9.79GiB
Data ratio: 1.00
Metadata ratio: 1.00
Global reserve: 3.25MiB (used: 0.00B)
Multiple profiles: no
Data Metadata System
Id Path RAID0 single single Unallocated
-- ---------- --------- --------- -------- -----------
1 /dev/sda10 1.00GiB 8.00MiB 1.00MiB 8.99GiB
-- ---------- --------- --------- -------- -----------
Total 1.00GiB 8.00MiB 1.00MiB 8.99GiB
Used 200.25MiB 352.00KiB 16.00KiB
# btrfs dev us .
/dev/sda10, ID: 1
Device size: 10.00GiB
Device slack: 0.00B
Data,RAID0/1: 1.00GiB
Metadata,single: 8.00MiB
System,single: 1.00MiB
Unallocated: 8.99GiB
Note "Data,RAID0/1", with btrfs-progs 5.13+ the number of devices per
profile is printed.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-07-22 20:54:37 +02:00
. devs_min = 2 ,
2015-09-15 21:08:07 +08:00
. tolerated_failures = 1 ,
2015-09-15 21:08:06 +08:00
. devs_increment = 2 ,
. ncopies = 2 ,
2018-10-04 23:24:42 +02:00
. nparity = 0 ,
2018-04-25 19:01:42 +08:00
. raid_name = " raid10 " ,
2018-04-25 19:01:43 +08:00
. bg_flag = BTRFS_BLOCK_GROUP_RAID10 ,
2018-04-25 19:01:44 +08:00
. mindev_error = BTRFS_ERROR_DEV_RAID10_MIN_NOT_MET ,
2015-09-15 21:08:06 +08:00
} ,
[ BTRFS_RAID_RAID1 ] = {
. sub_stripes = 1 ,
. dev_stripes = 1 ,
. devs_max = 2 ,
. devs_min = 2 ,
2015-09-15 21:08:07 +08:00
. tolerated_failures = 1 ,
2015-09-15 21:08:06 +08:00
. devs_increment = 2 ,
. ncopies = 2 ,
2018-10-04 23:24:42 +02:00
. nparity = 0 ,
2018-04-25 19:01:42 +08:00
. raid_name = " raid1 " ,
2018-04-25 19:01:43 +08:00
. bg_flag = BTRFS_BLOCK_GROUP_RAID1 ,
2018-04-25 19:01:44 +08:00
. mindev_error = BTRFS_ERROR_DEV_RAID1_MIN_NOT_MET ,
2015-09-15 21:08:06 +08:00
} ,
2018-03-02 22:56:53 +01:00
[ BTRFS_RAID_RAID1C3 ] = {
. sub_stripes = 1 ,
. dev_stripes = 1 ,
2019-11-27 16:10:54 +01:00
. devs_max = 3 ,
2018-03-02 22:56:53 +01:00
. devs_min = 3 ,
. tolerated_failures = 2 ,
. devs_increment = 3 ,
. ncopies = 3 ,
2019-11-25 15:34:48 +01:00
. nparity = 0 ,
2018-03-02 22:56:53 +01:00
. raid_name = " raid1c3 " ,
. bg_flag = BTRFS_BLOCK_GROUP_RAID1C3 ,
. mindev_error = BTRFS_ERROR_DEV_RAID1C3_MIN_NOT_MET ,
} ,
2018-03-02 22:56:53 +01:00
[ BTRFS_RAID_RAID1C4 ] = {
. sub_stripes = 1 ,
. dev_stripes = 1 ,
2019-11-27 16:10:54 +01:00
. devs_max = 4 ,
2018-03-02 22:56:53 +01:00
. devs_min = 4 ,
. tolerated_failures = 3 ,
. devs_increment = 4 ,
. ncopies = 4 ,
2019-11-25 15:34:48 +01:00
. nparity = 0 ,
2018-03-02 22:56:53 +01:00
. raid_name = " raid1c4 " ,
. bg_flag = BTRFS_BLOCK_GROUP_RAID1C4 ,
. mindev_error = BTRFS_ERROR_DEV_RAID1C4_MIN_NOT_MET ,
} ,
2015-09-15 21:08:06 +08:00
[ BTRFS_RAID_DUP ] = {
. sub_stripes = 1 ,
. dev_stripes = 2 ,
. devs_max = 1 ,
. devs_min = 1 ,
2015-09-15 21:08:07 +08:00
. tolerated_failures = 0 ,
2015-09-15 21:08:06 +08:00
. devs_increment = 1 ,
. ncopies = 2 ,
2018-10-04 23:24:42 +02:00
. nparity = 0 ,
2018-04-25 19:01:42 +08:00
. raid_name = " dup " ,
2018-04-25 19:01:43 +08:00
. bg_flag = BTRFS_BLOCK_GROUP_DUP ,
2018-04-25 19:01:44 +08:00
. mindev_error = 0 ,
2015-09-15 21:08:06 +08:00
} ,
[ BTRFS_RAID_RAID0 ] = {
. sub_stripes = 1 ,
. dev_stripes = 1 ,
. devs_max = 0 ,
btrfs: allow degenerate raid0/raid10
The data on raid0 and raid10 are supposed to be spread over multiple
devices, so the minimum constraints are set to 2 and 4 respectively.
This is an artificial limit and there's some interest to remove it.
Change this to allow raid0 on one device and raid10 on two devices. This
works as expected eg. when converting or removing devices.
The only difference is when raid0 on two devices gets one device
removed. Unpatched would silently create a single profile, while newly
it would be raid0.
The motivation is to allow to preserve the profile type as long as it
possible for some intermediate state (device removal, conversion), or
when there are disks of different size, with raid0 the otherwise
unusable space of the last device will be used too. Similarly for
raid10, though the two largest devices would need to be the same.
Unpatched kernel will mount and use the degenerate profiles just fine
but won't allow any operation that would not satisfy the stricter device
number constraints, eg. not allowing to go from 3 to 2 devices for
raid10 or various profile conversions.
Example output:
# btrfs fi us -T .
Overall:
Device size: 10.00GiB
Device allocated: 1.01GiB
Device unallocated: 8.99GiB
Device missing: 0.00B
Used: 200.61MiB
Free (estimated): 9.79GiB (min: 9.79GiB)
Free (statfs, df): 9.79GiB
Data ratio: 1.00
Metadata ratio: 1.00
Global reserve: 3.25MiB (used: 0.00B)
Multiple profiles: no
Data Metadata System
Id Path RAID0 single single Unallocated
-- ---------- --------- --------- -------- -----------
1 /dev/sda10 1.00GiB 8.00MiB 1.00MiB 8.99GiB
-- ---------- --------- --------- -------- -----------
Total 1.00GiB 8.00MiB 1.00MiB 8.99GiB
Used 200.25MiB 352.00KiB 16.00KiB
# btrfs dev us .
/dev/sda10, ID: 1
Device size: 10.00GiB
Device slack: 0.00B
Data,RAID0/1: 1.00GiB
Metadata,single: 8.00MiB
System,single: 1.00MiB
Unallocated: 8.99GiB
Note "Data,RAID0/1", with btrfs-progs 5.13+ the number of devices per
profile is printed.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-07-22 20:54:37 +02:00
. devs_min = 1 ,
2015-09-15 21:08:07 +08:00
. tolerated_failures = 0 ,
2015-09-15 21:08:06 +08:00
. devs_increment = 1 ,
. ncopies = 1 ,
2018-10-04 23:24:42 +02:00
. nparity = 0 ,
2018-04-25 19:01:42 +08:00
. raid_name = " raid0 " ,
2018-04-25 19:01:43 +08:00
. bg_flag = BTRFS_BLOCK_GROUP_RAID0 ,
2018-04-25 19:01:44 +08:00
. mindev_error = 0 ,
2015-09-15 21:08:06 +08:00
} ,
[ BTRFS_RAID_SINGLE ] = {
. sub_stripes = 1 ,
. dev_stripes = 1 ,
. devs_max = 1 ,
. devs_min = 1 ,
2015-09-15 21:08:07 +08:00
. tolerated_failures = 0 ,
2015-09-15 21:08:06 +08:00
. devs_increment = 1 ,
. ncopies = 1 ,
2018-10-04 23:24:42 +02:00
. nparity = 0 ,
2018-04-25 19:01:42 +08:00
. raid_name = " single " ,
2018-04-25 19:01:43 +08:00
. bg_flag = 0 ,
2018-04-25 19:01:44 +08:00
. mindev_error = 0 ,
2015-09-15 21:08:06 +08:00
} ,
[ BTRFS_RAID_RAID5 ] = {
. sub_stripes = 1 ,
. dev_stripes = 1 ,
. devs_max = 0 ,
. devs_min = 2 ,
2015-09-15 21:08:07 +08:00
. tolerated_failures = 1 ,
2015-09-15 21:08:06 +08:00
. devs_increment = 1 ,
2018-10-04 23:24:41 +02:00
. ncopies = 1 ,
2018-10-04 23:24:42 +02:00
. nparity = 1 ,
2018-04-25 19:01:42 +08:00
. raid_name = " raid5 " ,
2018-04-25 19:01:43 +08:00
. bg_flag = BTRFS_BLOCK_GROUP_RAID5 ,
2018-04-25 19:01:44 +08:00
. mindev_error = BTRFS_ERROR_DEV_RAID5_MIN_NOT_MET ,
2015-09-15 21:08:06 +08:00
} ,
[ BTRFS_RAID_RAID6 ] = {
. sub_stripes = 1 ,
. dev_stripes = 1 ,
. devs_max = 0 ,
. devs_min = 3 ,
2015-09-15 21:08:07 +08:00
. tolerated_failures = 2 ,
2015-09-15 21:08:06 +08:00
. devs_increment = 1 ,
2018-10-04 23:24:41 +02:00
. ncopies = 1 ,
2018-10-04 23:24:42 +02:00
. nparity = 2 ,
2018-04-25 19:01:42 +08:00
. raid_name = " raid6 " ,
2018-04-25 19:01:43 +08:00
. bg_flag = BTRFS_BLOCK_GROUP_RAID6 ,
2018-04-25 19:01:44 +08:00
. mindev_error = BTRFS_ERROR_DEV_RAID6_MIN_NOT_MET ,
2015-09-15 21:08:06 +08:00
} ,
} ;
2021-07-26 14:15:19 +02:00
/*
* Convert block group flags ( BTRFS_BLOCK_GROUP_ * ) to btrfs_raid_types , which
* can be used as index to access btrfs_raid_array [ ] .
*/
enum btrfs_raid_types __attribute_const__ btrfs_bg_flags_to_raid_index ( u64 flags )
{
if ( flags & BTRFS_BLOCK_GROUP_RAID10 )
return BTRFS_RAID_RAID10 ;
else if ( flags & BTRFS_BLOCK_GROUP_RAID1 )
return BTRFS_RAID_RAID1 ;
else if ( flags & BTRFS_BLOCK_GROUP_RAID1C3 )
return BTRFS_RAID_RAID1C3 ;
else if ( flags & BTRFS_BLOCK_GROUP_RAID1C4 )
return BTRFS_RAID_RAID1C4 ;
else if ( flags & BTRFS_BLOCK_GROUP_DUP )
return BTRFS_RAID_DUP ;
else if ( flags & BTRFS_BLOCK_GROUP_RAID0 )
return BTRFS_RAID_RAID0 ;
else if ( flags & BTRFS_BLOCK_GROUP_RAID5 )
return BTRFS_RAID_RAID5 ;
else if ( flags & BTRFS_BLOCK_GROUP_RAID6 )
return BTRFS_RAID_RAID6 ;
return BTRFS_RAID_SINGLE ; /* BTRFS_BLOCK_GROUP_SINGLE */
}
2019-05-17 11:43:41 +02:00
const char * btrfs_bg_type_to_raid_name ( u64 flags )
2018-04-25 19:01:42 +08:00
{
2019-05-17 11:43:41 +02:00
const int index = btrfs_bg_flags_to_raid_index ( flags ) ;
if ( index > = BTRFS_NR_RAID_TYPES )
2018-04-25 19:01:42 +08:00
return NULL ;
2019-05-17 11:43:41 +02:00
return btrfs_raid_array [ index ] . raid_name ;
2018-04-25 19:01:42 +08:00
}
2018-11-20 16:12:55 +08:00
/*
* Fill @ buf with textual description of @ bg_flags , no more than @ size_buf
* bytes including terminating null byte .
*/
void btrfs_describe_block_groups ( u64 bg_flags , char * buf , u32 size_buf )
{
int i ;
int ret ;
char * bp = buf ;
u64 flags = bg_flags ;
u32 size_bp = size_buf ;
if ( ! flags ) {
strcpy ( bp , " NONE " ) ;
return ;
}
# define DESCRIBE_FLAG(flag, desc) \
do { \
if ( flags & ( flag ) ) { \
ret = snprintf ( bp , size_bp , " %s| " , ( desc ) ) ; \
if ( ret < 0 | | ret > = size_bp ) \
goto out_overflow ; \
size_bp - = ret ; \
bp + = ret ; \
flags & = ~ ( flag ) ; \
} \
} while ( 0 )
DESCRIBE_FLAG ( BTRFS_BLOCK_GROUP_DATA , " data " ) ;
DESCRIBE_FLAG ( BTRFS_BLOCK_GROUP_SYSTEM , " system " ) ;
DESCRIBE_FLAG ( BTRFS_BLOCK_GROUP_METADATA , " metadata " ) ;
DESCRIBE_FLAG ( BTRFS_AVAIL_ALLOC_BIT_SINGLE , " single " ) ;
for ( i = 0 ; i < BTRFS_NR_RAID_TYPES ; i + + )
DESCRIBE_FLAG ( btrfs_raid_array [ i ] . bg_flag ,
btrfs_raid_array [ i ] . raid_name ) ;
# undef DESCRIBE_FLAG
if ( flags ) {
ret = snprintf ( bp , size_bp , " 0x%llx| " , flags ) ;
size_bp - = ret ;
}
if ( size_bp < size_buf )
buf [ size_buf - size_bp - 1 ] = ' \0 ' ; /* remove last | */
/*
* The text is trimmed , it ' s up to the caller to provide sufficiently
* large buffer
*/
out_overflow : ;
}
2019-03-20 16:29:13 +01:00
static int init_first_rw_device ( struct btrfs_trans_handle * trans ) ;
2016-06-22 18:54:24 -04:00
static int btrfs_relocate_sys_chunks ( struct btrfs_fs_info * fs_info ) ;
2013-04-25 20:41:01 +00:00
static void btrfs_dev_stat_print_on_error ( struct btrfs_device * dev ) ;
2012-05-25 16:06:10 +02:00
static void btrfs_dev_stat_print_on_load ( struct btrfs_device * device ) ;
2017-03-14 13:33:57 -07:00
static int __btrfs_map_block ( struct btrfs_fs_info * fs_info ,
enum btrfs_map_op op ,
u64 logical , u64 * length ,
2021-09-15 15:17:16 +08:00
struct btrfs_io_context * * bioc_ret ,
2017-03-14 13:33:57 -07:00
int mirror_num , int need_raid_map ) ;
2008-11-17 21:11:30 -05:00
2017-06-16 22:30:00 +02:00
/*
* Device locking
* = = = = = = = = = = = = = =
*
* There are several mutexes that protect manipulation of devices and low - level
* structures like chunks but not block groups , extents or files
*
* uuid_mutex ( global lock )
* - - - - - - - - - - - - - - - - - - - - - - - -
* protects the fs_uuids list that tracks all per - fs fs_devices , resulting from
* the SCAN_DEV ioctl registration or from mount either implicitly ( the first
* device ) or requested by the device = mount option
*
* the mutex can be very coarse and can cover long - running operations
*
* protects : updates to fs_devices counters like missing devices , rw devices ,
2018-11-28 12:05:13 +01:00
* seeding , structure cloning , opening / closing devices at mount / umount time
2017-06-16 22:30:00 +02:00
*
* global : : fs_devs - add , remove , updates to the global list
*
btrfs: open device without device_list_mutex
There's long existed a lockdep splat because we open our bdev's under
the ->device_list_mutex at mount time, which acquires the bd_mutex.
Usually this goes unnoticed, but if you do loopback devices at all
suddenly the bd_mutex comes with a whole host of other dependencies,
which results in the splat when you mount a btrfs file system.
======================================================
WARNING: possible circular locking dependency detected
5.8.0-0.rc3.1.fc33.x86_64+debug #1 Not tainted
------------------------------------------------------
systemd-journal/509 is trying to acquire lock:
ffff970831f84db0 (&fs_info->reloc_mutex){+.+.}-{3:3}, at: btrfs_record_root_in_trans+0x44/0x70 [btrfs]
but task is already holding lock:
ffff97083144d598 (sb_pagefaults){.+.+}-{0:0}, at: btrfs_page_mkwrite+0x59/0x560 [btrfs]
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #6 (sb_pagefaults){.+.+}-{0:0}:
__sb_start_write+0x13e/0x220
btrfs_page_mkwrite+0x59/0x560 [btrfs]
do_page_mkwrite+0x4f/0x130
do_wp_page+0x3b0/0x4f0
handle_mm_fault+0xf47/0x1850
do_user_addr_fault+0x1fc/0x4b0
exc_page_fault+0x88/0x300
asm_exc_page_fault+0x1e/0x30
-> #5 (&mm->mmap_lock#2){++++}-{3:3}:
__might_fault+0x60/0x80
_copy_from_user+0x20/0xb0
get_sg_io_hdr+0x9a/0xb0
scsi_cmd_ioctl+0x1ea/0x2f0
cdrom_ioctl+0x3c/0x12b4
sr_block_ioctl+0xa4/0xd0
block_ioctl+0x3f/0x50
ksys_ioctl+0x82/0xc0
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x52/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #4 (&cd->lock){+.+.}-{3:3}:
__mutex_lock+0x7b/0x820
sr_block_open+0xa2/0x180
__blkdev_get+0xdd/0x550
blkdev_get+0x38/0x150
do_dentry_open+0x16b/0x3e0
path_openat+0x3c9/0xa00
do_filp_open+0x75/0x100
do_sys_openat2+0x8a/0x140
__x64_sys_openat+0x46/0x70
do_syscall_64+0x52/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #3 (&bdev->bd_mutex){+.+.}-{3:3}:
__mutex_lock+0x7b/0x820
__blkdev_get+0x6a/0x550
blkdev_get+0x85/0x150
blkdev_get_by_path+0x2c/0x70
btrfs_get_bdev_and_sb+0x1b/0xb0 [btrfs]
open_fs_devices+0x88/0x240 [btrfs]
btrfs_open_devices+0x92/0xa0 [btrfs]
btrfs_mount_root+0x250/0x490 [btrfs]
legacy_get_tree+0x30/0x50
vfs_get_tree+0x28/0xc0
vfs_kern_mount.part.0+0x71/0xb0
btrfs_mount+0x119/0x380 [btrfs]
legacy_get_tree+0x30/0x50
vfs_get_tree+0x28/0xc0
do_mount+0x8c6/0xca0
__x64_sys_mount+0x8e/0xd0
do_syscall_64+0x52/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #2 (&fs_devs->device_list_mutex){+.+.}-{3:3}:
__mutex_lock+0x7b/0x820
btrfs_run_dev_stats+0x36/0x420 [btrfs]
commit_cowonly_roots+0x91/0x2d0 [btrfs]
btrfs_commit_transaction+0x4e6/0x9f0 [btrfs]
btrfs_sync_file+0x38a/0x480 [btrfs]
__x64_sys_fdatasync+0x47/0x80
do_syscall_64+0x52/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #1 (&fs_info->tree_log_mutex){+.+.}-{3:3}:
__mutex_lock+0x7b/0x820
btrfs_commit_transaction+0x48e/0x9f0 [btrfs]
btrfs_sync_file+0x38a/0x480 [btrfs]
__x64_sys_fdatasync+0x47/0x80
do_syscall_64+0x52/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #0 (&fs_info->reloc_mutex){+.+.}-{3:3}:
__lock_acquire+0x1241/0x20c0
lock_acquire+0xb0/0x400
__mutex_lock+0x7b/0x820
btrfs_record_root_in_trans+0x44/0x70 [btrfs]
start_transaction+0xd2/0x500 [btrfs]
btrfs_dirty_inode+0x44/0xd0 [btrfs]
file_update_time+0xc6/0x120
btrfs_page_mkwrite+0xda/0x560 [btrfs]
do_page_mkwrite+0x4f/0x130
do_wp_page+0x3b0/0x4f0
handle_mm_fault+0xf47/0x1850
do_user_addr_fault+0x1fc/0x4b0
exc_page_fault+0x88/0x300
asm_exc_page_fault+0x1e/0x30
other info that might help us debug this:
Chain exists of:
&fs_info->reloc_mutex --> &mm->mmap_lock#2 --> sb_pagefaults
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(sb_pagefaults);
lock(&mm->mmap_lock#2);
lock(sb_pagefaults);
lock(&fs_info->reloc_mutex);
*** DEADLOCK ***
3 locks held by systemd-journal/509:
#0: ffff97083bdec8b8 (&mm->mmap_lock#2){++++}-{3:3}, at: do_user_addr_fault+0x12e/0x4b0
#1: ffff97083144d598 (sb_pagefaults){.+.+}-{0:0}, at: btrfs_page_mkwrite+0x59/0x560 [btrfs]
#2: ffff97083144d6a8 (sb_internal){.+.+}-{0:0}, at: start_transaction+0x3f8/0x500 [btrfs]
stack backtrace:
CPU: 0 PID: 509 Comm: systemd-journal Not tainted 5.8.0-0.rc3.1.fc33.x86_64+debug #1
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
Call Trace:
dump_stack+0x92/0xc8
check_noncircular+0x134/0x150
__lock_acquire+0x1241/0x20c0
lock_acquire+0xb0/0x400
? btrfs_record_root_in_trans+0x44/0x70 [btrfs]
? lock_acquire+0xb0/0x400
? btrfs_record_root_in_trans+0x44/0x70 [btrfs]
__mutex_lock+0x7b/0x820
? btrfs_record_root_in_trans+0x44/0x70 [btrfs]
? kvm_sched_clock_read+0x14/0x30
? sched_clock+0x5/0x10
? sched_clock_cpu+0xc/0xb0
btrfs_record_root_in_trans+0x44/0x70 [btrfs]
start_transaction+0xd2/0x500 [btrfs]
btrfs_dirty_inode+0x44/0xd0 [btrfs]
file_update_time+0xc6/0x120
btrfs_page_mkwrite+0xda/0x560 [btrfs]
? sched_clock+0x5/0x10
do_page_mkwrite+0x4f/0x130
do_wp_page+0x3b0/0x4f0
handle_mm_fault+0xf47/0x1850
do_user_addr_fault+0x1fc/0x4b0
exc_page_fault+0x88/0x300
? asm_exc_page_fault+0x8/0x30
asm_exc_page_fault+0x1e/0x30
RIP: 0033:0x7fa3972fdbfe
Code: Bad RIP value.
Fix this by not holding the ->device_list_mutex at this point. The
device_list_mutex exists to protect us from modifying the device list
while the file system is running.
However it can also be modified by doing a scan on a device. But this
action is specifically protected by the uuid_mutex, which we are holding
here. We cannot race with opening at this point because we have the
->s_mount lock held during the mount. Not having the
->device_list_mutex here is perfectly safe as we're not going to change
the devices at this point.
CC: stable@vger.kernel.org # 4.19+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add some comments ]
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-17 15:12:27 -04:00
* does not protect : manipulation of the fs_devices : : devices list in general
* but in mount context it could be used to exclude list modifications by eg .
* scan ioctl
2017-06-16 22:30:00 +02:00
*
* btrfs_device : : name - renames ( write side ) , read is RCU
*
* fs_devices : : device_list_mutex ( per - fs , with RCU )
* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
* protects updates to fs_devices : : devices , ie . adding and deleting
*
* simple list traversal with read - only actions can be done with RCU protection
*
* may be used to exclude some operations from running concurrently without any
* modifications to the list ( see write_all_supers )
*
btrfs: open device without device_list_mutex
There's long existed a lockdep splat because we open our bdev's under
the ->device_list_mutex at mount time, which acquires the bd_mutex.
Usually this goes unnoticed, but if you do loopback devices at all
suddenly the bd_mutex comes with a whole host of other dependencies,
which results in the splat when you mount a btrfs file system.
======================================================
WARNING: possible circular locking dependency detected
5.8.0-0.rc3.1.fc33.x86_64+debug #1 Not tainted
------------------------------------------------------
systemd-journal/509 is trying to acquire lock:
ffff970831f84db0 (&fs_info->reloc_mutex){+.+.}-{3:3}, at: btrfs_record_root_in_trans+0x44/0x70 [btrfs]
but task is already holding lock:
ffff97083144d598 (sb_pagefaults){.+.+}-{0:0}, at: btrfs_page_mkwrite+0x59/0x560 [btrfs]
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #6 (sb_pagefaults){.+.+}-{0:0}:
__sb_start_write+0x13e/0x220
btrfs_page_mkwrite+0x59/0x560 [btrfs]
do_page_mkwrite+0x4f/0x130
do_wp_page+0x3b0/0x4f0
handle_mm_fault+0xf47/0x1850
do_user_addr_fault+0x1fc/0x4b0
exc_page_fault+0x88/0x300
asm_exc_page_fault+0x1e/0x30
-> #5 (&mm->mmap_lock#2){++++}-{3:3}:
__might_fault+0x60/0x80
_copy_from_user+0x20/0xb0
get_sg_io_hdr+0x9a/0xb0
scsi_cmd_ioctl+0x1ea/0x2f0
cdrom_ioctl+0x3c/0x12b4
sr_block_ioctl+0xa4/0xd0
block_ioctl+0x3f/0x50
ksys_ioctl+0x82/0xc0
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x52/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #4 (&cd->lock){+.+.}-{3:3}:
__mutex_lock+0x7b/0x820
sr_block_open+0xa2/0x180
__blkdev_get+0xdd/0x550
blkdev_get+0x38/0x150
do_dentry_open+0x16b/0x3e0
path_openat+0x3c9/0xa00
do_filp_open+0x75/0x100
do_sys_openat2+0x8a/0x140
__x64_sys_openat+0x46/0x70
do_syscall_64+0x52/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #3 (&bdev->bd_mutex){+.+.}-{3:3}:
__mutex_lock+0x7b/0x820
__blkdev_get+0x6a/0x550
blkdev_get+0x85/0x150
blkdev_get_by_path+0x2c/0x70
btrfs_get_bdev_and_sb+0x1b/0xb0 [btrfs]
open_fs_devices+0x88/0x240 [btrfs]
btrfs_open_devices+0x92/0xa0 [btrfs]
btrfs_mount_root+0x250/0x490 [btrfs]
legacy_get_tree+0x30/0x50
vfs_get_tree+0x28/0xc0
vfs_kern_mount.part.0+0x71/0xb0
btrfs_mount+0x119/0x380 [btrfs]
legacy_get_tree+0x30/0x50
vfs_get_tree+0x28/0xc0
do_mount+0x8c6/0xca0
__x64_sys_mount+0x8e/0xd0
do_syscall_64+0x52/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #2 (&fs_devs->device_list_mutex){+.+.}-{3:3}:
__mutex_lock+0x7b/0x820
btrfs_run_dev_stats+0x36/0x420 [btrfs]
commit_cowonly_roots+0x91/0x2d0 [btrfs]
btrfs_commit_transaction+0x4e6/0x9f0 [btrfs]
btrfs_sync_file+0x38a/0x480 [btrfs]
__x64_sys_fdatasync+0x47/0x80
do_syscall_64+0x52/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #1 (&fs_info->tree_log_mutex){+.+.}-{3:3}:
__mutex_lock+0x7b/0x820
btrfs_commit_transaction+0x48e/0x9f0 [btrfs]
btrfs_sync_file+0x38a/0x480 [btrfs]
__x64_sys_fdatasync+0x47/0x80
do_syscall_64+0x52/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #0 (&fs_info->reloc_mutex){+.+.}-{3:3}:
__lock_acquire+0x1241/0x20c0
lock_acquire+0xb0/0x400
__mutex_lock+0x7b/0x820
btrfs_record_root_in_trans+0x44/0x70 [btrfs]
start_transaction+0xd2/0x500 [btrfs]
btrfs_dirty_inode+0x44/0xd0 [btrfs]
file_update_time+0xc6/0x120
btrfs_page_mkwrite+0xda/0x560 [btrfs]
do_page_mkwrite+0x4f/0x130
do_wp_page+0x3b0/0x4f0
handle_mm_fault+0xf47/0x1850
do_user_addr_fault+0x1fc/0x4b0
exc_page_fault+0x88/0x300
asm_exc_page_fault+0x1e/0x30
other info that might help us debug this:
Chain exists of:
&fs_info->reloc_mutex --> &mm->mmap_lock#2 --> sb_pagefaults
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(sb_pagefaults);
lock(&mm->mmap_lock#2);
lock(sb_pagefaults);
lock(&fs_info->reloc_mutex);
*** DEADLOCK ***
3 locks held by systemd-journal/509:
#0: ffff97083bdec8b8 (&mm->mmap_lock#2){++++}-{3:3}, at: do_user_addr_fault+0x12e/0x4b0
#1: ffff97083144d598 (sb_pagefaults){.+.+}-{0:0}, at: btrfs_page_mkwrite+0x59/0x560 [btrfs]
#2: ffff97083144d6a8 (sb_internal){.+.+}-{0:0}, at: start_transaction+0x3f8/0x500 [btrfs]
stack backtrace:
CPU: 0 PID: 509 Comm: systemd-journal Not tainted 5.8.0-0.rc3.1.fc33.x86_64+debug #1
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
Call Trace:
dump_stack+0x92/0xc8
check_noncircular+0x134/0x150
__lock_acquire+0x1241/0x20c0
lock_acquire+0xb0/0x400
? btrfs_record_root_in_trans+0x44/0x70 [btrfs]
? lock_acquire+0xb0/0x400
? btrfs_record_root_in_trans+0x44/0x70 [btrfs]
__mutex_lock+0x7b/0x820
? btrfs_record_root_in_trans+0x44/0x70 [btrfs]
? kvm_sched_clock_read+0x14/0x30
? sched_clock+0x5/0x10
? sched_clock_cpu+0xc/0xb0
btrfs_record_root_in_trans+0x44/0x70 [btrfs]
start_transaction+0xd2/0x500 [btrfs]
btrfs_dirty_inode+0x44/0xd0 [btrfs]
file_update_time+0xc6/0x120
btrfs_page_mkwrite+0xda/0x560 [btrfs]
? sched_clock+0x5/0x10
do_page_mkwrite+0x4f/0x130
do_wp_page+0x3b0/0x4f0
handle_mm_fault+0xf47/0x1850
do_user_addr_fault+0x1fc/0x4b0
exc_page_fault+0x88/0x300
? asm_exc_page_fault+0x8/0x30
asm_exc_page_fault+0x1e/0x30
RIP: 0033:0x7fa3972fdbfe
Code: Bad RIP value.
Fix this by not holding the ->device_list_mutex at this point. The
device_list_mutex exists to protect us from modifying the device list
while the file system is running.
However it can also be modified by doing a scan on a device. But this
action is specifically protected by the uuid_mutex, which we are holding
here. We cannot race with opening at this point because we have the
->s_mount lock held during the mount. Not having the
->device_list_mutex here is perfectly safe as we're not going to change
the devices at this point.
CC: stable@vger.kernel.org # 4.19+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add some comments ]
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-17 15:12:27 -04:00
* Is not required at mount and close times , because our device list is
* protected by the uuid_mutex at that point .
*
2017-06-16 22:30:00 +02:00
* balance_mutex
* - - - - - - - - - - - - -
* protects balance structures ( status , state ) and context accessed from
* several places ( internally , ioctl )
*
* chunk_mutex
* - - - - - - - - - - -
* protects chunks , adding or removing during allocation , trim or when a new
2019-05-09 18:11:11 +03:00
* device is added / removed . Additionally it also protects post_commit_list of
* individual devices , since they can be added to the transaction ' s
* post_commit_list only with chunk_mutex held .
2017-06-16 22:30:00 +02:00
*
* cleaner_mutex
* - - - - - - - - - - - - -
* a big lock that is held by the cleaner thread and prevents running subvolume
* cleaning together with relocation or delayed iputs
*
*
* Lock nesting
* = = = = = = = = = = = =
*
* uuid_mutex
2020-05-14 01:42:45 +08:00
* device_list_mutex
* chunk_mutex
* balance_mutex
2018-04-18 14:59:25 +08:00
*
*
2020-08-25 10:02:32 -05:00
* Exclusive operations
* = = = = = = = = = = = = = = = = = = = =
2018-04-18 14:59:25 +08:00
*
* Maintains the exclusivity of the following operations that apply to the
* whole filesystem and cannot run in parallel .
*
* - Balance ( * )
* - Device add
* - Device remove
* - Device replace ( * )
* - Resize
*
* The device operations ( as above ) can be in one of the following states :
*
* - Running state
* - Paused state
* - Completed state
*
* Only device operations marked with ( * ) can go into the Paused state for the
* following reasons :
*
* - ioctl ( only Balance can be Paused through ioctl )
* - filesystem remounted as read - only
* - filesystem unmounted and mounted as read - only
* - system power - cycle and filesystem mounted as read - only
* - filesystem or device errors leading to forced read - only
*
2020-08-25 10:02:32 -05:00
* The status of exclusive operation is set and cleared atomically .
* During the course of Paused state , fs_info : : exclusive_operation remains set .
2018-04-18 14:59:25 +08:00
* A device operation in Paused or Running state can be canceled or resumed
* either by ioctl ( Balance only ) or when remounted as read - write .
2020-08-25 10:02:32 -05:00
* The exclusive status is cleared when the device operation is canceled or
2018-04-18 14:59:25 +08:00
* completed .
2017-06-16 22:30:00 +02:00
*/
2014-09-03 21:35:43 +08:00
DEFINE_MUTEX ( uuid_mutex ) ;
2008-03-24 15:02:07 -04:00
static LIST_HEAD ( fs_uuids ) ;
2019-10-01 19:57:37 +02:00
struct list_head * __attribute_const__ btrfs_get_fs_uuids ( void )
2015-03-10 06:38:30 +08:00
{
return & fs_uuids ;
}
2008-03-24 15:02:07 -04:00
2017-06-14 02:48:07 +02:00
/*
* alloc_fs_devices - allocate struct btrfs_fs_devices
2018-10-30 16:43:23 +02:00
* @ fsid : if not NULL , copy the UUID to fs_devices : : fsid
* @ metadata_fsid : if not NULL , copy the UUID to fs_devices : : metadata_fsid
2017-06-14 02:48:07 +02:00
*
* Return a pointer to a new struct btrfs_fs_devices on success , or ERR_PTR ( ) .
* The returned struct is not linked onto any lists and can be destroyed with
* kfree ( ) right away .
*/
2018-10-30 16:43:23 +02:00
static struct btrfs_fs_devices * alloc_fs_devices ( const u8 * fsid ,
const u8 * metadata_fsid )
2013-08-12 14:33:03 +03:00
{
struct btrfs_fs_devices * fs_devs ;
2016-02-11 14:25:38 +01:00
fs_devs = kzalloc ( sizeof ( * fs_devs ) , GFP_KERNEL ) ;
2013-08-12 14:33:03 +03:00
if ( ! fs_devs )
return ERR_PTR ( - ENOMEM ) ;
mutex_init ( & fs_devs - > device_list_mutex ) ;
INIT_LIST_HEAD ( & fs_devs - > devices ) ;
INIT_LIST_HEAD ( & fs_devs - > alloc_list ) ;
2018-04-12 10:29:25 +08:00
INIT_LIST_HEAD ( & fs_devs - > fs_list ) ;
2020-07-16 10:25:33 +03:00
INIT_LIST_HEAD ( & fs_devs - > seed_list ) ;
2013-08-12 14:33:03 +03:00
if ( fsid )
memcpy ( fs_devs - > fsid , fsid , BTRFS_FSID_SIZE ) ;
2018-10-30 16:43:23 +02:00
if ( metadata_fsid )
memcpy ( fs_devs - > metadata_uuid , metadata_fsid , BTRFS_FSID_SIZE ) ;
else if ( fsid )
memcpy ( fs_devs - > metadata_uuid , fsid , BTRFS_FSID_SIZE ) ;
2013-08-12 14:33:03 +03:00
return fs_devs ;
}
2018-03-20 15:47:33 +01:00
void btrfs_free_device ( struct btrfs_device * device )
2017-10-30 18:10:25 +01:00
{
2019-03-25 14:31:22 +02:00
WARN_ON ( ! list_empty ( & device - > post_commit_list ) ) ;
2017-10-30 18:10:25 +01:00
rcu_string_free ( device - > name ) ;
2019-03-27 14:24:12 +02:00
extent_io_tree_release ( & device - > alloc_state ) ;
2017-10-30 18:10:25 +01:00
bio_put ( device - > flush_bio ) ;
2020-11-10 20:26:07 +09:00
btrfs_destroy_dev_zone_info ( device ) ;
2017-10-30 18:10:25 +01:00
kfree ( device ) ;
}
2008-12-12 10:03:26 -05:00
static void free_fs_devices ( struct btrfs_fs_devices * fs_devices )
{
struct btrfs_device * device ;
WARN_ON ( fs_devices - > opened ) ;
while ( ! list_empty ( & fs_devices - > devices ) ) {
device = list_entry ( fs_devices - > devices . next ,
struct btrfs_device , dev_list ) ;
list_del ( & device - > dev_list ) ;
2018-03-20 15:47:33 +01:00
btrfs_free_device ( device ) ;
2008-12-12 10:03:26 -05:00
}
kfree ( fs_devices ) ;
}
2018-02-19 17:24:15 +01:00
void __exit btrfs_cleanup_fs_uuids ( void )
2008-03-24 15:02:07 -04:00
{
struct btrfs_fs_devices * fs_devices ;
2008-11-17 21:11:30 -05:00
while ( ! list_empty ( & fs_uuids ) ) {
fs_devices = list_entry ( fs_uuids . next ,
2018-04-12 10:29:25 +08:00
struct btrfs_fs_devices , fs_list ) ;
list_del ( & fs_devices - > fs_list ) ;
2008-12-12 10:03:26 -05:00
free_fs_devices ( fs_devices ) ;
2008-03-24 15:02:07 -04:00
}
}
2018-10-30 16:43:23 +02:00
static noinline struct btrfs_fs_devices * find_fsid (
const u8 * fsid , const u8 * metadata_fsid )
2008-03-24 15:02:07 -04:00
{
struct btrfs_fs_devices * fs_devices ;
2018-10-30 16:43:23 +02:00
ASSERT ( fsid ) ;
2018-10-30 16:43:27 +02:00
/* Handle non-split brain cases */
2018-04-12 10:29:25 +08:00
list_for_each_entry ( fs_devices , & fs_uuids , fs_list ) {
2018-10-30 16:43:23 +02:00
if ( metadata_fsid ) {
if ( memcmp ( fsid , fs_devices - > fsid , BTRFS_FSID_SIZE ) = = 0
& & memcmp ( metadata_fsid , fs_devices - > metadata_uuid ,
BTRFS_FSID_SIZE ) = = 0 )
return fs_devices ;
} else {
if ( memcmp ( fsid , fs_devices - > fsid , BTRFS_FSID_SIZE ) = = 0 )
return fs_devices ;
}
2008-03-24 15:02:07 -04:00
}
return NULL ;
}
2020-01-10 14:11:33 +02:00
static struct btrfs_fs_devices * find_fsid_with_metadata_uuid (
struct btrfs_super_block * disk_super )
{
struct btrfs_fs_devices * fs_devices ;
/*
* Handle scanned device having completed its fsid change but
* belonging to a fs_devices that was created by first scanning
* a device which didn ' t have its fsid / metadata_uuid changed
* at all and the CHANGING_FSID_V2 flag set .
*/
list_for_each_entry ( fs_devices , & fs_uuids , fs_list ) {
if ( fs_devices - > fsid_change & &
memcmp ( disk_super - > metadata_uuid , fs_devices - > fsid ,
BTRFS_FSID_SIZE ) = = 0 & &
memcmp ( fs_devices - > fsid , fs_devices - > metadata_uuid ,
BTRFS_FSID_SIZE ) = = 0 ) {
return fs_devices ;
}
}
/*
* Handle scanned device having completed its fsid change but
* belonging to a fs_devices that was created by a device that
* has an outdated pair of fsid / metadata_uuid and
* CHANGING_FSID_V2 flag set .
*/
list_for_each_entry ( fs_devices , & fs_uuids , fs_list ) {
if ( fs_devices - > fsid_change & &
memcmp ( fs_devices - > metadata_uuid ,
fs_devices - > fsid , BTRFS_FSID_SIZE ) ! = 0 & &
memcmp ( disk_super - > metadata_uuid , fs_devices - > metadata_uuid ,
BTRFS_FSID_SIZE ) = = 0 ) {
return fs_devices ;
}
}
return find_fsid ( disk_super - > fsid , disk_super - > metadata_uuid ) ;
}
2012-11-12 14:03:45 +01:00
static int
btrfs_get_bdev_and_sb ( const char * device_path , fmode_t flags , void * holder ,
int flush , struct block_device * * bdev ,
2020-02-14 00:24:32 +09:00
struct btrfs_super_block * * disk_super )
2012-11-12 14:03:45 +01:00
{
int ret ;
* bdev = blkdev_get_by_path ( device_path , flags , holder ) ;
if ( IS_ERR ( * bdev ) ) {
ret = PTR_ERR ( * bdev ) ;
goto error ;
}
if ( flush )
filemap_write_and_wait ( ( * bdev ) - > bd_inode - > i_mapping ) ;
2017-06-16 01:48:05 +02:00
ret = set_blocksize ( * bdev , BTRFS_BDEV_BLOCKSIZE ) ;
2012-11-12 14:03:45 +01:00
if ( ret ) {
blkdev_put ( * bdev , flags ) ;
goto error ;
}
invalidate_bdev ( * bdev ) ;
2020-02-14 00:24:32 +09:00
* disk_super = btrfs_read_dev_super ( * bdev ) ;
if ( IS_ERR ( * disk_super ) ) {
ret = PTR_ERR ( * disk_super ) ;
2012-11-12 14:03:45 +01:00
blkdev_put ( * bdev , flags ) ;
goto error ;
}
return 0 ;
error :
* bdev = NULL ;
return ret ;
}
2019-01-04 13:31:53 +08:00
static bool device_path_matched ( const char * path , struct btrfs_device * device )
{
int found ;
rcu_read_lock ( ) ;
found = strcmp ( rcu_str_deref ( device - > name ) , path ) ;
rcu_read_unlock ( ) ;
return found = = 0 ;
}
2018-01-18 22:00:37 +08:00
/*
* Search and remove all stale ( devices which are not mounted ) devices .
* When both inputs are NULL , it will search and release all stale devices .
* path : Optional . When provided will it release all unmounted devices
* matching this path only .
* skip_dev : Optional . Will skip this device when searching for the stale
* devices .
2019-01-04 13:31:53 +08:00
* Return : 0 for success or if @ path is NULL .
* - EBUSY if @ path is a mounted device .
* - ENOENT if @ path does not match any device in the list .
2018-01-18 22:00:37 +08:00
*/
2019-01-04 13:31:53 +08:00
static int btrfs_free_stale_devices ( const char * path ,
2018-05-29 15:33:08 +08:00
struct btrfs_device * skip_device )
2015-06-17 21:10:48 +08:00
{
2018-05-29 15:33:08 +08:00
struct btrfs_fs_devices * fs_devices , * tmp_fs_devices ;
struct btrfs_device * device , * tmp_device ;
2019-01-04 13:31:53 +08:00
int ret = 0 ;
2021-08-31 09:21:28 +08:00
lockdep_assert_held ( & uuid_mutex ) ;
2019-01-04 13:31:53 +08:00
if ( path )
ret = - ENOENT ;
2015-06-17 21:10:48 +08:00
2018-05-29 15:33:08 +08:00
list_for_each_entry_safe ( fs_devices , tmp_fs_devices , & fs_uuids , fs_list ) {
2015-06-17 21:10:48 +08:00
2019-01-04 13:31:53 +08:00
mutex_lock ( & fs_devices - > device_list_mutex ) ;
2018-05-29 15:33:08 +08:00
list_for_each_entry_safe ( device , tmp_device ,
& fs_devices - > devices , dev_list ) {
if ( skip_device & & skip_device = = device )
2018-01-18 22:00:37 +08:00
continue ;
2018-05-29 15:33:08 +08:00
if ( path & & ! device - > name )
2015-06-17 21:10:48 +08:00
continue ;
2019-01-04 13:31:53 +08:00
if ( path & & ! device_path_matched ( path , device ) )
2018-01-18 22:00:34 +08:00
continue ;
2019-01-04 13:31:53 +08:00
if ( fs_devices - > opened ) {
/* for an already deleted device return 0 */
if ( path & & ret ! = 0 )
ret = - EBUSY ;
break ;
}
2015-06-17 21:10:48 +08:00
/* delete the stale device */
2018-05-29 17:23:20 +08:00
fs_devices - > num_devices - - ;
list_del ( & device - > dev_list ) ;
btrfs_free_device ( device ) ;
2019-01-04 13:31:53 +08:00
ret = 0 ;
2018-05-29 17:23:20 +08:00
}
mutex_unlock ( & fs_devices - > device_list_mutex ) ;
2019-01-04 13:31:53 +08:00
2018-05-29 17:23:20 +08:00
if ( fs_devices - > num_devices = = 0 ) {
btrfs_sysfs_remove_fsid ( fs_devices ) ;
list_del ( & fs_devices - > fs_list ) ;
free_fs_devices ( fs_devices ) ;
2015-06-17 21:10:48 +08:00
}
}
2019-01-04 13:31:53 +08:00
return ret ;
2015-06-17 21:10:48 +08:00
}
btrfs: open device without device_list_mutex
There's long existed a lockdep splat because we open our bdev's under
the ->device_list_mutex at mount time, which acquires the bd_mutex.
Usually this goes unnoticed, but if you do loopback devices at all
suddenly the bd_mutex comes with a whole host of other dependencies,
which results in the splat when you mount a btrfs file system.
======================================================
WARNING: possible circular locking dependency detected
5.8.0-0.rc3.1.fc33.x86_64+debug #1 Not tainted
------------------------------------------------------
systemd-journal/509 is trying to acquire lock:
ffff970831f84db0 (&fs_info->reloc_mutex){+.+.}-{3:3}, at: btrfs_record_root_in_trans+0x44/0x70 [btrfs]
but task is already holding lock:
ffff97083144d598 (sb_pagefaults){.+.+}-{0:0}, at: btrfs_page_mkwrite+0x59/0x560 [btrfs]
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #6 (sb_pagefaults){.+.+}-{0:0}:
__sb_start_write+0x13e/0x220
btrfs_page_mkwrite+0x59/0x560 [btrfs]
do_page_mkwrite+0x4f/0x130
do_wp_page+0x3b0/0x4f0
handle_mm_fault+0xf47/0x1850
do_user_addr_fault+0x1fc/0x4b0
exc_page_fault+0x88/0x300
asm_exc_page_fault+0x1e/0x30
-> #5 (&mm->mmap_lock#2){++++}-{3:3}:
__might_fault+0x60/0x80
_copy_from_user+0x20/0xb0
get_sg_io_hdr+0x9a/0xb0
scsi_cmd_ioctl+0x1ea/0x2f0
cdrom_ioctl+0x3c/0x12b4
sr_block_ioctl+0xa4/0xd0
block_ioctl+0x3f/0x50
ksys_ioctl+0x82/0xc0
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x52/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #4 (&cd->lock){+.+.}-{3:3}:
__mutex_lock+0x7b/0x820
sr_block_open+0xa2/0x180
__blkdev_get+0xdd/0x550
blkdev_get+0x38/0x150
do_dentry_open+0x16b/0x3e0
path_openat+0x3c9/0xa00
do_filp_open+0x75/0x100
do_sys_openat2+0x8a/0x140
__x64_sys_openat+0x46/0x70
do_syscall_64+0x52/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #3 (&bdev->bd_mutex){+.+.}-{3:3}:
__mutex_lock+0x7b/0x820
__blkdev_get+0x6a/0x550
blkdev_get+0x85/0x150
blkdev_get_by_path+0x2c/0x70
btrfs_get_bdev_and_sb+0x1b/0xb0 [btrfs]
open_fs_devices+0x88/0x240 [btrfs]
btrfs_open_devices+0x92/0xa0 [btrfs]
btrfs_mount_root+0x250/0x490 [btrfs]
legacy_get_tree+0x30/0x50
vfs_get_tree+0x28/0xc0
vfs_kern_mount.part.0+0x71/0xb0
btrfs_mount+0x119/0x380 [btrfs]
legacy_get_tree+0x30/0x50
vfs_get_tree+0x28/0xc0
do_mount+0x8c6/0xca0
__x64_sys_mount+0x8e/0xd0
do_syscall_64+0x52/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #2 (&fs_devs->device_list_mutex){+.+.}-{3:3}:
__mutex_lock+0x7b/0x820
btrfs_run_dev_stats+0x36/0x420 [btrfs]
commit_cowonly_roots+0x91/0x2d0 [btrfs]
btrfs_commit_transaction+0x4e6/0x9f0 [btrfs]
btrfs_sync_file+0x38a/0x480 [btrfs]
__x64_sys_fdatasync+0x47/0x80
do_syscall_64+0x52/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #1 (&fs_info->tree_log_mutex){+.+.}-{3:3}:
__mutex_lock+0x7b/0x820
btrfs_commit_transaction+0x48e/0x9f0 [btrfs]
btrfs_sync_file+0x38a/0x480 [btrfs]
__x64_sys_fdatasync+0x47/0x80
do_syscall_64+0x52/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #0 (&fs_info->reloc_mutex){+.+.}-{3:3}:
__lock_acquire+0x1241/0x20c0
lock_acquire+0xb0/0x400
__mutex_lock+0x7b/0x820
btrfs_record_root_in_trans+0x44/0x70 [btrfs]
start_transaction+0xd2/0x500 [btrfs]
btrfs_dirty_inode+0x44/0xd0 [btrfs]
file_update_time+0xc6/0x120
btrfs_page_mkwrite+0xda/0x560 [btrfs]
do_page_mkwrite+0x4f/0x130
do_wp_page+0x3b0/0x4f0
handle_mm_fault+0xf47/0x1850
do_user_addr_fault+0x1fc/0x4b0
exc_page_fault+0x88/0x300
asm_exc_page_fault+0x1e/0x30
other info that might help us debug this:
Chain exists of:
&fs_info->reloc_mutex --> &mm->mmap_lock#2 --> sb_pagefaults
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(sb_pagefaults);
lock(&mm->mmap_lock#2);
lock(sb_pagefaults);
lock(&fs_info->reloc_mutex);
*** DEADLOCK ***
3 locks held by systemd-journal/509:
#0: ffff97083bdec8b8 (&mm->mmap_lock#2){++++}-{3:3}, at: do_user_addr_fault+0x12e/0x4b0
#1: ffff97083144d598 (sb_pagefaults){.+.+}-{0:0}, at: btrfs_page_mkwrite+0x59/0x560 [btrfs]
#2: ffff97083144d6a8 (sb_internal){.+.+}-{0:0}, at: start_transaction+0x3f8/0x500 [btrfs]
stack backtrace:
CPU: 0 PID: 509 Comm: systemd-journal Not tainted 5.8.0-0.rc3.1.fc33.x86_64+debug #1
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
Call Trace:
dump_stack+0x92/0xc8
check_noncircular+0x134/0x150
__lock_acquire+0x1241/0x20c0
lock_acquire+0xb0/0x400
? btrfs_record_root_in_trans+0x44/0x70 [btrfs]
? lock_acquire+0xb0/0x400
? btrfs_record_root_in_trans+0x44/0x70 [btrfs]
__mutex_lock+0x7b/0x820
? btrfs_record_root_in_trans+0x44/0x70 [btrfs]
? kvm_sched_clock_read+0x14/0x30
? sched_clock+0x5/0x10
? sched_clock_cpu+0xc/0xb0
btrfs_record_root_in_trans+0x44/0x70 [btrfs]
start_transaction+0xd2/0x500 [btrfs]
btrfs_dirty_inode+0x44/0xd0 [btrfs]
file_update_time+0xc6/0x120
btrfs_page_mkwrite+0xda/0x560 [btrfs]
? sched_clock+0x5/0x10
do_page_mkwrite+0x4f/0x130
do_wp_page+0x3b0/0x4f0
handle_mm_fault+0xf47/0x1850
do_user_addr_fault+0x1fc/0x4b0
exc_page_fault+0x88/0x300
? asm_exc_page_fault+0x8/0x30
asm_exc_page_fault+0x1e/0x30
RIP: 0033:0x7fa3972fdbfe
Code: Bad RIP value.
Fix this by not holding the ->device_list_mutex at this point. The
device_list_mutex exists to protect us from modifying the device list
while the file system is running.
However it can also be modified by doing a scan on a device. But this
action is specifically protected by the uuid_mutex, which we are holding
here. We cannot race with opening at this point because we have the
->s_mount lock held during the mount. Not having the
->device_list_mutex here is perfectly safe as we're not going to change
the devices at this point.
CC: stable@vger.kernel.org # 4.19+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add some comments ]
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-17 15:12:27 -04:00
/*
* This is only used on mount , and we are protected from competing things
* messing with our fs_devices by the uuid_mutex , thus we do not need the
* fs_devices - > device_list_mutex here .
*/
2017-11-09 23:45:24 +08:00
static int btrfs_open_one_device ( struct btrfs_fs_devices * fs_devices ,
struct btrfs_device * device , fmode_t flags ,
void * holder )
{
struct request_queue * q ;
struct block_device * bdev ;
struct btrfs_super_block * disk_super ;
u64 devid ;
int ret ;
if ( device - > bdev )
return - EINVAL ;
if ( ! device - > name )
return - EINVAL ;
ret = btrfs_get_bdev_and_sb ( device - > name - > str , flags , holder , 1 ,
2020-02-14 00:24:32 +09:00
& bdev , & disk_super ) ;
2017-11-09 23:45:24 +08:00
if ( ret )
return ret ;
devid = btrfs_stack_device_id ( & disk_super - > dev_item ) ;
if ( devid ! = device - > devid )
2020-02-14 00:24:32 +09:00
goto error_free_page ;
2017-11-09 23:45:24 +08:00
if ( memcmp ( device - > uuid , disk_super - > dev_item . uuid , BTRFS_UUID_SIZE ) )
2020-02-14 00:24:32 +09:00
goto error_free_page ;
2017-11-09 23:45:24 +08:00
device - > generation = btrfs_super_generation ( disk_super ) ;
if ( btrfs_super_flags ( disk_super ) & BTRFS_SUPER_FLAG_SEEDING ) {
2018-10-30 16:43:23 +02:00
if ( btrfs_super_incompat_flags ( disk_super ) &
BTRFS_FEATURE_INCOMPAT_METADATA_UUID ) {
pr_err (
" BTRFS: Invalid seeding and uuid-changed device detected \n " ) ;
2020-02-14 00:24:32 +09:00
goto error_free_page ;
2018-10-30 16:43:23 +02:00
}
2017-12-04 12:54:52 +08:00
clear_bit ( BTRFS_DEV_STATE_WRITEABLE , & device - > dev_state ) ;
2019-11-13 11:27:27 +01:00
fs_devices - > seeding = true ;
2017-11-09 23:45:24 +08:00
} else {
2017-12-04 12:54:52 +08:00
if ( bdev_read_only ( bdev ) )
clear_bit ( BTRFS_DEV_STATE_WRITEABLE , & device - > dev_state ) ;
else
set_bit ( BTRFS_DEV_STATE_WRITEABLE , & device - > dev_state ) ;
2017-11-09 23:45:24 +08:00
}
q = bdev_get_queue ( bdev ) ;
if ( ! blk_queue_nonrot ( q ) )
2019-11-13 11:27:28 +01:00
fs_devices - > rotating = true ;
2017-11-09 23:45:24 +08:00
device - > bdev = bdev ;
2017-12-04 12:54:53 +08:00
clear_bit ( BTRFS_DEV_STATE_IN_FS_METADATA , & device - > dev_state ) ;
2017-11-09 23:45:24 +08:00
device - > mode = flags ;
fs_devices - > open_devices + + ;
2017-12-04 12:54:52 +08:00
if ( test_bit ( BTRFS_DEV_STATE_WRITEABLE , & device - > dev_state ) & &
device - > devid ! = BTRFS_DEV_REPLACE_DEVID ) {
2017-11-09 23:45:24 +08:00
fs_devices - > rw_devices + + ;
2018-01-22 14:49:37 -08:00
list_add_tail ( & device - > dev_alloc_list , & fs_devices - > alloc_list ) ;
2017-11-09 23:45:24 +08:00
}
2020-02-14 00:24:32 +09:00
btrfs_release_disk_super ( disk_super ) ;
2017-11-09 23:45:24 +08:00
return 0 ;
2020-02-14 00:24:32 +09:00
error_free_page :
btrfs_release_disk_super ( disk_super ) ;
2017-11-09 23:45:24 +08:00
blkdev_put ( bdev , flags ) ;
return - EINVAL ;
}
2018-10-30 16:43:27 +02:00
/*
* Handle scanned device having its CHANGING_FSID_V2 flag set and the fs_devices
2020-01-10 14:11:32 +02:00
* being created with a disk that has already completed its fsid change . Such
* disk can belong to an fs which has its FSID changed or to one which doesn ' t .
* Handle both cases here .
2018-10-30 16:43:27 +02:00
*/
static struct btrfs_fs_devices * find_fsid_inprogress (
struct btrfs_super_block * disk_super )
{
struct btrfs_fs_devices * fs_devices ;
list_for_each_entry ( fs_devices , & fs_uuids , fs_list ) {
if ( memcmp ( fs_devices - > metadata_uuid , fs_devices - > fsid ,
BTRFS_FSID_SIZE ) ! = 0 & &
memcmp ( fs_devices - > metadata_uuid , disk_super - > fsid ,
BTRFS_FSID_SIZE ) = = 0 & & ! fs_devices - > fsid_change ) {
return fs_devices ;
}
}
2020-01-10 14:11:32 +02:00
return find_fsid ( disk_super - > fsid , NULL ) ;
2018-10-30 16:43:27 +02:00
}
2018-10-30 16:43:28 +02:00
static struct btrfs_fs_devices * find_fsid_changed (
struct btrfs_super_block * disk_super )
{
struct btrfs_fs_devices * fs_devices ;
/*
* Handles the case where scanned device is part of an fs that had
2021-05-21 17:42:23 +02:00
* multiple successful changes of FSID but currently device didn ' t
2020-01-10 14:11:34 +02:00
* observe it . Meaning our fsid will be different than theirs . We need
* to handle two subcases :
* 1 - The fs still continues to have different METADATA / FSID uuids .
* 2 - The fs is switched back to its original FSID ( METADATA / FSID
* are equal ) .
2018-10-30 16:43:28 +02:00
*/
list_for_each_entry ( fs_devices , & fs_uuids , fs_list ) {
2020-01-10 14:11:34 +02:00
/* Changed UUIDs */
2018-10-30 16:43:28 +02:00
if ( memcmp ( fs_devices - > metadata_uuid , fs_devices - > fsid ,
BTRFS_FSID_SIZE ) ! = 0 & &
memcmp ( fs_devices - > metadata_uuid , disk_super - > metadata_uuid ,
BTRFS_FSID_SIZE ) = = 0 & &
memcmp ( fs_devices - > fsid , disk_super - > fsid ,
2020-01-10 14:11:34 +02:00
BTRFS_FSID_SIZE ) ! = 0 )
return fs_devices ;
/* Unchanged UUIDs */
if ( memcmp ( fs_devices - > metadata_uuid , fs_devices - > fsid ,
BTRFS_FSID_SIZE ) = = 0 & &
memcmp ( fs_devices - > fsid , disk_super - > metadata_uuid ,
BTRFS_FSID_SIZE ) = = 0 )
2018-10-30 16:43:28 +02:00
return fs_devices ;
}
return NULL ;
}
2020-01-10 14:11:35 +02:00
static struct btrfs_fs_devices * find_fsid_reverted_metadata (
struct btrfs_super_block * disk_super )
{
struct btrfs_fs_devices * fs_devices ;
/*
* Handle the case where the scanned device is part of an fs whose last
* metadata UUID change reverted it to the original FSID . At the same
* time * fs_devices was first created by another constitutent device
* which didn ' t fully observe the operation . This results in an
* btrfs_fs_devices created with metadata / fsid different AND
* btrfs_fs_devices : : fsid_change set AND the metadata_uuid of the
* fs_devices equal to the FSID of the disk .
*/
list_for_each_entry ( fs_devices , & fs_uuids , fs_list ) {
if ( memcmp ( fs_devices - > fsid , fs_devices - > metadata_uuid ,
BTRFS_FSID_SIZE ) ! = 0 & &
memcmp ( fs_devices - > metadata_uuid , disk_super - > fsid ,
BTRFS_FSID_SIZE ) = = 0 & &
fs_devices - > fsid_change )
return fs_devices ;
}
return NULL ;
}
2014-03-26 18:26:36 +01:00
/*
* Add new device to list of registered devices
*
* Returns :
2018-01-18 22:02:35 +08:00
* device pointer which was just added or updated when successful
* error pointer when failed
2014-03-26 18:26:36 +01:00
*/
2018-01-18 22:02:35 +08:00
static noinline struct btrfs_device * device_list_add ( const char * path ,
2018-05-29 12:28:37 +08:00
struct btrfs_super_block * disk_super ,
bool * new_device_added )
2008-03-24 15:02:07 -04:00
{
struct btrfs_device * device ;
2018-10-30 16:43:27 +02:00
struct btrfs_fs_devices * fs_devices = NULL ;
2012-06-04 14:03:51 -04:00
struct rcu_string * name ;
2008-03-24 15:02:07 -04:00
u64 found_transid = btrfs_super_generation ( disk_super ) ;
2018-01-18 22:02:36 +08:00
u64 devid = btrfs_stack_device_id ( & disk_super - > dev_item ) ;
2018-10-30 16:43:23 +02:00
bool has_metadata_uuid = ( btrfs_super_incompat_flags ( disk_super ) &
BTRFS_FEATURE_INCOMPAT_METADATA_UUID ) ;
2018-10-30 16:43:26 +02:00
bool fsid_change_in_progress = ( btrfs_super_flags ( disk_super ) &
BTRFS_SUPER_FLAG_CHANGING_FSID_V2 ) ;
2018-10-30 16:43:23 +02:00
2018-10-30 16:43:28 +02:00
if ( fsid_change_in_progress ) {
2020-01-10 14:11:32 +02:00
if ( ! has_metadata_uuid )
2018-10-30 16:43:28 +02:00
fs_devices = find_fsid_inprogress ( disk_super ) ;
2020-01-10 14:11:32 +02:00
else
2018-10-30 16:43:28 +02:00
fs_devices = find_fsid_changed ( disk_super ) ;
2018-10-30 16:43:27 +02:00
} else if ( has_metadata_uuid ) {
2020-01-10 14:11:33 +02:00
fs_devices = find_fsid_with_metadata_uuid ( disk_super ) ;
2018-10-30 16:43:27 +02:00
} else {
2020-01-10 14:11:35 +02:00
fs_devices = find_fsid_reverted_metadata ( disk_super ) ;
if ( ! fs_devices )
fs_devices = find_fsid ( disk_super - > fsid , NULL ) ;
2018-10-30 16:43:27 +02:00
}
2008-03-24 15:02:07 -04:00
if ( ! fs_devices ) {
2018-10-30 16:43:23 +02:00
if ( has_metadata_uuid )
fs_devices = alloc_fs_devices ( disk_super - > fsid ,
disk_super - > metadata_uuid ) ;
else
fs_devices = alloc_fs_devices ( disk_super - > fsid , NULL ) ;
2013-08-12 14:33:03 +03:00
if ( IS_ERR ( fs_devices ) )
2018-01-18 22:02:35 +08:00
return ERR_CAST ( fs_devices ) ;
2013-08-12 14:33:03 +03:00
2019-01-27 04:58:00 +00:00
fs_devices - > fsid_change = fsid_change_in_progress ;
2018-05-29 14:10:20 +08:00
mutex_lock ( & fs_devices - > device_list_mutex ) ;
2018-04-12 10:29:25 +08:00
list_add ( & fs_devices - > fs_list , & fs_uuids ) ;
2013-08-12 14:33:03 +03:00
2008-03-24 15:02:07 -04:00
device = NULL ;
} else {
2021-10-05 16:12:42 -04:00
struct btrfs_dev_lookup_args args = {
. devid = devid ,
. uuid = disk_super - > dev_item . uuid ,
} ;
2018-05-29 14:10:20 +08:00
mutex_lock ( & fs_devices - > device_list_mutex ) ;
2021-10-05 16:12:42 -04:00
device = btrfs_find_device ( fs_devices , & args ) ;
2018-10-30 16:43:27 +02:00
/*
* If this disk has been pulled into an fs devices created by
* a device which had the CHANGING_FSID_V2 flag then replace the
* metadata_uuid / fsid values of the fs_devices .
*/
2020-01-10 14:11:35 +02:00
if ( fs_devices - > fsid_change & &
2018-10-30 16:43:27 +02:00
found_transid > fs_devices - > latest_generation ) {
memcpy ( fs_devices - > fsid , disk_super - > fsid ,
BTRFS_FSID_SIZE ) ;
2020-01-10 14:11:35 +02:00
if ( has_metadata_uuid )
memcpy ( fs_devices - > metadata_uuid ,
disk_super - > metadata_uuid ,
BTRFS_FSID_SIZE ) ;
else
memcpy ( fs_devices - > metadata_uuid ,
disk_super - > fsid , BTRFS_FSID_SIZE ) ;
2018-10-30 16:43:27 +02:00
fs_devices - > fsid_change = false ;
}
2008-03-24 15:02:07 -04:00
}
2014-07-24 11:37:15 +08:00
2008-03-24 15:02:07 -04:00
if ( ! device ) {
2018-05-29 14:10:20 +08:00
if ( fs_devices - > opened ) {
mutex_unlock ( & fs_devices - > device_list_mutex ) ;
2018-01-18 22:02:35 +08:00
return ERR_PTR ( - EBUSY ) ;
2018-05-29 14:10:20 +08:00
}
2008-11-17 21:11:30 -05:00
2013-08-23 13:20:17 +03:00
device = btrfs_alloc_device ( NULL , & devid ,
disk_super - > dev_item . uuid ) ;
if ( IS_ERR ( device ) ) {
2018-05-29 14:10:20 +08:00
mutex_unlock ( & fs_devices - > device_list_mutex ) ;
2008-03-24 15:02:07 -04:00
/* we can safely leave the fs_devices entry around */
2018-01-18 22:02:35 +08:00
return device ;
2008-03-24 15:02:07 -04:00
}
2012-06-04 14:03:51 -04:00
name = rcu_string_strdup ( path , GFP_NOFS ) ;
if ( ! name ) {
2018-03-20 15:47:33 +01:00
btrfs_free_device ( device ) ;
2018-05-29 14:10:20 +08:00
mutex_unlock ( & fs_devices - > device_list_mutex ) ;
2018-01-18 22:02:35 +08:00
return ERR_PTR ( - ENOMEM ) ;
2008-03-24 15:02:07 -04:00
}
2012-06-04 14:03:51 -04:00
rcu_assign_pointer ( device - > name , name ) ;
2011-05-23 14:30:00 +02:00
2011-04-20 10:09:16 +00:00
list_add_rcu ( & device - > dev_list , & fs_devices - > devices ) ;
Btrfs: fix race conditions in BTRFS_IOC_FS_INFO ioctl
The handler for the ioctl BTRFS_IOC_FS_INFO was reading the
number of devices before acquiring the device list mutex.
This could lead to inconsistent results because the update of
the device list and the number of devices counter (amongst other
counters related to the device list) are updated in volumes.c
while holding the device list mutex - except for 2 places, one
was volumes.c:btrfs_prepare_sprout() and the other was
volumes.c:device_list_add().
For example, if we have 2 devices, with IDs 1 and 2 and then add
a new device, with ID 3, and while adding the device is in progress
an BTRFS_IOC_FS_INFO ioctl arrives, it could return a number of
devices of 2 and a max dev id of 3. This would be incorrect.
Also, this ioctl handler was reading the fsid while it can be
updated concurrently. This can happen when while a new device is
being added and the current filesystem is in seeding mode.
Example:
$ mkfs.btrfs -f /dev/sdb1
$ mkfs.btrfs -f /dev/sdb2
$ btrfstune -S 1 /dev/sdb1
$ mount /dev/sdb1 /mnt/test
$ btrfs device add /dev/sdb2 /mnt/test
If during the last step a BTRFS_IOC_FS_INFO ioctl was requested, it
could read an fsid that was never valid (some bits part of the old
fsid and others part of the new fsid). Also, it could read a number
of devices that doesn't match the number of devices in the list and
the max device id, as explained before.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-08-12 20:56:58 +01:00
fs_devices - > num_devices + + ;
2009-06-10 15:17:02 -04:00
2008-11-17 21:11:30 -05:00
device - > fs_devices = fs_devices ;
2018-05-29 12:28:37 +08:00
* new_device_added = true ;
2018-01-18 22:02:33 +08:00
if ( disk_super - > label [ 0 ] )
2019-10-02 18:30:48 +08:00
pr_info (
" BTRFS: device label %s devid %llu transid %llu %s scanned by %s (%d) \n " ,
disk_super - > label , devid , found_transid , path ,
current - > comm , task_pid_nr ( current ) ) ;
2018-01-18 22:02:33 +08:00
else
2019-10-02 18:30:48 +08:00
pr_info (
" BTRFS: device fsid %pU devid %llu transid %llu %s scanned by %s (%d) \n " ,
disk_super - > fsid , devid , found_transid , path ,
current - > comm , task_pid_nr ( current ) ) ;
2018-01-18 22:02:33 +08:00
2012-06-04 14:03:51 -04:00
} else if ( ! device - > name | | strcmp ( device - > name - > str , path ) ) {
2014-07-03 18:22:05 +08:00
/*
* When FS is already mounted .
* 1. If you are here and if the device - > name is NULL that
* means this device was missing at time of FS mount .
* 2. If you are here and if the device - > name is different
* from ' path ' that means either
* a . The same device disappeared and reappeared with
* different name . or
* b . The missing - disk - which - was - replaced , has
* reappeared now .
*
* We must allow 1 and 2 a above . But 2 b would be a spurious
* and unintentional .
*
* Further in case of 1 and 2 a above , the disk at ' path '
* would have missed some transaction when it was away and
* in case of 2 a the stale bdev has to be updated as well .
* 2 b must not be allowed at all time .
*/
/*
2014-09-18 07:49:05 -07:00
* For now , we do allow update to btrfs_fs_device through the
* btrfs dev scan cli after FS has been mounted . We ' re still
* tracking a problem where systems fail mount by subvolume id
* when we reject replacement on a mounted FS .
2014-07-03 18:22:05 +08:00
*/
2014-09-18 07:49:05 -07:00
if ( ! fs_devices - > opened & & found_transid < device - > generation ) {
2014-07-03 18:22:06 +08:00
/*
* That is if the FS is _not_ mounted and if you
* are here , that means there is more than one
* disk with same uuid and devid . We keep the one
* with larger generation number or the last - in if
* generation are equal .
*/
2018-05-29 14:10:20 +08:00
mutex_unlock ( & fs_devices - > device_list_mutex ) ;
2018-01-18 22:02:35 +08:00
return ERR_PTR ( - EEXIST ) ;
2014-07-03 18:22:06 +08:00
}
2014-07-03 18:22:05 +08:00
2018-10-15 10:45:17 +08:00
/*
* We are going to replace the device path for a given devid ,
* make sure it ' s the same device if the device is mounted
*/
if ( device - > bdev ) {
2020-11-23 13:38:40 +01:00
int error ;
dev_t path_dev ;
2018-10-15 10:45:17 +08:00
2020-11-23 13:38:40 +01:00
error = lookup_bdev ( path , & path_dev ) ;
if ( error ) {
2018-10-15 10:45:17 +08:00
mutex_unlock ( & fs_devices - > device_list_mutex ) ;
2020-11-23 13:38:40 +01:00
return ERR_PTR ( error ) ;
2018-10-15 10:45:17 +08:00
}
2020-11-23 13:38:40 +01:00
if ( device - > bdev - > bd_dev ! = path_dev ) {
2018-10-15 10:45:17 +08:00
mutex_unlock ( & fs_devices - > device_list_mutex ) ;
2020-11-18 18:03:26 +09:00
/*
* device - > fs_info may not be reliable here , so
* pass in a NULL instead . This avoids a
* possible use - after - free when the fs_info and
* fs_info - > sb are already torn down .
*/
btrfs_warn_in_rcu ( NULL ,
2020-09-03 21:30:12 +08:00
" duplicate device %s devid %llu generation %llu scanned by %s (%d) " ,
path , devid , found_transid ,
current - > comm ,
task_pid_nr ( current ) ) ;
2018-10-15 10:45:17 +08:00
return ERR_PTR ( - EEXIST ) ;
}
btrfs_info_in_rcu ( device - > fs_info ,
2020-09-03 21:30:12 +08:00
" devid %llu device path %s changed to %s scanned by %s (%d) " ,
devid , rcu_str_deref ( device - > name ) ,
path , current - > comm ,
task_pid_nr ( current ) ) ;
2018-10-15 10:45:17 +08:00
}
2012-06-04 14:03:51 -04:00
name = rcu_string_strdup ( path , GFP_NOFS ) ;
2018-05-29 14:10:20 +08:00
if ( ! name ) {
mutex_unlock ( & fs_devices - > device_list_mutex ) ;
2018-01-18 22:02:35 +08:00
return ERR_PTR ( - ENOMEM ) ;
2018-05-29 14:10:20 +08:00
}
2012-06-04 14:03:51 -04:00
rcu_string_free ( device - > name ) ;
rcu_assign_pointer ( device - > name , name ) ;
2017-12-04 12:54:54 +08:00
if ( test_bit ( BTRFS_DEV_STATE_MISSING , & device - > dev_state ) ) {
2010-12-13 14:56:23 -05:00
fs_devices - > missing_devices - - ;
2017-12-04 12:54:54 +08:00
clear_bit ( BTRFS_DEV_STATE_MISSING , & device - > dev_state ) ;
2010-12-13 14:56:23 -05:00
}
2008-03-24 15:02:07 -04:00
}
2014-07-03 18:22:06 +08:00
/*
* Unmount does not free the btrfs_device struct but would zero
* generation along with most of the other members . So just update
* it back . We need it to pick the disk with largest generation
* ( as above ) .
*/
2018-10-30 16:43:26 +02:00
if ( ! fs_devices - > opened ) {
2014-07-03 18:22:06 +08:00
device - > generation = found_transid ;
2018-10-30 16:43:26 +02:00
fs_devices - > latest_generation = max_t ( u64 , found_transid ,
fs_devices - > latest_generation ) ;
}
2014-07-03 18:22:06 +08:00
2018-01-18 22:02:34 +08:00
fs_devices - > total_devices = btrfs_super_num_devices ( disk_super ) ;
2018-05-29 14:10:20 +08:00
mutex_unlock ( & fs_devices - > device_list_mutex ) ;
2018-01-18 22:02:35 +08:00
return device ;
2008-03-24 15:02:07 -04:00
}
2008-12-12 10:03:26 -05:00
static struct btrfs_fs_devices * clone_fs_devices ( struct btrfs_fs_devices * orig )
{
struct btrfs_fs_devices * fs_devices ;
struct btrfs_device * device ;
struct btrfs_device * orig_dev ;
2019-08-27 15:40:45 +08:00
int ret = 0 ;
2008-12-12 10:03:26 -05:00
2021-08-31 09:21:28 +08:00
lockdep_assert_held ( & uuid_mutex ) ;
2018-10-30 16:43:23 +02:00
fs_devices = alloc_fs_devices ( orig - > fsid , NULL ) ;
2013-08-12 14:33:03 +03:00
if ( IS_ERR ( fs_devices ) )
return fs_devices ;
2008-12-12 10:03:26 -05:00
2012-06-21 16:03:58 -04:00
fs_devices - > total_devices = orig - > total_devices ;
2008-12-12 10:03:26 -05:00
list_for_each_entry ( orig_dev , & orig - > devices , dev_list ) {
2012-06-04 14:03:51 -04:00
struct rcu_string * name ;
2013-08-23 13:20:17 +03:00
device = btrfs_alloc_device ( NULL , & orig_dev - > devid ,
orig_dev - > uuid ) ;
2019-08-27 15:40:45 +08:00
if ( IS_ERR ( device ) ) {
ret = PTR_ERR ( device ) ;
2008-12-12 10:03:26 -05:00
goto error ;
2019-08-27 15:40:45 +08:00
}
2008-12-12 10:03:26 -05:00
2012-06-04 14:03:51 -04:00
/*
* This is ok to do without rcu read locked because we hold the
* uuid mutex so nothing we touch in here is going to disappear .
*/
2014-06-30 17:12:47 +08:00
if ( orig_dev - > name ) {
2016-02-11 14:25:38 +01:00
name = rcu_string_strdup ( orig_dev - > name - > str ,
GFP_KERNEL ) ;
2014-06-30 17:12:47 +08:00
if ( ! name ) {
2018-03-20 15:47:33 +01:00
btrfs_free_device ( device ) ;
2019-08-27 15:40:45 +08:00
ret = - ENOMEM ;
2014-06-30 17:12:47 +08:00
goto error ;
}
rcu_assign_pointer ( device - > name , name ) ;
2009-09-29 13:51:04 -04:00
}
2008-12-12 10:03:26 -05:00
list_add ( & device - > dev_list , & fs_devices - > devices ) ;
device - > fs_devices = fs_devices ;
fs_devices - > num_devices + + ;
}
return fs_devices ;
error :
free_fs_devices ( fs_devices ) ;
2019-08-27 15:40:45 +08:00
return ERR_PTR ( ret ) ;
2008-12-12 10:03:26 -05:00
}
2020-07-16 10:17:04 +03:00
static void __btrfs_free_extra_devids ( struct btrfs_fs_devices * fs_devices ,
2020-11-06 16:06:33 +08:00
struct btrfs_device * * latest_dev )
2008-05-13 13:46:40 -04:00
{
2009-01-21 10:59:08 -05:00
struct btrfs_device * device , * next ;
2012-02-20 20:53:43 -05:00
2011-04-20 10:08:47 +00:00
/* This is the initialized path, it is safe to release the devices. */
2009-01-21 10:59:08 -05:00
list_for_each_entry_safe ( device , next , & fs_devices - > devices , dev_list ) {
2020-07-16 10:17:04 +03:00
if ( test_bit ( BTRFS_DEV_STATE_IN_FS_METADATA , & device - > dev_state ) ) {
2017-12-04 12:54:55 +08:00
if ( ! test_bit ( BTRFS_DEV_STATE_REPLACE_TGT ,
2020-07-16 10:17:04 +03:00
& device - > dev_state ) & &
2020-05-05 02:58:25 +08:00
! test_bit ( BTRFS_DEV_STATE_MISSING ,
& device - > dev_state ) & &
2020-07-16 10:17:04 +03:00
( ! * latest_dev | |
device - > generation > ( * latest_dev ) - > generation ) ) {
* latest_dev = device ;
2012-02-20 20:53:43 -05:00
}
2008-11-17 21:11:30 -05:00
continue ;
2012-02-20 20:53:43 -05:00
}
2008-11-17 21:11:30 -05:00
2020-10-30 06:53:56 +08:00
/*
* We have already validated the presence of BTRFS_DEV_REPLACE_DEVID ,
* in btrfs_init_dev_replace ( ) so just continue .
*/
if ( device - > devid = = BTRFS_DEV_REPLACE_DEVID )
continue ;
2008-11-17 21:11:30 -05:00
if ( device - > bdev ) {
2010-11-13 11:55:18 +01:00
blkdev_put ( device - > bdev , device - > mode ) ;
2008-11-17 21:11:30 -05:00
device - > bdev = NULL ;
fs_devices - > open_devices - - ;
}
2017-12-04 12:54:52 +08:00
if ( test_bit ( BTRFS_DEV_STATE_WRITEABLE , & device - > dev_state ) ) {
2008-11-17 21:11:30 -05:00
list_del_init ( & device - > dev_alloc_list ) ;
2017-12-04 12:54:52 +08:00
clear_bit ( BTRFS_DEV_STATE_WRITEABLE , & device - > dev_state ) ;
btrfs: fix rw device counting in __btrfs_free_extra_devids
When removing a writeable device in __btrfs_free_extra_devids, the rw
device count should be decremented.
This error was caught by Syzbot which reported a warning in
close_fs_devices:
WARNING: CPU: 1 PID: 9355 at fs/btrfs/volumes.c:1168 close_fs_devices+0x763/0x880 fs/btrfs/volumes.c:1168
Modules linked in:
CPU: 0 PID: 9355 Comm: syz-executor552 Not tainted 5.13.0-rc1-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:close_fs_devices+0x763/0x880 fs/btrfs/volumes.c:1168
RSP: 0018:ffffc9000333f2f0 EFLAGS: 00010293
RAX: ffffffff8365f5c3 RBX: 0000000000000001 RCX: ffff888029afd4c0
RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000000
RBP: ffff88802846f508 R08: ffffffff8365f525 R09: ffffed100337d128
R10: ffffed100337d128 R11: 0000000000000000 R12: dffffc0000000000
R13: ffff888019be8868 R14: 1ffff1100337d10d R15: 1ffff1100337d10a
FS: 00007f6f53828700(0000) GS:ffff8880b9a00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000000000047c410 CR3: 00000000302a6000 CR4: 00000000001506f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
btrfs_close_devices+0xc9/0x450 fs/btrfs/volumes.c:1180
open_ctree+0x8e1/0x3968 fs/btrfs/disk-io.c:3693
btrfs_fill_super fs/btrfs/super.c:1382 [inline]
btrfs_mount_root+0xac5/0xc60 fs/btrfs/super.c:1749
legacy_get_tree+0xea/0x180 fs/fs_context.c:592
vfs_get_tree+0x86/0x270 fs/super.c:1498
fc_mount fs/namespace.c:993 [inline]
vfs_kern_mount+0xc9/0x160 fs/namespace.c:1023
btrfs_mount+0x3d3/0xb50 fs/btrfs/super.c:1809
legacy_get_tree+0xea/0x180 fs/fs_context.c:592
vfs_get_tree+0x86/0x270 fs/super.c:1498
do_new_mount fs/namespace.c:2905 [inline]
path_mount+0x196f/0x2be0 fs/namespace.c:3235
do_mount fs/namespace.c:3248 [inline]
__do_sys_mount fs/namespace.c:3456 [inline]
__se_sys_mount+0x2f9/0x3b0 fs/namespace.c:3433
do_syscall_64+0x3f/0xb0 arch/x86/entry/common.c:47
entry_SYSCALL_64_after_hwframe+0x44/0xae
Because fs_devices->rw_devices was not 0 after
closing all devices. Here is the call trace that was observed:
btrfs_mount_root():
btrfs_scan_one_device():
device_list_add(); <---------------- device added
btrfs_open_devices():
open_fs_devices():
btrfs_open_one_device(); <-------- writable device opened,
rw device count ++
btrfs_fill_super():
open_ctree():
btrfs_free_extra_devids():
__btrfs_free_extra_devids(); <--- writable device removed,
rw device count not decremented
fail_tree_roots:
btrfs_close_devices():
close_fs_devices(); <------- rw device count off by 1
As a note, prior to commit cf89af146b7e ("btrfs: dev-replace: fail
mount if we don't have replace item with target device"), rw_devices
was decremented on removing a writable device in
__btrfs_free_extra_devids only if the BTRFS_DEV_STATE_REPLACE_TGT bit
was not set for the device. However, this check does not need to be
reinstated as it is now redundant and incorrect.
In __btrfs_free_extra_devids, we skip removing the device if it is the
target for replacement. This is done by checking whether device->devid
== BTRFS_DEV_REPLACE_DEVID. Since BTRFS_DEV_STATE_REPLACE_TGT is set
only on the device with devid BTRFS_DEV_REPLACE_DEVID, no devices
should have the BTRFS_DEV_STATE_REPLACE_TGT bit set after the check,
and so it's redundant to test for that bit.
Additionally, following commit 82372bc816d7 ("Btrfs: make
the logic of source device removing more clear"), rw_devices is
incremented whenever a writeable device is added to the alloc
list (including the target device in btrfs_dev_replace_finishing), so
all removals of writable devices from the alloc list should also be
accompanied by a decrement to rw_devices.
Reported-by: syzbot+a70e2ad0879f160b9217@syzkaller.appspotmail.com
Fixes: cf89af146b7e ("btrfs: dev-replace: fail mount if we don't have replace item with target device")
CC: stable@vger.kernel.org # 5.10+
Tested-by: syzbot+a70e2ad0879f160b9217@syzkaller.appspotmail.com
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Desmond Cheong Zhi Xi <desmondcheongzx@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-07-27 15:13:03 +08:00
fs_devices - > rw_devices - - ;
2008-11-17 21:11:30 -05:00
}
2008-12-12 10:03:26 -05:00
list_del_init ( & device - > dev_list ) ;
fs_devices - > num_devices - - ;
2018-03-20 15:47:33 +01:00
btrfs_free_device ( device ) ;
2008-05-13 13:46:40 -04:00
}
2008-11-17 21:11:30 -05:00
2020-07-16 10:17:04 +03:00
}
/*
* After we have read the system tree and know devids belonging to this
* filesystem , remove the device which does not belong there .
*/
2020-11-06 16:06:33 +08:00
void btrfs_free_extra_devids ( struct btrfs_fs_devices * fs_devices )
2020-07-16 10:17:04 +03:00
{
struct btrfs_device * latest_dev = NULL ;
2020-07-16 10:25:33 +03:00
struct btrfs_fs_devices * seed_dev ;
2020-07-16 10:17:04 +03:00
mutex_lock ( & uuid_mutex ) ;
2020-11-06 16:06:33 +08:00
__btrfs_free_extra_devids ( fs_devices , & latest_dev ) ;
2020-07-16 10:25:33 +03:00
list_for_each_entry ( seed_dev , & fs_devices - > seed_list , seed_list )
2020-11-06 16:06:33 +08:00
__btrfs_free_extra_devids ( seed_dev , & latest_dev ) ;
2008-11-17 21:11:30 -05:00
2021-08-24 13:05:19 +08:00
fs_devices - > latest_dev = latest_dev ;
2012-02-20 20:53:43 -05:00
2008-05-13 13:46:40 -04:00
mutex_unlock ( & uuid_mutex ) ;
}
2008-05-13 16:03:06 -04:00
2016-07-22 06:04:53 +08:00
static void btrfs_close_bdev ( struct btrfs_device * device )
{
2017-06-19 16:55:35 +02:00
if ( ! device - > bdev )
return ;
2017-12-04 12:54:52 +08:00
if ( test_bit ( BTRFS_DEV_STATE_WRITEABLE , & device - > dev_state ) ) {
2016-07-22 06:04:53 +08:00
sync_blockdev ( device - > bdev ) ;
invalidate_bdev ( device - > bdev ) ;
}
2017-06-19 16:55:35 +02:00
blkdev_put ( device - > bdev , device - > mode ) ;
2016-07-22 06:04:53 +08:00
}
2018-06-29 08:26:05 +03:00
static void btrfs_close_one_device ( struct btrfs_device * device )
2016-06-14 18:55:25 +08:00
{
struct btrfs_fs_devices * fs_devices = device - > fs_devices ;
2017-12-04 12:54:52 +08:00
if ( test_bit ( BTRFS_DEV_STATE_WRITEABLE , & device - > dev_state ) & &
2016-06-14 18:55:25 +08:00
device - > devid ! = BTRFS_DEV_REPLACE_DEVID ) {
list_del_init ( & device - > dev_alloc_list ) ;
fs_devices - > rw_devices - - ;
}
btrfs: reset replace target device to allocation state on close
This crash was observed with a failed assertion on device close:
BTRFS: Transaction aborted (error -28)
WARNING: CPU: 1 PID: 3902 at fs/btrfs/extent-tree.c:2150 btrfs_run_delayed_refs+0x1d2/0x1e0 [btrfs]
Modules linked in: btrfs blake2b_generic libcrc32c crc32c_intel xor zstd_decompress zstd_compress xxhash lzo_compress lzo_decompress raid6_pq loop
CPU: 1 PID: 3902 Comm: kworker/u8:4 Not tainted 5.14.0-rc5-default+ #1532
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs]
RIP: 0010:btrfs_run_delayed_refs+0x1d2/0x1e0 [btrfs]
RSP: 0018:ffffb7a5452d7d80 EFLAGS: 00010282
RAX: 0000000000000000 RBX: 0000000000000003 RCX: 0000000000000000
RDX: 0000000000000001 RSI: ffffffffabee13c4 RDI: 00000000ffffffff
RBP: ffff97834176a378 R08: 0000000000000001 R09: 0000000000000001
R10: 0000000000000000 R11: 0000000000000001 R12: ffff97835195d388
R13: 0000000005b08000 R14: ffff978385484000 R15: 000000000000016c
FS: 0000000000000000(0000) GS:ffff9783bd800000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000056190d003fe8 CR3: 000000002a81e005 CR4: 0000000000170ea0
Call Trace:
flush_space+0x197/0x2f0 [btrfs]
btrfs_async_reclaim_metadata_space+0x139/0x300 [btrfs]
process_one_work+0x262/0x5e0
worker_thread+0x4c/0x320
? process_one_work+0x5e0/0x5e0
kthread+0x144/0x170
? set_kthread_struct+0x40/0x40
ret_from_fork+0x1f/0x30
irq event stamp: 19334989
hardirqs last enabled at (19334997): [<ffffffffab0e0c87>] console_unlock+0x2b7/0x400
hardirqs last disabled at (19335006): [<ffffffffab0e0d0d>] console_unlock+0x33d/0x400
softirqs last enabled at (19334900): [<ffffffffaba0030d>] __do_softirq+0x30d/0x574
softirqs last disabled at (19334893): [<ffffffffab0721ec>] irq_exit_rcu+0x12c/0x140
---[ end trace 45939e308e0dd3c7 ]---
BTRFS: error (device vdd) in btrfs_run_delayed_refs:2150: errno=-28 No space left
BTRFS info (device vdd): forced readonly
BTRFS warning (device vdd): failed setting block group ro: -30
BTRFS info (device vdd): suspending dev_replace for unmount
assertion failed: !test_bit(BTRFS_DEV_STATE_REPLACE_TGT, &device->dev_state), in fs/btrfs/volumes.c:1150
------------[ cut here ]------------
kernel BUG at fs/btrfs/ctree.h:3431!
invalid opcode: 0000 [#1] PREEMPT SMP
CPU: 1 PID: 3982 Comm: umount Tainted: G W 5.14.0-rc5-default+ #1532
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
RIP: 0010:assertfail.constprop.0+0x18/0x1a [btrfs]
RSP: 0018:ffffb7a5454c7db8 EFLAGS: 00010246
RAX: 0000000000000068 RBX: ffff978364b91c00 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffffffabee13c4 RDI: 00000000ffffffff
RBP: ffff9783523a4c00 R08: 0000000000000001 R09: 0000000000000001
R10: 0000000000000000 R11: 0000000000000001 R12: ffff9783523a4d18
R13: 0000000000000000 R14: 0000000000000004 R15: 0000000000000003
FS: 00007f61c8f42800(0000) GS:ffff9783bd800000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000056190cffa810 CR3: 0000000030b96002 CR4: 0000000000170ea0
Call Trace:
btrfs_close_one_device.cold+0x11/0x55 [btrfs]
close_fs_devices+0x44/0xb0 [btrfs]
btrfs_close_devices+0x48/0x160 [btrfs]
generic_shutdown_super+0x69/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x2c/0xa0
cleanup_mnt+0x144/0x1b0
task_work_run+0x59/0xa0
exit_to_user_mode_loop+0xe7/0xf0
exit_to_user_mode_prepare+0xaf/0xf0
syscall_exit_to_user_mode+0x19/0x50
do_syscall_64+0x4a/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
This happens when close_ctree is called while a dev_replace hasn't
completed. In close_ctree, we suspend the dev_replace, but keep the
replace target around so that we can resume the dev_replace procedure
when we mount the root again. This is the call trace:
close_ctree():
btrfs_dev_replace_suspend_for_unmount();
btrfs_close_devices():
btrfs_close_fs_devices():
btrfs_close_one_device():
ASSERT(!test_bit(BTRFS_DEV_STATE_REPLACE_TGT,
&device->dev_state));
However, since the replace target sticks around, there is a device
with BTRFS_DEV_STATE_REPLACE_TGT set on close, and we fail the
assertion in btrfs_close_one_device.
To fix this, if we come across the replace target device when
closing, we should properly reset it back to allocation state. This
fix also ensures that if a non-target device has a corrupted state and
has the BTRFS_DEV_STATE_REPLACE_TGT bit set, the assertion will still
catch the error.
Reported-by: David Sterba <dsterba@suse.com>
Fixes: b2a616676839 ("btrfs: fix rw device counting in __btrfs_free_extra_devids")
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Desmond Cheong Zhi Xi <desmondcheongzx@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-21 01:50:40 +08:00
if ( device - > devid = = BTRFS_DEV_REPLACE_DEVID )
clear_bit ( BTRFS_DEV_STATE_REPLACE_TGT , & device - > dev_state ) ;
2017-12-04 12:54:54 +08:00
if ( test_bit ( BTRFS_DEV_STATE_MISSING , & device - > dev_state ) )
2016-06-14 18:55:25 +08:00
fs_devices - > missing_devices - - ;
2018-06-29 08:26:05 +03:00
btrfs_close_bdev ( device ) ;
2019-12-04 14:36:39 +01:00
if ( device - > bdev ) {
2019-11-26 09:40:05 +01:00
fs_devices - > open_devices - - ;
2019-12-04 14:36:39 +01:00
device - > bdev = NULL ;
2016-06-14 18:55:25 +08:00
}
2019-12-04 14:36:39 +01:00
clear_bit ( BTRFS_DEV_STATE_WRITEABLE , & device - > dev_state ) ;
2020-11-10 20:26:07 +09:00
btrfs_destroy_dev_zone_info ( device ) ;
2016-06-14 18:55:25 +08:00
2019-12-04 14:36:39 +01:00
device - > fs_info = NULL ;
atomic_set ( & device - > dev_stats_ccnt , 0 ) ;
extent_io_tree_release ( & device - > alloc_state ) ;
2018-06-29 08:26:05 +03:00
btrfs: fix mount failure due to past and transient device flush error
When we get an error flushing one device, during a super block commit, we
record the error in the device structure, in the field 'last_flush_error'.
This is used to later check if we should error out the super block commit,
depending on whether the number of flush errors is greater than or equals
to the maximum tolerated device failures for a raid profile.
However if we get a transient device flush error, unmount the filesystem
and later try to mount it, we can fail the mount because we treat that
past error as critical and consider the device is missing. Even if it's
very likely that the error will happen again, as it's probably due to a
hardware related problem, there may be cases where the error might not
happen again. One example is during testing, and a test case like the
new generic/648 from fstests always triggers this. The test cases
generic/019 and generic/475 also trigger this scenario, but very
sporadically.
When this happens we get an error like this:
$ mount /dev/sdc /mnt
mount: /mnt wrong fs type, bad option, bad superblock on /dev/sdc, missing codepage or helper program, or other error.
$ dmesg
(...)
[12918.886926] BTRFS warning (device sdc): chunk 13631488 missing 1 devices, max tolerance is 0 for writable mount
[12918.888293] BTRFS warning (device sdc): writable mount is not allowed due to too many missing devices
[12918.890853] BTRFS error (device sdc): open_ctree failed
The failure happens because when btrfs_check_rw_degradable() is called at
mount time, or at remount from RO to RW time, is sees a non zero value in
a device's ->last_flush_error attribute, and therefore considers that the
device is 'missing'.
Fix this by setting a device's ->last_flush_error to zero when we close a
device, making sure the error is not seen on the next mount attempt. We
only need to track flush errors during the current mount, so that we never
commit a super block if such errors happened.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-09-08 19:05:44 +01:00
/*
* Reset the flush error record . We might have a transient flush error
* in this mount , and if so we aborted the current transaction and set
* the fs to an error state , guaranteeing no super blocks can be further
* committed . However that error might be transient and if we unmount the
* filesystem and mount it again , we should allow the mount to succeed
* ( btrfs_check_rw_degradable ( ) should not fail ) - if after mounting the
* filesystem again we still get flush errors , then we will again abort
* any transaction and set the error state , guaranteeing no commits of
* unsafe super blocks .
*/
device - > last_flush_error = 0 ;
2019-12-04 14:36:39 +01:00
/* Verify the device is back in a pristine state */
ASSERT ( ! test_bit ( BTRFS_DEV_STATE_FLUSH_SENT , & device - > dev_state ) ) ;
ASSERT ( ! test_bit ( BTRFS_DEV_STATE_REPLACE_TGT , & device - > dev_state ) ) ;
ASSERT ( list_empty ( & device - > dev_alloc_list ) ) ;
ASSERT ( list_empty ( & device - > post_commit_list ) ) ;
ASSERT ( atomic_read ( & device - > reada_in_flight ) = = 0 ) ;
2016-06-14 18:55:25 +08:00
}
2020-07-15 13:48:48 +03:00
static void close_fs_devices ( struct btrfs_fs_devices * fs_devices )
2008-03-24 15:02:07 -04:00
{
2015-05-12 19:31:37 -04:00
struct btrfs_device * device , * tmp ;
2008-12-12 10:03:26 -05:00
2020-08-10 11:42:28 -04:00
lockdep_assert_held ( & uuid_mutex ) ;
2008-11-17 21:11:30 -05:00
if ( - - fs_devices - > opened > 0 )
2020-07-15 13:48:48 +03:00
return ;
2008-03-24 15:02:07 -04:00
2020-08-10 11:42:28 -04:00
list_for_each_entry_safe ( device , tmp , & fs_devices - > devices , dev_list )
2018-06-29 08:26:05 +03:00
btrfs_close_one_device ( device ) ;
2011-04-20 10:07:30 +00:00
2008-12-12 10:03:26 -05:00
WARN_ON ( fs_devices - > open_devices ) ;
WARN_ON ( fs_devices - > rw_devices ) ;
2008-11-17 21:11:30 -05:00
fs_devices - > opened = 0 ;
2019-11-13 11:27:27 +01:00
fs_devices - > seeding = false ;
2020-07-15 13:48:49 +03:00
fs_devices - > fs_info = NULL ;
2008-03-24 15:02:07 -04:00
}
2020-07-15 13:48:48 +03:00
void btrfs_close_devices ( struct btrfs_fs_devices * fs_devices )
2008-11-17 21:11:30 -05:00
{
2020-07-16 10:25:33 +03:00
LIST_HEAD ( list ) ;
struct btrfs_fs_devices * tmp ;
2008-11-17 21:11:30 -05:00
mutex_lock ( & uuid_mutex ) ;
2020-07-15 13:48:48 +03:00
close_fs_devices ( fs_devices ) ;
2020-07-16 10:25:33 +03:00
if ( ! fs_devices - > opened )
list_splice_init ( & fs_devices - > seed_list , & list ) ;
2008-12-12 10:03:26 -05:00
2020-07-16 10:25:33 +03:00
list_for_each_entry_safe ( fs_devices , tmp , & list , seed_list ) {
2018-04-12 10:29:27 +08:00
close_fs_devices ( fs_devices ) ;
2020-07-16 10:25:33 +03:00
list_del ( & fs_devices - > seed_list ) ;
2008-12-12 10:03:26 -05:00
free_fs_devices ( fs_devices ) ;
}
2020-08-10 11:42:28 -04:00
mutex_unlock ( & uuid_mutex ) ;
2008-11-17 21:11:30 -05:00
}
2018-04-12 10:29:28 +08:00
static int open_fs_devices ( struct btrfs_fs_devices * fs_devices ,
2008-12-12 10:03:26 -05:00
fmode_t flags , void * holder )
2008-03-24 15:02:07 -04:00
{
struct btrfs_device * device ;
2014-07-24 11:37:15 +08:00
struct btrfs_device * latest_dev = NULL ;
2020-09-30 21:09:52 +08:00
struct btrfs_device * tmp_device ;
2008-03-24 15:02:07 -04:00
2010-11-13 11:55:18 +01:00
flags | = FMODE_EXCL ;
2020-09-30 21:09:52 +08:00
list_for_each_entry_safe ( device , tmp_device , & fs_devices - > devices ,
dev_list ) {
int ret ;
2008-05-13 16:03:06 -04:00
2020-09-30 21:09:52 +08:00
ret = btrfs_open_one_device ( fs_devices , device , flags , holder ) ;
if ( ret = = 0 & &
( ! latest_dev | | device - > generation > latest_dev - > generation ) ) {
2017-11-09 23:45:25 +08:00
latest_dev = device ;
2020-09-30 21:09:52 +08:00
} else if ( ret = = - ENODATA ) {
fs_devices - > num_devices - - ;
list_del ( & device - > dev_list ) ;
btrfs_free_device ( device ) ;
}
2008-03-24 15:02:07 -04:00
}
2020-04-28 23:22:25 +08:00
if ( fs_devices - > open_devices = = 0 )
return - EINVAL ;
2008-11-17 21:11:30 -05:00
fs_devices - > opened = 1 ;
2021-08-24 13:05:19 +08:00
fs_devices - > latest_dev = latest_dev ;
2008-11-17 21:11:30 -05:00
fs_devices - > total_rw_bytes = 0 ;
2020-02-25 12:56:08 +09:00
fs_devices - > chunk_alloc_policy = BTRFS_CHUNK_ALLOC_REGULAR ;
2020-10-28 21:14:46 +08:00
fs_devices - > read_policy = BTRFS_READ_POLICY_PID ;
2020-04-28 23:22:25 +08:00
return 0 ;
2008-11-17 21:11:30 -05:00
}
2021-04-08 11:28:34 -07:00
static int devid_cmp ( void * priv , const struct list_head * a ,
const struct list_head * b )
2018-01-22 14:49:36 -08:00
{
2021-07-26 14:15:26 +02:00
const struct btrfs_device * dev1 , * dev2 ;
2018-01-22 14:49:36 -08:00
dev1 = list_entry ( a , struct btrfs_device , dev_list ) ;
dev2 = list_entry ( b , struct btrfs_device , dev_list ) ;
if ( dev1 - > devid < dev2 - > devid )
return - 1 ;
else if ( dev1 - > devid > dev2 - > devid )
return 1 ;
return 0 ;
}
2008-11-17 21:11:30 -05:00
int btrfs_open_devices ( struct btrfs_fs_devices * fs_devices ,
2008-12-02 06:36:09 -05:00
fmode_t flags , void * holder )
2008-11-17 21:11:30 -05:00
{
int ret ;
2018-06-19 17:09:47 +02:00
lockdep_assert_held ( & uuid_mutex ) ;
btrfs: open device without device_list_mutex
There's long existed a lockdep splat because we open our bdev's under
the ->device_list_mutex at mount time, which acquires the bd_mutex.
Usually this goes unnoticed, but if you do loopback devices at all
suddenly the bd_mutex comes with a whole host of other dependencies,
which results in the splat when you mount a btrfs file system.
======================================================
WARNING: possible circular locking dependency detected
5.8.0-0.rc3.1.fc33.x86_64+debug #1 Not tainted
------------------------------------------------------
systemd-journal/509 is trying to acquire lock:
ffff970831f84db0 (&fs_info->reloc_mutex){+.+.}-{3:3}, at: btrfs_record_root_in_trans+0x44/0x70 [btrfs]
but task is already holding lock:
ffff97083144d598 (sb_pagefaults){.+.+}-{0:0}, at: btrfs_page_mkwrite+0x59/0x560 [btrfs]
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #6 (sb_pagefaults){.+.+}-{0:0}:
__sb_start_write+0x13e/0x220
btrfs_page_mkwrite+0x59/0x560 [btrfs]
do_page_mkwrite+0x4f/0x130
do_wp_page+0x3b0/0x4f0
handle_mm_fault+0xf47/0x1850
do_user_addr_fault+0x1fc/0x4b0
exc_page_fault+0x88/0x300
asm_exc_page_fault+0x1e/0x30
-> #5 (&mm->mmap_lock#2){++++}-{3:3}:
__might_fault+0x60/0x80
_copy_from_user+0x20/0xb0
get_sg_io_hdr+0x9a/0xb0
scsi_cmd_ioctl+0x1ea/0x2f0
cdrom_ioctl+0x3c/0x12b4
sr_block_ioctl+0xa4/0xd0
block_ioctl+0x3f/0x50
ksys_ioctl+0x82/0xc0
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x52/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #4 (&cd->lock){+.+.}-{3:3}:
__mutex_lock+0x7b/0x820
sr_block_open+0xa2/0x180
__blkdev_get+0xdd/0x550
blkdev_get+0x38/0x150
do_dentry_open+0x16b/0x3e0
path_openat+0x3c9/0xa00
do_filp_open+0x75/0x100
do_sys_openat2+0x8a/0x140
__x64_sys_openat+0x46/0x70
do_syscall_64+0x52/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #3 (&bdev->bd_mutex){+.+.}-{3:3}:
__mutex_lock+0x7b/0x820
__blkdev_get+0x6a/0x550
blkdev_get+0x85/0x150
blkdev_get_by_path+0x2c/0x70
btrfs_get_bdev_and_sb+0x1b/0xb0 [btrfs]
open_fs_devices+0x88/0x240 [btrfs]
btrfs_open_devices+0x92/0xa0 [btrfs]
btrfs_mount_root+0x250/0x490 [btrfs]
legacy_get_tree+0x30/0x50
vfs_get_tree+0x28/0xc0
vfs_kern_mount.part.0+0x71/0xb0
btrfs_mount+0x119/0x380 [btrfs]
legacy_get_tree+0x30/0x50
vfs_get_tree+0x28/0xc0
do_mount+0x8c6/0xca0
__x64_sys_mount+0x8e/0xd0
do_syscall_64+0x52/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #2 (&fs_devs->device_list_mutex){+.+.}-{3:3}:
__mutex_lock+0x7b/0x820
btrfs_run_dev_stats+0x36/0x420 [btrfs]
commit_cowonly_roots+0x91/0x2d0 [btrfs]
btrfs_commit_transaction+0x4e6/0x9f0 [btrfs]
btrfs_sync_file+0x38a/0x480 [btrfs]
__x64_sys_fdatasync+0x47/0x80
do_syscall_64+0x52/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #1 (&fs_info->tree_log_mutex){+.+.}-{3:3}:
__mutex_lock+0x7b/0x820
btrfs_commit_transaction+0x48e/0x9f0 [btrfs]
btrfs_sync_file+0x38a/0x480 [btrfs]
__x64_sys_fdatasync+0x47/0x80
do_syscall_64+0x52/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #0 (&fs_info->reloc_mutex){+.+.}-{3:3}:
__lock_acquire+0x1241/0x20c0
lock_acquire+0xb0/0x400
__mutex_lock+0x7b/0x820
btrfs_record_root_in_trans+0x44/0x70 [btrfs]
start_transaction+0xd2/0x500 [btrfs]
btrfs_dirty_inode+0x44/0xd0 [btrfs]
file_update_time+0xc6/0x120
btrfs_page_mkwrite+0xda/0x560 [btrfs]
do_page_mkwrite+0x4f/0x130
do_wp_page+0x3b0/0x4f0
handle_mm_fault+0xf47/0x1850
do_user_addr_fault+0x1fc/0x4b0
exc_page_fault+0x88/0x300
asm_exc_page_fault+0x1e/0x30
other info that might help us debug this:
Chain exists of:
&fs_info->reloc_mutex --> &mm->mmap_lock#2 --> sb_pagefaults
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(sb_pagefaults);
lock(&mm->mmap_lock#2);
lock(sb_pagefaults);
lock(&fs_info->reloc_mutex);
*** DEADLOCK ***
3 locks held by systemd-journal/509:
#0: ffff97083bdec8b8 (&mm->mmap_lock#2){++++}-{3:3}, at: do_user_addr_fault+0x12e/0x4b0
#1: ffff97083144d598 (sb_pagefaults){.+.+}-{0:0}, at: btrfs_page_mkwrite+0x59/0x560 [btrfs]
#2: ffff97083144d6a8 (sb_internal){.+.+}-{0:0}, at: start_transaction+0x3f8/0x500 [btrfs]
stack backtrace:
CPU: 0 PID: 509 Comm: systemd-journal Not tainted 5.8.0-0.rc3.1.fc33.x86_64+debug #1
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
Call Trace:
dump_stack+0x92/0xc8
check_noncircular+0x134/0x150
__lock_acquire+0x1241/0x20c0
lock_acquire+0xb0/0x400
? btrfs_record_root_in_trans+0x44/0x70 [btrfs]
? lock_acquire+0xb0/0x400
? btrfs_record_root_in_trans+0x44/0x70 [btrfs]
__mutex_lock+0x7b/0x820
? btrfs_record_root_in_trans+0x44/0x70 [btrfs]
? kvm_sched_clock_read+0x14/0x30
? sched_clock+0x5/0x10
? sched_clock_cpu+0xc/0xb0
btrfs_record_root_in_trans+0x44/0x70 [btrfs]
start_transaction+0xd2/0x500 [btrfs]
btrfs_dirty_inode+0x44/0xd0 [btrfs]
file_update_time+0xc6/0x120
btrfs_page_mkwrite+0xda/0x560 [btrfs]
? sched_clock+0x5/0x10
do_page_mkwrite+0x4f/0x130
do_wp_page+0x3b0/0x4f0
handle_mm_fault+0xf47/0x1850
do_user_addr_fault+0x1fc/0x4b0
exc_page_fault+0x88/0x300
? asm_exc_page_fault+0x8/0x30
asm_exc_page_fault+0x1e/0x30
RIP: 0033:0x7fa3972fdbfe
Code: Bad RIP value.
Fix this by not holding the ->device_list_mutex at this point. The
device_list_mutex exists to protect us from modifying the device list
while the file system is running.
However it can also be modified by doing a scan on a device. But this
action is specifically protected by the uuid_mutex, which we are holding
here. We cannot race with opening at this point because we have the
->s_mount lock held during the mount. Not having the
->device_list_mutex here is perfectly safe as we're not going to change
the devices at this point.
CC: stable@vger.kernel.org # 4.19+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add some comments ]
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-17 15:12:27 -04:00
/*
* The device_list_mutex cannot be taken here in case opening the
2021-05-25 08:12:56 +02:00
* underlying device takes further locks like open_mutex .
btrfs: open device without device_list_mutex
There's long existed a lockdep splat because we open our bdev's under
the ->device_list_mutex at mount time, which acquires the bd_mutex.
Usually this goes unnoticed, but if you do loopback devices at all
suddenly the bd_mutex comes with a whole host of other dependencies,
which results in the splat when you mount a btrfs file system.
======================================================
WARNING: possible circular locking dependency detected
5.8.0-0.rc3.1.fc33.x86_64+debug #1 Not tainted
------------------------------------------------------
systemd-journal/509 is trying to acquire lock:
ffff970831f84db0 (&fs_info->reloc_mutex){+.+.}-{3:3}, at: btrfs_record_root_in_trans+0x44/0x70 [btrfs]
but task is already holding lock:
ffff97083144d598 (sb_pagefaults){.+.+}-{0:0}, at: btrfs_page_mkwrite+0x59/0x560 [btrfs]
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #6 (sb_pagefaults){.+.+}-{0:0}:
__sb_start_write+0x13e/0x220
btrfs_page_mkwrite+0x59/0x560 [btrfs]
do_page_mkwrite+0x4f/0x130
do_wp_page+0x3b0/0x4f0
handle_mm_fault+0xf47/0x1850
do_user_addr_fault+0x1fc/0x4b0
exc_page_fault+0x88/0x300
asm_exc_page_fault+0x1e/0x30
-> #5 (&mm->mmap_lock#2){++++}-{3:3}:
__might_fault+0x60/0x80
_copy_from_user+0x20/0xb0
get_sg_io_hdr+0x9a/0xb0
scsi_cmd_ioctl+0x1ea/0x2f0
cdrom_ioctl+0x3c/0x12b4
sr_block_ioctl+0xa4/0xd0
block_ioctl+0x3f/0x50
ksys_ioctl+0x82/0xc0
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x52/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #4 (&cd->lock){+.+.}-{3:3}:
__mutex_lock+0x7b/0x820
sr_block_open+0xa2/0x180
__blkdev_get+0xdd/0x550
blkdev_get+0x38/0x150
do_dentry_open+0x16b/0x3e0
path_openat+0x3c9/0xa00
do_filp_open+0x75/0x100
do_sys_openat2+0x8a/0x140
__x64_sys_openat+0x46/0x70
do_syscall_64+0x52/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #3 (&bdev->bd_mutex){+.+.}-{3:3}:
__mutex_lock+0x7b/0x820
__blkdev_get+0x6a/0x550
blkdev_get+0x85/0x150
blkdev_get_by_path+0x2c/0x70
btrfs_get_bdev_and_sb+0x1b/0xb0 [btrfs]
open_fs_devices+0x88/0x240 [btrfs]
btrfs_open_devices+0x92/0xa0 [btrfs]
btrfs_mount_root+0x250/0x490 [btrfs]
legacy_get_tree+0x30/0x50
vfs_get_tree+0x28/0xc0
vfs_kern_mount.part.0+0x71/0xb0
btrfs_mount+0x119/0x380 [btrfs]
legacy_get_tree+0x30/0x50
vfs_get_tree+0x28/0xc0
do_mount+0x8c6/0xca0
__x64_sys_mount+0x8e/0xd0
do_syscall_64+0x52/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #2 (&fs_devs->device_list_mutex){+.+.}-{3:3}:
__mutex_lock+0x7b/0x820
btrfs_run_dev_stats+0x36/0x420 [btrfs]
commit_cowonly_roots+0x91/0x2d0 [btrfs]
btrfs_commit_transaction+0x4e6/0x9f0 [btrfs]
btrfs_sync_file+0x38a/0x480 [btrfs]
__x64_sys_fdatasync+0x47/0x80
do_syscall_64+0x52/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #1 (&fs_info->tree_log_mutex){+.+.}-{3:3}:
__mutex_lock+0x7b/0x820
btrfs_commit_transaction+0x48e/0x9f0 [btrfs]
btrfs_sync_file+0x38a/0x480 [btrfs]
__x64_sys_fdatasync+0x47/0x80
do_syscall_64+0x52/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #0 (&fs_info->reloc_mutex){+.+.}-{3:3}:
__lock_acquire+0x1241/0x20c0
lock_acquire+0xb0/0x400
__mutex_lock+0x7b/0x820
btrfs_record_root_in_trans+0x44/0x70 [btrfs]
start_transaction+0xd2/0x500 [btrfs]
btrfs_dirty_inode+0x44/0xd0 [btrfs]
file_update_time+0xc6/0x120
btrfs_page_mkwrite+0xda/0x560 [btrfs]
do_page_mkwrite+0x4f/0x130
do_wp_page+0x3b0/0x4f0
handle_mm_fault+0xf47/0x1850
do_user_addr_fault+0x1fc/0x4b0
exc_page_fault+0x88/0x300
asm_exc_page_fault+0x1e/0x30
other info that might help us debug this:
Chain exists of:
&fs_info->reloc_mutex --> &mm->mmap_lock#2 --> sb_pagefaults
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(sb_pagefaults);
lock(&mm->mmap_lock#2);
lock(sb_pagefaults);
lock(&fs_info->reloc_mutex);
*** DEADLOCK ***
3 locks held by systemd-journal/509:
#0: ffff97083bdec8b8 (&mm->mmap_lock#2){++++}-{3:3}, at: do_user_addr_fault+0x12e/0x4b0
#1: ffff97083144d598 (sb_pagefaults){.+.+}-{0:0}, at: btrfs_page_mkwrite+0x59/0x560 [btrfs]
#2: ffff97083144d6a8 (sb_internal){.+.+}-{0:0}, at: start_transaction+0x3f8/0x500 [btrfs]
stack backtrace:
CPU: 0 PID: 509 Comm: systemd-journal Not tainted 5.8.0-0.rc3.1.fc33.x86_64+debug #1
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
Call Trace:
dump_stack+0x92/0xc8
check_noncircular+0x134/0x150
__lock_acquire+0x1241/0x20c0
lock_acquire+0xb0/0x400
? btrfs_record_root_in_trans+0x44/0x70 [btrfs]
? lock_acquire+0xb0/0x400
? btrfs_record_root_in_trans+0x44/0x70 [btrfs]
__mutex_lock+0x7b/0x820
? btrfs_record_root_in_trans+0x44/0x70 [btrfs]
? kvm_sched_clock_read+0x14/0x30
? sched_clock+0x5/0x10
? sched_clock_cpu+0xc/0xb0
btrfs_record_root_in_trans+0x44/0x70 [btrfs]
start_transaction+0xd2/0x500 [btrfs]
btrfs_dirty_inode+0x44/0xd0 [btrfs]
file_update_time+0xc6/0x120
btrfs_page_mkwrite+0xda/0x560 [btrfs]
? sched_clock+0x5/0x10
do_page_mkwrite+0x4f/0x130
do_wp_page+0x3b0/0x4f0
handle_mm_fault+0xf47/0x1850
do_user_addr_fault+0x1fc/0x4b0
exc_page_fault+0x88/0x300
? asm_exc_page_fault+0x8/0x30
asm_exc_page_fault+0x1e/0x30
RIP: 0033:0x7fa3972fdbfe
Code: Bad RIP value.
Fix this by not holding the ->device_list_mutex at this point. The
device_list_mutex exists to protect us from modifying the device list
while the file system is running.
However it can also be modified by doing a scan on a device. But this
action is specifically protected by the uuid_mutex, which we are holding
here. We cannot race with opening at this point because we have the
->s_mount lock held during the mount. Not having the
->device_list_mutex here is perfectly safe as we're not going to change
the devices at this point.
CC: stable@vger.kernel.org # 4.19+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add some comments ]
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-17 15:12:27 -04:00
*
* We also don ' t need the lock here as this is called during mount and
* exclusion is provided by uuid_mutex
*/
2018-06-19 17:09:47 +02:00
2008-11-17 21:11:30 -05:00
if ( fs_devices - > opened ) {
2008-12-12 10:03:26 -05:00
fs_devices - > opened + + ;
ret = 0 ;
2008-11-17 21:11:30 -05:00
} else {
2018-01-22 14:49:36 -08:00
list_sort ( NULL , & fs_devices - > devices , devid_cmp ) ;
2018-04-12 10:29:28 +08:00
ret = open_fs_devices ( fs_devices , flags , holder ) ;
2008-11-17 21:11:30 -05:00
}
2018-04-12 10:29:34 +08:00
2008-03-24 15:02:07 -04:00
return ret ;
}
2020-02-14 00:24:32 +09:00
void btrfs_release_disk_super ( struct btrfs_super_block * super )
2016-02-13 10:01:29 +08:00
{
2020-02-14 00:24:32 +09:00
struct page * page = virt_to_page ( super ) ;
2016-02-13 10:01:29 +08:00
put_page ( page ) ;
}
2020-04-15 15:53:46 +03:00
static struct btrfs_super_block * btrfs_read_disk_super ( struct block_device * bdev ,
btrfs: implement log-structured superblock for ZONED mode
Superblock (and its copies) is the only data structure in btrfs which
has a fixed location on a device. Since we cannot overwrite in a
sequential write required zone, we cannot place superblock in the zone.
One easy solution is limiting superblock and copies to be placed only in
conventional zones. However, this method has two downsides: one is
reduced number of superblock copies. The location of the second copy of
superblock is 256GB, which is in a sequential write required zone on
typical devices in the market today. So, the number of superblock and
copies is limited to be two. Second downside is that we cannot support
devices which have no conventional zones at all.
To solve these two problems, we employ superblock log writing. It uses
two adjacent zones as a circular buffer to write updated superblocks.
Once the first zone is filled up, start writing into the second one.
Then, when both zones are filled up and before starting to write to the
first zone again, it reset the first zone.
We can determine the position of the latest superblock by reading write
pointer information from a device. One corner case is when both zones
are full. For this situation, we read out the last superblock of each
zone, and compare them to determine which zone is older.
The following zones are reserved as the circular buffer on ZONED btrfs.
- The primary superblock: zones 0 and 1
- The first copy: zones 16 and 17
- The second copy: zones 1024 or zone at 256GB which is minimum, and
next to it
If these reserved zones are conventional, superblock is written fixed at
the start of the zone without logging.
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-11-10 20:26:14 +09:00
u64 bytenr , u64 bytenr_orig )
2016-02-13 10:01:29 +08:00
{
2020-04-15 15:53:46 +03:00
struct btrfs_super_block * disk_super ;
struct page * page ;
2016-02-13 10:01:29 +08:00
void * p ;
pgoff_t index ;
/* make sure our super fits in the device */
if ( bytenr + PAGE_SIZE > = i_size_read ( bdev - > bd_inode ) )
2020-04-15 15:53:46 +03:00
return ERR_PTR ( - EINVAL ) ;
2016-02-13 10:01:29 +08:00
/* make sure our super fits in the page */
2020-04-15 15:53:46 +03:00
if ( sizeof ( * disk_super ) > PAGE_SIZE )
return ERR_PTR ( - EINVAL ) ;
2016-02-13 10:01:29 +08:00
/* make sure our super doesn't straddle pages on disk */
index = bytenr > > PAGE_SHIFT ;
2020-04-15 15:53:46 +03:00
if ( ( bytenr + sizeof ( * disk_super ) - 1 ) > > PAGE_SHIFT ! = index )
return ERR_PTR ( - EINVAL ) ;
2016-02-13 10:01:29 +08:00
/* pull in the page with our super */
2020-04-15 15:53:46 +03:00
page = read_cache_page_gfp ( bdev - > bd_inode - > i_mapping , index , GFP_KERNEL ) ;
2016-02-13 10:01:29 +08:00
2020-04-15 15:53:46 +03:00
if ( IS_ERR ( page ) )
return ERR_CAST ( page ) ;
2016-02-13 10:01:29 +08:00
2020-04-15 15:53:46 +03:00
p = page_address ( page ) ;
2016-02-13 10:01:29 +08:00
/* align our pointer to the offset of the super block */
2020-04-15 15:53:46 +03:00
disk_super = p + offset_in_page ( bytenr ) ;
2016-02-13 10:01:29 +08:00
btrfs: implement log-structured superblock for ZONED mode
Superblock (and its copies) is the only data structure in btrfs which
has a fixed location on a device. Since we cannot overwrite in a
sequential write required zone, we cannot place superblock in the zone.
One easy solution is limiting superblock and copies to be placed only in
conventional zones. However, this method has two downsides: one is
reduced number of superblock copies. The location of the second copy of
superblock is 256GB, which is in a sequential write required zone on
typical devices in the market today. So, the number of superblock and
copies is limited to be two. Second downside is that we cannot support
devices which have no conventional zones at all.
To solve these two problems, we employ superblock log writing. It uses
two adjacent zones as a circular buffer to write updated superblocks.
Once the first zone is filled up, start writing into the second one.
Then, when both zones are filled up and before starting to write to the
first zone again, it reset the first zone.
We can determine the position of the latest superblock by reading write
pointer information from a device. One corner case is when both zones
are full. For this situation, we read out the last superblock of each
zone, and compare them to determine which zone is older.
The following zones are reserved as the circular buffer on ZONED btrfs.
- The primary superblock: zones 0 and 1
- The first copy: zones 16 and 17
- The second copy: zones 1024 or zone at 256GB which is minimum, and
next to it
If these reserved zones are conventional, superblock is written fixed at
the start of the zone without logging.
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-11-10 20:26:14 +09:00
if ( btrfs_super_bytenr ( disk_super ) ! = bytenr_orig | |
2020-04-15 15:53:46 +03:00
btrfs_super_magic ( disk_super ) ! = BTRFS_MAGIC ) {
2020-02-14 00:24:32 +09:00
btrfs_release_disk_super ( p ) ;
2020-04-15 15:53:46 +03:00
return ERR_PTR ( - EINVAL ) ;
2016-02-13 10:01:29 +08:00
}
2020-04-15 15:53:46 +03:00
if ( disk_super - > label [ 0 ] & & disk_super - > label [ BTRFS_LABEL_SIZE - 1 ] )
disk_super - > label [ BTRFS_LABEL_SIZE - 1 ] = 0 ;
2016-02-13 10:01:29 +08:00
2020-04-15 15:53:46 +03:00
return disk_super ;
2016-02-13 10:01:29 +08:00
}
2019-01-04 13:31:54 +08:00
int btrfs_forget_devices ( const char * path )
{
int ret ;
mutex_lock ( & uuid_mutex ) ;
ret = btrfs_free_stale_devices ( strlen ( path ) ? path : NULL , NULL ) ;
mutex_unlock ( & uuid_mutex ) ;
return ret ;
}
2013-02-15 11:31:02 -07:00
/*
* Look for a btrfs signature on a device . This may be called out of the mount path
* and we are not allowed to call set_blocksize during the scan . The superblock
* is read via pagecache
*/
2018-07-12 14:23:16 +08:00
struct btrfs_device * btrfs_scan_one_device ( const char * path , fmode_t flags ,
void * holder )
2008-03-24 15:02:07 -04:00
{
struct btrfs_super_block * disk_super ;
2018-05-29 12:28:37 +08:00
bool new_device_added = false ;
2018-07-12 14:23:16 +08:00
struct btrfs_device * device = NULL ;
2008-03-24 15:02:07 -04:00
struct block_device * bdev ;
btrfs: implement log-structured superblock for ZONED mode
Superblock (and its copies) is the only data structure in btrfs which
has a fixed location on a device. Since we cannot overwrite in a
sequential write required zone, we cannot place superblock in the zone.
One easy solution is limiting superblock and copies to be placed only in
conventional zones. However, this method has two downsides: one is
reduced number of superblock copies. The location of the second copy of
superblock is 256GB, which is in a sequential write required zone on
typical devices in the market today. So, the number of superblock and
copies is limited to be two. Second downside is that we cannot support
devices which have no conventional zones at all.
To solve these two problems, we employ superblock log writing. It uses
two adjacent zones as a circular buffer to write updated superblocks.
Once the first zone is filled up, start writing into the second one.
Then, when both zones are filled up and before starting to write to the
first zone again, it reset the first zone.
We can determine the position of the latest superblock by reading write
pointer information from a device. One corner case is when both zones
are full. For this situation, we read out the last superblock of each
zone, and compare them to determine which zone is older.
The following zones are reserved as the circular buffer on ZONED btrfs.
- The primary superblock: zones 0 and 1
- The first copy: zones 16 and 17
- The second copy: zones 1024 or zone at 256GB which is minimum, and
next to it
If these reserved zones are conventional, superblock is written fixed at
the start of the zone without logging.
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-11-10 20:26:14 +09:00
u64 bytenr , bytenr_orig ;
int ret ;
2008-03-24 15:02:07 -04:00
2018-06-19 16:37:36 +02:00
lockdep_assert_held ( & uuid_mutex ) ;
2013-02-15 11:31:02 -07:00
/*
* we would like to check all the supers , but that would make
* a btrfs mount succeed after a mkfs from a different FS .
* So , we need to add a special mount option to scan for
* later supers , using BTRFS_SUPER_MIRROR_MAX instead
*/
2010-11-13 11:55:18 +01:00
flags | = FMODE_EXCL ;
2013-02-15 11:31:02 -07:00
bdev = blkdev_get_by_path ( path , flags , holder ) ;
2018-04-12 10:29:24 +08:00
if ( IS_ERR ( bdev ) )
2018-07-12 14:23:16 +08:00
return ERR_CAST ( bdev ) ;
2013-02-15 11:31:02 -07:00
btrfs: implement log-structured superblock for ZONED mode
Superblock (and its copies) is the only data structure in btrfs which
has a fixed location on a device. Since we cannot overwrite in a
sequential write required zone, we cannot place superblock in the zone.
One easy solution is limiting superblock and copies to be placed only in
conventional zones. However, this method has two downsides: one is
reduced number of superblock copies. The location of the second copy of
superblock is 256GB, which is in a sequential write required zone on
typical devices in the market today. So, the number of superblock and
copies is limited to be two. Second downside is that we cannot support
devices which have no conventional zones at all.
To solve these two problems, we employ superblock log writing. It uses
two adjacent zones as a circular buffer to write updated superblocks.
Once the first zone is filled up, start writing into the second one.
Then, when both zones are filled up and before starting to write to the
first zone again, it reset the first zone.
We can determine the position of the latest superblock by reading write
pointer information from a device. One corner case is when both zones
are full. For this situation, we read out the last superblock of each
zone, and compare them to determine which zone is older.
The following zones are reserved as the circular buffer on ZONED btrfs.
- The primary superblock: zones 0 and 1
- The first copy: zones 16 and 17
- The second copy: zones 1024 or zone at 256GB which is minimum, and
next to it
If these reserved zones are conventional, superblock is written fixed at
the start of the zone without logging.
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-11-10 20:26:14 +09:00
bytenr_orig = btrfs_sb_offset ( 0 ) ;
ret = btrfs_sb_log_location_bdev ( bdev , 0 , READ , & bytenr ) ;
if ( ret )
return ERR_PTR ( ret ) ;
disk_super = btrfs_read_disk_super ( bdev , bytenr , bytenr_orig ) ;
2020-04-15 15:53:46 +03:00
if ( IS_ERR ( disk_super ) ) {
device = ERR_CAST ( disk_super ) ;
2013-02-15 11:31:02 -07:00
goto error_bdev_put ;
2017-12-15 15:40:16 +08:00
}
2013-02-15 11:31:02 -07:00
2018-05-29 12:28:37 +08:00
device = device_list_add ( path , disk_super , & new_device_added ) ;
2018-07-12 14:23:16 +08:00
if ( ! IS_ERR ( device ) ) {
2018-05-29 12:28:37 +08:00
if ( new_device_added )
btrfs_free_stale_devices ( path , device ) ;
}
2013-02-15 11:31:02 -07:00
2020-02-14 00:24:32 +09:00
btrfs_release_disk_super ( disk_super ) ;
2013-02-15 11:31:02 -07:00
error_bdev_put :
2010-11-13 11:55:18 +01:00
blkdev_put ( bdev , flags ) ;
2018-04-12 10:29:24 +08:00
2018-07-12 14:23:16 +08:00
return device ;
2008-03-24 15:02:07 -04:00
}
2008-03-24 15:01:56 -04:00
2019-03-27 14:24:12 +02:00
/*
* Try to find a chunk that intersects [ start , start + len ] range and when one
* such is found , record the end of it in * start
*/
static bool contains_pending_extent ( struct btrfs_device * device , u64 * start ,
u64 len )
2013-06-27 13:22:46 -04:00
{
2019-03-27 14:24:12 +02:00
u64 physical_start , physical_end ;
2013-06-27 13:22:46 -04:00
2019-03-27 14:24:12 +02:00
lockdep_assert_held ( & device - > fs_info - > chunk_mutex ) ;
2013-06-27 13:22:46 -04:00
2019-03-27 14:24:12 +02:00
if ( ! find_first_extent_bit ( & device - > alloc_state , * start ,
& physical_start , & physical_end ,
CHUNK_ALLOCATED , NULL ) ) {
2015-05-14 10:46:03 +01:00
2019-03-27 14:24:12 +02:00
if ( in_range ( physical_start , * start , len ) | |
in_range ( * start , physical_start ,
physical_end - physical_start ) ) {
* start = physical_end + 1 ;
return true ;
2013-06-27 13:22:46 -04:00
}
}
2019-03-27 14:24:12 +02:00
return false ;
2013-06-27 13:22:46 -04:00
}
2020-02-25 12:56:09 +09:00
static u64 dev_extent_search_start ( struct btrfs_device * device , u64 start )
{
switch ( device - > fs_devices - > chunk_alloc_policy ) {
case BTRFS_CHUNK_ALLOC_REGULAR :
/*
* We don ' t want to overwrite the superblock on the drive nor
* any area used by the boot loader ( grub for example ) , so we
* make sure to start at an offset of at least 1 MB .
*/
return max_t ( u64 , start , SZ_1M ) ;
2021-02-04 19:21:48 +09:00
case BTRFS_CHUNK_ALLOC_ZONED :
/*
* We don ' t care about the starting region like regular
* allocator , because we anyway use / reserve the first two zones
* for superblock logging .
*/
return ALIGN ( start , device - > zone_info - > zone_size ) ;
2020-02-25 12:56:09 +09:00
default :
BUG ( ) ;
}
}
2021-02-04 19:21:48 +09:00
static bool dev_extent_hole_check_zoned ( struct btrfs_device * device ,
u64 * hole_start , u64 * hole_size ,
u64 num_bytes )
{
u64 zone_size = device - > zone_info - > zone_size ;
u64 pos ;
int ret ;
bool changed = false ;
ASSERT ( IS_ALIGNED ( * hole_start , zone_size ) ) ;
while ( * hole_size > 0 ) {
pos = btrfs_find_allocatable_zones ( device , * hole_start ,
* hole_start + * hole_size ,
num_bytes ) ;
if ( pos ! = * hole_start ) {
* hole_size = * hole_start + * hole_size - pos ;
* hole_start = pos ;
changed = true ;
if ( * hole_size < num_bytes )
break ;
}
ret = btrfs_ensure_empty_zones ( device , pos , num_bytes ) ;
/* Range is ensured to be empty */
if ( ! ret )
return changed ;
/* Given hole range was invalid (outside of device) */
if ( ret = = - ERANGE ) {
* hole_start + = * hole_size ;
2021-05-10 22:39:38 +09:00
* hole_size = 0 ;
2021-03-03 17:45:28 +08:00
return true ;
2021-02-04 19:21:48 +09:00
}
* hole_start + = zone_size ;
* hole_size - = zone_size ;
changed = true ;
}
return changed ;
}
2020-02-25 12:56:09 +09:00
/**
* dev_extent_hole_check - check if specified hole is suitable for allocation
* @ device : the device which we have the hole
* @ hole_start : starting position of the hole
* @ hole_size : the size of the hole
* @ num_bytes : the size of the free space that we need
*
2021-02-04 19:21:48 +09:00
* This function may modify @ hole_start and @ hole_size to reflect the suitable
2020-02-25 12:56:09 +09:00
* position for allocation . Returns 1 if hole position is updated , 0 otherwise .
*/
static bool dev_extent_hole_check ( struct btrfs_device * device , u64 * hole_start ,
u64 * hole_size , u64 num_bytes )
{
bool changed = false ;
u64 hole_end = * hole_start + * hole_size ;
2021-02-04 19:21:48 +09:00
for ( ; ; ) {
/*
* Check before we set max_hole_start , otherwise we could end up
* sending back this offset anyway .
*/
if ( contains_pending_extent ( device , hole_start , * hole_size ) ) {
if ( hole_end > = * hole_start )
* hole_size = hole_end - * hole_start ;
else
* hole_size = 0 ;
changed = true ;
}
switch ( device - > fs_devices - > chunk_alloc_policy ) {
case BTRFS_CHUNK_ALLOC_REGULAR :
/* No extra check */
break ;
case BTRFS_CHUNK_ALLOC_ZONED :
if ( dev_extent_hole_check_zoned ( device , hole_start ,
hole_size , num_bytes ) ) {
changed = true ;
/*
* The changed hole can contain pending extent .
* Loop again to check that .
*/
continue ;
}
break ;
default :
BUG ( ) ;
}
2020-02-25 12:56:09 +09:00
break ;
}
return changed ;
}
2013-06-27 13:22:46 -04:00
2008-03-24 15:01:56 -04:00
/*
2015-06-15 09:41:17 -04:00
* find_free_dev_extent_start - find free space in the specified device
* @ device : the device which we search the free space in
* @ num_bytes : the size of the free space that we need
* @ search_start : the position from which to begin the search
* @ start : store the start of the free space .
* @ len : the size of the free space . that we find , or the size
* of the max free space if we don ' t find suitable free space
2011-01-05 10:07:26 +00:00
*
2008-03-24 15:01:56 -04:00
* this uses a pretty simple search , the expectation is that it is
* called very infrequently and that a given device has a small number
* of extents
2011-01-05 10:07:26 +00:00
*
* @ start is used to store the start of the free space if we find . But if we
* don ' t find suitable free space , it will be used to store the start position
* of the max free space .
*
* @ len is used to store the size of the free space that we find .
* But if we don ' t find suitable free space , it is used to store the size of
* the max free space .
2019-07-19 14:51:42 +08:00
*
* NOTE : This function will search * commit * root of device tree , and does extra
* check to ensure dev extents are not double allocated .
* This makes the function safe to allocate dev extents but may not report
* correct usable device space , as device extent freed in current transaction
2021-05-21 17:42:23 +02:00
* is not reported as available .
2008-03-24 15:01:56 -04:00
*/
2019-07-19 14:51:41 +08:00
static int find_free_dev_extent_start ( struct btrfs_device * device ,
u64 num_bytes , u64 search_start , u64 * start ,
u64 * len )
2008-03-24 15:01:56 -04:00
{
2016-06-22 18:54:23 -04:00
struct btrfs_fs_info * fs_info = device - > fs_info ;
struct btrfs_root * root = fs_info - > dev_root ;
2008-03-24 15:01:56 -04:00
struct btrfs_key key ;
2011-01-05 10:07:26 +00:00
struct btrfs_dev_extent * dev_extent ;
2008-11-17 21:11:30 -05:00
struct btrfs_path * path ;
2011-01-05 10:07:26 +00:00
u64 hole_size ;
u64 max_hole_start ;
u64 max_hole_size ;
u64 extent_end ;
2008-03-24 15:01:56 -04:00
u64 search_end = device - > total_bytes ;
int ret ;
2011-01-05 10:07:26 +00:00
int slot ;
2008-03-24 15:01:56 -04:00
struct extent_buffer * l ;
Btrfs: fix fitrim discarding device area reserved for boot loader's use
As of the 4.3 kernel release, the fitrim ioctl can now discard any region
of a disk that is not allocated to any chunk/block group, including the
first megabyte which is used for our primary superblock and by the boot
loader (grub for example).
Fix this by not allowing to trim/discard any region in the device starting
with an offset not greater than min(alloc_start_mount_option, 1Mb), just
as it was not possible before 4.3.
A reproducer test case for xfstests follows.
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=/tmp/$$
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15
_cleanup()
{
cd /
rm -f $tmp.*
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
# real QA test starts here
_need_to_be_root
_supported_fs btrfs
_supported_os Linux
_require_scratch
rm -f $seqres.full
_scratch_mkfs >>$seqres.full 2>&1
# Write to the [0, 64Kb[ and [68Kb, 1Mb[ ranges of the device. These ranges are
# reserved for a boot loader to use (GRUB for example) and btrfs should never
# use them - neither for allocating metadata/data nor should trim/discard them.
# The range [64Kb, 68Kb[ is used for the primary superblock of the filesystem.
$XFS_IO_PROG -c "pwrite -S 0xfd 0 64K" $SCRATCH_DEV | _filter_xfs_io
$XFS_IO_PROG -c "pwrite -S 0xfd 68K 956K" $SCRATCH_DEV | _filter_xfs_io
# Now mount the filesystem and perform a fitrim against it.
_scratch_mount
_require_batched_discard $SCRATCH_MNT
$FSTRIM_PROG $SCRATCH_MNT
# Now unmount the filesystem and verify the content of the ranges was not
# modified (no trim/discard happened on them).
_scratch_unmount
echo "Content of the ranges [0, 64Kb] and [68Kb, 1Mb[ after fitrim:"
od -t x1 -N $((64 * 1024)) $SCRATCH_DEV
od -t x1 -j $((68 * 1024)) -N $((956 * 1024)) $SCRATCH_DEV
status=0
exit
Reported-by: Vincent Petry <PVince81@yahoo.fr>
Reported-by: Andrei Borzenkov <arvidjaar@gmail.com>
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=109341
Fixes: 499f377f49f0 (btrfs: iterate over unused chunk space in FITRIM)
Cc: stable@vger.kernel.org # 4.3+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2016-01-06 22:42:35 +00:00
2020-02-25 12:56:09 +09:00
search_start = dev_extent_search_start ( device , search_start ) ;
2008-03-24 15:01:56 -04:00
2021-02-04 19:21:48 +09:00
WARN_ON ( device - > zone_info & &
! IS_ALIGNED ( num_bytes , device - > zone_info - > zone_size ) ) ;
2013-06-27 13:22:46 -04:00
path = btrfs_alloc_path ( ) ;
if ( ! path )
return - ENOMEM ;
2015-02-16 18:52:17 +08:00
2011-01-05 10:07:26 +00:00
max_hole_start = search_start ;
max_hole_size = 0 ;
2015-02-16 18:52:17 +08:00
again :
2017-12-04 12:54:55 +08:00
if ( search_start > = search_end | |
test_bit ( BTRFS_DEV_STATE_REPLACE_TGT , & device - > dev_state ) ) {
2011-01-05 10:07:26 +00:00
ret = - ENOSPC ;
2013-06-27 13:22:46 -04:00
goto out ;
2011-01-05 10:07:26 +00:00
}
2015-11-27 16:31:35 +01:00
path - > reada = READA_FORWARD ;
2013-06-27 13:22:46 -04:00
path - > search_commit_root = 1 ;
path - > skip_locking = 1 ;
2011-01-05 10:07:26 +00:00
2008-03-24 15:01:56 -04:00
key . objectid = device - > devid ;
key . offset = search_start ;
key . type = BTRFS_DEV_EXTENT_KEY ;
2011-01-05 10:07:26 +00:00
2021-07-29 05:22:16 -03:00
ret = btrfs_search_backwards ( root , & key , path ) ;
2008-03-24 15:01:56 -04:00
if ( ret < 0 )
2011-01-05 10:07:26 +00:00
goto out ;
2008-03-24 15:01:56 -04:00
while ( 1 ) {
l = path - > nodes [ 0 ] ;
slot = path - > slots [ 0 ] ;
if ( slot > = btrfs_header_nritems ( l ) ) {
ret = btrfs_next_leaf ( root , path ) ;
if ( ret = = 0 )
continue ;
if ( ret < 0 )
2011-01-05 10:07:26 +00:00
goto out ;
break ;
2008-03-24 15:01:56 -04:00
}
btrfs_item_key_to_cpu ( l , & key , slot ) ;
if ( key . objectid < device - > devid )
goto next ;
if ( key . objectid > device - > devid )
2011-01-05 10:07:26 +00:00
break ;
2008-03-24 15:01:56 -04:00
2014-06-04 18:41:45 +02:00
if ( key . type ! = BTRFS_DEV_EXTENT_KEY )
2011-01-05 10:07:26 +00:00
goto next ;
2009-07-24 16:41:41 -04:00
2011-01-05 10:07:26 +00:00
if ( key . offset > search_start ) {
hole_size = key . offset - search_start ;
2020-02-25 12:56:09 +09:00
dev_extent_hole_check ( device , & search_start , & hole_size ,
num_bytes ) ;
2013-06-27 13:22:46 -04:00
2011-01-05 10:07:26 +00:00
if ( hole_size > max_hole_size ) {
max_hole_start = search_start ;
max_hole_size = hole_size ;
}
2009-07-24 16:41:41 -04:00
2011-01-05 10:07:26 +00:00
/*
* If this free space is greater than which we need ,
* it must be the max free space that we have found
* until now , so max_hole_start must point to the start
* of this free space and the length of this free space
* is stored in max_hole_size . Thus , we return
* max_hole_start and max_hole_size and go back to the
* caller .
*/
if ( hole_size > = num_bytes ) {
ret = 0 ;
goto out ;
2008-03-24 15:01:56 -04:00
}
}
dev_extent = btrfs_item_ptr ( l , slot , struct btrfs_dev_extent ) ;
2011-01-05 10:07:26 +00:00
extent_end = key . offset + btrfs_dev_extent_length ( l ,
dev_extent ) ;
if ( extent_end > search_start )
search_start = extent_end ;
2008-03-24 15:01:56 -04:00
next :
path - > slots [ 0 ] + + ;
cond_resched ( ) ;
}
Btrfs: fix a bug of balance on full multi-disk partitions
When balancing, we'll first try to shrink devices for some space,
but if it is working on a full multi-disk partition with raid protection,
we may encounter a bug, that is, while shrinking, total_bytes may be less
than bytes_used, and btrfs may allocate a dev extent that accesses out of
device's bounds.
Then we will not be able to write or read the data which stores at the end
of the device, and get the followings:
device fsid 0939f071-7ea3-46c8-95df-f176d773bfb6 devid 1 transid 10 /dev/sdb5
Btrfs detected SSD devices, enabling SSD mode
btrfs: relocating block group 476315648 flags 9
btrfs: found 4 extents
attempt to access beyond end of device
sdb5: rw=145, want=546176, limit=546147
attempt to access beyond end of device
sdb5: rw=145, want=546304, limit=546147
attempt to access beyond end of device
sdb5: rw=145, want=546432, limit=546147
attempt to access beyond end of device
sdb5: rw=145, want=546560, limit=546147
attempt to access beyond end of device
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-08-02 02:39:03 +00:00
/*
* At this point , search_start should be the end of
* allocated dev extents , and when shrinking the device ,
* search_end may be smaller than search_start .
*/
2015-02-16 18:52:17 +08:00
if ( search_end > search_start ) {
Btrfs: fix a bug of balance on full multi-disk partitions
When balancing, we'll first try to shrink devices for some space,
but if it is working on a full multi-disk partition with raid protection,
we may encounter a bug, that is, while shrinking, total_bytes may be less
than bytes_used, and btrfs may allocate a dev extent that accesses out of
device's bounds.
Then we will not be able to write or read the data which stores at the end
of the device, and get the followings:
device fsid 0939f071-7ea3-46c8-95df-f176d773bfb6 devid 1 transid 10 /dev/sdb5
Btrfs detected SSD devices, enabling SSD mode
btrfs: relocating block group 476315648 flags 9
btrfs: found 4 extents
attempt to access beyond end of device
sdb5: rw=145, want=546176, limit=546147
attempt to access beyond end of device
sdb5: rw=145, want=546304, limit=546147
attempt to access beyond end of device
sdb5: rw=145, want=546432, limit=546147
attempt to access beyond end of device
sdb5: rw=145, want=546560, limit=546147
attempt to access beyond end of device
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-08-02 02:39:03 +00:00
hole_size = search_end - search_start ;
2020-02-25 12:56:09 +09:00
if ( dev_extent_hole_check ( device , & search_start , & hole_size ,
num_bytes ) ) {
2015-02-16 18:52:17 +08:00
btrfs_release_path ( path ) ;
goto again ;
}
2008-03-24 15:01:56 -04:00
2015-02-16 18:52:17 +08:00
if ( hole_size > max_hole_size ) {
max_hole_start = search_start ;
max_hole_size = hole_size ;
}
2013-06-27 13:22:46 -04:00
}
2011-01-05 10:07:26 +00:00
/* See above. */
2015-02-16 18:52:17 +08:00
if ( max_hole_size < num_bytes )
2011-01-05 10:07:26 +00:00
ret = - ENOSPC ;
else
ret = 0 ;
out :
2008-11-17 21:11:30 -05:00
btrfs_free_path ( path ) ;
2011-01-05 10:07:26 +00:00
* start = max_hole_start ;
2011-01-05 10:07:28 +00:00
if ( len )
2011-01-05 10:07:26 +00:00
* len = max_hole_size ;
2008-03-24 15:01:56 -04:00
return ret ;
}
2019-03-27 14:24:14 +02:00
int find_free_dev_extent ( struct btrfs_device * device , u64 num_bytes ,
2015-06-15 09:41:17 -04:00
u64 * start , u64 * len )
{
/* FIXME use last free of some kind */
2019-03-27 14:24:14 +02:00
return find_free_dev_extent_start ( device , num_bytes , 0 , start , len ) ;
2015-06-15 09:41:17 -04:00
}
2008-12-02 09:54:17 -05:00
static int btrfs_free_dev_extent ( struct btrfs_trans_handle * trans ,
2008-04-25 16:53:30 -04:00
struct btrfs_device * device ,
2014-09-03 21:35:41 +08:00
u64 start , u64 * dev_extent_len )
2008-04-25 16:53:30 -04:00
{
2016-06-22 18:54:23 -04:00
struct btrfs_fs_info * fs_info = device - > fs_info ;
struct btrfs_root * root = fs_info - > dev_root ;
2008-04-25 16:53:30 -04:00
int ret ;
struct btrfs_path * path ;
struct btrfs_key key ;
2008-05-07 11:43:44 -04:00
struct btrfs_key found_key ;
struct extent_buffer * leaf = NULL ;
struct btrfs_dev_extent * extent = NULL ;
2008-04-25 16:53:30 -04:00
path = btrfs_alloc_path ( ) ;
if ( ! path )
return - ENOMEM ;
key . objectid = device - > devid ;
key . offset = start ;
key . type = BTRFS_DEV_EXTENT_KEY ;
2011-11-10 20:45:04 -05:00
again :
2008-04-25 16:53:30 -04:00
ret = btrfs_search_slot ( trans , root , & key , path , - 1 , 1 ) ;
2008-05-07 11:43:44 -04:00
if ( ret > 0 ) {
ret = btrfs_previous_item ( root , path , key . objectid ,
BTRFS_DEV_EXTENT_KEY ) ;
2011-05-19 07:03:42 +00:00
if ( ret )
goto out ;
2008-05-07 11:43:44 -04:00
leaf = path - > nodes [ 0 ] ;
btrfs_item_key_to_cpu ( leaf , & found_key , path - > slots [ 0 ] ) ;
extent = btrfs_item_ptr ( leaf , path - > slots [ 0 ] ,
struct btrfs_dev_extent ) ;
BUG_ON ( found_key . offset > start | | found_key . offset +
btrfs_dev_extent_length ( leaf , extent ) < start ) ;
2011-11-10 20:45:04 -05:00
key = found_key ;
btrfs_release_path ( path ) ;
goto again ;
2008-05-07 11:43:44 -04:00
} else if ( ret = = 0 ) {
leaf = path - > nodes [ 0 ] ;
extent = btrfs_item_ptr ( leaf , path - > slots [ 0 ] ,
struct btrfs_dev_extent ) ;
2012-03-12 16:03:00 +01:00
} else {
goto out ;
2008-05-07 11:43:44 -04:00
}
2008-04-25 16:53:30 -04:00
2014-09-03 21:35:41 +08:00
* dev_extent_len = btrfs_dev_extent_length ( leaf , extent ) ;
2008-04-25 16:53:30 -04:00
ret = btrfs_del_item ( trans , root , path ) ;
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 14:43:06 +01:00
if ( ret = = 0 )
2015-09-24 10:46:10 -04:00
set_bit ( BTRFS_TRANS_HAVE_FREE_BGS , & trans - > transaction - > flags ) ;
2011-05-19 07:03:42 +00:00
out :
2008-04-25 16:53:30 -04:00
btrfs_free_path ( path ) ;
return ret ;
}
2013-06-27 13:22:46 -04:00
static u64 find_next_chunk ( struct btrfs_fs_info * fs_info )
2008-03-24 15:01:56 -04:00
{
2013-06-27 13:22:46 -04:00
struct extent_map_tree * em_tree ;
struct extent_map * em ;
struct rb_node * n ;
u64 ret = 0 ;
2008-03-24 15:01:56 -04:00
2019-05-17 11:43:17 +02:00
em_tree = & fs_info - > mapping_tree ;
2013-06-27 13:22:46 -04:00
read_lock ( & em_tree - > lock ) ;
2018-08-23 03:51:52 +08:00
n = rb_last ( & em_tree - > map . rb_root ) ;
2013-06-27 13:22:46 -04:00
if ( n ) {
em = rb_entry ( n , struct extent_map , rb_node ) ;
ret = em - > start + em - > len ;
2008-03-24 15:01:56 -04:00
}
2013-06-27 13:22:46 -04:00
read_unlock ( & em_tree - > lock ) ;
2008-03-24 15:01:56 -04:00
return ret ;
}
2013-08-12 14:33:01 +03:00
static noinline int find_next_devid ( struct btrfs_fs_info * fs_info ,
u64 * devid_ret )
2008-03-24 15:01:56 -04:00
{
int ret ;
struct btrfs_key key ;
struct btrfs_key found_key ;
2008-11-17 21:11:30 -05:00
struct btrfs_path * path ;
path = btrfs_alloc_path ( ) ;
if ( ! path )
return - ENOMEM ;
2008-03-24 15:01:56 -04:00
key . objectid = BTRFS_DEV_ITEMS_OBJECTID ;
key . type = BTRFS_DEV_ITEM_KEY ;
key . offset = ( u64 ) - 1 ;
2013-08-12 14:33:01 +03:00
ret = btrfs_search_slot ( NULL , fs_info - > chunk_root , & key , path , 0 , 0 ) ;
2008-03-24 15:01:56 -04:00
if ( ret < 0 )
goto error ;
2019-08-27 15:40:44 +08:00
if ( ret = = 0 ) {
/* Corruption */
btrfs_err ( fs_info , " corrupted chunk tree devid -1 matched " ) ;
ret = - EUCLEAN ;
goto error ;
}
2008-03-24 15:01:56 -04:00
2013-08-12 14:33:01 +03:00
ret = btrfs_previous_item ( fs_info - > chunk_root , path ,
BTRFS_DEV_ITEMS_OBJECTID ,
2008-03-24 15:01:56 -04:00
BTRFS_DEV_ITEM_KEY ) ;
if ( ret ) {
2013-08-12 14:33:01 +03:00
* devid_ret = 1 ;
2008-03-24 15:01:56 -04:00
} else {
btrfs_item_key_to_cpu ( path - > nodes [ 0 ] , & found_key ,
path - > slots [ 0 ] ) ;
2013-08-12 14:33:01 +03:00
* devid_ret = found_key . offset + 1 ;
2008-03-24 15:01:56 -04:00
}
ret = 0 ;
error :
2008-11-17 21:11:30 -05:00
btrfs_free_path ( path ) ;
2008-03-24 15:01:56 -04:00
return ret ;
}
/*
* the device information is stored in the chunk root
* the btrfs_device struct should be fully filled in
*/
2017-11-06 16:36:15 +08:00
static int btrfs_add_dev_item ( struct btrfs_trans_handle * trans ,
2013-04-25 20:41:01 +00:00
struct btrfs_device * device )
2008-03-24 15:01:56 -04:00
{
int ret ;
struct btrfs_path * path ;
struct btrfs_dev_item * dev_item ;
struct extent_buffer * leaf ;
struct btrfs_key key ;
unsigned long ptr ;
path = btrfs_alloc_path ( ) ;
if ( ! path )
return - ENOMEM ;
key . objectid = BTRFS_DEV_ITEMS_OBJECTID ;
key . type = BTRFS_DEV_ITEM_KEY ;
2008-11-17 21:11:30 -05:00
key . offset = device - > devid ;
2008-03-24 15:01:56 -04:00
2018-07-20 19:37:47 +03:00
ret = btrfs_insert_empty_item ( trans , trans - > fs_info - > chunk_root , path ,
& key , sizeof ( * dev_item ) ) ;
2008-03-24 15:01:56 -04:00
if ( ret )
goto out ;
leaf = path - > nodes [ 0 ] ;
dev_item = btrfs_item_ptr ( leaf , path - > slots [ 0 ] , struct btrfs_dev_item ) ;
btrfs_set_device_id ( leaf , dev_item , device - > devid ) ;
2008-11-17 21:11:30 -05:00
btrfs_set_device_generation ( leaf , dev_item , 0 ) ;
2008-03-24 15:01:56 -04:00
btrfs_set_device_type ( leaf , dev_item , device - > type ) ;
btrfs_set_device_io_align ( leaf , dev_item , device - > io_align ) ;
btrfs_set_device_io_width ( leaf , dev_item , device - > io_width ) ;
btrfs_set_device_sector_size ( leaf , dev_item , device - > sector_size ) ;
2014-09-03 21:35:38 +08:00
btrfs_set_device_total_bytes ( leaf , dev_item ,
btrfs_device_get_disk_total_bytes ( device ) ) ;
btrfs_set_device_bytes_used ( leaf , dev_item ,
btrfs_device_get_bytes_used ( device ) ) ;
2008-04-15 15:41:47 -04:00
btrfs_set_device_group ( leaf , dev_item , 0 ) ;
btrfs_set_device_seek_speed ( leaf , dev_item , 0 ) ;
btrfs_set_device_bandwidth ( leaf , dev_item , 0 ) ;
2008-12-08 16:40:21 -05:00
btrfs_set_device_start_offset ( leaf , dev_item , 0 ) ;
2008-03-24 15:01:56 -04:00
2013-08-20 13:20:11 +02:00
ptr = btrfs_device_uuid ( dev_item ) ;
2008-04-15 15:41:47 -04:00
write_extent_buffer ( leaf , device - > uuid , ptr , BTRFS_UUID_SIZE ) ;
2013-08-20 13:20:12 +02:00
ptr = btrfs_device_fsid ( dev_item ) ;
2018-10-30 16:43:24 +02:00
write_extent_buffer ( leaf , trans - > fs_info - > fs_devices - > metadata_uuid ,
ptr , BTRFS_FSID_SIZE ) ;
2008-03-24 15:01:56 -04:00
btrfs_mark_buffer_dirty ( leaf ) ;
2008-11-17 21:11:30 -05:00
ret = 0 ;
2008-03-24 15:01:56 -04:00
out :
btrfs_free_path ( path ) ;
return ret ;
}
2008-04-25 16:53:30 -04:00
2014-04-16 17:02:32 +08:00
/*
* Function to update ctime / mtime for a given device path .
* Mainly used for ctime / mtime based probe like libblkid .
*/
2021-07-27 17:01:16 -04:00
static void update_dev_time ( struct block_device * bdev )
2014-04-16 17:02:32 +08:00
{
2021-07-27 17:01:16 -04:00
struct inode * inode = bdev - > bd_inode ;
struct timespec64 now ;
2014-04-16 17:02:32 +08:00
2021-07-27 17:01:16 -04:00
/* Shouldn't happen but just in case. */
if ( ! inode )
2014-04-16 17:02:32 +08:00
return ;
2021-07-27 17:01:16 -04:00
now = current_time ( inode ) ;
generic_update_time ( inode , & now , S_MTIME | S_CTIME ) ;
2014-04-16 17:02:32 +08:00
}
2019-03-20 16:31:53 +01:00
static int btrfs_rm_dev_item ( struct btrfs_device * device )
2008-05-07 11:43:44 -04:00
{
2019-03-20 16:31:53 +01:00
struct btrfs_root * root = device - > fs_info - > chunk_root ;
2008-05-07 11:43:44 -04:00
int ret ;
struct btrfs_path * path ;
struct btrfs_key key ;
struct btrfs_trans_handle * trans ;
path = btrfs_alloc_path ( ) ;
if ( ! path )
return - ENOMEM ;
2010-05-16 10:48:46 -04:00
trans = btrfs_start_transaction ( root , 0 ) ;
2011-01-20 06:19:37 +00:00
if ( IS_ERR ( trans ) ) {
btrfs_free_path ( path ) ;
return PTR_ERR ( trans ) ;
}
2008-05-07 11:43:44 -04:00
key . objectid = BTRFS_DEV_ITEMS_OBJECTID ;
key . type = BTRFS_DEV_ITEM_KEY ;
key . offset = device - > devid ;
ret = btrfs_search_slot ( trans , root , & key , path , - 1 , 1 ) ;
2017-10-23 09:58:46 +03:00
if ( ret ) {
if ( ret > 0 )
ret = - ENOENT ;
btrfs_abort_transaction ( trans , ret ) ;
btrfs_end_transaction ( trans ) ;
2008-05-07 11:43:44 -04:00
goto out ;
}
ret = btrfs_del_item ( trans , root , path ) ;
2017-10-23 09:58:46 +03:00
if ( ret ) {
btrfs_abort_transaction ( trans , ret ) ;
btrfs_end_transaction ( trans ) ;
}
2008-05-07 11:43:44 -04:00
out :
btrfs_free_path ( path ) ;
2017-10-23 09:58:46 +03:00
if ( ! ret )
ret = btrfs_commit_transaction ( trans ) ;
2008-05-07 11:43:44 -04:00
return ret ;
}
2016-02-15 16:00:26 +01:00
/*
* Verify that @ num_devices satisfies the RAID profile constraints in the whole
* filesystem . It ' s up to the caller to adjust that number regarding eg . device
* replace .
*/
static int btrfs_check_raid_min_devices ( struct btrfs_fs_info * fs_info ,
u64 num_devices )
2008-05-07 11:43:44 -04:00
{
u64 all_avail ;
2013-01-29 10:13:12 +00:00
unsigned seq ;
2016-02-15 16:28:14 +01:00
int i ;
2008-05-07 11:43:44 -04:00
2013-01-29 10:13:12 +00:00
do {
2016-02-13 10:01:34 +08:00
seq = read_seqbegin ( & fs_info - > profiles_lock ) ;
2013-01-29 10:13:12 +00:00
2016-02-13 10:01:34 +08:00
all_avail = fs_info - > avail_data_alloc_bits |
fs_info - > avail_system_alloc_bits |
fs_info - > avail_metadata_alloc_bits ;
} while ( read_seqretry ( & fs_info - > profiles_lock , seq ) ) ;
2008-05-07 11:43:44 -04:00
2016-02-15 16:28:14 +01:00
for ( i = 0 ; i < BTRFS_NR_RAID_TYPES ; i + + ) {
2018-04-25 19:01:43 +08:00
if ( ! ( all_avail & btrfs_raid_array [ i ] . bg_flag ) )
2016-02-15 16:28:14 +01:00
continue ;
2008-05-07 11:43:44 -04:00
2021-07-28 07:03:05 +08:00
if ( num_devices < btrfs_raid_array [ i ] . devs_min )
return btrfs_raid_array [ i ] . mindev_error ;
2013-01-29 18:40:14 -05:00
}
2016-02-13 10:01:34 +08:00
return 0 ;
2016-02-13 10:01:33 +08:00
}
2017-08-22 23:46:04 -07:00
static struct btrfs_device * btrfs_find_next_active_device (
struct btrfs_fs_devices * fs_devs , struct btrfs_device * device )
2008-05-07 11:43:44 -04:00
{
2008-11-17 21:11:30 -05:00
struct btrfs_device * next_device ;
2016-05-03 17:44:43 +08:00
list_for_each_entry ( next_device , & fs_devs - > devices , dev_list ) {
if ( next_device ! = device & &
2017-12-04 12:54:54 +08:00
! test_bit ( BTRFS_DEV_STATE_MISSING , & next_device - > dev_state )
& & next_device - > bdev )
2016-05-03 17:44:43 +08:00
return next_device ;
}
return NULL ;
}
/*
2021-08-24 13:05:19 +08:00
* Helper function to check if the given device is part of s_bdev / latest_dev
2016-05-03 17:44:43 +08:00
* and replace it with the provided or the next active device , in the context
* where this function called , there should be always be another device ( or
* this_dev ) which is active .
*/
2019-10-01 19:57:35 +02:00
void __cold btrfs_assign_next_active_device ( struct btrfs_device * device ,
2020-09-05 01:34:34 +08:00
struct btrfs_device * next_device )
2016-05-03 17:44:43 +08:00
{
2018-07-20 19:37:50 +03:00
struct btrfs_fs_info * fs_info = device - > fs_info ;
2016-05-03 17:44:43 +08:00
2020-09-05 01:34:34 +08:00
if ( ! next_device )
2016-05-03 17:44:43 +08:00
next_device = btrfs_find_next_active_device ( fs_info - > fs_devices ,
2020-09-05 01:34:34 +08:00
device ) ;
2016-05-03 17:44:43 +08:00
ASSERT ( next_device ) ;
if ( fs_info - > sb - > s_bdev & &
( fs_info - > sb - > s_bdev = = device - > bdev ) )
fs_info - > sb - > s_bdev = next_device - > bdev ;
2021-08-24 13:05:19 +08:00
if ( fs_info - > fs_devices - > latest_dev - > bdev = = device - > bdev )
fs_info - > fs_devices - > latest_dev = next_device ;
2016-05-03 17:44:43 +08:00
}
2018-08-10 13:53:21 +08:00
/*
* Return btrfs_fs_devices : : num_devices excluding the device that ' s being
* currently replaced .
*/
static u64 btrfs_num_devices ( struct btrfs_fs_info * fs_info )
{
u64 num_devices = fs_info - > fs_devices - > num_devices ;
2018-09-07 16:11:23 +02:00
down_read ( & fs_info - > dev_replace . rwsem ) ;
2018-08-10 13:53:21 +08:00
if ( btrfs_dev_replace_is_ongoing ( & fs_info - > dev_replace ) ) {
ASSERT ( num_devices > 1 ) ;
num_devices - - ;
}
2018-09-07 16:11:23 +02:00
up_read ( & fs_info - > dev_replace . rwsem ) ;
2018-08-10 13:53:21 +08:00
return num_devices ;
}
2020-08-20 11:18:26 -04:00
void btrfs_scratch_superblocks ( struct btrfs_fs_info * fs_info ,
struct block_device * bdev ,
const char * device_path )
2020-02-14 00:24:31 +09:00
{
struct btrfs_super_block * disk_super ;
int copy_num ;
if ( ! bdev )
return ;
for ( copy_num = 0 ; copy_num < BTRFS_SUPER_MIRROR_MAX ; copy_num + + ) {
2020-02-14 00:24:32 +09:00
struct page * page ;
int ret ;
2020-02-14 00:24:31 +09:00
2020-02-14 00:24:32 +09:00
disk_super = btrfs_read_dev_one_super ( bdev , copy_num ) ;
if ( IS_ERR ( disk_super ) )
continue ;
2020-02-14 00:24:31 +09:00
btrfs: implement log-structured superblock for ZONED mode
Superblock (and its copies) is the only data structure in btrfs which
has a fixed location on a device. Since we cannot overwrite in a
sequential write required zone, we cannot place superblock in the zone.
One easy solution is limiting superblock and copies to be placed only in
conventional zones. However, this method has two downsides: one is
reduced number of superblock copies. The location of the second copy of
superblock is 256GB, which is in a sequential write required zone on
typical devices in the market today. So, the number of superblock and
copies is limited to be two. Second downside is that we cannot support
devices which have no conventional zones at all.
To solve these two problems, we employ superblock log writing. It uses
two adjacent zones as a circular buffer to write updated superblocks.
Once the first zone is filled up, start writing into the second one.
Then, when both zones are filled up and before starting to write to the
first zone again, it reset the first zone.
We can determine the position of the latest superblock by reading write
pointer information from a device. One corner case is when both zones
are full. For this situation, we read out the last superblock of each
zone, and compare them to determine which zone is older.
The following zones are reserved as the circular buffer on ZONED btrfs.
- The primary superblock: zones 0 and 1
- The first copy: zones 16 and 17
- The second copy: zones 1024 or zone at 256GB which is minimum, and
next to it
If these reserved zones are conventional, superblock is written fixed at
the start of the zone without logging.
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-11-10 20:26:14 +09:00
if ( bdev_is_zoned ( bdev ) ) {
btrfs_reset_sb_log_zones ( bdev , copy_num ) ;
continue ;
}
2020-02-14 00:24:31 +09:00
memset ( & disk_super - > magic , 0 , sizeof ( disk_super - > magic ) ) ;
2020-02-14 00:24:32 +09:00
page = virt_to_page ( disk_super ) ;
set_page_dirty ( page ) ;
lock_page ( page ) ;
/* write_on_page() unlocks the page */
ret = write_one_page ( page ) ;
if ( ret )
btrfs_warn ( fs_info ,
" error clearing superblock number %d (%d) " ,
copy_num , ret ) ;
btrfs_release_disk_super ( disk_super ) ;
2020-02-14 00:24:31 +09:00
}
/* Notify udev that device has changed */
btrfs_kobject_uevent ( bdev , KOBJ_CHANGE ) ;
/* Update ctime/mtime for device path for libblkid */
2021-07-27 17:01:16 -04:00
update_dev_time ( bdev ) ;
2020-02-14 00:24:31 +09:00
}
btrfs: use btrfs_get_dev_args_from_path in dev removal ioctls
For device removal and replace we call btrfs_find_device_by_devspec,
which if we give it a device path and nothing else will call
btrfs_get_dev_args_from_path, which opens the block device and reads the
super block and then looks up our device based on that.
However at this point we're holding the sb write "lock", so reading the
block device pulls in the dependency of ->open_mutex, which produces the
following lockdep splat
======================================================
WARNING: possible circular locking dependency detected
5.14.0-rc2+ #405 Not tainted
------------------------------------------------------
losetup/11576 is trying to acquire lock:
ffff9bbe8cded938 ((wq_completion)loop0){+.+.}-{0:0}, at: flush_workqueue+0x67/0x5e0
but task is already holding lock:
ffff9bbe88e4fc68 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x660 [loop]
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #4 (&lo->lo_mutex){+.+.}-{3:3}:
__mutex_lock+0x7d/0x750
lo_open+0x28/0x60 [loop]
blkdev_get_whole+0x25/0xf0
blkdev_get_by_dev.part.0+0x168/0x3c0
blkdev_open+0xd2/0xe0
do_dentry_open+0x161/0x390
path_openat+0x3cc/0xa20
do_filp_open+0x96/0x120
do_sys_openat2+0x7b/0x130
__x64_sys_openat+0x46/0x70
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
-> #3 (&disk->open_mutex){+.+.}-{3:3}:
__mutex_lock+0x7d/0x750
blkdev_get_by_dev.part.0+0x56/0x3c0
blkdev_get_by_path+0x98/0xa0
btrfs_get_bdev_and_sb+0x1b/0xb0
btrfs_find_device_by_devspec+0x12b/0x1c0
btrfs_rm_device+0x127/0x610
btrfs_ioctl+0x2a31/0x2e70
__x64_sys_ioctl+0x80/0xb0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
-> #2 (sb_writers#12){.+.+}-{0:0}:
lo_write_bvec+0xc2/0x240 [loop]
loop_process_work+0x238/0xd00 [loop]
process_one_work+0x26b/0x560
worker_thread+0x55/0x3c0
kthread+0x140/0x160
ret_from_fork+0x1f/0x30
-> #1 ((work_completion)(&lo->rootcg_work)){+.+.}-{0:0}:
process_one_work+0x245/0x560
worker_thread+0x55/0x3c0
kthread+0x140/0x160
ret_from_fork+0x1f/0x30
-> #0 ((wq_completion)loop0){+.+.}-{0:0}:
__lock_acquire+0x10ea/0x1d90
lock_acquire+0xb5/0x2b0
flush_workqueue+0x91/0x5e0
drain_workqueue+0xa0/0x110
destroy_workqueue+0x36/0x250
__loop_clr_fd+0x9a/0x660 [loop]
block_ioctl+0x3f/0x50
__x64_sys_ioctl+0x80/0xb0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
other info that might help us debug this:
Chain exists of:
(wq_completion)loop0 --> &disk->open_mutex --> &lo->lo_mutex
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&lo->lo_mutex);
lock(&disk->open_mutex);
lock(&lo->lo_mutex);
lock((wq_completion)loop0);
*** DEADLOCK ***
1 lock held by losetup/11576:
#0: ffff9bbe88e4fc68 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x660 [loop]
stack backtrace:
CPU: 0 PID: 11576 Comm: losetup Not tainted 5.14.0-rc2+ #405
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
Call Trace:
dump_stack_lvl+0x57/0x72
check_noncircular+0xcf/0xf0
? stack_trace_save+0x3b/0x50
__lock_acquire+0x10ea/0x1d90
lock_acquire+0xb5/0x2b0
? flush_workqueue+0x67/0x5e0
? lockdep_init_map_type+0x47/0x220
flush_workqueue+0x91/0x5e0
? flush_workqueue+0x67/0x5e0
? verify_cpu+0xf0/0x100
drain_workqueue+0xa0/0x110
destroy_workqueue+0x36/0x250
__loop_clr_fd+0x9a/0x660 [loop]
? blkdev_ioctl+0x8d/0x2a0
block_ioctl+0x3f/0x50
__x64_sys_ioctl+0x80/0xb0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f31b02404cb
Instead what we want to do is populate our device lookup args before we
grab any locks, and then pass these args into btrfs_rm_device(). From
there we can find the device and do the appropriate removal.
Suggested-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-05 16:12:44 -04:00
int btrfs_rm_device ( struct btrfs_fs_info * fs_info ,
struct btrfs_dev_lookup_args * args ,
struct block_device * * bdev , fmode_t * mode )
2016-02-13 10:01:33 +08:00
{
struct btrfs_device * device ;
2011-04-20 10:09:16 +00:00
struct btrfs_fs_devices * cur_devices ;
2018-04-12 10:29:30 +08:00
struct btrfs_fs_devices * fs_devices = fs_info - > fs_devices ;
2008-11-17 21:11:30 -05:00
u64 num_devices ;
2008-05-07 11:43:44 -04:00
int ret = 0 ;
btrfs: do not take the uuid_mutex in btrfs_rm_device
We got the following lockdep splat while running fstests (specifically
btrfs/003 and btrfs/020 in a row) with the new rc. This was uncovered
by 87579e9b7d8d ("loop: use worker per cgroup instead of kworker") which
converted loop to using workqueues, which comes with lockdep
annotations that don't exist with kworkers. The lockdep splat is as
follows:
WARNING: possible circular locking dependency detected
5.14.0-rc2-custom+ #34 Not tainted
------------------------------------------------------
losetup/156417 is trying to acquire lock:
ffff9c7645b02d38 ((wq_completion)loop0){+.+.}-{0:0}, at: flush_workqueue+0x84/0x600
but task is already holding lock:
ffff9c7647395468 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x650 [loop]
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #5 (&lo->lo_mutex){+.+.}-{3:3}:
__mutex_lock+0xba/0x7c0
lo_open+0x28/0x60 [loop]
blkdev_get_whole+0x28/0xf0
blkdev_get_by_dev.part.0+0x168/0x3c0
blkdev_open+0xd2/0xe0
do_dentry_open+0x163/0x3a0
path_openat+0x74d/0xa40
do_filp_open+0x9c/0x140
do_sys_openat2+0xb1/0x170
__x64_sys_openat+0x54/0x90
do_syscall_64+0x3b/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
-> #4 (&disk->open_mutex){+.+.}-{3:3}:
__mutex_lock+0xba/0x7c0
blkdev_get_by_dev.part.0+0xd1/0x3c0
blkdev_get_by_path+0xc0/0xd0
btrfs_scan_one_device+0x52/0x1f0 [btrfs]
btrfs_control_ioctl+0xac/0x170 [btrfs]
__x64_sys_ioctl+0x83/0xb0
do_syscall_64+0x3b/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
-> #3 (uuid_mutex){+.+.}-{3:3}:
__mutex_lock+0xba/0x7c0
btrfs_rm_device+0x48/0x6a0 [btrfs]
btrfs_ioctl+0x2d1c/0x3110 [btrfs]
__x64_sys_ioctl+0x83/0xb0
do_syscall_64+0x3b/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
-> #2 (sb_writers#11){.+.+}-{0:0}:
lo_write_bvec+0x112/0x290 [loop]
loop_process_work+0x25f/0xcb0 [loop]
process_one_work+0x28f/0x5d0
worker_thread+0x55/0x3c0
kthread+0x140/0x170
ret_from_fork+0x22/0x30
-> #1 ((work_completion)(&lo->rootcg_work)){+.+.}-{0:0}:
process_one_work+0x266/0x5d0
worker_thread+0x55/0x3c0
kthread+0x140/0x170
ret_from_fork+0x22/0x30
-> #0 ((wq_completion)loop0){+.+.}-{0:0}:
__lock_acquire+0x1130/0x1dc0
lock_acquire+0xf5/0x320
flush_workqueue+0xae/0x600
drain_workqueue+0xa0/0x110
destroy_workqueue+0x36/0x250
__loop_clr_fd+0x9a/0x650 [loop]
lo_ioctl+0x29d/0x780 [loop]
block_ioctl+0x3f/0x50
__x64_sys_ioctl+0x83/0xb0
do_syscall_64+0x3b/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
other info that might help us debug this:
Chain exists of:
(wq_completion)loop0 --> &disk->open_mutex --> &lo->lo_mutex
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&lo->lo_mutex);
lock(&disk->open_mutex);
lock(&lo->lo_mutex);
lock((wq_completion)loop0);
*** DEADLOCK ***
1 lock held by losetup/156417:
#0: ffff9c7647395468 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x650 [loop]
stack backtrace:
CPU: 8 PID: 156417 Comm: losetup Not tainted 5.14.0-rc2-custom+ #34
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
Call Trace:
dump_stack_lvl+0x57/0x72
check_noncircular+0x10a/0x120
__lock_acquire+0x1130/0x1dc0
lock_acquire+0xf5/0x320
? flush_workqueue+0x84/0x600
flush_workqueue+0xae/0x600
? flush_workqueue+0x84/0x600
drain_workqueue+0xa0/0x110
destroy_workqueue+0x36/0x250
__loop_clr_fd+0x9a/0x650 [loop]
lo_ioctl+0x29d/0x780 [loop]
? __lock_acquire+0x3a0/0x1dc0
? update_dl_rq_load_avg+0x152/0x360
? lock_is_held_type+0xa5/0x120
? find_held_lock.constprop.0+0x2b/0x80
block_ioctl+0x3f/0x50
__x64_sys_ioctl+0x83/0xb0
do_syscall_64+0x3b/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f645884de6b
Usually the uuid_mutex exists to protect the fs_devices that map
together all of the devices that match a specific uuid. In rm_device
we're messing with the uuid of a device, so it makes sense to protect
that here.
However in doing that it pulls in a whole host of lockdep dependencies,
as we call mnt_may_write() on the sb before we grab the uuid_mutex, thus
we end up with the dependency chain under the uuid_mutex being added
under the normal sb write dependency chain, which causes problems with
loop devices.
We don't need the uuid mutex here however. If we call
btrfs_scan_one_device() before we scratch the super block we will find
the fs_devices and not find the device itself and return EBUSY because
the fs_devices is open. If we call it after the scratch happens it will
not appear to be a valid btrfs file system.
We do not need to worry about other fs_devices modifying operations here
because we're protected by the exclusive operations locking.
So drop the uuid_mutex here in order to fix the lockdep splat.
A more detailed explanation from the discussion:
We are worried about rm and scan racing with each other, before this
change we'll zero the device out under the UUID mutex so when scan does
run it'll make sure that it can go through the whole device scan thing
without rm messing with us.
We aren't worried if the scratch happens first, because the result is we
don't think this is a btrfs device and we bail out.
The only case we are concerned with is we scratch _after_ scan is able
to read the superblock and gets a seemingly valid super block, so lets
consider this case.
Scan will call device_list_add() with the device we're removing. We'll
call find_fsid_with_metadata_uuid() and get our fs_devices for this
UUID. At this point we lock the fs_devices->device_list_mutex. This is
what protects us in this case, but we have two cases here.
1. We aren't to the device removal part of the RM. We found our device,
and device name matches our path, we go down and we set total_devices
to our super number of devices, which doesn't affect anything because
we haven't done the remove yet.
2. We are past the device removal part, which is protected by the
device_list_mutex. Scan doesn't find the device, it goes down and
does the
if (fs_devices->opened)
return -EBUSY;
check and we bail out.
Nothing about this situation is ideal, but the lockdep splat is real,
and the fix is safe, tho admittedly a bit scary looking.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ copy more from the discussion ]
Signed-off-by: David Sterba <dsterba@suse.com>
2021-07-27 17:01:14 -04:00
/*
* The device list in fs_devices is accessed without locks ( neither
* uuid_mutex nor device_list_mutex ) as it won ' t change on a mounted
* filesystem and another device rm cannot run .
*/
2018-08-10 13:53:21 +08:00
num_devices = btrfs_num_devices ( fs_info ) ;
2012-11-06 13:15:27 +01:00
2016-06-22 18:54:23 -04:00
ret = btrfs_check_raid_min_devices ( fs_info , num_devices - 1 ) ;
2016-02-13 10:01:33 +08:00
if ( ret )
2008-05-07 11:43:44 -04:00
goto out ;
btrfs: use btrfs_get_dev_args_from_path in dev removal ioctls
For device removal and replace we call btrfs_find_device_by_devspec,
which if we give it a device path and nothing else will call
btrfs_get_dev_args_from_path, which opens the block device and reads the
super block and then looks up our device based on that.
However at this point we're holding the sb write "lock", so reading the
block device pulls in the dependency of ->open_mutex, which produces the
following lockdep splat
======================================================
WARNING: possible circular locking dependency detected
5.14.0-rc2+ #405 Not tainted
------------------------------------------------------
losetup/11576 is trying to acquire lock:
ffff9bbe8cded938 ((wq_completion)loop0){+.+.}-{0:0}, at: flush_workqueue+0x67/0x5e0
but task is already holding lock:
ffff9bbe88e4fc68 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x660 [loop]
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #4 (&lo->lo_mutex){+.+.}-{3:3}:
__mutex_lock+0x7d/0x750
lo_open+0x28/0x60 [loop]
blkdev_get_whole+0x25/0xf0
blkdev_get_by_dev.part.0+0x168/0x3c0
blkdev_open+0xd2/0xe0
do_dentry_open+0x161/0x390
path_openat+0x3cc/0xa20
do_filp_open+0x96/0x120
do_sys_openat2+0x7b/0x130
__x64_sys_openat+0x46/0x70
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
-> #3 (&disk->open_mutex){+.+.}-{3:3}:
__mutex_lock+0x7d/0x750
blkdev_get_by_dev.part.0+0x56/0x3c0
blkdev_get_by_path+0x98/0xa0
btrfs_get_bdev_and_sb+0x1b/0xb0
btrfs_find_device_by_devspec+0x12b/0x1c0
btrfs_rm_device+0x127/0x610
btrfs_ioctl+0x2a31/0x2e70
__x64_sys_ioctl+0x80/0xb0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
-> #2 (sb_writers#12){.+.+}-{0:0}:
lo_write_bvec+0xc2/0x240 [loop]
loop_process_work+0x238/0xd00 [loop]
process_one_work+0x26b/0x560
worker_thread+0x55/0x3c0
kthread+0x140/0x160
ret_from_fork+0x1f/0x30
-> #1 ((work_completion)(&lo->rootcg_work)){+.+.}-{0:0}:
process_one_work+0x245/0x560
worker_thread+0x55/0x3c0
kthread+0x140/0x160
ret_from_fork+0x1f/0x30
-> #0 ((wq_completion)loop0){+.+.}-{0:0}:
__lock_acquire+0x10ea/0x1d90
lock_acquire+0xb5/0x2b0
flush_workqueue+0x91/0x5e0
drain_workqueue+0xa0/0x110
destroy_workqueue+0x36/0x250
__loop_clr_fd+0x9a/0x660 [loop]
block_ioctl+0x3f/0x50
__x64_sys_ioctl+0x80/0xb0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
other info that might help us debug this:
Chain exists of:
(wq_completion)loop0 --> &disk->open_mutex --> &lo->lo_mutex
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&lo->lo_mutex);
lock(&disk->open_mutex);
lock(&lo->lo_mutex);
lock((wq_completion)loop0);
*** DEADLOCK ***
1 lock held by losetup/11576:
#0: ffff9bbe88e4fc68 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x660 [loop]
stack backtrace:
CPU: 0 PID: 11576 Comm: losetup Not tainted 5.14.0-rc2+ #405
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
Call Trace:
dump_stack_lvl+0x57/0x72
check_noncircular+0xcf/0xf0
? stack_trace_save+0x3b/0x50
__lock_acquire+0x10ea/0x1d90
lock_acquire+0xb5/0x2b0
? flush_workqueue+0x67/0x5e0
? lockdep_init_map_type+0x47/0x220
flush_workqueue+0x91/0x5e0
? flush_workqueue+0x67/0x5e0
? verify_cpu+0xf0/0x100
drain_workqueue+0xa0/0x110
destroy_workqueue+0x36/0x250
__loop_clr_fd+0x9a/0x660 [loop]
? blkdev_ioctl+0x8d/0x2a0
block_ioctl+0x3f/0x50
__x64_sys_ioctl+0x80/0xb0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f31b02404cb
Instead what we want to do is populate our device lookup args before we
grab any locks, and then pass these args into btrfs_rm_device(). From
there we can find the device and do the appropriate removal.
Suggested-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-05 16:12:44 -04:00
device = btrfs_find_device ( fs_info - > fs_devices , args ) ;
if ( ! device ) {
if ( args - > missing )
2018-09-03 12:46:14 +03:00
ret = BTRFS_ERROR_DEV_MISSING_NOT_FOUND ;
else
btrfs: use btrfs_get_dev_args_from_path in dev removal ioctls
For device removal and replace we call btrfs_find_device_by_devspec,
which if we give it a device path and nothing else will call
btrfs_get_dev_args_from_path, which opens the block device and reads the
super block and then looks up our device based on that.
However at this point we're holding the sb write "lock", so reading the
block device pulls in the dependency of ->open_mutex, which produces the
following lockdep splat
======================================================
WARNING: possible circular locking dependency detected
5.14.0-rc2+ #405 Not tainted
------------------------------------------------------
losetup/11576 is trying to acquire lock:
ffff9bbe8cded938 ((wq_completion)loop0){+.+.}-{0:0}, at: flush_workqueue+0x67/0x5e0
but task is already holding lock:
ffff9bbe88e4fc68 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x660 [loop]
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #4 (&lo->lo_mutex){+.+.}-{3:3}:
__mutex_lock+0x7d/0x750
lo_open+0x28/0x60 [loop]
blkdev_get_whole+0x25/0xf0
blkdev_get_by_dev.part.0+0x168/0x3c0
blkdev_open+0xd2/0xe0
do_dentry_open+0x161/0x390
path_openat+0x3cc/0xa20
do_filp_open+0x96/0x120
do_sys_openat2+0x7b/0x130
__x64_sys_openat+0x46/0x70
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
-> #3 (&disk->open_mutex){+.+.}-{3:3}:
__mutex_lock+0x7d/0x750
blkdev_get_by_dev.part.0+0x56/0x3c0
blkdev_get_by_path+0x98/0xa0
btrfs_get_bdev_and_sb+0x1b/0xb0
btrfs_find_device_by_devspec+0x12b/0x1c0
btrfs_rm_device+0x127/0x610
btrfs_ioctl+0x2a31/0x2e70
__x64_sys_ioctl+0x80/0xb0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
-> #2 (sb_writers#12){.+.+}-{0:0}:
lo_write_bvec+0xc2/0x240 [loop]
loop_process_work+0x238/0xd00 [loop]
process_one_work+0x26b/0x560
worker_thread+0x55/0x3c0
kthread+0x140/0x160
ret_from_fork+0x1f/0x30
-> #1 ((work_completion)(&lo->rootcg_work)){+.+.}-{0:0}:
process_one_work+0x245/0x560
worker_thread+0x55/0x3c0
kthread+0x140/0x160
ret_from_fork+0x1f/0x30
-> #0 ((wq_completion)loop0){+.+.}-{0:0}:
__lock_acquire+0x10ea/0x1d90
lock_acquire+0xb5/0x2b0
flush_workqueue+0x91/0x5e0
drain_workqueue+0xa0/0x110
destroy_workqueue+0x36/0x250
__loop_clr_fd+0x9a/0x660 [loop]
block_ioctl+0x3f/0x50
__x64_sys_ioctl+0x80/0xb0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
other info that might help us debug this:
Chain exists of:
(wq_completion)loop0 --> &disk->open_mutex --> &lo->lo_mutex
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&lo->lo_mutex);
lock(&disk->open_mutex);
lock(&lo->lo_mutex);
lock((wq_completion)loop0);
*** DEADLOCK ***
1 lock held by losetup/11576:
#0: ffff9bbe88e4fc68 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x660 [loop]
stack backtrace:
CPU: 0 PID: 11576 Comm: losetup Not tainted 5.14.0-rc2+ #405
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
Call Trace:
dump_stack_lvl+0x57/0x72
check_noncircular+0xcf/0xf0
? stack_trace_save+0x3b/0x50
__lock_acquire+0x10ea/0x1d90
lock_acquire+0xb5/0x2b0
? flush_workqueue+0x67/0x5e0
? lockdep_init_map_type+0x47/0x220
flush_workqueue+0x91/0x5e0
? flush_workqueue+0x67/0x5e0
? verify_cpu+0xf0/0x100
drain_workqueue+0xa0/0x110
destroy_workqueue+0x36/0x250
__loop_clr_fd+0x9a/0x660 [loop]
? blkdev_ioctl+0x8d/0x2a0
block_ioctl+0x3f/0x50
__x64_sys_ioctl+0x80/0xb0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f31b02404cb
Instead what we want to do is populate our device lookup args before we
grab any locks, and then pass these args into btrfs_rm_device(). From
there we can find the device and do the appropriate removal.
Suggested-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-05 16:12:44 -04:00
ret = - ENOENT ;
2013-01-29 18:40:14 -05:00
goto out ;
2018-09-03 12:46:14 +03:00
}
2008-05-13 13:46:40 -04:00
Btrfs: prevent ioctls from interfering with a swap file
A later patch will implement swap file support for Btrfs, but before we
do that, we need to make sure that the various Btrfs ioctls cannot
change a swap file.
When a swap file is active, we must make sure that the extents of the
file are not moved and that they don't become shared. That means that
the following are not safe:
- chattr +c (enable compression)
- reflink
- dedupe
- snapshot
- defrag
Don't allow those to happen on an active swap file.
Additionally, balance, resize, device remove, and device replace are
also unsafe if they affect an active swapfile. Add a red-black tree of
block groups and devices which contain an active swapfile. Relocation
checks each block group against this tree and skips it or errors out for
balance or resize, respectively. Device remove and device replace check
the tree for the device they will operate on.
Note that we don't have to worry about chattr -C (disable nocow), which
we ignore for non-empty files, because an active swapfile must be
non-empty and can't be truncated. We also don't have to worry about
autodefrag because it's only done on COW files. Truncate and fallocate
are already taken care of by the generic code. Device add doesn't do
relocation so it's not an issue, either.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-11-03 10:28:12 -07:00
if ( btrfs_pinned_by_swapfile ( fs_info , device ) ) {
btrfs_warn_in_rcu ( fs_info ,
" cannot remove device %s (devid %llu) due to active swapfile " ,
rcu_str_deref ( device - > name ) , device - > devid ) ;
ret = - ETXTBSY ;
goto out ;
}
2017-12-04 12:54:55 +08:00
if ( test_bit ( BTRFS_DEV_STATE_REPLACE_TGT , & device - > dev_state ) ) {
2013-05-17 10:52:45 +00:00
ret = BTRFS_ERROR_DEV_TGT_REPLACE ;
2016-02-13 10:01:36 +08:00
goto out ;
2012-11-05 18:29:28 +01:00
}
2017-12-04 12:54:52 +08:00
if ( test_bit ( BTRFS_DEV_STATE_WRITEABLE , & device - > dev_state ) & &
fs_info - > fs_devices - > rw_devices = = 1 ) {
2013-05-17 10:52:45 +00:00
ret = BTRFS_ERROR_DEV_ONLY_WRITABLE ;
2016-02-13 10:01:36 +08:00
goto out ;
2008-11-17 21:11:30 -05:00
}
2017-12-04 12:54:52 +08:00
if ( test_bit ( BTRFS_DEV_STATE_WRITEABLE , & device - > dev_state ) ) {
2016-10-04 19:34:27 +02:00
mutex_lock ( & fs_info - > chunk_mutex ) ;
2008-11-17 21:11:30 -05:00
list_del_init ( & device - > dev_alloc_list ) ;
2014-09-03 21:35:47 +08:00
device - > fs_devices - > rw_devices - - ;
2016-10-04 19:34:27 +02:00
mutex_unlock ( & fs_info - > chunk_mutex ) ;
2008-05-13 13:46:40 -04:00
}
2008-05-07 11:43:44 -04:00
ret = btrfs_shrink_device ( device , 0 ) ;
btrfs: fix readahead hang and use-after-free after removing a device
Very sporadically I had test case btrfs/069 from fstests hanging (for
years, it is not a recent regression), with the following traces in
dmesg/syslog:
[162301.160628] BTRFS info (device sdc): dev_replace from /dev/sdd (devid 2) to /dev/sdg started
[162301.181196] BTRFS info (device sdc): scrub: finished on devid 4 with status: 0
[162301.287162] BTRFS info (device sdc): dev_replace from /dev/sdd (devid 2) to /dev/sdg finished
[162513.513792] INFO: task btrfs-transacti:1356167 blocked for more than 120 seconds.
[162513.514318] Not tainted 5.9.0-rc6-btrfs-next-69 #1
[162513.514522] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[162513.514747] task:btrfs-transacti state:D stack: 0 pid:1356167 ppid: 2 flags:0x00004000
[162513.514751] Call Trace:
[162513.514761] __schedule+0x5ce/0xd00
[162513.514765] ? _raw_spin_unlock_irqrestore+0x3c/0x60
[162513.514771] schedule+0x46/0xf0
[162513.514844] wait_current_trans+0xde/0x140 [btrfs]
[162513.514850] ? finish_wait+0x90/0x90
[162513.514864] start_transaction+0x37c/0x5f0 [btrfs]
[162513.514879] transaction_kthread+0xa4/0x170 [btrfs]
[162513.514891] ? btrfs_cleanup_transaction+0x660/0x660 [btrfs]
[162513.514894] kthread+0x153/0x170
[162513.514897] ? kthread_stop+0x2c0/0x2c0
[162513.514902] ret_from_fork+0x22/0x30
[162513.514916] INFO: task fsstress:1356184 blocked for more than 120 seconds.
[162513.515192] Not tainted 5.9.0-rc6-btrfs-next-69 #1
[162513.515431] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[162513.515680] task:fsstress state:D stack: 0 pid:1356184 ppid:1356177 flags:0x00004000
[162513.515682] Call Trace:
[162513.515688] __schedule+0x5ce/0xd00
[162513.515691] ? _raw_spin_unlock_irqrestore+0x3c/0x60
[162513.515697] schedule+0x46/0xf0
[162513.515712] wait_current_trans+0xde/0x140 [btrfs]
[162513.515716] ? finish_wait+0x90/0x90
[162513.515729] start_transaction+0x37c/0x5f0 [btrfs]
[162513.515743] btrfs_attach_transaction_barrier+0x1f/0x50 [btrfs]
[162513.515753] btrfs_sync_fs+0x61/0x1c0 [btrfs]
[162513.515758] ? __ia32_sys_fdatasync+0x20/0x20
[162513.515761] iterate_supers+0x87/0xf0
[162513.515765] ksys_sync+0x60/0xb0
[162513.515768] __do_sys_sync+0xa/0x10
[162513.515771] do_syscall_64+0x33/0x80
[162513.515774] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[162513.515781] RIP: 0033:0x7f5238f50bd7
[162513.515782] Code: Bad RIP value.
[162513.515784] RSP: 002b:00007fff67b978e8 EFLAGS: 00000206 ORIG_RAX: 00000000000000a2
[162513.515786] RAX: ffffffffffffffda RBX: 000055b1fad2c560 RCX: 00007f5238f50bd7
[162513.515788] RDX: 00000000ffffffff RSI: 000000000daf0e74 RDI: 000000000000003a
[162513.515789] RBP: 0000000000000032 R08: 000000000000000a R09: 00007f5239019be0
[162513.515791] R10: fffffffffffff24f R11: 0000000000000206 R12: 000000000000003a
[162513.515792] R13: 00007fff67b97950 R14: 00007fff67b97906 R15: 000055b1fad1a340
[162513.515804] INFO: task fsstress:1356185 blocked for more than 120 seconds.
[162513.516064] Not tainted 5.9.0-rc6-btrfs-next-69 #1
[162513.516329] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[162513.516617] task:fsstress state:D stack: 0 pid:1356185 ppid:1356177 flags:0x00000000
[162513.516620] Call Trace:
[162513.516625] __schedule+0x5ce/0xd00
[162513.516628] ? _raw_spin_unlock_irqrestore+0x3c/0x60
[162513.516634] schedule+0x46/0xf0
[162513.516647] wait_current_trans+0xde/0x140 [btrfs]
[162513.516650] ? finish_wait+0x90/0x90
[162513.516662] start_transaction+0x4d7/0x5f0 [btrfs]
[162513.516679] btrfs_setxattr_trans+0x3c/0x100 [btrfs]
[162513.516686] __vfs_setxattr+0x66/0x80
[162513.516691] __vfs_setxattr_noperm+0x70/0x200
[162513.516697] vfs_setxattr+0x6b/0x120
[162513.516703] setxattr+0x125/0x240
[162513.516709] ? lock_acquire+0xb1/0x480
[162513.516712] ? mnt_want_write+0x20/0x50
[162513.516721] ? rcu_read_lock_any_held+0x8e/0xb0
[162513.516723] ? preempt_count_add+0x49/0xa0
[162513.516725] ? __sb_start_write+0x19b/0x290
[162513.516727] ? preempt_count_add+0x49/0xa0
[162513.516732] path_setxattr+0xba/0xd0
[162513.516739] __x64_sys_setxattr+0x27/0x30
[162513.516741] do_syscall_64+0x33/0x80
[162513.516743] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[162513.516745] RIP: 0033:0x7f5238f56d5a
[162513.516746] Code: Bad RIP value.
[162513.516748] RSP: 002b:00007fff67b97868 EFLAGS: 00000202 ORIG_RAX: 00000000000000bc
[162513.516750] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f5238f56d5a
[162513.516751] RDX: 000055b1fbb0d5a0 RSI: 00007fff67b978a0 RDI: 000055b1fbb0d470
[162513.516753] RBP: 000055b1fbb0d5a0 R08: 0000000000000001 R09: 00007fff67b97700
[162513.516754] R10: 0000000000000004 R11: 0000000000000202 R12: 0000000000000004
[162513.516756] R13: 0000000000000024 R14: 0000000000000001 R15: 00007fff67b978a0
[162513.516767] INFO: task fsstress:1356196 blocked for more than 120 seconds.
[162513.517064] Not tainted 5.9.0-rc6-btrfs-next-69 #1
[162513.517365] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[162513.517763] task:fsstress state:D stack: 0 pid:1356196 ppid:1356177 flags:0x00004000
[162513.517780] Call Trace:
[162513.517786] __schedule+0x5ce/0xd00
[162513.517789] ? _raw_spin_unlock_irqrestore+0x3c/0x60
[162513.517796] schedule+0x46/0xf0
[162513.517810] wait_current_trans+0xde/0x140 [btrfs]
[162513.517814] ? finish_wait+0x90/0x90
[162513.517829] start_transaction+0x37c/0x5f0 [btrfs]
[162513.517845] btrfs_attach_transaction_barrier+0x1f/0x50 [btrfs]
[162513.517857] btrfs_sync_fs+0x61/0x1c0 [btrfs]
[162513.517862] ? __ia32_sys_fdatasync+0x20/0x20
[162513.517865] iterate_supers+0x87/0xf0
[162513.517869] ksys_sync+0x60/0xb0
[162513.517872] __do_sys_sync+0xa/0x10
[162513.517875] do_syscall_64+0x33/0x80
[162513.517878] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[162513.517881] RIP: 0033:0x7f5238f50bd7
[162513.517883] Code: Bad RIP value.
[162513.517885] RSP: 002b:00007fff67b978e8 EFLAGS: 00000206 ORIG_RAX: 00000000000000a2
[162513.517887] RAX: ffffffffffffffda RBX: 000055b1fad2c560 RCX: 00007f5238f50bd7
[162513.517889] RDX: 0000000000000000 RSI: 000000007660add2 RDI: 0000000000000053
[162513.517891] RBP: 0000000000000032 R08: 0000000000000067 R09: 00007f5239019be0
[162513.517893] R10: fffffffffffff24f R11: 0000000000000206 R12: 0000000000000053
[162513.517895] R13: 00007fff67b97950 R14: 00007fff67b97906 R15: 000055b1fad1a340
[162513.517908] INFO: task fsstress:1356197 blocked for more than 120 seconds.
[162513.518298] Not tainted 5.9.0-rc6-btrfs-next-69 #1
[162513.518672] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[162513.519157] task:fsstress state:D stack: 0 pid:1356197 ppid:1356177 flags:0x00000000
[162513.519160] Call Trace:
[162513.519165] __schedule+0x5ce/0xd00
[162513.519168] ? _raw_spin_unlock_irqrestore+0x3c/0x60
[162513.519174] schedule+0x46/0xf0
[162513.519190] wait_current_trans+0xde/0x140 [btrfs]
[162513.519193] ? finish_wait+0x90/0x90
[162513.519206] start_transaction+0x4d7/0x5f0 [btrfs]
[162513.519222] btrfs_create+0x57/0x200 [btrfs]
[162513.519230] lookup_open+0x522/0x650
[162513.519246] path_openat+0x2b8/0xa50
[162513.519270] do_filp_open+0x91/0x100
[162513.519275] ? find_held_lock+0x32/0x90
[162513.519280] ? lock_acquired+0x33b/0x470
[162513.519285] ? do_raw_spin_unlock+0x4b/0xc0
[162513.519287] ? _raw_spin_unlock+0x29/0x40
[162513.519295] do_sys_openat2+0x20d/0x2d0
[162513.519300] do_sys_open+0x44/0x80
[162513.519304] do_syscall_64+0x33/0x80
[162513.519307] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[162513.519309] RIP: 0033:0x7f5238f4a903
[162513.519310] Code: Bad RIP value.
[162513.519312] RSP: 002b:00007fff67b97758 EFLAGS: 00000246 ORIG_RAX: 0000000000000055
[162513.519314] RAX: ffffffffffffffda RBX: 00000000ffffffff RCX: 00007f5238f4a903
[162513.519316] RDX: 0000000000000000 RSI: 00000000000001b6 RDI: 000055b1fbb0d470
[162513.519317] RBP: 00007fff67b978c0 R08: 0000000000000001 R09: 0000000000000002
[162513.519319] R10: 00007fff67b974f7 R11: 0000000000000246 R12: 0000000000000013
[162513.519320] R13: 00000000000001b6 R14: 00007fff67b97906 R15: 000055b1fad1c620
[162513.519332] INFO: task btrfs:1356211 blocked for more than 120 seconds.
[162513.519727] Not tainted 5.9.0-rc6-btrfs-next-69 #1
[162513.520115] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[162513.520508] task:btrfs state:D stack: 0 pid:1356211 ppid:1356178 flags:0x00004002
[162513.520511] Call Trace:
[162513.520516] __schedule+0x5ce/0xd00
[162513.520519] ? _raw_spin_unlock_irqrestore+0x3c/0x60
[162513.520525] schedule+0x46/0xf0
[162513.520544] btrfs_scrub_pause+0x11f/0x180 [btrfs]
[162513.520548] ? finish_wait+0x90/0x90
[162513.520562] btrfs_commit_transaction+0x45a/0xc30 [btrfs]
[162513.520574] ? start_transaction+0xe0/0x5f0 [btrfs]
[162513.520596] btrfs_dev_replace_finishing+0x6d8/0x711 [btrfs]
[162513.520619] btrfs_dev_replace_by_ioctl.cold+0x1cc/0x1fd [btrfs]
[162513.520639] btrfs_ioctl+0x2a25/0x36f0 [btrfs]
[162513.520643] ? do_sigaction+0xf3/0x240
[162513.520645] ? find_held_lock+0x32/0x90
[162513.520648] ? do_sigaction+0xf3/0x240
[162513.520651] ? lock_acquired+0x33b/0x470
[162513.520655] ? _raw_spin_unlock_irq+0x24/0x50
[162513.520657] ? lockdep_hardirqs_on+0x7d/0x100
[162513.520660] ? _raw_spin_unlock_irq+0x35/0x50
[162513.520662] ? do_sigaction+0xf3/0x240
[162513.520671] ? __x64_sys_ioctl+0x83/0xb0
[162513.520672] __x64_sys_ioctl+0x83/0xb0
[162513.520677] do_syscall_64+0x33/0x80
[162513.520679] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[162513.520681] RIP: 0033:0x7fc3cd307d87
[162513.520682] Code: Bad RIP value.
[162513.520684] RSP: 002b:00007ffe30a56bb8 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
[162513.520686] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007fc3cd307d87
[162513.520687] RDX: 00007ffe30a57a30 RSI: 00000000ca289435 RDI: 0000000000000003
[162513.520689] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
[162513.520690] R10: 0000000000000008 R11: 0000000000000202 R12: 0000000000000003
[162513.520692] R13: 0000557323a212e0 R14: 00007ffe30a5a520 R15: 0000000000000001
[162513.520703]
Showing all locks held in the system:
[162513.520712] 1 lock held by khungtaskd/54:
[162513.520713] #0: ffffffffb40a91a0 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x15/0x197
[162513.520728] 1 lock held by in:imklog/596:
[162513.520729] #0: ffff8f3f0d781400 (&f->f_pos_lock){+.+.}-{3:3}, at: __fdget_pos+0x4d/0x60
[162513.520782] 1 lock held by btrfs-transacti/1356167:
[162513.520784] #0: ffff8f3d810cc848 (&fs_info->transaction_kthread_mutex){+.+.}-{3:3}, at: transaction_kthread+0x4a/0x170 [btrfs]
[162513.520798] 1 lock held by btrfs/1356190:
[162513.520800] #0: ffff8f3d57644470 (sb_writers#15){.+.+}-{0:0}, at: mnt_want_write_file+0x22/0x60
[162513.520805] 1 lock held by fsstress/1356184:
[162513.520806] #0: ffff8f3d576440e8 (&type->s_umount_key#62){++++}-{3:3}, at: iterate_supers+0x6f/0xf0
[162513.520811] 3 locks held by fsstress/1356185:
[162513.520812] #0: ffff8f3d57644470 (sb_writers#15){.+.+}-{0:0}, at: mnt_want_write+0x20/0x50
[162513.520815] #1: ffff8f3d80a650b8 (&type->i_mutex_dir_key#10){++++}-{3:3}, at: vfs_setxattr+0x50/0x120
[162513.520820] #2: ffff8f3d57644690 (sb_internal#2){.+.+}-{0:0}, at: start_transaction+0x40e/0x5f0 [btrfs]
[162513.520833] 1 lock held by fsstress/1356196:
[162513.520834] #0: ffff8f3d576440e8 (&type->s_umount_key#62){++++}-{3:3}, at: iterate_supers+0x6f/0xf0
[162513.520838] 3 locks held by fsstress/1356197:
[162513.520839] #0: ffff8f3d57644470 (sb_writers#15){.+.+}-{0:0}, at: mnt_want_write+0x20/0x50
[162513.520843] #1: ffff8f3d506465e8 (&type->i_mutex_dir_key#10){++++}-{3:3}, at: path_openat+0x2a7/0xa50
[162513.520846] #2: ffff8f3d57644690 (sb_internal#2){.+.+}-{0:0}, at: start_transaction+0x40e/0x5f0 [btrfs]
[162513.520858] 2 locks held by btrfs/1356211:
[162513.520859] #0: ffff8f3d810cde30 (&fs_info->dev_replace.lock_finishing_cancel_unmount){+.+.}-{3:3}, at: btrfs_dev_replace_finishing+0x52/0x711 [btrfs]
[162513.520877] #1: ffff8f3d57644690 (sb_internal#2){.+.+}-{0:0}, at: start_transaction+0x40e/0x5f0 [btrfs]
This was weird because the stack traces show that a transaction commit,
triggered by a device replace operation, is blocking trying to pause any
running scrubs but there are no stack traces of blocked tasks doing a
scrub.
After poking around with drgn, I noticed there was a scrub task that was
constantly running and blocking for shorts periods of time:
>>> t = find_task(prog, 1356190)
>>> prog.stack_trace(t)
#0 __schedule+0x5ce/0xcfc
#1 schedule+0x46/0xe4
#2 schedule_timeout+0x1df/0x475
#3 btrfs_reada_wait+0xda/0x132
#4 scrub_stripe+0x2a8/0x112f
#5 scrub_chunk+0xcd/0x134
#6 scrub_enumerate_chunks+0x29e/0x5ee
#7 btrfs_scrub_dev+0x2d5/0x91b
#8 btrfs_ioctl+0x7f5/0x36e7
#9 __x64_sys_ioctl+0x83/0xb0
#10 do_syscall_64+0x33/0x77
#11 entry_SYSCALL_64+0x7c/0x156
Which corresponds to:
int btrfs_reada_wait(void *handle)
{
struct reada_control *rc = handle;
struct btrfs_fs_info *fs_info = rc->fs_info;
while (atomic_read(&rc->elems)) {
if (!atomic_read(&fs_info->reada_works_cnt))
reada_start_machine(fs_info);
wait_event_timeout(rc->wait, atomic_read(&rc->elems) == 0,
(HZ + 9) / 10);
}
(...)
So the counter "rc->elems" was set to 1 and never decreased to 0, causing
the scrub task to loop forever in that function. Then I used the following
script for drgn to check the readahead requests:
$ cat dump_reada.py
import sys
import drgn
from drgn import NULL, Object, cast, container_of, execscript, \
reinterpret, sizeof
from drgn.helpers.linux import *
mnt_path = b"/home/fdmanana/btrfs-tests/scratch_1"
mnt = None
for mnt in for_each_mount(prog, dst = mnt_path):
pass
if mnt is None:
sys.stderr.write(f'Error: mount point {mnt_path} not found\n')
sys.exit(1)
fs_info = cast('struct btrfs_fs_info *', mnt.mnt.mnt_sb.s_fs_info)
def dump_re(re):
nzones = re.nzones.value_()
print(f're at {hex(re.value_())}')
print(f'\t logical {re.logical.value_()}')
print(f'\t refcnt {re.refcnt.value_()}')
print(f'\t nzones {nzones}')
for i in range(nzones):
dev = re.zones[i].device
name = dev.name.str.string_()
print(f'\t\t dev id {dev.devid.value_()} name {name}')
print()
for _, e in radix_tree_for_each(fs_info.reada_tree):
re = cast('struct reada_extent *', e)
dump_re(re)
$ drgn dump_reada.py
re at 0xffff8f3da9d25ad8
logical 38928384
refcnt 1
nzones 1
dev id 0 name b'/dev/sdd'
$
So there was one readahead extent with a single zone corresponding to the
source device of that last device replace operation logged in dmesg/syslog.
Also the ID of that zone's device was 0 which is a special value set in
the source device of a device replace operation when the operation finishes
(constant BTRFS_DEV_REPLACE_DEVID set at btrfs_dev_replace_finishing()),
confirming again that device /dev/sdd was the source of a device replace
operation.
Normally there should be as many zones in the readahead extent as there are
devices, and I wasn't expecting the extent to be in a block group with a
'single' profile, so I went and confirmed with the following drgn script
that there weren't any single profile block groups:
$ cat dump_block_groups.py
import sys
import drgn
from drgn import NULL, Object, cast, container_of, execscript, \
reinterpret, sizeof
from drgn.helpers.linux import *
mnt_path = b"/home/fdmanana/btrfs-tests/scratch_1"
mnt = None
for mnt in for_each_mount(prog, dst = mnt_path):
pass
if mnt is None:
sys.stderr.write(f'Error: mount point {mnt_path} not found\n')
sys.exit(1)
fs_info = cast('struct btrfs_fs_info *', mnt.mnt.mnt_sb.s_fs_info)
BTRFS_BLOCK_GROUP_DATA = (1 << 0)
BTRFS_BLOCK_GROUP_SYSTEM = (1 << 1)
BTRFS_BLOCK_GROUP_METADATA = (1 << 2)
BTRFS_BLOCK_GROUP_RAID0 = (1 << 3)
BTRFS_BLOCK_GROUP_RAID1 = (1 << 4)
BTRFS_BLOCK_GROUP_DUP = (1 << 5)
BTRFS_BLOCK_GROUP_RAID10 = (1 << 6)
BTRFS_BLOCK_GROUP_RAID5 = (1 << 7)
BTRFS_BLOCK_GROUP_RAID6 = (1 << 8)
BTRFS_BLOCK_GROUP_RAID1C3 = (1 << 9)
BTRFS_BLOCK_GROUP_RAID1C4 = (1 << 10)
def bg_flags_string(bg):
flags = bg.flags.value_()
ret = ''
if flags & BTRFS_BLOCK_GROUP_DATA:
ret = 'data'
if flags & BTRFS_BLOCK_GROUP_METADATA:
if len(ret) > 0:
ret += '|'
ret += 'meta'
if flags & BTRFS_BLOCK_GROUP_SYSTEM:
if len(ret) > 0:
ret += '|'
ret += 'system'
if flags & BTRFS_BLOCK_GROUP_RAID0:
ret += ' raid0'
elif flags & BTRFS_BLOCK_GROUP_RAID1:
ret += ' raid1'
elif flags & BTRFS_BLOCK_GROUP_DUP:
ret += ' dup'
elif flags & BTRFS_BLOCK_GROUP_RAID10:
ret += ' raid10'
elif flags & BTRFS_BLOCK_GROUP_RAID5:
ret += ' raid5'
elif flags & BTRFS_BLOCK_GROUP_RAID6:
ret += ' raid6'
elif flags & BTRFS_BLOCK_GROUP_RAID1C3:
ret += ' raid1c3'
elif flags & BTRFS_BLOCK_GROUP_RAID1C4:
ret += ' raid1c4'
else:
ret += ' single'
return ret
def dump_bg(bg):
print()
print(f'block group at {hex(bg.value_())}')
print(f'\t start {bg.start.value_()} length {bg.length.value_()}')
print(f'\t flags {bg.flags.value_()} - {bg_flags_string(bg)}')
bg_root = fs_info.block_group_cache_tree.address_of_()
for bg in rbtree_inorder_for_each_entry('struct btrfs_block_group', bg_root, 'cache_node'):
dump_bg(bg)
$ drgn dump_block_groups.py
block group at 0xffff8f3d673b0400
start 22020096 length 16777216
flags 258 - system raid6
block group at 0xffff8f3d53ddb400
start 38797312 length 536870912
flags 260 - meta raid6
block group at 0xffff8f3d5f4d9c00
start 575668224 length 2147483648
flags 257 - data raid6
block group at 0xffff8f3d08189000
start 2723151872 length 67108864
flags 258 - system raid6
block group at 0xffff8f3db70ff000
start 2790260736 length 1073741824
flags 260 - meta raid6
block group at 0xffff8f3d5f4dd800
start 3864002560 length 67108864
flags 258 - system raid6
block group at 0xffff8f3d67037000
start 3931111424 length 2147483648
flags 257 - data raid6
$
So there were only 2 reasons left for having a readahead extent with a
single zone: reada_find_zone(), called when creating a readahead extent,
returned NULL either because we failed to find the corresponding block
group or because a memory allocation failed. With some additional and
custom tracing I figured out that on every further ocurrence of the
problem the block group had just been deleted when we were looping to
create the zones for the readahead extent (at reada_find_extent()), so we
ended up with only one zone in the readahead extent, corresponding to a
device that ends up getting replaced.
So after figuring that out it became obvious why the hang happens:
1) Task A starts a scrub on any device of the filesystem, except for
device /dev/sdd;
2) Task B starts a device replace with /dev/sdd as the source device;
3) Task A calls btrfs_reada_add() from scrub_stripe() and it is currently
starting to scrub a stripe from block group X. This call to
btrfs_reada_add() is the one for the extent tree. When btrfs_reada_add()
calls reada_add_block(), it passes the logical address of the extent
tree's root node as its 'logical' argument - a value of 38928384;
4) Task A then enters reada_find_extent(), called from reada_add_block().
It finds there isn't any existing readahead extent for the logical
address 38928384, so it proceeds to the path of creating a new one.
It calls btrfs_map_block() to find out which stripes exist for the block
group X. On the first iteration of the for loop that iterates over the
stripes, it finds the stripe for device /dev/sdd, so it creates one
zone for that device and adds it to the readahead extent. Before getting
into the second iteration of the loop, the cleanup kthread deletes block
group X because it was empty. So in the iterations for the remaining
stripes it does not add more zones to the readahead extent, because the
calls to reada_find_zone() returned NULL because they couldn't find
block group X anymore.
As a result the new readahead extent has a single zone, corresponding to
the device /dev/sdd;
4) Before task A returns to btrfs_reada_add() and queues the readahead job
for the readahead work queue, task B finishes the device replace and at
btrfs_dev_replace_finishing() swaps the device /dev/sdd with the new
device /dev/sdg;
5) Task A returns to reada_add_block(), which increments the counter
"->elems" of the reada_control structure allocated at btrfs_reada_add().
Then it returns back to btrfs_reada_add() and calls
reada_start_machine(). This queues a job in the readahead work queue to
run the function reada_start_machine_worker(), which calls
__reada_start_machine().
At __reada_start_machine() we take the device list mutex and for each
device found in the current device list, we call
reada_start_machine_dev() to start the readahead work. However at this
point the device /dev/sdd was already freed and is not in the device
list anymore.
This means the corresponding readahead for the extent at 38928384 is
never started, and therefore the "->elems" counter of the reada_control
structure allocated at btrfs_reada_add() never goes down to 0, causing
the call to btrfs_reada_wait(), done by the scrub task, to wait forever.
Note that the readahead request can be made either after the device replace
started or before it started, however in pratice it is very unlikely that a
device replace is able to start after a readahead request is made and is
able to complete before the readahead request completes - maybe only on a
very small and nearly empty filesystem.
This hang however is not the only problem we can have with readahead and
device removals. When the readahead extent has other zones other than the
one corresponding to the device that is being removed (either by a device
replace or a device remove operation), we risk having a use-after-free on
the device when dropping the last reference of the readahead extent.
For example if we create a readahead extent with two zones, one for the
device /dev/sdd and one for the device /dev/sde:
1) Before the readahead worker starts, the device /dev/sdd is removed,
and the corresponding btrfs_device structure is freed. However the
readahead extent still has the zone pointing to the device structure;
2) When the readahead worker starts, it only finds device /dev/sde in the
current device list of the filesystem;
3) It starts the readahead work, at reada_start_machine_dev(), using the
device /dev/sde;
4) Then when it finishes reading the extent from device /dev/sde, it calls
__readahead_hook() which ends up dropping the last reference on the
readahead extent through the last call to reada_extent_put();
5) At reada_extent_put() it iterates over each zone of the readahead extent
and attempts to delete an element from the device's 'reada_extents'
radix tree, resulting in a use-after-free, as the device pointer of the
zone for /dev/sdd is now stale. We can also access the device after
dropping the last reference of a zone, through reada_zone_release(),
also called by reada_extent_put().
And a device remove suffers the same problem, however since it shrinks the
device size down to zero before removing the device, it is very unlikely to
still have readahead requests not completed by the time we free the device,
the only possibility is if the device has a very little space allocated.
While the hang problem is exclusive to scrub, since it is currently the
only user of btrfs_reada_add() and btrfs_reada_wait(), the use-after-free
problem affects any path that triggers readhead, which includes
btree_readahead_hook() and __readahead_hook() (a readahead worker can
trigger readahed for the children of a node) for example - any path that
ends up calling reada_add_block() can trigger the use-after-free after a
device is removed.
So fix this by waiting for any readahead requests for a device to complete
before removing a device, ensuring that while waiting for existing ones no
new ones can be made.
This problem has been around for a very long time - the readahead code was
added in 2011, device remove exists since 2008 and device replace was
introduced in 2013, hard to pick a specific commit for a git Fixes tag.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-12 11:55:24 +01:00
if ( ! ret )
btrfs_reada_remove_dev ( device ) ;
2008-05-07 11:43:44 -04:00
if ( ret )
2011-02-15 18:14:25 +00:00
goto error_undo ;
2008-05-07 11:43:44 -04:00
2012-11-05 18:29:28 +01:00
/*
* TODO : the superblock still includes this device in its num_devices
* counter although write_all_supers ( ) is not locked out . This
* could give a filesystem state which requires a degraded mount .
*/
2019-03-20 16:31:53 +01:00
ret = btrfs_rm_dev_item ( device ) ;
2008-05-07 11:43:44 -04:00
if ( ret )
2011-02-15 18:14:25 +00:00
goto error_undo ;
2008-05-07 11:43:44 -04:00
2017-12-04 12:54:53 +08:00
clear_bit ( BTRFS_DEV_STATE_IN_FS_METADATA , & device - > dev_state ) ;
2019-03-20 16:32:55 +01:00
btrfs_scrub_cancel_dev ( device ) ;
2009-06-10 15:17:02 -04:00
/*
* the device list mutex makes sure that we don ' t change
* the device list while someone else is writing out all
Btrfs: fix race between removing a dev and writing sbs
This change fixes an issue when removing a device and writing
all super blocks run simultaneously. Here's the steps necessary
for the issue to happen:
1) disk-io.c:write_all_supers() gets a number of N devices from the
super_copy, so it will not panic if it fails to write super blocks
for N - 1 devices;
2) Then it tries to acquire the device_list_mutex, but blocks because
volumes.c:btrfs_rm_device() got it first;
3) btrfs_rm_device() removes the device from the list, then unlocks the
mutex and after the unlock it updates the number of devices in
super_copy to N - 1.
4) write_all_supers() finally acquires the mutex, iterates over all the
devices in the list and gets N - 1 errors, that is, it failed to write
super blocks to all the devices;
5) Because write_all_supers() thinks there are a total of N devices, it
considers N - 1 errors to be ok, and therefore won't panic.
So this change just makes sure that write_all_supers() reads the number
of devices from super_copy after it acquires the device_list_mutex.
Conversely, it changes btrfs_rm_device() to update the number of devices
in super_copy before it releases the device list mutex.
The code path to add a new device (volumes.c:btrfs_init_new_device),
already has the right behaviour: it updates the number of devices in
super_copy while holding the device_list_mutex.
The only code path that doesn't lock the device list mutex
before updating the number of devices in the super copy is
disk-io.c:next_root_backup(), called by open_ctree() during
mount time where concurrency issues can't happen.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-08-09 15:41:36 +01:00
* the device supers . Whoever is writing all supers , should
* lock the device list mutex before getting the number of
* devices in the super block ( super_copy ) . Conversely ,
* whoever updates the number of devices in the super block
* ( super_copy ) should hold the device list mutex .
2009-06-10 15:17:02 -04:00
*/
2011-04-20 10:09:16 +00:00
2018-04-12 10:29:31 +08:00
/*
* In normal cases the cur_devices = = fs_devices . But in case
* of deleting a seed device , the cur_devices should point to
2021-08-18 12:15:48 +08:00
* its own fs_devices listed under the fs_devices - > seed_list .
2018-04-12 10:29:31 +08:00
*/
2011-04-20 10:09:16 +00:00
cur_devices = device - > fs_devices ;
2018-04-12 10:29:30 +08:00
mutex_lock ( & fs_devices - > device_list_mutex ) ;
2011-04-20 10:09:16 +00:00
list_del_rcu ( & device - > dev_list ) ;
2009-06-10 15:17:02 -04:00
2018-04-12 10:29:31 +08:00
cur_devices - > num_devices - - ;
cur_devices - > total_devices - - ;
2018-07-03 17:07:23 +08:00
/* Update total_devices of the parent fs_devices if it's seed */
if ( cur_devices ! = fs_devices )
fs_devices - > total_devices - - ;
2008-11-17 21:11:30 -05:00
2017-12-04 12:54:54 +08:00
if ( test_bit ( BTRFS_DEV_STATE_MISSING , & device - > dev_state ) )
2018-04-12 10:29:31 +08:00
cur_devices - > missing_devices - - ;
2010-12-13 14:56:23 -05:00
2018-07-20 19:37:50 +03:00
btrfs_assign_next_active_device ( device , NULL ) ;
2008-11-17 21:11:30 -05:00
2014-07-07 12:34:49 -05:00
if ( device - > bdev ) {
2018-04-12 10:29:31 +08:00
cur_devices - > open_devices - - ;
2014-07-07 12:34:49 -05:00
/* remove sysfs entry */
2020-09-05 01:34:27 +08:00
btrfs_sysfs_remove_device ( device ) ;
2014-07-07 12:34:49 -05:00
}
2014-06-03 11:36:00 +08:00
2016-06-22 18:54:23 -04:00
num_devices = btrfs_super_num_devices ( fs_info - > super_copy ) - 1 ;
btrfs_set_super_num_devices ( fs_info - > super_copy , num_devices ) ;
2018-04-12 10:29:30 +08:00
mutex_unlock ( & fs_devices - > device_list_mutex ) ;
2008-11-17 21:11:30 -05:00
2016-09-20 08:50:21 -04:00
/*
2021-07-27 17:01:17 -04:00
* At this point , the device is zero sized and detached from the
* devices list . All that ' s left is to zero out the old supers and
* free the device .
*
* We cannot call btrfs_close_bdev ( ) here because we ' re holding the sb
* write lock , and blkdev_put ( ) will pull in the - > open_mutex on the
* block device and it ' s dependencies . Instead just flush the device
* and let the caller do the final blkdev_put .
2016-09-20 08:50:21 -04:00
*/
2021-07-27 17:01:17 -04:00
if ( test_bit ( BTRFS_DEV_STATE_WRITEABLE , & device - > dev_state ) ) {
2020-02-14 00:24:32 +09:00
btrfs_scratch_superblocks ( fs_info , device - > bdev ,
device - > name - > str ) ;
2021-07-27 17:01:17 -04:00
if ( device - > bdev ) {
sync_blockdev ( device - > bdev ) ;
invalidate_bdev ( device - > bdev ) ;
}
}
2016-09-20 08:50:21 -04:00
2021-07-27 17:01:17 -04:00
* bdev = device - > bdev ;
* mode = device - > mode ;
2019-03-27 14:24:11 +02:00
synchronize_rcu ( ) ;
btrfs_free_device ( device ) ;
2016-09-20 08:50:21 -04:00
2021-10-05 16:12:41 -04:00
/*
* This can happen if cur_devices is the private seed devices list . We
* cannot call close_fs_devices ( ) here because it expects the uuid_mutex
* to be held , but in fact we don ' t need that for the private
* seed_devices , we can simply decrement cur_devices - > opened and then
* remove it from our list and free the fs_devices .
*/
2021-10-05 16:12:39 -04:00
if ( cur_devices - > num_devices = = 0 ) {
2020-07-16 10:25:33 +03:00
list_del_init ( & cur_devices - > seed_list ) ;
2021-10-05 16:12:41 -04:00
ASSERT ( cur_devices - > opened = = 1 ) ;
cur_devices - > opened - - ;
2011-04-20 10:09:16 +00:00
free_fs_devices ( cur_devices ) ;
2008-11-17 21:11:30 -05:00
}
2008-05-07 11:43:44 -04:00
out :
return ret ;
2016-02-13 10:01:36 +08:00
2011-02-15 18:14:25 +00:00
error_undo :
btrfs: fix readahead hang and use-after-free after removing a device
Very sporadically I had test case btrfs/069 from fstests hanging (for
years, it is not a recent regression), with the following traces in
dmesg/syslog:
[162301.160628] BTRFS info (device sdc): dev_replace from /dev/sdd (devid 2) to /dev/sdg started
[162301.181196] BTRFS info (device sdc): scrub: finished on devid 4 with status: 0
[162301.287162] BTRFS info (device sdc): dev_replace from /dev/sdd (devid 2) to /dev/sdg finished
[162513.513792] INFO: task btrfs-transacti:1356167 blocked for more than 120 seconds.
[162513.514318] Not tainted 5.9.0-rc6-btrfs-next-69 #1
[162513.514522] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[162513.514747] task:btrfs-transacti state:D stack: 0 pid:1356167 ppid: 2 flags:0x00004000
[162513.514751] Call Trace:
[162513.514761] __schedule+0x5ce/0xd00
[162513.514765] ? _raw_spin_unlock_irqrestore+0x3c/0x60
[162513.514771] schedule+0x46/0xf0
[162513.514844] wait_current_trans+0xde/0x140 [btrfs]
[162513.514850] ? finish_wait+0x90/0x90
[162513.514864] start_transaction+0x37c/0x5f0 [btrfs]
[162513.514879] transaction_kthread+0xa4/0x170 [btrfs]
[162513.514891] ? btrfs_cleanup_transaction+0x660/0x660 [btrfs]
[162513.514894] kthread+0x153/0x170
[162513.514897] ? kthread_stop+0x2c0/0x2c0
[162513.514902] ret_from_fork+0x22/0x30
[162513.514916] INFO: task fsstress:1356184 blocked for more than 120 seconds.
[162513.515192] Not tainted 5.9.0-rc6-btrfs-next-69 #1
[162513.515431] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[162513.515680] task:fsstress state:D stack: 0 pid:1356184 ppid:1356177 flags:0x00004000
[162513.515682] Call Trace:
[162513.515688] __schedule+0x5ce/0xd00
[162513.515691] ? _raw_spin_unlock_irqrestore+0x3c/0x60
[162513.515697] schedule+0x46/0xf0
[162513.515712] wait_current_trans+0xde/0x140 [btrfs]
[162513.515716] ? finish_wait+0x90/0x90
[162513.515729] start_transaction+0x37c/0x5f0 [btrfs]
[162513.515743] btrfs_attach_transaction_barrier+0x1f/0x50 [btrfs]
[162513.515753] btrfs_sync_fs+0x61/0x1c0 [btrfs]
[162513.515758] ? __ia32_sys_fdatasync+0x20/0x20
[162513.515761] iterate_supers+0x87/0xf0
[162513.515765] ksys_sync+0x60/0xb0
[162513.515768] __do_sys_sync+0xa/0x10
[162513.515771] do_syscall_64+0x33/0x80
[162513.515774] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[162513.515781] RIP: 0033:0x7f5238f50bd7
[162513.515782] Code: Bad RIP value.
[162513.515784] RSP: 002b:00007fff67b978e8 EFLAGS: 00000206 ORIG_RAX: 00000000000000a2
[162513.515786] RAX: ffffffffffffffda RBX: 000055b1fad2c560 RCX: 00007f5238f50bd7
[162513.515788] RDX: 00000000ffffffff RSI: 000000000daf0e74 RDI: 000000000000003a
[162513.515789] RBP: 0000000000000032 R08: 000000000000000a R09: 00007f5239019be0
[162513.515791] R10: fffffffffffff24f R11: 0000000000000206 R12: 000000000000003a
[162513.515792] R13: 00007fff67b97950 R14: 00007fff67b97906 R15: 000055b1fad1a340
[162513.515804] INFO: task fsstress:1356185 blocked for more than 120 seconds.
[162513.516064] Not tainted 5.9.0-rc6-btrfs-next-69 #1
[162513.516329] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[162513.516617] task:fsstress state:D stack: 0 pid:1356185 ppid:1356177 flags:0x00000000
[162513.516620] Call Trace:
[162513.516625] __schedule+0x5ce/0xd00
[162513.516628] ? _raw_spin_unlock_irqrestore+0x3c/0x60
[162513.516634] schedule+0x46/0xf0
[162513.516647] wait_current_trans+0xde/0x140 [btrfs]
[162513.516650] ? finish_wait+0x90/0x90
[162513.516662] start_transaction+0x4d7/0x5f0 [btrfs]
[162513.516679] btrfs_setxattr_trans+0x3c/0x100 [btrfs]
[162513.516686] __vfs_setxattr+0x66/0x80
[162513.516691] __vfs_setxattr_noperm+0x70/0x200
[162513.516697] vfs_setxattr+0x6b/0x120
[162513.516703] setxattr+0x125/0x240
[162513.516709] ? lock_acquire+0xb1/0x480
[162513.516712] ? mnt_want_write+0x20/0x50
[162513.516721] ? rcu_read_lock_any_held+0x8e/0xb0
[162513.516723] ? preempt_count_add+0x49/0xa0
[162513.516725] ? __sb_start_write+0x19b/0x290
[162513.516727] ? preempt_count_add+0x49/0xa0
[162513.516732] path_setxattr+0xba/0xd0
[162513.516739] __x64_sys_setxattr+0x27/0x30
[162513.516741] do_syscall_64+0x33/0x80
[162513.516743] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[162513.516745] RIP: 0033:0x7f5238f56d5a
[162513.516746] Code: Bad RIP value.
[162513.516748] RSP: 002b:00007fff67b97868 EFLAGS: 00000202 ORIG_RAX: 00000000000000bc
[162513.516750] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f5238f56d5a
[162513.516751] RDX: 000055b1fbb0d5a0 RSI: 00007fff67b978a0 RDI: 000055b1fbb0d470
[162513.516753] RBP: 000055b1fbb0d5a0 R08: 0000000000000001 R09: 00007fff67b97700
[162513.516754] R10: 0000000000000004 R11: 0000000000000202 R12: 0000000000000004
[162513.516756] R13: 0000000000000024 R14: 0000000000000001 R15: 00007fff67b978a0
[162513.516767] INFO: task fsstress:1356196 blocked for more than 120 seconds.
[162513.517064] Not tainted 5.9.0-rc6-btrfs-next-69 #1
[162513.517365] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[162513.517763] task:fsstress state:D stack: 0 pid:1356196 ppid:1356177 flags:0x00004000
[162513.517780] Call Trace:
[162513.517786] __schedule+0x5ce/0xd00
[162513.517789] ? _raw_spin_unlock_irqrestore+0x3c/0x60
[162513.517796] schedule+0x46/0xf0
[162513.517810] wait_current_trans+0xde/0x140 [btrfs]
[162513.517814] ? finish_wait+0x90/0x90
[162513.517829] start_transaction+0x37c/0x5f0 [btrfs]
[162513.517845] btrfs_attach_transaction_barrier+0x1f/0x50 [btrfs]
[162513.517857] btrfs_sync_fs+0x61/0x1c0 [btrfs]
[162513.517862] ? __ia32_sys_fdatasync+0x20/0x20
[162513.517865] iterate_supers+0x87/0xf0
[162513.517869] ksys_sync+0x60/0xb0
[162513.517872] __do_sys_sync+0xa/0x10
[162513.517875] do_syscall_64+0x33/0x80
[162513.517878] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[162513.517881] RIP: 0033:0x7f5238f50bd7
[162513.517883] Code: Bad RIP value.
[162513.517885] RSP: 002b:00007fff67b978e8 EFLAGS: 00000206 ORIG_RAX: 00000000000000a2
[162513.517887] RAX: ffffffffffffffda RBX: 000055b1fad2c560 RCX: 00007f5238f50bd7
[162513.517889] RDX: 0000000000000000 RSI: 000000007660add2 RDI: 0000000000000053
[162513.517891] RBP: 0000000000000032 R08: 0000000000000067 R09: 00007f5239019be0
[162513.517893] R10: fffffffffffff24f R11: 0000000000000206 R12: 0000000000000053
[162513.517895] R13: 00007fff67b97950 R14: 00007fff67b97906 R15: 000055b1fad1a340
[162513.517908] INFO: task fsstress:1356197 blocked for more than 120 seconds.
[162513.518298] Not tainted 5.9.0-rc6-btrfs-next-69 #1
[162513.518672] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[162513.519157] task:fsstress state:D stack: 0 pid:1356197 ppid:1356177 flags:0x00000000
[162513.519160] Call Trace:
[162513.519165] __schedule+0x5ce/0xd00
[162513.519168] ? _raw_spin_unlock_irqrestore+0x3c/0x60
[162513.519174] schedule+0x46/0xf0
[162513.519190] wait_current_trans+0xde/0x140 [btrfs]
[162513.519193] ? finish_wait+0x90/0x90
[162513.519206] start_transaction+0x4d7/0x5f0 [btrfs]
[162513.519222] btrfs_create+0x57/0x200 [btrfs]
[162513.519230] lookup_open+0x522/0x650
[162513.519246] path_openat+0x2b8/0xa50
[162513.519270] do_filp_open+0x91/0x100
[162513.519275] ? find_held_lock+0x32/0x90
[162513.519280] ? lock_acquired+0x33b/0x470
[162513.519285] ? do_raw_spin_unlock+0x4b/0xc0
[162513.519287] ? _raw_spin_unlock+0x29/0x40
[162513.519295] do_sys_openat2+0x20d/0x2d0
[162513.519300] do_sys_open+0x44/0x80
[162513.519304] do_syscall_64+0x33/0x80
[162513.519307] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[162513.519309] RIP: 0033:0x7f5238f4a903
[162513.519310] Code: Bad RIP value.
[162513.519312] RSP: 002b:00007fff67b97758 EFLAGS: 00000246 ORIG_RAX: 0000000000000055
[162513.519314] RAX: ffffffffffffffda RBX: 00000000ffffffff RCX: 00007f5238f4a903
[162513.519316] RDX: 0000000000000000 RSI: 00000000000001b6 RDI: 000055b1fbb0d470
[162513.519317] RBP: 00007fff67b978c0 R08: 0000000000000001 R09: 0000000000000002
[162513.519319] R10: 00007fff67b974f7 R11: 0000000000000246 R12: 0000000000000013
[162513.519320] R13: 00000000000001b6 R14: 00007fff67b97906 R15: 000055b1fad1c620
[162513.519332] INFO: task btrfs:1356211 blocked for more than 120 seconds.
[162513.519727] Not tainted 5.9.0-rc6-btrfs-next-69 #1
[162513.520115] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[162513.520508] task:btrfs state:D stack: 0 pid:1356211 ppid:1356178 flags:0x00004002
[162513.520511] Call Trace:
[162513.520516] __schedule+0x5ce/0xd00
[162513.520519] ? _raw_spin_unlock_irqrestore+0x3c/0x60
[162513.520525] schedule+0x46/0xf0
[162513.520544] btrfs_scrub_pause+0x11f/0x180 [btrfs]
[162513.520548] ? finish_wait+0x90/0x90
[162513.520562] btrfs_commit_transaction+0x45a/0xc30 [btrfs]
[162513.520574] ? start_transaction+0xe0/0x5f0 [btrfs]
[162513.520596] btrfs_dev_replace_finishing+0x6d8/0x711 [btrfs]
[162513.520619] btrfs_dev_replace_by_ioctl.cold+0x1cc/0x1fd [btrfs]
[162513.520639] btrfs_ioctl+0x2a25/0x36f0 [btrfs]
[162513.520643] ? do_sigaction+0xf3/0x240
[162513.520645] ? find_held_lock+0x32/0x90
[162513.520648] ? do_sigaction+0xf3/0x240
[162513.520651] ? lock_acquired+0x33b/0x470
[162513.520655] ? _raw_spin_unlock_irq+0x24/0x50
[162513.520657] ? lockdep_hardirqs_on+0x7d/0x100
[162513.520660] ? _raw_spin_unlock_irq+0x35/0x50
[162513.520662] ? do_sigaction+0xf3/0x240
[162513.520671] ? __x64_sys_ioctl+0x83/0xb0
[162513.520672] __x64_sys_ioctl+0x83/0xb0
[162513.520677] do_syscall_64+0x33/0x80
[162513.520679] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[162513.520681] RIP: 0033:0x7fc3cd307d87
[162513.520682] Code: Bad RIP value.
[162513.520684] RSP: 002b:00007ffe30a56bb8 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
[162513.520686] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007fc3cd307d87
[162513.520687] RDX: 00007ffe30a57a30 RSI: 00000000ca289435 RDI: 0000000000000003
[162513.520689] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
[162513.520690] R10: 0000000000000008 R11: 0000000000000202 R12: 0000000000000003
[162513.520692] R13: 0000557323a212e0 R14: 00007ffe30a5a520 R15: 0000000000000001
[162513.520703]
Showing all locks held in the system:
[162513.520712] 1 lock held by khungtaskd/54:
[162513.520713] #0: ffffffffb40a91a0 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x15/0x197
[162513.520728] 1 lock held by in:imklog/596:
[162513.520729] #0: ffff8f3f0d781400 (&f->f_pos_lock){+.+.}-{3:3}, at: __fdget_pos+0x4d/0x60
[162513.520782] 1 lock held by btrfs-transacti/1356167:
[162513.520784] #0: ffff8f3d810cc848 (&fs_info->transaction_kthread_mutex){+.+.}-{3:3}, at: transaction_kthread+0x4a/0x170 [btrfs]
[162513.520798] 1 lock held by btrfs/1356190:
[162513.520800] #0: ffff8f3d57644470 (sb_writers#15){.+.+}-{0:0}, at: mnt_want_write_file+0x22/0x60
[162513.520805] 1 lock held by fsstress/1356184:
[162513.520806] #0: ffff8f3d576440e8 (&type->s_umount_key#62){++++}-{3:3}, at: iterate_supers+0x6f/0xf0
[162513.520811] 3 locks held by fsstress/1356185:
[162513.520812] #0: ffff8f3d57644470 (sb_writers#15){.+.+}-{0:0}, at: mnt_want_write+0x20/0x50
[162513.520815] #1: ffff8f3d80a650b8 (&type->i_mutex_dir_key#10){++++}-{3:3}, at: vfs_setxattr+0x50/0x120
[162513.520820] #2: ffff8f3d57644690 (sb_internal#2){.+.+}-{0:0}, at: start_transaction+0x40e/0x5f0 [btrfs]
[162513.520833] 1 lock held by fsstress/1356196:
[162513.520834] #0: ffff8f3d576440e8 (&type->s_umount_key#62){++++}-{3:3}, at: iterate_supers+0x6f/0xf0
[162513.520838] 3 locks held by fsstress/1356197:
[162513.520839] #0: ffff8f3d57644470 (sb_writers#15){.+.+}-{0:0}, at: mnt_want_write+0x20/0x50
[162513.520843] #1: ffff8f3d506465e8 (&type->i_mutex_dir_key#10){++++}-{3:3}, at: path_openat+0x2a7/0xa50
[162513.520846] #2: ffff8f3d57644690 (sb_internal#2){.+.+}-{0:0}, at: start_transaction+0x40e/0x5f0 [btrfs]
[162513.520858] 2 locks held by btrfs/1356211:
[162513.520859] #0: ffff8f3d810cde30 (&fs_info->dev_replace.lock_finishing_cancel_unmount){+.+.}-{3:3}, at: btrfs_dev_replace_finishing+0x52/0x711 [btrfs]
[162513.520877] #1: ffff8f3d57644690 (sb_internal#2){.+.+}-{0:0}, at: start_transaction+0x40e/0x5f0 [btrfs]
This was weird because the stack traces show that a transaction commit,
triggered by a device replace operation, is blocking trying to pause any
running scrubs but there are no stack traces of blocked tasks doing a
scrub.
After poking around with drgn, I noticed there was a scrub task that was
constantly running and blocking for shorts periods of time:
>>> t = find_task(prog, 1356190)
>>> prog.stack_trace(t)
#0 __schedule+0x5ce/0xcfc
#1 schedule+0x46/0xe4
#2 schedule_timeout+0x1df/0x475
#3 btrfs_reada_wait+0xda/0x132
#4 scrub_stripe+0x2a8/0x112f
#5 scrub_chunk+0xcd/0x134
#6 scrub_enumerate_chunks+0x29e/0x5ee
#7 btrfs_scrub_dev+0x2d5/0x91b
#8 btrfs_ioctl+0x7f5/0x36e7
#9 __x64_sys_ioctl+0x83/0xb0
#10 do_syscall_64+0x33/0x77
#11 entry_SYSCALL_64+0x7c/0x156
Which corresponds to:
int btrfs_reada_wait(void *handle)
{
struct reada_control *rc = handle;
struct btrfs_fs_info *fs_info = rc->fs_info;
while (atomic_read(&rc->elems)) {
if (!atomic_read(&fs_info->reada_works_cnt))
reada_start_machine(fs_info);
wait_event_timeout(rc->wait, atomic_read(&rc->elems) == 0,
(HZ + 9) / 10);
}
(...)
So the counter "rc->elems" was set to 1 and never decreased to 0, causing
the scrub task to loop forever in that function. Then I used the following
script for drgn to check the readahead requests:
$ cat dump_reada.py
import sys
import drgn
from drgn import NULL, Object, cast, container_of, execscript, \
reinterpret, sizeof
from drgn.helpers.linux import *
mnt_path = b"/home/fdmanana/btrfs-tests/scratch_1"
mnt = None
for mnt in for_each_mount(prog, dst = mnt_path):
pass
if mnt is None:
sys.stderr.write(f'Error: mount point {mnt_path} not found\n')
sys.exit(1)
fs_info = cast('struct btrfs_fs_info *', mnt.mnt.mnt_sb.s_fs_info)
def dump_re(re):
nzones = re.nzones.value_()
print(f're at {hex(re.value_())}')
print(f'\t logical {re.logical.value_()}')
print(f'\t refcnt {re.refcnt.value_()}')
print(f'\t nzones {nzones}')
for i in range(nzones):
dev = re.zones[i].device
name = dev.name.str.string_()
print(f'\t\t dev id {dev.devid.value_()} name {name}')
print()
for _, e in radix_tree_for_each(fs_info.reada_tree):
re = cast('struct reada_extent *', e)
dump_re(re)
$ drgn dump_reada.py
re at 0xffff8f3da9d25ad8
logical 38928384
refcnt 1
nzones 1
dev id 0 name b'/dev/sdd'
$
So there was one readahead extent with a single zone corresponding to the
source device of that last device replace operation logged in dmesg/syslog.
Also the ID of that zone's device was 0 which is a special value set in
the source device of a device replace operation when the operation finishes
(constant BTRFS_DEV_REPLACE_DEVID set at btrfs_dev_replace_finishing()),
confirming again that device /dev/sdd was the source of a device replace
operation.
Normally there should be as many zones in the readahead extent as there are
devices, and I wasn't expecting the extent to be in a block group with a
'single' profile, so I went and confirmed with the following drgn script
that there weren't any single profile block groups:
$ cat dump_block_groups.py
import sys
import drgn
from drgn import NULL, Object, cast, container_of, execscript, \
reinterpret, sizeof
from drgn.helpers.linux import *
mnt_path = b"/home/fdmanana/btrfs-tests/scratch_1"
mnt = None
for mnt in for_each_mount(prog, dst = mnt_path):
pass
if mnt is None:
sys.stderr.write(f'Error: mount point {mnt_path} not found\n')
sys.exit(1)
fs_info = cast('struct btrfs_fs_info *', mnt.mnt.mnt_sb.s_fs_info)
BTRFS_BLOCK_GROUP_DATA = (1 << 0)
BTRFS_BLOCK_GROUP_SYSTEM = (1 << 1)
BTRFS_BLOCK_GROUP_METADATA = (1 << 2)
BTRFS_BLOCK_GROUP_RAID0 = (1 << 3)
BTRFS_BLOCK_GROUP_RAID1 = (1 << 4)
BTRFS_BLOCK_GROUP_DUP = (1 << 5)
BTRFS_BLOCK_GROUP_RAID10 = (1 << 6)
BTRFS_BLOCK_GROUP_RAID5 = (1 << 7)
BTRFS_BLOCK_GROUP_RAID6 = (1 << 8)
BTRFS_BLOCK_GROUP_RAID1C3 = (1 << 9)
BTRFS_BLOCK_GROUP_RAID1C4 = (1 << 10)
def bg_flags_string(bg):
flags = bg.flags.value_()
ret = ''
if flags & BTRFS_BLOCK_GROUP_DATA:
ret = 'data'
if flags & BTRFS_BLOCK_GROUP_METADATA:
if len(ret) > 0:
ret += '|'
ret += 'meta'
if flags & BTRFS_BLOCK_GROUP_SYSTEM:
if len(ret) > 0:
ret += '|'
ret += 'system'
if flags & BTRFS_BLOCK_GROUP_RAID0:
ret += ' raid0'
elif flags & BTRFS_BLOCK_GROUP_RAID1:
ret += ' raid1'
elif flags & BTRFS_BLOCK_GROUP_DUP:
ret += ' dup'
elif flags & BTRFS_BLOCK_GROUP_RAID10:
ret += ' raid10'
elif flags & BTRFS_BLOCK_GROUP_RAID5:
ret += ' raid5'
elif flags & BTRFS_BLOCK_GROUP_RAID6:
ret += ' raid6'
elif flags & BTRFS_BLOCK_GROUP_RAID1C3:
ret += ' raid1c3'
elif flags & BTRFS_BLOCK_GROUP_RAID1C4:
ret += ' raid1c4'
else:
ret += ' single'
return ret
def dump_bg(bg):
print()
print(f'block group at {hex(bg.value_())}')
print(f'\t start {bg.start.value_()} length {bg.length.value_()}')
print(f'\t flags {bg.flags.value_()} - {bg_flags_string(bg)}')
bg_root = fs_info.block_group_cache_tree.address_of_()
for bg in rbtree_inorder_for_each_entry('struct btrfs_block_group', bg_root, 'cache_node'):
dump_bg(bg)
$ drgn dump_block_groups.py
block group at 0xffff8f3d673b0400
start 22020096 length 16777216
flags 258 - system raid6
block group at 0xffff8f3d53ddb400
start 38797312 length 536870912
flags 260 - meta raid6
block group at 0xffff8f3d5f4d9c00
start 575668224 length 2147483648
flags 257 - data raid6
block group at 0xffff8f3d08189000
start 2723151872 length 67108864
flags 258 - system raid6
block group at 0xffff8f3db70ff000
start 2790260736 length 1073741824
flags 260 - meta raid6
block group at 0xffff8f3d5f4dd800
start 3864002560 length 67108864
flags 258 - system raid6
block group at 0xffff8f3d67037000
start 3931111424 length 2147483648
flags 257 - data raid6
$
So there were only 2 reasons left for having a readahead extent with a
single zone: reada_find_zone(), called when creating a readahead extent,
returned NULL either because we failed to find the corresponding block
group or because a memory allocation failed. With some additional and
custom tracing I figured out that on every further ocurrence of the
problem the block group had just been deleted when we were looping to
create the zones for the readahead extent (at reada_find_extent()), so we
ended up with only one zone in the readahead extent, corresponding to a
device that ends up getting replaced.
So after figuring that out it became obvious why the hang happens:
1) Task A starts a scrub on any device of the filesystem, except for
device /dev/sdd;
2) Task B starts a device replace with /dev/sdd as the source device;
3) Task A calls btrfs_reada_add() from scrub_stripe() and it is currently
starting to scrub a stripe from block group X. This call to
btrfs_reada_add() is the one for the extent tree. When btrfs_reada_add()
calls reada_add_block(), it passes the logical address of the extent
tree's root node as its 'logical' argument - a value of 38928384;
4) Task A then enters reada_find_extent(), called from reada_add_block().
It finds there isn't any existing readahead extent for the logical
address 38928384, so it proceeds to the path of creating a new one.
It calls btrfs_map_block() to find out which stripes exist for the block
group X. On the first iteration of the for loop that iterates over the
stripes, it finds the stripe for device /dev/sdd, so it creates one
zone for that device and adds it to the readahead extent. Before getting
into the second iteration of the loop, the cleanup kthread deletes block
group X because it was empty. So in the iterations for the remaining
stripes it does not add more zones to the readahead extent, because the
calls to reada_find_zone() returned NULL because they couldn't find
block group X anymore.
As a result the new readahead extent has a single zone, corresponding to
the device /dev/sdd;
4) Before task A returns to btrfs_reada_add() and queues the readahead job
for the readahead work queue, task B finishes the device replace and at
btrfs_dev_replace_finishing() swaps the device /dev/sdd with the new
device /dev/sdg;
5) Task A returns to reada_add_block(), which increments the counter
"->elems" of the reada_control structure allocated at btrfs_reada_add().
Then it returns back to btrfs_reada_add() and calls
reada_start_machine(). This queues a job in the readahead work queue to
run the function reada_start_machine_worker(), which calls
__reada_start_machine().
At __reada_start_machine() we take the device list mutex and for each
device found in the current device list, we call
reada_start_machine_dev() to start the readahead work. However at this
point the device /dev/sdd was already freed and is not in the device
list anymore.
This means the corresponding readahead for the extent at 38928384 is
never started, and therefore the "->elems" counter of the reada_control
structure allocated at btrfs_reada_add() never goes down to 0, causing
the call to btrfs_reada_wait(), done by the scrub task, to wait forever.
Note that the readahead request can be made either after the device replace
started or before it started, however in pratice it is very unlikely that a
device replace is able to start after a readahead request is made and is
able to complete before the readahead request completes - maybe only on a
very small and nearly empty filesystem.
This hang however is not the only problem we can have with readahead and
device removals. When the readahead extent has other zones other than the
one corresponding to the device that is being removed (either by a device
replace or a device remove operation), we risk having a use-after-free on
the device when dropping the last reference of the readahead extent.
For example if we create a readahead extent with two zones, one for the
device /dev/sdd and one for the device /dev/sde:
1) Before the readahead worker starts, the device /dev/sdd is removed,
and the corresponding btrfs_device structure is freed. However the
readahead extent still has the zone pointing to the device structure;
2) When the readahead worker starts, it only finds device /dev/sde in the
current device list of the filesystem;
3) It starts the readahead work, at reada_start_machine_dev(), using the
device /dev/sde;
4) Then when it finishes reading the extent from device /dev/sde, it calls
__readahead_hook() which ends up dropping the last reference on the
readahead extent through the last call to reada_extent_put();
5) At reada_extent_put() it iterates over each zone of the readahead extent
and attempts to delete an element from the device's 'reada_extents'
radix tree, resulting in a use-after-free, as the device pointer of the
zone for /dev/sdd is now stale. We can also access the device after
dropping the last reference of a zone, through reada_zone_release(),
also called by reada_extent_put().
And a device remove suffers the same problem, however since it shrinks the
device size down to zero before removing the device, it is very unlikely to
still have readahead requests not completed by the time we free the device,
the only possibility is if the device has a very little space allocated.
While the hang problem is exclusive to scrub, since it is currently the
only user of btrfs_reada_add() and btrfs_reada_wait(), the use-after-free
problem affects any path that triggers readhead, which includes
btree_readahead_hook() and __readahead_hook() (a readahead worker can
trigger readahed for the children of a node) for example - any path that
ends up calling reada_add_block() can trigger the use-after-free after a
device is removed.
So fix this by waiting for any readahead requests for a device to complete
before removing a device, ensuring that while waiting for existing ones no
new ones can be made.
This problem has been around for a very long time - the readahead code was
added in 2011, device remove exists since 2008 and device replace was
introduced in 2013, hard to pick a specific commit for a git Fixes tag.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-12 11:55:24 +01:00
btrfs_reada_undo_remove_dev ( device ) ;
2017-12-04 12:54:52 +08:00
if ( test_bit ( BTRFS_DEV_STATE_WRITEABLE , & device - > dev_state ) ) {
2016-10-04 19:34:27 +02:00
mutex_lock ( & fs_info - > chunk_mutex ) ;
2011-02-15 18:14:25 +00:00
list_add ( & device - > dev_alloc_list ,
2018-04-12 10:29:30 +08:00
& fs_devices - > alloc_list ) ;
2014-09-03 21:35:47 +08:00
device - > fs_devices - > rw_devices + + ;
2016-10-04 19:34:27 +02:00
mutex_unlock ( & fs_info - > chunk_mutex ) ;
2011-02-15 18:14:25 +00:00
}
2016-02-13 10:01:36 +08:00
goto out ;
2008-05-07 11:43:44 -04:00
}
2018-07-20 19:37:48 +03:00
void btrfs_rm_dev_replace_remove_srcdev ( struct btrfs_device * srcdev )
2012-11-05 17:33:06 +01:00
{
2014-08-13 14:24:19 +08:00
struct btrfs_fs_devices * fs_devices ;
2018-07-20 19:37:48 +03:00
lockdep_assert_held ( & srcdev - > fs_info - > fs_devices - > device_list_mutex ) ;
2013-10-02 20:41:01 +03:00
2014-08-20 10:56:56 +08:00
/*
* in case of fs with no seed , srcdev - > fs_devices will point
* to fs_devices of fs_info . However when the dev being replaced is
* a seed dev it will point to the seed ' s local fs_devices . In short
* srcdev will have its correct fs_devices in both the cases .
*/
fs_devices = srcdev - > fs_devices ;
2014-08-13 14:24:19 +08:00
2012-11-05 17:33:06 +01:00
list_del_rcu ( & srcdev - > dev_list ) ;
2017-06-19 14:14:22 +02:00
list_del ( & srcdev - > dev_alloc_list ) ;
2014-08-13 14:24:19 +08:00
fs_devices - > num_devices - - ;
2017-12-04 12:54:54 +08:00
if ( test_bit ( BTRFS_DEV_STATE_MISSING , & srcdev - > dev_state ) )
2014-08-13 14:24:19 +08:00
fs_devices - > missing_devices - - ;
2012-11-05 17:33:06 +01:00
2017-12-04 12:54:52 +08:00
if ( test_bit ( BTRFS_DEV_STATE_WRITEABLE , & srcdev - > dev_state ) )
2014-09-03 21:35:44 +08:00
fs_devices - > rw_devices - - ;
2013-10-02 20:41:01 +03:00
2014-09-03 21:35:44 +08:00
if ( srcdev - > bdev )
2014-08-13 14:24:19 +08:00
fs_devices - > open_devices - - ;
2014-10-30 16:52:31 +08:00
}
2019-03-20 16:34:54 +01:00
void btrfs_rm_dev_replace_free_srcdev ( struct btrfs_device * srcdev )
2014-10-30 16:52:31 +08:00
{
struct btrfs_fs_devices * fs_devices = srcdev - > fs_devices ;
2012-11-05 17:33:06 +01:00
btrfs: move btrfs_rm_dev_replace_free_srcdev outside of all locks
When closing and freeing the source device we could end up doing our
final blkdev_put() on the bdev, which will grab the bd_mutex. As such
we want to be holding as few locks as possible, so move this call
outside of the dev_replace->lock_finishing_cancel_unmount lock. Since
we're modifying the fs_devices we need to make sure we're holding the
uuid_mutex here, so take that as well.
There's a report from syzbot probably hitting one of the cases where
the bd_mutex and device_list_mutex are taken in the wrong order, however
it's not with device replace, like this patch fixes. As there's no
reproducer available so far, we can't verify the fix.
https://lore.kernel.org/lkml/000000000000fc04d105afcf86d7@google.com/
dashboard link: https://syzkaller.appspot.com/bug?extid=84a0634dc5d21d488419
WARNING: possible circular locking dependency detected
5.9.0-rc5-syzkaller #0 Not tainted
------------------------------------------------------
syz-executor.0/6878 is trying to acquire lock:
ffff88804c17d780 (&bdev->bd_mutex){+.+.}-{3:3}, at: blkdev_put+0x30/0x520 fs/block_dev.c:1804
but task is already holding lock:
ffff8880908cfce0 (&fs_devs->device_list_mutex){+.+.}-{3:3}, at: close_fs_devices.part.0+0x2e/0x800 fs/btrfs/volumes.c:1159
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #4 (&fs_devs->device_list_mutex){+.+.}-{3:3}:
__mutex_lock_common kernel/locking/mutex.c:956 [inline]
__mutex_lock+0x134/0x10e0 kernel/locking/mutex.c:1103
btrfs_finish_chunk_alloc+0x281/0xf90 fs/btrfs/volumes.c:5255
btrfs_create_pending_block_groups+0x2f3/0x700 fs/btrfs/block-group.c:2109
__btrfs_end_transaction+0xf5/0x690 fs/btrfs/transaction.c:916
find_free_extent_update_loop fs/btrfs/extent-tree.c:3807 [inline]
find_free_extent+0x23b7/0x2e60 fs/btrfs/extent-tree.c:4127
btrfs_reserve_extent+0x166/0x460 fs/btrfs/extent-tree.c:4206
cow_file_range+0x3de/0x9b0 fs/btrfs/inode.c:1063
btrfs_run_delalloc_range+0x2cf/0x1410 fs/btrfs/inode.c:1838
writepage_delalloc+0x150/0x460 fs/btrfs/extent_io.c:3439
__extent_writepage+0x441/0xd00 fs/btrfs/extent_io.c:3653
extent_write_cache_pages.constprop.0+0x69d/0x1040 fs/btrfs/extent_io.c:4249
extent_writepages+0xcd/0x2b0 fs/btrfs/extent_io.c:4370
do_writepages+0xec/0x290 mm/page-writeback.c:2352
__writeback_single_inode+0x125/0x1400 fs/fs-writeback.c:1461
writeback_sb_inodes+0x53d/0xf40 fs/fs-writeback.c:1721
wb_writeback+0x2ad/0xd40 fs/fs-writeback.c:1894
wb_do_writeback fs/fs-writeback.c:2039 [inline]
wb_workfn+0x2dc/0x13e0 fs/fs-writeback.c:2080
process_one_work+0x94c/0x1670 kernel/workqueue.c:2269
worker_thread+0x64c/0x1120 kernel/workqueue.c:2415
kthread+0x3b5/0x4a0 kernel/kthread.c:292
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294
-> #3 (sb_internal#2){.+.+}-{0:0}:
percpu_down_read include/linux/percpu-rwsem.h:51 [inline]
__sb_start_write+0x234/0x470 fs/super.c:1672
sb_start_intwrite include/linux/fs.h:1690 [inline]
start_transaction+0xbe7/0x1170 fs/btrfs/transaction.c:624
find_free_extent_update_loop fs/btrfs/extent-tree.c:3789 [inline]
find_free_extent+0x25e1/0x2e60 fs/btrfs/extent-tree.c:4127
btrfs_reserve_extent+0x166/0x460 fs/btrfs/extent-tree.c:4206
cow_file_range+0x3de/0x9b0 fs/btrfs/inode.c:1063
btrfs_run_delalloc_range+0x2cf/0x1410 fs/btrfs/inode.c:1838
writepage_delalloc+0x150/0x460 fs/btrfs/extent_io.c:3439
__extent_writepage+0x441/0xd00 fs/btrfs/extent_io.c:3653
extent_write_cache_pages.constprop.0+0x69d/0x1040 fs/btrfs/extent_io.c:4249
extent_writepages+0xcd/0x2b0 fs/btrfs/extent_io.c:4370
do_writepages+0xec/0x290 mm/page-writeback.c:2352
__writeback_single_inode+0x125/0x1400 fs/fs-writeback.c:1461
writeback_sb_inodes+0x53d/0xf40 fs/fs-writeback.c:1721
wb_writeback+0x2ad/0xd40 fs/fs-writeback.c:1894
wb_do_writeback fs/fs-writeback.c:2039 [inline]
wb_workfn+0x2dc/0x13e0 fs/fs-writeback.c:2080
process_one_work+0x94c/0x1670 kernel/workqueue.c:2269
worker_thread+0x64c/0x1120 kernel/workqueue.c:2415
kthread+0x3b5/0x4a0 kernel/kthread.c:292
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294
-> #2 ((work_completion)(&(&wb->dwork)->work)){+.+.}-{0:0}:
__flush_work+0x60e/0xac0 kernel/workqueue.c:3041
wb_shutdown+0x180/0x220 mm/backing-dev.c:355
bdi_unregister+0x174/0x590 mm/backing-dev.c:872
del_gendisk+0x820/0xa10 block/genhd.c:933
loop_remove drivers/block/loop.c:2192 [inline]
loop_control_ioctl drivers/block/loop.c:2291 [inline]
loop_control_ioctl+0x3b1/0x480 drivers/block/loop.c:2257
vfs_ioctl fs/ioctl.c:48 [inline]
__do_sys_ioctl fs/ioctl.c:753 [inline]
__se_sys_ioctl fs/ioctl.c:739 [inline]
__x64_sys_ioctl+0x193/0x200 fs/ioctl.c:739
do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #1 (loop_ctl_mutex){+.+.}-{3:3}:
__mutex_lock_common kernel/locking/mutex.c:956 [inline]
__mutex_lock+0x134/0x10e0 kernel/locking/mutex.c:1103
lo_open+0x19/0xd0 drivers/block/loop.c:1893
__blkdev_get+0x759/0x1aa0 fs/block_dev.c:1507
blkdev_get fs/block_dev.c:1639 [inline]
blkdev_open+0x227/0x300 fs/block_dev.c:1753
do_dentry_open+0x4b9/0x11b0 fs/open.c:817
do_open fs/namei.c:3251 [inline]
path_openat+0x1b9a/0x2730 fs/namei.c:3368
do_filp_open+0x17e/0x3c0 fs/namei.c:3395
do_sys_openat2+0x16d/0x420 fs/open.c:1168
do_sys_open fs/open.c:1184 [inline]
__do_sys_open fs/open.c:1192 [inline]
__se_sys_open fs/open.c:1188 [inline]
__x64_sys_open+0x119/0x1c0 fs/open.c:1188
do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #0 (&bdev->bd_mutex){+.+.}-{3:3}:
check_prev_add kernel/locking/lockdep.c:2496 [inline]
check_prevs_add kernel/locking/lockdep.c:2601 [inline]
validate_chain kernel/locking/lockdep.c:3218 [inline]
__lock_acquire+0x2a96/0x5780 kernel/locking/lockdep.c:4426
lock_acquire+0x1f3/0xae0 kernel/locking/lockdep.c:5006
__mutex_lock_common kernel/locking/mutex.c:956 [inline]
__mutex_lock+0x134/0x10e0 kernel/locking/mutex.c:1103
blkdev_put+0x30/0x520 fs/block_dev.c:1804
btrfs_close_bdev fs/btrfs/volumes.c:1117 [inline]
btrfs_close_bdev fs/btrfs/volumes.c:1107 [inline]
btrfs_close_one_device fs/btrfs/volumes.c:1133 [inline]
close_fs_devices.part.0+0x1a4/0x800 fs/btrfs/volumes.c:1161
close_fs_devices fs/btrfs/volumes.c:1193 [inline]
btrfs_close_devices+0x95/0x1f0 fs/btrfs/volumes.c:1179
close_ctree+0x688/0x6cb fs/btrfs/disk-io.c:4149
generic_shutdown_super+0x144/0x370 fs/super.c:464
kill_anon_super+0x36/0x60 fs/super.c:1108
btrfs_kill_super+0x38/0x50 fs/btrfs/super.c:2265
deactivate_locked_super+0x94/0x160 fs/super.c:335
deactivate_super+0xad/0xd0 fs/super.c:366
cleanup_mnt+0x3a3/0x530 fs/namespace.c:1118
task_work_run+0xdd/0x190 kernel/task_work.c:141
tracehook_notify_resume include/linux/tracehook.h:188 [inline]
exit_to_user_mode_loop kernel/entry/common.c:163 [inline]
exit_to_user_mode_prepare+0x1e1/0x200 kernel/entry/common.c:190
syscall_exit_to_user_mode+0x7e/0x2e0 kernel/entry/common.c:265
entry_SYSCALL_64_after_hwframe+0x44/0xa9
other info that might help us debug this:
Chain exists of:
&bdev->bd_mutex --> sb_internal#2 --> &fs_devs->device_list_mutex
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&fs_devs->device_list_mutex);
lock(sb_internal#2);
lock(&fs_devs->device_list_mutex);
lock(&bdev->bd_mutex);
*** DEADLOCK ***
3 locks held by syz-executor.0/6878:
#0: ffff88809070c0e0 (&type->s_umount_key#70){++++}-{3:3}, at: deactivate_super+0xa5/0xd0 fs/super.c:365
#1: ffffffff8a5b37a8 (uuid_mutex){+.+.}-{3:3}, at: btrfs_close_devices+0x23/0x1f0 fs/btrfs/volumes.c:1178
#2: ffff8880908cfce0 (&fs_devs->device_list_mutex){+.+.}-{3:3}, at: close_fs_devices.part.0+0x2e/0x800 fs/btrfs/volumes.c:1159
stack backtrace:
CPU: 0 PID: 6878 Comm: syz-executor.0 Not tainted 5.9.0-rc5-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x198/0x1fd lib/dump_stack.c:118
check_noncircular+0x324/0x3e0 kernel/locking/lockdep.c:1827
check_prev_add kernel/locking/lockdep.c:2496 [inline]
check_prevs_add kernel/locking/lockdep.c:2601 [inline]
validate_chain kernel/locking/lockdep.c:3218 [inline]
__lock_acquire+0x2a96/0x5780 kernel/locking/lockdep.c:4426
lock_acquire+0x1f3/0xae0 kernel/locking/lockdep.c:5006
__mutex_lock_common kernel/locking/mutex.c:956 [inline]
__mutex_lock+0x134/0x10e0 kernel/locking/mutex.c:1103
blkdev_put+0x30/0x520 fs/block_dev.c:1804
btrfs_close_bdev fs/btrfs/volumes.c:1117 [inline]
btrfs_close_bdev fs/btrfs/volumes.c:1107 [inline]
btrfs_close_one_device fs/btrfs/volumes.c:1133 [inline]
close_fs_devices.part.0+0x1a4/0x800 fs/btrfs/volumes.c:1161
close_fs_devices fs/btrfs/volumes.c:1193 [inline]
btrfs_close_devices+0x95/0x1f0 fs/btrfs/volumes.c:1179
close_ctree+0x688/0x6cb fs/btrfs/disk-io.c:4149
generic_shutdown_super+0x144/0x370 fs/super.c:464
kill_anon_super+0x36/0x60 fs/super.c:1108
btrfs_kill_super+0x38/0x50 fs/btrfs/super.c:2265
deactivate_locked_super+0x94/0x160 fs/super.c:335
deactivate_super+0xad/0xd0 fs/super.c:366
cleanup_mnt+0x3a3/0x530 fs/namespace.c:1118
task_work_run+0xdd/0x190 kernel/task_work.c:141
tracehook_notify_resume include/linux/tracehook.h:188 [inline]
exit_to_user_mode_loop kernel/entry/common.c:163 [inline]
exit_to_user_mode_prepare+0x1e1/0x200 kernel/entry/common.c:190
syscall_exit_to_user_mode+0x7e/0x2e0 kernel/entry/common.c:265
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x460027
RSP: 002b:00007fff59216328 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
RAX: 0000000000000000 RBX: 0000000000076035 RCX: 0000000000460027
RDX: 0000000000403188 RSI: 0000000000000002 RDI: 00007fff592163d0
RBP: 0000000000000333 R08: 0000000000000000 R09: 000000000000000b
R10: 0000000000000005 R11: 0000000000000246 R12: 00007fff59217460
R13: 0000000002df2a60 R14: 0000000000000000 R15: 00007fff59217460
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
[ add syzbot reference ]
Signed-off-by: David Sterba <dsterba@suse.com>
2020-08-20 11:18:27 -04:00
mutex_lock ( & uuid_mutex ) ;
2016-07-22 06:04:53 +08:00
btrfs_close_bdev ( srcdev ) ;
2019-03-27 14:24:11 +02:00
synchronize_rcu ( ) ;
btrfs_free_device ( srcdev ) ;
2014-08-13 14:24:22 +08:00
/* if this is no devs we rather delete the fs_devices */
if ( ! fs_devices - > num_devices ) {
2017-10-17 06:53:50 +08:00
/*
* On a mounted FS , num_devices can ' t be zero unless it ' s a
* seed . In case of a seed device being replaced , the replace
* target added to the sprout FS , so there will be no more
* device left under the seed FS .
*/
ASSERT ( fs_devices - > seeding ) ;
2020-07-16 10:25:33 +03:00
list_del_init ( & fs_devices - > seed_list ) ;
2018-04-12 10:29:27 +08:00
close_fs_devices ( fs_devices ) ;
2014-08-13 14:24:23 +08:00
free_fs_devices ( fs_devices ) ;
2014-08-13 14:24:22 +08:00
}
btrfs: move btrfs_rm_dev_replace_free_srcdev outside of all locks
When closing and freeing the source device we could end up doing our
final blkdev_put() on the bdev, which will grab the bd_mutex. As such
we want to be holding as few locks as possible, so move this call
outside of the dev_replace->lock_finishing_cancel_unmount lock. Since
we're modifying the fs_devices we need to make sure we're holding the
uuid_mutex here, so take that as well.
There's a report from syzbot probably hitting one of the cases where
the bd_mutex and device_list_mutex are taken in the wrong order, however
it's not with device replace, like this patch fixes. As there's no
reproducer available so far, we can't verify the fix.
https://lore.kernel.org/lkml/000000000000fc04d105afcf86d7@google.com/
dashboard link: https://syzkaller.appspot.com/bug?extid=84a0634dc5d21d488419
WARNING: possible circular locking dependency detected
5.9.0-rc5-syzkaller #0 Not tainted
------------------------------------------------------
syz-executor.0/6878 is trying to acquire lock:
ffff88804c17d780 (&bdev->bd_mutex){+.+.}-{3:3}, at: blkdev_put+0x30/0x520 fs/block_dev.c:1804
but task is already holding lock:
ffff8880908cfce0 (&fs_devs->device_list_mutex){+.+.}-{3:3}, at: close_fs_devices.part.0+0x2e/0x800 fs/btrfs/volumes.c:1159
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #4 (&fs_devs->device_list_mutex){+.+.}-{3:3}:
__mutex_lock_common kernel/locking/mutex.c:956 [inline]
__mutex_lock+0x134/0x10e0 kernel/locking/mutex.c:1103
btrfs_finish_chunk_alloc+0x281/0xf90 fs/btrfs/volumes.c:5255
btrfs_create_pending_block_groups+0x2f3/0x700 fs/btrfs/block-group.c:2109
__btrfs_end_transaction+0xf5/0x690 fs/btrfs/transaction.c:916
find_free_extent_update_loop fs/btrfs/extent-tree.c:3807 [inline]
find_free_extent+0x23b7/0x2e60 fs/btrfs/extent-tree.c:4127
btrfs_reserve_extent+0x166/0x460 fs/btrfs/extent-tree.c:4206
cow_file_range+0x3de/0x9b0 fs/btrfs/inode.c:1063
btrfs_run_delalloc_range+0x2cf/0x1410 fs/btrfs/inode.c:1838
writepage_delalloc+0x150/0x460 fs/btrfs/extent_io.c:3439
__extent_writepage+0x441/0xd00 fs/btrfs/extent_io.c:3653
extent_write_cache_pages.constprop.0+0x69d/0x1040 fs/btrfs/extent_io.c:4249
extent_writepages+0xcd/0x2b0 fs/btrfs/extent_io.c:4370
do_writepages+0xec/0x290 mm/page-writeback.c:2352
__writeback_single_inode+0x125/0x1400 fs/fs-writeback.c:1461
writeback_sb_inodes+0x53d/0xf40 fs/fs-writeback.c:1721
wb_writeback+0x2ad/0xd40 fs/fs-writeback.c:1894
wb_do_writeback fs/fs-writeback.c:2039 [inline]
wb_workfn+0x2dc/0x13e0 fs/fs-writeback.c:2080
process_one_work+0x94c/0x1670 kernel/workqueue.c:2269
worker_thread+0x64c/0x1120 kernel/workqueue.c:2415
kthread+0x3b5/0x4a0 kernel/kthread.c:292
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294
-> #3 (sb_internal#2){.+.+}-{0:0}:
percpu_down_read include/linux/percpu-rwsem.h:51 [inline]
__sb_start_write+0x234/0x470 fs/super.c:1672
sb_start_intwrite include/linux/fs.h:1690 [inline]
start_transaction+0xbe7/0x1170 fs/btrfs/transaction.c:624
find_free_extent_update_loop fs/btrfs/extent-tree.c:3789 [inline]
find_free_extent+0x25e1/0x2e60 fs/btrfs/extent-tree.c:4127
btrfs_reserve_extent+0x166/0x460 fs/btrfs/extent-tree.c:4206
cow_file_range+0x3de/0x9b0 fs/btrfs/inode.c:1063
btrfs_run_delalloc_range+0x2cf/0x1410 fs/btrfs/inode.c:1838
writepage_delalloc+0x150/0x460 fs/btrfs/extent_io.c:3439
__extent_writepage+0x441/0xd00 fs/btrfs/extent_io.c:3653
extent_write_cache_pages.constprop.0+0x69d/0x1040 fs/btrfs/extent_io.c:4249
extent_writepages+0xcd/0x2b0 fs/btrfs/extent_io.c:4370
do_writepages+0xec/0x290 mm/page-writeback.c:2352
__writeback_single_inode+0x125/0x1400 fs/fs-writeback.c:1461
writeback_sb_inodes+0x53d/0xf40 fs/fs-writeback.c:1721
wb_writeback+0x2ad/0xd40 fs/fs-writeback.c:1894
wb_do_writeback fs/fs-writeback.c:2039 [inline]
wb_workfn+0x2dc/0x13e0 fs/fs-writeback.c:2080
process_one_work+0x94c/0x1670 kernel/workqueue.c:2269
worker_thread+0x64c/0x1120 kernel/workqueue.c:2415
kthread+0x3b5/0x4a0 kernel/kthread.c:292
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294
-> #2 ((work_completion)(&(&wb->dwork)->work)){+.+.}-{0:0}:
__flush_work+0x60e/0xac0 kernel/workqueue.c:3041
wb_shutdown+0x180/0x220 mm/backing-dev.c:355
bdi_unregister+0x174/0x590 mm/backing-dev.c:872
del_gendisk+0x820/0xa10 block/genhd.c:933
loop_remove drivers/block/loop.c:2192 [inline]
loop_control_ioctl drivers/block/loop.c:2291 [inline]
loop_control_ioctl+0x3b1/0x480 drivers/block/loop.c:2257
vfs_ioctl fs/ioctl.c:48 [inline]
__do_sys_ioctl fs/ioctl.c:753 [inline]
__se_sys_ioctl fs/ioctl.c:739 [inline]
__x64_sys_ioctl+0x193/0x200 fs/ioctl.c:739
do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #1 (loop_ctl_mutex){+.+.}-{3:3}:
__mutex_lock_common kernel/locking/mutex.c:956 [inline]
__mutex_lock+0x134/0x10e0 kernel/locking/mutex.c:1103
lo_open+0x19/0xd0 drivers/block/loop.c:1893
__blkdev_get+0x759/0x1aa0 fs/block_dev.c:1507
blkdev_get fs/block_dev.c:1639 [inline]
blkdev_open+0x227/0x300 fs/block_dev.c:1753
do_dentry_open+0x4b9/0x11b0 fs/open.c:817
do_open fs/namei.c:3251 [inline]
path_openat+0x1b9a/0x2730 fs/namei.c:3368
do_filp_open+0x17e/0x3c0 fs/namei.c:3395
do_sys_openat2+0x16d/0x420 fs/open.c:1168
do_sys_open fs/open.c:1184 [inline]
__do_sys_open fs/open.c:1192 [inline]
__se_sys_open fs/open.c:1188 [inline]
__x64_sys_open+0x119/0x1c0 fs/open.c:1188
do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #0 (&bdev->bd_mutex){+.+.}-{3:3}:
check_prev_add kernel/locking/lockdep.c:2496 [inline]
check_prevs_add kernel/locking/lockdep.c:2601 [inline]
validate_chain kernel/locking/lockdep.c:3218 [inline]
__lock_acquire+0x2a96/0x5780 kernel/locking/lockdep.c:4426
lock_acquire+0x1f3/0xae0 kernel/locking/lockdep.c:5006
__mutex_lock_common kernel/locking/mutex.c:956 [inline]
__mutex_lock+0x134/0x10e0 kernel/locking/mutex.c:1103
blkdev_put+0x30/0x520 fs/block_dev.c:1804
btrfs_close_bdev fs/btrfs/volumes.c:1117 [inline]
btrfs_close_bdev fs/btrfs/volumes.c:1107 [inline]
btrfs_close_one_device fs/btrfs/volumes.c:1133 [inline]
close_fs_devices.part.0+0x1a4/0x800 fs/btrfs/volumes.c:1161
close_fs_devices fs/btrfs/volumes.c:1193 [inline]
btrfs_close_devices+0x95/0x1f0 fs/btrfs/volumes.c:1179
close_ctree+0x688/0x6cb fs/btrfs/disk-io.c:4149
generic_shutdown_super+0x144/0x370 fs/super.c:464
kill_anon_super+0x36/0x60 fs/super.c:1108
btrfs_kill_super+0x38/0x50 fs/btrfs/super.c:2265
deactivate_locked_super+0x94/0x160 fs/super.c:335
deactivate_super+0xad/0xd0 fs/super.c:366
cleanup_mnt+0x3a3/0x530 fs/namespace.c:1118
task_work_run+0xdd/0x190 kernel/task_work.c:141
tracehook_notify_resume include/linux/tracehook.h:188 [inline]
exit_to_user_mode_loop kernel/entry/common.c:163 [inline]
exit_to_user_mode_prepare+0x1e1/0x200 kernel/entry/common.c:190
syscall_exit_to_user_mode+0x7e/0x2e0 kernel/entry/common.c:265
entry_SYSCALL_64_after_hwframe+0x44/0xa9
other info that might help us debug this:
Chain exists of:
&bdev->bd_mutex --> sb_internal#2 --> &fs_devs->device_list_mutex
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&fs_devs->device_list_mutex);
lock(sb_internal#2);
lock(&fs_devs->device_list_mutex);
lock(&bdev->bd_mutex);
*** DEADLOCK ***
3 locks held by syz-executor.0/6878:
#0: ffff88809070c0e0 (&type->s_umount_key#70){++++}-{3:3}, at: deactivate_super+0xa5/0xd0 fs/super.c:365
#1: ffffffff8a5b37a8 (uuid_mutex){+.+.}-{3:3}, at: btrfs_close_devices+0x23/0x1f0 fs/btrfs/volumes.c:1178
#2: ffff8880908cfce0 (&fs_devs->device_list_mutex){+.+.}-{3:3}, at: close_fs_devices.part.0+0x2e/0x800 fs/btrfs/volumes.c:1159
stack backtrace:
CPU: 0 PID: 6878 Comm: syz-executor.0 Not tainted 5.9.0-rc5-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x198/0x1fd lib/dump_stack.c:118
check_noncircular+0x324/0x3e0 kernel/locking/lockdep.c:1827
check_prev_add kernel/locking/lockdep.c:2496 [inline]
check_prevs_add kernel/locking/lockdep.c:2601 [inline]
validate_chain kernel/locking/lockdep.c:3218 [inline]
__lock_acquire+0x2a96/0x5780 kernel/locking/lockdep.c:4426
lock_acquire+0x1f3/0xae0 kernel/locking/lockdep.c:5006
__mutex_lock_common kernel/locking/mutex.c:956 [inline]
__mutex_lock+0x134/0x10e0 kernel/locking/mutex.c:1103
blkdev_put+0x30/0x520 fs/block_dev.c:1804
btrfs_close_bdev fs/btrfs/volumes.c:1117 [inline]
btrfs_close_bdev fs/btrfs/volumes.c:1107 [inline]
btrfs_close_one_device fs/btrfs/volumes.c:1133 [inline]
close_fs_devices.part.0+0x1a4/0x800 fs/btrfs/volumes.c:1161
close_fs_devices fs/btrfs/volumes.c:1193 [inline]
btrfs_close_devices+0x95/0x1f0 fs/btrfs/volumes.c:1179
close_ctree+0x688/0x6cb fs/btrfs/disk-io.c:4149
generic_shutdown_super+0x144/0x370 fs/super.c:464
kill_anon_super+0x36/0x60 fs/super.c:1108
btrfs_kill_super+0x38/0x50 fs/btrfs/super.c:2265
deactivate_locked_super+0x94/0x160 fs/super.c:335
deactivate_super+0xad/0xd0 fs/super.c:366
cleanup_mnt+0x3a3/0x530 fs/namespace.c:1118
task_work_run+0xdd/0x190 kernel/task_work.c:141
tracehook_notify_resume include/linux/tracehook.h:188 [inline]
exit_to_user_mode_loop kernel/entry/common.c:163 [inline]
exit_to_user_mode_prepare+0x1e1/0x200 kernel/entry/common.c:190
syscall_exit_to_user_mode+0x7e/0x2e0 kernel/entry/common.c:265
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x460027
RSP: 002b:00007fff59216328 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
RAX: 0000000000000000 RBX: 0000000000076035 RCX: 0000000000460027
RDX: 0000000000403188 RSI: 0000000000000002 RDI: 00007fff592163d0
RBP: 0000000000000333 R08: 0000000000000000 R09: 000000000000000b
R10: 0000000000000005 R11: 0000000000000246 R12: 00007fff59217460
R13: 0000000002df2a60 R14: 0000000000000000 R15: 00007fff59217460
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
[ add syzbot reference ]
Signed-off-by: David Sterba <dsterba@suse.com>
2020-08-20 11:18:27 -04:00
mutex_unlock ( & uuid_mutex ) ;
2012-11-05 17:33:06 +01:00
}
2018-07-20 19:37:51 +03:00
void btrfs_destroy_dev_replace_tgtdev ( struct btrfs_device * tgtdev )
2012-11-05 17:33:06 +01:00
{
2018-07-20 19:37:51 +03:00
struct btrfs_fs_devices * fs_devices = tgtdev - > fs_info - > fs_devices ;
2018-04-12 10:29:38 +08:00
mutex_lock ( & fs_devices - > device_list_mutex ) ;
2015-03-10 06:38:42 +08:00
2020-09-05 01:34:27 +08:00
btrfs_sysfs_remove_device ( tgtdev ) ;
2015-03-10 06:38:42 +08:00
btrfs: fix lock dep warning, move scratch dev out of device_list_mutex and uuid_mutex
When the replace target fails, the target device will be taken
out of fs device list, scratch + update_dev_time and freed. However
we could do the scratch + update_dev_time and free part after the
device has been taken out of device list, so that we don't have to
hold the device_list_mutex and uuid_mutex locks.
Reported issue:
[ 5375.718845] ======================================================
[ 5375.718846] [ INFO: possible circular locking dependency detected ]
[ 5375.718849] 4.4.5-scst31x-debug-11+ #40 Not tainted
[ 5375.718849] -------------------------------------------------------
[ 5375.718851] btrfs-health/4662 is trying to acquire lock:
[ 5375.718861] (sb_writers){.+.+.+}, at: [<ffffffff812214f7>] __sb_start_write+0xb7/0xf0
[ 5375.718862]
[ 5375.718862] but task is already holding lock:
[ 5375.718907] (&fs_devs->device_list_mutex){+.+.+.}, at: [<ffffffffa028263c>] btrfs_destroy_dev_replace_tgtdev+0x3c/0x150 [btrfs]
[ 5375.718907]
[ 5375.718907] which lock already depends on the new lock.
[ 5375.718907]
[ 5375.718908]
[ 5375.718908] the existing dependency chain (in reverse order) is:
[ 5375.718911]
[ 5375.718911] -> #3 (&fs_devs->device_list_mutex){+.+.+.}:
[ 5375.718917] [<ffffffff810da4be>] lock_acquire+0xce/0x1e0
[ 5375.718921] [<ffffffff81633949>] mutex_lock_nested+0x69/0x3c0
[ 5375.718940] [<ffffffffa0219bf6>] btrfs_show_devname+0x36/0x210 [btrfs]
[ 5375.718945] [<ffffffff81267079>] show_vfsmnt+0x49/0x150
[ 5375.718948] [<ffffffff81240b07>] m_show+0x17/0x20
[ 5375.718951] [<ffffffff81246868>] seq_read+0x2d8/0x3b0
[ 5375.718955] [<ffffffff8121df28>] __vfs_read+0x28/0xd0
[ 5375.718959] [<ffffffff8121e806>] vfs_read+0x86/0x130
[ 5375.718962] [<ffffffff8121f4c9>] SyS_read+0x49/0xa0
[ 5375.718966] [<ffffffff81637976>] entry_SYSCALL_64_fastpath+0x16/0x7a
[ 5375.718968]
[ 5375.718968] -> #2 (namespace_sem){+++++.}:
[ 5375.718971] [<ffffffff810da4be>] lock_acquire+0xce/0x1e0
[ 5375.718974] [<ffffffff81635199>] down_write+0x49/0x80
[ 5375.718977] [<ffffffff81243593>] lock_mount+0x43/0x1c0
[ 5375.718979] [<ffffffff81243c13>] do_add_mount+0x23/0xd0
[ 5375.718982] [<ffffffff81244afb>] do_mount+0x27b/0xe30
[ 5375.718985] [<ffffffff812459dc>] SyS_mount+0x8c/0xd0
[ 5375.718988] [<ffffffff81637976>] entry_SYSCALL_64_fastpath+0x16/0x7a
[ 5375.718991]
[ 5375.718991] -> #1 (&sb->s_type->i_mutex_key#5){+.+.+.}:
[ 5375.718994] [<ffffffff810da4be>] lock_acquire+0xce/0x1e0
[ 5375.718996] [<ffffffff81633949>] mutex_lock_nested+0x69/0x3c0
[ 5375.719001] [<ffffffff8122d608>] path_openat+0x468/0x1360
[ 5375.719004] [<ffffffff8122f86e>] do_filp_open+0x7e/0xe0
[ 5375.719007] [<ffffffff8121da7b>] do_sys_open+0x12b/0x210
[ 5375.719010] [<ffffffff8121db7e>] SyS_open+0x1e/0x20
[ 5375.719013] [<ffffffff81637976>] entry_SYSCALL_64_fastpath+0x16/0x7a
[ 5375.719015]
[ 5375.719015] -> #0 (sb_writers){.+.+.+}:
[ 5375.719018] [<ffffffff810d97ca>] __lock_acquire+0x17ba/0x1ae0
[ 5375.719021] [<ffffffff810da4be>] lock_acquire+0xce/0x1e0
[ 5375.719026] [<ffffffff810d3bef>] percpu_down_read+0x4f/0xa0
[ 5375.719028] [<ffffffff812214f7>] __sb_start_write+0xb7/0xf0
[ 5375.719031] [<ffffffff81242eb4>] mnt_want_write+0x24/0x50
[ 5375.719035] [<ffffffff8122ded2>] path_openat+0xd32/0x1360
[ 5375.719037] [<ffffffff8122f86e>] do_filp_open+0x7e/0xe0
[ 5375.719040] [<ffffffff8121d8a4>] file_open_name+0xe4/0x130
[ 5375.719043] [<ffffffff8121d923>] filp_open+0x33/0x60
[ 5375.719073] [<ffffffffa02776a6>] update_dev_time+0x16/0x40 [btrfs]
[ 5375.719099] [<ffffffffa02825be>] btrfs_scratch_superblocks+0x4e/0x90 [btrfs]
[ 5375.719123] [<ffffffffa0282665>] btrfs_destroy_dev_replace_tgtdev+0x65/0x150 [btrfs]
[ 5375.719150] [<ffffffffa02c6c80>] btrfs_dev_replace_finishing+0x6b0/0x990 [btrfs]
[ 5375.719175] [<ffffffffa02c729e>] btrfs_dev_replace_start+0x33e/0x540 [btrfs]
[ 5375.719199] [<ffffffffa02c7f58>] btrfs_auto_replace_start+0xf8/0x140 [btrfs]
[ 5375.719222] [<ffffffffa02464e6>] health_kthread+0x246/0x490 [btrfs]
[ 5375.719225] [<ffffffff810a70df>] kthread+0xef/0x110
[ 5375.719229] [<ffffffff81637d2f>] ret_from_fork+0x3f/0x70
[ 5375.719230]
[ 5375.719230] other info that might help us debug this:
[ 5375.719230]
[ 5375.719233] Chain exists of:
[ 5375.719233] sb_writers --> namespace_sem --> &fs_devs->device_list_mutex
[ 5375.719233]
[ 5375.719234] Possible unsafe locking scenario:
[ 5375.719234]
[ 5375.719234] CPU0 CPU1
[ 5375.719235] ---- ----
[ 5375.719236] lock(&fs_devs->device_list_mutex);
[ 5375.719238] lock(namespace_sem);
[ 5375.719239] lock(&fs_devs->device_list_mutex);
[ 5375.719241] lock(sb_writers);
[ 5375.719241]
[ 5375.719241] *** DEADLOCK ***
[ 5375.719241]
[ 5375.719243] 4 locks held by btrfs-health/4662:
[ 5375.719266] #0: (&fs_info->health_mutex){+.+.+.}, at: [<ffffffffa0246303>] health_kthread+0x63/0x490 [btrfs]
[ 5375.719293] #1: (&fs_info->dev_replace.lock_finishing_cancel_unmount){+.+.+.}, at: [<ffffffffa02c6611>] btrfs_dev_replace_finishing+0x41/0x990 [btrfs]
[ 5375.719319] #2: (uuid_mutex){+.+.+.}, at: [<ffffffffa0282620>] btrfs_destroy_dev_replace_tgtdev+0x20/0x150 [btrfs]
[ 5375.719343] #3: (&fs_devs->device_list_mutex){+.+.+.}, at: [<ffffffffa028263c>] btrfs_destroy_dev_replace_tgtdev+0x3c/0x150 [btrfs]
[ 5375.719343]
[ 5375.719343] stack backtrace:
[ 5375.719347] CPU: 2 PID: 4662 Comm: btrfs-health Not tainted 4.4.5-scst31x-debug-11+ #40
[ 5375.719348] Hardware name: Supermicro SYS-6018R-WTRT/X10DRW-iT, BIOS 1.0c 01/07/2015
[ 5375.719352] 0000000000000000 ffff880856f73880 ffffffff813529e3 ffffffff826182a0
[ 5375.719354] ffffffff8260c090 ffff880856f738c0 ffffffff810d667c ffff880856f73930
[ 5375.719357] ffff880861f32b40 ffff880861f32b68 0000000000000003 0000000000000004
[ 5375.719357] Call Trace:
[ 5375.719363] [<ffffffff813529e3>] dump_stack+0x85/0xc2
[ 5375.719366] [<ffffffff810d667c>] print_circular_bug+0x1ec/0x260
[ 5375.719369] [<ffffffff810d97ca>] __lock_acquire+0x17ba/0x1ae0
[ 5375.719373] [<ffffffff810f606d>] ? debug_lockdep_rcu_enabled+0x1d/0x20
[ 5375.719376] [<ffffffff810da4be>] lock_acquire+0xce/0x1e0
[ 5375.719378] [<ffffffff812214f7>] ? __sb_start_write+0xb7/0xf0
[ 5375.719383] [<ffffffff810d3bef>] percpu_down_read+0x4f/0xa0
[ 5375.719385] [<ffffffff812214f7>] ? __sb_start_write+0xb7/0xf0
[ 5375.719387] [<ffffffff812214f7>] __sb_start_write+0xb7/0xf0
[ 5375.719389] [<ffffffff81242eb4>] mnt_want_write+0x24/0x50
[ 5375.719393] [<ffffffff8122ded2>] path_openat+0xd32/0x1360
[ 5375.719415] [<ffffffffa02462a0>] ? btrfs_congested_fn+0x180/0x180 [btrfs]
[ 5375.719418] [<ffffffff810f606d>] ? debug_lockdep_rcu_enabled+0x1d/0x20
[ 5375.719420] [<ffffffff8122f86e>] do_filp_open+0x7e/0xe0
[ 5375.719423] [<ffffffff810f615d>] ? rcu_read_lock_sched_held+0x6d/0x80
[ 5375.719426] [<ffffffff81201a9b>] ? kmem_cache_alloc+0x26b/0x5d0
[ 5375.719430] [<ffffffff8122e7d4>] ? getname_kernel+0x34/0x120
[ 5375.719433] [<ffffffff8121d8a4>] file_open_name+0xe4/0x130
[ 5375.719436] [<ffffffff8121d923>] filp_open+0x33/0x60
[ 5375.719462] [<ffffffffa02776a6>] update_dev_time+0x16/0x40 [btrfs]
[ 5375.719485] [<ffffffffa02825be>] btrfs_scratch_superblocks+0x4e/0x90 [btrfs]
[ 5375.719506] [<ffffffffa0282665>] btrfs_destroy_dev_replace_tgtdev+0x65/0x150 [btrfs]
[ 5375.719530] [<ffffffffa02c6c80>] btrfs_dev_replace_finishing+0x6b0/0x990 [btrfs]
[ 5375.719554] [<ffffffffa02c6b23>] ? btrfs_dev_replace_finishing+0x553/0x990 [btrfs]
[ 5375.719576] [<ffffffffa02c729e>] btrfs_dev_replace_start+0x33e/0x540 [btrfs]
[ 5375.719598] [<ffffffffa02c7f58>] btrfs_auto_replace_start+0xf8/0x140 [btrfs]
[ 5375.719621] [<ffffffffa02464e6>] health_kthread+0x246/0x490 [btrfs]
[ 5375.719641] [<ffffffffa02463d8>] ? health_kthread+0x138/0x490 [btrfs]
[ 5375.719661] [<ffffffffa02462a0>] ? btrfs_congested_fn+0x180/0x180 [btrfs]
[ 5375.719663] [<ffffffff810a70df>] kthread+0xef/0x110
[ 5375.719666] [<ffffffff810a6ff0>] ? kthread_create_on_node+0x200/0x200
[ 5375.719669] [<ffffffff81637d2f>] ret_from_fork+0x3f/0x70
[ 5375.719672] [<ffffffff810a6ff0>] ? kthread_create_on_node+0x200/0x200
[ 5375.719697] ------------[ cut here ]------------
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reported-by: Yauhen Kharuzhy <yauhen.kharuzhy@zavadatar.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-18 16:51:23 +08:00
if ( tgtdev - > bdev )
2018-04-12 10:29:38 +08:00
fs_devices - > open_devices - - ;
btrfs: fix lock dep warning, move scratch dev out of device_list_mutex and uuid_mutex
When the replace target fails, the target device will be taken
out of fs device list, scratch + update_dev_time and freed. However
we could do the scratch + update_dev_time and free part after the
device has been taken out of device list, so that we don't have to
hold the device_list_mutex and uuid_mutex locks.
Reported issue:
[ 5375.718845] ======================================================
[ 5375.718846] [ INFO: possible circular locking dependency detected ]
[ 5375.718849] 4.4.5-scst31x-debug-11+ #40 Not tainted
[ 5375.718849] -------------------------------------------------------
[ 5375.718851] btrfs-health/4662 is trying to acquire lock:
[ 5375.718861] (sb_writers){.+.+.+}, at: [<ffffffff812214f7>] __sb_start_write+0xb7/0xf0
[ 5375.718862]
[ 5375.718862] but task is already holding lock:
[ 5375.718907] (&fs_devs->device_list_mutex){+.+.+.}, at: [<ffffffffa028263c>] btrfs_destroy_dev_replace_tgtdev+0x3c/0x150 [btrfs]
[ 5375.718907]
[ 5375.718907] which lock already depends on the new lock.
[ 5375.718907]
[ 5375.718908]
[ 5375.718908] the existing dependency chain (in reverse order) is:
[ 5375.718911]
[ 5375.718911] -> #3 (&fs_devs->device_list_mutex){+.+.+.}:
[ 5375.718917] [<ffffffff810da4be>] lock_acquire+0xce/0x1e0
[ 5375.718921] [<ffffffff81633949>] mutex_lock_nested+0x69/0x3c0
[ 5375.718940] [<ffffffffa0219bf6>] btrfs_show_devname+0x36/0x210 [btrfs]
[ 5375.718945] [<ffffffff81267079>] show_vfsmnt+0x49/0x150
[ 5375.718948] [<ffffffff81240b07>] m_show+0x17/0x20
[ 5375.718951] [<ffffffff81246868>] seq_read+0x2d8/0x3b0
[ 5375.718955] [<ffffffff8121df28>] __vfs_read+0x28/0xd0
[ 5375.718959] [<ffffffff8121e806>] vfs_read+0x86/0x130
[ 5375.718962] [<ffffffff8121f4c9>] SyS_read+0x49/0xa0
[ 5375.718966] [<ffffffff81637976>] entry_SYSCALL_64_fastpath+0x16/0x7a
[ 5375.718968]
[ 5375.718968] -> #2 (namespace_sem){+++++.}:
[ 5375.718971] [<ffffffff810da4be>] lock_acquire+0xce/0x1e0
[ 5375.718974] [<ffffffff81635199>] down_write+0x49/0x80
[ 5375.718977] [<ffffffff81243593>] lock_mount+0x43/0x1c0
[ 5375.718979] [<ffffffff81243c13>] do_add_mount+0x23/0xd0
[ 5375.718982] [<ffffffff81244afb>] do_mount+0x27b/0xe30
[ 5375.718985] [<ffffffff812459dc>] SyS_mount+0x8c/0xd0
[ 5375.718988] [<ffffffff81637976>] entry_SYSCALL_64_fastpath+0x16/0x7a
[ 5375.718991]
[ 5375.718991] -> #1 (&sb->s_type->i_mutex_key#5){+.+.+.}:
[ 5375.718994] [<ffffffff810da4be>] lock_acquire+0xce/0x1e0
[ 5375.718996] [<ffffffff81633949>] mutex_lock_nested+0x69/0x3c0
[ 5375.719001] [<ffffffff8122d608>] path_openat+0x468/0x1360
[ 5375.719004] [<ffffffff8122f86e>] do_filp_open+0x7e/0xe0
[ 5375.719007] [<ffffffff8121da7b>] do_sys_open+0x12b/0x210
[ 5375.719010] [<ffffffff8121db7e>] SyS_open+0x1e/0x20
[ 5375.719013] [<ffffffff81637976>] entry_SYSCALL_64_fastpath+0x16/0x7a
[ 5375.719015]
[ 5375.719015] -> #0 (sb_writers){.+.+.+}:
[ 5375.719018] [<ffffffff810d97ca>] __lock_acquire+0x17ba/0x1ae0
[ 5375.719021] [<ffffffff810da4be>] lock_acquire+0xce/0x1e0
[ 5375.719026] [<ffffffff810d3bef>] percpu_down_read+0x4f/0xa0
[ 5375.719028] [<ffffffff812214f7>] __sb_start_write+0xb7/0xf0
[ 5375.719031] [<ffffffff81242eb4>] mnt_want_write+0x24/0x50
[ 5375.719035] [<ffffffff8122ded2>] path_openat+0xd32/0x1360
[ 5375.719037] [<ffffffff8122f86e>] do_filp_open+0x7e/0xe0
[ 5375.719040] [<ffffffff8121d8a4>] file_open_name+0xe4/0x130
[ 5375.719043] [<ffffffff8121d923>] filp_open+0x33/0x60
[ 5375.719073] [<ffffffffa02776a6>] update_dev_time+0x16/0x40 [btrfs]
[ 5375.719099] [<ffffffffa02825be>] btrfs_scratch_superblocks+0x4e/0x90 [btrfs]
[ 5375.719123] [<ffffffffa0282665>] btrfs_destroy_dev_replace_tgtdev+0x65/0x150 [btrfs]
[ 5375.719150] [<ffffffffa02c6c80>] btrfs_dev_replace_finishing+0x6b0/0x990 [btrfs]
[ 5375.719175] [<ffffffffa02c729e>] btrfs_dev_replace_start+0x33e/0x540 [btrfs]
[ 5375.719199] [<ffffffffa02c7f58>] btrfs_auto_replace_start+0xf8/0x140 [btrfs]
[ 5375.719222] [<ffffffffa02464e6>] health_kthread+0x246/0x490 [btrfs]
[ 5375.719225] [<ffffffff810a70df>] kthread+0xef/0x110
[ 5375.719229] [<ffffffff81637d2f>] ret_from_fork+0x3f/0x70
[ 5375.719230]
[ 5375.719230] other info that might help us debug this:
[ 5375.719230]
[ 5375.719233] Chain exists of:
[ 5375.719233] sb_writers --> namespace_sem --> &fs_devs->device_list_mutex
[ 5375.719233]
[ 5375.719234] Possible unsafe locking scenario:
[ 5375.719234]
[ 5375.719234] CPU0 CPU1
[ 5375.719235] ---- ----
[ 5375.719236] lock(&fs_devs->device_list_mutex);
[ 5375.719238] lock(namespace_sem);
[ 5375.719239] lock(&fs_devs->device_list_mutex);
[ 5375.719241] lock(sb_writers);
[ 5375.719241]
[ 5375.719241] *** DEADLOCK ***
[ 5375.719241]
[ 5375.719243] 4 locks held by btrfs-health/4662:
[ 5375.719266] #0: (&fs_info->health_mutex){+.+.+.}, at: [<ffffffffa0246303>] health_kthread+0x63/0x490 [btrfs]
[ 5375.719293] #1: (&fs_info->dev_replace.lock_finishing_cancel_unmount){+.+.+.}, at: [<ffffffffa02c6611>] btrfs_dev_replace_finishing+0x41/0x990 [btrfs]
[ 5375.719319] #2: (uuid_mutex){+.+.+.}, at: [<ffffffffa0282620>] btrfs_destroy_dev_replace_tgtdev+0x20/0x150 [btrfs]
[ 5375.719343] #3: (&fs_devs->device_list_mutex){+.+.+.}, at: [<ffffffffa028263c>] btrfs_destroy_dev_replace_tgtdev+0x3c/0x150 [btrfs]
[ 5375.719343]
[ 5375.719343] stack backtrace:
[ 5375.719347] CPU: 2 PID: 4662 Comm: btrfs-health Not tainted 4.4.5-scst31x-debug-11+ #40
[ 5375.719348] Hardware name: Supermicro SYS-6018R-WTRT/X10DRW-iT, BIOS 1.0c 01/07/2015
[ 5375.719352] 0000000000000000 ffff880856f73880 ffffffff813529e3 ffffffff826182a0
[ 5375.719354] ffffffff8260c090 ffff880856f738c0 ffffffff810d667c ffff880856f73930
[ 5375.719357] ffff880861f32b40 ffff880861f32b68 0000000000000003 0000000000000004
[ 5375.719357] Call Trace:
[ 5375.719363] [<ffffffff813529e3>] dump_stack+0x85/0xc2
[ 5375.719366] [<ffffffff810d667c>] print_circular_bug+0x1ec/0x260
[ 5375.719369] [<ffffffff810d97ca>] __lock_acquire+0x17ba/0x1ae0
[ 5375.719373] [<ffffffff810f606d>] ? debug_lockdep_rcu_enabled+0x1d/0x20
[ 5375.719376] [<ffffffff810da4be>] lock_acquire+0xce/0x1e0
[ 5375.719378] [<ffffffff812214f7>] ? __sb_start_write+0xb7/0xf0
[ 5375.719383] [<ffffffff810d3bef>] percpu_down_read+0x4f/0xa0
[ 5375.719385] [<ffffffff812214f7>] ? __sb_start_write+0xb7/0xf0
[ 5375.719387] [<ffffffff812214f7>] __sb_start_write+0xb7/0xf0
[ 5375.719389] [<ffffffff81242eb4>] mnt_want_write+0x24/0x50
[ 5375.719393] [<ffffffff8122ded2>] path_openat+0xd32/0x1360
[ 5375.719415] [<ffffffffa02462a0>] ? btrfs_congested_fn+0x180/0x180 [btrfs]
[ 5375.719418] [<ffffffff810f606d>] ? debug_lockdep_rcu_enabled+0x1d/0x20
[ 5375.719420] [<ffffffff8122f86e>] do_filp_open+0x7e/0xe0
[ 5375.719423] [<ffffffff810f615d>] ? rcu_read_lock_sched_held+0x6d/0x80
[ 5375.719426] [<ffffffff81201a9b>] ? kmem_cache_alloc+0x26b/0x5d0
[ 5375.719430] [<ffffffff8122e7d4>] ? getname_kernel+0x34/0x120
[ 5375.719433] [<ffffffff8121d8a4>] file_open_name+0xe4/0x130
[ 5375.719436] [<ffffffff8121d923>] filp_open+0x33/0x60
[ 5375.719462] [<ffffffffa02776a6>] update_dev_time+0x16/0x40 [btrfs]
[ 5375.719485] [<ffffffffa02825be>] btrfs_scratch_superblocks+0x4e/0x90 [btrfs]
[ 5375.719506] [<ffffffffa0282665>] btrfs_destroy_dev_replace_tgtdev+0x65/0x150 [btrfs]
[ 5375.719530] [<ffffffffa02c6c80>] btrfs_dev_replace_finishing+0x6b0/0x990 [btrfs]
[ 5375.719554] [<ffffffffa02c6b23>] ? btrfs_dev_replace_finishing+0x553/0x990 [btrfs]
[ 5375.719576] [<ffffffffa02c729e>] btrfs_dev_replace_start+0x33e/0x540 [btrfs]
[ 5375.719598] [<ffffffffa02c7f58>] btrfs_auto_replace_start+0xf8/0x140 [btrfs]
[ 5375.719621] [<ffffffffa02464e6>] health_kthread+0x246/0x490 [btrfs]
[ 5375.719641] [<ffffffffa02463d8>] ? health_kthread+0x138/0x490 [btrfs]
[ 5375.719661] [<ffffffffa02462a0>] ? btrfs_congested_fn+0x180/0x180 [btrfs]
[ 5375.719663] [<ffffffff810a70df>] kthread+0xef/0x110
[ 5375.719666] [<ffffffff810a6ff0>] ? kthread_create_on_node+0x200/0x200
[ 5375.719669] [<ffffffff81637d2f>] ret_from_fork+0x3f/0x70
[ 5375.719672] [<ffffffff810a6ff0>] ? kthread_create_on_node+0x200/0x200
[ 5375.719697] ------------[ cut here ]------------
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reported-by: Yauhen Kharuzhy <yauhen.kharuzhy@zavadatar.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-18 16:51:23 +08:00
2018-04-12 10:29:38 +08:00
fs_devices - > num_devices - - ;
2012-11-05 17:33:06 +01:00
2018-07-20 19:37:50 +03:00
btrfs_assign_next_active_device ( tgtdev , NULL ) ;
2012-11-05 17:33:06 +01:00
list_del_rcu ( & tgtdev - > dev_list ) ;
2018-04-12 10:29:38 +08:00
mutex_unlock ( & fs_devices - > device_list_mutex ) ;
btrfs: fix lock dep warning, move scratch dev out of device_list_mutex and uuid_mutex
When the replace target fails, the target device will be taken
out of fs device list, scratch + update_dev_time and freed. However
we could do the scratch + update_dev_time and free part after the
device has been taken out of device list, so that we don't have to
hold the device_list_mutex and uuid_mutex locks.
Reported issue:
[ 5375.718845] ======================================================
[ 5375.718846] [ INFO: possible circular locking dependency detected ]
[ 5375.718849] 4.4.5-scst31x-debug-11+ #40 Not tainted
[ 5375.718849] -------------------------------------------------------
[ 5375.718851] btrfs-health/4662 is trying to acquire lock:
[ 5375.718861] (sb_writers){.+.+.+}, at: [<ffffffff812214f7>] __sb_start_write+0xb7/0xf0
[ 5375.718862]
[ 5375.718862] but task is already holding lock:
[ 5375.718907] (&fs_devs->device_list_mutex){+.+.+.}, at: [<ffffffffa028263c>] btrfs_destroy_dev_replace_tgtdev+0x3c/0x150 [btrfs]
[ 5375.718907]
[ 5375.718907] which lock already depends on the new lock.
[ 5375.718907]
[ 5375.718908]
[ 5375.718908] the existing dependency chain (in reverse order) is:
[ 5375.718911]
[ 5375.718911] -> #3 (&fs_devs->device_list_mutex){+.+.+.}:
[ 5375.718917] [<ffffffff810da4be>] lock_acquire+0xce/0x1e0
[ 5375.718921] [<ffffffff81633949>] mutex_lock_nested+0x69/0x3c0
[ 5375.718940] [<ffffffffa0219bf6>] btrfs_show_devname+0x36/0x210 [btrfs]
[ 5375.718945] [<ffffffff81267079>] show_vfsmnt+0x49/0x150
[ 5375.718948] [<ffffffff81240b07>] m_show+0x17/0x20
[ 5375.718951] [<ffffffff81246868>] seq_read+0x2d8/0x3b0
[ 5375.718955] [<ffffffff8121df28>] __vfs_read+0x28/0xd0
[ 5375.718959] [<ffffffff8121e806>] vfs_read+0x86/0x130
[ 5375.718962] [<ffffffff8121f4c9>] SyS_read+0x49/0xa0
[ 5375.718966] [<ffffffff81637976>] entry_SYSCALL_64_fastpath+0x16/0x7a
[ 5375.718968]
[ 5375.718968] -> #2 (namespace_sem){+++++.}:
[ 5375.718971] [<ffffffff810da4be>] lock_acquire+0xce/0x1e0
[ 5375.718974] [<ffffffff81635199>] down_write+0x49/0x80
[ 5375.718977] [<ffffffff81243593>] lock_mount+0x43/0x1c0
[ 5375.718979] [<ffffffff81243c13>] do_add_mount+0x23/0xd0
[ 5375.718982] [<ffffffff81244afb>] do_mount+0x27b/0xe30
[ 5375.718985] [<ffffffff812459dc>] SyS_mount+0x8c/0xd0
[ 5375.718988] [<ffffffff81637976>] entry_SYSCALL_64_fastpath+0x16/0x7a
[ 5375.718991]
[ 5375.718991] -> #1 (&sb->s_type->i_mutex_key#5){+.+.+.}:
[ 5375.718994] [<ffffffff810da4be>] lock_acquire+0xce/0x1e0
[ 5375.718996] [<ffffffff81633949>] mutex_lock_nested+0x69/0x3c0
[ 5375.719001] [<ffffffff8122d608>] path_openat+0x468/0x1360
[ 5375.719004] [<ffffffff8122f86e>] do_filp_open+0x7e/0xe0
[ 5375.719007] [<ffffffff8121da7b>] do_sys_open+0x12b/0x210
[ 5375.719010] [<ffffffff8121db7e>] SyS_open+0x1e/0x20
[ 5375.719013] [<ffffffff81637976>] entry_SYSCALL_64_fastpath+0x16/0x7a
[ 5375.719015]
[ 5375.719015] -> #0 (sb_writers){.+.+.+}:
[ 5375.719018] [<ffffffff810d97ca>] __lock_acquire+0x17ba/0x1ae0
[ 5375.719021] [<ffffffff810da4be>] lock_acquire+0xce/0x1e0
[ 5375.719026] [<ffffffff810d3bef>] percpu_down_read+0x4f/0xa0
[ 5375.719028] [<ffffffff812214f7>] __sb_start_write+0xb7/0xf0
[ 5375.719031] [<ffffffff81242eb4>] mnt_want_write+0x24/0x50
[ 5375.719035] [<ffffffff8122ded2>] path_openat+0xd32/0x1360
[ 5375.719037] [<ffffffff8122f86e>] do_filp_open+0x7e/0xe0
[ 5375.719040] [<ffffffff8121d8a4>] file_open_name+0xe4/0x130
[ 5375.719043] [<ffffffff8121d923>] filp_open+0x33/0x60
[ 5375.719073] [<ffffffffa02776a6>] update_dev_time+0x16/0x40 [btrfs]
[ 5375.719099] [<ffffffffa02825be>] btrfs_scratch_superblocks+0x4e/0x90 [btrfs]
[ 5375.719123] [<ffffffffa0282665>] btrfs_destroy_dev_replace_tgtdev+0x65/0x150 [btrfs]
[ 5375.719150] [<ffffffffa02c6c80>] btrfs_dev_replace_finishing+0x6b0/0x990 [btrfs]
[ 5375.719175] [<ffffffffa02c729e>] btrfs_dev_replace_start+0x33e/0x540 [btrfs]
[ 5375.719199] [<ffffffffa02c7f58>] btrfs_auto_replace_start+0xf8/0x140 [btrfs]
[ 5375.719222] [<ffffffffa02464e6>] health_kthread+0x246/0x490 [btrfs]
[ 5375.719225] [<ffffffff810a70df>] kthread+0xef/0x110
[ 5375.719229] [<ffffffff81637d2f>] ret_from_fork+0x3f/0x70
[ 5375.719230]
[ 5375.719230] other info that might help us debug this:
[ 5375.719230]
[ 5375.719233] Chain exists of:
[ 5375.719233] sb_writers --> namespace_sem --> &fs_devs->device_list_mutex
[ 5375.719233]
[ 5375.719234] Possible unsafe locking scenario:
[ 5375.719234]
[ 5375.719234] CPU0 CPU1
[ 5375.719235] ---- ----
[ 5375.719236] lock(&fs_devs->device_list_mutex);
[ 5375.719238] lock(namespace_sem);
[ 5375.719239] lock(&fs_devs->device_list_mutex);
[ 5375.719241] lock(sb_writers);
[ 5375.719241]
[ 5375.719241] *** DEADLOCK ***
[ 5375.719241]
[ 5375.719243] 4 locks held by btrfs-health/4662:
[ 5375.719266] #0: (&fs_info->health_mutex){+.+.+.}, at: [<ffffffffa0246303>] health_kthread+0x63/0x490 [btrfs]
[ 5375.719293] #1: (&fs_info->dev_replace.lock_finishing_cancel_unmount){+.+.+.}, at: [<ffffffffa02c6611>] btrfs_dev_replace_finishing+0x41/0x990 [btrfs]
[ 5375.719319] #2: (uuid_mutex){+.+.+.}, at: [<ffffffffa0282620>] btrfs_destroy_dev_replace_tgtdev+0x20/0x150 [btrfs]
[ 5375.719343] #3: (&fs_devs->device_list_mutex){+.+.+.}, at: [<ffffffffa028263c>] btrfs_destroy_dev_replace_tgtdev+0x3c/0x150 [btrfs]
[ 5375.719343]
[ 5375.719343] stack backtrace:
[ 5375.719347] CPU: 2 PID: 4662 Comm: btrfs-health Not tainted 4.4.5-scst31x-debug-11+ #40
[ 5375.719348] Hardware name: Supermicro SYS-6018R-WTRT/X10DRW-iT, BIOS 1.0c 01/07/2015
[ 5375.719352] 0000000000000000 ffff880856f73880 ffffffff813529e3 ffffffff826182a0
[ 5375.719354] ffffffff8260c090 ffff880856f738c0 ffffffff810d667c ffff880856f73930
[ 5375.719357] ffff880861f32b40 ffff880861f32b68 0000000000000003 0000000000000004
[ 5375.719357] Call Trace:
[ 5375.719363] [<ffffffff813529e3>] dump_stack+0x85/0xc2
[ 5375.719366] [<ffffffff810d667c>] print_circular_bug+0x1ec/0x260
[ 5375.719369] [<ffffffff810d97ca>] __lock_acquire+0x17ba/0x1ae0
[ 5375.719373] [<ffffffff810f606d>] ? debug_lockdep_rcu_enabled+0x1d/0x20
[ 5375.719376] [<ffffffff810da4be>] lock_acquire+0xce/0x1e0
[ 5375.719378] [<ffffffff812214f7>] ? __sb_start_write+0xb7/0xf0
[ 5375.719383] [<ffffffff810d3bef>] percpu_down_read+0x4f/0xa0
[ 5375.719385] [<ffffffff812214f7>] ? __sb_start_write+0xb7/0xf0
[ 5375.719387] [<ffffffff812214f7>] __sb_start_write+0xb7/0xf0
[ 5375.719389] [<ffffffff81242eb4>] mnt_want_write+0x24/0x50
[ 5375.719393] [<ffffffff8122ded2>] path_openat+0xd32/0x1360
[ 5375.719415] [<ffffffffa02462a0>] ? btrfs_congested_fn+0x180/0x180 [btrfs]
[ 5375.719418] [<ffffffff810f606d>] ? debug_lockdep_rcu_enabled+0x1d/0x20
[ 5375.719420] [<ffffffff8122f86e>] do_filp_open+0x7e/0xe0
[ 5375.719423] [<ffffffff810f615d>] ? rcu_read_lock_sched_held+0x6d/0x80
[ 5375.719426] [<ffffffff81201a9b>] ? kmem_cache_alloc+0x26b/0x5d0
[ 5375.719430] [<ffffffff8122e7d4>] ? getname_kernel+0x34/0x120
[ 5375.719433] [<ffffffff8121d8a4>] file_open_name+0xe4/0x130
[ 5375.719436] [<ffffffff8121d923>] filp_open+0x33/0x60
[ 5375.719462] [<ffffffffa02776a6>] update_dev_time+0x16/0x40 [btrfs]
[ 5375.719485] [<ffffffffa02825be>] btrfs_scratch_superblocks+0x4e/0x90 [btrfs]
[ 5375.719506] [<ffffffffa0282665>] btrfs_destroy_dev_replace_tgtdev+0x65/0x150 [btrfs]
[ 5375.719530] [<ffffffffa02c6c80>] btrfs_dev_replace_finishing+0x6b0/0x990 [btrfs]
[ 5375.719554] [<ffffffffa02c6b23>] ? btrfs_dev_replace_finishing+0x553/0x990 [btrfs]
[ 5375.719576] [<ffffffffa02c729e>] btrfs_dev_replace_start+0x33e/0x540 [btrfs]
[ 5375.719598] [<ffffffffa02c7f58>] btrfs_auto_replace_start+0xf8/0x140 [btrfs]
[ 5375.719621] [<ffffffffa02464e6>] health_kthread+0x246/0x490 [btrfs]
[ 5375.719641] [<ffffffffa02463d8>] ? health_kthread+0x138/0x490 [btrfs]
[ 5375.719661] [<ffffffffa02462a0>] ? btrfs_congested_fn+0x180/0x180 [btrfs]
[ 5375.719663] [<ffffffff810a70df>] kthread+0xef/0x110
[ 5375.719666] [<ffffffff810a6ff0>] ? kthread_create_on_node+0x200/0x200
[ 5375.719669] [<ffffffff81637d2f>] ret_from_fork+0x3f/0x70
[ 5375.719672] [<ffffffff810a6ff0>] ? kthread_create_on_node+0x200/0x200
[ 5375.719697] ------------[ cut here ]------------
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reported-by: Yauhen Kharuzhy <yauhen.kharuzhy@zavadatar.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-18 16:51:23 +08:00
2020-02-14 00:24:32 +09:00
btrfs_scratch_superblocks ( tgtdev - > fs_info , tgtdev - > bdev ,
tgtdev - > name - > str ) ;
2016-07-22 06:04:53 +08:00
btrfs_close_bdev ( tgtdev ) ;
2019-03-27 14:24:11 +02:00
synchronize_rcu ( ) ;
btrfs_free_device ( tgtdev ) ;
2012-11-05 17:33:06 +01:00
}
2021-10-05 16:12:43 -04:00
/**
* Populate args from device at path
*
* @ fs_info : the filesystem
* @ args : the args to populate
* @ path : the path to the device
*
* This will read the super block of the device at @ path and populate @ args with
* the devid , fsid , and uuid . This is meant to be used for ioctls that need to
* lookup a device to operate on , but need to do it before we take any locks .
* This properly handles the special case of " missing " that a user may pass in ,
* and does some basic sanity checks . The caller must make sure that @ path is
* properly NUL terminated before calling in , and must call
* btrfs_put_dev_args_from_path ( ) in order to free up the temporary fsid and
* uuid buffers .
*
* Return : 0 for success , - errno for failure
*/
int btrfs_get_dev_args_from_path ( struct btrfs_fs_info * fs_info ,
struct btrfs_dev_lookup_args * args ,
const char * path )
2012-11-05 14:42:30 +01:00
{
struct btrfs_super_block * disk_super ;
struct block_device * bdev ;
2021-10-05 16:12:43 -04:00
int ret ;
2012-11-05 14:42:30 +01:00
2021-10-05 16:12:43 -04:00
if ( ! path | | ! path [ 0 ] )
return - EINVAL ;
if ( ! strcmp ( path , " missing " ) ) {
args - > missing = true ;
return 0 ;
}
args - > uuid = kzalloc ( BTRFS_UUID_SIZE , GFP_KERNEL ) ;
args - > fsid = kzalloc ( BTRFS_FSID_SIZE , GFP_KERNEL ) ;
if ( ! args - > uuid | | ! args - > fsid ) {
btrfs_put_dev_args_from_path ( args ) ;
return - ENOMEM ;
}
2020-02-14 00:24:32 +09:00
2021-10-05 16:12:43 -04:00
ret = btrfs_get_bdev_and_sb ( path , FMODE_READ , fs_info - > bdev_holder , 0 ,
& bdev , & disk_super ) ;
if ( ret )
return ret ;
args - > devid = btrfs_stack_device_id ( & disk_super - > dev_item ) ;
memcpy ( args - > uuid , disk_super - > dev_item . uuid , BTRFS_UUID_SIZE ) ;
2018-10-30 16:43:23 +02:00
if ( btrfs_fs_incompat ( fs_info , METADATA_UUID ) )
2021-10-05 16:12:43 -04:00
memcpy ( args - > fsid , disk_super - > metadata_uuid , BTRFS_FSID_SIZE ) ;
2018-10-30 16:43:23 +02:00
else
2021-10-05 16:12:43 -04:00
memcpy ( args - > fsid , disk_super - > fsid , BTRFS_FSID_SIZE ) ;
2020-02-14 00:24:32 +09:00
btrfs_release_disk_super ( disk_super ) ;
2012-11-05 14:42:30 +01:00
blkdev_put ( bdev , FMODE_READ ) ;
2021-10-05 16:12:43 -04:00
return 0 ;
2012-11-05 14:42:30 +01:00
}
2016-02-15 16:39:55 +01:00
/*
2021-10-05 16:12:43 -04:00
* Only use this jointly with btrfs_get_dev_args_from_path ( ) because we will
* allocate our - > uuid and - > fsid pointers , everybody else uses local variables
* that don ' t need to be freed .
2016-02-15 16:39:55 +01:00
*/
2021-10-05 16:12:43 -04:00
void btrfs_put_dev_args_from_path ( struct btrfs_dev_lookup_args * args )
{
kfree ( args - > uuid ) ;
kfree ( args - > fsid ) ;
args - > uuid = NULL ;
args - > fsid = NULL ;
}
2018-09-03 12:46:14 +03:00
struct btrfs_device * btrfs_find_device_by_devspec (
2019-01-17 23:32:29 +08:00
struct btrfs_fs_info * fs_info , u64 devid ,
const char * device_path )
2016-02-13 10:01:35 +08:00
{
2021-10-05 16:12:42 -04:00
BTRFS_DEV_LOOKUP_ARGS ( args ) ;
2018-09-03 12:46:14 +03:00
struct btrfs_device * device ;
2021-10-05 16:12:43 -04:00
int ret ;
2016-02-13 10:01:35 +08:00
2016-02-15 16:39:55 +01:00
if ( devid ) {
2021-10-05 16:12:42 -04:00
args . devid = devid ;
device = btrfs_find_device ( fs_info - > fs_devices , & args ) ;
2018-09-03 12:46:14 +03:00
if ( ! device )
return ERR_PTR ( - ENOENT ) ;
2019-01-17 23:32:29 +08:00
return device ;
}
2021-10-05 16:12:43 -04:00
ret = btrfs_get_dev_args_from_path ( fs_info , & args , device_path ) ;
if ( ret )
return ERR_PTR ( ret ) ;
device = btrfs_find_device ( fs_info - > fs_devices , & args ) ;
btrfs_put_dev_args_from_path ( & args ) ;
if ( ! device )
return ERR_PTR ( - ENOENT ) ;
return device ;
2016-02-13 10:01:35 +08:00
}
2008-11-17 21:11:30 -05:00
/*
* does all the dirty work required for changing file system ' s UUID .
*/
2016-06-22 18:54:24 -04:00
static int btrfs_prepare_sprout ( struct btrfs_fs_info * fs_info )
2008-11-17 21:11:30 -05:00
{
2016-06-22 18:54:23 -04:00
struct btrfs_fs_devices * fs_devices = fs_info - > fs_devices ;
2008-11-17 21:11:30 -05:00
struct btrfs_fs_devices * old_devices ;
2008-12-12 10:03:26 -05:00
struct btrfs_fs_devices * seed_devices ;
2016-06-22 18:54:23 -04:00
struct btrfs_super_block * disk_super = fs_info - > super_copy ;
2008-11-17 21:11:30 -05:00
struct btrfs_device * device ;
u64 super_flags ;
2018-03-16 02:21:22 +01:00
lockdep_assert_held ( & uuid_mutex ) ;
2008-12-12 10:03:26 -05:00
if ( ! fs_devices - > seeding )
2008-11-17 21:11:30 -05:00
return - EINVAL ;
2020-08-12 17:04:36 +03:00
/*
* Private copy of the seed devices , anchored at
* fs_info - > fs_devices - > seed_list
*/
2018-10-30 16:43:23 +02:00
seed_devices = alloc_fs_devices ( NULL , NULL ) ;
2013-08-12 14:33:03 +03:00
if ( IS_ERR ( seed_devices ) )
return PTR_ERR ( seed_devices ) ;
2008-11-17 21:11:30 -05:00
2020-08-12 17:04:36 +03:00
/*
* It ' s necessary to retain a copy of the original seed fs_devices in
* fs_uuids so that filesystems which have been seeded can successfully
* reference the seed device from open_seed_devices . This also supports
* multiple fs seed .
*/
2008-12-12 10:03:26 -05:00
old_devices = clone_fs_devices ( fs_devices ) ;
if ( IS_ERR ( old_devices ) ) {
kfree ( seed_devices ) ;
return PTR_ERR ( old_devices ) ;
2008-11-17 21:11:30 -05:00
}
2008-12-12 10:03:26 -05:00
2018-04-12 10:29:25 +08:00
list_add ( & old_devices - > fs_list , & fs_uuids ) ;
2008-11-17 21:11:30 -05:00
2008-12-12 10:03:26 -05:00
memcpy ( seed_devices , fs_devices , sizeof ( * seed_devices ) ) ;
seed_devices - > opened = 1 ;
INIT_LIST_HEAD ( & seed_devices - > devices ) ;
INIT_LIST_HEAD ( & seed_devices - > alloc_list ) ;
2009-06-10 15:17:02 -04:00
mutex_init ( & seed_devices - > device_list_mutex ) ;
2011-04-20 10:07:30 +00:00
2018-07-16 22:58:09 +08:00
mutex_lock ( & fs_devices - > device_list_mutex ) ;
2011-04-20 10:09:16 +00:00
list_splice_init_rcu ( & fs_devices - > devices , & seed_devices - > devices ,
synchronize_rcu ) ;
2014-09-03 21:35:41 +08:00
list_for_each_entry ( device , & seed_devices - > devices , dev_list )
device - > fs_devices = seed_devices ;
2011-04-20 10:07:30 +00:00
2019-11-13 11:27:27 +01:00
fs_devices - > seeding = false ;
2008-11-17 21:11:30 -05:00
fs_devices - > num_devices = 0 ;
fs_devices - > open_devices = 0 ;
2014-07-03 18:22:12 +08:00
fs_devices - > missing_devices = 0 ;
2019-11-13 11:27:28 +01:00
fs_devices - > rotating = false ;
2020-07-16 10:25:33 +03:00
list_add ( & seed_devices - > seed_list , & fs_devices - > seed_list ) ;
2008-11-17 21:11:30 -05:00
generate_random_uuid ( fs_devices - > fsid ) ;
2018-10-30 16:43:23 +02:00
memcpy ( fs_devices - > metadata_uuid , fs_devices - > fsid , BTRFS_FSID_SIZE ) ;
2008-11-17 21:11:30 -05:00
memcpy ( disk_super - > fsid , fs_devices - > fsid , BTRFS_FSID_SIZE ) ;
2018-07-16 22:58:09 +08:00
mutex_unlock ( & fs_devices - > device_list_mutex ) ;
Btrfs: fix race conditions in BTRFS_IOC_FS_INFO ioctl
The handler for the ioctl BTRFS_IOC_FS_INFO was reading the
number of devices before acquiring the device list mutex.
This could lead to inconsistent results because the update of
the device list and the number of devices counter (amongst other
counters related to the device list) are updated in volumes.c
while holding the device list mutex - except for 2 places, one
was volumes.c:btrfs_prepare_sprout() and the other was
volumes.c:device_list_add().
For example, if we have 2 devices, with IDs 1 and 2 and then add
a new device, with ID 3, and while adding the device is in progress
an BTRFS_IOC_FS_INFO ioctl arrives, it could return a number of
devices of 2 and a max dev id of 3. This would be incorrect.
Also, this ioctl handler was reading the fsid while it can be
updated concurrently. This can happen when while a new device is
being added and the current filesystem is in seeding mode.
Example:
$ mkfs.btrfs -f /dev/sdb1
$ mkfs.btrfs -f /dev/sdb2
$ btrfstune -S 1 /dev/sdb1
$ mount /dev/sdb1 /mnt/test
$ btrfs device add /dev/sdb2 /mnt/test
If during the last step a BTRFS_IOC_FS_INFO ioctl was requested, it
could read an fsid that was never valid (some bits part of the old
fsid and others part of the new fsid). Also, it could read a number
of devices that doesn't match the number of devices in the list and
the max device id, as explained before.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-08-12 20:56:58 +01:00
2008-11-17 21:11:30 -05:00
super_flags = btrfs_super_flags ( disk_super ) &
~ BTRFS_SUPER_FLAG_SEEDING ;
btrfs_set_super_flags ( disk_super , super_flags ) ;
return 0 ;
}
/*
2016-05-19 21:18:45 -04:00
* Store the expected generation for seed devices in device items .
2008-11-17 21:11:30 -05:00
*/
2019-03-20 16:36:39 +01:00
static int btrfs_finish_sprout ( struct btrfs_trans_handle * trans )
2008-11-17 21:11:30 -05:00
{
2021-10-05 16:12:42 -04:00
BTRFS_DEV_LOOKUP_ARGS ( args ) ;
2019-03-20 16:36:39 +01:00
struct btrfs_fs_info * fs_info = trans - > fs_info ;
2016-06-21 10:40:19 -04:00
struct btrfs_root * root = fs_info - > chunk_root ;
2008-11-17 21:11:30 -05:00
struct btrfs_path * path ;
struct extent_buffer * leaf ;
struct btrfs_dev_item * dev_item ;
struct btrfs_device * device ;
struct btrfs_key key ;
2017-07-29 17:50:09 +08:00
u8 fs_uuid [ BTRFS_FSID_SIZE ] ;
2008-11-17 21:11:30 -05:00
u8 dev_uuid [ BTRFS_UUID_SIZE ] ;
int ret ;
path = btrfs_alloc_path ( ) ;
if ( ! path )
return - ENOMEM ;
key . objectid = BTRFS_DEV_ITEMS_OBJECTID ;
key . offset = 0 ;
key . type = BTRFS_DEV_ITEM_KEY ;
while ( 1 ) {
ret = btrfs_search_slot ( trans , root , & key , path , 0 , 1 ) ;
if ( ret < 0 )
goto error ;
leaf = path - > nodes [ 0 ] ;
next_slot :
if ( path - > slots [ 0 ] > = btrfs_header_nritems ( leaf ) ) {
ret = btrfs_next_leaf ( root , path ) ;
if ( ret > 0 )
break ;
if ( ret < 0 )
goto error ;
leaf = path - > nodes [ 0 ] ;
btrfs_item_key_to_cpu ( leaf , & key , path - > slots [ 0 ] ) ;
2011-04-21 01:20:15 +02:00
btrfs_release_path ( path ) ;
2008-11-17 21:11:30 -05:00
continue ;
}
btrfs_item_key_to_cpu ( leaf , & key , path - > slots [ 0 ] ) ;
if ( key . objectid ! = BTRFS_DEV_ITEMS_OBJECTID | |
key . type ! = BTRFS_DEV_ITEM_KEY )
break ;
dev_item = btrfs_item_ptr ( leaf , path - > slots [ 0 ] ,
struct btrfs_dev_item ) ;
2021-10-05 16:12:42 -04:00
args . devid = btrfs_device_id ( leaf , dev_item ) ;
2013-08-20 13:20:11 +02:00
read_extent_buffer ( leaf , dev_uuid , btrfs_device_uuid ( dev_item ) ,
2008-11-17 21:11:30 -05:00
BTRFS_UUID_SIZE ) ;
2013-08-20 13:20:12 +02:00
read_extent_buffer ( leaf , fs_uuid , btrfs_device_fsid ( dev_item ) ,
2017-07-29 17:50:09 +08:00
BTRFS_FSID_SIZE ) ;
2021-10-05 16:12:42 -04:00
args . uuid = dev_uuid ;
args . fsid = fs_uuid ;
device = btrfs_find_device ( fs_info - > fs_devices , & args ) ;
2012-03-12 16:03:00 +01:00
BUG_ON ( ! device ) ; /* Logic error */
2008-11-17 21:11:30 -05:00
if ( device - > fs_devices - > seeding ) {
btrfs_set_device_generation ( leaf , dev_item ,
device - > generation ) ;
btrfs_mark_buffer_dirty ( leaf ) ;
}
path - > slots [ 0 ] + + ;
goto next_slot ;
}
ret = 0 ;
error :
btrfs_free_path ( path ) ;
return ret ;
}
2017-02-14 17:55:53 +01:00
int btrfs_init_new_device ( struct btrfs_fs_info * fs_info , const char * device_path )
2008-04-28 15:29:42 -04:00
{
2016-06-21 20:16:08 -04:00
struct btrfs_root * root = fs_info - > dev_root ;
2011-08-04 14:52:27 +00:00
struct request_queue * q ;
2008-04-28 15:29:42 -04:00
struct btrfs_trans_handle * trans ;
struct btrfs_device * device ;
struct block_device * bdev ;
2016-06-22 18:54:23 -04:00
struct super_block * sb = fs_info - > sb ;
2012-06-04 14:03:51 -04:00
struct rcu_string * name ;
2018-07-03 13:14:50 +08:00
struct btrfs_fs_devices * fs_devices = fs_info - > fs_devices ;
2018-07-27 09:04:55 +09:00
u64 orig_super_total_bytes ;
u64 orig_super_num_devices ;
2008-11-17 21:11:30 -05:00
int seeding_dev = 0 ;
2008-04-28 15:29:42 -04:00
int ret = 0 ;
2020-07-22 11:09:23 +03:00
bool locked = false ;
2008-04-28 15:29:42 -04:00
2018-07-03 13:14:50 +08:00
if ( sb_rdonly ( sb ) & & ! fs_devices - > seeding )
2012-05-10 18:10:38 +08:00
return - EROFS ;
2008-04-28 15:29:42 -04:00
2011-12-07 20:08:40 -05:00
bdev = blkdev_get_by_path ( device_path , FMODE_WRITE | FMODE_EXCL ,
2016-06-22 18:54:23 -04:00
fs_info - > bdev_holder ) ;
2010-01-27 02:09:00 +00:00
if ( IS_ERR ( bdev ) )
return PTR_ERR ( bdev ) ;
2008-06-25 16:01:30 -04:00
2020-11-10 20:26:08 +09:00
if ( ! btrfs_check_device_zone_type ( fs_info , bdev ) ) {
ret = - EINVAL ;
goto error ;
}
2018-07-03 13:14:50 +08:00
if ( fs_devices - > seeding ) {
2008-11-17 21:11:30 -05:00
seeding_dev = 1 ;
down_write ( & sb - > s_umount ) ;
mutex_lock ( & uuid_mutex ) ;
2020-07-22 11:09:23 +03:00
locked = true ;
2008-11-17 21:11:30 -05:00
}
2020-07-22 11:09:25 +03:00
sync_blockdev ( bdev ) ;
2008-06-25 16:01:30 -04:00
2020-07-22 11:09:22 +03:00
rcu_read_lock ( ) ;
list_for_each_entry_rcu ( device , & fs_devices - > devices , dev_list ) {
2008-04-28 15:29:42 -04:00
if ( device - > bdev = = bdev ) {
ret = - EEXIST ;
2020-07-22 11:09:22 +03:00
rcu_read_unlock ( ) ;
2008-11-17 21:11:30 -05:00
goto error ;
2008-04-28 15:29:42 -04:00
}
}
2020-07-22 11:09:22 +03:00
rcu_read_unlock ( ) ;
2008-04-28 15:29:42 -04:00
2016-06-22 18:54:23 -04:00
device = btrfs_alloc_device ( fs_info , NULL , NULL ) ;
2013-08-23 13:20:17 +03:00
if ( IS_ERR ( device ) ) {
2008-04-28 15:29:42 -04:00
/* we can safely leave the fs_devices entry around */
2013-08-23 13:20:17 +03:00
ret = PTR_ERR ( device ) ;
2008-11-17 21:11:30 -05:00
goto error ;
2008-04-28 15:29:42 -04:00
}
2016-02-11 14:25:38 +01:00
name = rcu_string_strdup ( device_path , GFP_KERNEL ) ;
2012-06-04 14:03:51 -04:00
if ( ! name ) {
2008-11-17 21:11:30 -05:00
ret = - ENOMEM ;
2017-10-30 19:29:46 +01:00
goto error_free_device ;
2008-04-28 15:29:42 -04:00
}
2012-06-04 14:03:51 -04:00
rcu_assign_pointer ( device - > name , name ) ;
2008-11-17 21:11:30 -05:00
2020-11-10 20:26:07 +09:00
device - > fs_info = fs_info ;
device - > bdev = bdev ;
ret = btrfs_get_dev_zone_info ( device ) ;
if ( ret )
goto error_free_device ;
2010-05-16 10:48:46 -04:00
trans = btrfs_start_transaction ( root , 0 ) ;
2011-01-20 06:19:37 +00:00
if ( IS_ERR ( trans ) ) {
ret = PTR_ERR ( trans ) ;
2020-11-10 20:26:07 +09:00
goto error_free_zone ;
2011-01-20 06:19:37 +00:00
}
2011-08-04 14:52:27 +00:00
q = bdev_get_queue ( bdev ) ;
2017-12-04 12:54:52 +08:00
set_bit ( BTRFS_DEV_STATE_WRITEABLE , & device - > dev_state ) ;
2008-11-17 21:11:30 -05:00
device - > generation = trans - > transid ;
2016-06-22 18:54:23 -04:00
device - > io_width = fs_info - > sectorsize ;
device - > io_align = fs_info - > sectorsize ;
device - > sector_size = fs_info - > sectorsize ;
2017-06-16 14:39:20 +03:00
device - > total_bytes = round_down ( i_size_read ( bdev - > bd_inode ) ,
fs_info - > sectorsize ) ;
2009-06-04 09:23:50 -04:00
device - > disk_total_bytes = device - > total_bytes ;
2014-09-03 21:35:33 +08:00
device - > commit_total_bytes = device - > total_bytes ;
2017-12-04 12:54:53 +08:00
set_bit ( BTRFS_DEV_STATE_IN_FS_METADATA , & device - > dev_state ) ;
2017-12-04 12:54:55 +08:00
clear_bit ( BTRFS_DEV_STATE_REPLACE_TGT , & device - > dev_state ) ;
2011-02-15 18:12:57 +00:00
device - > mode = FMODE_EXCL ;
2013-10-11 15:20:42 +02:00
device - > dev_stats_valid = 1 ;
2017-06-16 01:48:05 +02:00
set_blocksize ( device - > bdev , BTRFS_BDEV_BLOCKSIZE ) ;
2008-04-28 15:29:42 -04:00
2008-11-17 21:11:30 -05:00
if ( seeding_dev ) {
btrfs: fix race between RO remount and the cleaner task
When we are remounting a filesystem in RO mode we can race with the cleaner
task and result in leaking a transaction if the filesystem is unmounted
shortly after, before the transaction kthread had a chance to commit that
transaction. That also results in a crash during unmount, due to a
use-after-free, if hardware acceleration is not available for crc32c.
The following sequence of steps explains how the race happens.
1) The filesystem is mounted in RW mode and the cleaner task is running.
This means that currently BTRFS_FS_CLEANER_RUNNING is set at
fs_info->flags;
2) The cleaner task is currently running delayed iputs for example;
3) A filesystem RO remount operation starts;
4) The RO remount task calls btrfs_commit_super(), which commits any
currently open transaction, and it finishes;
5) At this point the cleaner task is still running and it creates a new
transaction by doing one of the following things:
* When running the delayed iput() for an inode with a 0 link count,
in which case at btrfs_evict_inode() we start a transaction through
the call to evict_refill_and_join(), use it and then release its
handle through btrfs_end_transaction();
* When deleting a dead root through btrfs_clean_one_deleted_snapshot(),
a transaction is started at btrfs_drop_snapshot() and then its handle
is released through a call to btrfs_end_transaction_throttle();
* When the remount task was still running, and before the remount task
called btrfs_delete_unused_bgs(), the cleaner task also called
btrfs_delete_unused_bgs() and it picked and removed one block group
from the list of unused block groups. Before the cleaner task started
a transaction, through btrfs_start_trans_remove_block_group() at
btrfs_delete_unused_bgs(), the remount task had already called
btrfs_commit_super();
6) So at this point the filesystem is in RO mode and we have an open
transaction that was started by the cleaner task;
7) Shortly after a filesystem unmount operation starts. At close_ctree()
we stop the transaction kthread before it had a chance to commit the
transaction, since less than 30 seconds (the default commit interval)
have elapsed since the last transaction was committed;
8) We end up calling iput() against the btree inode at close_ctree() while
there is an open transaction, and since that transaction was used to
update btrees by the cleaner, we have dirty pages in the btree inode
due to COW operations on metadata extents, and therefore writeback is
triggered for the btree inode.
So btree_write_cache_pages() is invoked to flush those dirty pages
during the final iput() on the btree inode. This results in creating a
bio and submitting it, which makes us end up at
btrfs_submit_metadata_bio();
9) At btrfs_submit_metadata_bio() we end up at the if-then-else branch
that calls btrfs_wq_submit_bio(), because check_async_write() returned
a value of 1. This value of 1 is because we did not have hardware
acceleration available for crc32c, so BTRFS_FS_CSUM_IMPL_FAST was not
set in fs_info->flags;
10) Then at btrfs_wq_submit_bio() we call btrfs_queue_work() against the
workqueue at fs_info->workers, which was already freed before by the
call to btrfs_stop_all_workers() at close_ctree(). This results in an
invalid memory access due to a use-after-free, leading to a crash.
When this happens, before the crash there are several warnings triggered,
since we have reserved metadata space in a block group, the delayed refs
reservation, etc:
------------[ cut here ]------------
WARNING: CPU: 4 PID: 1729896 at fs/btrfs/block-group.c:125 btrfs_put_block_group+0x63/0xa0 [btrfs]
Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
CPU: 4 PID: 1729896 Comm: umount Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
RIP: 0010:btrfs_put_block_group+0x63/0xa0 [btrfs]
Code: f0 01 00 00 48 39 c2 75 (...)
RSP: 0018:ffffb270826bbdd8 EFLAGS: 00010206
RAX: 0000000000000001 RBX: ffff947ed73e4000 RCX: ffff947ebc8b29c8
RDX: 0000000000000001 RSI: ffffffffc0b150a0 RDI: ffff947ebc8b2800
RBP: ffff947ebc8b2800 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000001 R12: ffff947ed73e4110
R13: ffff947ed73e4160 R14: ffff947ebc8b2988 R15: dead000000000100
FS: 00007f15edfea840(0000) GS:ffff9481ad600000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f37e2893320 CR3: 0000000138f68001 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
btrfs_free_block_groups+0x17f/0x2f0 [btrfs]
close_ctree+0x2ba/0x2fa [btrfs]
generic_shutdown_super+0x6c/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f15ee221ee7
Code: ff 0b 00 f7 d8 64 89 01 48 (...)
RSP: 002b:00007ffe9470f0f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
RAX: 0000000000000000 RBX: 00007f15ee347264 RCX: 00007f15ee221ee7
RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 000056169701d000
RBP: 0000561697018a30 R08: 0000000000000000 R09: 00007f15ee2e2be0
R10: 000056169701efe0 R11: 0000000000000246 R12: 0000000000000000
R13: 000056169701d000 R14: 0000561697018b40 R15: 0000561697018c60
irq event stamp: 0
hardirqs last enabled at (0): [<0000000000000000>] 0x0
hardirqs last disabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
softirqs last enabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
softirqs last disabled at (0): [<0000000000000000>] 0x0
---[ end trace dd74718fef1ed5c6 ]---
------------[ cut here ]------------
WARNING: CPU: 2 PID: 1729896 at fs/btrfs/block-rsv.c:459 btrfs_release_global_block_rsv+0x70/0xc0 [btrfs]
Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
CPU: 2 PID: 1729896 Comm: umount Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
RIP: 0010:btrfs_release_global_block_rsv+0x70/0xc0 [btrfs]
Code: 48 83 bb b0 03 00 00 00 (...)
RSP: 0018:ffffb270826bbdd8 EFLAGS: 00010206
RAX: 000000000033c000 RBX: ffff947ed73e4000 RCX: 0000000000000000
RDX: 0000000000000001 RSI: ffffffffc0b0d8c1 RDI: 00000000ffffffff
RBP: ffff947ebc8b7000 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000001 R12: ffff947ed73e4110
R13: ffff947ed73e5278 R14: dead000000000122 R15: dead000000000100
FS: 00007f15edfea840(0000) GS:ffff9481aca00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000561a79f76e20 CR3: 0000000138f68006 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
btrfs_free_block_groups+0x24c/0x2f0 [btrfs]
close_ctree+0x2ba/0x2fa [btrfs]
generic_shutdown_super+0x6c/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f15ee221ee7
Code: ff 0b 00 f7 d8 64 89 01 (...)
RSP: 002b:00007ffe9470f0f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
RAX: 0000000000000000 RBX: 00007f15ee347264 RCX: 00007f15ee221ee7
RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 000056169701d000
RBP: 0000561697018a30 R08: 0000000000000000 R09: 00007f15ee2e2be0
R10: 000056169701efe0 R11: 0000000000000246 R12: 0000000000000000
R13: 000056169701d000 R14: 0000561697018b40 R15: 0000561697018c60
irq event stamp: 0
hardirqs last enabled at (0): [<0000000000000000>] 0x0
hardirqs last disabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
softirqs last enabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
softirqs last disabled at (0): [<0000000000000000>] 0x0
---[ end trace dd74718fef1ed5c7 ]---
------------[ cut here ]------------
WARNING: CPU: 2 PID: 1729896 at fs/btrfs/block-group.c:3377 btrfs_free_block_groups+0x25d/0x2f0 [btrfs]
Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
CPU: 5 PID: 1729896 Comm: umount Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
RIP: 0010:btrfs_free_block_groups+0x25d/0x2f0 [btrfs]
Code: ad de 49 be 22 01 00 (...)
RSP: 0018:ffffb270826bbde8 EFLAGS: 00010206
RAX: ffff947ebeae1d08 RBX: ffff947ed73e4000 RCX: 0000000000000000
RDX: 0000000000000001 RSI: ffff947e9d823ae8 RDI: 0000000000000246
RBP: ffff947ebeae1d08 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000001 R12: ffff947ebeae1c00
R13: ffff947ed73e5278 R14: dead000000000122 R15: dead000000000100
FS: 00007f15edfea840(0000) GS:ffff9481ad200000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f1475d98ea8 CR3: 0000000138f68005 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
close_ctree+0x2ba/0x2fa [btrfs]
generic_shutdown_super+0x6c/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f15ee221ee7
Code: ff 0b 00 f7 d8 64 89 (...)
RSP: 002b:00007ffe9470f0f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
RAX: 0000000000000000 RBX: 00007f15ee347264 RCX: 00007f15ee221ee7
RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 000056169701d000
RBP: 0000561697018a30 R08: 0000000000000000 R09: 00007f15ee2e2be0
R10: 000056169701efe0 R11: 0000000000000246 R12: 0000000000000000
R13: 000056169701d000 R14: 0000561697018b40 R15: 0000561697018c60
irq event stamp: 0
hardirqs last enabled at (0): [<0000000000000000>] 0x0
hardirqs last disabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
softirqs last enabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
softirqs last disabled at (0): [<0000000000000000>] 0x0
---[ end trace dd74718fef1ed5c8 ]---
BTRFS info (device sdc): space_info 4 has 268238848 free, is not full
BTRFS info (device sdc): space_info total=268435456, used=114688, pinned=0, reserved=16384, may_use=0, readonly=65536
BTRFS info (device sdc): global_block_rsv: size 0 reserved 0
BTRFS info (device sdc): trans_block_rsv: size 0 reserved 0
BTRFS info (device sdc): chunk_block_rsv: size 0 reserved 0
BTRFS info (device sdc): delayed_block_rsv: size 0 reserved 0
BTRFS info (device sdc): delayed_refs_rsv: size 524288 reserved 0
And the crash, which only happens when we do not have crc32c hardware
acceleration, produces the following trace immediately after those
warnings:
stack segment: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
CPU: 2 PID: 1749129 Comm: umount Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
RIP: 0010:btrfs_queue_work+0x36/0x190 [btrfs]
Code: 54 55 53 48 89 f3 (...)
RSP: 0018:ffffb27082443ae8 EFLAGS: 00010282
RAX: 0000000000000004 RBX: ffff94810ee9ad90 RCX: 0000000000000000
RDX: 0000000000000001 RSI: ffff94810ee9ad90 RDI: ffff947ed8ee75a0
RBP: a56b6b6b6b6b6b6b R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000007 R11: 0000000000000001 R12: ffff947fa9b435a8
R13: ffff94810ee9ad90 R14: 0000000000000000 R15: ffff947e93dc0000
FS: 00007f3cfe974840(0000) GS:ffff9481ac600000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f1b42995a70 CR3: 0000000127638003 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
btrfs_wq_submit_bio+0xb3/0xd0 [btrfs]
btrfs_submit_metadata_bio+0x44/0xc0 [btrfs]
submit_one_bio+0x61/0x70 [btrfs]
btree_write_cache_pages+0x414/0x450 [btrfs]
? kobject_put+0x9a/0x1d0
? trace_hardirqs_on+0x1b/0xf0
? _raw_spin_unlock_irqrestore+0x3c/0x60
? free_debug_processing+0x1e1/0x2b0
do_writepages+0x43/0xe0
? lock_acquired+0x199/0x490
__writeback_single_inode+0x59/0x650
writeback_single_inode+0xaf/0x120
write_inode_now+0x94/0xd0
iput+0x187/0x2b0
close_ctree+0x2c6/0x2fa [btrfs]
generic_shutdown_super+0x6c/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f3cfebabee7
Code: ff 0b 00 f7 d8 64 89 01 (...)
RSP: 002b:00007ffc9c9a05f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
RAX: 0000000000000000 RBX: 00007f3cfecd1264 RCX: 00007f3cfebabee7
RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 0000562b6b478000
RBP: 0000562b6b473a30 R08: 0000000000000000 R09: 00007f3cfec6cbe0
R10: 0000562b6b479fe0 R11: 0000000000000246 R12: 0000000000000000
R13: 0000562b6b478000 R14: 0000562b6b473b40 R15: 0000562b6b473c60
Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
---[ end trace dd74718fef1ed5cc ]---
Finally when we remove the btrfs module (rmmod btrfs), there are several
warnings about objects that were allocated from our slabs but were never
freed, consequence of the transaction that was never committed and got
leaked:
=============================================================================
BUG btrfs_delayed_ref_head (Tainted: G B W ): Objects remaining in btrfs_delayed_ref_head on __kmem_cache_shutdown()
-----------------------------------------------------------------------------
INFO: Slab 0x0000000094c2ae56 objects=24 used=2 fp=0x000000002bfa2521 flags=0x17fffc000010200
CPU: 5 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x8d/0xb5
slab_err+0xb7/0xdc
? lock_acquired+0x199/0x490
__kmem_cache_shutdown+0x1ac/0x3c0
? lock_release+0x20e/0x4c0
kmem_cache_destroy+0x55/0x120
btrfs_delayed_ref_exit+0x11/0x35 [btrfs]
exit_btrfs_fs+0xa/0x59 [btrfs]
__x64_sys_delete_module+0x194/0x260
? fpregs_assert_state_consistent+0x1e/0x40
? exit_to_user_mode_prepare+0x55/0x1c0
? trace_hardirqs_on+0x1b/0xf0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f693e305897
Code: 73 01 c3 48 8b 0d f9 f5 (...)
RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
INFO: Object 0x0000000050cbdd61 @offset=12104
INFO: Allocated in btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs] age=1894 cpu=6 pid=1729873
__slab_alloc.isra.0+0x109/0x1c0
kmem_cache_alloc+0x7bb/0x830
btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs]
btrfs_free_tree_block+0x128/0x360 [btrfs]
__btrfs_cow_block+0x489/0x5f0 [btrfs]
btrfs_cow_block+0xf7/0x220 [btrfs]
btrfs_search_slot+0x62a/0xc40 [btrfs]
btrfs_del_orphan_item+0x65/0xd0 [btrfs]
btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
open_ctree+0x125a/0x18a0 [btrfs]
btrfs_mount_root.cold+0x13/0xed [btrfs]
legacy_get_tree+0x30/0x60
vfs_get_tree+0x28/0xe0
fc_mount+0xe/0x40
vfs_kern_mount.part.0+0x71/0x90
btrfs_mount+0x13b/0x3e0 [btrfs]
INFO: Freed in __btrfs_run_delayed_refs+0x1117/0x1290 [btrfs] age=4292 cpu=2 pid=1729526
kmem_cache_free+0x34c/0x3c0
__btrfs_run_delayed_refs+0x1117/0x1290 [btrfs]
btrfs_run_delayed_refs+0x81/0x210 [btrfs]
commit_cowonly_roots+0xfb/0x300 [btrfs]
btrfs_commit_transaction+0x367/0xc40 [btrfs]
sync_filesystem+0x74/0x90
generic_shutdown_super+0x22/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
INFO: Object 0x0000000086e9b0ff @offset=12776
INFO: Allocated in btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs] age=1900 cpu=6 pid=1729873
__slab_alloc.isra.0+0x109/0x1c0
kmem_cache_alloc+0x7bb/0x830
btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs]
btrfs_alloc_tree_block+0x2bf/0x360 [btrfs]
alloc_tree_block_no_bg_flush+0x4f/0x60 [btrfs]
__btrfs_cow_block+0x12d/0x5f0 [btrfs]
btrfs_cow_block+0xf7/0x220 [btrfs]
btrfs_search_slot+0x62a/0xc40 [btrfs]
btrfs_del_orphan_item+0x65/0xd0 [btrfs]
btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
open_ctree+0x125a/0x18a0 [btrfs]
btrfs_mount_root.cold+0x13/0xed [btrfs]
legacy_get_tree+0x30/0x60
vfs_get_tree+0x28/0xe0
fc_mount+0xe/0x40
vfs_kern_mount.part.0+0x71/0x90
INFO: Freed in __btrfs_run_delayed_refs+0x1117/0x1290 [btrfs] age=3141 cpu=6 pid=1729803
kmem_cache_free+0x34c/0x3c0
__btrfs_run_delayed_refs+0x1117/0x1290 [btrfs]
btrfs_run_delayed_refs+0x81/0x210 [btrfs]
btrfs_write_dirty_block_groups+0x17d/0x3d0 [btrfs]
commit_cowonly_roots+0x248/0x300 [btrfs]
btrfs_commit_transaction+0x367/0xc40 [btrfs]
close_ctree+0x113/0x2fa [btrfs]
generic_shutdown_super+0x6c/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
kmem_cache_destroy btrfs_delayed_ref_head: Slab cache still has objects
CPU: 5 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x8d/0xb5
kmem_cache_destroy+0x119/0x120
btrfs_delayed_ref_exit+0x11/0x35 [btrfs]
exit_btrfs_fs+0xa/0x59 [btrfs]
__x64_sys_delete_module+0x194/0x260
? fpregs_assert_state_consistent+0x1e/0x40
? exit_to_user_mode_prepare+0x55/0x1c0
? trace_hardirqs_on+0x1b/0xf0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f693e305897
Code: 73 01 c3 48 8b 0d f9 f5 0b (...)
RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
=============================================================================
BUG btrfs_delayed_tree_ref (Tainted: G B W ): Objects remaining in btrfs_delayed_tree_ref on __kmem_cache_shutdown()
-----------------------------------------------------------------------------
INFO: Slab 0x0000000011f78dc0 objects=37 used=2 fp=0x0000000032d55d91 flags=0x17fffc000010200
CPU: 3 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x8d/0xb5
slab_err+0xb7/0xdc
? lock_acquired+0x199/0x490
__kmem_cache_shutdown+0x1ac/0x3c0
? lock_release+0x20e/0x4c0
kmem_cache_destroy+0x55/0x120
btrfs_delayed_ref_exit+0x1d/0x35 [btrfs]
exit_btrfs_fs+0xa/0x59 [btrfs]
__x64_sys_delete_module+0x194/0x260
? fpregs_assert_state_consistent+0x1e/0x40
? exit_to_user_mode_prepare+0x55/0x1c0
? trace_hardirqs_on+0x1b/0xf0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f693e305897
Code: 73 01 c3 48 8b 0d f9 f5 (...)
RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
INFO: Object 0x000000001a340018 @offset=4408
INFO: Allocated in btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs] age=1917 cpu=6 pid=1729873
__slab_alloc.isra.0+0x109/0x1c0
kmem_cache_alloc+0x7bb/0x830
btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs]
btrfs_free_tree_block+0x128/0x360 [btrfs]
__btrfs_cow_block+0x489/0x5f0 [btrfs]
btrfs_cow_block+0xf7/0x220 [btrfs]
btrfs_search_slot+0x62a/0xc40 [btrfs]
btrfs_del_orphan_item+0x65/0xd0 [btrfs]
btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
open_ctree+0x125a/0x18a0 [btrfs]
btrfs_mount_root.cold+0x13/0xed [btrfs]
legacy_get_tree+0x30/0x60
vfs_get_tree+0x28/0xe0
fc_mount+0xe/0x40
vfs_kern_mount.part.0+0x71/0x90
btrfs_mount+0x13b/0x3e0 [btrfs]
INFO: Freed in __btrfs_run_delayed_refs+0x63d/0x1290 [btrfs] age=4167 cpu=4 pid=1729795
kmem_cache_free+0x34c/0x3c0
__btrfs_run_delayed_refs+0x63d/0x1290 [btrfs]
btrfs_run_delayed_refs+0x81/0x210 [btrfs]
btrfs_commit_transaction+0x60/0xc40 [btrfs]
create_subvol+0x56a/0x990 [btrfs]
btrfs_mksubvol+0x3fb/0x4a0 [btrfs]
__btrfs_ioctl_snap_create+0x119/0x1a0 [btrfs]
btrfs_ioctl_snap_create+0x58/0x80 [btrfs]
btrfs_ioctl+0x1a92/0x36f0 [btrfs]
__x64_sys_ioctl+0x83/0xb0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
INFO: Object 0x000000002b46292a @offset=13648
INFO: Allocated in btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs] age=1923 cpu=6 pid=1729873
__slab_alloc.isra.0+0x109/0x1c0
kmem_cache_alloc+0x7bb/0x830
btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs]
btrfs_alloc_tree_block+0x2bf/0x360 [btrfs]
alloc_tree_block_no_bg_flush+0x4f/0x60 [btrfs]
__btrfs_cow_block+0x12d/0x5f0 [btrfs]
btrfs_cow_block+0xf7/0x220 [btrfs]
btrfs_search_slot+0x62a/0xc40 [btrfs]
btrfs_del_orphan_item+0x65/0xd0 [btrfs]
btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
open_ctree+0x125a/0x18a0 [btrfs]
btrfs_mount_root.cold+0x13/0xed [btrfs]
legacy_get_tree+0x30/0x60
vfs_get_tree+0x28/0xe0
fc_mount+0xe/0x40
vfs_kern_mount.part.0+0x71/0x90
INFO: Freed in __btrfs_run_delayed_refs+0x63d/0x1290 [btrfs] age=3164 cpu=6 pid=1729803
kmem_cache_free+0x34c/0x3c0
__btrfs_run_delayed_refs+0x63d/0x1290 [btrfs]
btrfs_run_delayed_refs+0x81/0x210 [btrfs]
commit_cowonly_roots+0xfb/0x300 [btrfs]
btrfs_commit_transaction+0x367/0xc40 [btrfs]
close_ctree+0x113/0x2fa [btrfs]
generic_shutdown_super+0x6c/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
kmem_cache_destroy btrfs_delayed_tree_ref: Slab cache still has objects
CPU: 5 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x8d/0xb5
kmem_cache_destroy+0x119/0x120
btrfs_delayed_ref_exit+0x1d/0x35 [btrfs]
exit_btrfs_fs+0xa/0x59 [btrfs]
__x64_sys_delete_module+0x194/0x260
? fpregs_assert_state_consistent+0x1e/0x40
? exit_to_user_mode_prepare+0x55/0x1c0
? trace_hardirqs_on+0x1b/0xf0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f693e305897
Code: 73 01 c3 48 8b 0d f9 f5 (...)
RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
=============================================================================
BUG btrfs_delayed_extent_op (Tainted: G B W ): Objects remaining in btrfs_delayed_extent_op on __kmem_cache_shutdown()
-----------------------------------------------------------------------------
INFO: Slab 0x00000000f145ce2f objects=22 used=1 fp=0x00000000af0f92cf flags=0x17fffc000010200
CPU: 5 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x8d/0xb5
slab_err+0xb7/0xdc
? lock_acquired+0x199/0x490
__kmem_cache_shutdown+0x1ac/0x3c0
? __mutex_unlock_slowpath+0x45/0x2a0
kmem_cache_destroy+0x55/0x120
exit_btrfs_fs+0xa/0x59 [btrfs]
__x64_sys_delete_module+0x194/0x260
? fpregs_assert_state_consistent+0x1e/0x40
? exit_to_user_mode_prepare+0x55/0x1c0
? trace_hardirqs_on+0x1b/0xf0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f693e305897
Code: 73 01 c3 48 8b 0d f9 f5 (...)
RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
INFO: Object 0x000000004cf95ea8 @offset=6264
INFO: Allocated in btrfs_alloc_tree_block+0x1e0/0x360 [btrfs] age=1931 cpu=6 pid=1729873
__slab_alloc.isra.0+0x109/0x1c0
kmem_cache_alloc+0x7bb/0x830
btrfs_alloc_tree_block+0x1e0/0x360 [btrfs]
alloc_tree_block_no_bg_flush+0x4f/0x60 [btrfs]
__btrfs_cow_block+0x12d/0x5f0 [btrfs]
btrfs_cow_block+0xf7/0x220 [btrfs]
btrfs_search_slot+0x62a/0xc40 [btrfs]
btrfs_del_orphan_item+0x65/0xd0 [btrfs]
btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
open_ctree+0x125a/0x18a0 [btrfs]
btrfs_mount_root.cold+0x13/0xed [btrfs]
legacy_get_tree+0x30/0x60
vfs_get_tree+0x28/0xe0
fc_mount+0xe/0x40
vfs_kern_mount.part.0+0x71/0x90
btrfs_mount+0x13b/0x3e0 [btrfs]
INFO: Freed in __btrfs_run_delayed_refs+0xabd/0x1290 [btrfs] age=3173 cpu=6 pid=1729803
kmem_cache_free+0x34c/0x3c0
__btrfs_run_delayed_refs+0xabd/0x1290 [btrfs]
btrfs_run_delayed_refs+0x81/0x210 [btrfs]
commit_cowonly_roots+0xfb/0x300 [btrfs]
btrfs_commit_transaction+0x367/0xc40 [btrfs]
close_ctree+0x113/0x2fa [btrfs]
generic_shutdown_super+0x6c/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
kmem_cache_destroy btrfs_delayed_extent_op: Slab cache still has objects
CPU: 3 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x8d/0xb5
kmem_cache_destroy+0x119/0x120
exit_btrfs_fs+0xa/0x59 [btrfs]
__x64_sys_delete_module+0x194/0x260
? fpregs_assert_state_consistent+0x1e/0x40
? exit_to_user_mode_prepare+0x55/0x1c0
? trace_hardirqs_on+0x1b/0xf0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f693e305897
Code: 73 01 c3 48 8b 0d f9 (...)
RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
BTRFS: state leak: start 30408704 end 30425087 state 1 in tree 1 refs 1
So fix this by making the remount path to wait for the cleaner task before
calling btrfs_commit_super(). The remount path now waits for the bit
BTRFS_FS_CLEANER_RUNNING to be cleared from fs_info->flags before calling
btrfs_commit_super() and this ensures the cleaner can not start a
transaction after that, because it sleeps when the filesystem is in RO
mode and we have already flagged the filesystem as RO before waiting for
BTRFS_FS_CLEANER_RUNNING to be cleared.
This also introduces a new flag BTRFS_FS_STATE_RO to be used for
fs_info->fs_state when the filesystem is in RO mode. This is because we
were doing the RO check using the flags of the superblock and setting the
RO mode simply by ORing into the superblock's flags - those operations are
not atomic and could result in the cleaner not seeing the update from the
remount task after it clears BTRFS_FS_CLEANER_RUNNING.
Tested-by: Fabian Vogt <fvogt@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-14 10:10:47 +00:00
btrfs_clear_sb_rdonly ( sb ) ;
2016-06-22 18:54:24 -04:00
ret = btrfs_prepare_sprout ( fs_info ) ;
2017-09-28 14:51:10 +08:00
if ( ret ) {
btrfs_abort_transaction ( trans , ret ) ;
goto error_trans ;
}
btrfs: update latest_dev when we create a sprout device
When we add a device to the seed filesystem (sprouting) it is a new
filesystem (and fsid) on the device added. Update the latest_dev so
that /proc/self/mounts shows the correct device.
Example:
$ btrfstune -S1 /dev/vg/seed
$ mount /dev/vg/seed /btrfs
mount: /btrfs: WARNING: device write-protected, mounted read-only.
$ cat /proc/self/mounts | grep btrfs
/dev/mapper/vg-seed /btrfs btrfs ro,relatime,space_cache,subvolid=5,subvol=/ 0 0
$ btrfs dev add -f /dev/vg/new /btrfs
Before:
$ cat /proc/self/mounts | grep btrfs
/dev/mapper/vg-seed /btrfs btrfs ro,relatime,space_cache,subvolid=5,subvol=/ 0 0
After:
$ cat /proc/self/mounts | grep btrfs
/dev/mapper/vg-new /btrfs btrfs ro,relatime,space_cache,subvolid=5,subvol=/ 0 0
Tested-by: Su Yue <l@damenly.su>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-24 13:05:21 +08:00
btrfs_assign_next_active_device ( fs_info - > fs_devices - > latest_dev ,
device ) ;
2008-11-17 21:11:30 -05:00
}
2008-04-28 15:29:42 -04:00
2018-07-03 13:14:50 +08:00
device - > fs_devices = fs_devices ;
2009-06-10 15:17:02 -04:00
2018-07-03 13:14:50 +08:00
mutex_lock ( & fs_devices - > device_list_mutex ) ;
2016-10-04 19:34:27 +02:00
mutex_lock ( & fs_info - > chunk_mutex ) ;
2018-07-03 13:14:50 +08:00
list_add_rcu ( & device - > dev_list , & fs_devices - > devices ) ;
list_add ( & device - > dev_alloc_list , & fs_devices - > alloc_list ) ;
fs_devices - > num_devices + + ;
fs_devices - > open_devices + + ;
fs_devices - > rw_devices + + ;
fs_devices - > total_devices + + ;
fs_devices - > total_rw_bytes + = device - > total_bytes ;
2008-09-05 16:43:54 -04:00
2017-05-11 09:17:46 +03:00
atomic64_add ( device - > total_bytes , & fs_info - > free_chunk_space ) ;
2011-09-26 17:12:22 -04:00
2017-04-04 18:40:19 +08:00
if ( ! blk_queue_nonrot ( q ) )
2019-11-13 11:27:28 +01:00
fs_devices - > rotating = true ;
2009-06-10 09:51:32 -04:00
2018-07-27 09:04:55 +09:00
orig_super_total_bytes = btrfs_super_total_bytes ( fs_info - > super_copy ) ;
2016-06-22 18:54:23 -04:00
btrfs_set_super_total_bytes ( fs_info - > super_copy ,
2018-07-27 09:04:55 +09:00
round_down ( orig_super_total_bytes + device - > total_bytes ,
fs_info - > sectorsize ) ) ;
2008-04-28 15:29:42 -04:00
2018-07-27 09:04:55 +09:00
orig_super_num_devices = btrfs_super_num_devices ( fs_info - > super_copy ) ;
btrfs_set_super_num_devices ( fs_info - > super_copy ,
orig_super_num_devices + 1 ) ;
2014-06-03 11:36:01 +08:00
2014-09-03 21:35:41 +08:00
/*
* we ' ve got more storage , clear any full flags on the space
* infos
*/
2016-06-22 18:54:23 -04:00
btrfs_clear_space_info_full ( fs_info ) ;
2014-09-03 21:35:41 +08:00
2016-10-04 19:34:27 +02:00
mutex_unlock ( & fs_info - > chunk_mutex ) ;
btrfs: sysfs: init devices outside of the chunk_mutex
While running btrfs/061, btrfs/073, btrfs/078, or btrfs/178 we hit the
following lockdep splat:
======================================================
WARNING: possible circular locking dependency detected
5.9.0-rc3+ #4 Not tainted
------------------------------------------------------
kswapd0/100 is trying to acquire lock:
ffff96ecc22ef4a0 (&delayed_node->mutex){+.+.}-{3:3}, at: __btrfs_release_delayed_node.part.0+0x3f/0x330
but task is already holding lock:
ffffffff8dd74700 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #3 (fs_reclaim){+.+.}-{0:0}:
fs_reclaim_acquire+0x65/0x80
slab_pre_alloc_hook.constprop.0+0x20/0x200
kmem_cache_alloc+0x37/0x270
alloc_inode+0x82/0xb0
iget_locked+0x10d/0x2c0
kernfs_get_inode+0x1b/0x130
kernfs_get_tree+0x136/0x240
sysfs_get_tree+0x16/0x40
vfs_get_tree+0x28/0xc0
path_mount+0x434/0xc00
__x64_sys_mount+0xe3/0x120
do_syscall_64+0x33/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #2 (kernfs_mutex){+.+.}-{3:3}:
__mutex_lock+0x7e/0x7e0
kernfs_add_one+0x23/0x150
kernfs_create_link+0x63/0xa0
sysfs_do_create_link_sd+0x5e/0xd0
btrfs_sysfs_add_devices_dir+0x81/0x130
btrfs_init_new_device+0x67f/0x1250
btrfs_ioctl+0x1ef/0x2e20
__x64_sys_ioctl+0x83/0xb0
do_syscall_64+0x33/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #1 (&fs_info->chunk_mutex){+.+.}-{3:3}:
__mutex_lock+0x7e/0x7e0
btrfs_chunk_alloc+0x125/0x3a0
find_free_extent+0xdf6/0x1210
btrfs_reserve_extent+0xb3/0x1b0
btrfs_alloc_tree_block+0xb0/0x310
alloc_tree_block_no_bg_flush+0x4a/0x60
__btrfs_cow_block+0x11a/0x530
btrfs_cow_block+0x104/0x220
btrfs_search_slot+0x52e/0x9d0
btrfs_insert_empty_items+0x64/0xb0
btrfs_insert_delayed_items+0x90/0x4f0
btrfs_commit_inode_delayed_items+0x93/0x140
btrfs_log_inode+0x5de/0x2020
btrfs_log_inode_parent+0x429/0xc90
btrfs_log_new_name+0x95/0x9b
btrfs_rename2+0xbb9/0x1800
vfs_rename+0x64f/0x9f0
do_renameat2+0x320/0x4e0
__x64_sys_rename+0x1f/0x30
do_syscall_64+0x33/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #0 (&delayed_node->mutex){+.+.}-{3:3}:
__lock_acquire+0x119c/0x1fc0
lock_acquire+0xa7/0x3d0
__mutex_lock+0x7e/0x7e0
__btrfs_release_delayed_node.part.0+0x3f/0x330
btrfs_evict_inode+0x24c/0x500
evict+0xcf/0x1f0
dispose_list+0x48/0x70
prune_icache_sb+0x44/0x50
super_cache_scan+0x161/0x1e0
do_shrink_slab+0x178/0x3c0
shrink_slab+0x17c/0x290
shrink_node+0x2b2/0x6d0
balance_pgdat+0x30a/0x670
kswapd+0x213/0x4c0
kthread+0x138/0x160
ret_from_fork+0x1f/0x30
other info that might help us debug this:
Chain exists of:
&delayed_node->mutex --> kernfs_mutex --> fs_reclaim
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(fs_reclaim);
lock(kernfs_mutex);
lock(fs_reclaim);
lock(&delayed_node->mutex);
*** DEADLOCK ***
3 locks held by kswapd0/100:
#0: ffffffff8dd74700 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30
#1: ffffffff8dd65c50 (shrinker_rwsem){++++}-{3:3}, at: shrink_slab+0x115/0x290
#2: ffff96ed2ade30e0 (&type->s_umount_key#36){++++}-{3:3}, at: super_cache_scan+0x38/0x1e0
stack backtrace:
CPU: 0 PID: 100 Comm: kswapd0 Not tainted 5.9.0-rc3+ #4
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
Call Trace:
dump_stack+0x8b/0xb8
check_noncircular+0x12d/0x150
__lock_acquire+0x119c/0x1fc0
lock_acquire+0xa7/0x3d0
? __btrfs_release_delayed_node.part.0+0x3f/0x330
__mutex_lock+0x7e/0x7e0
? __btrfs_release_delayed_node.part.0+0x3f/0x330
? __btrfs_release_delayed_node.part.0+0x3f/0x330
? lock_acquire+0xa7/0x3d0
? find_held_lock+0x2b/0x80
__btrfs_release_delayed_node.part.0+0x3f/0x330
btrfs_evict_inode+0x24c/0x500
evict+0xcf/0x1f0
dispose_list+0x48/0x70
prune_icache_sb+0x44/0x50
super_cache_scan+0x161/0x1e0
do_shrink_slab+0x178/0x3c0
shrink_slab+0x17c/0x290
shrink_node+0x2b2/0x6d0
balance_pgdat+0x30a/0x670
kswapd+0x213/0x4c0
? _raw_spin_unlock_irqrestore+0x41/0x50
? add_wait_queue_exclusive+0x70/0x70
? balance_pgdat+0x670/0x670
kthread+0x138/0x160
? kthread_create_worker_on_cpu+0x40/0x40
ret_from_fork+0x1f/0x30
This happens because we are holding the chunk_mutex at the time of
adding in a new device. However we only need to hold the
device_list_mutex, as we're going to iterate over the fs_devices
devices. Move the sysfs init stuff outside of the chunk_mutex to get
rid of this lockdep splat.
CC: stable@vger.kernel.org # 4.4.x: f3cd2c58110dad14e: btrfs: sysfs, rename device_link add/remove functions
CC: stable@vger.kernel.org # 4.4.x
Reported-by: David Sterba <dsterba@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-09-01 08:09:01 -04:00
/* Add sysfs device entry */
2020-09-05 01:34:26 +08:00
btrfs_sysfs_add_device ( device ) ;
btrfs: sysfs: init devices outside of the chunk_mutex
While running btrfs/061, btrfs/073, btrfs/078, or btrfs/178 we hit the
following lockdep splat:
======================================================
WARNING: possible circular locking dependency detected
5.9.0-rc3+ #4 Not tainted
------------------------------------------------------
kswapd0/100 is trying to acquire lock:
ffff96ecc22ef4a0 (&delayed_node->mutex){+.+.}-{3:3}, at: __btrfs_release_delayed_node.part.0+0x3f/0x330
but task is already holding lock:
ffffffff8dd74700 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #3 (fs_reclaim){+.+.}-{0:0}:
fs_reclaim_acquire+0x65/0x80
slab_pre_alloc_hook.constprop.0+0x20/0x200
kmem_cache_alloc+0x37/0x270
alloc_inode+0x82/0xb0
iget_locked+0x10d/0x2c0
kernfs_get_inode+0x1b/0x130
kernfs_get_tree+0x136/0x240
sysfs_get_tree+0x16/0x40
vfs_get_tree+0x28/0xc0
path_mount+0x434/0xc00
__x64_sys_mount+0xe3/0x120
do_syscall_64+0x33/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #2 (kernfs_mutex){+.+.}-{3:3}:
__mutex_lock+0x7e/0x7e0
kernfs_add_one+0x23/0x150
kernfs_create_link+0x63/0xa0
sysfs_do_create_link_sd+0x5e/0xd0
btrfs_sysfs_add_devices_dir+0x81/0x130
btrfs_init_new_device+0x67f/0x1250
btrfs_ioctl+0x1ef/0x2e20
__x64_sys_ioctl+0x83/0xb0
do_syscall_64+0x33/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #1 (&fs_info->chunk_mutex){+.+.}-{3:3}:
__mutex_lock+0x7e/0x7e0
btrfs_chunk_alloc+0x125/0x3a0
find_free_extent+0xdf6/0x1210
btrfs_reserve_extent+0xb3/0x1b0
btrfs_alloc_tree_block+0xb0/0x310
alloc_tree_block_no_bg_flush+0x4a/0x60
__btrfs_cow_block+0x11a/0x530
btrfs_cow_block+0x104/0x220
btrfs_search_slot+0x52e/0x9d0
btrfs_insert_empty_items+0x64/0xb0
btrfs_insert_delayed_items+0x90/0x4f0
btrfs_commit_inode_delayed_items+0x93/0x140
btrfs_log_inode+0x5de/0x2020
btrfs_log_inode_parent+0x429/0xc90
btrfs_log_new_name+0x95/0x9b
btrfs_rename2+0xbb9/0x1800
vfs_rename+0x64f/0x9f0
do_renameat2+0x320/0x4e0
__x64_sys_rename+0x1f/0x30
do_syscall_64+0x33/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #0 (&delayed_node->mutex){+.+.}-{3:3}:
__lock_acquire+0x119c/0x1fc0
lock_acquire+0xa7/0x3d0
__mutex_lock+0x7e/0x7e0
__btrfs_release_delayed_node.part.0+0x3f/0x330
btrfs_evict_inode+0x24c/0x500
evict+0xcf/0x1f0
dispose_list+0x48/0x70
prune_icache_sb+0x44/0x50
super_cache_scan+0x161/0x1e0
do_shrink_slab+0x178/0x3c0
shrink_slab+0x17c/0x290
shrink_node+0x2b2/0x6d0
balance_pgdat+0x30a/0x670
kswapd+0x213/0x4c0
kthread+0x138/0x160
ret_from_fork+0x1f/0x30
other info that might help us debug this:
Chain exists of:
&delayed_node->mutex --> kernfs_mutex --> fs_reclaim
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(fs_reclaim);
lock(kernfs_mutex);
lock(fs_reclaim);
lock(&delayed_node->mutex);
*** DEADLOCK ***
3 locks held by kswapd0/100:
#0: ffffffff8dd74700 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30
#1: ffffffff8dd65c50 (shrinker_rwsem){++++}-{3:3}, at: shrink_slab+0x115/0x290
#2: ffff96ed2ade30e0 (&type->s_umount_key#36){++++}-{3:3}, at: super_cache_scan+0x38/0x1e0
stack backtrace:
CPU: 0 PID: 100 Comm: kswapd0 Not tainted 5.9.0-rc3+ #4
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
Call Trace:
dump_stack+0x8b/0xb8
check_noncircular+0x12d/0x150
__lock_acquire+0x119c/0x1fc0
lock_acquire+0xa7/0x3d0
? __btrfs_release_delayed_node.part.0+0x3f/0x330
__mutex_lock+0x7e/0x7e0
? __btrfs_release_delayed_node.part.0+0x3f/0x330
? __btrfs_release_delayed_node.part.0+0x3f/0x330
? lock_acquire+0xa7/0x3d0
? find_held_lock+0x2b/0x80
__btrfs_release_delayed_node.part.0+0x3f/0x330
btrfs_evict_inode+0x24c/0x500
evict+0xcf/0x1f0
dispose_list+0x48/0x70
prune_icache_sb+0x44/0x50
super_cache_scan+0x161/0x1e0
do_shrink_slab+0x178/0x3c0
shrink_slab+0x17c/0x290
shrink_node+0x2b2/0x6d0
balance_pgdat+0x30a/0x670
kswapd+0x213/0x4c0
? _raw_spin_unlock_irqrestore+0x41/0x50
? add_wait_queue_exclusive+0x70/0x70
? balance_pgdat+0x670/0x670
kthread+0x138/0x160
? kthread_create_worker_on_cpu+0x40/0x40
ret_from_fork+0x1f/0x30
This happens because we are holding the chunk_mutex at the time of
adding in a new device. However we only need to hold the
device_list_mutex, as we're going to iterate over the fs_devices
devices. Move the sysfs init stuff outside of the chunk_mutex to get
rid of this lockdep splat.
CC: stable@vger.kernel.org # 4.4.x: f3cd2c58110dad14e: btrfs: sysfs, rename device_link add/remove functions
CC: stable@vger.kernel.org # 4.4.x
Reported-by: David Sterba <dsterba@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-09-01 08:09:01 -04:00
2018-07-03 13:14:50 +08:00
mutex_unlock ( & fs_devices - > device_list_mutex ) ;
2008-04-28 15:29:42 -04:00
2008-11-17 21:11:30 -05:00
if ( seeding_dev ) {
2016-10-04 19:34:27 +02:00
mutex_lock ( & fs_info - > chunk_mutex ) ;
2019-03-20 16:29:13 +01:00
ret = init_first_rw_device ( trans ) ;
2016-10-04 19:34:27 +02:00
mutex_unlock ( & fs_info - > chunk_mutex ) ;
2012-09-18 07:52:32 -06:00
if ( ret ) {
2016-06-10 18:19:25 -04:00
btrfs_abort_transaction ( trans , ret ) ;
2017-09-28 14:51:10 +08:00
goto error_sysfs ;
2012-09-18 07:52:32 -06:00
}
2014-09-03 21:35:41 +08:00
}
2018-07-20 19:37:47 +03:00
ret = btrfs_add_dev_item ( trans , device ) ;
2014-09-03 21:35:41 +08:00
if ( ret ) {
2016-06-10 18:19:25 -04:00
btrfs_abort_transaction ( trans , ret ) ;
2017-09-28 14:51:10 +08:00
goto error_sysfs ;
2014-09-03 21:35:41 +08:00
}
if ( seeding_dev ) {
2019-03-20 16:36:39 +01:00
ret = btrfs_finish_sprout ( trans ) ;
2012-09-18 07:52:32 -06:00
if ( ret ) {
2016-06-10 18:19:25 -04:00
btrfs_abort_transaction ( trans , ret ) ;
2017-09-28 14:51:10 +08:00
goto error_sysfs ;
2012-09-18 07:52:32 -06:00
}
2014-06-03 11:36:03 +08:00
2020-08-12 16:18:51 +03:00
/*
* fs_devices now represents the newly sprouted filesystem and
* its fsid has been changed by btrfs_prepare_sprout
*/
btrfs_sysfs_update_sprout_fsid ( fs_devices ) ;
2008-11-17 21:11:30 -05:00
}
2016-09-09 21:39:03 -04:00
ret = btrfs_commit_transaction ( trans ) ;
2008-06-25 16:01:30 -04:00
2008-11-17 21:11:30 -05:00
if ( seeding_dev ) {
mutex_unlock ( & uuid_mutex ) ;
up_write ( & sb - > s_umount ) ;
2020-07-22 11:09:23 +03:00
locked = false ;
2008-04-28 15:29:42 -04:00
2012-03-12 16:03:00 +01:00
if ( ret ) /* transaction commit */
return ret ;
2016-06-22 18:54:24 -04:00
ret = btrfs_relocate_sys_chunks ( fs_info ) ;
2012-03-12 16:03:00 +01:00
if ( ret < 0 )
2016-06-22 18:54:23 -04:00
btrfs_handle_fs_error ( fs_info , ret ,
2016-09-20 10:05:00 -04:00
" Failed to relocate sys chunks after device initialization. This can be fixed using the \" btrfs balance \" command. " ) ;
Btrfs: fix deadlock caused by the nested chunk allocation
Steps to reproduce:
# mkfs.btrfs -m raid1 <disk1> <disk2>
# btrfstune -S 1 <disk1>
# mount <disk1> <mnt>
# btrfs device add <disk3> <disk4> <mnt>
# mount -o remount,rw <mnt>
# dd if=/dev/zero of=<mnt>/tmpfile bs=1M count=1
Deadlock happened.
It is because of the nested chunk allocation. When we wrote the data
into the filesystem, we would allocate the data chunk because there was
no data chunk in the filesystem. At the end of the data chunk allocation,
we should insert the metadata of the data chunk into the extent tree, but
there was no raid1 chunk, so we tried to lock the chunk allocation mutex to
allocate the new chunk, but we had held the mutex, the deadlock happened.
By rights, we would allocate the raid1 chunk when we added the second device
because the profile of the seed filesystem is raid1 and we had two devices.
But we didn't do that in fact. It is because the last step of the first device
insertion didn't commit the transaction. So when we added the second device,
we didn't cow the tree, and just inserted the relative metadata into the leaves
which were generated by the first device insertion, and its profile was dup.
So, I fix this problem by commiting the transaction at the end of the first
device insertion.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2012-10-16 11:26:46 +00:00
trans = btrfs_attach_transaction ( root ) ;
if ( IS_ERR ( trans ) ) {
if ( PTR_ERR ( trans ) = = - ENOENT )
return 0 ;
2017-09-28 14:51:11 +08:00
ret = PTR_ERR ( trans ) ;
trans = NULL ;
goto error_sysfs ;
Btrfs: fix deadlock caused by the nested chunk allocation
Steps to reproduce:
# mkfs.btrfs -m raid1 <disk1> <disk2>
# btrfstune -S 1 <disk1>
# mount <disk1> <mnt>
# btrfs device add <disk3> <disk4> <mnt>
# mount -o remount,rw <mnt>
# dd if=/dev/zero of=<mnt>/tmpfile bs=1M count=1
Deadlock happened.
It is because of the nested chunk allocation. When we wrote the data
into the filesystem, we would allocate the data chunk because there was
no data chunk in the filesystem. At the end of the data chunk allocation,
we should insert the metadata of the data chunk into the extent tree, but
there was no raid1 chunk, so we tried to lock the chunk allocation mutex to
allocate the new chunk, but we had held the mutex, the deadlock happened.
By rights, we would allocate the raid1 chunk when we added the second device
because the profile of the seed filesystem is raid1 and we had two devices.
But we didn't do that in fact. It is because the last step of the first device
insertion didn't commit the transaction. So when we added the second device,
we didn't cow the tree, and just inserted the relative metadata into the leaves
which were generated by the first device insertion, and its profile was dup.
So, I fix this problem by commiting the transaction at the end of the first
device insertion.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2012-10-16 11:26:46 +00:00
}
2016-09-09 21:39:03 -04:00
ret = btrfs_commit_transaction ( trans ) ;
2008-11-17 21:11:30 -05:00
}
2012-01-16 22:04:47 +02:00
2020-05-05 02:58:26 +08:00
/*
* Now that we have written a new super block to this device , check all
* other fs_devices list if device_path alienates any other scanned
* device .
* We can ignore the return value as it typically returns - EINVAL and
* only succeeds if the device was an alien .
*/
btrfs_forget_devices ( device_path ) ;
/* Update ctime/mtime for blkid or udev */
2021-07-27 17:01:16 -04:00
update_dev_time ( bdev ) ;
2020-05-05 02:58:26 +08:00
2008-11-17 21:11:30 -05:00
return ret ;
2012-03-12 16:03:00 +01:00
2017-09-28 14:51:10 +08:00
error_sysfs :
2020-09-05 01:34:27 +08:00
btrfs_sysfs_remove_device ( device ) ;
2018-07-27 09:04:55 +09:00
mutex_lock ( & fs_info - > fs_devices - > device_list_mutex ) ;
mutex_lock ( & fs_info - > chunk_mutex ) ;
list_del_rcu ( & device - > dev_list ) ;
list_del ( & device - > dev_alloc_list ) ;
fs_info - > fs_devices - > num_devices - - ;
fs_info - > fs_devices - > open_devices - - ;
fs_info - > fs_devices - > rw_devices - - ;
fs_info - > fs_devices - > total_devices - - ;
fs_info - > fs_devices - > total_rw_bytes - = device - > total_bytes ;
atomic64_sub ( device - > total_bytes , & fs_info - > free_chunk_space ) ;
btrfs_set_super_total_bytes ( fs_info - > super_copy ,
orig_super_total_bytes ) ;
btrfs_set_super_num_devices ( fs_info - > super_copy ,
orig_super_num_devices ) ;
mutex_unlock ( & fs_info - > chunk_mutex ) ;
mutex_unlock ( & fs_info - > fs_devices - > device_list_mutex ) ;
2012-03-12 16:03:00 +01:00
error_trans :
2017-09-28 14:51:09 +08:00
if ( seeding_dev )
btrfs: fix race between RO remount and the cleaner task
When we are remounting a filesystem in RO mode we can race with the cleaner
task and result in leaking a transaction if the filesystem is unmounted
shortly after, before the transaction kthread had a chance to commit that
transaction. That also results in a crash during unmount, due to a
use-after-free, if hardware acceleration is not available for crc32c.
The following sequence of steps explains how the race happens.
1) The filesystem is mounted in RW mode and the cleaner task is running.
This means that currently BTRFS_FS_CLEANER_RUNNING is set at
fs_info->flags;
2) The cleaner task is currently running delayed iputs for example;
3) A filesystem RO remount operation starts;
4) The RO remount task calls btrfs_commit_super(), which commits any
currently open transaction, and it finishes;
5) At this point the cleaner task is still running and it creates a new
transaction by doing one of the following things:
* When running the delayed iput() for an inode with a 0 link count,
in which case at btrfs_evict_inode() we start a transaction through
the call to evict_refill_and_join(), use it and then release its
handle through btrfs_end_transaction();
* When deleting a dead root through btrfs_clean_one_deleted_snapshot(),
a transaction is started at btrfs_drop_snapshot() and then its handle
is released through a call to btrfs_end_transaction_throttle();
* When the remount task was still running, and before the remount task
called btrfs_delete_unused_bgs(), the cleaner task also called
btrfs_delete_unused_bgs() and it picked and removed one block group
from the list of unused block groups. Before the cleaner task started
a transaction, through btrfs_start_trans_remove_block_group() at
btrfs_delete_unused_bgs(), the remount task had already called
btrfs_commit_super();
6) So at this point the filesystem is in RO mode and we have an open
transaction that was started by the cleaner task;
7) Shortly after a filesystem unmount operation starts. At close_ctree()
we stop the transaction kthread before it had a chance to commit the
transaction, since less than 30 seconds (the default commit interval)
have elapsed since the last transaction was committed;
8) We end up calling iput() against the btree inode at close_ctree() while
there is an open transaction, and since that transaction was used to
update btrees by the cleaner, we have dirty pages in the btree inode
due to COW operations on metadata extents, and therefore writeback is
triggered for the btree inode.
So btree_write_cache_pages() is invoked to flush those dirty pages
during the final iput() on the btree inode. This results in creating a
bio and submitting it, which makes us end up at
btrfs_submit_metadata_bio();
9) At btrfs_submit_metadata_bio() we end up at the if-then-else branch
that calls btrfs_wq_submit_bio(), because check_async_write() returned
a value of 1. This value of 1 is because we did not have hardware
acceleration available for crc32c, so BTRFS_FS_CSUM_IMPL_FAST was not
set in fs_info->flags;
10) Then at btrfs_wq_submit_bio() we call btrfs_queue_work() against the
workqueue at fs_info->workers, which was already freed before by the
call to btrfs_stop_all_workers() at close_ctree(). This results in an
invalid memory access due to a use-after-free, leading to a crash.
When this happens, before the crash there are several warnings triggered,
since we have reserved metadata space in a block group, the delayed refs
reservation, etc:
------------[ cut here ]------------
WARNING: CPU: 4 PID: 1729896 at fs/btrfs/block-group.c:125 btrfs_put_block_group+0x63/0xa0 [btrfs]
Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
CPU: 4 PID: 1729896 Comm: umount Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
RIP: 0010:btrfs_put_block_group+0x63/0xa0 [btrfs]
Code: f0 01 00 00 48 39 c2 75 (...)
RSP: 0018:ffffb270826bbdd8 EFLAGS: 00010206
RAX: 0000000000000001 RBX: ffff947ed73e4000 RCX: ffff947ebc8b29c8
RDX: 0000000000000001 RSI: ffffffffc0b150a0 RDI: ffff947ebc8b2800
RBP: ffff947ebc8b2800 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000001 R12: ffff947ed73e4110
R13: ffff947ed73e4160 R14: ffff947ebc8b2988 R15: dead000000000100
FS: 00007f15edfea840(0000) GS:ffff9481ad600000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f37e2893320 CR3: 0000000138f68001 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
btrfs_free_block_groups+0x17f/0x2f0 [btrfs]
close_ctree+0x2ba/0x2fa [btrfs]
generic_shutdown_super+0x6c/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f15ee221ee7
Code: ff 0b 00 f7 d8 64 89 01 48 (...)
RSP: 002b:00007ffe9470f0f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
RAX: 0000000000000000 RBX: 00007f15ee347264 RCX: 00007f15ee221ee7
RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 000056169701d000
RBP: 0000561697018a30 R08: 0000000000000000 R09: 00007f15ee2e2be0
R10: 000056169701efe0 R11: 0000000000000246 R12: 0000000000000000
R13: 000056169701d000 R14: 0000561697018b40 R15: 0000561697018c60
irq event stamp: 0
hardirqs last enabled at (0): [<0000000000000000>] 0x0
hardirqs last disabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
softirqs last enabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
softirqs last disabled at (0): [<0000000000000000>] 0x0
---[ end trace dd74718fef1ed5c6 ]---
------------[ cut here ]------------
WARNING: CPU: 2 PID: 1729896 at fs/btrfs/block-rsv.c:459 btrfs_release_global_block_rsv+0x70/0xc0 [btrfs]
Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
CPU: 2 PID: 1729896 Comm: umount Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
RIP: 0010:btrfs_release_global_block_rsv+0x70/0xc0 [btrfs]
Code: 48 83 bb b0 03 00 00 00 (...)
RSP: 0018:ffffb270826bbdd8 EFLAGS: 00010206
RAX: 000000000033c000 RBX: ffff947ed73e4000 RCX: 0000000000000000
RDX: 0000000000000001 RSI: ffffffffc0b0d8c1 RDI: 00000000ffffffff
RBP: ffff947ebc8b7000 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000001 R12: ffff947ed73e4110
R13: ffff947ed73e5278 R14: dead000000000122 R15: dead000000000100
FS: 00007f15edfea840(0000) GS:ffff9481aca00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000561a79f76e20 CR3: 0000000138f68006 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
btrfs_free_block_groups+0x24c/0x2f0 [btrfs]
close_ctree+0x2ba/0x2fa [btrfs]
generic_shutdown_super+0x6c/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f15ee221ee7
Code: ff 0b 00 f7 d8 64 89 01 (...)
RSP: 002b:00007ffe9470f0f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
RAX: 0000000000000000 RBX: 00007f15ee347264 RCX: 00007f15ee221ee7
RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 000056169701d000
RBP: 0000561697018a30 R08: 0000000000000000 R09: 00007f15ee2e2be0
R10: 000056169701efe0 R11: 0000000000000246 R12: 0000000000000000
R13: 000056169701d000 R14: 0000561697018b40 R15: 0000561697018c60
irq event stamp: 0
hardirqs last enabled at (0): [<0000000000000000>] 0x0
hardirqs last disabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
softirqs last enabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
softirqs last disabled at (0): [<0000000000000000>] 0x0
---[ end trace dd74718fef1ed5c7 ]---
------------[ cut here ]------------
WARNING: CPU: 2 PID: 1729896 at fs/btrfs/block-group.c:3377 btrfs_free_block_groups+0x25d/0x2f0 [btrfs]
Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
CPU: 5 PID: 1729896 Comm: umount Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
RIP: 0010:btrfs_free_block_groups+0x25d/0x2f0 [btrfs]
Code: ad de 49 be 22 01 00 (...)
RSP: 0018:ffffb270826bbde8 EFLAGS: 00010206
RAX: ffff947ebeae1d08 RBX: ffff947ed73e4000 RCX: 0000000000000000
RDX: 0000000000000001 RSI: ffff947e9d823ae8 RDI: 0000000000000246
RBP: ffff947ebeae1d08 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000001 R12: ffff947ebeae1c00
R13: ffff947ed73e5278 R14: dead000000000122 R15: dead000000000100
FS: 00007f15edfea840(0000) GS:ffff9481ad200000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f1475d98ea8 CR3: 0000000138f68005 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
close_ctree+0x2ba/0x2fa [btrfs]
generic_shutdown_super+0x6c/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f15ee221ee7
Code: ff 0b 00 f7 d8 64 89 (...)
RSP: 002b:00007ffe9470f0f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
RAX: 0000000000000000 RBX: 00007f15ee347264 RCX: 00007f15ee221ee7
RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 000056169701d000
RBP: 0000561697018a30 R08: 0000000000000000 R09: 00007f15ee2e2be0
R10: 000056169701efe0 R11: 0000000000000246 R12: 0000000000000000
R13: 000056169701d000 R14: 0000561697018b40 R15: 0000561697018c60
irq event stamp: 0
hardirqs last enabled at (0): [<0000000000000000>] 0x0
hardirqs last disabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
softirqs last enabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
softirqs last disabled at (0): [<0000000000000000>] 0x0
---[ end trace dd74718fef1ed5c8 ]---
BTRFS info (device sdc): space_info 4 has 268238848 free, is not full
BTRFS info (device sdc): space_info total=268435456, used=114688, pinned=0, reserved=16384, may_use=0, readonly=65536
BTRFS info (device sdc): global_block_rsv: size 0 reserved 0
BTRFS info (device sdc): trans_block_rsv: size 0 reserved 0
BTRFS info (device sdc): chunk_block_rsv: size 0 reserved 0
BTRFS info (device sdc): delayed_block_rsv: size 0 reserved 0
BTRFS info (device sdc): delayed_refs_rsv: size 524288 reserved 0
And the crash, which only happens when we do not have crc32c hardware
acceleration, produces the following trace immediately after those
warnings:
stack segment: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
CPU: 2 PID: 1749129 Comm: umount Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
RIP: 0010:btrfs_queue_work+0x36/0x190 [btrfs]
Code: 54 55 53 48 89 f3 (...)
RSP: 0018:ffffb27082443ae8 EFLAGS: 00010282
RAX: 0000000000000004 RBX: ffff94810ee9ad90 RCX: 0000000000000000
RDX: 0000000000000001 RSI: ffff94810ee9ad90 RDI: ffff947ed8ee75a0
RBP: a56b6b6b6b6b6b6b R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000007 R11: 0000000000000001 R12: ffff947fa9b435a8
R13: ffff94810ee9ad90 R14: 0000000000000000 R15: ffff947e93dc0000
FS: 00007f3cfe974840(0000) GS:ffff9481ac600000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f1b42995a70 CR3: 0000000127638003 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
btrfs_wq_submit_bio+0xb3/0xd0 [btrfs]
btrfs_submit_metadata_bio+0x44/0xc0 [btrfs]
submit_one_bio+0x61/0x70 [btrfs]
btree_write_cache_pages+0x414/0x450 [btrfs]
? kobject_put+0x9a/0x1d0
? trace_hardirqs_on+0x1b/0xf0
? _raw_spin_unlock_irqrestore+0x3c/0x60
? free_debug_processing+0x1e1/0x2b0
do_writepages+0x43/0xe0
? lock_acquired+0x199/0x490
__writeback_single_inode+0x59/0x650
writeback_single_inode+0xaf/0x120
write_inode_now+0x94/0xd0
iput+0x187/0x2b0
close_ctree+0x2c6/0x2fa [btrfs]
generic_shutdown_super+0x6c/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f3cfebabee7
Code: ff 0b 00 f7 d8 64 89 01 (...)
RSP: 002b:00007ffc9c9a05f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
RAX: 0000000000000000 RBX: 00007f3cfecd1264 RCX: 00007f3cfebabee7
RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 0000562b6b478000
RBP: 0000562b6b473a30 R08: 0000000000000000 R09: 00007f3cfec6cbe0
R10: 0000562b6b479fe0 R11: 0000000000000246 R12: 0000000000000000
R13: 0000562b6b478000 R14: 0000562b6b473b40 R15: 0000562b6b473c60
Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
---[ end trace dd74718fef1ed5cc ]---
Finally when we remove the btrfs module (rmmod btrfs), there are several
warnings about objects that were allocated from our slabs but were never
freed, consequence of the transaction that was never committed and got
leaked:
=============================================================================
BUG btrfs_delayed_ref_head (Tainted: G B W ): Objects remaining in btrfs_delayed_ref_head on __kmem_cache_shutdown()
-----------------------------------------------------------------------------
INFO: Slab 0x0000000094c2ae56 objects=24 used=2 fp=0x000000002bfa2521 flags=0x17fffc000010200
CPU: 5 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x8d/0xb5
slab_err+0xb7/0xdc
? lock_acquired+0x199/0x490
__kmem_cache_shutdown+0x1ac/0x3c0
? lock_release+0x20e/0x4c0
kmem_cache_destroy+0x55/0x120
btrfs_delayed_ref_exit+0x11/0x35 [btrfs]
exit_btrfs_fs+0xa/0x59 [btrfs]
__x64_sys_delete_module+0x194/0x260
? fpregs_assert_state_consistent+0x1e/0x40
? exit_to_user_mode_prepare+0x55/0x1c0
? trace_hardirqs_on+0x1b/0xf0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f693e305897
Code: 73 01 c3 48 8b 0d f9 f5 (...)
RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
INFO: Object 0x0000000050cbdd61 @offset=12104
INFO: Allocated in btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs] age=1894 cpu=6 pid=1729873
__slab_alloc.isra.0+0x109/0x1c0
kmem_cache_alloc+0x7bb/0x830
btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs]
btrfs_free_tree_block+0x128/0x360 [btrfs]
__btrfs_cow_block+0x489/0x5f0 [btrfs]
btrfs_cow_block+0xf7/0x220 [btrfs]
btrfs_search_slot+0x62a/0xc40 [btrfs]
btrfs_del_orphan_item+0x65/0xd0 [btrfs]
btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
open_ctree+0x125a/0x18a0 [btrfs]
btrfs_mount_root.cold+0x13/0xed [btrfs]
legacy_get_tree+0x30/0x60
vfs_get_tree+0x28/0xe0
fc_mount+0xe/0x40
vfs_kern_mount.part.0+0x71/0x90
btrfs_mount+0x13b/0x3e0 [btrfs]
INFO: Freed in __btrfs_run_delayed_refs+0x1117/0x1290 [btrfs] age=4292 cpu=2 pid=1729526
kmem_cache_free+0x34c/0x3c0
__btrfs_run_delayed_refs+0x1117/0x1290 [btrfs]
btrfs_run_delayed_refs+0x81/0x210 [btrfs]
commit_cowonly_roots+0xfb/0x300 [btrfs]
btrfs_commit_transaction+0x367/0xc40 [btrfs]
sync_filesystem+0x74/0x90
generic_shutdown_super+0x22/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
INFO: Object 0x0000000086e9b0ff @offset=12776
INFO: Allocated in btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs] age=1900 cpu=6 pid=1729873
__slab_alloc.isra.0+0x109/0x1c0
kmem_cache_alloc+0x7bb/0x830
btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs]
btrfs_alloc_tree_block+0x2bf/0x360 [btrfs]
alloc_tree_block_no_bg_flush+0x4f/0x60 [btrfs]
__btrfs_cow_block+0x12d/0x5f0 [btrfs]
btrfs_cow_block+0xf7/0x220 [btrfs]
btrfs_search_slot+0x62a/0xc40 [btrfs]
btrfs_del_orphan_item+0x65/0xd0 [btrfs]
btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
open_ctree+0x125a/0x18a0 [btrfs]
btrfs_mount_root.cold+0x13/0xed [btrfs]
legacy_get_tree+0x30/0x60
vfs_get_tree+0x28/0xe0
fc_mount+0xe/0x40
vfs_kern_mount.part.0+0x71/0x90
INFO: Freed in __btrfs_run_delayed_refs+0x1117/0x1290 [btrfs] age=3141 cpu=6 pid=1729803
kmem_cache_free+0x34c/0x3c0
__btrfs_run_delayed_refs+0x1117/0x1290 [btrfs]
btrfs_run_delayed_refs+0x81/0x210 [btrfs]
btrfs_write_dirty_block_groups+0x17d/0x3d0 [btrfs]
commit_cowonly_roots+0x248/0x300 [btrfs]
btrfs_commit_transaction+0x367/0xc40 [btrfs]
close_ctree+0x113/0x2fa [btrfs]
generic_shutdown_super+0x6c/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
kmem_cache_destroy btrfs_delayed_ref_head: Slab cache still has objects
CPU: 5 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x8d/0xb5
kmem_cache_destroy+0x119/0x120
btrfs_delayed_ref_exit+0x11/0x35 [btrfs]
exit_btrfs_fs+0xa/0x59 [btrfs]
__x64_sys_delete_module+0x194/0x260
? fpregs_assert_state_consistent+0x1e/0x40
? exit_to_user_mode_prepare+0x55/0x1c0
? trace_hardirqs_on+0x1b/0xf0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f693e305897
Code: 73 01 c3 48 8b 0d f9 f5 0b (...)
RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
=============================================================================
BUG btrfs_delayed_tree_ref (Tainted: G B W ): Objects remaining in btrfs_delayed_tree_ref on __kmem_cache_shutdown()
-----------------------------------------------------------------------------
INFO: Slab 0x0000000011f78dc0 objects=37 used=2 fp=0x0000000032d55d91 flags=0x17fffc000010200
CPU: 3 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x8d/0xb5
slab_err+0xb7/0xdc
? lock_acquired+0x199/0x490
__kmem_cache_shutdown+0x1ac/0x3c0
? lock_release+0x20e/0x4c0
kmem_cache_destroy+0x55/0x120
btrfs_delayed_ref_exit+0x1d/0x35 [btrfs]
exit_btrfs_fs+0xa/0x59 [btrfs]
__x64_sys_delete_module+0x194/0x260
? fpregs_assert_state_consistent+0x1e/0x40
? exit_to_user_mode_prepare+0x55/0x1c0
? trace_hardirqs_on+0x1b/0xf0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f693e305897
Code: 73 01 c3 48 8b 0d f9 f5 (...)
RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
INFO: Object 0x000000001a340018 @offset=4408
INFO: Allocated in btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs] age=1917 cpu=6 pid=1729873
__slab_alloc.isra.0+0x109/0x1c0
kmem_cache_alloc+0x7bb/0x830
btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs]
btrfs_free_tree_block+0x128/0x360 [btrfs]
__btrfs_cow_block+0x489/0x5f0 [btrfs]
btrfs_cow_block+0xf7/0x220 [btrfs]
btrfs_search_slot+0x62a/0xc40 [btrfs]
btrfs_del_orphan_item+0x65/0xd0 [btrfs]
btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
open_ctree+0x125a/0x18a0 [btrfs]
btrfs_mount_root.cold+0x13/0xed [btrfs]
legacy_get_tree+0x30/0x60
vfs_get_tree+0x28/0xe0
fc_mount+0xe/0x40
vfs_kern_mount.part.0+0x71/0x90
btrfs_mount+0x13b/0x3e0 [btrfs]
INFO: Freed in __btrfs_run_delayed_refs+0x63d/0x1290 [btrfs] age=4167 cpu=4 pid=1729795
kmem_cache_free+0x34c/0x3c0
__btrfs_run_delayed_refs+0x63d/0x1290 [btrfs]
btrfs_run_delayed_refs+0x81/0x210 [btrfs]
btrfs_commit_transaction+0x60/0xc40 [btrfs]
create_subvol+0x56a/0x990 [btrfs]
btrfs_mksubvol+0x3fb/0x4a0 [btrfs]
__btrfs_ioctl_snap_create+0x119/0x1a0 [btrfs]
btrfs_ioctl_snap_create+0x58/0x80 [btrfs]
btrfs_ioctl+0x1a92/0x36f0 [btrfs]
__x64_sys_ioctl+0x83/0xb0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
INFO: Object 0x000000002b46292a @offset=13648
INFO: Allocated in btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs] age=1923 cpu=6 pid=1729873
__slab_alloc.isra.0+0x109/0x1c0
kmem_cache_alloc+0x7bb/0x830
btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs]
btrfs_alloc_tree_block+0x2bf/0x360 [btrfs]
alloc_tree_block_no_bg_flush+0x4f/0x60 [btrfs]
__btrfs_cow_block+0x12d/0x5f0 [btrfs]
btrfs_cow_block+0xf7/0x220 [btrfs]
btrfs_search_slot+0x62a/0xc40 [btrfs]
btrfs_del_orphan_item+0x65/0xd0 [btrfs]
btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
open_ctree+0x125a/0x18a0 [btrfs]
btrfs_mount_root.cold+0x13/0xed [btrfs]
legacy_get_tree+0x30/0x60
vfs_get_tree+0x28/0xe0
fc_mount+0xe/0x40
vfs_kern_mount.part.0+0x71/0x90
INFO: Freed in __btrfs_run_delayed_refs+0x63d/0x1290 [btrfs] age=3164 cpu=6 pid=1729803
kmem_cache_free+0x34c/0x3c0
__btrfs_run_delayed_refs+0x63d/0x1290 [btrfs]
btrfs_run_delayed_refs+0x81/0x210 [btrfs]
commit_cowonly_roots+0xfb/0x300 [btrfs]
btrfs_commit_transaction+0x367/0xc40 [btrfs]
close_ctree+0x113/0x2fa [btrfs]
generic_shutdown_super+0x6c/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
kmem_cache_destroy btrfs_delayed_tree_ref: Slab cache still has objects
CPU: 5 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x8d/0xb5
kmem_cache_destroy+0x119/0x120
btrfs_delayed_ref_exit+0x1d/0x35 [btrfs]
exit_btrfs_fs+0xa/0x59 [btrfs]
__x64_sys_delete_module+0x194/0x260
? fpregs_assert_state_consistent+0x1e/0x40
? exit_to_user_mode_prepare+0x55/0x1c0
? trace_hardirqs_on+0x1b/0xf0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f693e305897
Code: 73 01 c3 48 8b 0d f9 f5 (...)
RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
=============================================================================
BUG btrfs_delayed_extent_op (Tainted: G B W ): Objects remaining in btrfs_delayed_extent_op on __kmem_cache_shutdown()
-----------------------------------------------------------------------------
INFO: Slab 0x00000000f145ce2f objects=22 used=1 fp=0x00000000af0f92cf flags=0x17fffc000010200
CPU: 5 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x8d/0xb5
slab_err+0xb7/0xdc
? lock_acquired+0x199/0x490
__kmem_cache_shutdown+0x1ac/0x3c0
? __mutex_unlock_slowpath+0x45/0x2a0
kmem_cache_destroy+0x55/0x120
exit_btrfs_fs+0xa/0x59 [btrfs]
__x64_sys_delete_module+0x194/0x260
? fpregs_assert_state_consistent+0x1e/0x40
? exit_to_user_mode_prepare+0x55/0x1c0
? trace_hardirqs_on+0x1b/0xf0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f693e305897
Code: 73 01 c3 48 8b 0d f9 f5 (...)
RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
INFO: Object 0x000000004cf95ea8 @offset=6264
INFO: Allocated in btrfs_alloc_tree_block+0x1e0/0x360 [btrfs] age=1931 cpu=6 pid=1729873
__slab_alloc.isra.0+0x109/0x1c0
kmem_cache_alloc+0x7bb/0x830
btrfs_alloc_tree_block+0x1e0/0x360 [btrfs]
alloc_tree_block_no_bg_flush+0x4f/0x60 [btrfs]
__btrfs_cow_block+0x12d/0x5f0 [btrfs]
btrfs_cow_block+0xf7/0x220 [btrfs]
btrfs_search_slot+0x62a/0xc40 [btrfs]
btrfs_del_orphan_item+0x65/0xd0 [btrfs]
btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
open_ctree+0x125a/0x18a0 [btrfs]
btrfs_mount_root.cold+0x13/0xed [btrfs]
legacy_get_tree+0x30/0x60
vfs_get_tree+0x28/0xe0
fc_mount+0xe/0x40
vfs_kern_mount.part.0+0x71/0x90
btrfs_mount+0x13b/0x3e0 [btrfs]
INFO: Freed in __btrfs_run_delayed_refs+0xabd/0x1290 [btrfs] age=3173 cpu=6 pid=1729803
kmem_cache_free+0x34c/0x3c0
__btrfs_run_delayed_refs+0xabd/0x1290 [btrfs]
btrfs_run_delayed_refs+0x81/0x210 [btrfs]
commit_cowonly_roots+0xfb/0x300 [btrfs]
btrfs_commit_transaction+0x367/0xc40 [btrfs]
close_ctree+0x113/0x2fa [btrfs]
generic_shutdown_super+0x6c/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
kmem_cache_destroy btrfs_delayed_extent_op: Slab cache still has objects
CPU: 3 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x8d/0xb5
kmem_cache_destroy+0x119/0x120
exit_btrfs_fs+0xa/0x59 [btrfs]
__x64_sys_delete_module+0x194/0x260
? fpregs_assert_state_consistent+0x1e/0x40
? exit_to_user_mode_prepare+0x55/0x1c0
? trace_hardirqs_on+0x1b/0xf0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f693e305897
Code: 73 01 c3 48 8b 0d f9 (...)
RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
BTRFS: state leak: start 30408704 end 30425087 state 1 in tree 1 refs 1
So fix this by making the remount path to wait for the cleaner task before
calling btrfs_commit_super(). The remount path now waits for the bit
BTRFS_FS_CLEANER_RUNNING to be cleared from fs_info->flags before calling
btrfs_commit_super() and this ensures the cleaner can not start a
transaction after that, because it sleeps when the filesystem is in RO
mode and we have already flagged the filesystem as RO before waiting for
BTRFS_FS_CLEANER_RUNNING to be cleared.
This also introduces a new flag BTRFS_FS_STATE_RO to be used for
fs_info->fs_state when the filesystem is in RO mode. This is because we
were doing the RO check using the flags of the superblock and setting the
RO mode simply by ORing into the superblock's flags - those operations are
not atomic and could result in the cleaner not seeing the update from the
remount task after it clears BTRFS_FS_CLEANER_RUNNING.
Tested-by: Fabian Vogt <fvogt@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-14 10:10:47 +00:00
btrfs_set_sb_rdonly ( sb ) ;
2017-09-28 14:51:11 +08:00
if ( trans )
btrfs_end_transaction ( trans ) ;
2020-11-10 20:26:07 +09:00
error_free_zone :
btrfs_destroy_dev_zone_info ( device ) ;
2017-10-30 19:29:46 +01:00
error_free_device :
2018-03-20 15:47:33 +01:00
btrfs_free_device ( device ) ;
2008-11-17 21:11:30 -05:00
error :
block: make blkdev_get/put() handle exclusive access
Over time, block layer has accumulated a set of APIs dealing with bdev
open, close, claim and release.
* blkdev_get/put() are the primary open and close functions.
* bd_claim/release() deal with exclusive open.
* open/close_bdev_exclusive() are combination of open and claim and
the other way around, respectively.
* bd_link/unlink_disk_holder() to create and remove holder/slave
symlinks.
* open_by_devnum() wraps bdget() + blkdev_get().
The interface is a bit confusing and the decoupling of open and claim
makes it impossible to properly guarantee exclusive access as
in-kernel open + claim sequence can disturb the existing exclusive
open even before the block layer knows the current open if for another
exclusive access. Reorganize the interface such that,
* blkdev_get() is extended to include exclusive access management.
@holder argument is added and, if is @FMODE_EXCL specified, it will
gain exclusive access atomically w.r.t. other exclusive accesses.
* blkdev_put() is similarly extended. It now takes @mode argument and
if @FMODE_EXCL is set, it releases an exclusive access. Also, when
the last exclusive claim is released, the holder/slave symlinks are
removed automatically.
* bd_claim/release() and close_bdev_exclusive() are no longer
necessary and either made static or removed.
* bd_link_disk_holder() remains the same but bd_unlink_disk_holder()
is no longer necessary and removed.
* open_bdev_exclusive() becomes a simple wrapper around lookup_bdev()
and blkdev_get(). It also has an unexpected extra bdev_read_only()
test which probably should be moved into blkdev_get().
* open_by_devnum() is modified to take @holder argument and pass it to
blkdev_get().
Most of bdev open/close operations are unified into blkdev_get/put()
and most exclusive accesses are tested atomically at the open time (as
it should). This cleans up code and removes some, both valid and
invalid, but unnecessary all the same, corner cases.
open_bdev_exclusive() and open_by_devnum() can use further cleanup -
rename to blkdev_get_by_path() and blkdev_get_by_devt() and drop
special features. Well, let's leave them for another day.
Most conversions are straight-forward. drbd conversion is a bit more
involved as there was some reordering, but the logic should stay the
same.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Neil Brown <neilb@suse.de>
Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Philipp Reisner <philipp.reisner@linbit.com>
Cc: Peter Osterlund <petero2@telia.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <joel.becker@oracle.com>
Cc: Alex Elder <aelder@sgi.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: dm-devel@redhat.com
Cc: drbd-dev@lists.linbit.com
Cc: Leo Chen <leochen@broadcom.com>
Cc: Scott Branden <sbranden@broadcom.com>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Cc: Joern Engel <joern@logfs.org>
Cc: reiserfs-devel@vger.kernel.org
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
2010-11-13 11:55:17 +01:00
blkdev_put ( bdev , FMODE_EXCL ) ;
2020-07-22 11:09:23 +03:00
if ( locked ) {
2008-11-17 21:11:30 -05:00
mutex_unlock ( & uuid_mutex ) ;
up_write ( & sb - > s_umount ) ;
}
2012-01-16 22:04:47 +02:00
return ret ;
2008-04-28 15:29:42 -04:00
}
2009-01-05 21:25:51 -05:00
static noinline int btrfs_update_device ( struct btrfs_trans_handle * trans ,
struct btrfs_device * device )
2008-03-24 15:01:56 -04:00
{
int ret ;
struct btrfs_path * path ;
2016-06-22 18:54:23 -04:00
struct btrfs_root * root = device - > fs_info - > chunk_root ;
2008-03-24 15:01:56 -04:00
struct btrfs_dev_item * dev_item ;
struct extent_buffer * leaf ;
struct btrfs_key key ;
path = btrfs_alloc_path ( ) ;
if ( ! path )
return - ENOMEM ;
key . objectid = BTRFS_DEV_ITEMS_OBJECTID ;
key . type = BTRFS_DEV_ITEM_KEY ;
key . offset = device - > devid ;
ret = btrfs_search_slot ( trans , root , & key , path , 0 , 1 ) ;
if ( ret < 0 )
goto out ;
if ( ret > 0 ) {
ret = - ENOENT ;
goto out ;
}
leaf = path - > nodes [ 0 ] ;
dev_item = btrfs_item_ptr ( leaf , path - > slots [ 0 ] , struct btrfs_dev_item ) ;
btrfs_set_device_id ( leaf , dev_item , device - > devid ) ;
btrfs_set_device_type ( leaf , dev_item , device - > type ) ;
btrfs_set_device_io_align ( leaf , dev_item , device - > io_align ) ;
btrfs_set_device_io_width ( leaf , dev_item , device - > io_width ) ;
btrfs_set_device_sector_size ( leaf , dev_item , device - > sector_size ) ;
2014-09-03 21:35:38 +08:00
btrfs_set_device_total_bytes ( leaf , dev_item ,
btrfs_device_get_disk_total_bytes ( device ) ) ;
btrfs_set_device_bytes_used ( leaf , dev_item ,
btrfs_device_get_bytes_used ( device ) ) ;
2008-03-24 15:01:56 -04:00
btrfs_mark_buffer_dirty ( leaf ) ;
out :
btrfs_free_path ( path ) ;
return ret ;
}
2014-09-03 21:35:41 +08:00
int btrfs_grow_device ( struct btrfs_trans_handle * trans ,
2008-04-25 16:53:30 -04:00
struct btrfs_device * device , u64 new_size )
{
2016-06-22 18:54:23 -04:00
struct btrfs_fs_info * fs_info = device - > fs_info ;
struct btrfs_super_block * super_copy = fs_info - > super_copy ;
2014-09-03 21:35:41 +08:00
u64 old_total ;
u64 diff ;
2008-04-25 16:53:30 -04:00
2017-12-04 12:54:52 +08:00
if ( ! test_bit ( BTRFS_DEV_STATE_WRITEABLE , & device - > dev_state ) )
2008-11-17 21:11:30 -05:00
return - EACCES ;
2014-09-03 21:35:41 +08:00
2017-06-16 14:39:20 +03:00
new_size = round_down ( new_size , fs_info - > sectorsize ) ;
2016-10-04 19:34:27 +02:00
mutex_lock ( & fs_info - > chunk_mutex ) ;
2014-09-03 21:35:41 +08:00
old_total = btrfs_super_total_bytes ( super_copy ) ;
2017-07-21 11:28:24 +03:00
diff = round_down ( new_size - device - > total_bytes , fs_info - > sectorsize ) ;
2014-09-03 21:35:41 +08:00
2012-11-05 18:29:28 +01:00
if ( new_size < = device - > total_bytes | |
2017-12-04 12:54:55 +08:00
test_bit ( BTRFS_DEV_STATE_REPLACE_TGT , & device - > dev_state ) ) {
2016-10-04 19:34:27 +02:00
mutex_unlock ( & fs_info - > chunk_mutex ) ;
2008-11-17 21:11:30 -05:00
return - EINVAL ;
2014-09-03 21:35:41 +08:00
}
2008-11-17 21:11:30 -05:00
2017-06-16 14:39:20 +03:00
btrfs_set_super_total_bytes ( super_copy ,
round_down ( old_total + diff , fs_info - > sectorsize ) ) ;
2008-11-17 21:11:30 -05:00
device - > fs_devices - > total_rw_bytes + = diff ;
2014-09-03 21:35:38 +08:00
btrfs_device_set_total_bytes ( device , new_size ) ;
btrfs_device_set_disk_total_bytes ( device , new_size ) ;
2016-06-22 18:54:56 -04:00
btrfs_clear_space_info_full ( device - > fs_info ) ;
2019-03-25 14:31:22 +02:00
if ( list_empty ( & device - > post_commit_list ) )
list_add_tail ( & device - > post_commit_list ,
& trans - > transaction - > dev_update_list ) ;
2016-10-04 19:34:27 +02:00
mutex_unlock ( & fs_info - > chunk_mutex ) ;
2009-03-10 12:39:20 -04:00
2008-04-25 16:53:30 -04:00
return btrfs_update_device ( trans , device ) ;
}
2018-07-20 19:37:52 +03:00
static int btrfs_free_chunk ( struct btrfs_trans_handle * trans , u64 chunk_offset )
2008-04-25 16:53:30 -04:00
{
2018-07-20 19:37:52 +03:00
struct btrfs_fs_info * fs_info = trans - > fs_info ;
2016-06-21 10:40:19 -04:00
struct btrfs_root * root = fs_info - > chunk_root ;
2008-04-25 16:53:30 -04:00
int ret ;
struct btrfs_path * path ;
struct btrfs_key key ;
path = btrfs_alloc_path ( ) ;
if ( ! path )
return - ENOMEM ;
2017-07-27 14:37:29 +03:00
key . objectid = BTRFS_FIRST_CHUNK_TREE_OBJECTID ;
2008-04-25 16:53:30 -04:00
key . offset = chunk_offset ;
key . type = BTRFS_CHUNK_ITEM_KEY ;
ret = btrfs_search_slot ( trans , root , & key , path , - 1 , 1 ) ;
2012-03-12 16:03:00 +01:00
if ( ret < 0 )
goto out ;
else if ( ret > 0 ) { /* Logic error or corruption */
2016-06-22 18:54:23 -04:00
btrfs_handle_fs_error ( fs_info , - ENOENT ,
" Failed lookup while freeing chunk. " ) ;
2012-03-12 16:03:00 +01:00
ret = - ENOENT ;
goto out ;
}
2008-04-25 16:53:30 -04:00
ret = btrfs_del_item ( trans , root , path ) ;
2012-03-12 16:03:00 +01:00
if ( ret < 0 )
2016-06-22 18:54:23 -04:00
btrfs_handle_fs_error ( fs_info , ret ,
" Failed to delete chunk item. " ) ;
2012-03-12 16:03:00 +01:00
out :
2008-04-25 16:53:30 -04:00
btrfs_free_path ( path ) ;
2011-05-19 04:37:44 +00:00
return ret ;
2008-04-25 16:53:30 -04:00
}
2017-07-27 14:37:29 +03:00
static int btrfs_del_sys_chunk ( struct btrfs_fs_info * fs_info , u64 chunk_offset )
2008-04-25 16:53:30 -04:00
{
2016-06-22 18:54:23 -04:00
struct btrfs_super_block * super_copy = fs_info - > super_copy ;
2008-04-25 16:53:30 -04:00
struct btrfs_disk_key * disk_key ;
struct btrfs_chunk * chunk ;
u8 * ptr ;
int ret = 0 ;
u32 num_stripes ;
u32 array_size ;
u32 len = 0 ;
u32 cur ;
struct btrfs_key key ;
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 14:43:06 +01:00
lockdep_assert_held ( & fs_info - > chunk_mutex ) ;
2008-04-25 16:53:30 -04:00
array_size = btrfs_super_sys_array_size ( super_copy ) ;
ptr = super_copy - > sys_chunk_array ;
cur = 0 ;
while ( cur < array_size ) {
disk_key = ( struct btrfs_disk_key * ) ptr ;
btrfs_disk_key_to_cpu ( & key , disk_key ) ;
len = sizeof ( * disk_key ) ;
if ( key . type = = BTRFS_CHUNK_ITEM_KEY ) {
chunk = ( struct btrfs_chunk * ) ( ptr + len ) ;
num_stripes = btrfs_stack_chunk_num_stripes ( chunk ) ;
len + = btrfs_chunk_item_size ( num_stripes ) ;
} else {
ret = - EIO ;
break ;
}
2017-07-27 14:37:29 +03:00
if ( key . objectid = = BTRFS_FIRST_CHUNK_TREE_OBJECTID & &
2008-04-25 16:53:30 -04:00
key . offset = = chunk_offset ) {
memmove ( ptr , ptr + len , array_size - ( cur + len ) ) ;
array_size - = len ;
btrfs_set_super_sys_array_size ( super_copy , array_size ) ;
} else {
ptr + = len ;
cur + = len ;
}
}
return ret ;
}
2018-05-16 16:34:31 -07:00
/*
* btrfs_get_chunk_map ( ) - Find the mapping containing the given logical extent .
* @ logical : Logical block offset in bytes .
* @ length : Length of extent in bytes .
*
* Return : Chunk mapping or ERR_PTR .
*/
struct extent_map * btrfs_get_chunk_map ( struct btrfs_fs_info * fs_info ,
u64 logical , u64 length )
2017-03-14 13:33:55 -07:00
{
struct extent_map_tree * em_tree ;
struct extent_map * em ;
2019-05-17 11:43:17 +02:00
em_tree = & fs_info - > mapping_tree ;
2017-03-14 13:33:55 -07:00
read_lock ( & em_tree - > lock ) ;
em = lookup_extent_mapping ( em_tree , logical , length ) ;
read_unlock ( & em_tree - > lock ) ;
if ( ! em ) {
btrfs_crit ( fs_info , " unable to find logical %llu length %llu " ,
logical , length ) ;
return ERR_PTR ( - EINVAL ) ;
}
if ( em - > start > logical | | em - > start + em - > len < logical ) {
btrfs_crit ( fs_info ,
" found a bad mapping, wanted %llu-%llu, found %llu-%llu " ,
logical , length , em - > start , em - > start + em - > len ) ;
free_extent_map ( em ) ;
return ERR_PTR ( - EINVAL ) ;
}
/* callers are responsible for dropping em's ref. */
return em ;
}
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 14:43:06 +01:00
static int remove_chunk_item ( struct btrfs_trans_handle * trans ,
struct map_lookup * map , u64 chunk_offset )
{
int i ;
/*
* Removing chunk items and updating the device items in the chunks btree
* requires holding the chunk_mutex .
* See the comment at btrfs_chunk_alloc ( ) for the details .
*/
lockdep_assert_held ( & trans - > fs_info - > chunk_mutex ) ;
for ( i = 0 ; i < map - > num_stripes ; i + + ) {
int ret ;
ret = btrfs_update_device ( trans , map - > stripes [ i ] . dev ) ;
if ( ret )
return ret ;
}
return btrfs_free_chunk ( trans , chunk_offset ) ;
}
2018-07-20 19:37:53 +03:00
int btrfs_remove_chunk ( struct btrfs_trans_handle * trans , u64 chunk_offset )
2008-04-25 16:53:30 -04:00
{
2018-07-20 19:37:53 +03:00
struct btrfs_fs_info * fs_info = trans - > fs_info ;
2008-04-25 16:53:30 -04:00
struct extent_map * em ;
struct map_lookup * map ;
2014-09-03 21:35:41 +08:00
u64 dev_extent_len = 0 ;
2014-09-18 11:20:02 -04:00
int i , ret = 0 ;
2016-06-22 18:54:23 -04:00
struct btrfs_fs_devices * fs_devices = fs_info - > fs_devices ;
2008-04-25 16:53:30 -04:00
2018-05-16 16:34:31 -07:00
em = btrfs_get_chunk_map ( fs_info , chunk_offset , 1 ) ;
2017-03-14 13:33:55 -07:00
if ( IS_ERR ( em ) ) {
2014-09-18 11:20:02 -04:00
/*
* This is a logic error , but we don ' t want to just rely on the
2016-03-04 11:23:12 -08:00
* user having built with ASSERT enabled , so if ASSERT doesn ' t
2014-09-18 11:20:02 -04:00
* do anything we still error out .
*/
ASSERT ( 0 ) ;
2017-03-14 13:33:55 -07:00
return PTR_ERR ( em ) ;
2014-09-18 11:20:02 -04:00
}
2015-06-03 10:55:48 -04:00
map = em - > map_lookup ;
2008-04-25 16:53:30 -04:00
2016-05-20 04:34:23 +01:00
/*
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 14:43:06 +01:00
* First delete the device extent items from the devices btree .
* We take the device_list_mutex to avoid racing with the finishing phase
* of a device replace operation . See the comment below before acquiring
* fs_info - > chunk_mutex . Note that here we do not acquire the chunk_mutex
* because that can result in a deadlock when deleting the device extent
* items from the devices btree - COWing an extent buffer from the btree
* may result in allocating a new metadata chunk , which would attempt to
* lock again fs_info - > chunk_mutex .
2016-05-20 04:34:23 +01:00
*/
mutex_lock ( & fs_devices - > device_list_mutex ) ;
2008-04-25 16:53:30 -04:00
for ( i = 0 ; i < map - > num_stripes ; i + + ) {
2014-09-18 11:20:02 -04:00
struct btrfs_device * device = map - > stripes [ i ] . dev ;
2014-09-03 21:35:41 +08:00
ret = btrfs_free_dev_extent ( trans , device ,
map - > stripes [ i ] . physical ,
& dev_extent_len ) ;
2014-09-18 11:20:02 -04:00
if ( ret ) {
2016-05-20 04:34:23 +01:00
mutex_unlock ( & fs_devices - > device_list_mutex ) ;
2016-06-10 18:19:25 -04:00
btrfs_abort_transaction ( trans , ret ) ;
2014-09-18 11:20:02 -04:00
goto out ;
}
2008-05-07 11:43:44 -04:00
2014-09-03 21:35:41 +08:00
if ( device - > bytes_used > 0 ) {
2016-10-04 19:34:27 +02:00
mutex_lock ( & fs_info - > chunk_mutex ) ;
2014-09-03 21:35:41 +08:00
btrfs_device_set_bytes_used ( device ,
device - > bytes_used - dev_extent_len ) ;
2017-05-11 09:17:46 +03:00
atomic64_add ( dev_extent_len , & fs_info - > free_chunk_space ) ;
2016-06-22 18:54:23 -04:00
btrfs_clear_space_info_full ( fs_info ) ;
2016-10-04 19:34:27 +02:00
mutex_unlock ( & fs_info - > chunk_mutex ) ;
2014-09-03 21:35:41 +08:00
}
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 14:43:06 +01:00
}
mutex_unlock ( & fs_devices - > device_list_mutex ) ;
2008-05-07 11:43:44 -04:00
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 14:43:06 +01:00
/*
* We acquire fs_info - > chunk_mutex for 2 reasons :
*
* 1 ) Just like with the first phase of the chunk allocation , we must
* reserve system space , do all chunk btree updates and deletions , and
* update the system chunk array in the superblock while holding this
* mutex . This is for similar reasons as explained on the comment at
* the top of btrfs_chunk_alloc ( ) ;
*
* 2 ) Prevent races with the final phase of a device replace operation
* that replaces the device object associated with the map ' s stripes ,
* because the device object ' s id can change at any time during that
* final phase of the device replace operation
* ( dev - replace . c : btrfs_dev_replace_finishing ( ) ) , so we could grab the
* replaced device and then see it with an ID of
* BTRFS_DEV_REPLACE_DEVID , which would cause a failure when updating
* the device item , which does not exists on the chunk btree .
* The finishing phase of device replace acquires both the
* device_list_mutex and the chunk_mutex , in that order , so we are
* safe by just acquiring the chunk_mutex .
*/
trans - > removing_chunk = true ;
mutex_lock ( & fs_info - > chunk_mutex ) ;
check_system_chunk ( trans , map - > type ) ;
ret = remove_chunk_item ( trans , map , chunk_offset ) ;
/*
* Normally we should not get - ENOSPC since we reserved space before
* through the call to check_system_chunk ( ) .
*
* Despite our system space_info having enough free space , we may not
* be able to allocate extents from its block groups , because all have
* an incompatible profile , which will force us to allocate a new system
* block group with the right profile , or right after we called
* check_system_space ( ) above , a scrub turned the only system block group
* with enough free space into RO mode .
* This is explained with more detail at do_chunk_alloc ( ) .
*
* So if we get - ENOSPC , allocate a new system chunk and retry once .
*/
if ( ret = = - ENOSPC ) {
const u64 sys_flags = btrfs_system_alloc_profile ( fs_info ) ;
struct btrfs_block_group * sys_bg ;
2021-08-18 13:41:19 +03:00
sys_bg = btrfs_create_chunk ( trans , sys_flags ) ;
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 14:43:06 +01:00
if ( IS_ERR ( sys_bg ) ) {
ret = PTR_ERR ( sys_bg ) ;
btrfs_abort_transaction ( trans , ret ) ;
goto out ;
}
ret = btrfs_chunk_alloc_add_chunk_item ( trans , sys_bg ) ;
2018-10-26 14:43:19 +03:00
if ( ret ) {
btrfs_abort_transaction ( trans , ret ) ;
goto out ;
2008-05-13 13:46:40 -04:00
}
2016-05-20 04:34:23 +01:00
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 14:43:06 +01:00
ret = remove_chunk_item ( trans , map , chunk_offset ) ;
if ( ret ) {
btrfs_abort_transaction ( trans , ret ) ;
goto out ;
}
} else if ( ret ) {
2016-06-10 18:19:25 -04:00
btrfs_abort_transaction ( trans , ret ) ;
2014-09-18 11:20:02 -04:00
goto out ;
}
2008-04-25 16:53:30 -04:00
2016-06-21 21:16:51 -04:00
trace_btrfs_chunk_free ( fs_info , map , chunk_offset , em - > len ) ;
Btrfs: add initial tracepoint support for btrfs
Tracepoints can provide insight into why btrfs hits bugs and be greatly
helpful for debugging, e.g
dd-7822 [000] 2121.641088: btrfs_inode_request: root = 5(FS_TREE), gen = 4, ino = 256, blocks = 8, disk_i_size = 0, last_trans = 8, logged_trans = 0
dd-7822 [000] 2121.641100: btrfs_inode_new: root = 5(FS_TREE), gen = 8, ino = 257, blocks = 0, disk_i_size = 0, last_trans = 0, logged_trans = 0
btrfs-transacti-7804 [001] 2146.935420: btrfs_cow_block: root = 2(EXTENT_TREE), refs = 2, orig_buf = 29368320 (orig_level = 0), cow_buf = 29388800 (cow_level = 0)
btrfs-transacti-7804 [001] 2146.935473: btrfs_cow_block: root = 1(ROOT_TREE), refs = 2, orig_buf = 29364224 (orig_level = 0), cow_buf = 29392896 (cow_level = 0)
btrfs-transacti-7804 [001] 2146.972221: btrfs_transaction_commit: root = 1(ROOT_TREE), gen = 8
flush-btrfs-2-7821 [001] 2155.824210: btrfs_chunk_alloc: root = 3(CHUNK_TREE), offset = 1103101952, size = 1073741824, num_stripes = 1, sub_stripes = 0, type = DATA
flush-btrfs-2-7821 [001] 2155.824241: btrfs_cow_block: root = 2(EXTENT_TREE), refs = 2, orig_buf = 29388800 (orig_level = 0), cow_buf = 29396992 (cow_level = 0)
flush-btrfs-2-7821 [001] 2155.824255: btrfs_cow_block: root = 4(DEV_TREE), refs = 2, orig_buf = 29372416 (orig_level = 0), cow_buf = 29401088 (cow_level = 0)
flush-btrfs-2-7821 [000] 2155.824329: btrfs_cow_block: root = 3(CHUNK_TREE), refs = 2, orig_buf = 20971520 (orig_level = 0), cow_buf = 20975616 (cow_level = 0)
btrfs-endio-wri-7800 [001] 2155.898019: btrfs_cow_block: root = 5(FS_TREE), refs = 2, orig_buf = 29384704 (orig_level = 0), cow_buf = 29405184 (cow_level = 0)
btrfs-endio-wri-7800 [001] 2155.898043: btrfs_cow_block: root = 7(CSUM_TREE), refs = 2, orig_buf = 29376512 (orig_level = 0), cow_buf = 29409280 (cow_level = 0)
Here is what I have added:
1) ordere_extent:
btrfs_ordered_extent_add
btrfs_ordered_extent_remove
btrfs_ordered_extent_start
btrfs_ordered_extent_put
These provide critical information to understand how ordered_extents are
updated.
2) extent_map:
btrfs_get_extent
extent_map is used in both read and write cases, and it is useful for tracking
how btrfs specific IO is running.
3) writepage:
__extent_writepage
btrfs_writepage_end_io_hook
Pages are cirtical resourses and produce a lot of corner cases during writeback,
so it is valuable to know how page is written to disk.
4) inode:
btrfs_inode_new
btrfs_inode_request
btrfs_inode_evict
These can show where and when a inode is created, when a inode is evicted.
5) sync:
btrfs_sync_file
btrfs_sync_fs
These show sync arguments.
6) transaction:
btrfs_transaction_commit
In transaction based filesystem, it will be useful to know the generation and
who does commit.
7) back reference and cow:
btrfs_delayed_tree_ref
btrfs_delayed_data_ref
btrfs_delayed_ref_head
btrfs_cow_block
Btrfs natively supports back references, these tracepoints are helpful on
understanding btrfs's COW mechanism.
8) chunk:
btrfs_chunk_alloc
btrfs_chunk_free
Chunk is a link between physical offset and logical offset, and stands for space
infomation in btrfs, and these are helpful on tracing space things.
9) reserved_extent:
btrfs_reserved_extent_alloc
btrfs_reserved_extent_free
These can show how btrfs uses its space.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-03-24 11:18:59 +00:00
2008-04-25 16:53:30 -04:00
if ( map - > type & BTRFS_BLOCK_GROUP_SYSTEM ) {
2017-07-27 14:37:29 +03:00
ret = btrfs_del_sys_chunk ( fs_info , chunk_offset ) ;
2014-09-18 11:20:02 -04:00
if ( ret ) {
2016-06-10 18:19:25 -04:00
btrfs_abort_transaction ( trans , ret ) ;
2014-09-18 11:20:02 -04:00
goto out ;
}
2008-04-25 16:53:30 -04:00
}
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 14:43:06 +01:00
mutex_unlock ( & fs_info - > chunk_mutex ) ;
trans - > removing_chunk = false ;
/*
* We are done with chunk btree updates and deletions , so release the
* system space we previously reserved ( with check_system_chunk ( ) ) .
*/
btrfs_trans_release_chunk_metadata ( trans ) ;
2018-06-20 15:48:56 +03:00
ret = btrfs_remove_block_group ( trans , chunk_offset , em ) ;
2014-09-18 11:20:02 -04:00
if ( ret ) {
2016-06-10 18:19:25 -04:00
btrfs_abort_transaction ( trans , ret ) ;
2014-09-18 11:20:02 -04:00
goto out ;
}
2008-11-17 21:11:30 -05:00
2014-09-18 11:20:02 -04:00
out :
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 14:43:06 +01:00
if ( trans - > removing_chunk ) {
mutex_unlock ( & fs_info - > chunk_mutex ) ;
trans - > removing_chunk = false ;
}
2008-11-17 21:11:30 -05:00
/* once for us */
free_extent_map ( em ) ;
2014-09-18 11:20:02 -04:00
return ret ;
}
2008-11-17 21:11:30 -05:00
2021-04-19 16:41:02 +09:00
int btrfs_relocate_chunk ( struct btrfs_fs_info * fs_info , u64 chunk_offset )
2014-09-18 11:20:02 -04:00
{
2016-06-21 10:40:19 -04:00
struct btrfs_root * root = fs_info - > chunk_root ;
2016-10-10 13:43:31 -07:00
struct btrfs_trans_handle * trans ;
2019-12-13 16:22:14 -08:00
struct btrfs_block_group * block_group ;
2021-04-19 16:41:00 +09:00
u64 length ;
2014-09-18 11:20:02 -04:00
int ret ;
2008-11-17 21:11:30 -05:00
Btrfs: fix race between balance and unused block group deletion
We have a race between deleting an unused block group and balancing the
same block group that leads to an assertion failure/BUG(), producing the
following trace:
[181631.208236] BTRFS: assertion failed: 0, file: fs/btrfs/volumes.c, line: 2622
[181631.220591] ------------[ cut here ]------------
[181631.222959] kernel BUG at fs/btrfs/ctree.h:4062!
[181631.223932] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[181631.224566] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse acpi_cpufreq parpor$
[181631.224566] CPU: 8 PID: 17451 Comm: btrfs Tainted: G W 4.1.0-rc5-btrfs-next-10+ #1
[181631.224566] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
[181631.224566] task: ffff880127e09590 ti: ffff8800b5824000 task.ti: ffff8800b5824000
[181631.224566] RIP: 0010:[<ffffffffa03f19f6>] [<ffffffffa03f19f6>] assfail.constprop.50+0x1e/0x20 [btrfs]
[181631.224566] RSP: 0018:ffff8800b5827ae8 EFLAGS: 00010246
[181631.224566] RAX: 0000000000000040 RBX: ffff8800109fc218 RCX: ffffffff81095dce
[181631.224566] RDX: 0000000000005124 RSI: ffffffff81464819 RDI: 00000000ffffffff
[181631.224566] RBP: ffff8800b5827ae8 R08: 0000000000000001 R09: 0000000000000000
[181631.224566] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8800109fc200
[181631.224566] R13: ffff880020095000 R14: ffff8800b1a13f38 R15: ffff880020095000
[181631.224566] FS: 00007f70ca0b0c80(0000) GS:ffff88013ec00000(0000) knlGS:0000000000000000
[181631.224566] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[181631.224566] CR2: 00007f2872ab6e68 CR3: 00000000a717c000 CR4: 00000000000006e0
[181631.224566] Stack:
[181631.224566] ffff8800b5827ba8 ffffffffa03f3916 ffff8800b5827b38 ffffffffa03d080e
[181631.224566] ffffffffa03d1423 ffff880020095000 ffff88001233c000 0000000000000001
[181631.224566] ffff880020095000 ffff8800b1a13f38 0000000a69c00000 0000000000000000
[181631.224566] Call Trace:
[181631.224566] [<ffffffffa03f3916>] btrfs_remove_chunk+0xa4/0x6bb [btrfs]
[181631.224566] [<ffffffffa03d080e>] ? join_transaction.isra.8+0xb9/0x3ba [btrfs]
[181631.224566] [<ffffffffa03d1423>] ? wait_current_trans.isra.13+0x22/0xfc [btrfs]
[181631.224566] [<ffffffffa03f3fbc>] btrfs_relocate_chunk.isra.29+0x8f/0xa7 [btrfs]
[181631.224566] [<ffffffffa03f54df>] btrfs_balance+0xaa4/0xc52 [btrfs]
[181631.224566] [<ffffffffa03fd388>] btrfs_ioctl_balance+0x23f/0x2b0 [btrfs]
[181631.224566] [<ffffffff810872f9>] ? trace_hardirqs_on+0xd/0xf
[181631.224566] [<ffffffffa04019a3>] btrfs_ioctl+0xfe2/0x2220 [btrfs]
[181631.224566] [<ffffffff812603ed>] ? __this_cpu_preempt_check+0x13/0x15
[181631.224566] [<ffffffff81084669>] ? arch_local_irq_save+0x9/0xc
[181631.224566] [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
[181631.224566] [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
[181631.224566] [<ffffffff8103e48c>] ? __do_page_fault+0x211/0x424
[181631.224566] [<ffffffff811755e6>] do_vfs_ioctl+0x3c6/0x479
(...)
The sequence of steps leading to this are:
CPU 0 CPU 1
btrfs_balance()
btrfs_relocate_chunk()
btrfs_relocate_block_group(bg X)
btrfs_lookup_block_group(bg X)
cleaner_kthread
locks fs_info->cleaner_mutex
btrfs_delete_unused_bgs()
finds bg X, which became
unused in the previous
transaction
checks bg X ->ro == 0,
so it proceeds
sets bg X ->ro to 1
(btrfs_set_block_group_ro(bg X))
blocks on fs_info->cleaner_mutex
btrfs_remove_chunk(bg X)
unlocks fs_info->cleaner_mutex
acquires fs_info->cleaner_mutex
relocate_block_group()
--> does nothing, no extents found in
the extent tree from bg X
unlocks fs_info->cleaner_mutex
btrfs_relocate_block_group(bg X) returns
btrfs_remove_chunk(bg X)
extent map not found
--> ASSERT(0)
Fix this by using a new mutex to make sure these 2 operations, block
group relocation and removal, are serialized.
This issue is reproducible by running fstests generic/038 (which stresses
chunk allocation and automatic removal of unused block groups) together
with the following balance loop:
while true; do btrfs balance start -dusage=0 <mountpoint> ; done
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-11 00:58:53 +01:00
/*
* Prevent races with automatic removal of unused block groups .
* After we relocate and before we remove the chunk with offset
* chunk_offset , automatic removal of the block group can kick in ,
* resulting in a failure when calling btrfs_remove_chunk ( ) below .
*
* Make sure to acquire this mutex before doing a tree search ( dev
* or chunk trees ) to find chunks . Otherwise the cleaner kthread might
* call btrfs_remove_chunk ( ) ( through btrfs_delete_unused_bgs ( ) ) after
* we release the path used to search the chunk / dev tree and before
* the current task acquires this mutex and calls us .
*/
2021-04-19 16:41:01 +09:00
lockdep_assert_held ( & fs_info - > reclaim_bgs_lock ) ;
Btrfs: fix race between balance and unused block group deletion
We have a race between deleting an unused block group and balancing the
same block group that leads to an assertion failure/BUG(), producing the
following trace:
[181631.208236] BTRFS: assertion failed: 0, file: fs/btrfs/volumes.c, line: 2622
[181631.220591] ------------[ cut here ]------------
[181631.222959] kernel BUG at fs/btrfs/ctree.h:4062!
[181631.223932] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[181631.224566] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse acpi_cpufreq parpor$
[181631.224566] CPU: 8 PID: 17451 Comm: btrfs Tainted: G W 4.1.0-rc5-btrfs-next-10+ #1
[181631.224566] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
[181631.224566] task: ffff880127e09590 ti: ffff8800b5824000 task.ti: ffff8800b5824000
[181631.224566] RIP: 0010:[<ffffffffa03f19f6>] [<ffffffffa03f19f6>] assfail.constprop.50+0x1e/0x20 [btrfs]
[181631.224566] RSP: 0018:ffff8800b5827ae8 EFLAGS: 00010246
[181631.224566] RAX: 0000000000000040 RBX: ffff8800109fc218 RCX: ffffffff81095dce
[181631.224566] RDX: 0000000000005124 RSI: ffffffff81464819 RDI: 00000000ffffffff
[181631.224566] RBP: ffff8800b5827ae8 R08: 0000000000000001 R09: 0000000000000000
[181631.224566] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8800109fc200
[181631.224566] R13: ffff880020095000 R14: ffff8800b1a13f38 R15: ffff880020095000
[181631.224566] FS: 00007f70ca0b0c80(0000) GS:ffff88013ec00000(0000) knlGS:0000000000000000
[181631.224566] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[181631.224566] CR2: 00007f2872ab6e68 CR3: 00000000a717c000 CR4: 00000000000006e0
[181631.224566] Stack:
[181631.224566] ffff8800b5827ba8 ffffffffa03f3916 ffff8800b5827b38 ffffffffa03d080e
[181631.224566] ffffffffa03d1423 ffff880020095000 ffff88001233c000 0000000000000001
[181631.224566] ffff880020095000 ffff8800b1a13f38 0000000a69c00000 0000000000000000
[181631.224566] Call Trace:
[181631.224566] [<ffffffffa03f3916>] btrfs_remove_chunk+0xa4/0x6bb [btrfs]
[181631.224566] [<ffffffffa03d080e>] ? join_transaction.isra.8+0xb9/0x3ba [btrfs]
[181631.224566] [<ffffffffa03d1423>] ? wait_current_trans.isra.13+0x22/0xfc [btrfs]
[181631.224566] [<ffffffffa03f3fbc>] btrfs_relocate_chunk.isra.29+0x8f/0xa7 [btrfs]
[181631.224566] [<ffffffffa03f54df>] btrfs_balance+0xaa4/0xc52 [btrfs]
[181631.224566] [<ffffffffa03fd388>] btrfs_ioctl_balance+0x23f/0x2b0 [btrfs]
[181631.224566] [<ffffffff810872f9>] ? trace_hardirqs_on+0xd/0xf
[181631.224566] [<ffffffffa04019a3>] btrfs_ioctl+0xfe2/0x2220 [btrfs]
[181631.224566] [<ffffffff812603ed>] ? __this_cpu_preempt_check+0x13/0x15
[181631.224566] [<ffffffff81084669>] ? arch_local_irq_save+0x9/0xc
[181631.224566] [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
[181631.224566] [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
[181631.224566] [<ffffffff8103e48c>] ? __do_page_fault+0x211/0x424
[181631.224566] [<ffffffff811755e6>] do_vfs_ioctl+0x3c6/0x479
(...)
The sequence of steps leading to this are:
CPU 0 CPU 1
btrfs_balance()
btrfs_relocate_chunk()
btrfs_relocate_block_group(bg X)
btrfs_lookup_block_group(bg X)
cleaner_kthread
locks fs_info->cleaner_mutex
btrfs_delete_unused_bgs()
finds bg X, which became
unused in the previous
transaction
checks bg X ->ro == 0,
so it proceeds
sets bg X ->ro to 1
(btrfs_set_block_group_ro(bg X))
blocks on fs_info->cleaner_mutex
btrfs_remove_chunk(bg X)
unlocks fs_info->cleaner_mutex
acquires fs_info->cleaner_mutex
relocate_block_group()
--> does nothing, no extents found in
the extent tree from bg X
unlocks fs_info->cleaner_mutex
btrfs_relocate_block_group(bg X) returns
btrfs_remove_chunk(bg X)
extent map not found
--> ASSERT(0)
Fix this by using a new mutex to make sure these 2 operations, block
group relocation and removal, are serialized.
This issue is reproducible by running fstests generic/038 (which stresses
chunk allocation and automatic removal of unused block groups) together
with the following balance loop:
while true; do btrfs balance start -dusage=0 <mountpoint> ; done
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-11 00:58:53 +01:00
2014-09-18 11:20:02 -04:00
/* step one, relocate all the extents inside this chunk */
2016-06-22 18:54:24 -04:00
btrfs_scrub_pause ( fs_info ) ;
2016-06-22 18:54:23 -04:00
ret = btrfs_relocate_block_group ( fs_info , chunk_offset ) ;
2016-06-22 18:54:24 -04:00
btrfs_scrub_continue ( fs_info ) ;
2014-09-18 11:20:02 -04:00
if ( ret )
return ret ;
2019-12-13 16:22:14 -08:00
block_group = btrfs_lookup_block_group ( fs_info , chunk_offset ) ;
if ( ! block_group )
return - ENOENT ;
btrfs_discard_cancel_work ( & fs_info - > discard_ctl , block_group ) ;
2021-04-19 16:41:00 +09:00
length = block_group - > length ;
2019-12-13 16:22:14 -08:00
btrfs_put_block_group ( block_group ) ;
2021-04-19 16:41:00 +09:00
/*
* On a zoned file system , discard the whole block group , this will
* trigger a REQ_OP_ZONE_RESET operation on the device zone . If
* resetting the zone fails , don ' t treat it as a fatal problem from the
* filesystem ' s point of view .
*/
if ( btrfs_is_zoned ( fs_info ) ) {
ret = btrfs_discard_extent ( fs_info , chunk_offset , length , NULL ) ;
if ( ret )
btrfs_info ( fs_info ,
" failed to reset zone %llu after relocation " ,
chunk_offset ) ;
}
2016-10-10 13:43:31 -07:00
trans = btrfs_start_trans_remove_block_group ( root - > fs_info ,
chunk_offset ) ;
if ( IS_ERR ( trans ) ) {
ret = PTR_ERR ( trans ) ;
btrfs_handle_fs_error ( root - > fs_info , ret , NULL ) ;
return ret ;
}
2014-09-18 11:20:02 -04:00
/*
2016-10-10 13:43:31 -07:00
* step two , delete the device extents and the
* chunk tree entries
2014-09-18 11:20:02 -04:00
*/
2018-07-20 19:37:53 +03:00
ret = btrfs_remove_chunk ( trans , chunk_offset ) ;
2016-09-09 21:39:03 -04:00
btrfs_end_transaction ( trans ) ;
2016-10-10 13:43:31 -07:00
return ret ;
2008-11-17 21:11:30 -05:00
}
2016-06-22 18:54:24 -04:00
static int btrfs_relocate_sys_chunks ( struct btrfs_fs_info * fs_info )
2008-11-17 21:11:30 -05:00
{
2016-06-22 18:54:23 -04:00
struct btrfs_root * chunk_root = fs_info - > chunk_root ;
2008-11-17 21:11:30 -05:00
struct btrfs_path * path ;
struct extent_buffer * leaf ;
struct btrfs_chunk * chunk ;
struct btrfs_key key ;
struct btrfs_key found_key ;
u64 chunk_type ;
Btrfs: make balance code choose more wisely when relocating
Currently, we can panic the box if the first block group we go to move is of a
type where there is no space left to move those extents. For example, if we
fill the disk up with data, and then we try to balance and we have no room to
move the data nor room to allocate new chunks, we will panic. Change this by
checking to see if we have room to move this chunk around, and if not, return
-ENOSPC and move on to the next chunk. This will make sure we remove block
groups that are moveable, like if we have alot of empty metadata block groups,
and then that way we make room to be able to balance our data chunks as well.
Tested this with an fs that would panic on btrfs-vol -b normally, but no longer
panics with this patch.
V1->V2:
-actually search for a free extent on the device to make sure we can allocate a
chunk if need be.
-fix btrfs_shrink_device to make sure we actually try to relocate all the
chunks, and then if we can't return -ENOSPC so if we are doing a btrfs-vol -r
we don't remove the device with data still on it.
-check to make sure the block group we are going to relocate isn't the last one
in that particular space
-fix a bug in btrfs_shrink_device where we would change the device's size and
not fix it if we fail to do our relocate
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-11 16:11:19 -04:00
bool retried = false ;
int failed = 0 ;
2008-11-17 21:11:30 -05:00
int ret ;
path = btrfs_alloc_path ( ) ;
if ( ! path )
return - ENOMEM ;
Btrfs: make balance code choose more wisely when relocating
Currently, we can panic the box if the first block group we go to move is of a
type where there is no space left to move those extents. For example, if we
fill the disk up with data, and then we try to balance and we have no room to
move the data nor room to allocate new chunks, we will panic. Change this by
checking to see if we have room to move this chunk around, and if not, return
-ENOSPC and move on to the next chunk. This will make sure we remove block
groups that are moveable, like if we have alot of empty metadata block groups,
and then that way we make room to be able to balance our data chunks as well.
Tested this with an fs that would panic on btrfs-vol -b normally, but no longer
panics with this patch.
V1->V2:
-actually search for a free extent on the device to make sure we can allocate a
chunk if need be.
-fix btrfs_shrink_device to make sure we actually try to relocate all the
chunks, and then if we can't return -ENOSPC so if we are doing a btrfs-vol -r
we don't remove the device with data still on it.
-check to make sure the block group we are going to relocate isn't the last one
in that particular space
-fix a bug in btrfs_shrink_device where we would change the device's size and
not fix it if we fail to do our relocate
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-11 16:11:19 -04:00
again :
2008-11-17 21:11:30 -05:00
key . objectid = BTRFS_FIRST_CHUNK_TREE_OBJECTID ;
key . offset = ( u64 ) - 1 ;
key . type = BTRFS_CHUNK_ITEM_KEY ;
while ( 1 ) {
2021-04-19 16:41:01 +09:00
mutex_lock ( & fs_info - > reclaim_bgs_lock ) ;
2008-11-17 21:11:30 -05:00
ret = btrfs_search_slot ( NULL , chunk_root , & key , path , 0 , 0 ) ;
Btrfs: fix race between balance and unused block group deletion
We have a race between deleting an unused block group and balancing the
same block group that leads to an assertion failure/BUG(), producing the
following trace:
[181631.208236] BTRFS: assertion failed: 0, file: fs/btrfs/volumes.c, line: 2622
[181631.220591] ------------[ cut here ]------------
[181631.222959] kernel BUG at fs/btrfs/ctree.h:4062!
[181631.223932] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[181631.224566] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse acpi_cpufreq parpor$
[181631.224566] CPU: 8 PID: 17451 Comm: btrfs Tainted: G W 4.1.0-rc5-btrfs-next-10+ #1
[181631.224566] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
[181631.224566] task: ffff880127e09590 ti: ffff8800b5824000 task.ti: ffff8800b5824000
[181631.224566] RIP: 0010:[<ffffffffa03f19f6>] [<ffffffffa03f19f6>] assfail.constprop.50+0x1e/0x20 [btrfs]
[181631.224566] RSP: 0018:ffff8800b5827ae8 EFLAGS: 00010246
[181631.224566] RAX: 0000000000000040 RBX: ffff8800109fc218 RCX: ffffffff81095dce
[181631.224566] RDX: 0000000000005124 RSI: ffffffff81464819 RDI: 00000000ffffffff
[181631.224566] RBP: ffff8800b5827ae8 R08: 0000000000000001 R09: 0000000000000000
[181631.224566] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8800109fc200
[181631.224566] R13: ffff880020095000 R14: ffff8800b1a13f38 R15: ffff880020095000
[181631.224566] FS: 00007f70ca0b0c80(0000) GS:ffff88013ec00000(0000) knlGS:0000000000000000
[181631.224566] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[181631.224566] CR2: 00007f2872ab6e68 CR3: 00000000a717c000 CR4: 00000000000006e0
[181631.224566] Stack:
[181631.224566] ffff8800b5827ba8 ffffffffa03f3916 ffff8800b5827b38 ffffffffa03d080e
[181631.224566] ffffffffa03d1423 ffff880020095000 ffff88001233c000 0000000000000001
[181631.224566] ffff880020095000 ffff8800b1a13f38 0000000a69c00000 0000000000000000
[181631.224566] Call Trace:
[181631.224566] [<ffffffffa03f3916>] btrfs_remove_chunk+0xa4/0x6bb [btrfs]
[181631.224566] [<ffffffffa03d080e>] ? join_transaction.isra.8+0xb9/0x3ba [btrfs]
[181631.224566] [<ffffffffa03d1423>] ? wait_current_trans.isra.13+0x22/0xfc [btrfs]
[181631.224566] [<ffffffffa03f3fbc>] btrfs_relocate_chunk.isra.29+0x8f/0xa7 [btrfs]
[181631.224566] [<ffffffffa03f54df>] btrfs_balance+0xaa4/0xc52 [btrfs]
[181631.224566] [<ffffffffa03fd388>] btrfs_ioctl_balance+0x23f/0x2b0 [btrfs]
[181631.224566] [<ffffffff810872f9>] ? trace_hardirqs_on+0xd/0xf
[181631.224566] [<ffffffffa04019a3>] btrfs_ioctl+0xfe2/0x2220 [btrfs]
[181631.224566] [<ffffffff812603ed>] ? __this_cpu_preempt_check+0x13/0x15
[181631.224566] [<ffffffff81084669>] ? arch_local_irq_save+0x9/0xc
[181631.224566] [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
[181631.224566] [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
[181631.224566] [<ffffffff8103e48c>] ? __do_page_fault+0x211/0x424
[181631.224566] [<ffffffff811755e6>] do_vfs_ioctl+0x3c6/0x479
(...)
The sequence of steps leading to this are:
CPU 0 CPU 1
btrfs_balance()
btrfs_relocate_chunk()
btrfs_relocate_block_group(bg X)
btrfs_lookup_block_group(bg X)
cleaner_kthread
locks fs_info->cleaner_mutex
btrfs_delete_unused_bgs()
finds bg X, which became
unused in the previous
transaction
checks bg X ->ro == 0,
so it proceeds
sets bg X ->ro to 1
(btrfs_set_block_group_ro(bg X))
blocks on fs_info->cleaner_mutex
btrfs_remove_chunk(bg X)
unlocks fs_info->cleaner_mutex
acquires fs_info->cleaner_mutex
relocate_block_group()
--> does nothing, no extents found in
the extent tree from bg X
unlocks fs_info->cleaner_mutex
btrfs_relocate_block_group(bg X) returns
btrfs_remove_chunk(bg X)
extent map not found
--> ASSERT(0)
Fix this by using a new mutex to make sure these 2 operations, block
group relocation and removal, are serialized.
This issue is reproducible by running fstests generic/038 (which stresses
chunk allocation and automatic removal of unused block groups) together
with the following balance loop:
while true; do btrfs balance start -dusage=0 <mountpoint> ; done
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-11 00:58:53 +01:00
if ( ret < 0 ) {
2021-04-19 16:41:01 +09:00
mutex_unlock ( & fs_info - > reclaim_bgs_lock ) ;
2008-11-17 21:11:30 -05:00
goto error ;
Btrfs: fix race between balance and unused block group deletion
We have a race between deleting an unused block group and balancing the
same block group that leads to an assertion failure/BUG(), producing the
following trace:
[181631.208236] BTRFS: assertion failed: 0, file: fs/btrfs/volumes.c, line: 2622
[181631.220591] ------------[ cut here ]------------
[181631.222959] kernel BUG at fs/btrfs/ctree.h:4062!
[181631.223932] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[181631.224566] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse acpi_cpufreq parpor$
[181631.224566] CPU: 8 PID: 17451 Comm: btrfs Tainted: G W 4.1.0-rc5-btrfs-next-10+ #1
[181631.224566] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
[181631.224566] task: ffff880127e09590 ti: ffff8800b5824000 task.ti: ffff8800b5824000
[181631.224566] RIP: 0010:[<ffffffffa03f19f6>] [<ffffffffa03f19f6>] assfail.constprop.50+0x1e/0x20 [btrfs]
[181631.224566] RSP: 0018:ffff8800b5827ae8 EFLAGS: 00010246
[181631.224566] RAX: 0000000000000040 RBX: ffff8800109fc218 RCX: ffffffff81095dce
[181631.224566] RDX: 0000000000005124 RSI: ffffffff81464819 RDI: 00000000ffffffff
[181631.224566] RBP: ffff8800b5827ae8 R08: 0000000000000001 R09: 0000000000000000
[181631.224566] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8800109fc200
[181631.224566] R13: ffff880020095000 R14: ffff8800b1a13f38 R15: ffff880020095000
[181631.224566] FS: 00007f70ca0b0c80(0000) GS:ffff88013ec00000(0000) knlGS:0000000000000000
[181631.224566] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[181631.224566] CR2: 00007f2872ab6e68 CR3: 00000000a717c000 CR4: 00000000000006e0
[181631.224566] Stack:
[181631.224566] ffff8800b5827ba8 ffffffffa03f3916 ffff8800b5827b38 ffffffffa03d080e
[181631.224566] ffffffffa03d1423 ffff880020095000 ffff88001233c000 0000000000000001
[181631.224566] ffff880020095000 ffff8800b1a13f38 0000000a69c00000 0000000000000000
[181631.224566] Call Trace:
[181631.224566] [<ffffffffa03f3916>] btrfs_remove_chunk+0xa4/0x6bb [btrfs]
[181631.224566] [<ffffffffa03d080e>] ? join_transaction.isra.8+0xb9/0x3ba [btrfs]
[181631.224566] [<ffffffffa03d1423>] ? wait_current_trans.isra.13+0x22/0xfc [btrfs]
[181631.224566] [<ffffffffa03f3fbc>] btrfs_relocate_chunk.isra.29+0x8f/0xa7 [btrfs]
[181631.224566] [<ffffffffa03f54df>] btrfs_balance+0xaa4/0xc52 [btrfs]
[181631.224566] [<ffffffffa03fd388>] btrfs_ioctl_balance+0x23f/0x2b0 [btrfs]
[181631.224566] [<ffffffff810872f9>] ? trace_hardirqs_on+0xd/0xf
[181631.224566] [<ffffffffa04019a3>] btrfs_ioctl+0xfe2/0x2220 [btrfs]
[181631.224566] [<ffffffff812603ed>] ? __this_cpu_preempt_check+0x13/0x15
[181631.224566] [<ffffffff81084669>] ? arch_local_irq_save+0x9/0xc
[181631.224566] [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
[181631.224566] [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
[181631.224566] [<ffffffff8103e48c>] ? __do_page_fault+0x211/0x424
[181631.224566] [<ffffffff811755e6>] do_vfs_ioctl+0x3c6/0x479
(...)
The sequence of steps leading to this are:
CPU 0 CPU 1
btrfs_balance()
btrfs_relocate_chunk()
btrfs_relocate_block_group(bg X)
btrfs_lookup_block_group(bg X)
cleaner_kthread
locks fs_info->cleaner_mutex
btrfs_delete_unused_bgs()
finds bg X, which became
unused in the previous
transaction
checks bg X ->ro == 0,
so it proceeds
sets bg X ->ro to 1
(btrfs_set_block_group_ro(bg X))
blocks on fs_info->cleaner_mutex
btrfs_remove_chunk(bg X)
unlocks fs_info->cleaner_mutex
acquires fs_info->cleaner_mutex
relocate_block_group()
--> does nothing, no extents found in
the extent tree from bg X
unlocks fs_info->cleaner_mutex
btrfs_relocate_block_group(bg X) returns
btrfs_remove_chunk(bg X)
extent map not found
--> ASSERT(0)
Fix this by using a new mutex to make sure these 2 operations, block
group relocation and removal, are serialized.
This issue is reproducible by running fstests generic/038 (which stresses
chunk allocation and automatic removal of unused block groups) together
with the following balance loop:
while true; do btrfs balance start -dusage=0 <mountpoint> ; done
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-11 00:58:53 +01:00
}
2012-03-12 16:03:00 +01:00
BUG_ON ( ret = = 0 ) ; /* Corruption */
2008-11-17 21:11:30 -05:00
ret = btrfs_previous_item ( chunk_root , path , key . objectid ,
key . type ) ;
Btrfs: fix race between balance and unused block group deletion
We have a race between deleting an unused block group and balancing the
same block group that leads to an assertion failure/BUG(), producing the
following trace:
[181631.208236] BTRFS: assertion failed: 0, file: fs/btrfs/volumes.c, line: 2622
[181631.220591] ------------[ cut here ]------------
[181631.222959] kernel BUG at fs/btrfs/ctree.h:4062!
[181631.223932] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[181631.224566] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse acpi_cpufreq parpor$
[181631.224566] CPU: 8 PID: 17451 Comm: btrfs Tainted: G W 4.1.0-rc5-btrfs-next-10+ #1
[181631.224566] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
[181631.224566] task: ffff880127e09590 ti: ffff8800b5824000 task.ti: ffff8800b5824000
[181631.224566] RIP: 0010:[<ffffffffa03f19f6>] [<ffffffffa03f19f6>] assfail.constprop.50+0x1e/0x20 [btrfs]
[181631.224566] RSP: 0018:ffff8800b5827ae8 EFLAGS: 00010246
[181631.224566] RAX: 0000000000000040 RBX: ffff8800109fc218 RCX: ffffffff81095dce
[181631.224566] RDX: 0000000000005124 RSI: ffffffff81464819 RDI: 00000000ffffffff
[181631.224566] RBP: ffff8800b5827ae8 R08: 0000000000000001 R09: 0000000000000000
[181631.224566] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8800109fc200
[181631.224566] R13: ffff880020095000 R14: ffff8800b1a13f38 R15: ffff880020095000
[181631.224566] FS: 00007f70ca0b0c80(0000) GS:ffff88013ec00000(0000) knlGS:0000000000000000
[181631.224566] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[181631.224566] CR2: 00007f2872ab6e68 CR3: 00000000a717c000 CR4: 00000000000006e0
[181631.224566] Stack:
[181631.224566] ffff8800b5827ba8 ffffffffa03f3916 ffff8800b5827b38 ffffffffa03d080e
[181631.224566] ffffffffa03d1423 ffff880020095000 ffff88001233c000 0000000000000001
[181631.224566] ffff880020095000 ffff8800b1a13f38 0000000a69c00000 0000000000000000
[181631.224566] Call Trace:
[181631.224566] [<ffffffffa03f3916>] btrfs_remove_chunk+0xa4/0x6bb [btrfs]
[181631.224566] [<ffffffffa03d080e>] ? join_transaction.isra.8+0xb9/0x3ba [btrfs]
[181631.224566] [<ffffffffa03d1423>] ? wait_current_trans.isra.13+0x22/0xfc [btrfs]
[181631.224566] [<ffffffffa03f3fbc>] btrfs_relocate_chunk.isra.29+0x8f/0xa7 [btrfs]
[181631.224566] [<ffffffffa03f54df>] btrfs_balance+0xaa4/0xc52 [btrfs]
[181631.224566] [<ffffffffa03fd388>] btrfs_ioctl_balance+0x23f/0x2b0 [btrfs]
[181631.224566] [<ffffffff810872f9>] ? trace_hardirqs_on+0xd/0xf
[181631.224566] [<ffffffffa04019a3>] btrfs_ioctl+0xfe2/0x2220 [btrfs]
[181631.224566] [<ffffffff812603ed>] ? __this_cpu_preempt_check+0x13/0x15
[181631.224566] [<ffffffff81084669>] ? arch_local_irq_save+0x9/0xc
[181631.224566] [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
[181631.224566] [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
[181631.224566] [<ffffffff8103e48c>] ? __do_page_fault+0x211/0x424
[181631.224566] [<ffffffff811755e6>] do_vfs_ioctl+0x3c6/0x479
(...)
The sequence of steps leading to this are:
CPU 0 CPU 1
btrfs_balance()
btrfs_relocate_chunk()
btrfs_relocate_block_group(bg X)
btrfs_lookup_block_group(bg X)
cleaner_kthread
locks fs_info->cleaner_mutex
btrfs_delete_unused_bgs()
finds bg X, which became
unused in the previous
transaction
checks bg X ->ro == 0,
so it proceeds
sets bg X ->ro to 1
(btrfs_set_block_group_ro(bg X))
blocks on fs_info->cleaner_mutex
btrfs_remove_chunk(bg X)
unlocks fs_info->cleaner_mutex
acquires fs_info->cleaner_mutex
relocate_block_group()
--> does nothing, no extents found in
the extent tree from bg X
unlocks fs_info->cleaner_mutex
btrfs_relocate_block_group(bg X) returns
btrfs_remove_chunk(bg X)
extent map not found
--> ASSERT(0)
Fix this by using a new mutex to make sure these 2 operations, block
group relocation and removal, are serialized.
This issue is reproducible by running fstests generic/038 (which stresses
chunk allocation and automatic removal of unused block groups) together
with the following balance loop:
while true; do btrfs balance start -dusage=0 <mountpoint> ; done
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-11 00:58:53 +01:00
if ( ret )
2021-04-19 16:41:01 +09:00
mutex_unlock ( & fs_info - > reclaim_bgs_lock ) ;
2008-11-17 21:11:30 -05:00
if ( ret < 0 )
goto error ;
if ( ret > 0 )
break ;
Btrfs: update space balancing code
This patch updates the space balancing code to utilize the new
backref format. Before, btrfs-vol -b would break any COW links
on data blocks or metadata. This was slow and caused the amount
of space used to explode if a large number of snapshots were present.
The new code can keeps the sharing of all data extents and
most of the tree blocks.
To maintain the sharing of data extents, the space balance code uses
a seperate inode hold data extent pointers, then updates the references
to point to the new location.
To maintain the sharing of tree blocks, the space balance code uses
reloc trees to relocate tree blocks in reference counted roots.
There is one reloc tree for each subvol, and all reloc trees share
same root key objectid. Reloc trees are snapshots of the latest
committed roots of subvols (root->commit_root).
To relocate a tree block referenced by a subvol, there are two steps.
COW the block through subvol's reloc tree, then update block pointer in
the subvol to point to the new block. Since all reloc trees share
same root key objectid, doing special handing for tree blocks
owned by them is easy. Once a tree block has been COWed in one
reloc tree, we can use the resulting new block directly when the
same block is required to COW again through other reloc trees.
In this way, relocated tree blocks are shared between reloc trees,
so they are also shared between subvols.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-26 10:09:34 -04:00
2008-11-17 21:11:30 -05:00
leaf = path - > nodes [ 0 ] ;
btrfs_item_key_to_cpu ( leaf , & found_key , path - > slots [ 0 ] ) ;
Btrfs: update space balancing code
This patch updates the space balancing code to utilize the new
backref format. Before, btrfs-vol -b would break any COW links
on data blocks or metadata. This was slow and caused the amount
of space used to explode if a large number of snapshots were present.
The new code can keeps the sharing of all data extents and
most of the tree blocks.
To maintain the sharing of data extents, the space balance code uses
a seperate inode hold data extent pointers, then updates the references
to point to the new location.
To maintain the sharing of tree blocks, the space balance code uses
reloc trees to relocate tree blocks in reference counted roots.
There is one reloc tree for each subvol, and all reloc trees share
same root key objectid. Reloc trees are snapshots of the latest
committed roots of subvols (root->commit_root).
To relocate a tree block referenced by a subvol, there are two steps.
COW the block through subvol's reloc tree, then update block pointer in
the subvol to point to the new block. Since all reloc trees share
same root key objectid, doing special handing for tree blocks
owned by them is easy. Once a tree block has been COWed in one
reloc tree, we can use the resulting new block directly when the
same block is required to COW again through other reloc trees.
In this way, relocated tree blocks are shared between reloc trees,
so they are also shared between subvols.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-26 10:09:34 -04:00
2008-11-17 21:11:30 -05:00
chunk = btrfs_item_ptr ( leaf , path - > slots [ 0 ] ,
struct btrfs_chunk ) ;
chunk_type = btrfs_chunk_type ( leaf , chunk ) ;
2011-04-21 01:20:15 +02:00
btrfs_release_path ( path ) ;
2008-04-25 16:53:30 -04:00
2008-11-17 21:11:30 -05:00
if ( chunk_type & BTRFS_BLOCK_GROUP_SYSTEM ) {
2016-06-22 18:54:23 -04:00
ret = btrfs_relocate_chunk ( fs_info , found_key . offset ) ;
Btrfs: make balance code choose more wisely when relocating
Currently, we can panic the box if the first block group we go to move is of a
type where there is no space left to move those extents. For example, if we
fill the disk up with data, and then we try to balance and we have no room to
move the data nor room to allocate new chunks, we will panic. Change this by
checking to see if we have room to move this chunk around, and if not, return
-ENOSPC and move on to the next chunk. This will make sure we remove block
groups that are moveable, like if we have alot of empty metadata block groups,
and then that way we make room to be able to balance our data chunks as well.
Tested this with an fs that would panic on btrfs-vol -b normally, but no longer
panics with this patch.
V1->V2:
-actually search for a free extent on the device to make sure we can allocate a
chunk if need be.
-fix btrfs_shrink_device to make sure we actually try to relocate all the
chunks, and then if we can't return -ENOSPC so if we are doing a btrfs-vol -r
we don't remove the device with data still on it.
-check to make sure the block group we are going to relocate isn't the last one
in that particular space
-fix a bug in btrfs_shrink_device where we would change the device's size and
not fix it if we fail to do our relocate
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-11 16:11:19 -04:00
if ( ret = = - ENOSPC )
failed + + ;
2014-07-09 03:51:41 +05:30
else
BUG_ON ( ret ) ;
2008-11-17 21:11:30 -05:00
}
2021-04-19 16:41:01 +09:00
mutex_unlock ( & fs_info - > reclaim_bgs_lock ) ;
2008-04-25 16:53:30 -04:00
2008-11-17 21:11:30 -05:00
if ( found_key . offset = = 0 )
break ;
key . offset = found_key . offset - 1 ;
}
ret = 0 ;
Btrfs: make balance code choose more wisely when relocating
Currently, we can panic the box if the first block group we go to move is of a
type where there is no space left to move those extents. For example, if we
fill the disk up with data, and then we try to balance and we have no room to
move the data nor room to allocate new chunks, we will panic. Change this by
checking to see if we have room to move this chunk around, and if not, return
-ENOSPC and move on to the next chunk. This will make sure we remove block
groups that are moveable, like if we have alot of empty metadata block groups,
and then that way we make room to be able to balance our data chunks as well.
Tested this with an fs that would panic on btrfs-vol -b normally, but no longer
panics with this patch.
V1->V2:
-actually search for a free extent on the device to make sure we can allocate a
chunk if need be.
-fix btrfs_shrink_device to make sure we actually try to relocate all the
chunks, and then if we can't return -ENOSPC so if we are doing a btrfs-vol -r
we don't remove the device with data still on it.
-check to make sure the block group we are going to relocate isn't the last one
in that particular space
-fix a bug in btrfs_shrink_device where we would change the device's size and
not fix it if we fail to do our relocate
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-11 16:11:19 -04:00
if ( failed & & ! retried ) {
failed = 0 ;
retried = true ;
goto again ;
2013-10-31 10:30:08 +05:30
} else if ( WARN_ON ( failed & & retried ) ) {
Btrfs: make balance code choose more wisely when relocating
Currently, we can panic the box if the first block group we go to move is of a
type where there is no space left to move those extents. For example, if we
fill the disk up with data, and then we try to balance and we have no room to
move the data nor room to allocate new chunks, we will panic. Change this by
checking to see if we have room to move this chunk around, and if not, return
-ENOSPC and move on to the next chunk. This will make sure we remove block
groups that are moveable, like if we have alot of empty metadata block groups,
and then that way we make room to be able to balance our data chunks as well.
Tested this with an fs that would panic on btrfs-vol -b normally, but no longer
panics with this patch.
V1->V2:
-actually search for a free extent on the device to make sure we can allocate a
chunk if need be.
-fix btrfs_shrink_device to make sure we actually try to relocate all the
chunks, and then if we can't return -ENOSPC so if we are doing a btrfs-vol -r
we don't remove the device with data still on it.
-check to make sure the block group we are going to relocate isn't the last one
in that particular space
-fix a bug in btrfs_shrink_device where we would change the device's size and
not fix it if we fail to do our relocate
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-11 16:11:19 -04:00
ret = - ENOSPC ;
}
2008-11-17 21:11:30 -05:00
error :
btrfs_free_path ( path ) ;
return ret ;
2008-04-25 16:53:30 -04:00
}
2017-11-15 16:28:11 -07:00
/*
* return 1 : allocate a data chunk successfully ,
* return < 0 : errors during allocating a data chunk ,
* return 0 : no need to allocate a data chunk .
*/
static int btrfs_may_alloc_data_chunk ( struct btrfs_fs_info * fs_info ,
u64 chunk_offset )
{
2019-10-29 19:20:18 +01:00
struct btrfs_block_group * cache ;
2017-11-15 16:28:11 -07:00
u64 bytes_used ;
u64 chunk_type ;
cache = btrfs_lookup_block_group ( fs_info , chunk_offset ) ;
ASSERT ( cache ) ;
chunk_type = cache - > flags ;
btrfs_put_block_group ( cache ) ;
2019-10-18 11:58:22 +02:00
if ( ! ( chunk_type & BTRFS_BLOCK_GROUP_DATA ) )
return 0 ;
spin_lock ( & fs_info - > data_sinfo - > lock ) ;
bytes_used = fs_info - > data_sinfo - > bytes_used ;
spin_unlock ( & fs_info - > data_sinfo - > lock ) ;
if ( ! bytes_used ) {
struct btrfs_trans_handle * trans ;
int ret ;
trans = btrfs_join_transaction ( fs_info - > tree_root ) ;
if ( IS_ERR ( trans ) )
return PTR_ERR ( trans ) ;
ret = btrfs_force_chunk_alloc ( trans , BTRFS_BLOCK_GROUP_DATA ) ;
btrfs_end_transaction ( trans ) ;
if ( ret < 0 )
return ret ;
return 1 ;
2017-11-15 16:28:11 -07:00
}
2019-10-18 11:58:22 +02:00
2017-11-15 16:28:11 -07:00
return 0 ;
}
2016-06-21 21:16:51 -04:00
static int insert_balance_item ( struct btrfs_fs_info * fs_info ,
2012-01-16 22:04:48 +02:00
struct btrfs_balance_control * bctl )
{
2016-06-21 21:16:51 -04:00
struct btrfs_root * root = fs_info - > tree_root ;
2012-01-16 22:04:48 +02:00
struct btrfs_trans_handle * trans ;
struct btrfs_balance_item * item ;
struct btrfs_disk_balance_args disk_bargs ;
struct btrfs_path * path ;
struct extent_buffer * leaf ;
struct btrfs_key key ;
int ret , err ;
path = btrfs_alloc_path ( ) ;
if ( ! path )
return - ENOMEM ;
trans = btrfs_start_transaction ( root , 0 ) ;
if ( IS_ERR ( trans ) ) {
btrfs_free_path ( path ) ;
return PTR_ERR ( trans ) ;
}
key . objectid = BTRFS_BALANCE_OBJECTID ;
2016-01-25 17:51:31 +01:00
key . type = BTRFS_TEMPORARY_ITEM_KEY ;
2012-01-16 22:04:48 +02:00
key . offset = 0 ;
ret = btrfs_insert_empty_item ( trans , root , path , & key ,
sizeof ( * item ) ) ;
if ( ret )
goto out ;
leaf = path - > nodes [ 0 ] ;
item = btrfs_item_ptr ( leaf , path - > slots [ 0 ] , struct btrfs_balance_item ) ;
2016-11-08 18:09:03 +01:00
memzero_extent_buffer ( leaf , ( unsigned long ) item , sizeof ( * item ) ) ;
2012-01-16 22:04:48 +02:00
btrfs_cpu_balance_args_to_disk ( & disk_bargs , & bctl - > data ) ;
btrfs_set_balance_data ( leaf , item , & disk_bargs ) ;
btrfs_cpu_balance_args_to_disk ( & disk_bargs , & bctl - > meta ) ;
btrfs_set_balance_meta ( leaf , item , & disk_bargs ) ;
btrfs_cpu_balance_args_to_disk ( & disk_bargs , & bctl - > sys ) ;
btrfs_set_balance_sys ( leaf , item , & disk_bargs ) ;
btrfs_set_balance_flags ( leaf , item , bctl - > flags ) ;
btrfs_mark_buffer_dirty ( leaf ) ;
out :
btrfs_free_path ( path ) ;
2016-09-09 21:39:03 -04:00
err = btrfs_commit_transaction ( trans ) ;
2012-01-16 22:04:48 +02:00
if ( err & & ! ret )
ret = err ;
return ret ;
}
2016-06-21 21:16:51 -04:00
static int del_balance_item ( struct btrfs_fs_info * fs_info )
2012-01-16 22:04:48 +02:00
{
2016-06-21 21:16:51 -04:00
struct btrfs_root * root = fs_info - > tree_root ;
2012-01-16 22:04:48 +02:00
struct btrfs_trans_handle * trans ;
struct btrfs_path * path ;
struct btrfs_key key ;
int ret , err ;
path = btrfs_alloc_path ( ) ;
if ( ! path )
return - ENOMEM ;
btrfs: allow use of global block reserve for balance item deletion
On a filesystem with exhausted metadata, but still enough to start
balance, it's possible to hit this error:
[324402.053842] BTRFS info (device loop0): 1 enospc errors during balance
[324402.060769] BTRFS info (device loop0): balance: ended with status: -28
[324402.172295] BTRFS: error (device loop0) in reset_balance_state:3321: errno=-28 No space left
It fails inside reset_balance_state and turns the filesystem to
read-only, which is unnecessary and should be fixed too, but the problem
is caused by lack for space when the balance item is deleted. This is a
one-time operation and from the same rank as unlink that is allowed to
use the global block reserve. So do the same for the balance item.
Status of the filesystem (100GiB) just after the balance fails:
$ btrfs fi df mnt
Data, single: total=80.01GiB, used=38.58GiB
System, single: total=4.00MiB, used=16.00KiB
Metadata, single: total=19.99GiB, used=19.48GiB
GlobalReserve, single: total=512.00MiB, used=50.11MiB
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-06-25 12:35:28 +02:00
trans = btrfs_start_transaction_fallback_global_rsv ( root , 0 ) ;
2012-01-16 22:04:48 +02:00
if ( IS_ERR ( trans ) ) {
btrfs_free_path ( path ) ;
return PTR_ERR ( trans ) ;
}
key . objectid = BTRFS_BALANCE_OBJECTID ;
2016-01-25 17:51:31 +01:00
key . type = BTRFS_TEMPORARY_ITEM_KEY ;
2012-01-16 22:04:48 +02:00
key . offset = 0 ;
ret = btrfs_search_slot ( trans , root , & key , path , - 1 , 1 ) ;
if ( ret < 0 )
goto out ;
if ( ret > 0 ) {
ret = - ENOENT ;
goto out ;
}
ret = btrfs_del_item ( trans , root , path ) ;
out :
btrfs_free_path ( path ) ;
2016-09-09 21:39:03 -04:00
err = btrfs_commit_transaction ( trans ) ;
2012-01-16 22:04:48 +02:00
if ( err & & ! ret )
ret = err ;
return ret ;
}
2012-01-16 22:04:48 +02:00
/*
* This is a heuristic used to reduce the number of chunks balanced on
* resume after balance was interrupted .
*/
static void update_balance_args ( struct btrfs_balance_control * bctl )
{
/*
* Turn on soft mode for chunk types that were being converted .
*/
if ( bctl - > data . flags & BTRFS_BALANCE_ARGS_CONVERT )
bctl - > data . flags | = BTRFS_BALANCE_ARGS_SOFT ;
if ( bctl - > sys . flags & BTRFS_BALANCE_ARGS_CONVERT )
bctl - > sys . flags | = BTRFS_BALANCE_ARGS_SOFT ;
if ( bctl - > meta . flags & BTRFS_BALANCE_ARGS_CONVERT )
bctl - > meta . flags | = BTRFS_BALANCE_ARGS_SOFT ;
/*
* Turn on usage filter if is not already used . The idea is
* that chunks that we have already balanced should be
* reasonably full . Don ' t do it for chunks that are being
* converted - that will keep us from relocating unconverted
* ( albeit full ) chunks .
*/
if ( ! ( bctl - > data . flags & BTRFS_BALANCE_ARGS_USAGE ) & &
2015-10-20 18:22:13 +02:00
! ( bctl - > data . flags & BTRFS_BALANCE_ARGS_USAGE_RANGE ) & &
2012-01-16 22:04:48 +02:00
! ( bctl - > data . flags & BTRFS_BALANCE_ARGS_CONVERT ) ) {
bctl - > data . flags | = BTRFS_BALANCE_ARGS_USAGE ;
bctl - > data . usage = 90 ;
}
if ( ! ( bctl - > sys . flags & BTRFS_BALANCE_ARGS_USAGE ) & &
2015-10-20 18:22:13 +02:00
! ( bctl - > sys . flags & BTRFS_BALANCE_ARGS_USAGE_RANGE ) & &
2012-01-16 22:04:48 +02:00
! ( bctl - > sys . flags & BTRFS_BALANCE_ARGS_CONVERT ) ) {
bctl - > sys . flags | = BTRFS_BALANCE_ARGS_USAGE ;
bctl - > sys . usage = 90 ;
}
if ( ! ( bctl - > meta . flags & BTRFS_BALANCE_ARGS_USAGE ) & &
2015-10-20 18:22:13 +02:00
! ( bctl - > meta . flags & BTRFS_BALANCE_ARGS_USAGE_RANGE ) & &
2012-01-16 22:04:48 +02:00
! ( bctl - > meta . flags & BTRFS_BALANCE_ARGS_CONVERT ) ) {
bctl - > meta . flags | = BTRFS_BALANCE_ARGS_USAGE ;
bctl - > meta . usage = 90 ;
}
}
2018-03-20 20:23:09 +01:00
/*
* Clear the balance status in fs_info and delete the balance item from disk .
*/
static void reset_balance_state ( struct btrfs_fs_info * fs_info )
2012-01-16 22:04:47 +02:00
{
struct btrfs_balance_control * bctl = fs_info - > balance_ctl ;
2018-03-20 20:23:09 +01:00
int ret ;
2012-01-16 22:04:47 +02:00
BUG_ON ( ! fs_info - > balance_ctl ) ;
spin_lock ( & fs_info - > balance_lock ) ;
fs_info - > balance_ctl = NULL ;
spin_unlock ( & fs_info - > balance_lock ) ;
kfree ( bctl ) ;
2018-03-20 20:23:09 +01:00
ret = del_balance_item ( fs_info ) ;
if ( ret )
btrfs_handle_fs_error ( fs_info , ret , NULL ) ;
2012-01-16 22:04:47 +02:00
}
2012-01-16 22:04:47 +02:00
/*
* Balance filters . Return 1 if chunk should be filtered out
* ( should not be balanced ) .
*/
2012-03-27 17:09:16 +03:00
static int chunk_profiles_filter ( u64 chunk_type ,
2012-01-16 22:04:47 +02:00
struct btrfs_balance_args * bargs )
{
2012-03-27 17:09:16 +03:00
chunk_type = chunk_to_extended ( chunk_type ) &
BTRFS_EXTENDED_PROFILE_MASK ;
2012-01-16 22:04:47 +02:00
2012-03-27 17:09:16 +03:00
if ( bargs - > profiles & chunk_type )
2012-01-16 22:04:47 +02:00
return 0 ;
return 1 ;
}
2015-11-17 12:29:32 +01:00
static int chunk_usage_range_filter ( struct btrfs_fs_info * fs_info , u64 chunk_offset ,
2012-01-16 22:04:47 +02:00
struct btrfs_balance_args * bargs )
2015-10-20 18:22:13 +02:00
{
2019-10-29 19:20:18 +01:00
struct btrfs_block_group * cache ;
2015-10-20 18:22:13 +02:00
u64 chunk_used ;
u64 user_thresh_min ;
u64 user_thresh_max ;
int ret = 1 ;
cache = btrfs_lookup_block_group ( fs_info , chunk_offset ) ;
2019-10-23 18:48:11 +02:00
chunk_used = cache - > used ;
2015-10-20 18:22:13 +02:00
if ( bargs - > usage_min = = 0 )
user_thresh_min = 0 ;
else
2019-10-23 18:48:22 +02:00
user_thresh_min = div_factor_fine ( cache - > length ,
bargs - > usage_min ) ;
2015-10-20 18:22:13 +02:00
if ( bargs - > usage_max = = 0 )
user_thresh_max = 1 ;
else if ( bargs - > usage_max > 100 )
2019-10-23 18:48:22 +02:00
user_thresh_max = cache - > length ;
2015-10-20 18:22:13 +02:00
else
2019-10-23 18:48:22 +02:00
user_thresh_max = div_factor_fine ( cache - > length ,
bargs - > usage_max ) ;
2015-10-20 18:22:13 +02:00
if ( user_thresh_min < = chunk_used & & chunk_used < user_thresh_max )
ret = 0 ;
btrfs_put_block_group ( cache ) ;
return ret ;
}
2015-11-17 12:29:32 +01:00
static int chunk_usage_filter ( struct btrfs_fs_info * fs_info ,
2015-10-20 18:22:13 +02:00
u64 chunk_offset , struct btrfs_balance_args * bargs )
2012-01-16 22:04:47 +02:00
{
2019-10-29 19:20:18 +01:00
struct btrfs_block_group * cache ;
2012-01-16 22:04:47 +02:00
u64 chunk_used , user_thresh ;
int ret = 1 ;
cache = btrfs_lookup_block_group ( fs_info , chunk_offset ) ;
2019-10-23 18:48:11 +02:00
chunk_used = cache - > used ;
2012-01-16 22:04:47 +02:00
2015-10-20 18:22:13 +02:00
if ( bargs - > usage_min = = 0 )
2013-02-12 16:28:59 +00:00
user_thresh = 1 ;
2013-01-21 15:15:56 +02:00
else if ( bargs - > usage > 100 )
2019-10-23 18:48:22 +02:00
user_thresh = cache - > length ;
2013-01-21 15:15:56 +02:00
else
2019-10-23 18:48:22 +02:00
user_thresh = div_factor_fine ( cache - > length , bargs - > usage ) ;
2013-01-21 15:15:56 +02:00
2012-01-16 22:04:47 +02:00
if ( chunk_used < user_thresh )
ret = 0 ;
btrfs_put_block_group ( cache ) ;
return ret ;
}
2012-01-16 22:04:47 +02:00
static int chunk_devid_filter ( struct extent_buffer * leaf ,
struct btrfs_chunk * chunk ,
struct btrfs_balance_args * bargs )
{
struct btrfs_stripe * stripe ;
int num_stripes = btrfs_chunk_num_stripes ( leaf , chunk ) ;
int i ;
for ( i = 0 ; i < num_stripes ; i + + ) {
stripe = btrfs_stripe_nr ( chunk , i ) ;
if ( btrfs_stripe_devid ( leaf , stripe ) = = bargs - > devid )
return 0 ;
}
return 1 ;
}
2019-05-17 11:43:34 +02:00
static u64 calc_data_stripes ( u64 type , int num_stripes )
{
const int index = btrfs_bg_flags_to_raid_index ( type ) ;
const int ncopies = btrfs_raid_array [ index ] . ncopies ;
const int nparity = btrfs_raid_array [ index ] . nparity ;
2021-07-26 14:15:24 +02:00
return ( num_stripes - nparity ) / ncopies ;
2019-05-17 11:43:34 +02:00
}
2012-01-16 22:04:48 +02:00
/* [pstart, pend) */
static int chunk_drange_filter ( struct extent_buffer * leaf ,
struct btrfs_chunk * chunk ,
struct btrfs_balance_args * bargs )
{
struct btrfs_stripe * stripe ;
int num_stripes = btrfs_chunk_num_stripes ( leaf , chunk ) ;
u64 stripe_offset ;
u64 stripe_length ;
2019-05-17 11:43:34 +02:00
u64 type ;
2012-01-16 22:04:48 +02:00
int factor ;
int i ;
if ( ! ( bargs - > flags & BTRFS_BALANCE_ARGS_DEVID ) )
return 0 ;
2019-05-17 11:43:34 +02:00
type = btrfs_chunk_type ( leaf , chunk ) ;
factor = calc_data_stripes ( type , num_stripes ) ;
2012-01-16 22:04:48 +02:00
for ( i = 0 ; i < num_stripes ; i + + ) {
stripe = btrfs_stripe_nr ( chunk , i ) ;
if ( btrfs_stripe_devid ( leaf , stripe ) ! = bargs - > devid )
continue ;
stripe_offset = btrfs_stripe_offset ( leaf , stripe ) ;
stripe_length = btrfs_chunk_length ( leaf , chunk ) ;
2015-01-16 17:26:13 +01:00
stripe_length = div_u64 ( stripe_length , factor ) ;
2012-01-16 22:04:48 +02:00
if ( stripe_offset < bargs - > pend & &
stripe_offset + stripe_length > bargs - > pstart )
return 0 ;
}
return 1 ;
}
2012-01-16 22:04:48 +02:00
/* [vstart, vend) */
static int chunk_vrange_filter ( struct extent_buffer * leaf ,
struct btrfs_chunk * chunk ,
u64 chunk_offset ,
struct btrfs_balance_args * bargs )
{
if ( chunk_offset < bargs - > vend & &
chunk_offset + btrfs_chunk_length ( leaf , chunk ) > bargs - > vstart )
/* at least part of the chunk is inside this vrange */
return 0 ;
return 1 ;
}
2015-09-28 22:32:41 +00:00
static int chunk_stripes_range_filter ( struct extent_buffer * leaf ,
struct btrfs_chunk * chunk ,
struct btrfs_balance_args * bargs )
{
int num_stripes = btrfs_chunk_num_stripes ( leaf , chunk ) ;
if ( bargs - > stripes_min < = num_stripes
& & num_stripes < = bargs - > stripes_max )
return 0 ;
return 1 ;
}
2012-03-27 17:09:16 +03:00
static int chunk_soft_convert_filter ( u64 chunk_type ,
2012-01-16 22:04:48 +02:00
struct btrfs_balance_args * bargs )
{
if ( ! ( bargs - > flags & BTRFS_BALANCE_ARGS_CONVERT ) )
return 0 ;
2012-03-27 17:09:16 +03:00
chunk_type = chunk_to_extended ( chunk_type ) &
BTRFS_EXTENDED_PROFILE_MASK ;
2012-01-16 22:04:48 +02:00
2012-03-27 17:09:16 +03:00
if ( bargs - > target = = chunk_type )
2012-01-16 22:04:48 +02:00
return 1 ;
return 0 ;
}
2019-03-20 16:38:52 +01:00
static int should_balance_chunk ( struct extent_buffer * leaf ,
2012-01-16 22:04:47 +02:00
struct btrfs_chunk * chunk , u64 chunk_offset )
{
2019-03-20 16:38:52 +01:00
struct btrfs_fs_info * fs_info = leaf - > fs_info ;
2016-06-22 18:54:23 -04:00
struct btrfs_balance_control * bctl = fs_info - > balance_ctl ;
2012-01-16 22:04:47 +02:00
struct btrfs_balance_args * bargs = NULL ;
u64 chunk_type = btrfs_chunk_type ( leaf , chunk ) ;
/* type filter */
if ( ! ( ( chunk_type & BTRFS_BLOCK_GROUP_TYPE_MASK ) &
( bctl - > flags & BTRFS_BALANCE_TYPE_MASK ) ) ) {
return 0 ;
}
if ( chunk_type & BTRFS_BLOCK_GROUP_DATA )
bargs = & bctl - > data ;
else if ( chunk_type & BTRFS_BLOCK_GROUP_SYSTEM )
bargs = & bctl - > sys ;
else if ( chunk_type & BTRFS_BLOCK_GROUP_METADATA )
bargs = & bctl - > meta ;
2012-01-16 22:04:47 +02:00
/* profiles filter */
if ( ( bargs - > flags & BTRFS_BALANCE_ARGS_PROFILES ) & &
chunk_profiles_filter ( chunk_type , bargs ) ) {
return 0 ;
2012-01-16 22:04:47 +02:00
}
/* usage filter */
if ( ( bargs - > flags & BTRFS_BALANCE_ARGS_USAGE ) & &
2016-06-22 18:54:23 -04:00
chunk_usage_filter ( fs_info , chunk_offset , bargs ) ) {
2012-01-16 22:04:47 +02:00
return 0 ;
2015-10-20 18:22:13 +02:00
} else if ( ( bargs - > flags & BTRFS_BALANCE_ARGS_USAGE_RANGE ) & &
2016-06-22 18:54:23 -04:00
chunk_usage_range_filter ( fs_info , chunk_offset , bargs ) ) {
2015-10-20 18:22:13 +02:00
return 0 ;
2012-01-16 22:04:47 +02:00
}
/* devid filter */
if ( ( bargs - > flags & BTRFS_BALANCE_ARGS_DEVID ) & &
chunk_devid_filter ( leaf , chunk , bargs ) ) {
return 0 ;
2012-01-16 22:04:48 +02:00
}
/* drange filter, makes sense only with devid filter */
if ( ( bargs - > flags & BTRFS_BALANCE_ARGS_DRANGE ) & &
2017-07-19 10:48:42 +03:00
chunk_drange_filter ( leaf , chunk , bargs ) ) {
2012-01-16 22:04:48 +02:00
return 0 ;
2012-01-16 22:04:48 +02:00
}
/* vrange filter */
if ( ( bargs - > flags & BTRFS_BALANCE_ARGS_VRANGE ) & &
chunk_vrange_filter ( leaf , chunk , chunk_offset , bargs ) ) {
return 0 ;
2012-01-16 22:04:47 +02:00
}
2015-09-28 22:32:41 +00:00
/* stripes filter */
if ( ( bargs - > flags & BTRFS_BALANCE_ARGS_STRIPES_RANGE ) & &
chunk_stripes_range_filter ( leaf , chunk , bargs ) ) {
return 0 ;
}
2012-01-16 22:04:48 +02:00
/* soft profile changing mode */
if ( ( bargs - > flags & BTRFS_BALANCE_ARGS_SOFT ) & &
chunk_soft_convert_filter ( chunk_type , bargs ) ) {
return 0 ;
}
2014-05-07 17:37:51 +02:00
/*
* limited by count , must be the last filter
*/
if ( ( bargs - > flags & BTRFS_BALANCE_ARGS_LIMIT ) ) {
if ( bargs - > limit = = 0 )
return 0 ;
else
bargs - > limit - - ;
2015-10-10 17:16:50 +02:00
} else if ( ( bargs - > flags & BTRFS_BALANCE_ARGS_LIMIT_RANGE ) ) {
/*
* Same logic as the ' limit ' filter ; the minimum cannot be
2016-05-19 21:18:45 -04:00
* determined here because we do not have the global information
2015-10-10 17:16:50 +02:00
* about the count of all chunks that satisfy the filters .
*/
if ( bargs - > limit_max = = 0 )
return 0 ;
else
bargs - > limit_max - - ;
2014-05-07 17:37:51 +02:00
}
2012-01-16 22:04:47 +02:00
return 1 ;
}
2012-01-16 22:04:47 +02:00
static int __btrfs_balance ( struct btrfs_fs_info * fs_info )
2008-04-28 15:29:52 -04:00
{
2012-01-16 22:04:49 +02:00
struct btrfs_balance_control * bctl = fs_info - > balance_ctl ;
2012-01-16 22:04:47 +02:00
struct btrfs_root * chunk_root = fs_info - > chunk_root ;
2015-10-10 17:16:50 +02:00
u64 chunk_type ;
2012-01-16 22:04:47 +02:00
struct btrfs_chunk * chunk ;
2016-07-12 11:24:21 -07:00
struct btrfs_path * path = NULL ;
2008-04-28 15:29:52 -04:00
struct btrfs_key key ;
struct btrfs_key found_key ;
2012-01-16 22:04:47 +02:00
struct extent_buffer * leaf ;
int slot ;
2012-01-16 22:04:47 +02:00
int ret ;
int enospc_errors = 0 ;
2012-01-16 22:04:49 +02:00
bool counting = true ;
2015-10-10 17:16:50 +02:00
/* The single value limit and min/max limits use the same bytes in the */
2014-05-07 17:37:51 +02:00
u64 limit_data = bctl - > data . limit ;
u64 limit_meta = bctl - > meta . limit ;
u64 limit_sys = bctl - > sys . limit ;
2015-10-10 17:16:50 +02:00
u32 count_data = 0 ;
u32 count_meta = 0 ;
u32 count_sys = 0 ;
btrfs: Fix lost-data-profile caused by balance bg
Reproduce:
(In integration-4.3 branch)
TEST_DEV=(/dev/vdg /dev/vdh)
TEST_DIR=/mnt/tmp
umount "$TEST_DEV" >/dev/null
mkfs.btrfs -f -d raid1 "${TEST_DEV[@]}"
mount -o nospace_cache "$TEST_DEV" "$TEST_DIR"
btrfs balance start -dusage=0 $TEST_DIR
btrfs filesystem usage $TEST_DIR
dd if=/dev/zero of="$TEST_DIR"/file count=100
btrfs filesystem usage $TEST_DIR
Result:
We can see "no data chunk" in first "btrfs filesystem usage":
# btrfs filesystem usage $TEST_DIR
Overall:
...
Metadata,single: Size:8.00MiB, Used:0.00B
/dev/vdg 8.00MiB
Metadata,RAID1: Size:122.88MiB, Used:112.00KiB
/dev/vdg 122.88MiB
/dev/vdh 122.88MiB
System,single: Size:4.00MiB, Used:0.00B
/dev/vdg 4.00MiB
System,RAID1: Size:8.00MiB, Used:16.00KiB
/dev/vdg 8.00MiB
/dev/vdh 8.00MiB
Unallocated:
/dev/vdg 1.06GiB
/dev/vdh 1.07GiB
And "data chunks changed from raid1 to single" in second
"btrfs filesystem usage":
# btrfs filesystem usage $TEST_DIR
Overall:
...
Data,single: Size:256.00MiB, Used:0.00B
/dev/vdh 256.00MiB
Metadata,single: Size:8.00MiB, Used:0.00B
/dev/vdg 8.00MiB
Metadata,RAID1: Size:122.88MiB, Used:112.00KiB
/dev/vdg 122.88MiB
/dev/vdh 122.88MiB
System,single: Size:4.00MiB, Used:0.00B
/dev/vdg 4.00MiB
System,RAID1: Size:8.00MiB, Used:16.00KiB
/dev/vdg 8.00MiB
/dev/vdh 8.00MiB
Unallocated:
/dev/vdg 1.06GiB
/dev/vdh 841.92MiB
Reason:
btrfs balance delete last data chunk in case of no data in
the filesystem, then we can see "no data chunk" by "fi usage"
command.
And when we do write operation to fs, the only available data
profile is 0x0, result is all new chunks are allocated single type.
Fix:
Allocate a data chunk explicitly to ensure we don't lose the
raid profile for data.
Test:
Test by above script, and confirmed the logic by debug output.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-11-09 11:51:32 +08:00
int chunk_reserved = 0 ;
2008-04-28 15:29:52 -04:00
path = btrfs_alloc_path ( ) ;
2011-07-12 11:10:23 -07:00
if ( ! path ) {
ret = - ENOMEM ;
goto error ;
}
2012-01-16 22:04:49 +02:00
/* zero out stat counters */
spin_lock ( & fs_info - > balance_lock ) ;
memset ( & bctl - > stat , 0 , sizeof ( bctl - > stat ) ) ;
spin_unlock ( & fs_info - > balance_lock ) ;
again :
2014-05-07 17:37:51 +02:00
if ( ! counting ) {
2015-10-10 17:16:50 +02:00
/*
* The single value limit and min / max limits use the same bytes
* in the
*/
2014-05-07 17:37:51 +02:00
bctl - > data . limit = limit_data ;
bctl - > meta . limit = limit_meta ;
bctl - > sys . limit = limit_sys ;
}
2008-04-28 15:29:52 -04:00
key . objectid = BTRFS_FIRST_CHUNK_TREE_OBJECTID ;
key . offset = ( u64 ) - 1 ;
key . type = BTRFS_CHUNK_ITEM_KEY ;
2009-01-05 21:25:51 -05:00
while ( 1 ) {
2012-01-16 22:04:49 +02:00
if ( ( ! counting & & atomic_read ( & fs_info - > balance_pause_req ) ) | |
2012-01-16 22:04:49 +02:00
atomic_read ( & fs_info - > balance_cancel_req ) ) {
2012-01-16 22:04:49 +02:00
ret = - ECANCELED ;
goto error ;
}
2021-04-19 16:41:01 +09:00
mutex_lock ( & fs_info - > reclaim_bgs_lock ) ;
2008-04-28 15:29:52 -04:00
ret = btrfs_search_slot ( NULL , chunk_root , & key , path , 0 , 0 ) ;
Btrfs: fix race between balance and unused block group deletion
We have a race between deleting an unused block group and balancing the
same block group that leads to an assertion failure/BUG(), producing the
following trace:
[181631.208236] BTRFS: assertion failed: 0, file: fs/btrfs/volumes.c, line: 2622
[181631.220591] ------------[ cut here ]------------
[181631.222959] kernel BUG at fs/btrfs/ctree.h:4062!
[181631.223932] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[181631.224566] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse acpi_cpufreq parpor$
[181631.224566] CPU: 8 PID: 17451 Comm: btrfs Tainted: G W 4.1.0-rc5-btrfs-next-10+ #1
[181631.224566] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
[181631.224566] task: ffff880127e09590 ti: ffff8800b5824000 task.ti: ffff8800b5824000
[181631.224566] RIP: 0010:[<ffffffffa03f19f6>] [<ffffffffa03f19f6>] assfail.constprop.50+0x1e/0x20 [btrfs]
[181631.224566] RSP: 0018:ffff8800b5827ae8 EFLAGS: 00010246
[181631.224566] RAX: 0000000000000040 RBX: ffff8800109fc218 RCX: ffffffff81095dce
[181631.224566] RDX: 0000000000005124 RSI: ffffffff81464819 RDI: 00000000ffffffff
[181631.224566] RBP: ffff8800b5827ae8 R08: 0000000000000001 R09: 0000000000000000
[181631.224566] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8800109fc200
[181631.224566] R13: ffff880020095000 R14: ffff8800b1a13f38 R15: ffff880020095000
[181631.224566] FS: 00007f70ca0b0c80(0000) GS:ffff88013ec00000(0000) knlGS:0000000000000000
[181631.224566] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[181631.224566] CR2: 00007f2872ab6e68 CR3: 00000000a717c000 CR4: 00000000000006e0
[181631.224566] Stack:
[181631.224566] ffff8800b5827ba8 ffffffffa03f3916 ffff8800b5827b38 ffffffffa03d080e
[181631.224566] ffffffffa03d1423 ffff880020095000 ffff88001233c000 0000000000000001
[181631.224566] ffff880020095000 ffff8800b1a13f38 0000000a69c00000 0000000000000000
[181631.224566] Call Trace:
[181631.224566] [<ffffffffa03f3916>] btrfs_remove_chunk+0xa4/0x6bb [btrfs]
[181631.224566] [<ffffffffa03d080e>] ? join_transaction.isra.8+0xb9/0x3ba [btrfs]
[181631.224566] [<ffffffffa03d1423>] ? wait_current_trans.isra.13+0x22/0xfc [btrfs]
[181631.224566] [<ffffffffa03f3fbc>] btrfs_relocate_chunk.isra.29+0x8f/0xa7 [btrfs]
[181631.224566] [<ffffffffa03f54df>] btrfs_balance+0xaa4/0xc52 [btrfs]
[181631.224566] [<ffffffffa03fd388>] btrfs_ioctl_balance+0x23f/0x2b0 [btrfs]
[181631.224566] [<ffffffff810872f9>] ? trace_hardirqs_on+0xd/0xf
[181631.224566] [<ffffffffa04019a3>] btrfs_ioctl+0xfe2/0x2220 [btrfs]
[181631.224566] [<ffffffff812603ed>] ? __this_cpu_preempt_check+0x13/0x15
[181631.224566] [<ffffffff81084669>] ? arch_local_irq_save+0x9/0xc
[181631.224566] [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
[181631.224566] [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
[181631.224566] [<ffffffff8103e48c>] ? __do_page_fault+0x211/0x424
[181631.224566] [<ffffffff811755e6>] do_vfs_ioctl+0x3c6/0x479
(...)
The sequence of steps leading to this are:
CPU 0 CPU 1
btrfs_balance()
btrfs_relocate_chunk()
btrfs_relocate_block_group(bg X)
btrfs_lookup_block_group(bg X)
cleaner_kthread
locks fs_info->cleaner_mutex
btrfs_delete_unused_bgs()
finds bg X, which became
unused in the previous
transaction
checks bg X ->ro == 0,
so it proceeds
sets bg X ->ro to 1
(btrfs_set_block_group_ro(bg X))
blocks on fs_info->cleaner_mutex
btrfs_remove_chunk(bg X)
unlocks fs_info->cleaner_mutex
acquires fs_info->cleaner_mutex
relocate_block_group()
--> does nothing, no extents found in
the extent tree from bg X
unlocks fs_info->cleaner_mutex
btrfs_relocate_block_group(bg X) returns
btrfs_remove_chunk(bg X)
extent map not found
--> ASSERT(0)
Fix this by using a new mutex to make sure these 2 operations, block
group relocation and removal, are serialized.
This issue is reproducible by running fstests generic/038 (which stresses
chunk allocation and automatic removal of unused block groups) together
with the following balance loop:
while true; do btrfs balance start -dusage=0 <mountpoint> ; done
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-11 00:58:53 +01:00
if ( ret < 0 ) {
2021-04-19 16:41:01 +09:00
mutex_unlock ( & fs_info - > reclaim_bgs_lock ) ;
2008-04-28 15:29:52 -04:00
goto error ;
Btrfs: fix race between balance and unused block group deletion
We have a race between deleting an unused block group and balancing the
same block group that leads to an assertion failure/BUG(), producing the
following trace:
[181631.208236] BTRFS: assertion failed: 0, file: fs/btrfs/volumes.c, line: 2622
[181631.220591] ------------[ cut here ]------------
[181631.222959] kernel BUG at fs/btrfs/ctree.h:4062!
[181631.223932] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[181631.224566] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse acpi_cpufreq parpor$
[181631.224566] CPU: 8 PID: 17451 Comm: btrfs Tainted: G W 4.1.0-rc5-btrfs-next-10+ #1
[181631.224566] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
[181631.224566] task: ffff880127e09590 ti: ffff8800b5824000 task.ti: ffff8800b5824000
[181631.224566] RIP: 0010:[<ffffffffa03f19f6>] [<ffffffffa03f19f6>] assfail.constprop.50+0x1e/0x20 [btrfs]
[181631.224566] RSP: 0018:ffff8800b5827ae8 EFLAGS: 00010246
[181631.224566] RAX: 0000000000000040 RBX: ffff8800109fc218 RCX: ffffffff81095dce
[181631.224566] RDX: 0000000000005124 RSI: ffffffff81464819 RDI: 00000000ffffffff
[181631.224566] RBP: ffff8800b5827ae8 R08: 0000000000000001 R09: 0000000000000000
[181631.224566] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8800109fc200
[181631.224566] R13: ffff880020095000 R14: ffff8800b1a13f38 R15: ffff880020095000
[181631.224566] FS: 00007f70ca0b0c80(0000) GS:ffff88013ec00000(0000) knlGS:0000000000000000
[181631.224566] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[181631.224566] CR2: 00007f2872ab6e68 CR3: 00000000a717c000 CR4: 00000000000006e0
[181631.224566] Stack:
[181631.224566] ffff8800b5827ba8 ffffffffa03f3916 ffff8800b5827b38 ffffffffa03d080e
[181631.224566] ffffffffa03d1423 ffff880020095000 ffff88001233c000 0000000000000001
[181631.224566] ffff880020095000 ffff8800b1a13f38 0000000a69c00000 0000000000000000
[181631.224566] Call Trace:
[181631.224566] [<ffffffffa03f3916>] btrfs_remove_chunk+0xa4/0x6bb [btrfs]
[181631.224566] [<ffffffffa03d080e>] ? join_transaction.isra.8+0xb9/0x3ba [btrfs]
[181631.224566] [<ffffffffa03d1423>] ? wait_current_trans.isra.13+0x22/0xfc [btrfs]
[181631.224566] [<ffffffffa03f3fbc>] btrfs_relocate_chunk.isra.29+0x8f/0xa7 [btrfs]
[181631.224566] [<ffffffffa03f54df>] btrfs_balance+0xaa4/0xc52 [btrfs]
[181631.224566] [<ffffffffa03fd388>] btrfs_ioctl_balance+0x23f/0x2b0 [btrfs]
[181631.224566] [<ffffffff810872f9>] ? trace_hardirqs_on+0xd/0xf
[181631.224566] [<ffffffffa04019a3>] btrfs_ioctl+0xfe2/0x2220 [btrfs]
[181631.224566] [<ffffffff812603ed>] ? __this_cpu_preempt_check+0x13/0x15
[181631.224566] [<ffffffff81084669>] ? arch_local_irq_save+0x9/0xc
[181631.224566] [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
[181631.224566] [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
[181631.224566] [<ffffffff8103e48c>] ? __do_page_fault+0x211/0x424
[181631.224566] [<ffffffff811755e6>] do_vfs_ioctl+0x3c6/0x479
(...)
The sequence of steps leading to this are:
CPU 0 CPU 1
btrfs_balance()
btrfs_relocate_chunk()
btrfs_relocate_block_group(bg X)
btrfs_lookup_block_group(bg X)
cleaner_kthread
locks fs_info->cleaner_mutex
btrfs_delete_unused_bgs()
finds bg X, which became
unused in the previous
transaction
checks bg X ->ro == 0,
so it proceeds
sets bg X ->ro to 1
(btrfs_set_block_group_ro(bg X))
blocks on fs_info->cleaner_mutex
btrfs_remove_chunk(bg X)
unlocks fs_info->cleaner_mutex
acquires fs_info->cleaner_mutex
relocate_block_group()
--> does nothing, no extents found in
the extent tree from bg X
unlocks fs_info->cleaner_mutex
btrfs_relocate_block_group(bg X) returns
btrfs_remove_chunk(bg X)
extent map not found
--> ASSERT(0)
Fix this by using a new mutex to make sure these 2 operations, block
group relocation and removal, are serialized.
This issue is reproducible by running fstests generic/038 (which stresses
chunk allocation and automatic removal of unused block groups) together
with the following balance loop:
while true; do btrfs balance start -dusage=0 <mountpoint> ; done
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-11 00:58:53 +01:00
}
2008-04-28 15:29:52 -04:00
/*
* this shouldn ' t happen , it means the last relocate
* failed
*/
if ( ret = = 0 )
2012-01-16 22:04:47 +02:00
BUG ( ) ; /* FIXME break ? */
2008-04-28 15:29:52 -04:00
ret = btrfs_previous_item ( chunk_root , path , 0 ,
BTRFS_CHUNK_ITEM_KEY ) ;
2012-01-16 22:04:47 +02:00
if ( ret ) {
2021-04-19 16:41:01 +09:00
mutex_unlock ( & fs_info - > reclaim_bgs_lock ) ;
2012-01-16 22:04:47 +02:00
ret = 0 ;
2008-04-28 15:29:52 -04:00
break ;
2012-01-16 22:04:47 +02:00
}
2008-07-08 14:19:17 -04:00
2012-01-16 22:04:47 +02:00
leaf = path - > nodes [ 0 ] ;
slot = path - > slots [ 0 ] ;
btrfs_item_key_to_cpu ( leaf , & found_key , slot ) ;
2008-07-08 14:19:17 -04:00
Btrfs: fix race between balance and unused block group deletion
We have a race between deleting an unused block group and balancing the
same block group that leads to an assertion failure/BUG(), producing the
following trace:
[181631.208236] BTRFS: assertion failed: 0, file: fs/btrfs/volumes.c, line: 2622
[181631.220591] ------------[ cut here ]------------
[181631.222959] kernel BUG at fs/btrfs/ctree.h:4062!
[181631.223932] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[181631.224566] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse acpi_cpufreq parpor$
[181631.224566] CPU: 8 PID: 17451 Comm: btrfs Tainted: G W 4.1.0-rc5-btrfs-next-10+ #1
[181631.224566] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
[181631.224566] task: ffff880127e09590 ti: ffff8800b5824000 task.ti: ffff8800b5824000
[181631.224566] RIP: 0010:[<ffffffffa03f19f6>] [<ffffffffa03f19f6>] assfail.constprop.50+0x1e/0x20 [btrfs]
[181631.224566] RSP: 0018:ffff8800b5827ae8 EFLAGS: 00010246
[181631.224566] RAX: 0000000000000040 RBX: ffff8800109fc218 RCX: ffffffff81095dce
[181631.224566] RDX: 0000000000005124 RSI: ffffffff81464819 RDI: 00000000ffffffff
[181631.224566] RBP: ffff8800b5827ae8 R08: 0000000000000001 R09: 0000000000000000
[181631.224566] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8800109fc200
[181631.224566] R13: ffff880020095000 R14: ffff8800b1a13f38 R15: ffff880020095000
[181631.224566] FS: 00007f70ca0b0c80(0000) GS:ffff88013ec00000(0000) knlGS:0000000000000000
[181631.224566] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[181631.224566] CR2: 00007f2872ab6e68 CR3: 00000000a717c000 CR4: 00000000000006e0
[181631.224566] Stack:
[181631.224566] ffff8800b5827ba8 ffffffffa03f3916 ffff8800b5827b38 ffffffffa03d080e
[181631.224566] ffffffffa03d1423 ffff880020095000 ffff88001233c000 0000000000000001
[181631.224566] ffff880020095000 ffff8800b1a13f38 0000000a69c00000 0000000000000000
[181631.224566] Call Trace:
[181631.224566] [<ffffffffa03f3916>] btrfs_remove_chunk+0xa4/0x6bb [btrfs]
[181631.224566] [<ffffffffa03d080e>] ? join_transaction.isra.8+0xb9/0x3ba [btrfs]
[181631.224566] [<ffffffffa03d1423>] ? wait_current_trans.isra.13+0x22/0xfc [btrfs]
[181631.224566] [<ffffffffa03f3fbc>] btrfs_relocate_chunk.isra.29+0x8f/0xa7 [btrfs]
[181631.224566] [<ffffffffa03f54df>] btrfs_balance+0xaa4/0xc52 [btrfs]
[181631.224566] [<ffffffffa03fd388>] btrfs_ioctl_balance+0x23f/0x2b0 [btrfs]
[181631.224566] [<ffffffff810872f9>] ? trace_hardirqs_on+0xd/0xf
[181631.224566] [<ffffffffa04019a3>] btrfs_ioctl+0xfe2/0x2220 [btrfs]
[181631.224566] [<ffffffff812603ed>] ? __this_cpu_preempt_check+0x13/0x15
[181631.224566] [<ffffffff81084669>] ? arch_local_irq_save+0x9/0xc
[181631.224566] [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
[181631.224566] [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
[181631.224566] [<ffffffff8103e48c>] ? __do_page_fault+0x211/0x424
[181631.224566] [<ffffffff811755e6>] do_vfs_ioctl+0x3c6/0x479
(...)
The sequence of steps leading to this are:
CPU 0 CPU 1
btrfs_balance()
btrfs_relocate_chunk()
btrfs_relocate_block_group(bg X)
btrfs_lookup_block_group(bg X)
cleaner_kthread
locks fs_info->cleaner_mutex
btrfs_delete_unused_bgs()
finds bg X, which became
unused in the previous
transaction
checks bg X ->ro == 0,
so it proceeds
sets bg X ->ro to 1
(btrfs_set_block_group_ro(bg X))
blocks on fs_info->cleaner_mutex
btrfs_remove_chunk(bg X)
unlocks fs_info->cleaner_mutex
acquires fs_info->cleaner_mutex
relocate_block_group()
--> does nothing, no extents found in
the extent tree from bg X
unlocks fs_info->cleaner_mutex
btrfs_relocate_block_group(bg X) returns
btrfs_remove_chunk(bg X)
extent map not found
--> ASSERT(0)
Fix this by using a new mutex to make sure these 2 operations, block
group relocation and removal, are serialized.
This issue is reproducible by running fstests generic/038 (which stresses
chunk allocation and automatic removal of unused block groups) together
with the following balance loop:
while true; do btrfs balance start -dusage=0 <mountpoint> ; done
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-11 00:58:53 +01:00
if ( found_key . objectid ! = key . objectid ) {
2021-04-19 16:41:01 +09:00
mutex_unlock ( & fs_info - > reclaim_bgs_lock ) ;
2008-04-28 15:29:52 -04:00
break ;
Btrfs: fix race between balance and unused block group deletion
We have a race between deleting an unused block group and balancing the
same block group that leads to an assertion failure/BUG(), producing the
following trace:
[181631.208236] BTRFS: assertion failed: 0, file: fs/btrfs/volumes.c, line: 2622
[181631.220591] ------------[ cut here ]------------
[181631.222959] kernel BUG at fs/btrfs/ctree.h:4062!
[181631.223932] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[181631.224566] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse acpi_cpufreq parpor$
[181631.224566] CPU: 8 PID: 17451 Comm: btrfs Tainted: G W 4.1.0-rc5-btrfs-next-10+ #1
[181631.224566] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
[181631.224566] task: ffff880127e09590 ti: ffff8800b5824000 task.ti: ffff8800b5824000
[181631.224566] RIP: 0010:[<ffffffffa03f19f6>] [<ffffffffa03f19f6>] assfail.constprop.50+0x1e/0x20 [btrfs]
[181631.224566] RSP: 0018:ffff8800b5827ae8 EFLAGS: 00010246
[181631.224566] RAX: 0000000000000040 RBX: ffff8800109fc218 RCX: ffffffff81095dce
[181631.224566] RDX: 0000000000005124 RSI: ffffffff81464819 RDI: 00000000ffffffff
[181631.224566] RBP: ffff8800b5827ae8 R08: 0000000000000001 R09: 0000000000000000
[181631.224566] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8800109fc200
[181631.224566] R13: ffff880020095000 R14: ffff8800b1a13f38 R15: ffff880020095000
[181631.224566] FS: 00007f70ca0b0c80(0000) GS:ffff88013ec00000(0000) knlGS:0000000000000000
[181631.224566] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[181631.224566] CR2: 00007f2872ab6e68 CR3: 00000000a717c000 CR4: 00000000000006e0
[181631.224566] Stack:
[181631.224566] ffff8800b5827ba8 ffffffffa03f3916 ffff8800b5827b38 ffffffffa03d080e
[181631.224566] ffffffffa03d1423 ffff880020095000 ffff88001233c000 0000000000000001
[181631.224566] ffff880020095000 ffff8800b1a13f38 0000000a69c00000 0000000000000000
[181631.224566] Call Trace:
[181631.224566] [<ffffffffa03f3916>] btrfs_remove_chunk+0xa4/0x6bb [btrfs]
[181631.224566] [<ffffffffa03d080e>] ? join_transaction.isra.8+0xb9/0x3ba [btrfs]
[181631.224566] [<ffffffffa03d1423>] ? wait_current_trans.isra.13+0x22/0xfc [btrfs]
[181631.224566] [<ffffffffa03f3fbc>] btrfs_relocate_chunk.isra.29+0x8f/0xa7 [btrfs]
[181631.224566] [<ffffffffa03f54df>] btrfs_balance+0xaa4/0xc52 [btrfs]
[181631.224566] [<ffffffffa03fd388>] btrfs_ioctl_balance+0x23f/0x2b0 [btrfs]
[181631.224566] [<ffffffff810872f9>] ? trace_hardirqs_on+0xd/0xf
[181631.224566] [<ffffffffa04019a3>] btrfs_ioctl+0xfe2/0x2220 [btrfs]
[181631.224566] [<ffffffff812603ed>] ? __this_cpu_preempt_check+0x13/0x15
[181631.224566] [<ffffffff81084669>] ? arch_local_irq_save+0x9/0xc
[181631.224566] [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
[181631.224566] [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
[181631.224566] [<ffffffff8103e48c>] ? __do_page_fault+0x211/0x424
[181631.224566] [<ffffffff811755e6>] do_vfs_ioctl+0x3c6/0x479
(...)
The sequence of steps leading to this are:
CPU 0 CPU 1
btrfs_balance()
btrfs_relocate_chunk()
btrfs_relocate_block_group(bg X)
btrfs_lookup_block_group(bg X)
cleaner_kthread
locks fs_info->cleaner_mutex
btrfs_delete_unused_bgs()
finds bg X, which became
unused in the previous
transaction
checks bg X ->ro == 0,
so it proceeds
sets bg X ->ro to 1
(btrfs_set_block_group_ro(bg X))
blocks on fs_info->cleaner_mutex
btrfs_remove_chunk(bg X)
unlocks fs_info->cleaner_mutex
acquires fs_info->cleaner_mutex
relocate_block_group()
--> does nothing, no extents found in
the extent tree from bg X
unlocks fs_info->cleaner_mutex
btrfs_relocate_block_group(bg X) returns
btrfs_remove_chunk(bg X)
extent map not found
--> ASSERT(0)
Fix this by using a new mutex to make sure these 2 operations, block
group relocation and removal, are serialized.
This issue is reproducible by running fstests generic/038 (which stresses
chunk allocation and automatic removal of unused block groups) together
with the following balance loop:
while true; do btrfs balance start -dusage=0 <mountpoint> ; done
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-11 00:58:53 +01:00
}
2008-07-08 14:19:17 -04:00
2012-01-16 22:04:47 +02:00
chunk = btrfs_item_ptr ( leaf , slot , struct btrfs_chunk ) ;
2015-10-10 17:16:50 +02:00
chunk_type = btrfs_chunk_type ( leaf , chunk ) ;
2012-01-16 22:04:47 +02:00
2012-01-16 22:04:49 +02:00
if ( ! counting ) {
spin_lock ( & fs_info - > balance_lock ) ;
bctl - > stat . considered + + ;
spin_unlock ( & fs_info - > balance_lock ) ;
}
2019-03-20 16:38:52 +01:00
ret = should_balance_chunk ( leaf , chunk , found_key . offset ) ;
btrfs: Fix lost-data-profile caused by balance bg
Reproduce:
(In integration-4.3 branch)
TEST_DEV=(/dev/vdg /dev/vdh)
TEST_DIR=/mnt/tmp
umount "$TEST_DEV" >/dev/null
mkfs.btrfs -f -d raid1 "${TEST_DEV[@]}"
mount -o nospace_cache "$TEST_DEV" "$TEST_DIR"
btrfs balance start -dusage=0 $TEST_DIR
btrfs filesystem usage $TEST_DIR
dd if=/dev/zero of="$TEST_DIR"/file count=100
btrfs filesystem usage $TEST_DIR
Result:
We can see "no data chunk" in first "btrfs filesystem usage":
# btrfs filesystem usage $TEST_DIR
Overall:
...
Metadata,single: Size:8.00MiB, Used:0.00B
/dev/vdg 8.00MiB
Metadata,RAID1: Size:122.88MiB, Used:112.00KiB
/dev/vdg 122.88MiB
/dev/vdh 122.88MiB
System,single: Size:4.00MiB, Used:0.00B
/dev/vdg 4.00MiB
System,RAID1: Size:8.00MiB, Used:16.00KiB
/dev/vdg 8.00MiB
/dev/vdh 8.00MiB
Unallocated:
/dev/vdg 1.06GiB
/dev/vdh 1.07GiB
And "data chunks changed from raid1 to single" in second
"btrfs filesystem usage":
# btrfs filesystem usage $TEST_DIR
Overall:
...
Data,single: Size:256.00MiB, Used:0.00B
/dev/vdh 256.00MiB
Metadata,single: Size:8.00MiB, Used:0.00B
/dev/vdg 8.00MiB
Metadata,RAID1: Size:122.88MiB, Used:112.00KiB
/dev/vdg 122.88MiB
/dev/vdh 122.88MiB
System,single: Size:4.00MiB, Used:0.00B
/dev/vdg 4.00MiB
System,RAID1: Size:8.00MiB, Used:16.00KiB
/dev/vdg 8.00MiB
/dev/vdh 8.00MiB
Unallocated:
/dev/vdg 1.06GiB
/dev/vdh 841.92MiB
Reason:
btrfs balance delete last data chunk in case of no data in
the filesystem, then we can see "no data chunk" by "fi usage"
command.
And when we do write operation to fs, the only available data
profile is 0x0, result is all new chunks are allocated single type.
Fix:
Allocate a data chunk explicitly to ensure we don't lose the
raid profile for data.
Test:
Test by above script, and confirmed the logic by debug output.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-11-09 11:51:32 +08:00
2011-04-21 01:20:15 +02:00
btrfs_release_path ( path ) ;
Btrfs: fix race between balance and unused block group deletion
We have a race between deleting an unused block group and balancing the
same block group that leads to an assertion failure/BUG(), producing the
following trace:
[181631.208236] BTRFS: assertion failed: 0, file: fs/btrfs/volumes.c, line: 2622
[181631.220591] ------------[ cut here ]------------
[181631.222959] kernel BUG at fs/btrfs/ctree.h:4062!
[181631.223932] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[181631.224566] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse acpi_cpufreq parpor$
[181631.224566] CPU: 8 PID: 17451 Comm: btrfs Tainted: G W 4.1.0-rc5-btrfs-next-10+ #1
[181631.224566] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
[181631.224566] task: ffff880127e09590 ti: ffff8800b5824000 task.ti: ffff8800b5824000
[181631.224566] RIP: 0010:[<ffffffffa03f19f6>] [<ffffffffa03f19f6>] assfail.constprop.50+0x1e/0x20 [btrfs]
[181631.224566] RSP: 0018:ffff8800b5827ae8 EFLAGS: 00010246
[181631.224566] RAX: 0000000000000040 RBX: ffff8800109fc218 RCX: ffffffff81095dce
[181631.224566] RDX: 0000000000005124 RSI: ffffffff81464819 RDI: 00000000ffffffff
[181631.224566] RBP: ffff8800b5827ae8 R08: 0000000000000001 R09: 0000000000000000
[181631.224566] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8800109fc200
[181631.224566] R13: ffff880020095000 R14: ffff8800b1a13f38 R15: ffff880020095000
[181631.224566] FS: 00007f70ca0b0c80(0000) GS:ffff88013ec00000(0000) knlGS:0000000000000000
[181631.224566] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[181631.224566] CR2: 00007f2872ab6e68 CR3: 00000000a717c000 CR4: 00000000000006e0
[181631.224566] Stack:
[181631.224566] ffff8800b5827ba8 ffffffffa03f3916 ffff8800b5827b38 ffffffffa03d080e
[181631.224566] ffffffffa03d1423 ffff880020095000 ffff88001233c000 0000000000000001
[181631.224566] ffff880020095000 ffff8800b1a13f38 0000000a69c00000 0000000000000000
[181631.224566] Call Trace:
[181631.224566] [<ffffffffa03f3916>] btrfs_remove_chunk+0xa4/0x6bb [btrfs]
[181631.224566] [<ffffffffa03d080e>] ? join_transaction.isra.8+0xb9/0x3ba [btrfs]
[181631.224566] [<ffffffffa03d1423>] ? wait_current_trans.isra.13+0x22/0xfc [btrfs]
[181631.224566] [<ffffffffa03f3fbc>] btrfs_relocate_chunk.isra.29+0x8f/0xa7 [btrfs]
[181631.224566] [<ffffffffa03f54df>] btrfs_balance+0xaa4/0xc52 [btrfs]
[181631.224566] [<ffffffffa03fd388>] btrfs_ioctl_balance+0x23f/0x2b0 [btrfs]
[181631.224566] [<ffffffff810872f9>] ? trace_hardirqs_on+0xd/0xf
[181631.224566] [<ffffffffa04019a3>] btrfs_ioctl+0xfe2/0x2220 [btrfs]
[181631.224566] [<ffffffff812603ed>] ? __this_cpu_preempt_check+0x13/0x15
[181631.224566] [<ffffffff81084669>] ? arch_local_irq_save+0x9/0xc
[181631.224566] [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
[181631.224566] [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
[181631.224566] [<ffffffff8103e48c>] ? __do_page_fault+0x211/0x424
[181631.224566] [<ffffffff811755e6>] do_vfs_ioctl+0x3c6/0x479
(...)
The sequence of steps leading to this are:
CPU 0 CPU 1
btrfs_balance()
btrfs_relocate_chunk()
btrfs_relocate_block_group(bg X)
btrfs_lookup_block_group(bg X)
cleaner_kthread
locks fs_info->cleaner_mutex
btrfs_delete_unused_bgs()
finds bg X, which became
unused in the previous
transaction
checks bg X ->ro == 0,
so it proceeds
sets bg X ->ro to 1
(btrfs_set_block_group_ro(bg X))
blocks on fs_info->cleaner_mutex
btrfs_remove_chunk(bg X)
unlocks fs_info->cleaner_mutex
acquires fs_info->cleaner_mutex
relocate_block_group()
--> does nothing, no extents found in
the extent tree from bg X
unlocks fs_info->cleaner_mutex
btrfs_relocate_block_group(bg X) returns
btrfs_remove_chunk(bg X)
extent map not found
--> ASSERT(0)
Fix this by using a new mutex to make sure these 2 operations, block
group relocation and removal, are serialized.
This issue is reproducible by running fstests generic/038 (which stresses
chunk allocation and automatic removal of unused block groups) together
with the following balance loop:
while true; do btrfs balance start -dusage=0 <mountpoint> ; done
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-11 00:58:53 +01:00
if ( ! ret ) {
2021-04-19 16:41:01 +09:00
mutex_unlock ( & fs_info - > reclaim_bgs_lock ) ;
2012-01-16 22:04:47 +02:00
goto loop ;
Btrfs: fix race between balance and unused block group deletion
We have a race between deleting an unused block group and balancing the
same block group that leads to an assertion failure/BUG(), producing the
following trace:
[181631.208236] BTRFS: assertion failed: 0, file: fs/btrfs/volumes.c, line: 2622
[181631.220591] ------------[ cut here ]------------
[181631.222959] kernel BUG at fs/btrfs/ctree.h:4062!
[181631.223932] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[181631.224566] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse acpi_cpufreq parpor$
[181631.224566] CPU: 8 PID: 17451 Comm: btrfs Tainted: G W 4.1.0-rc5-btrfs-next-10+ #1
[181631.224566] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
[181631.224566] task: ffff880127e09590 ti: ffff8800b5824000 task.ti: ffff8800b5824000
[181631.224566] RIP: 0010:[<ffffffffa03f19f6>] [<ffffffffa03f19f6>] assfail.constprop.50+0x1e/0x20 [btrfs]
[181631.224566] RSP: 0018:ffff8800b5827ae8 EFLAGS: 00010246
[181631.224566] RAX: 0000000000000040 RBX: ffff8800109fc218 RCX: ffffffff81095dce
[181631.224566] RDX: 0000000000005124 RSI: ffffffff81464819 RDI: 00000000ffffffff
[181631.224566] RBP: ffff8800b5827ae8 R08: 0000000000000001 R09: 0000000000000000
[181631.224566] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8800109fc200
[181631.224566] R13: ffff880020095000 R14: ffff8800b1a13f38 R15: ffff880020095000
[181631.224566] FS: 00007f70ca0b0c80(0000) GS:ffff88013ec00000(0000) knlGS:0000000000000000
[181631.224566] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[181631.224566] CR2: 00007f2872ab6e68 CR3: 00000000a717c000 CR4: 00000000000006e0
[181631.224566] Stack:
[181631.224566] ffff8800b5827ba8 ffffffffa03f3916 ffff8800b5827b38 ffffffffa03d080e
[181631.224566] ffffffffa03d1423 ffff880020095000 ffff88001233c000 0000000000000001
[181631.224566] ffff880020095000 ffff8800b1a13f38 0000000a69c00000 0000000000000000
[181631.224566] Call Trace:
[181631.224566] [<ffffffffa03f3916>] btrfs_remove_chunk+0xa4/0x6bb [btrfs]
[181631.224566] [<ffffffffa03d080e>] ? join_transaction.isra.8+0xb9/0x3ba [btrfs]
[181631.224566] [<ffffffffa03d1423>] ? wait_current_trans.isra.13+0x22/0xfc [btrfs]
[181631.224566] [<ffffffffa03f3fbc>] btrfs_relocate_chunk.isra.29+0x8f/0xa7 [btrfs]
[181631.224566] [<ffffffffa03f54df>] btrfs_balance+0xaa4/0xc52 [btrfs]
[181631.224566] [<ffffffffa03fd388>] btrfs_ioctl_balance+0x23f/0x2b0 [btrfs]
[181631.224566] [<ffffffff810872f9>] ? trace_hardirqs_on+0xd/0xf
[181631.224566] [<ffffffffa04019a3>] btrfs_ioctl+0xfe2/0x2220 [btrfs]
[181631.224566] [<ffffffff812603ed>] ? __this_cpu_preempt_check+0x13/0x15
[181631.224566] [<ffffffff81084669>] ? arch_local_irq_save+0x9/0xc
[181631.224566] [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
[181631.224566] [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
[181631.224566] [<ffffffff8103e48c>] ? __do_page_fault+0x211/0x424
[181631.224566] [<ffffffff811755e6>] do_vfs_ioctl+0x3c6/0x479
(...)
The sequence of steps leading to this are:
CPU 0 CPU 1
btrfs_balance()
btrfs_relocate_chunk()
btrfs_relocate_block_group(bg X)
btrfs_lookup_block_group(bg X)
cleaner_kthread
locks fs_info->cleaner_mutex
btrfs_delete_unused_bgs()
finds bg X, which became
unused in the previous
transaction
checks bg X ->ro == 0,
so it proceeds
sets bg X ->ro to 1
(btrfs_set_block_group_ro(bg X))
blocks on fs_info->cleaner_mutex
btrfs_remove_chunk(bg X)
unlocks fs_info->cleaner_mutex
acquires fs_info->cleaner_mutex
relocate_block_group()
--> does nothing, no extents found in
the extent tree from bg X
unlocks fs_info->cleaner_mutex
btrfs_relocate_block_group(bg X) returns
btrfs_remove_chunk(bg X)
extent map not found
--> ASSERT(0)
Fix this by using a new mutex to make sure these 2 operations, block
group relocation and removal, are serialized.
This issue is reproducible by running fstests generic/038 (which stresses
chunk allocation and automatic removal of unused block groups) together
with the following balance loop:
while true; do btrfs balance start -dusage=0 <mountpoint> ; done
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-11 00:58:53 +01:00
}
2012-01-16 22:04:47 +02:00
2012-01-16 22:04:49 +02:00
if ( counting ) {
2021-04-19 16:41:01 +09:00
mutex_unlock ( & fs_info - > reclaim_bgs_lock ) ;
2012-01-16 22:04:49 +02:00
spin_lock ( & fs_info - > balance_lock ) ;
bctl - > stat . expected + + ;
spin_unlock ( & fs_info - > balance_lock ) ;
2015-10-10 17:16:50 +02:00
if ( chunk_type & BTRFS_BLOCK_GROUP_DATA )
count_data + + ;
else if ( chunk_type & BTRFS_BLOCK_GROUP_SYSTEM )
count_sys + + ;
else if ( chunk_type & BTRFS_BLOCK_GROUP_METADATA )
count_meta + + ;
goto loop ;
}
/*
* Apply limit_min filter , no need to check if the LIMITS
* filter is used , limit_min is 0 by default
*/
if ( ( ( chunk_type & BTRFS_BLOCK_GROUP_DATA ) & &
count_data < bctl - > data . limit_min )
| | ( ( chunk_type & BTRFS_BLOCK_GROUP_METADATA ) & &
count_meta < bctl - > meta . limit_min )
| | ( ( chunk_type & BTRFS_BLOCK_GROUP_SYSTEM ) & &
count_sys < bctl - > sys . limit_min ) ) {
2021-04-19 16:41:01 +09:00
mutex_unlock ( & fs_info - > reclaim_bgs_lock ) ;
2012-01-16 22:04:49 +02:00
goto loop ;
}
2017-11-15 16:28:11 -07:00
if ( ! chunk_reserved ) {
/*
* We may be relocating the only data chunk we have ,
* which could potentially end up with losing data ' s
* raid profile , so lets allocate an empty one in
* advance .
*/
ret = btrfs_may_alloc_data_chunk ( fs_info ,
found_key . offset ) ;
btrfs: Fix lost-data-profile caused by balance bg
Reproduce:
(In integration-4.3 branch)
TEST_DEV=(/dev/vdg /dev/vdh)
TEST_DIR=/mnt/tmp
umount "$TEST_DEV" >/dev/null
mkfs.btrfs -f -d raid1 "${TEST_DEV[@]}"
mount -o nospace_cache "$TEST_DEV" "$TEST_DIR"
btrfs balance start -dusage=0 $TEST_DIR
btrfs filesystem usage $TEST_DIR
dd if=/dev/zero of="$TEST_DIR"/file count=100
btrfs filesystem usage $TEST_DIR
Result:
We can see "no data chunk" in first "btrfs filesystem usage":
# btrfs filesystem usage $TEST_DIR
Overall:
...
Metadata,single: Size:8.00MiB, Used:0.00B
/dev/vdg 8.00MiB
Metadata,RAID1: Size:122.88MiB, Used:112.00KiB
/dev/vdg 122.88MiB
/dev/vdh 122.88MiB
System,single: Size:4.00MiB, Used:0.00B
/dev/vdg 4.00MiB
System,RAID1: Size:8.00MiB, Used:16.00KiB
/dev/vdg 8.00MiB
/dev/vdh 8.00MiB
Unallocated:
/dev/vdg 1.06GiB
/dev/vdh 1.07GiB
And "data chunks changed from raid1 to single" in second
"btrfs filesystem usage":
# btrfs filesystem usage $TEST_DIR
Overall:
...
Data,single: Size:256.00MiB, Used:0.00B
/dev/vdh 256.00MiB
Metadata,single: Size:8.00MiB, Used:0.00B
/dev/vdg 8.00MiB
Metadata,RAID1: Size:122.88MiB, Used:112.00KiB
/dev/vdg 122.88MiB
/dev/vdh 122.88MiB
System,single: Size:4.00MiB, Used:0.00B
/dev/vdg 4.00MiB
System,RAID1: Size:8.00MiB, Used:16.00KiB
/dev/vdg 8.00MiB
/dev/vdh 8.00MiB
Unallocated:
/dev/vdg 1.06GiB
/dev/vdh 841.92MiB
Reason:
btrfs balance delete last data chunk in case of no data in
the filesystem, then we can see "no data chunk" by "fi usage"
command.
And when we do write operation to fs, the only available data
profile is 0x0, result is all new chunks are allocated single type.
Fix:
Allocate a data chunk explicitly to ensure we don't lose the
raid profile for data.
Test:
Test by above script, and confirmed the logic by debug output.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-11-09 11:51:32 +08:00
if ( ret < 0 ) {
2021-04-19 16:41:01 +09:00
mutex_unlock ( & fs_info - > reclaim_bgs_lock ) ;
btrfs: Fix lost-data-profile caused by balance bg
Reproduce:
(In integration-4.3 branch)
TEST_DEV=(/dev/vdg /dev/vdh)
TEST_DIR=/mnt/tmp
umount "$TEST_DEV" >/dev/null
mkfs.btrfs -f -d raid1 "${TEST_DEV[@]}"
mount -o nospace_cache "$TEST_DEV" "$TEST_DIR"
btrfs balance start -dusage=0 $TEST_DIR
btrfs filesystem usage $TEST_DIR
dd if=/dev/zero of="$TEST_DIR"/file count=100
btrfs filesystem usage $TEST_DIR
Result:
We can see "no data chunk" in first "btrfs filesystem usage":
# btrfs filesystem usage $TEST_DIR
Overall:
...
Metadata,single: Size:8.00MiB, Used:0.00B
/dev/vdg 8.00MiB
Metadata,RAID1: Size:122.88MiB, Used:112.00KiB
/dev/vdg 122.88MiB
/dev/vdh 122.88MiB
System,single: Size:4.00MiB, Used:0.00B
/dev/vdg 4.00MiB
System,RAID1: Size:8.00MiB, Used:16.00KiB
/dev/vdg 8.00MiB
/dev/vdh 8.00MiB
Unallocated:
/dev/vdg 1.06GiB
/dev/vdh 1.07GiB
And "data chunks changed from raid1 to single" in second
"btrfs filesystem usage":
# btrfs filesystem usage $TEST_DIR
Overall:
...
Data,single: Size:256.00MiB, Used:0.00B
/dev/vdh 256.00MiB
Metadata,single: Size:8.00MiB, Used:0.00B
/dev/vdg 8.00MiB
Metadata,RAID1: Size:122.88MiB, Used:112.00KiB
/dev/vdg 122.88MiB
/dev/vdh 122.88MiB
System,single: Size:4.00MiB, Used:0.00B
/dev/vdg 4.00MiB
System,RAID1: Size:8.00MiB, Used:16.00KiB
/dev/vdg 8.00MiB
/dev/vdh 8.00MiB
Unallocated:
/dev/vdg 1.06GiB
/dev/vdh 841.92MiB
Reason:
btrfs balance delete last data chunk in case of no data in
the filesystem, then we can see "no data chunk" by "fi usage"
command.
And when we do write operation to fs, the only available data
profile is 0x0, result is all new chunks are allocated single type.
Fix:
Allocate a data chunk explicitly to ensure we don't lose the
raid profile for data.
Test:
Test by above script, and confirmed the logic by debug output.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-11-09 11:51:32 +08:00
goto error ;
2017-11-15 16:28:11 -07:00
} else if ( ret = = 1 ) {
chunk_reserved = 1 ;
btrfs: Fix lost-data-profile caused by balance bg
Reproduce:
(In integration-4.3 branch)
TEST_DEV=(/dev/vdg /dev/vdh)
TEST_DIR=/mnt/tmp
umount "$TEST_DEV" >/dev/null
mkfs.btrfs -f -d raid1 "${TEST_DEV[@]}"
mount -o nospace_cache "$TEST_DEV" "$TEST_DIR"
btrfs balance start -dusage=0 $TEST_DIR
btrfs filesystem usage $TEST_DIR
dd if=/dev/zero of="$TEST_DIR"/file count=100
btrfs filesystem usage $TEST_DIR
Result:
We can see "no data chunk" in first "btrfs filesystem usage":
# btrfs filesystem usage $TEST_DIR
Overall:
...
Metadata,single: Size:8.00MiB, Used:0.00B
/dev/vdg 8.00MiB
Metadata,RAID1: Size:122.88MiB, Used:112.00KiB
/dev/vdg 122.88MiB
/dev/vdh 122.88MiB
System,single: Size:4.00MiB, Used:0.00B
/dev/vdg 4.00MiB
System,RAID1: Size:8.00MiB, Used:16.00KiB
/dev/vdg 8.00MiB
/dev/vdh 8.00MiB
Unallocated:
/dev/vdg 1.06GiB
/dev/vdh 1.07GiB
And "data chunks changed from raid1 to single" in second
"btrfs filesystem usage":
# btrfs filesystem usage $TEST_DIR
Overall:
...
Data,single: Size:256.00MiB, Used:0.00B
/dev/vdh 256.00MiB
Metadata,single: Size:8.00MiB, Used:0.00B
/dev/vdg 8.00MiB
Metadata,RAID1: Size:122.88MiB, Used:112.00KiB
/dev/vdg 122.88MiB
/dev/vdh 122.88MiB
System,single: Size:4.00MiB, Used:0.00B
/dev/vdg 4.00MiB
System,RAID1: Size:8.00MiB, Used:16.00KiB
/dev/vdg 8.00MiB
/dev/vdh 8.00MiB
Unallocated:
/dev/vdg 1.06GiB
/dev/vdh 841.92MiB
Reason:
btrfs balance delete last data chunk in case of no data in
the filesystem, then we can see "no data chunk" by "fi usage"
command.
And when we do write operation to fs, the only available data
profile is 0x0, result is all new chunks are allocated single type.
Fix:
Allocate a data chunk explicitly to ensure we don't lose the
raid profile for data.
Test:
Test by above script, and confirmed the logic by debug output.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-11-09 11:51:32 +08:00
}
}
2016-06-21 10:40:19 -04:00
ret = btrfs_relocate_chunk ( fs_info , found_key . offset ) ;
2021-04-19 16:41:01 +09:00
mutex_unlock ( & fs_info - > reclaim_bgs_lock ) ;
2012-01-16 22:04:49 +02:00
if ( ret = = - ENOSPC ) {
2012-01-16 22:04:47 +02:00
enospc_errors + + ;
Btrfs: prevent ioctls from interfering with a swap file
A later patch will implement swap file support for Btrfs, but before we
do that, we need to make sure that the various Btrfs ioctls cannot
change a swap file.
When a swap file is active, we must make sure that the extents of the
file are not moved and that they don't become shared. That means that
the following are not safe:
- chattr +c (enable compression)
- reflink
- dedupe
- snapshot
- defrag
Don't allow those to happen on an active swap file.
Additionally, balance, resize, device remove, and device replace are
also unsafe if they affect an active swapfile. Add a red-black tree of
block groups and devices which contain an active swapfile. Relocation
checks each block group against this tree and skips it or errors out for
balance or resize, respectively. Device remove and device replace check
the tree for the device they will operate on.
Note that we don't have to worry about chattr -C (disable nocow), which
we ignore for non-empty files, because an active swapfile must be
non-empty and can't be truncated. We also don't have to worry about
autodefrag because it's only done on COW files. Truncate and fallocate
are already taken care of by the generic code. Device add doesn't do
relocation so it's not an issue, either.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-11-03 10:28:12 -07:00
} else if ( ret = = - ETXTBSY ) {
btrfs_info ( fs_info ,
" skipping relocation of block group %llu due to active swapfile " ,
found_key . offset ) ;
ret = 0 ;
} else if ( ret ) {
goto error ;
2012-01-16 22:04:49 +02:00
} else {
spin_lock ( & fs_info - > balance_lock ) ;
bctl - > stat . completed + + ;
spin_unlock ( & fs_info - > balance_lock ) ;
}
2012-01-16 22:04:47 +02:00
loop :
2013-08-27 13:50:44 +03:00
if ( found_key . offset = = 0 )
break ;
Btrfs: make balance code choose more wisely when relocating
Currently, we can panic the box if the first block group we go to move is of a
type where there is no space left to move those extents. For example, if we
fill the disk up with data, and then we try to balance and we have no room to
move the data nor room to allocate new chunks, we will panic. Change this by
checking to see if we have room to move this chunk around, and if not, return
-ENOSPC and move on to the next chunk. This will make sure we remove block
groups that are moveable, like if we have alot of empty metadata block groups,
and then that way we make room to be able to balance our data chunks as well.
Tested this with an fs that would panic on btrfs-vol -b normally, but no longer
panics with this patch.
V1->V2:
-actually search for a free extent on the device to make sure we can allocate a
chunk if need be.
-fix btrfs_shrink_device to make sure we actually try to relocate all the
chunks, and then if we can't return -ENOSPC so if we are doing a btrfs-vol -r
we don't remove the device with data still on it.
-check to make sure the block group we are going to relocate isn't the last one
in that particular space
-fix a bug in btrfs_shrink_device where we would change the device's size and
not fix it if we fail to do our relocate
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-11 16:11:19 -04:00
key . offset = found_key . offset - 1 ;
2008-04-28 15:29:52 -04:00
}
2012-01-16 22:04:47 +02:00
2012-01-16 22:04:49 +02:00
if ( counting ) {
btrfs_release_path ( path ) ;
counting = false ;
goto again ;
}
2008-04-28 15:29:52 -04:00
error :
btrfs_free_path ( path ) ;
2012-01-16 22:04:47 +02:00
if ( enospc_errors ) {
2013-12-20 11:37:06 -05:00
btrfs_info ( fs_info , " %d enospc errors during balance " ,
2016-09-20 10:05:00 -04:00
enospc_errors ) ;
2012-01-16 22:04:47 +02:00
if ( ! ret )
ret = - ENOSPC ;
}
2008-04-28 15:29:52 -04:00
return ret ;
}
2012-03-27 17:09:17 +03:00
/**
* alloc_profile_is_valid - see if a given profile is valid and reduced
* @ flags : profile to validate
* @ extended : if true @ flags is treated as an extended profile
*/
static int alloc_profile_is_valid ( u64 flags , int extended )
{
u64 mask = ( extended ? BTRFS_EXTENDED_PROFILE_MASK :
BTRFS_BLOCK_GROUP_PROFILE_MASK ) ;
flags & = ~ BTRFS_BLOCK_GROUP_TYPE_MASK ;
/* 1) check that all other bits are zeroed */
if ( flags & ~ mask )
return 0 ;
/* 2) see if profile is reduced */
if ( flags = = 0 )
return ! extended ; /* "0" is valid for usual profiles */
2019-10-01 19:44:42 +02:00
return has_single_bit_set ( flags ) ;
2012-03-27 17:09:17 +03:00
}
2012-01-16 22:04:49 +02:00
static inline int balance_need_close ( struct btrfs_fs_info * fs_info )
{
2012-01-16 22:04:49 +02:00
/* cancel requested || normal exit path */
return atomic_read ( & fs_info - > balance_cancel_req ) | |
( atomic_read ( & fs_info - > balance_pause_req ) = = 0 & &
atomic_read ( & fs_info - > balance_cancel_req ) = = 0 ) ;
2012-01-16 22:04:49 +02:00
}
2020-02-27 21:00:52 +01:00
/*
* Validate target profile against allowed profiles and return true if it ' s OK .
* Otherwise print the error message and return false .
*/
static inline int validate_convert_profile ( struct btrfs_fs_info * fs_info ,
const struct btrfs_balance_args * bargs ,
u64 allowed , const char * type )
2015-09-22 20:02:25 +00:00
{
2020-02-27 21:00:52 +01:00
if ( ! ( bargs - > flags & BTRFS_BALANCE_ARGS_CONVERT ) )
return true ;
2021-07-26 14:35:01 +08:00
if ( fs_info - > sectorsize < PAGE_SIZE & &
bargs - > target & BTRFS_BLOCK_GROUP_RAID56_MASK ) {
btrfs_err ( fs_info ,
" RAID56 is not yet supported for sectorsize %u with page size %lu " ,
fs_info - > sectorsize , PAGE_SIZE ) ;
return false ;
}
2020-02-27 21:00:52 +01:00
/* Profile is valid and does not have bits outside of the allowed set */
if ( alloc_profile_is_valid ( bargs - > target , 1 ) & &
( bargs - > target & ~ allowed ) = = 0 )
return true ;
btrfs_err ( fs_info , " balance: invalid convert %s profile %s " ,
type , btrfs_bg_type_to_raid_name ( bargs - > target ) ) ;
return false ;
2015-09-22 20:02:25 +00:00
}
btrfs: balance: print args during start and resume
The information about balance arguments is important for system audit,
this patch prints the textual representation when balance starts or is
resumed.
Example command:
$ btrfs balance start -f -mprofiles=raid1,convert=single,soft -dlimit=10..20,usage=50 /btrfs
Example kernel log output:
BTRFS info (device sdb): balance: start -f -dusage=50,limit=10..20 -mconvert=single,soft,profiles=raid1 -sconvert=single,soft,profiles=raid1
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog, simplify code ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-11-20 16:12:56 +08:00
/*
* Fill @ buf with textual description of balance filter flags @ bargs , up to
* @ size_buf including the terminating null . The output may be trimmed if it
* does not fit into the provided buffer .
*/
static void describe_balance_args ( struct btrfs_balance_args * bargs , char * buf ,
u32 size_buf )
{
int ret ;
u32 size_bp = size_buf ;
char * bp = buf ;
u64 flags = bargs - > flags ;
char tmp_buf [ 128 ] = { ' \0 ' } ;
if ( ! flags )
return ;
# define CHECK_APPEND_NOARG(a) \
do { \
ret = snprintf ( bp , size_bp , ( a ) ) ; \
if ( ret < 0 | | ret > = size_bp ) \
goto out_overflow ; \
size_bp - = ret ; \
bp + = ret ; \
} while ( 0 )
# define CHECK_APPEND_1ARG(a, v1) \
do { \
ret = snprintf ( bp , size_bp , ( a ) , ( v1 ) ) ; \
if ( ret < 0 | | ret > = size_bp ) \
goto out_overflow ; \
size_bp - = ret ; \
bp + = ret ; \
} while ( 0 )
# define CHECK_APPEND_2ARG(a, v1, v2) \
do { \
ret = snprintf ( bp , size_bp , ( a ) , ( v1 ) , ( v2 ) ) ; \
if ( ret < 0 | | ret > = size_bp ) \
goto out_overflow ; \
size_bp - = ret ; \
bp + = ret ; \
} while ( 0 )
2019-05-17 11:43:41 +02:00
if ( flags & BTRFS_BALANCE_ARGS_CONVERT )
CHECK_APPEND_1ARG ( " convert=%s, " ,
btrfs_bg_type_to_raid_name ( bargs - > target ) ) ;
btrfs: balance: print args during start and resume
The information about balance arguments is important for system audit,
this patch prints the textual representation when balance starts or is
resumed.
Example command:
$ btrfs balance start -f -mprofiles=raid1,convert=single,soft -dlimit=10..20,usage=50 /btrfs
Example kernel log output:
BTRFS info (device sdb): balance: start -f -dusage=50,limit=10..20 -mconvert=single,soft,profiles=raid1 -sconvert=single,soft,profiles=raid1
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog, simplify code ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-11-20 16:12:56 +08:00
if ( flags & BTRFS_BALANCE_ARGS_SOFT )
CHECK_APPEND_NOARG ( " soft, " ) ;
if ( flags & BTRFS_BALANCE_ARGS_PROFILES ) {
btrfs_describe_block_groups ( bargs - > profiles , tmp_buf ,
sizeof ( tmp_buf ) ) ;
CHECK_APPEND_1ARG ( " profiles=%s, " , tmp_buf ) ;
}
if ( flags & BTRFS_BALANCE_ARGS_USAGE )
CHECK_APPEND_1ARG ( " usage=%llu, " , bargs - > usage ) ;
if ( flags & BTRFS_BALANCE_ARGS_USAGE_RANGE )
CHECK_APPEND_2ARG ( " usage=%u..%u, " ,
bargs - > usage_min , bargs - > usage_max ) ;
if ( flags & BTRFS_BALANCE_ARGS_DEVID )
CHECK_APPEND_1ARG ( " devid=%llu, " , bargs - > devid ) ;
if ( flags & BTRFS_BALANCE_ARGS_DRANGE )
CHECK_APPEND_2ARG ( " drange=%llu..%llu, " ,
bargs - > pstart , bargs - > pend ) ;
if ( flags & BTRFS_BALANCE_ARGS_VRANGE )
CHECK_APPEND_2ARG ( " vrange=%llu..%llu, " ,
bargs - > vstart , bargs - > vend ) ;
if ( flags & BTRFS_BALANCE_ARGS_LIMIT )
CHECK_APPEND_1ARG ( " limit=%llu, " , bargs - > limit ) ;
if ( flags & BTRFS_BALANCE_ARGS_LIMIT_RANGE )
CHECK_APPEND_2ARG ( " limit=%u..%u, " ,
bargs - > limit_min , bargs - > limit_max ) ;
if ( flags & BTRFS_BALANCE_ARGS_STRIPES_RANGE )
CHECK_APPEND_2ARG ( " stripes=%u..%u, " ,
bargs - > stripes_min , bargs - > stripes_max ) ;
# undef CHECK_APPEND_2ARG
# undef CHECK_APPEND_1ARG
# undef CHECK_APPEND_NOARG
out_overflow :
if ( size_bp < size_buf )
buf [ size_buf - size_bp - 1 ] = ' \0 ' ; /* remove last , */
else
buf [ 0 ] = ' \0 ' ;
}
static void describe_balance_start_or_resume ( struct btrfs_fs_info * fs_info )
{
u32 size_buf = 1024 ;
char tmp_buf [ 192 ] = { ' \0 ' } ;
char * buf ;
char * bp ;
u32 size_bp = size_buf ;
int ret ;
struct btrfs_balance_control * bctl = fs_info - > balance_ctl ;
buf = kzalloc ( size_buf , GFP_KERNEL ) ;
if ( ! buf )
return ;
bp = buf ;
# define CHECK_APPEND_1ARG(a, v1) \
do { \
ret = snprintf ( bp , size_bp , ( a ) , ( v1 ) ) ; \
if ( ret < 0 | | ret > = size_bp ) \
goto out_overflow ; \
size_bp - = ret ; \
bp + = ret ; \
} while ( 0 )
if ( bctl - > flags & BTRFS_BALANCE_FORCE )
CHECK_APPEND_1ARG ( " %s " , " -f " ) ;
if ( bctl - > flags & BTRFS_BALANCE_DATA ) {
describe_balance_args ( & bctl - > data , tmp_buf , sizeof ( tmp_buf ) ) ;
CHECK_APPEND_1ARG ( " -d%s " , tmp_buf ) ;
}
if ( bctl - > flags & BTRFS_BALANCE_METADATA ) {
describe_balance_args ( & bctl - > meta , tmp_buf , sizeof ( tmp_buf ) ) ;
CHECK_APPEND_1ARG ( " -m%s " , tmp_buf ) ;
}
if ( bctl - > flags & BTRFS_BALANCE_SYSTEM ) {
describe_balance_args ( & bctl - > sys , tmp_buf , sizeof ( tmp_buf ) ) ;
CHECK_APPEND_1ARG ( " -s%s " , tmp_buf ) ;
}
# undef CHECK_APPEND_1ARG
out_overflow :
if ( size_bp < size_buf )
buf [ size_buf - size_bp - 1 ] = ' \0 ' ; /* remove last " " */
btrfs_info ( fs_info , " balance: %s %s " ,
( bctl - > flags & BTRFS_BALANCE_RESUME ) ?
" resume " : " start " , buf ) ;
kfree ( buf ) ;
}
2012-01-16 22:04:47 +02:00
/*
2018-03-21 00:20:05 +01:00
* Should be called with balance mutexe held
2012-01-16 22:04:47 +02:00
*/
2018-05-07 17:44:03 +02:00
int btrfs_balance ( struct btrfs_fs_info * fs_info ,
struct btrfs_balance_control * bctl ,
2012-01-16 22:04:47 +02:00
struct btrfs_ioctl_balance_args * bargs )
{
2017-03-07 23:34:44 +01:00
u64 meta_target , data_target ;
2012-01-16 22:04:47 +02:00
u64 allowed ;
2012-03-27 17:09:17 +03:00
int mixed = 0 ;
2012-01-16 22:04:47 +02:00
int ret ;
2012-11-06 13:15:27 +01:00
u64 num_devices ;
2013-01-29 10:13:12 +00:00
unsigned seq ;
2019-09-25 14:29:28 +08:00
bool reducing_redundancy ;
2019-05-17 11:43:27 +02:00
int i ;
2012-01-16 22:04:47 +02:00
2012-01-16 22:04:49 +02:00
if ( btrfs_fs_closing ( fs_info ) | |
2012-01-16 22:04:49 +02:00
atomic_read ( & fs_info - > balance_pause_req ) | |
2020-02-17 14:16:52 +08:00
btrfs_should_cancel_balance ( fs_info ) ) {
2012-01-16 22:04:47 +02:00
ret = - EINVAL ;
goto out ;
}
2012-03-27 17:09:17 +03:00
allowed = btrfs_super_incompat_flags ( fs_info - > super_copy ) ;
if ( allowed & BTRFS_FEATURE_INCOMPAT_MIXED_GROUPS )
mixed = 1 ;
2012-01-16 22:04:47 +02:00
/*
* In case of mixed groups both data and meta should be picked ,
* and identical options should be given for both of them .
*/
2012-03-27 17:09:17 +03:00
allowed = BTRFS_BALANCE_DATA | BTRFS_BALANCE_METADATA ;
if ( mixed & & ( bctl - > flags & allowed ) ) {
2012-01-16 22:04:47 +02:00
if ( ! ( bctl - > flags & BTRFS_BALANCE_DATA ) | |
! ( bctl - > flags & BTRFS_BALANCE_METADATA ) | |
memcmp ( & bctl - > data , & bctl - > meta , sizeof ( bctl - > data ) ) ) {
2016-09-20 10:05:00 -04:00
btrfs_err ( fs_info ,
2018-05-16 10:51:26 +08:00
" balance: mixed groups data and metadata options must be the same " ) ;
2012-01-16 22:04:47 +02:00
ret = - EINVAL ;
goto out ;
}
}
btrfs: check rw_devices, not num_devices for balance
The fstest btrfs/154 reports
[ 8675.381709] BTRFS: Transaction aborted (error -28)
[ 8675.383302] WARNING: CPU: 1 PID: 31900 at fs/btrfs/block-group.c:2038 btrfs_create_pending_block_groups+0x1e0/0x1f0 [btrfs]
[ 8675.390925] CPU: 1 PID: 31900 Comm: btrfs Not tainted 5.5.0-rc6-default+ #935
[ 8675.392780] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
[ 8675.395452] RIP: 0010:btrfs_create_pending_block_groups+0x1e0/0x1f0 [btrfs]
[ 8675.402672] RSP: 0018:ffffb2090888fb00 EFLAGS: 00010286
[ 8675.404413] RAX: 0000000000000000 RBX: ffff92026dfa91c8 RCX: 0000000000000001
[ 8675.406609] RDX: 0000000000000000 RSI: ffffffff8e100899 RDI: ffffffff8e100971
[ 8675.408775] RBP: ffff920247c61660 R08: 0000000000000000 R09: 0000000000000000
[ 8675.410978] R10: 0000000000000000 R11: 0000000000000000 R12: 00000000ffffffe4
[ 8675.412647] R13: ffff92026db74000 R14: ffff920247c616b8 R15: ffff92026dfbc000
[ 8675.413994] FS: 00007fd5e57248c0(0000) GS:ffff92027d800000(0000) knlGS:0000000000000000
[ 8675.416146] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 8675.417833] CR2: 0000564aa51682d8 CR3: 000000006dcbc004 CR4: 0000000000160ee0
[ 8675.419801] Call Trace:
[ 8675.420742] btrfs_start_dirty_block_groups+0x355/0x480 [btrfs]
[ 8675.422600] btrfs_commit_transaction+0xc8/0xaf0 [btrfs]
[ 8675.424335] reset_balance_state+0x14a/0x190 [btrfs]
[ 8675.425824] btrfs_balance.cold+0xe7/0x154 [btrfs]
[ 8675.427313] ? kmem_cache_alloc_trace+0x235/0x2c0
[ 8675.428663] btrfs_ioctl_balance+0x298/0x350 [btrfs]
[ 8675.430285] btrfs_ioctl+0x466/0x2550 [btrfs]
[ 8675.431788] ? mem_cgroup_charge_statistics+0x51/0xf0
[ 8675.433487] ? mem_cgroup_commit_charge+0x56/0x400
[ 8675.435122] ? do_raw_spin_unlock+0x4b/0xc0
[ 8675.436618] ? _raw_spin_unlock+0x1f/0x30
[ 8675.438093] ? __handle_mm_fault+0x499/0x740
[ 8675.439619] ? do_vfs_ioctl+0x56e/0x770
[ 8675.441034] do_vfs_ioctl+0x56e/0x770
[ 8675.442411] ksys_ioctl+0x3a/0x70
[ 8675.443718] ? trace_hardirqs_off_thunk+0x1a/0x1c
[ 8675.445333] __x64_sys_ioctl+0x16/0x20
[ 8675.446705] do_syscall_64+0x50/0x210
[ 8675.448059] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 8675.479187] BTRFS: error (device vdb) in btrfs_create_pending_block_groups:2038: errno=-28 No space left
We now use btrfs_can_overcommit() to see if we can flip a block group
read only. Before this would fail because we weren't taking into
account the usable un-allocated space for allocating chunks. With my
patches we were allowed to do the balance, which is technically correct.
The test is trying to start balance on degraded mount. So now we're
trying to allocate a chunk and cannot because we want to allocate a
RAID1 chunk, but there's only 1 device that's available for usage. This
results in an ENOSPC.
But we shouldn't even be making it this far, we don't have enough
devices to restripe. The problem is we're using btrfs_num_devices(),
that also includes missing devices. That's not actually what we want, we
need to use rw_devices.
The chunk_mutex is not needed here, rw_devices changes only in device
add, remove or replace, all are excluded by EXCL_OP mechanism.
Fixes: e4d8ec0f65b9 ("Btrfs: implement online profile changing")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add stacktrace, update changelog, drop chunk_mutex ]
Signed-off-by: David Sterba <dsterba@suse.com>
2020-01-10 11:11:24 -05:00
/*
* rw_devices will not change at the moment , device add / delete / replace
2020-08-25 10:02:32 -05:00
* are exclusive
btrfs: check rw_devices, not num_devices for balance
The fstest btrfs/154 reports
[ 8675.381709] BTRFS: Transaction aborted (error -28)
[ 8675.383302] WARNING: CPU: 1 PID: 31900 at fs/btrfs/block-group.c:2038 btrfs_create_pending_block_groups+0x1e0/0x1f0 [btrfs]
[ 8675.390925] CPU: 1 PID: 31900 Comm: btrfs Not tainted 5.5.0-rc6-default+ #935
[ 8675.392780] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
[ 8675.395452] RIP: 0010:btrfs_create_pending_block_groups+0x1e0/0x1f0 [btrfs]
[ 8675.402672] RSP: 0018:ffffb2090888fb00 EFLAGS: 00010286
[ 8675.404413] RAX: 0000000000000000 RBX: ffff92026dfa91c8 RCX: 0000000000000001
[ 8675.406609] RDX: 0000000000000000 RSI: ffffffff8e100899 RDI: ffffffff8e100971
[ 8675.408775] RBP: ffff920247c61660 R08: 0000000000000000 R09: 0000000000000000
[ 8675.410978] R10: 0000000000000000 R11: 0000000000000000 R12: 00000000ffffffe4
[ 8675.412647] R13: ffff92026db74000 R14: ffff920247c616b8 R15: ffff92026dfbc000
[ 8675.413994] FS: 00007fd5e57248c0(0000) GS:ffff92027d800000(0000) knlGS:0000000000000000
[ 8675.416146] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 8675.417833] CR2: 0000564aa51682d8 CR3: 000000006dcbc004 CR4: 0000000000160ee0
[ 8675.419801] Call Trace:
[ 8675.420742] btrfs_start_dirty_block_groups+0x355/0x480 [btrfs]
[ 8675.422600] btrfs_commit_transaction+0xc8/0xaf0 [btrfs]
[ 8675.424335] reset_balance_state+0x14a/0x190 [btrfs]
[ 8675.425824] btrfs_balance.cold+0xe7/0x154 [btrfs]
[ 8675.427313] ? kmem_cache_alloc_trace+0x235/0x2c0
[ 8675.428663] btrfs_ioctl_balance+0x298/0x350 [btrfs]
[ 8675.430285] btrfs_ioctl+0x466/0x2550 [btrfs]
[ 8675.431788] ? mem_cgroup_charge_statistics+0x51/0xf0
[ 8675.433487] ? mem_cgroup_commit_charge+0x56/0x400
[ 8675.435122] ? do_raw_spin_unlock+0x4b/0xc0
[ 8675.436618] ? _raw_spin_unlock+0x1f/0x30
[ 8675.438093] ? __handle_mm_fault+0x499/0x740
[ 8675.439619] ? do_vfs_ioctl+0x56e/0x770
[ 8675.441034] do_vfs_ioctl+0x56e/0x770
[ 8675.442411] ksys_ioctl+0x3a/0x70
[ 8675.443718] ? trace_hardirqs_off_thunk+0x1a/0x1c
[ 8675.445333] __x64_sys_ioctl+0x16/0x20
[ 8675.446705] do_syscall_64+0x50/0x210
[ 8675.448059] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 8675.479187] BTRFS: error (device vdb) in btrfs_create_pending_block_groups:2038: errno=-28 No space left
We now use btrfs_can_overcommit() to see if we can flip a block group
read only. Before this would fail because we weren't taking into
account the usable un-allocated space for allocating chunks. With my
patches we were allowed to do the balance, which is technically correct.
The test is trying to start balance on degraded mount. So now we're
trying to allocate a chunk and cannot because we want to allocate a
RAID1 chunk, but there's only 1 device that's available for usage. This
results in an ENOSPC.
But we shouldn't even be making it this far, we don't have enough
devices to restripe. The problem is we're using btrfs_num_devices(),
that also includes missing devices. That's not actually what we want, we
need to use rw_devices.
The chunk_mutex is not needed here, rw_devices changes only in device
add, remove or replace, all are excluded by EXCL_OP mechanism.
Fixes: e4d8ec0f65b9 ("Btrfs: implement online profile changing")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add stacktrace, update changelog, drop chunk_mutex ]
Signed-off-by: David Sterba <dsterba@suse.com>
2020-01-10 11:11:24 -05:00
*/
num_devices = fs_info - > fs_devices - > rw_devices ;
2019-09-25 10:13:27 +08:00
/*
* SINGLE profile on - disk has no profile bit , but in - memory we have a
* special bit for it , to make it easier to distinguish . Thus we need
* to set it manually , or balance would refuse the profile .
*/
allowed = BTRFS_AVAIL_ALLOC_BIT_SINGLE ;
2019-05-17 11:43:27 +02:00
for ( i = 0 ; i < ARRAY_SIZE ( btrfs_raid_array ) ; i + + )
if ( num_devices > = btrfs_raid_array [ i ] . devs_min )
allowed | = btrfs_raid_array [ i ] . bg_flag ;
2018-08-10 13:53:21 +08:00
2020-02-27 21:00:52 +01:00
if ( ! validate_convert_profile ( fs_info , & bctl - > data , allowed , " data " ) | |
! validate_convert_profile ( fs_info , & bctl - > meta , allowed , " metadata " ) | |
! validate_convert_profile ( fs_info , & bctl - > sys , allowed , " system " ) ) {
2012-01-16 22:04:48 +02:00
ret = - EINVAL ;
goto out ;
}
2019-05-17 11:43:29 +02:00
/*
* Allow to reduce metadata or system integrity only if force set for
* profiles with redundancy ( copies , parity )
*/
allowed = 0 ;
for ( i = 0 ; i < ARRAY_SIZE ( btrfs_raid_array ) ; i + + ) {
if ( btrfs_raid_array [ i ] . ncopies > = 2 | |
btrfs_raid_array [ i ] . tolerated_failures > = 1 )
allowed | = btrfs_raid_array [ i ] . bg_flag ;
}
2013-01-29 10:13:12 +00:00
do {
seq = read_seqbegin ( & fs_info - > profiles_lock ) ;
if ( ( ( bctl - > sys . flags & BTRFS_BALANCE_ARGS_CONVERT ) & &
( fs_info - > avail_system_alloc_bits & allowed ) & &
! ( bctl - > sys . target & allowed ) ) | |
( ( bctl - > meta . flags & BTRFS_BALANCE_ARGS_CONVERT ) & &
( fs_info - > avail_metadata_alloc_bits & allowed ) & &
2018-11-19 09:48:12 +00:00
! ( bctl - > meta . target & allowed ) ) )
2019-09-25 14:29:28 +08:00
reducing_redundancy = true ;
2018-11-19 09:48:12 +00:00
else
2019-09-25 14:29:28 +08:00
reducing_redundancy = false ;
2018-11-19 09:48:12 +00:00
/* if we're not converting, the target field is uninitialized */
meta_target = ( bctl - > meta . flags & BTRFS_BALANCE_ARGS_CONVERT ) ?
bctl - > meta . target : fs_info - > avail_metadata_alloc_bits ;
data_target = ( bctl - > data . flags & BTRFS_BALANCE_ARGS_CONVERT ) ?
bctl - > data . target : fs_info - > avail_data_alloc_bits ;
2013-01-29 10:13:12 +00:00
} while ( read_seqretry ( & fs_info - > profiles_lock , seq ) ) ;
2012-01-16 22:04:48 +02:00
2019-09-25 14:29:28 +08:00
if ( reducing_redundancy ) {
2018-11-19 09:48:12 +00:00
if ( bctl - > flags & BTRFS_BALANCE_FORCE ) {
btrfs_info ( fs_info ,
2019-09-25 14:29:28 +08:00
" balance: force reducing metadata redundancy " ) ;
2018-11-19 09:48:12 +00:00
} else {
btrfs_err ( fs_info ,
2019-09-25 14:29:28 +08:00
" balance: reduces metadata redundancy, use --force if you want this " ) ;
2018-11-19 09:48:12 +00:00
ret = - EINVAL ;
goto out ;
}
}
2017-03-07 23:34:44 +01:00
if ( btrfs_get_num_tolerated_disk_barrier_failures ( meta_target ) <
btrfs_get_num_tolerated_disk_barrier_failures ( data_target ) ) {
2016-01-06 08:46:12 +00:00
btrfs_warn ( fs_info ,
2018-05-16 10:51:26 +08:00
" balance: metadata profile %s has lower redundancy than data profile %s " ,
2019-05-17 11:43:41 +02:00
btrfs_bg_type_to_raid_name ( meta_target ) ,
btrfs_bg_type_to_raid_name ( data_target ) ) ;
2016-01-06 08:46:12 +00:00
}
2016-06-21 21:16:51 -04:00
ret = insert_balance_item ( fs_info , bctl ) ;
2012-01-16 22:04:48 +02:00
if ( ret & & ret ! = - EEXIST )
2012-01-16 22:04:48 +02:00
goto out ;
2012-01-16 22:04:48 +02:00
if ( ! ( bctl - > flags & BTRFS_BALANCE_RESUME ) ) {
BUG_ON ( ret = = - EEXIST ) ;
2018-03-21 02:41:30 +01:00
BUG_ON ( fs_info - > balance_ctl ) ;
spin_lock ( & fs_info - > balance_lock ) ;
fs_info - > balance_ctl = bctl ;
spin_unlock ( & fs_info - > balance_lock ) ;
2012-01-16 22:04:48 +02:00
} else {
BUG_ON ( ret ! = - EEXIST ) ;
spin_lock ( & fs_info - > balance_lock ) ;
update_balance_args ( bctl ) ;
spin_unlock ( & fs_info - > balance_lock ) ;
}
2012-01-16 22:04:47 +02:00
2018-03-21 01:31:04 +01:00
ASSERT ( ! test_bit ( BTRFS_FS_BALANCE_RUNNING , & fs_info - > flags ) ) ;
set_bit ( BTRFS_FS_BALANCE_RUNNING , & fs_info - > flags ) ;
btrfs: balance: print args during start and resume
The information about balance arguments is important for system audit,
this patch prints the textual representation when balance starts or is
resumed.
Example command:
$ btrfs balance start -f -mprofiles=raid1,convert=single,soft -dlimit=10..20,usage=50 /btrfs
Example kernel log output:
BTRFS info (device sdb): balance: start -f -dusage=50,limit=10..20 -mconvert=single,soft,profiles=raid1 -sconvert=single,soft,profiles=raid1
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog, simplify code ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-11-20 16:12:56 +08:00
describe_balance_start_or_resume ( fs_info ) ;
2012-01-16 22:04:47 +02:00
mutex_unlock ( & fs_info - > balance_mutex ) ;
ret = __btrfs_balance ( fs_info ) ;
mutex_lock ( & fs_info - > balance_mutex ) ;
2018-11-20 16:12:57 +08:00
if ( ret = = - ECANCELED & & atomic_read ( & fs_info - > balance_pause_req ) )
btrfs_info ( fs_info , " balance: paused " ) ;
2020-07-13 09:03:21 +08:00
/*
* Balance can be canceled by :
*
* - Regular cancel request
* Then ret = = - ECANCELED and balance_cancel_req > 0
*
* - Fatal signal to " btrfs " process
* Either the signal caught by wait_reserve_ticket ( ) and callers
* got - EINTR , or caught by btrfs_should_cancel_balance ( ) and
* got - ECANCELED .
* Either way , in this case balance_cancel_req = 0 , and
* ret = = - EINTR or ret = = - ECANCELED .
*
* So here we only check the return value to catch canceled balance .
*/
else if ( ret = = - ECANCELED | | ret = = - EINTR )
2018-11-20 16:12:57 +08:00
btrfs_info ( fs_info , " balance: canceled " ) ;
else
btrfs_info ( fs_info , " balance: ended with status: %d " , ret ) ;
2018-03-21 01:31:04 +01:00
clear_bit ( BTRFS_FS_BALANCE_RUNNING , & fs_info - > flags ) ;
2012-01-16 22:04:47 +02:00
if ( bargs ) {
memset ( bargs , 0 , sizeof ( * bargs ) ) ;
2018-03-21 02:05:27 +01:00
btrfs_update_ioctl_balance_args ( fs_info , bargs ) ;
2012-01-16 22:04:47 +02:00
}
2013-03-06 01:57:55 -07:00
if ( ( ret & & ret ! = - ECANCELED & & ret ! = - ENOSPC ) | |
balance_need_close ( fs_info ) ) {
2018-03-20 20:23:09 +01:00
reset_balance_state ( fs_info ) ;
2020-08-25 10:02:32 -05:00
btrfs_exclop_finish ( fs_info ) ;
2013-03-06 01:57:55 -07:00
}
2012-01-16 22:04:49 +02:00
wake_up ( & fs_info - > balance_wait_q ) ;
2012-01-16 22:04:47 +02:00
return ret ;
out :
2012-01-16 22:04:48 +02:00
if ( bctl - > flags & BTRFS_BALANCE_RESUME )
2018-03-20 20:23:09 +01:00
reset_balance_state ( fs_info ) ;
2018-03-20 17:28:05 +01:00
else
2012-01-16 22:04:48 +02:00
kfree ( bctl ) ;
2020-08-25 10:02:32 -05:00
btrfs_exclop_finish ( fs_info ) ;
2018-03-20 17:28:05 +01:00
2012-01-16 22:04:48 +02:00
return ret ;
}
static int balance_kthread ( void * data )
{
2012-06-22 12:24:13 -06:00
struct btrfs_fs_info * fs_info = data ;
2012-01-16 22:04:48 +02:00
int ret = 0 ;
2012-01-16 22:04:48 +02:00
mutex_lock ( & fs_info - > balance_mutex ) ;
btrfs: balance: print args during start and resume
The information about balance arguments is important for system audit,
this patch prints the textual representation when balance starts or is
resumed.
Example command:
$ btrfs balance start -f -mprofiles=raid1,convert=single,soft -dlimit=10..20,usage=50 /btrfs
Example kernel log output:
BTRFS info (device sdb): balance: start -f -dusage=50,limit=10..20 -mconvert=single,soft,profiles=raid1 -sconvert=single,soft,profiles=raid1
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog, simplify code ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-11-20 16:12:56 +08:00
if ( fs_info - > balance_ctl )
2018-05-07 17:44:03 +02:00
ret = btrfs_balance ( fs_info , fs_info - > balance_ctl , NULL ) ;
2012-01-16 22:04:48 +02:00
mutex_unlock ( & fs_info - > balance_mutex ) ;
2012-06-22 12:24:13 -06:00
2012-01-16 22:04:48 +02:00
return ret ;
}
2012-06-22 12:24:13 -06:00
int btrfs_resume_balance_async ( struct btrfs_fs_info * fs_info )
{
struct task_struct * tsk ;
2018-03-21 02:29:13 +01:00
mutex_lock ( & fs_info - > balance_mutex ) ;
2012-06-22 12:24:13 -06:00
if ( ! fs_info - > balance_ctl ) {
2018-03-21 02:29:13 +01:00
mutex_unlock ( & fs_info - > balance_mutex ) ;
2012-06-22 12:24:13 -06:00
return 0 ;
}
2018-03-21 02:29:13 +01:00
mutex_unlock ( & fs_info - > balance_mutex ) ;
2012-06-22 12:24:13 -06:00
2016-06-09 21:38:35 -04:00
if ( btrfs_test_opt ( fs_info , SKIP_BALANCE ) ) {
2018-05-16 10:51:26 +08:00
btrfs_info ( fs_info , " balance: resume skipped " ) ;
2012-06-22 12:24:13 -06:00
return 0 ;
}
2018-05-17 15:16:51 +08:00
/*
* A ro - > rw remount sequence should continue with the paused balance
* regardless of who pauses it , system or the user as of now , so set
* the resume flag .
*/
spin_lock ( & fs_info - > balance_lock ) ;
fs_info - > balance_ctl - > flags | = BTRFS_BALANCE_RESUME ;
spin_unlock ( & fs_info - > balance_lock ) ;
2012-06-22 12:24:13 -06:00
tsk = kthread_run ( balance_kthread , fs_info , " btrfs-balance " ) ;
2013-07-15 16:52:18 +05:30
return PTR_ERR_OR_ZERO ( tsk ) ;
2012-06-22 12:24:13 -06:00
}
2012-06-22 12:24:12 -06:00
int btrfs_recover_balance ( struct btrfs_fs_info * fs_info )
2012-01-16 22:04:48 +02:00
{
struct btrfs_balance_control * bctl ;
struct btrfs_balance_item * item ;
struct btrfs_disk_balance_args disk_bargs ;
struct btrfs_path * path ;
struct extent_buffer * leaf ;
struct btrfs_key key ;
int ret ;
path = btrfs_alloc_path ( ) ;
if ( ! path )
return - ENOMEM ;
key . objectid = BTRFS_BALANCE_OBJECTID ;
2016-01-25 17:51:31 +01:00
key . type = BTRFS_TEMPORARY_ITEM_KEY ;
2012-01-16 22:04:48 +02:00
key . offset = 0 ;
2012-06-22 12:24:12 -06:00
ret = btrfs_search_slot ( NULL , fs_info - > tree_root , & key , path , 0 , 0 ) ;
2012-01-16 22:04:48 +02:00
if ( ret < 0 )
2012-06-22 12:24:12 -06:00
goto out ;
2012-01-16 22:04:48 +02:00
if ( ret > 0 ) { /* ret = -ENOENT; */
ret = 0 ;
2012-06-22 12:24:12 -06:00
goto out ;
}
bctl = kzalloc ( sizeof ( * bctl ) , GFP_NOFS ) ;
if ( ! bctl ) {
ret = - ENOMEM ;
goto out ;
2012-01-16 22:04:48 +02:00
}
leaf = path - > nodes [ 0 ] ;
item = btrfs_item_ptr ( leaf , path - > slots [ 0 ] , struct btrfs_balance_item ) ;
2012-06-22 12:24:12 -06:00
bctl - > flags = btrfs_balance_flags ( leaf , item ) ;
bctl - > flags | = BTRFS_BALANCE_RESUME ;
2012-01-16 22:04:48 +02:00
btrfs_balance_data ( leaf , item , & disk_bargs ) ;
btrfs_disk_balance_args_to_cpu ( & bctl - > data , & disk_bargs ) ;
btrfs_balance_meta ( leaf , item , & disk_bargs ) ;
btrfs_disk_balance_args_to_cpu ( & bctl - > meta , & disk_bargs ) ;
btrfs_balance_sys ( leaf , item , & disk_bargs ) ;
btrfs_disk_balance_args_to_cpu ( & bctl - > sys , & disk_bargs ) ;
2018-03-20 20:07:58 +01:00
/*
* This should never happen , as the paused balance state is recovered
* during mount without any chance of other exclusive ops to collide .
*
* This gives the exclusive op status to balance and keeps in paused
* state until user intervention ( cancel or umount ) . If the ownership
* cannot be assigned , show a message but do not fail . The balance
* is in a paused state and must have fs_info : : balance_ctl properly
* set up .
*/
2020-08-25 10:02:32 -05:00
if ( ! btrfs_exclop_start ( fs_info , BTRFS_EXCLOP_BALANCE ) )
2018-03-20 20:07:58 +01:00
btrfs_warn ( fs_info ,
2018-05-16 10:51:26 +08:00
" balance: cannot set exclusive op status, resume manually " ) ;
2013-01-20 15:57:57 +02:00
2020-12-16 11:22:14 -05:00
btrfs_release_path ( path ) ;
2012-06-22 12:24:12 -06:00
mutex_lock ( & fs_info - > balance_mutex ) ;
2018-03-21 02:41:30 +01:00
BUG_ON ( fs_info - > balance_ctl ) ;
spin_lock ( & fs_info - > balance_lock ) ;
fs_info - > balance_ctl = bctl ;
spin_unlock ( & fs_info - > balance_lock ) ;
2012-06-22 12:24:12 -06:00
mutex_unlock ( & fs_info - > balance_mutex ) ;
2012-01-16 22:04:48 +02:00
out :
btrfs_free_path ( path ) ;
2008-04-28 15:29:52 -04:00
return ret ;
}
2012-01-16 22:04:49 +02:00
int btrfs_pause_balance ( struct btrfs_fs_info * fs_info )
{
int ret = 0 ;
mutex_lock ( & fs_info - > balance_mutex ) ;
if ( ! fs_info - > balance_ctl ) {
mutex_unlock ( & fs_info - > balance_mutex ) ;
return - ENOTCONN ;
}
2018-03-21 01:31:04 +01:00
if ( test_bit ( BTRFS_FS_BALANCE_RUNNING , & fs_info - > flags ) ) {
2012-01-16 22:04:49 +02:00
atomic_inc ( & fs_info - > balance_pause_req ) ;
mutex_unlock ( & fs_info - > balance_mutex ) ;
wait_event ( fs_info - > balance_wait_q ,
2018-03-21 01:31:04 +01:00
! test_bit ( BTRFS_FS_BALANCE_RUNNING , & fs_info - > flags ) ) ;
2012-01-16 22:04:49 +02:00
mutex_lock ( & fs_info - > balance_mutex ) ;
/* we are good with balance_ctl ripped off from under us */
2018-03-21 01:31:04 +01:00
BUG_ON ( test_bit ( BTRFS_FS_BALANCE_RUNNING , & fs_info - > flags ) ) ;
2012-01-16 22:04:49 +02:00
atomic_dec ( & fs_info - > balance_pause_req ) ;
} else {
ret = - ENOTCONN ;
}
mutex_unlock ( & fs_info - > balance_mutex ) ;
return ret ;
}
2012-01-16 22:04:49 +02:00
int btrfs_cancel_balance ( struct btrfs_fs_info * fs_info )
{
mutex_lock ( & fs_info - > balance_mutex ) ;
if ( ! fs_info - > balance_ctl ) {
mutex_unlock ( & fs_info - > balance_mutex ) ;
return - ENOTCONN ;
}
2018-03-21 01:45:32 +01:00
/*
* A paused balance with the item stored on disk can be resumed at
* mount time if the mount is read - write . Otherwise it ' s still paused
* and we must not allow cancelling as it deletes the item .
*/
if ( sb_rdonly ( fs_info - > sb ) ) {
mutex_unlock ( & fs_info - > balance_mutex ) ;
return - EROFS ;
}
2012-01-16 22:04:49 +02:00
atomic_inc ( & fs_info - > balance_cancel_req ) ;
/*
* if we are running just wait and return , balance item is
* deleted in btrfs_balance in this case
*/
2018-03-21 01:31:04 +01:00
if ( test_bit ( BTRFS_FS_BALANCE_RUNNING , & fs_info - > flags ) ) {
2012-01-16 22:04:49 +02:00
mutex_unlock ( & fs_info - > balance_mutex ) ;
wait_event ( fs_info - > balance_wait_q ,
2018-03-21 01:31:04 +01:00
! test_bit ( BTRFS_FS_BALANCE_RUNNING , & fs_info - > flags ) ) ;
2012-01-16 22:04:49 +02:00
mutex_lock ( & fs_info - > balance_mutex ) ;
} else {
mutex_unlock ( & fs_info - > balance_mutex ) ;
2018-03-21 00:20:05 +01:00
/*
* Lock released to allow other waiters to continue , we ' ll
* reexamine the status again .
*/
2012-01-16 22:04:49 +02:00
mutex_lock ( & fs_info - > balance_mutex ) ;
2018-03-20 17:28:05 +01:00
if ( fs_info - > balance_ctl ) {
2018-03-20 20:23:09 +01:00
reset_balance_state ( fs_info ) ;
2020-08-25 10:02:32 -05:00
btrfs_exclop_finish ( fs_info ) ;
2018-05-16 10:51:26 +08:00
btrfs_info ( fs_info , " balance: canceled " ) ;
2018-03-20 17:28:05 +01:00
}
2012-01-16 22:04:49 +02:00
}
2018-03-21 01:31:04 +01:00
BUG_ON ( fs_info - > balance_ctl | |
test_bit ( BTRFS_FS_BALANCE_RUNNING , & fs_info - > flags ) ) ;
2012-01-16 22:04:49 +02:00
atomic_dec ( & fs_info - > balance_cancel_req ) ;
mutex_unlock ( & fs_info - > balance_mutex ) ;
return 0 ;
}
2020-02-18 16:56:08 +02:00
int btrfs_uuid_scan_kthread ( void * data )
2013-08-15 17:11:21 +02:00
{
struct btrfs_fs_info * fs_info = data ;
struct btrfs_root * root = fs_info - > tree_root ;
struct btrfs_key key ;
struct btrfs_path * path = NULL ;
int ret = 0 ;
struct extent_buffer * eb ;
int slot ;
struct btrfs_root_item root_item ;
u32 item_size ;
2013-08-28 10:28:34 +01:00
struct btrfs_trans_handle * trans = NULL ;
2020-02-14 15:05:01 -05:00
bool closing = false ;
2013-08-15 17:11:21 +02:00
path = btrfs_alloc_path ( ) ;
if ( ! path ) {
ret = - ENOMEM ;
goto out ;
}
key . objectid = 0 ;
key . type = BTRFS_ROOT_ITEM_KEY ;
key . offset = 0 ;
while ( 1 ) {
2020-02-14 15:05:01 -05:00
if ( btrfs_fs_closing ( fs_info ) ) {
closing = true ;
break ;
}
2018-03-07 17:29:18 +08:00
ret = btrfs_search_forward ( root , & key , path ,
BTRFS_OLDEST_GENERATION ) ;
2013-08-15 17:11:21 +02:00
if ( ret ) {
if ( ret > 0 )
ret = 0 ;
break ;
}
if ( key . type ! = BTRFS_ROOT_ITEM_KEY | |
( key . objectid < BTRFS_FIRST_FREE_OBJECTID & &
key . objectid ! = BTRFS_FS_TREE_OBJECTID ) | |
key . objectid > BTRFS_LAST_FREE_OBJECTID )
goto skip ;
eb = path - > nodes [ 0 ] ;
slot = path - > slots [ 0 ] ;
item_size = btrfs_item_size_nr ( eb , slot ) ;
if ( item_size < sizeof ( root_item ) )
goto skip ;
read_extent_buffer ( eb , & root_item ,
btrfs_item_ptr_offset ( eb , slot ) ,
( int ) sizeof ( root_item ) ) ;
if ( btrfs_root_refs ( & root_item ) = = 0 )
goto skip ;
2013-08-28 10:28:34 +01:00
if ( ! btrfs_is_empty_uuid ( root_item . uuid ) | |
! btrfs_is_empty_uuid ( root_item . received_uuid ) ) {
if ( trans )
goto update_tree ;
btrfs_release_path ( path ) ;
2013-08-15 17:11:21 +02:00
/*
* 1 - subvol uuid item
* 1 - received_subvol uuid item
*/
trans = btrfs_start_transaction ( fs_info - > uuid_root , 2 ) ;
if ( IS_ERR ( trans ) ) {
ret = PTR_ERR ( trans ) ;
break ;
}
2013-08-28 10:28:34 +01:00
continue ;
} else {
goto skip ;
}
update_tree :
btrfs: drop path before adding new uuid tree entry
With the conversion of the tree locks to rwsem I got the following
lockdep splat:
======================================================
WARNING: possible circular locking dependency detected
5.8.0-rc7-00167-g0d7ba0c5b375-dirty #925 Not tainted
------------------------------------------------------
btrfs-uuid/7955 is trying to acquire lock:
ffff88bfbafec0f8 (btrfs-root-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180
but task is already holding lock:
ffff88bfbafef2a8 (btrfs-uuid-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (btrfs-uuid-00){++++}-{3:3}:
down_read_nested+0x3e/0x140
__btrfs_tree_read_lock+0x39/0x180
__btrfs_read_lock_root_node+0x3a/0x50
btrfs_search_slot+0x4bd/0x990
btrfs_uuid_tree_add+0x89/0x2d0
btrfs_uuid_scan_kthread+0x330/0x390
kthread+0x133/0x150
ret_from_fork+0x1f/0x30
-> #0 (btrfs-root-00){++++}-{3:3}:
__lock_acquire+0x1272/0x2310
lock_acquire+0x9e/0x360
down_read_nested+0x3e/0x140
__btrfs_tree_read_lock+0x39/0x180
__btrfs_read_lock_root_node+0x3a/0x50
btrfs_search_slot+0x4bd/0x990
btrfs_find_root+0x45/0x1b0
btrfs_read_tree_root+0x61/0x100
btrfs_get_root_ref.part.50+0x143/0x630
btrfs_uuid_tree_iterate+0x207/0x314
btrfs_uuid_rescan_kthread+0x12/0x50
kthread+0x133/0x150
ret_from_fork+0x1f/0x30
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(btrfs-uuid-00);
lock(btrfs-root-00);
lock(btrfs-uuid-00);
lock(btrfs-root-00);
*** DEADLOCK ***
1 lock held by btrfs-uuid/7955:
#0: ffff88bfbafef2a8 (btrfs-uuid-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180
stack backtrace:
CPU: 73 PID: 7955 Comm: btrfs-uuid Kdump: loaded Not tainted 5.8.0-rc7-00167-g0d7ba0c5b375-dirty #925
Hardware name: Quanta Tioga Pass Single Side 01-0030993006/Tioga Pass Single Side, BIOS F08_3A18 12/20/2018
Call Trace:
dump_stack+0x78/0xa0
check_noncircular+0x165/0x180
__lock_acquire+0x1272/0x2310
lock_acquire+0x9e/0x360
? __btrfs_tree_read_lock+0x39/0x180
? btrfs_root_node+0x1c/0x1d0
down_read_nested+0x3e/0x140
? __btrfs_tree_read_lock+0x39/0x180
__btrfs_tree_read_lock+0x39/0x180
__btrfs_read_lock_root_node+0x3a/0x50
btrfs_search_slot+0x4bd/0x990
btrfs_find_root+0x45/0x1b0
btrfs_read_tree_root+0x61/0x100
btrfs_get_root_ref.part.50+0x143/0x630
btrfs_uuid_tree_iterate+0x207/0x314
? btree_readpage+0x20/0x20
btrfs_uuid_rescan_kthread+0x12/0x50
kthread+0x133/0x150
? kthread_create_on_node+0x60/0x60
ret_from_fork+0x1f/0x30
This problem exists because we have two different rescan threads,
btrfs_uuid_scan_kthread which creates the uuid tree, and
btrfs_uuid_tree_iterate that goes through and updates or deletes any out
of date roots. The problem is they both do things in different order.
btrfs_uuid_scan_kthread() reads the tree_root, and then inserts entries
into the uuid_root. btrfs_uuid_tree_iterate() scans the uuid_root, but
then does a btrfs_get_fs_root() which can read from the tree_root.
It's actually easy enough to not be holding the path in
btrfs_uuid_scan_kthread() when we add a uuid entry, as we already drop
it further down and re-start the search when we loop. So simply move
the path release before we add our entry to the uuid tree.
This also fixes a problem where we're holding a path open after we do
btrfs_end_transaction(), which has it's own problems.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-08-10 11:42:26 -04:00
btrfs_release_path ( path ) ;
2013-08-28 10:28:34 +01:00
if ( ! btrfs_is_empty_uuid ( root_item . uuid ) ) {
2018-05-29 15:01:53 +08:00
ret = btrfs_uuid_tree_add ( trans , root_item . uuid ,
2013-08-15 17:11:21 +02:00
BTRFS_UUID_KEY_SUBVOL ,
key . objectid ) ;
if ( ret < 0 ) {
2013-12-20 11:37:06 -05:00
btrfs_warn ( fs_info , " uuid_tree_add failed %d " ,
2013-08-15 17:11:21 +02:00
ret ) ;
break ;
}
}
if ( ! btrfs_is_empty_uuid ( root_item . received_uuid ) ) {
2018-05-29 15:01:53 +08:00
ret = btrfs_uuid_tree_add ( trans ,
2013-08-15 17:11:21 +02:00
root_item . received_uuid ,
BTRFS_UUID_KEY_RECEIVED_SUBVOL ,
key . objectid ) ;
if ( ret < 0 ) {
2013-12-20 11:37:06 -05:00
btrfs_warn ( fs_info , " uuid_tree_add failed %d " ,
2013-08-15 17:11:21 +02:00
ret ) ;
break ;
}
}
2013-08-28 10:28:34 +01:00
skip :
btrfs: drop path before adding new uuid tree entry
With the conversion of the tree locks to rwsem I got the following
lockdep splat:
======================================================
WARNING: possible circular locking dependency detected
5.8.0-rc7-00167-g0d7ba0c5b375-dirty #925 Not tainted
------------------------------------------------------
btrfs-uuid/7955 is trying to acquire lock:
ffff88bfbafec0f8 (btrfs-root-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180
but task is already holding lock:
ffff88bfbafef2a8 (btrfs-uuid-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (btrfs-uuid-00){++++}-{3:3}:
down_read_nested+0x3e/0x140
__btrfs_tree_read_lock+0x39/0x180
__btrfs_read_lock_root_node+0x3a/0x50
btrfs_search_slot+0x4bd/0x990
btrfs_uuid_tree_add+0x89/0x2d0
btrfs_uuid_scan_kthread+0x330/0x390
kthread+0x133/0x150
ret_from_fork+0x1f/0x30
-> #0 (btrfs-root-00){++++}-{3:3}:
__lock_acquire+0x1272/0x2310
lock_acquire+0x9e/0x360
down_read_nested+0x3e/0x140
__btrfs_tree_read_lock+0x39/0x180
__btrfs_read_lock_root_node+0x3a/0x50
btrfs_search_slot+0x4bd/0x990
btrfs_find_root+0x45/0x1b0
btrfs_read_tree_root+0x61/0x100
btrfs_get_root_ref.part.50+0x143/0x630
btrfs_uuid_tree_iterate+0x207/0x314
btrfs_uuid_rescan_kthread+0x12/0x50
kthread+0x133/0x150
ret_from_fork+0x1f/0x30
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(btrfs-uuid-00);
lock(btrfs-root-00);
lock(btrfs-uuid-00);
lock(btrfs-root-00);
*** DEADLOCK ***
1 lock held by btrfs-uuid/7955:
#0: ffff88bfbafef2a8 (btrfs-uuid-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180
stack backtrace:
CPU: 73 PID: 7955 Comm: btrfs-uuid Kdump: loaded Not tainted 5.8.0-rc7-00167-g0d7ba0c5b375-dirty #925
Hardware name: Quanta Tioga Pass Single Side 01-0030993006/Tioga Pass Single Side, BIOS F08_3A18 12/20/2018
Call Trace:
dump_stack+0x78/0xa0
check_noncircular+0x165/0x180
__lock_acquire+0x1272/0x2310
lock_acquire+0x9e/0x360
? __btrfs_tree_read_lock+0x39/0x180
? btrfs_root_node+0x1c/0x1d0
down_read_nested+0x3e/0x140
? __btrfs_tree_read_lock+0x39/0x180
__btrfs_tree_read_lock+0x39/0x180
__btrfs_read_lock_root_node+0x3a/0x50
btrfs_search_slot+0x4bd/0x990
btrfs_find_root+0x45/0x1b0
btrfs_read_tree_root+0x61/0x100
btrfs_get_root_ref.part.50+0x143/0x630
btrfs_uuid_tree_iterate+0x207/0x314
? btree_readpage+0x20/0x20
btrfs_uuid_rescan_kthread+0x12/0x50
kthread+0x133/0x150
? kthread_create_on_node+0x60/0x60
ret_from_fork+0x1f/0x30
This problem exists because we have two different rescan threads,
btrfs_uuid_scan_kthread which creates the uuid tree, and
btrfs_uuid_tree_iterate that goes through and updates or deletes any out
of date roots. The problem is they both do things in different order.
btrfs_uuid_scan_kthread() reads the tree_root, and then inserts entries
into the uuid_root. btrfs_uuid_tree_iterate() scans the uuid_root, but
then does a btrfs_get_fs_root() which can read from the tree_root.
It's actually easy enough to not be holding the path in
btrfs_uuid_scan_kthread() when we add a uuid entry, as we already drop
it further down and re-start the search when we loop. So simply move
the path release before we add our entry to the uuid tree.
This also fixes a problem where we're holding a path open after we do
btrfs_end_transaction(), which has it's own problems.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-08-10 11:42:26 -04:00
btrfs_release_path ( path ) ;
2013-08-15 17:11:21 +02:00
if ( trans ) {
2016-09-09 21:39:03 -04:00
ret = btrfs_end_transaction ( trans ) ;
2013-08-28 10:28:34 +01:00
trans = NULL ;
2013-08-15 17:11:21 +02:00
if ( ret )
break ;
}
if ( key . offset < ( u64 ) - 1 ) {
key . offset + + ;
} else if ( key . type < BTRFS_ROOT_ITEM_KEY ) {
key . offset = 0 ;
key . type = BTRFS_ROOT_ITEM_KEY ;
} else if ( key . objectid < ( u64 ) - 1 ) {
key . offset = 0 ;
key . type = BTRFS_ROOT_ITEM_KEY ;
key . objectid + + ;
} else {
break ;
}
cond_resched ( ) ;
}
out :
btrfs_free_path ( path ) ;
2013-08-28 10:28:34 +01:00
if ( trans & & ! IS_ERR ( trans ) )
2016-09-09 21:39:03 -04:00
btrfs_end_transaction ( trans ) ;
2013-08-15 17:11:21 +02:00
if ( ret )
2013-12-20 11:37:06 -05:00
btrfs_warn ( fs_info , " btrfs_uuid_scan_kthread failed %d " , ret ) ;
2020-02-14 15:05:01 -05:00
else if ( ! closing )
2016-09-02 15:40:02 -04:00
set_bit ( BTRFS_FS_UPDATE_UUID_TREE_GEN , & fs_info - > flags ) ;
2013-08-15 17:11:21 +02:00
up ( & fs_info - > uuid_tree_rescan_sem ) ;
return 0 ;
}
2013-08-15 17:11:19 +02:00
int btrfs_create_uuid_tree ( struct btrfs_fs_info * fs_info )
{
struct btrfs_trans_handle * trans ;
struct btrfs_root * tree_root = fs_info - > tree_root ;
struct btrfs_root * uuid_root ;
2013-08-15 17:11:21 +02:00
struct task_struct * task ;
int ret ;
2013-08-15 17:11:19 +02:00
/*
* 1 - root node
* 1 - root item
*/
trans = btrfs_start_transaction ( tree_root , 2 ) ;
if ( IS_ERR ( trans ) )
return PTR_ERR ( trans ) ;
2019-03-20 13:20:49 +01:00
uuid_root = btrfs_create_tree ( trans , BTRFS_UUID_TREE_OBJECTID ) ;
2013-08-15 17:11:19 +02:00
if ( IS_ERR ( uuid_root ) ) {
2015-04-24 19:12:01 +02:00
ret = PTR_ERR ( uuid_root ) ;
2016-06-10 18:19:25 -04:00
btrfs_abort_transaction ( trans , ret ) ;
2016-09-09 21:39:03 -04:00
btrfs_end_transaction ( trans ) ;
2015-04-24 19:12:01 +02:00
return ret ;
2013-08-15 17:11:19 +02:00
}
fs_info - > uuid_root = uuid_root ;
2016-09-09 21:39:03 -04:00
ret = btrfs_commit_transaction ( trans ) ;
2013-08-15 17:11:21 +02:00
if ( ret )
return ret ;
down ( & fs_info - > uuid_tree_rescan_sem ) ;
task = kthread_run ( btrfs_uuid_scan_kthread , fs_info , " btrfs-uuid " ) ;
if ( IS_ERR ( task ) ) {
2013-08-15 17:11:23 +02:00
/* fs_info->update_uuid_tree_gen remains 0 in all error case */
2013-12-20 11:37:06 -05:00
btrfs_warn ( fs_info , " failed to start uuid_scan task " ) ;
2013-08-15 17:11:21 +02:00
up ( & fs_info - > uuid_tree_rescan_sem ) ;
return PTR_ERR ( task ) ;
}
return 0 ;
2013-08-15 17:11:19 +02:00
}
2013-08-15 17:11:21 +02:00
2008-04-25 16:53:30 -04:00
/*
* shrinking a device means finding all of the device extents past
* the new size , and then following the back refs to the chunks .
* The chunk relocation code actually frees the device extent
*/
int btrfs_shrink_device ( struct btrfs_device * device , u64 new_size )
{
2016-06-22 18:54:23 -04:00
struct btrfs_fs_info * fs_info = device - > fs_info ;
struct btrfs_root * root = fs_info - > dev_root ;
2008-04-25 16:53:30 -04:00
struct btrfs_trans_handle * trans ;
struct btrfs_dev_extent * dev_extent = NULL ;
struct btrfs_path * path ;
u64 length ;
u64 chunk_offset ;
int ret ;
int slot ;
Btrfs: make balance code choose more wisely when relocating
Currently, we can panic the box if the first block group we go to move is of a
type where there is no space left to move those extents. For example, if we
fill the disk up with data, and then we try to balance and we have no room to
move the data nor room to allocate new chunks, we will panic. Change this by
checking to see if we have room to move this chunk around, and if not, return
-ENOSPC and move on to the next chunk. This will make sure we remove block
groups that are moveable, like if we have alot of empty metadata block groups,
and then that way we make room to be able to balance our data chunks as well.
Tested this with an fs that would panic on btrfs-vol -b normally, but no longer
panics with this patch.
V1->V2:
-actually search for a free extent on the device to make sure we can allocate a
chunk if need be.
-fix btrfs_shrink_device to make sure we actually try to relocate all the
chunks, and then if we can't return -ENOSPC so if we are doing a btrfs-vol -r
we don't remove the device with data still on it.
-check to make sure the block group we are going to relocate isn't the last one
in that particular space
-fix a bug in btrfs_shrink_device where we would change the device's size and
not fix it if we fail to do our relocate
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-11 16:11:19 -04:00
int failed = 0 ;
bool retried = false ;
2008-04-25 16:53:30 -04:00
struct extent_buffer * l ;
struct btrfs_key key ;
2016-06-22 18:54:23 -04:00
struct btrfs_super_block * super_copy = fs_info - > super_copy ;
2008-04-25 16:53:30 -04:00
u64 old_total = btrfs_super_total_bytes ( super_copy ) ;
2014-09-03 21:35:38 +08:00
u64 old_size = btrfs_device_get_total_bytes ( device ) ;
2017-06-16 14:39:20 +03:00
u64 diff ;
2019-03-25 14:31:23 +02:00
u64 start ;
2017-06-16 14:39:20 +03:00
new_size = round_down ( new_size , fs_info - > sectorsize ) ;
2019-03-25 14:31:23 +02:00
start = new_size ;
2017-07-21 11:28:24 +03:00
diff = round_down ( old_size - new_size , fs_info - > sectorsize ) ;
2008-04-25 16:53:30 -04:00
2017-12-04 12:54:55 +08:00
if ( test_bit ( BTRFS_DEV_STATE_REPLACE_TGT , & device - > dev_state ) )
2012-11-05 18:29:28 +01:00
return - EINVAL ;
2008-04-25 16:53:30 -04:00
path = btrfs_alloc_path ( ) ;
if ( ! path )
return - ENOMEM ;
2018-04-27 16:22:07 +08:00
path - > reada = READA_BACK ;
2008-04-25 16:53:30 -04:00
2019-03-25 14:31:23 +02:00
trans = btrfs_start_transaction ( root , 0 ) ;
if ( IS_ERR ( trans ) ) {
btrfs_free_path ( path ) ;
return PTR_ERR ( trans ) ;
}
2016-10-04 19:34:27 +02:00
mutex_lock ( & fs_info - > chunk_mutex ) ;
2008-07-08 14:19:17 -04:00
2014-09-03 21:35:38 +08:00
btrfs_device_set_total_bytes ( device , new_size ) ;
2017-12-04 12:54:52 +08:00
if ( test_bit ( BTRFS_DEV_STATE_WRITEABLE , & device - > dev_state ) ) {
2008-11-17 21:11:30 -05:00
device - > fs_devices - > total_rw_bytes - = diff ;
2017-05-11 09:17:46 +03:00
atomic64_sub ( diff , & fs_info - > free_chunk_space ) ;
2011-09-26 17:12:22 -04:00
}
2019-03-25 14:31:23 +02:00
/*
* Once the device ' s size has been set to the new size , ensure all
* in - memory chunks are synced to disk so that the loop below sees them
* and relocates them accordingly .
*/
2019-03-27 14:24:12 +02:00
if ( contains_pending_extent ( device , & start , diff ) ) {
2019-03-25 14:31:23 +02:00
mutex_unlock ( & fs_info - > chunk_mutex ) ;
ret = btrfs_commit_transaction ( trans ) ;
if ( ret )
goto done ;
} else {
mutex_unlock ( & fs_info - > chunk_mutex ) ;
btrfs_end_transaction ( trans ) ;
}
2008-04-25 16:53:30 -04:00
Btrfs: make balance code choose more wisely when relocating
Currently, we can panic the box if the first block group we go to move is of a
type where there is no space left to move those extents. For example, if we
fill the disk up with data, and then we try to balance and we have no room to
move the data nor room to allocate new chunks, we will panic. Change this by
checking to see if we have room to move this chunk around, and if not, return
-ENOSPC and move on to the next chunk. This will make sure we remove block
groups that are moveable, like if we have alot of empty metadata block groups,
and then that way we make room to be able to balance our data chunks as well.
Tested this with an fs that would panic on btrfs-vol -b normally, but no longer
panics with this patch.
V1->V2:
-actually search for a free extent on the device to make sure we can allocate a
chunk if need be.
-fix btrfs_shrink_device to make sure we actually try to relocate all the
chunks, and then if we can't return -ENOSPC so if we are doing a btrfs-vol -r
we don't remove the device with data still on it.
-check to make sure the block group we are going to relocate isn't the last one
in that particular space
-fix a bug in btrfs_shrink_device where we would change the device's size and
not fix it if we fail to do our relocate
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-11 16:11:19 -04:00
again :
2008-04-25 16:53:30 -04:00
key . objectid = device - > devid ;
key . offset = ( u64 ) - 1 ;
key . type = BTRFS_DEV_EXTENT_KEY ;
2012-03-27 17:09:18 +03:00
do {
2021-04-19 16:41:01 +09:00
mutex_lock ( & fs_info - > reclaim_bgs_lock ) ;
2008-04-25 16:53:30 -04:00
ret = btrfs_search_slot ( NULL , root , & key , path , 0 , 0 ) ;
Btrfs: fix race between balance and unused block group deletion
We have a race between deleting an unused block group and balancing the
same block group that leads to an assertion failure/BUG(), producing the
following trace:
[181631.208236] BTRFS: assertion failed: 0, file: fs/btrfs/volumes.c, line: 2622
[181631.220591] ------------[ cut here ]------------
[181631.222959] kernel BUG at fs/btrfs/ctree.h:4062!
[181631.223932] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[181631.224566] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse acpi_cpufreq parpor$
[181631.224566] CPU: 8 PID: 17451 Comm: btrfs Tainted: G W 4.1.0-rc5-btrfs-next-10+ #1
[181631.224566] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
[181631.224566] task: ffff880127e09590 ti: ffff8800b5824000 task.ti: ffff8800b5824000
[181631.224566] RIP: 0010:[<ffffffffa03f19f6>] [<ffffffffa03f19f6>] assfail.constprop.50+0x1e/0x20 [btrfs]
[181631.224566] RSP: 0018:ffff8800b5827ae8 EFLAGS: 00010246
[181631.224566] RAX: 0000000000000040 RBX: ffff8800109fc218 RCX: ffffffff81095dce
[181631.224566] RDX: 0000000000005124 RSI: ffffffff81464819 RDI: 00000000ffffffff
[181631.224566] RBP: ffff8800b5827ae8 R08: 0000000000000001 R09: 0000000000000000
[181631.224566] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8800109fc200
[181631.224566] R13: ffff880020095000 R14: ffff8800b1a13f38 R15: ffff880020095000
[181631.224566] FS: 00007f70ca0b0c80(0000) GS:ffff88013ec00000(0000) knlGS:0000000000000000
[181631.224566] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[181631.224566] CR2: 00007f2872ab6e68 CR3: 00000000a717c000 CR4: 00000000000006e0
[181631.224566] Stack:
[181631.224566] ffff8800b5827ba8 ffffffffa03f3916 ffff8800b5827b38 ffffffffa03d080e
[181631.224566] ffffffffa03d1423 ffff880020095000 ffff88001233c000 0000000000000001
[181631.224566] ffff880020095000 ffff8800b1a13f38 0000000a69c00000 0000000000000000
[181631.224566] Call Trace:
[181631.224566] [<ffffffffa03f3916>] btrfs_remove_chunk+0xa4/0x6bb [btrfs]
[181631.224566] [<ffffffffa03d080e>] ? join_transaction.isra.8+0xb9/0x3ba [btrfs]
[181631.224566] [<ffffffffa03d1423>] ? wait_current_trans.isra.13+0x22/0xfc [btrfs]
[181631.224566] [<ffffffffa03f3fbc>] btrfs_relocate_chunk.isra.29+0x8f/0xa7 [btrfs]
[181631.224566] [<ffffffffa03f54df>] btrfs_balance+0xaa4/0xc52 [btrfs]
[181631.224566] [<ffffffffa03fd388>] btrfs_ioctl_balance+0x23f/0x2b0 [btrfs]
[181631.224566] [<ffffffff810872f9>] ? trace_hardirqs_on+0xd/0xf
[181631.224566] [<ffffffffa04019a3>] btrfs_ioctl+0xfe2/0x2220 [btrfs]
[181631.224566] [<ffffffff812603ed>] ? __this_cpu_preempt_check+0x13/0x15
[181631.224566] [<ffffffff81084669>] ? arch_local_irq_save+0x9/0xc
[181631.224566] [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
[181631.224566] [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
[181631.224566] [<ffffffff8103e48c>] ? __do_page_fault+0x211/0x424
[181631.224566] [<ffffffff811755e6>] do_vfs_ioctl+0x3c6/0x479
(...)
The sequence of steps leading to this are:
CPU 0 CPU 1
btrfs_balance()
btrfs_relocate_chunk()
btrfs_relocate_block_group(bg X)
btrfs_lookup_block_group(bg X)
cleaner_kthread
locks fs_info->cleaner_mutex
btrfs_delete_unused_bgs()
finds bg X, which became
unused in the previous
transaction
checks bg X ->ro == 0,
so it proceeds
sets bg X ->ro to 1
(btrfs_set_block_group_ro(bg X))
blocks on fs_info->cleaner_mutex
btrfs_remove_chunk(bg X)
unlocks fs_info->cleaner_mutex
acquires fs_info->cleaner_mutex
relocate_block_group()
--> does nothing, no extents found in
the extent tree from bg X
unlocks fs_info->cleaner_mutex
btrfs_relocate_block_group(bg X) returns
btrfs_remove_chunk(bg X)
extent map not found
--> ASSERT(0)
Fix this by using a new mutex to make sure these 2 operations, block
group relocation and removal, are serialized.
This issue is reproducible by running fstests generic/038 (which stresses
chunk allocation and automatic removal of unused block groups) together
with the following balance loop:
while true; do btrfs balance start -dusage=0 <mountpoint> ; done
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-11 00:58:53 +01:00
if ( ret < 0 ) {
2021-04-19 16:41:01 +09:00
mutex_unlock ( & fs_info - > reclaim_bgs_lock ) ;
2008-04-25 16:53:30 -04:00
goto done ;
Btrfs: fix race between balance and unused block group deletion
We have a race between deleting an unused block group and balancing the
same block group that leads to an assertion failure/BUG(), producing the
following trace:
[181631.208236] BTRFS: assertion failed: 0, file: fs/btrfs/volumes.c, line: 2622
[181631.220591] ------------[ cut here ]------------
[181631.222959] kernel BUG at fs/btrfs/ctree.h:4062!
[181631.223932] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[181631.224566] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse acpi_cpufreq parpor$
[181631.224566] CPU: 8 PID: 17451 Comm: btrfs Tainted: G W 4.1.0-rc5-btrfs-next-10+ #1
[181631.224566] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
[181631.224566] task: ffff880127e09590 ti: ffff8800b5824000 task.ti: ffff8800b5824000
[181631.224566] RIP: 0010:[<ffffffffa03f19f6>] [<ffffffffa03f19f6>] assfail.constprop.50+0x1e/0x20 [btrfs]
[181631.224566] RSP: 0018:ffff8800b5827ae8 EFLAGS: 00010246
[181631.224566] RAX: 0000000000000040 RBX: ffff8800109fc218 RCX: ffffffff81095dce
[181631.224566] RDX: 0000000000005124 RSI: ffffffff81464819 RDI: 00000000ffffffff
[181631.224566] RBP: ffff8800b5827ae8 R08: 0000000000000001 R09: 0000000000000000
[181631.224566] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8800109fc200
[181631.224566] R13: ffff880020095000 R14: ffff8800b1a13f38 R15: ffff880020095000
[181631.224566] FS: 00007f70ca0b0c80(0000) GS:ffff88013ec00000(0000) knlGS:0000000000000000
[181631.224566] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[181631.224566] CR2: 00007f2872ab6e68 CR3: 00000000a717c000 CR4: 00000000000006e0
[181631.224566] Stack:
[181631.224566] ffff8800b5827ba8 ffffffffa03f3916 ffff8800b5827b38 ffffffffa03d080e
[181631.224566] ffffffffa03d1423 ffff880020095000 ffff88001233c000 0000000000000001
[181631.224566] ffff880020095000 ffff8800b1a13f38 0000000a69c00000 0000000000000000
[181631.224566] Call Trace:
[181631.224566] [<ffffffffa03f3916>] btrfs_remove_chunk+0xa4/0x6bb [btrfs]
[181631.224566] [<ffffffffa03d080e>] ? join_transaction.isra.8+0xb9/0x3ba [btrfs]
[181631.224566] [<ffffffffa03d1423>] ? wait_current_trans.isra.13+0x22/0xfc [btrfs]
[181631.224566] [<ffffffffa03f3fbc>] btrfs_relocate_chunk.isra.29+0x8f/0xa7 [btrfs]
[181631.224566] [<ffffffffa03f54df>] btrfs_balance+0xaa4/0xc52 [btrfs]
[181631.224566] [<ffffffffa03fd388>] btrfs_ioctl_balance+0x23f/0x2b0 [btrfs]
[181631.224566] [<ffffffff810872f9>] ? trace_hardirqs_on+0xd/0xf
[181631.224566] [<ffffffffa04019a3>] btrfs_ioctl+0xfe2/0x2220 [btrfs]
[181631.224566] [<ffffffff812603ed>] ? __this_cpu_preempt_check+0x13/0x15
[181631.224566] [<ffffffff81084669>] ? arch_local_irq_save+0x9/0xc
[181631.224566] [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
[181631.224566] [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
[181631.224566] [<ffffffff8103e48c>] ? __do_page_fault+0x211/0x424
[181631.224566] [<ffffffff811755e6>] do_vfs_ioctl+0x3c6/0x479
(...)
The sequence of steps leading to this are:
CPU 0 CPU 1
btrfs_balance()
btrfs_relocate_chunk()
btrfs_relocate_block_group(bg X)
btrfs_lookup_block_group(bg X)
cleaner_kthread
locks fs_info->cleaner_mutex
btrfs_delete_unused_bgs()
finds bg X, which became
unused in the previous
transaction
checks bg X ->ro == 0,
so it proceeds
sets bg X ->ro to 1
(btrfs_set_block_group_ro(bg X))
blocks on fs_info->cleaner_mutex
btrfs_remove_chunk(bg X)
unlocks fs_info->cleaner_mutex
acquires fs_info->cleaner_mutex
relocate_block_group()
--> does nothing, no extents found in
the extent tree from bg X
unlocks fs_info->cleaner_mutex
btrfs_relocate_block_group(bg X) returns
btrfs_remove_chunk(bg X)
extent map not found
--> ASSERT(0)
Fix this by using a new mutex to make sure these 2 operations, block
group relocation and removal, are serialized.
This issue is reproducible by running fstests generic/038 (which stresses
chunk allocation and automatic removal of unused block groups) together
with the following balance loop:
while true; do btrfs balance start -dusage=0 <mountpoint> ; done
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-11 00:58:53 +01:00
}
2008-04-25 16:53:30 -04:00
ret = btrfs_previous_item ( root , path , 0 , key . type ) ;
if ( ret ) {
2021-04-19 16:41:01 +09:00
mutex_unlock ( & fs_info - > reclaim_bgs_lock ) ;
2020-12-17 15:21:16 +02:00
if ( ret < 0 )
goto done ;
2008-04-25 16:53:30 -04:00
ret = 0 ;
2011-04-21 01:20:15 +02:00
btrfs_release_path ( path ) ;
2009-07-22 09:59:00 -04:00
break ;
2008-04-25 16:53:30 -04:00
}
l = path - > nodes [ 0 ] ;
slot = path - > slots [ 0 ] ;
btrfs_item_key_to_cpu ( l , & key , path - > slots [ 0 ] ) ;
Btrfs: make balance code choose more wisely when relocating
Currently, we can panic the box if the first block group we go to move is of a
type where there is no space left to move those extents. For example, if we
fill the disk up with data, and then we try to balance and we have no room to
move the data nor room to allocate new chunks, we will panic. Change this by
checking to see if we have room to move this chunk around, and if not, return
-ENOSPC and move on to the next chunk. This will make sure we remove block
groups that are moveable, like if we have alot of empty metadata block groups,
and then that way we make room to be able to balance our data chunks as well.
Tested this with an fs that would panic on btrfs-vol -b normally, but no longer
panics with this patch.
V1->V2:
-actually search for a free extent on the device to make sure we can allocate a
chunk if need be.
-fix btrfs_shrink_device to make sure we actually try to relocate all the
chunks, and then if we can't return -ENOSPC so if we are doing a btrfs-vol -r
we don't remove the device with data still on it.
-check to make sure the block group we are going to relocate isn't the last one
in that particular space
-fix a bug in btrfs_shrink_device where we would change the device's size and
not fix it if we fail to do our relocate
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-11 16:11:19 -04:00
if ( key . objectid ! = device - > devid ) {
2021-04-19 16:41:01 +09:00
mutex_unlock ( & fs_info - > reclaim_bgs_lock ) ;
2011-04-21 01:20:15 +02:00
btrfs_release_path ( path ) ;
2009-07-22 09:59:00 -04:00
break ;
Btrfs: make balance code choose more wisely when relocating
Currently, we can panic the box if the first block group we go to move is of a
type where there is no space left to move those extents. For example, if we
fill the disk up with data, and then we try to balance and we have no room to
move the data nor room to allocate new chunks, we will panic. Change this by
checking to see if we have room to move this chunk around, and if not, return
-ENOSPC and move on to the next chunk. This will make sure we remove block
groups that are moveable, like if we have alot of empty metadata block groups,
and then that way we make room to be able to balance our data chunks as well.
Tested this with an fs that would panic on btrfs-vol -b normally, but no longer
panics with this patch.
V1->V2:
-actually search for a free extent on the device to make sure we can allocate a
chunk if need be.
-fix btrfs_shrink_device to make sure we actually try to relocate all the
chunks, and then if we can't return -ENOSPC so if we are doing a btrfs-vol -r
we don't remove the device with data still on it.
-check to make sure the block group we are going to relocate isn't the last one
in that particular space
-fix a bug in btrfs_shrink_device where we would change the device's size and
not fix it if we fail to do our relocate
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-11 16:11:19 -04:00
}
2008-04-25 16:53:30 -04:00
dev_extent = btrfs_item_ptr ( l , slot , struct btrfs_dev_extent ) ;
length = btrfs_dev_extent_length ( l , dev_extent ) ;
Btrfs: make balance code choose more wisely when relocating
Currently, we can panic the box if the first block group we go to move is of a
type where there is no space left to move those extents. For example, if we
fill the disk up with data, and then we try to balance and we have no room to
move the data nor room to allocate new chunks, we will panic. Change this by
checking to see if we have room to move this chunk around, and if not, return
-ENOSPC and move on to the next chunk. This will make sure we remove block
groups that are moveable, like if we have alot of empty metadata block groups,
and then that way we make room to be able to balance our data chunks as well.
Tested this with an fs that would panic on btrfs-vol -b normally, but no longer
panics with this patch.
V1->V2:
-actually search for a free extent on the device to make sure we can allocate a
chunk if need be.
-fix btrfs_shrink_device to make sure we actually try to relocate all the
chunks, and then if we can't return -ENOSPC so if we are doing a btrfs-vol -r
we don't remove the device with data still on it.
-check to make sure the block group we are going to relocate isn't the last one
in that particular space
-fix a bug in btrfs_shrink_device where we would change the device's size and
not fix it if we fail to do our relocate
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-11 16:11:19 -04:00
if ( key . offset + length < = new_size ) {
2021-04-19 16:41:01 +09:00
mutex_unlock ( & fs_info - > reclaim_bgs_lock ) ;
2011-04-21 01:20:15 +02:00
btrfs_release_path ( path ) ;
2009-04-27 07:29:03 -04:00
break ;
Btrfs: make balance code choose more wisely when relocating
Currently, we can panic the box if the first block group we go to move is of a
type where there is no space left to move those extents. For example, if we
fill the disk up with data, and then we try to balance and we have no room to
move the data nor room to allocate new chunks, we will panic. Change this by
checking to see if we have room to move this chunk around, and if not, return
-ENOSPC and move on to the next chunk. This will make sure we remove block
groups that are moveable, like if we have alot of empty metadata block groups,
and then that way we make room to be able to balance our data chunks as well.
Tested this with an fs that would panic on btrfs-vol -b normally, but no longer
panics with this patch.
V1->V2:
-actually search for a free extent on the device to make sure we can allocate a
chunk if need be.
-fix btrfs_shrink_device to make sure we actually try to relocate all the
chunks, and then if we can't return -ENOSPC so if we are doing a btrfs-vol -r
we don't remove the device with data still on it.
-check to make sure the block group we are going to relocate isn't the last one
in that particular space
-fix a bug in btrfs_shrink_device where we would change the device's size and
not fix it if we fail to do our relocate
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-11 16:11:19 -04:00
}
2008-04-25 16:53:30 -04:00
chunk_offset = btrfs_dev_extent_chunk_offset ( l , dev_extent ) ;
2011-04-21 01:20:15 +02:00
btrfs_release_path ( path ) ;
2008-04-25 16:53:30 -04:00
2017-11-15 16:28:11 -07:00
/*
* We may be relocating the only data chunk we have ,
* which could potentially end up with losing data ' s
* raid profile , so lets allocate an empty one in
* advance .
*/
ret = btrfs_may_alloc_data_chunk ( fs_info , chunk_offset ) ;
if ( ret < 0 ) {
2021-04-19 16:41:01 +09:00
mutex_unlock ( & fs_info - > reclaim_bgs_lock ) ;
2017-11-15 16:28:11 -07:00
goto done ;
}
2016-06-22 18:54:23 -04:00
ret = btrfs_relocate_chunk ( fs_info , chunk_offset ) ;
2021-04-19 16:41:01 +09:00
mutex_unlock ( & fs_info - > reclaim_bgs_lock ) ;
Btrfs: prevent ioctls from interfering with a swap file
A later patch will implement swap file support for Btrfs, but before we
do that, we need to make sure that the various Btrfs ioctls cannot
change a swap file.
When a swap file is active, we must make sure that the extents of the
file are not moved and that they don't become shared. That means that
the following are not safe:
- chattr +c (enable compression)
- reflink
- dedupe
- snapshot
- defrag
Don't allow those to happen on an active swap file.
Additionally, balance, resize, device remove, and device replace are
also unsafe if they affect an active swapfile. Add a red-black tree of
block groups and devices which contain an active swapfile. Relocation
checks each block group against this tree and skips it or errors out for
balance or resize, respectively. Device remove and device replace check
the tree for the device they will operate on.
Note that we don't have to worry about chattr -C (disable nocow), which
we ignore for non-empty files, because an active swapfile must be
non-empty and can't be truncated. We also don't have to worry about
autodefrag because it's only done on COW files. Truncate and fallocate
are already taken care of by the generic code. Device add doesn't do
relocation so it's not an issue, either.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-11-03 10:28:12 -07:00
if ( ret = = - ENOSPC ) {
Btrfs: make balance code choose more wisely when relocating
Currently, we can panic the box if the first block group we go to move is of a
type where there is no space left to move those extents. For example, if we
fill the disk up with data, and then we try to balance and we have no room to
move the data nor room to allocate new chunks, we will panic. Change this by
checking to see if we have room to move this chunk around, and if not, return
-ENOSPC and move on to the next chunk. This will make sure we remove block
groups that are moveable, like if we have alot of empty metadata block groups,
and then that way we make room to be able to balance our data chunks as well.
Tested this with an fs that would panic on btrfs-vol -b normally, but no longer
panics with this patch.
V1->V2:
-actually search for a free extent on the device to make sure we can allocate a
chunk if need be.
-fix btrfs_shrink_device to make sure we actually try to relocate all the
chunks, and then if we can't return -ENOSPC so if we are doing a btrfs-vol -r
we don't remove the device with data still on it.
-check to make sure the block group we are going to relocate isn't the last one
in that particular space
-fix a bug in btrfs_shrink_device where we would change the device's size and
not fix it if we fail to do our relocate
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-11 16:11:19 -04:00
failed + + ;
Btrfs: prevent ioctls from interfering with a swap file
A later patch will implement swap file support for Btrfs, but before we
do that, we need to make sure that the various Btrfs ioctls cannot
change a swap file.
When a swap file is active, we must make sure that the extents of the
file are not moved and that they don't become shared. That means that
the following are not safe:
- chattr +c (enable compression)
- reflink
- dedupe
- snapshot
- defrag
Don't allow those to happen on an active swap file.
Additionally, balance, resize, device remove, and device replace are
also unsafe if they affect an active swapfile. Add a red-black tree of
block groups and devices which contain an active swapfile. Relocation
checks each block group against this tree and skips it or errors out for
balance or resize, respectively. Device remove and device replace check
the tree for the device they will operate on.
Note that we don't have to worry about chattr -C (disable nocow), which
we ignore for non-empty files, because an active swapfile must be
non-empty and can't be truncated. We also don't have to worry about
autodefrag because it's only done on COW files. Truncate and fallocate
are already taken care of by the generic code. Device add doesn't do
relocation so it's not an issue, either.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-11-03 10:28:12 -07:00
} else if ( ret ) {
if ( ret = = - ETXTBSY ) {
btrfs_warn ( fs_info ,
" could not shrink block group %llu due to active swapfile " ,
chunk_offset ) ;
}
goto done ;
}
2012-03-27 17:09:18 +03:00
} while ( key . offset - - > 0 ) ;
Btrfs: make balance code choose more wisely when relocating
Currently, we can panic the box if the first block group we go to move is of a
type where there is no space left to move those extents. For example, if we
fill the disk up with data, and then we try to balance and we have no room to
move the data nor room to allocate new chunks, we will panic. Change this by
checking to see if we have room to move this chunk around, and if not, return
-ENOSPC and move on to the next chunk. This will make sure we remove block
groups that are moveable, like if we have alot of empty metadata block groups,
and then that way we make room to be able to balance our data chunks as well.
Tested this with an fs that would panic on btrfs-vol -b normally, but no longer
panics with this patch.
V1->V2:
-actually search for a free extent on the device to make sure we can allocate a
chunk if need be.
-fix btrfs_shrink_device to make sure we actually try to relocate all the
chunks, and then if we can't return -ENOSPC so if we are doing a btrfs-vol -r
we don't remove the device with data still on it.
-check to make sure the block group we are going to relocate isn't the last one
in that particular space
-fix a bug in btrfs_shrink_device where we would change the device's size and
not fix it if we fail to do our relocate
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-11 16:11:19 -04:00
if ( failed & & ! retried ) {
failed = 0 ;
retried = true ;
goto again ;
} else if ( failed & & retried ) {
ret = - ENOSPC ;
goto done ;
2008-04-25 16:53:30 -04:00
}
2009-04-27 07:29:03 -04:00
/* Shrinking succeeded, else we would be at "done". */
2010-05-16 10:48:46 -04:00
trans = btrfs_start_transaction ( root , 0 ) ;
2011-01-20 06:19:37 +00:00
if ( IS_ERR ( trans ) ) {
ret = PTR_ERR ( trans ) ;
goto done ;
}
2016-10-04 19:34:27 +02:00
mutex_lock ( & fs_info - > chunk_mutex ) ;
2020-07-31 19:29:11 +08:00
/* Clear all state bits beyond the shrunk device size */
clear_extent_bits ( & device - > alloc_state , new_size , ( u64 ) - 1 ,
CHUNK_STATE_MASK ) ;
2014-09-03 21:35:38 +08:00
btrfs_device_set_disk_total_bytes ( device , new_size ) ;
2019-03-25 14:31:22 +02:00
if ( list_empty ( & device - > post_commit_list ) )
list_add_tail ( & device - > post_commit_list ,
& trans - > transaction - > dev_update_list ) ;
2009-04-27 07:29:03 -04:00
WARN_ON ( diff > old_total ) ;
2017-06-16 14:39:20 +03:00
btrfs_set_super_total_bytes ( super_copy ,
round_down ( old_total - diff , fs_info - > sectorsize ) ) ;
2016-10-04 19:34:27 +02:00
mutex_unlock ( & fs_info - > chunk_mutex ) ;
2014-09-03 21:35:41 +08:00
/* Now btrfs_update_device() will change the on-disk size. */
ret = btrfs_update_device ( trans , device ) ;
2018-08-06 18:12:37 +08:00
if ( ret < 0 ) {
btrfs_abort_transaction ( trans , ret ) ;
btrfs_end_transaction ( trans ) ;
} else {
ret = btrfs_commit_transaction ( trans ) ;
}
2008-04-25 16:53:30 -04:00
done :
btrfs_free_path ( path ) ;
2015-06-02 14:43:21 +01:00
if ( ret ) {
2016-10-04 19:34:27 +02:00
mutex_lock ( & fs_info - > chunk_mutex ) ;
2015-06-02 14:43:21 +01:00
btrfs_device_set_total_bytes ( device , old_size ) ;
2017-12-04 12:54:52 +08:00
if ( test_bit ( BTRFS_DEV_STATE_WRITEABLE , & device - > dev_state ) )
2015-06-02 14:43:21 +01:00
device - > fs_devices - > total_rw_bytes + = diff ;
2017-05-11 09:17:46 +03:00
atomic64_add ( diff , & fs_info - > free_chunk_space ) ;
2016-10-04 19:34:27 +02:00
mutex_unlock ( & fs_info - > chunk_mutex ) ;
2015-06-02 14:43:21 +01:00
}
2008-04-25 16:53:30 -04:00
return ret ;
}
2016-06-22 18:54:24 -04:00
static int btrfs_add_system_chunk ( struct btrfs_fs_info * fs_info ,
2008-03-24 15:01:56 -04:00
struct btrfs_key * key ,
struct btrfs_chunk * chunk , int item_size )
{
2016-06-22 18:54:23 -04:00
struct btrfs_super_block * super_copy = fs_info - > super_copy ;
2008-03-24 15:01:56 -04:00
struct btrfs_disk_key disk_key ;
u32 array_size ;
u8 * ptr ;
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 14:43:06 +01:00
lockdep_assert_held ( & fs_info - > chunk_mutex ) ;
2008-03-24 15:01:56 -04:00
array_size = btrfs_super_sys_array_size ( super_copy ) ;
2014-04-21 20:13:11 +08:00
if ( array_size + item_size + sizeof ( disk_key )
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 14:43:06 +01:00
> BTRFS_SYSTEM_CHUNK_ARRAY_SIZE )
2008-03-24 15:01:56 -04:00
return - EFBIG ;
ptr = super_copy - > sys_chunk_array + array_size ;
btrfs_cpu_key_to_disk ( & disk_key , key ) ;
memcpy ( ptr , & disk_key , sizeof ( disk_key ) ) ;
ptr + = sizeof ( disk_key ) ;
memcpy ( ptr , chunk , item_size ) ;
item_size + = sizeof ( disk_key ) ;
btrfs_set_super_sys_array_size ( super_copy , array_size + item_size ) ;
2014-09-03 21:35:39 +08:00
2008-03-24 15:01:56 -04:00
return 0 ;
}
btrfs: quasi-round-robin for chunk allocation
In a multi device setup, the chunk allocator currently always allocates
chunks on the devices in the same order. This leads to a very uneven
distribution, especially with RAID1 or RAID10 and an uneven number of
devices.
This patch always sorts the devices before allocating, and allocates the
stripes on the devices with the most available space, as long as there
is enough space available. In a low space situation, it first tries to
maximize striping.
The patch also simplifies the allocator and reduces the checks for
corner cases.
The simplification is done by several means. First, it defines the
properties of each RAID type upfront. These properties are used afterwards
instead of differentiating cases in several places.
Second, the old allocator defined a minimum stripe size for each block
group type, tried to find a large enough chunk, and if this fails just
allocates a smaller one. This is now done in one step. The largest possible
chunk (up to max_chunk_size) is searched and allocated.
Because we now have only one pass, the allocation of the map (struct
map_lookup) is moved down to the point where the number of stripes is
already known. This way we avoid reallocation of the map.
We still avoid allocating stripes that are not a multiple of STRIPE_SIZE.
2011-04-12 12:07:57 +02:00
/*
* sort the devices in descending order by max_avail , total_avail
*/
static int btrfs_cmp_device_info ( const void * a , const void * b )
2008-04-18 10:29:51 -04:00
{
btrfs: quasi-round-robin for chunk allocation
In a multi device setup, the chunk allocator currently always allocates
chunks on the devices in the same order. This leads to a very uneven
distribution, especially with RAID1 or RAID10 and an uneven number of
devices.
This patch always sorts the devices before allocating, and allocates the
stripes on the devices with the most available space, as long as there
is enough space available. In a low space situation, it first tries to
maximize striping.
The patch also simplifies the allocator and reduces the checks for
corner cases.
The simplification is done by several means. First, it defines the
properties of each RAID type upfront. These properties are used afterwards
instead of differentiating cases in several places.
Second, the old allocator defined a minimum stripe size for each block
group type, tried to find a large enough chunk, and if this fails just
allocates a smaller one. This is now done in one step. The largest possible
chunk (up to max_chunk_size) is searched and allocated.
Because we now have only one pass, the allocation of the map (struct
map_lookup) is moved down to the point where the number of stripes is
already known. This way we avoid reallocation of the map.
We still avoid allocating stripes that are not a multiple of STRIPE_SIZE.
2011-04-12 12:07:57 +02:00
const struct btrfs_device_info * di_a = a ;
const struct btrfs_device_info * di_b = b ;
2008-04-18 10:29:51 -04:00
btrfs: quasi-round-robin for chunk allocation
In a multi device setup, the chunk allocator currently always allocates
chunks on the devices in the same order. This leads to a very uneven
distribution, especially with RAID1 or RAID10 and an uneven number of
devices.
This patch always sorts the devices before allocating, and allocates the
stripes on the devices with the most available space, as long as there
is enough space available. In a low space situation, it first tries to
maximize striping.
The patch also simplifies the allocator and reduces the checks for
corner cases.
The simplification is done by several means. First, it defines the
properties of each RAID type upfront. These properties are used afterwards
instead of differentiating cases in several places.
Second, the old allocator defined a minimum stripe size for each block
group type, tried to find a large enough chunk, and if this fails just
allocates a smaller one. This is now done in one step. The largest possible
chunk (up to max_chunk_size) is searched and allocated.
Because we now have only one pass, the allocation of the map (struct
map_lookup) is moved down to the point where the number of stripes is
already known. This way we avoid reallocation of the map.
We still avoid allocating stripes that are not a multiple of STRIPE_SIZE.
2011-04-12 12:07:57 +02:00
if ( di_a - > max_avail > di_b - > max_avail )
2011-01-05 10:07:28 +00:00
return - 1 ;
btrfs: quasi-round-robin for chunk allocation
In a multi device setup, the chunk allocator currently always allocates
chunks on the devices in the same order. This leads to a very uneven
distribution, especially with RAID1 or RAID10 and an uneven number of
devices.
This patch always sorts the devices before allocating, and allocates the
stripes on the devices with the most available space, as long as there
is enough space available. In a low space situation, it first tries to
maximize striping.
The patch also simplifies the allocator and reduces the checks for
corner cases.
The simplification is done by several means. First, it defines the
properties of each RAID type upfront. These properties are used afterwards
instead of differentiating cases in several places.
Second, the old allocator defined a minimum stripe size for each block
group type, tried to find a large enough chunk, and if this fails just
allocates a smaller one. This is now done in one step. The largest possible
chunk (up to max_chunk_size) is searched and allocated.
Because we now have only one pass, the allocation of the map (struct
map_lookup) is moved down to the point where the number of stripes is
already known. This way we avoid reallocation of the map.
We still avoid allocating stripes that are not a multiple of STRIPE_SIZE.
2011-04-12 12:07:57 +02:00
if ( di_a - > max_avail < di_b - > max_avail )
2011-01-05 10:07:28 +00:00
return 1 ;
btrfs: quasi-round-robin for chunk allocation
In a multi device setup, the chunk allocator currently always allocates
chunks on the devices in the same order. This leads to a very uneven
distribution, especially with RAID1 or RAID10 and an uneven number of
devices.
This patch always sorts the devices before allocating, and allocates the
stripes on the devices with the most available space, as long as there
is enough space available. In a low space situation, it first tries to
maximize striping.
The patch also simplifies the allocator and reduces the checks for
corner cases.
The simplification is done by several means. First, it defines the
properties of each RAID type upfront. These properties are used afterwards
instead of differentiating cases in several places.
Second, the old allocator defined a minimum stripe size for each block
group type, tried to find a large enough chunk, and if this fails just
allocates a smaller one. This is now done in one step. The largest possible
chunk (up to max_chunk_size) is searched and allocated.
Because we now have only one pass, the allocation of the map (struct
map_lookup) is moved down to the point where the number of stripes is
already known. This way we avoid reallocation of the map.
We still avoid allocating stripes that are not a multiple of STRIPE_SIZE.
2011-04-12 12:07:57 +02:00
if ( di_a - > total_avail > di_b - > total_avail )
return - 1 ;
if ( di_a - > total_avail < di_b - > total_avail )
return 1 ;
return 0 ;
2011-01-05 10:07:28 +00:00
}
2008-03-24 15:01:56 -04:00
2013-01-29 18:40:14 -05:00
static void check_raid56_incompat_flag ( struct btrfs_fs_info * info , u64 type )
{
2015-01-20 15:11:44 +08:00
if ( ! ( type & BTRFS_BLOCK_GROUP_RAID56_MASK ) )
2013-01-29 18:40:14 -05:00
return ;
2013-04-11 10:30:16 +00:00
btrfs_set_fs_incompat ( info , RAID56 ) ;
2013-01-29 18:40:14 -05:00
}
2018-07-10 18:15:05 +02:00
static void check_raid1c34_incompat_flag ( struct btrfs_fs_info * info , u64 type )
{
if ( ! ( type & ( BTRFS_BLOCK_GROUP_RAID1C3 | BTRFS_BLOCK_GROUP_RAID1C4 ) ) )
return ;
btrfs_set_fs_incompat ( info , RAID1C34 ) ;
}
2020-02-25 12:56:10 +09:00
/*
2021-08-18 13:41:19 +03:00
* Structure used internally for btrfs_create_chunk ( ) function .
2020-02-25 12:56:10 +09:00
* Wraps needed parameters .
*/
struct alloc_chunk_ctl {
u64 start ;
u64 type ;
/* Total number of stripes to allocate */
int num_stripes ;
/* sub_stripes info for map */
int sub_stripes ;
/* Stripes per device */
int dev_stripes ;
/* Maximum number of devices to use */
int devs_max ;
/* Minimum number of devices to use */
int devs_min ;
/* ndevs has to be a multiple of this */
int devs_increment ;
/* Number of copies */
int ncopies ;
/* Number of stripes worth of bytes to store parity information */
int nparity ;
u64 max_stripe_size ;
u64 max_chunk_size ;
2020-02-25 12:56:15 +09:00
u64 dev_extent_min ;
2020-02-25 12:56:10 +09:00
u64 stripe_size ;
u64 chunk_size ;
int ndevs ;
} ;
2020-02-25 12:56:11 +09:00
static void init_alloc_chunk_ctl_policy_regular (
struct btrfs_fs_devices * fs_devices ,
struct alloc_chunk_ctl * ctl )
{
u64 type = ctl - > type ;
if ( type & BTRFS_BLOCK_GROUP_DATA ) {
ctl - > max_stripe_size = SZ_1G ;
ctl - > max_chunk_size = BTRFS_MAX_DATA_CHUNK_SIZE ;
} else if ( type & BTRFS_BLOCK_GROUP_METADATA ) {
/* For larger filesystems, use larger metadata chunks */
if ( fs_devices - > total_rw_bytes > 50ULL * SZ_1G )
ctl - > max_stripe_size = SZ_1G ;
else
ctl - > max_stripe_size = SZ_256M ;
ctl - > max_chunk_size = ctl - > max_stripe_size ;
} else if ( type & BTRFS_BLOCK_GROUP_SYSTEM ) {
ctl - > max_stripe_size = SZ_32M ;
ctl - > max_chunk_size = 2 * ctl - > max_stripe_size ;
ctl - > devs_max = min_t ( int , ctl - > devs_max ,
BTRFS_MAX_DEVS_SYS_CHUNK ) ;
} else {
BUG ( ) ;
}
/* We don't want a chunk larger than 10% of writable space */
ctl - > max_chunk_size = min ( div_factor ( fs_devices - > total_rw_bytes , 1 ) ,
ctl - > max_chunk_size ) ;
2020-02-25 12:56:15 +09:00
ctl - > dev_extent_min = BTRFS_STRIPE_LEN * ctl - > dev_stripes ;
2020-02-25 12:56:11 +09:00
}
2021-02-04 19:21:48 +09:00
static void init_alloc_chunk_ctl_policy_zoned (
struct btrfs_fs_devices * fs_devices ,
struct alloc_chunk_ctl * ctl )
{
u64 zone_size = fs_devices - > fs_info - > zone_size ;
u64 limit ;
int min_num_stripes = ctl - > devs_min * ctl - > dev_stripes ;
int min_data_stripes = ( min_num_stripes - ctl - > nparity ) / ctl - > ncopies ;
u64 min_chunk_size = min_data_stripes * zone_size ;
u64 type = ctl - > type ;
ctl - > max_stripe_size = zone_size ;
if ( type & BTRFS_BLOCK_GROUP_DATA ) {
ctl - > max_chunk_size = round_down ( BTRFS_MAX_DATA_CHUNK_SIZE ,
zone_size ) ;
} else if ( type & BTRFS_BLOCK_GROUP_METADATA ) {
ctl - > max_chunk_size = ctl - > max_stripe_size ;
} else if ( type & BTRFS_BLOCK_GROUP_SYSTEM ) {
ctl - > max_chunk_size = 2 * ctl - > max_stripe_size ;
ctl - > devs_max = min_t ( int , ctl - > devs_max ,
BTRFS_MAX_DEVS_SYS_CHUNK ) ;
2021-03-23 15:31:19 +01:00
} else {
BUG ( ) ;
2021-02-04 19:21:48 +09:00
}
/* We don't want a chunk larger than 10% of writable space */
limit = max ( round_down ( div_factor ( fs_devices - > total_rw_bytes , 1 ) ,
zone_size ) ,
min_chunk_size ) ;
ctl - > max_chunk_size = min ( limit , ctl - > max_chunk_size ) ;
ctl - > dev_extent_min = zone_size * ctl - > dev_stripes ;
}
2020-02-25 12:56:11 +09:00
static void init_alloc_chunk_ctl ( struct btrfs_fs_devices * fs_devices ,
struct alloc_chunk_ctl * ctl )
{
int index = btrfs_bg_flags_to_raid_index ( ctl - > type ) ;
ctl - > sub_stripes = btrfs_raid_array [ index ] . sub_stripes ;
ctl - > dev_stripes = btrfs_raid_array [ index ] . dev_stripes ;
ctl - > devs_max = btrfs_raid_array [ index ] . devs_max ;
if ( ! ctl - > devs_max )
ctl - > devs_max = BTRFS_MAX_DEVS ( fs_devices - > fs_info ) ;
ctl - > devs_min = btrfs_raid_array [ index ] . devs_min ;
ctl - > devs_increment = btrfs_raid_array [ index ] . devs_increment ;
ctl - > ncopies = btrfs_raid_array [ index ] . ncopies ;
ctl - > nparity = btrfs_raid_array [ index ] . nparity ;
ctl - > ndevs = 0 ;
switch ( fs_devices - > chunk_alloc_policy ) {
case BTRFS_CHUNK_ALLOC_REGULAR :
init_alloc_chunk_ctl_policy_regular ( fs_devices , ctl ) ;
break ;
2021-02-04 19:21:48 +09:00
case BTRFS_CHUNK_ALLOC_ZONED :
init_alloc_chunk_ctl_policy_zoned ( fs_devices , ctl ) ;
break ;
2020-02-25 12:56:11 +09:00
default :
BUG ( ) ;
}
}
2020-02-25 12:56:12 +09:00
static int gather_device_info ( struct btrfs_fs_devices * fs_devices ,
struct alloc_chunk_ctl * ctl ,
struct btrfs_device_info * devices_info )
2011-01-05 10:07:28 +00:00
{
2020-02-25 12:56:12 +09:00
struct btrfs_fs_info * info = fs_devices - > fs_info ;
2017-06-27 10:02:24 +03:00
struct btrfs_device * device ;
btrfs: quasi-round-robin for chunk allocation
In a multi device setup, the chunk allocator currently always allocates
chunks on the devices in the same order. This leads to a very uneven
distribution, especially with RAID1 or RAID10 and an uneven number of
devices.
This patch always sorts the devices before allocating, and allocates the
stripes on the devices with the most available space, as long as there
is enough space available. In a low space situation, it first tries to
maximize striping.
The patch also simplifies the allocator and reduces the checks for
corner cases.
The simplification is done by several means. First, it defines the
properties of each RAID type upfront. These properties are used afterwards
instead of differentiating cases in several places.
Second, the old allocator defined a minimum stripe size for each block
group type, tried to find a large enough chunk, and if this fails just
allocates a smaller one. This is now done in one step. The largest possible
chunk (up to max_chunk_size) is searched and allocated.
Because we now have only one pass, the allocation of the map (struct
map_lookup) is moved down to the point where the number of stripes is
already known. This way we avoid reallocation of the map.
We still avoid allocating stripes that are not a multiple of STRIPE_SIZE.
2011-04-12 12:07:57 +02:00
u64 total_avail ;
2020-02-25 12:56:12 +09:00
u64 dev_extent_want = ctl - > max_stripe_size * ctl - > dev_stripes ;
btrfs: quasi-round-robin for chunk allocation
In a multi device setup, the chunk allocator currently always allocates
chunks on the devices in the same order. This leads to a very uneven
distribution, especially with RAID1 or RAID10 and an uneven number of
devices.
This patch always sorts the devices before allocating, and allocates the
stripes on the devices with the most available space, as long as there
is enough space available. In a low space situation, it first tries to
maximize striping.
The patch also simplifies the allocator and reduces the checks for
corner cases.
The simplification is done by several means. First, it defines the
properties of each RAID type upfront. These properties are used afterwards
instead of differentiating cases in several places.
Second, the old allocator defined a minimum stripe size for each block
group type, tried to find a large enough chunk, and if this fails just
allocates a smaller one. This is now done in one step. The largest possible
chunk (up to max_chunk_size) is searched and allocated.
Because we now have only one pass, the allocation of the map (struct
map_lookup) is moved down to the point where the number of stripes is
already known. This way we avoid reallocation of the map.
We still avoid allocating stripes that are not a multiple of STRIPE_SIZE.
2011-04-12 12:07:57 +02:00
int ret ;
2020-02-25 12:56:12 +09:00
int ndevs = 0 ;
u64 max_avail ;
u64 dev_offset ;
2010-03-17 20:45:56 +00:00
2010-04-06 09:37:47 -04:00
/*
btrfs: quasi-round-robin for chunk allocation
In a multi device setup, the chunk allocator currently always allocates
chunks on the devices in the same order. This leads to a very uneven
distribution, especially with RAID1 or RAID10 and an uneven number of
devices.
This patch always sorts the devices before allocating, and allocates the
stripes on the devices with the most available space, as long as there
is enough space available. In a low space situation, it first tries to
maximize striping.
The patch also simplifies the allocator and reduces the checks for
corner cases.
The simplification is done by several means. First, it defines the
properties of each RAID type upfront. These properties are used afterwards
instead of differentiating cases in several places.
Second, the old allocator defined a minimum stripe size for each block
group type, tried to find a large enough chunk, and if this fails just
allocates a smaller one. This is now done in one step. The largest possible
chunk (up to max_chunk_size) is searched and allocated.
Because we now have only one pass, the allocation of the map (struct
map_lookup) is moved down to the point where the number of stripes is
already known. This way we avoid reallocation of the map.
We still avoid allocating stripes that are not a multiple of STRIPE_SIZE.
2011-04-12 12:07:57 +02:00
* in the first pass through the devices list , we gather information
* about the available holes on each device .
2010-04-06 09:37:47 -04:00
*/
2017-06-27 10:02:24 +03:00
list_for_each_entry ( device , & fs_devices - > alloc_list , dev_alloc_list ) {
2017-12-04 12:54:52 +08:00
if ( ! test_bit ( BTRFS_DEV_STATE_WRITEABLE , & device - > dev_state ) ) {
2012-11-03 10:58:34 +00:00
WARN ( 1 , KERN_ERR
2013-12-20 11:37:06 -05:00
" BTRFS: read-only device in alloc_list \n " ) ;
btrfs: quasi-round-robin for chunk allocation
In a multi device setup, the chunk allocator currently always allocates
chunks on the devices in the same order. This leads to a very uneven
distribution, especially with RAID1 or RAID10 and an uneven number of
devices.
This patch always sorts the devices before allocating, and allocates the
stripes on the devices with the most available space, as long as there
is enough space available. In a low space situation, it first tries to
maximize striping.
The patch also simplifies the allocator and reduces the checks for
corner cases.
The simplification is done by several means. First, it defines the
properties of each RAID type upfront. These properties are used afterwards
instead of differentiating cases in several places.
Second, the old allocator defined a minimum stripe size for each block
group type, tried to find a large enough chunk, and if this fails just
allocates a smaller one. This is now done in one step. The largest possible
chunk (up to max_chunk_size) is searched and allocated.
Because we now have only one pass, the allocation of the map (struct
map_lookup) is moved down to the point where the number of stripes is
already known. This way we avoid reallocation of the map.
We still avoid allocating stripes that are not a multiple of STRIPE_SIZE.
2011-04-12 12:07:57 +02:00
continue ;
}
2011-01-05 10:07:28 +00:00
2017-12-04 12:54:53 +08:00
if ( ! test_bit ( BTRFS_DEV_STATE_IN_FS_METADATA ,
& device - > dev_state ) | |
2017-12-04 12:54:55 +08:00
test_bit ( BTRFS_DEV_STATE_REPLACE_TGT , & device - > dev_state ) )
btrfs: quasi-round-robin for chunk allocation
In a multi device setup, the chunk allocator currently always allocates
chunks on the devices in the same order. This leads to a very uneven
distribution, especially with RAID1 or RAID10 and an uneven number of
devices.
This patch always sorts the devices before allocating, and allocates the
stripes on the devices with the most available space, as long as there
is enough space available. In a low space situation, it first tries to
maximize striping.
The patch also simplifies the allocator and reduces the checks for
corner cases.
The simplification is done by several means. First, it defines the
properties of each RAID type upfront. These properties are used afterwards
instead of differentiating cases in several places.
Second, the old allocator defined a minimum stripe size for each block
group type, tried to find a large enough chunk, and if this fails just
allocates a smaller one. This is now done in one step. The largest possible
chunk (up to max_chunk_size) is searched and allocated.
Because we now have only one pass, the allocation of the map (struct
map_lookup) is moved down to the point where the number of stripes is
already known. This way we avoid reallocation of the map.
We still avoid allocating stripes that are not a multiple of STRIPE_SIZE.
2011-04-12 12:07:57 +02:00
continue ;
2011-01-05 10:07:28 +00:00
btrfs: quasi-round-robin for chunk allocation
In a multi device setup, the chunk allocator currently always allocates
chunks on the devices in the same order. This leads to a very uneven
distribution, especially with RAID1 or RAID10 and an uneven number of
devices.
This patch always sorts the devices before allocating, and allocates the
stripes on the devices with the most available space, as long as there
is enough space available. In a low space situation, it first tries to
maximize striping.
The patch also simplifies the allocator and reduces the checks for
corner cases.
The simplification is done by several means. First, it defines the
properties of each RAID type upfront. These properties are used afterwards
instead of differentiating cases in several places.
Second, the old allocator defined a minimum stripe size for each block
group type, tried to find a large enough chunk, and if this fails just
allocates a smaller one. This is now done in one step. The largest possible
chunk (up to max_chunk_size) is searched and allocated.
Because we now have only one pass, the allocation of the map (struct
map_lookup) is moved down to the point where the number of stripes is
already known. This way we avoid reallocation of the map.
We still avoid allocating stripes that are not a multiple of STRIPE_SIZE.
2011-04-12 12:07:57 +02:00
if ( device - > total_bytes > device - > bytes_used )
total_avail = device - > total_bytes - device - > bytes_used ;
else
total_avail = 0 ;
Btrfs: fix a bug of balance on full multi-disk partitions
When balancing, we'll first try to shrink devices for some space,
but if it is working on a full multi-disk partition with raid protection,
we may encounter a bug, that is, while shrinking, total_bytes may be less
than bytes_used, and btrfs may allocate a dev extent that accesses out of
device's bounds.
Then we will not be able to write or read the data which stores at the end
of the device, and get the followings:
device fsid 0939f071-7ea3-46c8-95df-f176d773bfb6 devid 1 transid 10 /dev/sdb5
Btrfs detected SSD devices, enabling SSD mode
btrfs: relocating block group 476315648 flags 9
btrfs: found 4 extents
attempt to access beyond end of device
sdb5: rw=145, want=546176, limit=546147
attempt to access beyond end of device
sdb5: rw=145, want=546304, limit=546147
attempt to access beyond end of device
sdb5: rw=145, want=546432, limit=546147
attempt to access beyond end of device
sdb5: rw=145, want=546560, limit=546147
attempt to access beyond end of device
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-08-02 02:39:03 +00:00
/* If there is no space on this device, skip it. */
2020-02-25 12:56:15 +09:00
if ( total_avail < ctl - > dev_extent_min )
Btrfs: fix a bug of balance on full multi-disk partitions
When balancing, we'll first try to shrink devices for some space,
but if it is working on a full multi-disk partition with raid protection,
we may encounter a bug, that is, while shrinking, total_bytes may be less
than bytes_used, and btrfs may allocate a dev extent that accesses out of
device's bounds.
Then we will not be able to write or read the data which stores at the end
of the device, and get the followings:
device fsid 0939f071-7ea3-46c8-95df-f176d773bfb6 devid 1 transid 10 /dev/sdb5
Btrfs detected SSD devices, enabling SSD mode
btrfs: relocating block group 476315648 flags 9
btrfs: found 4 extents
attempt to access beyond end of device
sdb5: rw=145, want=546176, limit=546147
attempt to access beyond end of device
sdb5: rw=145, want=546304, limit=546147
attempt to access beyond end of device
sdb5: rw=145, want=546432, limit=546147
attempt to access beyond end of device
sdb5: rw=145, want=546560, limit=546147
attempt to access beyond end of device
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-08-02 02:39:03 +00:00
continue ;
2011-01-05 10:07:28 +00:00
2020-02-25 12:56:12 +09:00
ret = find_free_dev_extent ( device , dev_extent_want , & dev_offset ,
& max_avail ) ;
btrfs: quasi-round-robin for chunk allocation
In a multi device setup, the chunk allocator currently always allocates
chunks on the devices in the same order. This leads to a very uneven
distribution, especially with RAID1 or RAID10 and an uneven number of
devices.
This patch always sorts the devices before allocating, and allocates the
stripes on the devices with the most available space, as long as there
is enough space available. In a low space situation, it first tries to
maximize striping.
The patch also simplifies the allocator and reduces the checks for
corner cases.
The simplification is done by several means. First, it defines the
properties of each RAID type upfront. These properties are used afterwards
instead of differentiating cases in several places.
Second, the old allocator defined a minimum stripe size for each block
group type, tried to find a large enough chunk, and if this fails just
allocates a smaller one. This is now done in one step. The largest possible
chunk (up to max_chunk_size) is searched and allocated.
Because we now have only one pass, the allocation of the map (struct
map_lookup) is moved down to the point where the number of stripes is
already known. This way we avoid reallocation of the map.
We still avoid allocating stripes that are not a multiple of STRIPE_SIZE.
2011-04-12 12:07:57 +02:00
if ( ret & & ret ! = - ENOSPC )
2020-02-25 12:56:12 +09:00
return ret ;
2011-01-05 10:07:28 +00:00
btrfs: quasi-round-robin for chunk allocation
In a multi device setup, the chunk allocator currently always allocates
chunks on the devices in the same order. This leads to a very uneven
distribution, especially with RAID1 or RAID10 and an uneven number of
devices.
This patch always sorts the devices before allocating, and allocates the
stripes on the devices with the most available space, as long as there
is enough space available. In a low space situation, it first tries to
maximize striping.
The patch also simplifies the allocator and reduces the checks for
corner cases.
The simplification is done by several means. First, it defines the
properties of each RAID type upfront. These properties are used afterwards
instead of differentiating cases in several places.
Second, the old allocator defined a minimum stripe size for each block
group type, tried to find a large enough chunk, and if this fails just
allocates a smaller one. This is now done in one step. The largest possible
chunk (up to max_chunk_size) is searched and allocated.
Because we now have only one pass, the allocation of the map (struct
map_lookup) is moved down to the point where the number of stripes is
already known. This way we avoid reallocation of the map.
We still avoid allocating stripes that are not a multiple of STRIPE_SIZE.
2011-04-12 12:07:57 +02:00
if ( ret = = 0 )
2020-02-25 12:56:12 +09:00
max_avail = dev_extent_want ;
2011-01-05 10:07:28 +00:00
2020-02-25 12:56:15 +09:00
if ( max_avail < ctl - > dev_extent_min ) {
2018-01-22 13:50:54 +08:00
if ( btrfs_test_opt ( info , ENOSPC_DEBUG ) )
btrfs_debug ( info ,
2020-02-25 12:56:12 +09:00
" %s: devid %llu has no free space, have=%llu want=%llu " ,
2018-01-22 13:50:54 +08:00
__func__ , device - > devid , max_avail ,
2020-02-25 12:56:15 +09:00
ctl - > dev_extent_min ) ;
btrfs: quasi-round-robin for chunk allocation
In a multi device setup, the chunk allocator currently always allocates
chunks on the devices in the same order. This leads to a very uneven
distribution, especially with RAID1 or RAID10 and an uneven number of
devices.
This patch always sorts the devices before allocating, and allocates the
stripes on the devices with the most available space, as long as there
is enough space available. In a low space situation, it first tries to
maximize striping.
The patch also simplifies the allocator and reduces the checks for
corner cases.
The simplification is done by several means. First, it defines the
properties of each RAID type upfront. These properties are used afterwards
instead of differentiating cases in several places.
Second, the old allocator defined a minimum stripe size for each block
group type, tried to find a large enough chunk, and if this fails just
allocates a smaller one. This is now done in one step. The largest possible
chunk (up to max_chunk_size) is searched and allocated.
Because we now have only one pass, the allocation of the map (struct
map_lookup) is moved down to the point where the number of stripes is
already known. This way we avoid reallocation of the map.
We still avoid allocating stripes that are not a multiple of STRIPE_SIZE.
2011-04-12 12:07:57 +02:00
continue ;
2018-01-22 13:50:54 +08:00
}
2011-01-05 10:07:28 +00:00
2013-01-31 00:55:01 +00:00
if ( ndevs = = fs_devices - > rw_devices ) {
WARN ( 1 , " %s: found more than %llu devices \n " ,
__func__ , fs_devices - > rw_devices ) ;
break ;
}
btrfs: quasi-round-robin for chunk allocation
In a multi device setup, the chunk allocator currently always allocates
chunks on the devices in the same order. This leads to a very uneven
distribution, especially with RAID1 or RAID10 and an uneven number of
devices.
This patch always sorts the devices before allocating, and allocates the
stripes on the devices with the most available space, as long as there
is enough space available. In a low space situation, it first tries to
maximize striping.
The patch also simplifies the allocator and reduces the checks for
corner cases.
The simplification is done by several means. First, it defines the
properties of each RAID type upfront. These properties are used afterwards
instead of differentiating cases in several places.
Second, the old allocator defined a minimum stripe size for each block
group type, tried to find a large enough chunk, and if this fails just
allocates a smaller one. This is now done in one step. The largest possible
chunk (up to max_chunk_size) is searched and allocated.
Because we now have only one pass, the allocation of the map (struct
map_lookup) is moved down to the point where the number of stripes is
already known. This way we avoid reallocation of the map.
We still avoid allocating stripes that are not a multiple of STRIPE_SIZE.
2011-04-12 12:07:57 +02:00
devices_info [ ndevs ] . dev_offset = dev_offset ;
devices_info [ ndevs ] . max_avail = max_avail ;
devices_info [ ndevs ] . total_avail = total_avail ;
devices_info [ ndevs ] . dev = device ;
+ + ndevs ;
}
2020-02-25 12:56:12 +09:00
ctl - > ndevs = ndevs ;
2011-01-05 10:07:28 +00:00
btrfs: quasi-round-robin for chunk allocation
In a multi device setup, the chunk allocator currently always allocates
chunks on the devices in the same order. This leads to a very uneven
distribution, especially with RAID1 or RAID10 and an uneven number of
devices.
This patch always sorts the devices before allocating, and allocates the
stripes on the devices with the most available space, as long as there
is enough space available. In a low space situation, it first tries to
maximize striping.
The patch also simplifies the allocator and reduces the checks for
corner cases.
The simplification is done by several means. First, it defines the
properties of each RAID type upfront. These properties are used afterwards
instead of differentiating cases in several places.
Second, the old allocator defined a minimum stripe size for each block
group type, tried to find a large enough chunk, and if this fails just
allocates a smaller one. This is now done in one step. The largest possible
chunk (up to max_chunk_size) is searched and allocated.
Because we now have only one pass, the allocation of the map (struct
map_lookup) is moved down to the point where the number of stripes is
already known. This way we avoid reallocation of the map.
We still avoid allocating stripes that are not a multiple of STRIPE_SIZE.
2011-04-12 12:07:57 +02:00
/*
* now sort the devices by hole size / available space
*/
2020-02-25 12:56:12 +09:00
sort ( devices_info , ndevs , sizeof ( struct btrfs_device_info ) ,
btrfs: quasi-round-robin for chunk allocation
In a multi device setup, the chunk allocator currently always allocates
chunks on the devices in the same order. This leads to a very uneven
distribution, especially with RAID1 or RAID10 and an uneven number of
devices.
This patch always sorts the devices before allocating, and allocates the
stripes on the devices with the most available space, as long as there
is enough space available. In a low space situation, it first tries to
maximize striping.
The patch also simplifies the allocator and reduces the checks for
corner cases.
The simplification is done by several means. First, it defines the
properties of each RAID type upfront. These properties are used afterwards
instead of differentiating cases in several places.
Second, the old allocator defined a minimum stripe size for each block
group type, tried to find a large enough chunk, and if this fails just
allocates a smaller one. This is now done in one step. The largest possible
chunk (up to max_chunk_size) is searched and allocated.
Because we now have only one pass, the allocation of the map (struct
map_lookup) is moved down to the point where the number of stripes is
already known. This way we avoid reallocation of the map.
We still avoid allocating stripes that are not a multiple of STRIPE_SIZE.
2011-04-12 12:07:57 +02:00
btrfs_cmp_device_info , NULL ) ;
2011-01-05 10:07:28 +00:00
2020-02-25 12:56:12 +09:00
return 0 ;
}
2020-02-25 12:56:13 +09:00
static int decide_stripe_size_regular ( struct alloc_chunk_ctl * ctl ,
struct btrfs_device_info * devices_info )
{
/* Number of stripes that count for block group size */
int data_stripes ;
/*
* The primary goal is to maximize the number of stripes , so use as
* many devices as possible , even if the stripes are not maximum sized .
*
* The DUP profile stores more than one stripe per device , the
* max_avail is the total size so we have to adjust .
*/
ctl - > stripe_size = div_u64 ( devices_info [ ctl - > ndevs - 1 ] . max_avail ,
ctl - > dev_stripes ) ;
ctl - > num_stripes = ctl - > ndevs * ctl - > dev_stripes ;
/* This will have to be fixed for RAID1 and RAID10 over more drives */
data_stripes = ( ctl - > num_stripes - ctl - > nparity ) / ctl - > ncopies ;
/*
* Use the number of data stripes to figure out how big this chunk is
* really going to be in terms of logical address space , and compare
* that answer with the max chunk size . If it ' s higher , we try to
* reduce stripe_size .
*/
if ( ctl - > stripe_size * data_stripes > ctl - > max_chunk_size ) {
/*
* Reduce stripe_size , round it up to a 16 MB boundary again and
* then use it , unless it ends up being even bigger than the
* previous value we had already .
*/
ctl - > stripe_size = min ( round_up ( div_u64 ( ctl - > max_chunk_size ,
data_stripes ) , SZ_16M ) ,
ctl - > stripe_size ) ;
}
/* Align to BTRFS_STRIPE_LEN */
ctl - > stripe_size = round_down ( ctl - > stripe_size , BTRFS_STRIPE_LEN ) ;
ctl - > chunk_size = ctl - > stripe_size * data_stripes ;
return 0 ;
}
2021-02-04 19:21:48 +09:00
static int decide_stripe_size_zoned ( struct alloc_chunk_ctl * ctl ,
struct btrfs_device_info * devices_info )
{
u64 zone_size = devices_info [ 0 ] . dev - > zone_info - > zone_size ;
/* Number of stripes that count for block group size */
int data_stripes ;
/*
* It should hold because :
* dev_extent_min = = dev_extent_want = = zone_size * dev_stripes
*/
ASSERT ( devices_info [ ctl - > ndevs - 1 ] . max_avail = = ctl - > dev_extent_min ) ;
ctl - > stripe_size = zone_size ;
ctl - > num_stripes = ctl - > ndevs * ctl - > dev_stripes ;
data_stripes = ( ctl - > num_stripes - ctl - > nparity ) / ctl - > ncopies ;
/* stripe_size is fixed in zoned filesysmte. Reduce ndevs instead. */
if ( ctl - > stripe_size * data_stripes > ctl - > max_chunk_size ) {
ctl - > ndevs = div_u64 ( div_u64 ( ctl - > max_chunk_size * ctl - > ncopies ,
ctl - > stripe_size ) + ctl - > nparity ,
ctl - > dev_stripes ) ;
ctl - > num_stripes = ctl - > ndevs * ctl - > dev_stripes ;
data_stripes = ( ctl - > num_stripes - ctl - > nparity ) / ctl - > ncopies ;
ASSERT ( ctl - > stripe_size * data_stripes < = ctl - > max_chunk_size ) ;
}
ctl - > chunk_size = ctl - > stripe_size * data_stripes ;
return 0 ;
}
2020-02-25 12:56:13 +09:00
static int decide_stripe_size ( struct btrfs_fs_devices * fs_devices ,
struct alloc_chunk_ctl * ctl ,
struct btrfs_device_info * devices_info )
{
struct btrfs_fs_info * info = fs_devices - > fs_info ;
/*
* Round down to number of usable stripes , devs_increment can be any
* number so we can ' t use round_down ( ) that requires power of 2 , while
* rounddown is safe .
*/
ctl - > ndevs = rounddown ( ctl - > ndevs , ctl - > devs_increment ) ;
if ( ctl - > ndevs < ctl - > devs_min ) {
if ( btrfs_test_opt ( info , ENOSPC_DEBUG ) ) {
btrfs_debug ( info ,
" %s: not enough devices with free space: have=%d minimum required=%d " ,
__func__ , ctl - > ndevs , ctl - > devs_min ) ;
}
return - ENOSPC ;
}
ctl - > ndevs = min ( ctl - > ndevs , ctl - > devs_max ) ;
switch ( fs_devices - > chunk_alloc_policy ) {
case BTRFS_CHUNK_ALLOC_REGULAR :
return decide_stripe_size_regular ( ctl , devices_info ) ;
2021-02-04 19:21:48 +09:00
case BTRFS_CHUNK_ALLOC_ZONED :
return decide_stripe_size_zoned ( ctl , devices_info ) ;
2020-02-25 12:56:13 +09:00
default :
BUG ( ) ;
}
}
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 14:43:06 +01:00
static struct btrfs_block_group * create_chunk ( struct btrfs_trans_handle * trans ,
2020-02-25 12:56:14 +09:00
struct alloc_chunk_ctl * ctl ,
struct btrfs_device_info * devices_info )
2020-02-25 12:56:12 +09:00
{
struct btrfs_fs_info * info = trans - > fs_info ;
struct map_lookup * map = NULL ;
struct extent_map_tree * em_tree ;
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 14:43:06 +01:00
struct btrfs_block_group * block_group ;
2020-02-25 12:56:12 +09:00
struct extent_map * em ;
2020-02-25 12:56:14 +09:00
u64 start = ctl - > start ;
u64 type = ctl - > type ;
2020-02-25 12:56:12 +09:00
int ret ;
int i ;
int j ;
2020-02-25 12:56:14 +09:00
map = kmalloc ( map_lookup_size ( ctl - > num_stripes ) , GFP_NOFS ) ;
if ( ! map )
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 14:43:06 +01:00
return ERR_PTR ( - ENOMEM ) ;
2020-02-25 12:56:14 +09:00
map - > num_stripes = ctl - > num_stripes ;
2020-02-25 12:56:12 +09:00
2020-02-25 12:56:14 +09:00
for ( i = 0 ; i < ctl - > ndevs ; + + i ) {
for ( j = 0 ; j < ctl - > dev_stripes ; + + j ) {
int s = i * ctl - > dev_stripes + j ;
btrfs: quasi-round-robin for chunk allocation
In a multi device setup, the chunk allocator currently always allocates
chunks on the devices in the same order. This leads to a very uneven
distribution, especially with RAID1 or RAID10 and an uneven number of
devices.
This patch always sorts the devices before allocating, and allocates the
stripes on the devices with the most available space, as long as there
is enough space available. In a low space situation, it first tries to
maximize striping.
The patch also simplifies the allocator and reduces the checks for
corner cases.
The simplification is done by several means. First, it defines the
properties of each RAID type upfront. These properties are used afterwards
instead of differentiating cases in several places.
Second, the old allocator defined a minimum stripe size for each block
group type, tried to find a large enough chunk, and if this fails just
allocates a smaller one. This is now done in one step. The largest possible
chunk (up to max_chunk_size) is searched and allocated.
Because we now have only one pass, the allocation of the map (struct
map_lookup) is moved down to the point where the number of stripes is
already known. This way we avoid reallocation of the map.
We still avoid allocating stripes that are not a multiple of STRIPE_SIZE.
2011-04-12 12:07:57 +02:00
map - > stripes [ s ] . dev = devices_info [ i ] . dev ;
map - > stripes [ s ] . physical = devices_info [ i ] . dev_offset +
2020-02-25 12:56:14 +09:00
j * ctl - > stripe_size ;
2008-03-24 15:01:59 -04:00
}
}
2017-07-14 09:55:41 +03:00
map - > stripe_len = BTRFS_STRIPE_LEN ;
map - > io_align = BTRFS_STRIPE_LEN ;
map - > io_width = BTRFS_STRIPE_LEN ;
2008-11-17 21:11:30 -05:00
map - > type = type ;
2020-02-25 12:56:14 +09:00
map - > sub_stripes = ctl - > sub_stripes ;
2008-03-24 15:01:56 -04:00
2020-02-25 12:56:14 +09:00
trace_btrfs_chunk_alloc ( info , map , start , ctl - > chunk_size ) ;
Btrfs: add initial tracepoint support for btrfs
Tracepoints can provide insight into why btrfs hits bugs and be greatly
helpful for debugging, e.g
dd-7822 [000] 2121.641088: btrfs_inode_request: root = 5(FS_TREE), gen = 4, ino = 256, blocks = 8, disk_i_size = 0, last_trans = 8, logged_trans = 0
dd-7822 [000] 2121.641100: btrfs_inode_new: root = 5(FS_TREE), gen = 8, ino = 257, blocks = 0, disk_i_size = 0, last_trans = 0, logged_trans = 0
btrfs-transacti-7804 [001] 2146.935420: btrfs_cow_block: root = 2(EXTENT_TREE), refs = 2, orig_buf = 29368320 (orig_level = 0), cow_buf = 29388800 (cow_level = 0)
btrfs-transacti-7804 [001] 2146.935473: btrfs_cow_block: root = 1(ROOT_TREE), refs = 2, orig_buf = 29364224 (orig_level = 0), cow_buf = 29392896 (cow_level = 0)
btrfs-transacti-7804 [001] 2146.972221: btrfs_transaction_commit: root = 1(ROOT_TREE), gen = 8
flush-btrfs-2-7821 [001] 2155.824210: btrfs_chunk_alloc: root = 3(CHUNK_TREE), offset = 1103101952, size = 1073741824, num_stripes = 1, sub_stripes = 0, type = DATA
flush-btrfs-2-7821 [001] 2155.824241: btrfs_cow_block: root = 2(EXTENT_TREE), refs = 2, orig_buf = 29388800 (orig_level = 0), cow_buf = 29396992 (cow_level = 0)
flush-btrfs-2-7821 [001] 2155.824255: btrfs_cow_block: root = 4(DEV_TREE), refs = 2, orig_buf = 29372416 (orig_level = 0), cow_buf = 29401088 (cow_level = 0)
flush-btrfs-2-7821 [000] 2155.824329: btrfs_cow_block: root = 3(CHUNK_TREE), refs = 2, orig_buf = 20971520 (orig_level = 0), cow_buf = 20975616 (cow_level = 0)
btrfs-endio-wri-7800 [001] 2155.898019: btrfs_cow_block: root = 5(FS_TREE), refs = 2, orig_buf = 29384704 (orig_level = 0), cow_buf = 29405184 (cow_level = 0)
btrfs-endio-wri-7800 [001] 2155.898043: btrfs_cow_block: root = 7(CSUM_TREE), refs = 2, orig_buf = 29376512 (orig_level = 0), cow_buf = 29409280 (cow_level = 0)
Here is what I have added:
1) ordere_extent:
btrfs_ordered_extent_add
btrfs_ordered_extent_remove
btrfs_ordered_extent_start
btrfs_ordered_extent_put
These provide critical information to understand how ordered_extents are
updated.
2) extent_map:
btrfs_get_extent
extent_map is used in both read and write cases, and it is useful for tracking
how btrfs specific IO is running.
3) writepage:
__extent_writepage
btrfs_writepage_end_io_hook
Pages are cirtical resourses and produce a lot of corner cases during writeback,
so it is valuable to know how page is written to disk.
4) inode:
btrfs_inode_new
btrfs_inode_request
btrfs_inode_evict
These can show where and when a inode is created, when a inode is evicted.
5) sync:
btrfs_sync_file
btrfs_sync_fs
These show sync arguments.
6) transaction:
btrfs_transaction_commit
In transaction based filesystem, it will be useful to know the generation and
who does commit.
7) back reference and cow:
btrfs_delayed_tree_ref
btrfs_delayed_data_ref
btrfs_delayed_ref_head
btrfs_cow_block
Btrfs natively supports back references, these tracepoints are helpful on
understanding btrfs's COW mechanism.
8) chunk:
btrfs_chunk_alloc
btrfs_chunk_free
Chunk is a link between physical offset and logical offset, and stands for space
infomation in btrfs, and these are helpful on tracing space things.
9) reserved_extent:
btrfs_reserved_extent_alloc
btrfs_reserved_extent_free
These can show how btrfs uses its space.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-03-24 11:18:59 +00:00
2011-04-21 00:48:27 +02:00
em = alloc_extent_map ( ) ;
2008-11-17 21:11:30 -05:00
if ( ! em ) {
2014-06-19 10:42:52 +08:00
kfree ( map ) ;
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 14:43:06 +01:00
return ERR_PTR ( - ENOMEM ) ;
2008-03-25 16:50:33 -04:00
}
2014-06-19 10:42:52 +08:00
set_bit ( EXTENT_FLAG_FS_MAPPING , & em - > flags ) ;
2015-06-03 10:55:48 -04:00
em - > map_lookup = map ;
2008-11-17 21:11:30 -05:00
em - > start = start ;
2020-02-25 12:56:14 +09:00
em - > len = ctl - > chunk_size ;
2008-11-17 21:11:30 -05:00
em - > block_start = 0 ;
em - > block_len = em - > len ;
2020-02-25 12:56:14 +09:00
em - > orig_block_len = ctl - > stripe_size ;
2008-03-25 16:50:33 -04:00
2019-05-17 11:43:17 +02:00
em_tree = & info - > mapping_tree ;
2009-09-02 16:24:52 -04:00
write_lock ( & em_tree - > lock ) ;
2013-04-05 16:51:15 -04:00
ret = add_extent_mapping ( em_tree , em , 0 ) ;
2013-01-31 10:23:04 -05:00
if ( ret ) {
2017-08-21 12:43:49 +03:00
write_unlock ( & em_tree - > lock ) ;
2013-01-31 10:23:04 -05:00
free_extent_map ( em ) ;
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 14:43:06 +01:00
return ERR_PTR ( ret ) ;
2013-01-31 10:23:04 -05:00
}
2017-08-21 12:43:49 +03:00
write_unlock ( & em_tree - > lock ) ;
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 14:43:06 +01:00
block_group = btrfs_make_block_group ( trans , 0 , type , start , ctl - > chunk_size ) ;
if ( IS_ERR ( block_group ) )
2013-06-27 13:22:46 -04:00
goto error_del_extent ;
2008-11-17 21:11:30 -05:00
2019-03-25 14:31:22 +02:00
for ( i = 0 ; i < map - > num_stripes ; i + + ) {
struct btrfs_device * dev = map - > stripes [ i ] . dev ;
2020-02-25 12:56:10 +09:00
btrfs_device_set_bytes_used ( dev ,
2020-02-25 12:56:14 +09:00
dev - > bytes_used + ctl - > stripe_size ) ;
2019-03-25 14:31:22 +02:00
if ( list_empty ( & dev - > post_commit_list ) )
list_add_tail ( & dev - > post_commit_list ,
& trans - > transaction - > dev_update_list ) ;
}
2014-09-03 21:35:36 +08:00
2020-02-25 12:56:14 +09:00
atomic64_sub ( ctl - > stripe_size * map - > num_stripes ,
2020-02-25 12:56:10 +09:00
& info - > free_chunk_space ) ;
2014-09-03 21:35:37 +08:00
2013-01-31 10:23:04 -05:00
free_extent_map ( em ) ;
2016-06-22 18:54:23 -04:00
check_raid56_incompat_flag ( info , type ) ;
2018-07-10 18:15:05 +02:00
check_raid1c34_incompat_flag ( info , type ) ;
2013-01-29 18:40:14 -05:00
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 14:43:06 +01:00
return block_group ;
2011-01-05 10:07:28 +00:00
2013-06-27 13:22:46 -04:00
error_del_extent :
2013-01-31 10:23:04 -05:00
write_lock ( & em_tree - > lock ) ;
remove_extent_mapping ( em_tree , em ) ;
write_unlock ( & em_tree - > lock ) ;
/* One for our allocation */
free_extent_map ( em ) ;
/* One for the tree reference */
free_extent_map ( em ) ;
2020-02-25 12:56:14 +09:00
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 14:43:06 +01:00
return block_group ;
2020-02-25 12:56:14 +09:00
}
2021-08-18 13:41:19 +03:00
struct btrfs_block_group * btrfs_create_chunk ( struct btrfs_trans_handle * trans ,
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 14:43:06 +01:00
u64 type )
2020-02-25 12:56:14 +09:00
{
struct btrfs_fs_info * info = trans - > fs_info ;
struct btrfs_fs_devices * fs_devices = info - > fs_devices ;
struct btrfs_device_info * devices_info = NULL ;
struct alloc_chunk_ctl ctl ;
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 14:43:06 +01:00
struct btrfs_block_group * block_group ;
2020-02-25 12:56:14 +09:00
int ret ;
2020-03-02 12:29:25 +02:00
lockdep_assert_held ( & info - > chunk_mutex ) ;
2020-02-25 12:56:14 +09:00
if ( ! alloc_profile_is_valid ( type , 0 ) ) {
ASSERT ( 0 ) ;
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 14:43:06 +01:00
return ERR_PTR ( - EINVAL ) ;
2020-02-25 12:56:14 +09:00
}
if ( list_empty ( & fs_devices - > alloc_list ) ) {
if ( btrfs_test_opt ( info , ENOSPC_DEBUG ) )
btrfs_debug ( info , " %s: no writable device " , __func__ ) ;
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 14:43:06 +01:00
return ERR_PTR ( - ENOSPC ) ;
2020-02-25 12:56:14 +09:00
}
if ( ! ( type & BTRFS_BLOCK_GROUP_TYPE_MASK ) ) {
btrfs_err ( info , " invalid chunk type 0x%llx requested " , type ) ;
ASSERT ( 0 ) ;
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 14:43:06 +01:00
return ERR_PTR ( - EINVAL ) ;
2020-02-25 12:56:14 +09:00
}
2020-03-02 12:29:25 +02:00
ctl . start = find_next_chunk ( info ) ;
2020-02-25 12:56:14 +09:00
ctl . type = type ;
init_alloc_chunk_ctl ( fs_devices , & ctl ) ;
devices_info = kcalloc ( fs_devices - > rw_devices , sizeof ( * devices_info ) ,
GFP_NOFS ) ;
if ( ! devices_info )
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 14:43:06 +01:00
return ERR_PTR ( - ENOMEM ) ;
2020-02-25 12:56:14 +09:00
ret = gather_device_info ( fs_devices , & ctl , devices_info ) ;
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 14:43:06 +01:00
if ( ret < 0 ) {
block_group = ERR_PTR ( ret ) ;
2020-02-25 12:56:14 +09:00
goto out ;
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 14:43:06 +01:00
}
2020-02-25 12:56:14 +09:00
ret = decide_stripe_size ( fs_devices , & ctl , devices_info ) ;
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 14:43:06 +01:00
if ( ret < 0 ) {
block_group = ERR_PTR ( ret ) ;
2020-02-25 12:56:14 +09:00
goto out ;
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 14:43:06 +01:00
}
2020-02-25 12:56:14 +09:00
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 14:43:06 +01:00
block_group = create_chunk ( trans , & ctl , devices_info ) ;
2020-02-25 12:56:14 +09:00
out :
2011-01-05 10:07:28 +00:00
kfree ( devices_info ) ;
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 14:43:06 +01:00
return block_group ;
2008-11-17 21:11:30 -05:00
}
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 14:43:06 +01:00
/*
* This function , btrfs_chunk_alloc_add_chunk_item ( ) , typically belongs to the
* phase 1 of chunk allocation . It belongs to phase 2 only when allocating system
* chunks .
*
* See the comment at btrfs_chunk_alloc ( ) for details about the chunk allocation
* phases .
*/
int btrfs_chunk_alloc_add_chunk_item ( struct btrfs_trans_handle * trans ,
struct btrfs_block_group * bg )
{
struct btrfs_fs_info * fs_info = trans - > fs_info ;
struct btrfs_root * extent_root = fs_info - > extent_root ;
struct btrfs_root * chunk_root = fs_info - > chunk_root ;
struct btrfs_key key ;
struct btrfs_chunk * chunk ;
struct btrfs_stripe * stripe ;
struct extent_map * em ;
struct map_lookup * map ;
size_t item_size ;
int i ;
int ret ;
/*
* We take the chunk_mutex for 2 reasons :
*
* 1 ) Updates and insertions in the chunk btree must be done while holding
* the chunk_mutex , as well as updating the system chunk array in the
* superblock . See the comment on top of btrfs_chunk_alloc ( ) for the
* details ;
*
* 2 ) To prevent races with the final phase of a device replace operation
* that replaces the device object associated with the map ' s stripes ,
* because the device object ' s id can change at any time during that
* final phase of the device replace operation
* ( dev - replace . c : btrfs_dev_replace_finishing ( ) ) , so we could grab the
* replaced device and then see it with an ID of BTRFS_DEV_REPLACE_DEVID ,
* which would cause a failure when updating the device item , which does
* not exists , or persisting a stripe of the chunk item with such ID .
* Here we can ' t use the device_list_mutex because our caller already
* has locked the chunk_mutex , and the final phase of device replace
* acquires both mutexes - first the device_list_mutex and then the
* chunk_mutex . Using any of those two mutexes protects us from a
* concurrent device replace .
*/
lockdep_assert_held ( & fs_info - > chunk_mutex ) ;
em = btrfs_get_chunk_map ( fs_info , bg - > start , bg - > length ) ;
if ( IS_ERR ( em ) ) {
ret = PTR_ERR ( em ) ;
btrfs_abort_transaction ( trans , ret ) ;
return ret ;
}
map = em - > map_lookup ;
item_size = btrfs_chunk_item_size ( map - > num_stripes ) ;
chunk = kzalloc ( item_size , GFP_NOFS ) ;
if ( ! chunk ) {
ret = - ENOMEM ;
btrfs_abort_transaction ( trans , ret ) ;
Btrfs: fix race when finishing dev replace leading to transaction abort
During the final phase of a device replace operation, I ran into a
transaction abort that resulted in the following trace:
[23919.655368] WARNING: CPU: 10 PID: 30175 at fs/btrfs/extent-tree.c:9843 btrfs_create_pending_block_groups+0x15e/0x1ab [btrfs]()
[23919.664742] BTRFS: Transaction aborted (error -2)
[23919.665749] Modules linked in: btrfs crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse parport_pc i2c_piix4 parport psmouse acpi_cpufreq processor i2c_core evdev microcode pcspkr button serio_raw ext4 crc16 jbd2 mbcache sd_mod sg sr_mod cdrom virtio_scsi ata_generic ata_piix virtio_pci floppy virtio_ring libata e1000 virtio scsi_mod [last unloaded: btrfs]
[23919.679442] CPU: 10 PID: 30175 Comm: fsstress Not tainted 4.3.0-rc5-btrfs-next-17+ #1
[23919.682392] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
[23919.689151] 0000000000000000 ffff8804020cbb50 ffffffff812566f4 ffff8804020cbb98
[23919.692604] ffff8804020cbb88 ffffffff8104d0a6 ffffffffa03eea69 ffff88041b678a48
[23919.694230] ffff88042ac38000 ffff88041b678930 00000000fffffffe ffff8804020cbbf0
[23919.696716] Call Trace:
[23919.698669] [<ffffffff812566f4>] dump_stack+0x4e/0x79
[23919.700597] [<ffffffff8104d0a6>] warn_slowpath_common+0x9f/0xb8
[23919.701958] [<ffffffffa03eea69>] ? btrfs_create_pending_block_groups+0x15e/0x1ab [btrfs]
[23919.703612] [<ffffffff8104d107>] warn_slowpath_fmt+0x48/0x50
[23919.705047] [<ffffffffa03eea69>] btrfs_create_pending_block_groups+0x15e/0x1ab [btrfs]
[23919.706967] [<ffffffffa0402097>] __btrfs_end_transaction+0x84/0x2dd [btrfs]
[23919.708611] [<ffffffffa0402300>] btrfs_end_transaction+0x10/0x12 [btrfs]
[23919.710099] [<ffffffffa03ef0b8>] btrfs_alloc_data_chunk_ondemand+0x121/0x28b [btrfs]
[23919.711970] [<ffffffffa0413025>] btrfs_fallocate+0x7d3/0xc6d [btrfs]
[23919.713602] [<ffffffff8108b78f>] ? lock_acquire+0x10d/0x194
[23919.714756] [<ffffffff81086dbc>] ? percpu_down_read+0x51/0x78
[23919.716155] [<ffffffff8116ef1d>] ? __sb_start_write+0x5f/0xb0
[23919.718918] [<ffffffff8116ef1d>] ? __sb_start_write+0x5f/0xb0
[23919.724170] [<ffffffff8116b579>] vfs_fallocate+0x170/0x1ff
[23919.725482] [<ffffffff8117c1d7>] ioctl_preallocate+0x89/0x9b
[23919.726790] [<ffffffff8117c5ef>] do_vfs_ioctl+0x406/0x4e6
[23919.728428] [<ffffffff81171175>] ? SYSC_newfstat+0x25/0x2e
[23919.729642] [<ffffffff8118574d>] ? __fget_light+0x4d/0x71
[23919.730782] [<ffffffff8117c726>] SyS_ioctl+0x57/0x79
[23919.731847] [<ffffffff8147cd97>] entry_SYSCALL_64_fastpath+0x12/0x6f
[23919.733330] ---[ end trace 166ef301a335832a ]---
This is due to a race between device replace and chunk allocation, which
the following diagram illustrates:
CPU 1 CPU 2
btrfs_dev_replace_finishing()
at this point
dev_replace->tgtdev->devid ==
BTRFS_DEV_REPLACE_DEVID (0ULL)
...
btrfs_start_transaction()
btrfs_commit_transaction()
btrfs_fallocate()
btrfs_alloc_data_chunk_ondemand()
btrfs_join_transaction()
--> starts a new transaction
do_chunk_alloc()
lock fs_info->chunk_mutex
btrfs_alloc_chunk()
--> creates extent map for
the new chunk with
em->bdev->map->stripes[i]->dev->devid
== X (X > 0)
--> extent map is added to
fs_info->mapping_tree
--> initial phase of bg A
allocation completes
unlock fs_info->chunk_mutex
lock fs_info->chunk_mutex
btrfs_dev_replace_update_device_in_mapping_tree()
--> iterates fs_info->mapping_tree and
replaces the device in every extent
map's map->stripes[] with
dev_replace->tgtdev, which still has
an id of 0ULL (BTRFS_DEV_REPLACE_DEVID)
btrfs_end_transaction()
btrfs_create_pending_block_groups()
--> starts final phase of
bg A creation (update device,
extent, and chunk trees, etc)
btrfs_finish_chunk_alloc()
btrfs_update_device()
--> attempts to update a device
item with ID == 0ULL
(BTRFS_DEV_REPLACE_DEVID)
which is the current ID of
bg A's
em->bdev->map->stripes[i]->dev->devid
--> doesn't find such item
returns -ENOENT
--> the device id should have been X
and not 0ULL
got -ENOENT from
btrfs_finish_chunk_alloc()
and aborts current transaction
finishes setting up the target device,
namely it sets tgtdev->devid to the value
of srcdev->devid, which is X (and X > 0)
frees the srcdev
unlock fs_info->chunk_mutex
So fix this by taking the device list mutex when processing the chunk's
extent map stripes to update the device items. This avoids getting the
wrong device id and use-after-free problems if the task finishing a
chunk allocation grabs the replaced device, which is freed while the
dev replace task is holding the device list mutex.
This happened while running fstest btrfs/071.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
2015-11-20 10:42:47 +00:00
goto out ;
2008-11-17 21:11:30 -05:00
}
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 14:43:06 +01:00
for ( i = 0 ; i < map - > num_stripes ; i + + ) {
struct btrfs_device * device = map - > stripes [ i ] . dev ;
ret = btrfs_update_device ( trans , device ) ;
if ( ret )
goto out ;
}
2008-11-17 21:11:30 -05:00
stripe = & chunk - > stripe ;
2013-06-27 13:22:46 -04:00
for ( i = 0 ; i < map - > num_stripes ; i + + ) {
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 14:43:06 +01:00
struct btrfs_device * device = map - > stripes [ i ] . dev ;
const u64 dev_offset = map - > stripes [ i ] . physical ;
2008-03-24 15:01:56 -04:00
2008-04-15 15:41:47 -04:00
btrfs_set_stack_stripe_devid ( stripe , device - > devid ) ;
btrfs_set_stack_stripe_offset ( stripe , dev_offset ) ;
memcpy ( stripe - > dev_uuid , device - > uuid , BTRFS_UUID_SIZE ) ;
2008-11-17 21:11:30 -05:00
stripe + + ;
2008-03-24 15:01:56 -04:00
}
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 14:43:06 +01:00
btrfs_set_stack_chunk_length ( chunk , bg - > length ) ;
2008-03-24 15:01:56 -04:00
btrfs_set_stack_chunk_owner ( chunk , extent_root - > root_key . objectid ) ;
2008-11-17 21:11:30 -05:00
btrfs_set_stack_chunk_stripe_len ( chunk , map - > stripe_len ) ;
btrfs_set_stack_chunk_type ( chunk , map - > type ) ;
btrfs_set_stack_chunk_num_stripes ( chunk , map - > num_stripes ) ;
btrfs_set_stack_chunk_io_align ( chunk , map - > stripe_len ) ;
btrfs_set_stack_chunk_io_width ( chunk , map - > stripe_len ) ;
2016-06-22 18:54:23 -04:00
btrfs_set_stack_chunk_sector_size ( chunk , fs_info - > sectorsize ) ;
2008-11-17 21:11:30 -05:00
btrfs_set_stack_chunk_sub_stripes ( chunk , map - > sub_stripes ) ;
2008-03-24 15:01:56 -04:00
2008-11-17 21:11:30 -05:00
key . objectid = BTRFS_FIRST_CHUNK_TREE_OBJECTID ;
key . type = BTRFS_CHUNK_ITEM_KEY ;
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 14:43:06 +01:00
key . offset = bg - > start ;
2008-03-24 15:01:56 -04:00
2008-11-17 21:11:30 -05:00
ret = btrfs_insert_item ( trans , chunk_root , & key , chunk , item_size ) ;
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 14:43:06 +01:00
if ( ret )
goto out ;
bg - > chunk_item_inserted = 1 ;
if ( map - > type & BTRFS_BLOCK_GROUP_SYSTEM ) {
2016-06-22 18:54:24 -04:00
ret = btrfs_add_system_chunk ( fs_info , & key , chunk , item_size ) ;
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 14:43:06 +01:00
if ( ret )
goto out ;
2008-04-25 16:53:30 -04:00
}
Btrfs: add initial tracepoint support for btrfs
Tracepoints can provide insight into why btrfs hits bugs and be greatly
helpful for debugging, e.g
dd-7822 [000] 2121.641088: btrfs_inode_request: root = 5(FS_TREE), gen = 4, ino = 256, blocks = 8, disk_i_size = 0, last_trans = 8, logged_trans = 0
dd-7822 [000] 2121.641100: btrfs_inode_new: root = 5(FS_TREE), gen = 8, ino = 257, blocks = 0, disk_i_size = 0, last_trans = 0, logged_trans = 0
btrfs-transacti-7804 [001] 2146.935420: btrfs_cow_block: root = 2(EXTENT_TREE), refs = 2, orig_buf = 29368320 (orig_level = 0), cow_buf = 29388800 (cow_level = 0)
btrfs-transacti-7804 [001] 2146.935473: btrfs_cow_block: root = 1(ROOT_TREE), refs = 2, orig_buf = 29364224 (orig_level = 0), cow_buf = 29392896 (cow_level = 0)
btrfs-transacti-7804 [001] 2146.972221: btrfs_transaction_commit: root = 1(ROOT_TREE), gen = 8
flush-btrfs-2-7821 [001] 2155.824210: btrfs_chunk_alloc: root = 3(CHUNK_TREE), offset = 1103101952, size = 1073741824, num_stripes = 1, sub_stripes = 0, type = DATA
flush-btrfs-2-7821 [001] 2155.824241: btrfs_cow_block: root = 2(EXTENT_TREE), refs = 2, orig_buf = 29388800 (orig_level = 0), cow_buf = 29396992 (cow_level = 0)
flush-btrfs-2-7821 [001] 2155.824255: btrfs_cow_block: root = 4(DEV_TREE), refs = 2, orig_buf = 29372416 (orig_level = 0), cow_buf = 29401088 (cow_level = 0)
flush-btrfs-2-7821 [000] 2155.824329: btrfs_cow_block: root = 3(CHUNK_TREE), refs = 2, orig_buf = 20971520 (orig_level = 0), cow_buf = 20975616 (cow_level = 0)
btrfs-endio-wri-7800 [001] 2155.898019: btrfs_cow_block: root = 5(FS_TREE), refs = 2, orig_buf = 29384704 (orig_level = 0), cow_buf = 29405184 (cow_level = 0)
btrfs-endio-wri-7800 [001] 2155.898043: btrfs_cow_block: root = 7(CSUM_TREE), refs = 2, orig_buf = 29376512 (orig_level = 0), cow_buf = 29409280 (cow_level = 0)
Here is what I have added:
1) ordere_extent:
btrfs_ordered_extent_add
btrfs_ordered_extent_remove
btrfs_ordered_extent_start
btrfs_ordered_extent_put
These provide critical information to understand how ordered_extents are
updated.
2) extent_map:
btrfs_get_extent
extent_map is used in both read and write cases, and it is useful for tracking
how btrfs specific IO is running.
3) writepage:
__extent_writepage
btrfs_writepage_end_io_hook
Pages are cirtical resourses and produce a lot of corner cases during writeback,
so it is valuable to know how page is written to disk.
4) inode:
btrfs_inode_new
btrfs_inode_request
btrfs_inode_evict
These can show where and when a inode is created, when a inode is evicted.
5) sync:
btrfs_sync_file
btrfs_sync_fs
These show sync arguments.
6) transaction:
btrfs_transaction_commit
In transaction based filesystem, it will be useful to know the generation and
who does commit.
7) back reference and cow:
btrfs_delayed_tree_ref
btrfs_delayed_data_ref
btrfs_delayed_ref_head
btrfs_cow_block
Btrfs natively supports back references, these tracepoints are helpful on
understanding btrfs's COW mechanism.
8) chunk:
btrfs_chunk_alloc
btrfs_chunk_free
Chunk is a link between physical offset and logical offset, and stands for space
infomation in btrfs, and these are helpful on tracing space things.
9) reserved_extent:
btrfs_reserved_extent_alloc
btrfs_reserved_extent_free
These can show how btrfs uses its space.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-03-24 11:18:59 +00:00
2013-06-27 13:22:46 -04:00
out :
2008-03-24 15:01:56 -04:00
kfree ( chunk ) ;
2013-06-27 13:22:46 -04:00
free_extent_map ( em ) ;
2011-08-10 12:32:10 -07:00
return ret ;
2008-11-17 21:11:30 -05:00
}
2008-03-24 15:01:56 -04:00
2019-03-20 16:29:13 +01:00
static noinline int init_first_rw_device ( struct btrfs_trans_handle * trans )
2008-11-17 21:11:30 -05:00
{
2019-03-20 16:29:13 +01:00
struct btrfs_fs_info * fs_info = trans - > fs_info ;
2008-11-17 21:11:30 -05:00
u64 alloc_profile ;
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 14:43:06 +01:00
struct btrfs_block_group * meta_bg ;
struct btrfs_block_group * sys_bg ;
/*
* When adding a new device for sprouting , the seed device is read - only
* so we must first allocate a metadata and a system chunk . But before
* adding the block group items to the extent , device and chunk btrees ,
* we must first :
*
* 1 ) Create both chunks without doing any changes to the btrees , as
* otherwise we would get - ENOSPC since the block groups from the
* seed device are read - only ;
*
* 2 ) Add the device item for the new sprout device - finishing the setup
* of a new block group requires updating the device item in the chunk
* btree , so it must exist when we attempt to do it . The previous step
* ensures this does not fail with - ENOSPC .
*
* After that we can add the block group items to their btrees :
* update existing device item in the chunk btree , add a new block group
* item to the extent btree , add a new chunk item to the chunk btree and
* finally add the new device extent items to the devices btree .
*/
2008-11-17 21:11:30 -05:00
2017-05-17 11:38:35 -04:00
alloc_profile = btrfs_metadata_alloc_profile ( fs_info ) ;
2021-08-18 13:41:19 +03:00
meta_bg = btrfs_create_chunk ( trans , alloc_profile ) ;
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 14:43:06 +01:00
if ( IS_ERR ( meta_bg ) )
return PTR_ERR ( meta_bg ) ;
2008-11-17 21:11:30 -05:00
2017-05-17 11:38:35 -04:00
alloc_profile = btrfs_system_alloc_profile ( fs_info ) ;
2021-08-18 13:41:19 +03:00
sys_bg = btrfs_create_chunk ( trans , alloc_profile ) ;
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 14:43:06 +01:00
if ( IS_ERR ( sys_bg ) )
return PTR_ERR ( sys_bg ) ;
return 0 ;
2008-11-17 21:11:30 -05:00
}
2014-07-03 18:22:13 +08:00
static inline int btrfs_chunk_max_errors ( struct map_lookup * map )
{
2019-05-17 11:43:22 +02:00
const int index = btrfs_bg_flags_to_raid_index ( map - > type ) ;
2008-11-17 21:11:30 -05:00
2019-05-17 11:43:22 +02:00
return btrfs_raid_array [ index ] . tolerated_failures ;
2008-11-17 21:11:30 -05:00
}
2021-08-24 13:27:42 +08:00
bool btrfs_chunk_writeable ( struct btrfs_fs_info * fs_info , u64 chunk_offset )
2008-11-17 21:11:30 -05:00
{
struct extent_map * em ;
struct map_lookup * map ;
2014-07-03 18:22:13 +08:00
int miss_ndevs = 0 ;
2008-11-17 21:11:30 -05:00
int i ;
2021-08-24 13:27:42 +08:00
bool ret = true ;
2008-11-17 21:11:30 -05:00
2018-05-16 16:34:31 -07:00
em = btrfs_get_chunk_map ( fs_info , chunk_offset , 1 ) ;
2017-03-14 13:33:55 -07:00
if ( IS_ERR ( em ) )
2021-08-24 13:27:42 +08:00
return false ;
2008-11-17 21:11:30 -05:00
2015-06-03 10:55:48 -04:00
map = em - > map_lookup ;
2008-11-17 21:11:30 -05:00
for ( i = 0 ; i < map - > num_stripes ; i + + ) {
2017-12-04 12:54:54 +08:00
if ( test_bit ( BTRFS_DEV_STATE_MISSING ,
& map - > stripes [ i ] . dev - > dev_state ) ) {
2014-07-03 18:22:13 +08:00
miss_ndevs + + ;
continue ;
}
2017-12-04 12:54:52 +08:00
if ( ! test_bit ( BTRFS_DEV_STATE_WRITEABLE ,
& map - > stripes [ i ] . dev - > dev_state ) ) {
2021-08-24 13:27:42 +08:00
ret = false ;
2014-07-03 18:22:13 +08:00
goto end ;
2008-11-17 21:11:30 -05:00
}
}
2014-07-03 18:22:13 +08:00
/*
2021-08-24 13:27:42 +08:00
* If the number of missing devices is larger than max errors , we can
* not write the data into that chunk successfully .
2014-07-03 18:22:13 +08:00
*/
if ( miss_ndevs > btrfs_chunk_max_errors ( map ) )
2021-08-24 13:27:42 +08:00
ret = false ;
2014-07-03 18:22:13 +08:00
end :
2008-03-24 15:01:56 -04:00
free_extent_map ( em ) ;
2021-08-24 13:27:42 +08:00
return ret ;
2008-03-24 15:01:56 -04:00
}
2019-05-17 11:43:17 +02:00
void btrfs_mapping_tree_free ( struct extent_map_tree * tree )
2008-03-24 15:01:56 -04:00
{
struct extent_map * em ;
2009-01-05 21:25:51 -05:00
while ( 1 ) {
2019-05-17 11:43:17 +02:00
write_lock ( & tree - > lock ) ;
em = lookup_extent_mapping ( tree , 0 , ( u64 ) - 1 ) ;
2008-03-24 15:01:56 -04:00
if ( em )
2019-05-17 11:43:17 +02:00
remove_extent_mapping ( tree , em ) ;
write_unlock ( & tree - > lock ) ;
2008-03-24 15:01:56 -04:00
if ( ! em )
break ;
/* once for us */
free_extent_map ( em ) ;
/* once for the tree */
free_extent_map ( em ) ;
}
}
2012-11-05 14:59:07 +01:00
int btrfs_num_copies ( struct btrfs_fs_info * fs_info , u64 logical , u64 len )
2008-04-09 16:28:12 -04:00
{
struct extent_map * em ;
struct map_lookup * map ;
int ret ;
2018-05-16 16:34:31 -07:00
em = btrfs_get_chunk_map ( fs_info , logical , len ) ;
2017-03-14 13:33:55 -07:00
if ( IS_ERR ( em ) )
/*
* We could return errors for these cases , but that could get
* ugly and we ' d probably do the same thing which is just not do
* anything else and exit , so return 1 so the callers don ' t try
* to use other copies .
*/
2013-04-23 10:53:18 -04:00
return 1 ;
2015-06-03 10:55:48 -04:00
map = em - > map_lookup ;
2019-05-31 15:39:31 +02:00
if ( map - > type & ( BTRFS_BLOCK_GROUP_DUP | BTRFS_BLOCK_GROUP_RAID1_MASK ) )
2008-04-09 16:28:12 -04:00
ret = map - > num_stripes ;
2008-04-16 10:49:51 -04:00
else if ( map - > type & BTRFS_BLOCK_GROUP_RAID10 )
ret = map - > sub_stripes ;
2013-01-29 18:40:14 -05:00
else if ( map - > type & BTRFS_BLOCK_GROUP_RAID5 )
ret = 2 ;
else if ( map - > type & BTRFS_BLOCK_GROUP_RAID6 )
Btrfs: make raid6 rebuild retry more
There is a scenario that can end up with rebuild process failing to
return good content, i.e.
suppose that all disks can be read without problems and if the content
that was read out doesn't match its checksum, currently for raid6
btrfs at most retries twice,
- the 1st retry is to rebuild with all other stripes, it'll eventually
be a raid5 xor rebuild,
- if the 1st fails, the 2nd retry will deliberately fail parity p so
that it will do raid6 style rebuild,
however, the chances are that another non-parity stripe content also
has something corrupted, so that the above retries are not able to
return correct content, and users will think of this as data loss.
More seriouly, if the loss happens on some important internal btree
roots, it could refuse to mount.
This extends btrfs to do more retries and each retry fails only one
stripe. Since raid6 can tolerate 2 disk failures, if there is one
more failure besides the failure on which we're recovering, this can
always work.
The worst case is to retry as many times as the number of raid6 disks,
but given the fact that such a scenario is really rare in practice,
it's still acceptable.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-02 13:36:41 -07:00
/*
* There could be two corrupted data stripes , we need
* to loop retry in order to rebuild the correct data .
2018-06-20 15:48:55 +03:00
*
Btrfs: make raid6 rebuild retry more
There is a scenario that can end up with rebuild process failing to
return good content, i.e.
suppose that all disks can be read without problems and if the content
that was read out doesn't match its checksum, currently for raid6
btrfs at most retries twice,
- the 1st retry is to rebuild with all other stripes, it'll eventually
be a raid5 xor rebuild,
- if the 1st fails, the 2nd retry will deliberately fail parity p so
that it will do raid6 style rebuild,
however, the chances are that another non-parity stripe content also
has something corrupted, so that the above retries are not able to
return correct content, and users will think of this as data loss.
More seriouly, if the loss happens on some important internal btree
roots, it could refuse to mount.
This extends btrfs to do more retries and each retry fails only one
stripe. Since raid6 can tolerate 2 disk failures, if there is one
more failure besides the failure on which we're recovering, this can
always work.
The worst case is to retry as many times as the number of raid6 disks,
but given the fact that such a scenario is really rare in practice,
it's still acceptable.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-02 13:36:41 -07:00
* Fail a stripe at a time on every retry except the
* stripe under reconstruction .
*/
ret = map - > num_stripes ;
2008-04-09 16:28:12 -04:00
else
ret = 1 ;
free_extent_map ( em ) ;
2012-11-06 15:06:47 +01:00
2018-09-07 16:11:23 +02:00
down_read ( & fs_info - > dev_replace . rwsem ) ;
2017-03-14 13:33:59 -07:00
if ( btrfs_dev_replace_is_ongoing ( & fs_info - > dev_replace ) & &
fs_info - > dev_replace . tgtdev )
2012-11-06 15:06:47 +01:00
ret + + ;
2018-09-07 16:11:23 +02:00
up_read ( & fs_info - > dev_replace . rwsem ) ;
2012-11-06 15:06:47 +01:00
2008-04-09 16:28:12 -04:00
return ret ;
}
2016-06-22 18:54:24 -04:00
unsigned long btrfs_full_stripe_len ( struct btrfs_fs_info * fs_info ,
2013-01-29 18:40:14 -05:00
u64 logical )
{
struct extent_map * em ;
struct map_lookup * map ;
2016-06-22 18:54:23 -04:00
unsigned long len = fs_info - > sectorsize ;
2013-01-29 18:40:14 -05:00
2018-05-16 16:34:31 -07:00
em = btrfs_get_chunk_map ( fs_info , logical , len ) ;
2013-01-29 18:40:14 -05:00
2017-07-11 16:55:51 +03:00
if ( ! WARN_ON ( IS_ERR ( em ) ) ) {
map = em - > map_lookup ;
if ( map - > type & BTRFS_BLOCK_GROUP_RAID56_MASK )
len = map - > stripe_len * nr_data_stripes ( map ) ;
free_extent_map ( em ) ;
}
2013-01-29 18:40:14 -05:00
return len ;
}
2017-07-19 10:48:42 +03:00
int btrfs_is_parity_mirror ( struct btrfs_fs_info * fs_info , u64 logical , u64 len )
2013-01-29 18:40:14 -05:00
{
struct extent_map * em ;
struct map_lookup * map ;
int ret = 0 ;
2018-05-16 16:34:31 -07:00
em = btrfs_get_chunk_map ( fs_info , logical , len ) ;
2013-01-29 18:40:14 -05:00
2017-07-11 16:55:51 +03:00
if ( ! WARN_ON ( IS_ERR ( em ) ) ) {
map = em - > map_lookup ;
if ( map - > type & BTRFS_BLOCK_GROUP_RAID56_MASK )
ret = 1 ;
free_extent_map ( em ) ;
}
2013-01-29 18:40:14 -05:00
return ret ;
}
2012-11-06 14:52:18 +01:00
static int find_live_mirror ( struct btrfs_fs_info * fs_info ,
2018-03-14 16:29:12 +08:00
struct map_lookup * map , int first ,
2018-03-14 16:29:13 +08:00
int dev_replace_is_ongoing )
2008-05-13 13:46:40 -04:00
{
int i ;
2018-03-14 16:29:12 +08:00
int num_stripes ;
2018-03-14 16:29:13 +08:00
int preferred_mirror ;
2012-11-06 14:52:18 +01:00
int tolerance ;
struct btrfs_device * srcdev ;
2018-03-14 16:29:12 +08:00
ASSERT ( ( map - > type &
2019-05-31 15:39:31 +02:00
( BTRFS_BLOCK_GROUP_RAID1_MASK | BTRFS_BLOCK_GROUP_RAID10 ) ) ) ;
2018-03-14 16:29:12 +08:00
if ( map - > type & BTRFS_BLOCK_GROUP_RAID10 )
num_stripes = map - > sub_stripes ;
else
num_stripes = map - > num_stripes ;
2020-10-28 21:14:46 +08:00
switch ( fs_info - > fs_devices - > read_policy ) {
default :
/* Shouldn't happen, just warn and use pid instead of failing */
btrfs_warn_rl ( fs_info ,
" unknown read_policy type %u, reset to pid " ,
fs_info - > fs_devices - > read_policy ) ;
fs_info - > fs_devices - > read_policy = BTRFS_READ_POLICY_PID ;
fallthrough ;
case BTRFS_READ_POLICY_PID :
preferred_mirror = first + ( current - > pid % num_stripes ) ;
break ;
}
2018-03-14 16:29:13 +08:00
2012-11-06 14:52:18 +01:00
if ( dev_replace_is_ongoing & &
fs_info - > dev_replace . cont_reading_from_srcdev_mode = =
BTRFS_DEV_REPLACE_ITEM_CONT_READING_FROM_SRCDEV_MODE_AVOID )
srcdev = fs_info - > dev_replace . srcdev ;
else
srcdev = NULL ;
/*
* try to avoid the drive that is the source drive for a
* dev - replace procedure , only choose it if no other non - missing
* mirror is available
*/
for ( tolerance = 0 ; tolerance < 2 ; tolerance + + ) {
2018-03-14 16:29:13 +08:00
if ( map - > stripes [ preferred_mirror ] . dev - > bdev & &
( tolerance | | map - > stripes [ preferred_mirror ] . dev ! = srcdev ) )
return preferred_mirror ;
2018-03-14 16:29:12 +08:00
for ( i = first ; i < first + num_stripes ; i + + ) {
2012-11-06 14:52:18 +01:00
if ( map - > stripes [ i ] . dev - > bdev & &
( tolerance | | map - > stripes [ i ] . dev ! = srcdev ) )
return i ;
}
2008-05-13 13:46:40 -04:00
}
2012-11-06 14:52:18 +01:00
2008-05-13 13:46:40 -04:00
/* we couldn't find one that doesn't fail. Just return something
* and the io error handling code will clean up eventually
*/
2018-03-14 16:29:13 +08:00
return preferred_mirror ;
2008-05-13 13:46:40 -04:00
}
2013-01-29 18:40:14 -05:00
/* Bubble-sort the stripe set to put the parity/syndrome stripes last */
2021-09-15 15:17:16 +08:00
static void sort_parity_stripes ( struct btrfs_io_context * bioc , int num_stripes )
2013-01-29 18:40:14 -05:00
{
int i ;
int again = 1 ;
while ( again ) {
again = 0 ;
2015-01-20 15:11:32 +08:00
for ( i = 0 ; i < num_stripes - 1 ; i + + ) {
2019-11-28 15:31:17 +01:00
/* Swap if parity is on a smaller index */
2021-09-15 15:17:16 +08:00
if ( bioc - > raid_map [ i ] > bioc - > raid_map [ i + 1 ] ) {
swap ( bioc - > stripes [ i ] , bioc - > stripes [ i + 1 ] ) ;
swap ( bioc - > raid_map [ i ] , bioc - > raid_map [ i + 1 ] ) ;
2013-01-29 18:40:14 -05:00
again = 1 ;
}
}
}
}
2021-09-23 14:00:08 +08:00
static struct btrfs_io_context * alloc_btrfs_io_context ( struct btrfs_fs_info * fs_info ,
int total_stripes ,
2021-09-15 15:17:16 +08:00
int real_stripes )
2015-01-20 15:11:34 +08:00
{
2021-09-15 15:17:16 +08:00
struct btrfs_io_context * bioc = kzalloc (
/* The size of btrfs_io_context */
sizeof ( struct btrfs_io_context ) +
/* Plus the variable array for the stripes */
sizeof ( struct btrfs_io_stripe ) * ( total_stripes ) +
/* Plus the variable array for the tgt dev */
2015-01-20 15:11:34 +08:00
sizeof ( int ) * ( real_stripes ) +
2015-02-19 17:51:39 -08:00
/*
2021-09-15 15:17:16 +08:00
* Plus the raid_map , which includes both the tgt dev
* and the stripes .
2015-02-19 17:51:39 -08:00
*/
sizeof ( u64 ) * ( total_stripes ) ,
2015-08-19 14:17:41 +02:00
GFP_NOFS | __GFP_NOFAIL ) ;
2015-01-20 15:11:34 +08:00
2021-09-15 15:17:16 +08:00
atomic_set ( & bioc - > error , 0 ) ;
refcount_set ( & bioc - > refs , 1 ) ;
2015-01-20 15:11:34 +08:00
2021-09-23 14:00:08 +08:00
bioc - > fs_info = fs_info ;
2021-09-15 15:17:16 +08:00
bioc - > tgtdev_map = ( int * ) ( bioc - > stripes + total_stripes ) ;
bioc - > raid_map = ( u64 * ) ( bioc - > tgtdev_map + real_stripes ) ;
2020-07-02 16:46:41 +03:00
2021-09-15 15:17:16 +08:00
return bioc ;
2015-01-20 15:11:34 +08:00
}
2021-09-15 15:17:16 +08:00
void btrfs_get_bioc ( struct btrfs_io_context * bioc )
2015-01-20 15:11:34 +08:00
{
2021-09-15 15:17:16 +08:00
WARN_ON ( ! refcount_read ( & bioc - > refs ) ) ;
refcount_inc ( & bioc - > refs ) ;
2015-01-20 15:11:34 +08:00
}
2021-09-15 15:17:16 +08:00
void btrfs_put_bioc ( struct btrfs_io_context * bioc )
2015-01-20 15:11:34 +08:00
{
2021-09-15 15:17:16 +08:00
if ( ! bioc )
2015-01-20 15:11:34 +08:00
return ;
2021-09-15 15:17:16 +08:00
if ( refcount_dec_and_test ( & bioc - > refs ) )
kfree ( bioc ) ;
2015-01-20 15:11:34 +08:00
}
2017-03-14 13:33:56 -07:00
/* can REQ_OP_DISCARD be sent with other REQ like REQ_OP_WRITE? */
/*
* Please note that , discard won ' t be sent to target device of device
* replace .
*/
static int __btrfs_map_block_for_discard ( struct btrfs_fs_info * fs_info ,
btrfs: Ensure we trim ranges across block group boundary
[BUG]
When deleting large files (which cross block group boundary) with
discard mount option, we find some btrfs_discard_extent() calls only
trimmed part of its space, not the whole range:
btrfs_discard_extent: type=0x1 start=19626196992 len=2144530432 trimmed=1073741824 ratio=50%
type: bbio->map_type, in above case, it's SINGLE DATA.
start: Logical address of this trim
len: Logical length of this trim
trimmed: Physically trimmed bytes
ratio: trimmed / len
Thus leaving some unused space not discarded.
[CAUSE]
When discard mount option is specified, after a transaction is fully
committed (super block written to disk), we begin to cleanup pinned
extents in the following call chain:
btrfs_commit_transaction()
|- btrfs_finish_extent_commit()
|- find_first_extent_bit(unpin, 0, &start, &end, EXTENT_DIRTY);
|- btrfs_discard_extent()
However, pinned extents are recorded in an extent_io_tree, which can
merge adjacent extent states.
When a large file gets deleted and it has adjacent file extents across
block group boundary, we will get a large merged range like this:
|<--- BG1 --->|<--- BG2 --->|
|//////|<-- Range to discard --->|/////|
To discard that range, we have the following calls:
btrfs_discard_extent()
|- btrfs_map_block()
| Returned bbio will end at BG1's end. As btrfs_map_block()
| never returns result across block group boundary.
|- btrfs_issuse_discard()
Issue discard for each stripe.
So we will only discard the range in BG1, not the remaining part in BG2.
Furthermore, this bug is not that reliably observed, for above case, if
there is no other extent in BG2, BG2 will be empty and btrfs will trim
all space of BG2, covering up the bug.
[FIX]
- Allow __btrfs_map_block_for_discard() to modify @length parameter
btrfs_map_block() uses its @length paramter to notify the caller how
many bytes are mapped in current call.
With __btrfs_map_block_for_discard() also modifing the @length,
btrfs_discard_extent() now understands when to do extra trim.
- Call btrfs_map_block() in a loop until we hit the range end Since we
now know how many bytes are mapped each time, we can iterate through
each block group boundary and issue correct trim for each range.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-10-23 21:57:27 +08:00
u64 logical , u64 * length_ret ,
2021-09-15 15:17:16 +08:00
struct btrfs_io_context * * bioc_ret )
2017-03-14 13:33:56 -07:00
{
struct extent_map * em ;
struct map_lookup * map ;
2021-09-15 15:17:16 +08:00
struct btrfs_io_context * bioc ;
btrfs: Ensure we trim ranges across block group boundary
[BUG]
When deleting large files (which cross block group boundary) with
discard mount option, we find some btrfs_discard_extent() calls only
trimmed part of its space, not the whole range:
btrfs_discard_extent: type=0x1 start=19626196992 len=2144530432 trimmed=1073741824 ratio=50%
type: bbio->map_type, in above case, it's SINGLE DATA.
start: Logical address of this trim
len: Logical length of this trim
trimmed: Physically trimmed bytes
ratio: trimmed / len
Thus leaving some unused space not discarded.
[CAUSE]
When discard mount option is specified, after a transaction is fully
committed (super block written to disk), we begin to cleanup pinned
extents in the following call chain:
btrfs_commit_transaction()
|- btrfs_finish_extent_commit()
|- find_first_extent_bit(unpin, 0, &start, &end, EXTENT_DIRTY);
|- btrfs_discard_extent()
However, pinned extents are recorded in an extent_io_tree, which can
merge adjacent extent states.
When a large file gets deleted and it has adjacent file extents across
block group boundary, we will get a large merged range like this:
|<--- BG1 --->|<--- BG2 --->|
|//////|<-- Range to discard --->|/////|
To discard that range, we have the following calls:
btrfs_discard_extent()
|- btrfs_map_block()
| Returned bbio will end at BG1's end. As btrfs_map_block()
| never returns result across block group boundary.
|- btrfs_issuse_discard()
Issue discard for each stripe.
So we will only discard the range in BG1, not the remaining part in BG2.
Furthermore, this bug is not that reliably observed, for above case, if
there is no other extent in BG2, BG2 will be empty and btrfs will trim
all space of BG2, covering up the bug.
[FIX]
- Allow __btrfs_map_block_for_discard() to modify @length parameter
btrfs_map_block() uses its @length paramter to notify the caller how
many bytes are mapped in current call.
With __btrfs_map_block_for_discard() also modifing the @length,
btrfs_discard_extent() now understands when to do extra trim.
- Call btrfs_map_block() in a loop until we hit the range end Since we
now know how many bytes are mapped each time, we can iterate through
each block group boundary and issue correct trim for each range.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-10-23 21:57:27 +08:00
u64 length = * length_ret ;
2017-03-14 13:33:56 -07:00
u64 offset ;
u64 stripe_nr ;
u64 stripe_nr_end ;
u64 stripe_end_offset ;
u64 stripe_cnt ;
u64 stripe_len ;
u64 stripe_offset ;
u64 num_stripes ;
u32 stripe_index ;
u32 factor = 0 ;
u32 sub_stripes = 0 ;
u64 stripes_per_dev = 0 ;
u32 remaining_stripes = 0 ;
u32 last_stripe = 0 ;
int ret = 0 ;
int i ;
2021-09-15 15:17:16 +08:00
/* Discard always returns a bioc. */
ASSERT ( bioc_ret ) ;
2017-03-14 13:33:56 -07:00
2018-05-16 16:34:31 -07:00
em = btrfs_get_chunk_map ( fs_info , logical , length ) ;
2017-03-14 13:33:56 -07:00
if ( IS_ERR ( em ) )
return PTR_ERR ( em ) ;
map = em - > map_lookup ;
/* we don't discard raid56 yet */
if ( map - > type & BTRFS_BLOCK_GROUP_RAID56_MASK ) {
ret = - EOPNOTSUPP ;
goto out ;
}
offset = logical - em - > start ;
2019-10-23 21:57:26 +08:00
length = min_t ( u64 , em - > start + em - > len - logical , length ) ;
btrfs: Ensure we trim ranges across block group boundary
[BUG]
When deleting large files (which cross block group boundary) with
discard mount option, we find some btrfs_discard_extent() calls only
trimmed part of its space, not the whole range:
btrfs_discard_extent: type=0x1 start=19626196992 len=2144530432 trimmed=1073741824 ratio=50%
type: bbio->map_type, in above case, it's SINGLE DATA.
start: Logical address of this trim
len: Logical length of this trim
trimmed: Physically trimmed bytes
ratio: trimmed / len
Thus leaving some unused space not discarded.
[CAUSE]
When discard mount option is specified, after a transaction is fully
committed (super block written to disk), we begin to cleanup pinned
extents in the following call chain:
btrfs_commit_transaction()
|- btrfs_finish_extent_commit()
|- find_first_extent_bit(unpin, 0, &start, &end, EXTENT_DIRTY);
|- btrfs_discard_extent()
However, pinned extents are recorded in an extent_io_tree, which can
merge adjacent extent states.
When a large file gets deleted and it has adjacent file extents across
block group boundary, we will get a large merged range like this:
|<--- BG1 --->|<--- BG2 --->|
|//////|<-- Range to discard --->|/////|
To discard that range, we have the following calls:
btrfs_discard_extent()
|- btrfs_map_block()
| Returned bbio will end at BG1's end. As btrfs_map_block()
| never returns result across block group boundary.
|- btrfs_issuse_discard()
Issue discard for each stripe.
So we will only discard the range in BG1, not the remaining part in BG2.
Furthermore, this bug is not that reliably observed, for above case, if
there is no other extent in BG2, BG2 will be empty and btrfs will trim
all space of BG2, covering up the bug.
[FIX]
- Allow __btrfs_map_block_for_discard() to modify @length parameter
btrfs_map_block() uses its @length paramter to notify the caller how
many bytes are mapped in current call.
With __btrfs_map_block_for_discard() also modifing the @length,
btrfs_discard_extent() now understands when to do extra trim.
- Call btrfs_map_block() in a loop until we hit the range end Since we
now know how many bytes are mapped each time, we can iterate through
each block group boundary and issue correct trim for each range.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-10-23 21:57:27 +08:00
* length_ret = length ;
2017-03-14 13:33:56 -07:00
stripe_len = map - > stripe_len ;
/*
* stripe_nr counts the total number of stripes we have to stride
* to get to this block
*/
stripe_nr = div64_u64 ( offset , stripe_len ) ;
/* stripe_offset is the offset of this block in its stripe */
stripe_offset = offset - stripe_nr * stripe_len ;
stripe_nr_end = round_up ( offset + length , map - > stripe_len ) ;
2017-04-03 13:45:24 -07:00
stripe_nr_end = div64_u64 ( stripe_nr_end , map - > stripe_len ) ;
2017-03-14 13:33:56 -07:00
stripe_cnt = stripe_nr_end - stripe_nr ;
stripe_end_offset = stripe_nr_end * map - > stripe_len -
( offset + length ) ;
/*
* after this , stripe_nr is the number of stripes on this
* device we have to walk to find the data , and stripe_index is
* the number of our device in the stripe array
*/
num_stripes = 1 ;
stripe_index = 0 ;
if ( map - > type & ( BTRFS_BLOCK_GROUP_RAID0 |
BTRFS_BLOCK_GROUP_RAID10 ) ) {
if ( map - > type & BTRFS_BLOCK_GROUP_RAID0 )
sub_stripes = 1 ;
else
sub_stripes = map - > sub_stripes ;
factor = map - > num_stripes / sub_stripes ;
num_stripes = min_t ( u64 , map - > num_stripes ,
sub_stripes * stripe_cnt ) ;
stripe_nr = div_u64_rem ( stripe_nr , factor , & stripe_index ) ;
stripe_index * = sub_stripes ;
stripes_per_dev = div_u64_rem ( stripe_cnt , factor ,
& remaining_stripes ) ;
div_u64_rem ( stripe_nr_end - 1 , factor , & last_stripe ) ;
last_stripe * = sub_stripes ;
2019-05-31 15:39:31 +02:00
} else if ( map - > type & ( BTRFS_BLOCK_GROUP_RAID1_MASK |
2017-03-14 13:33:56 -07:00
BTRFS_BLOCK_GROUP_DUP ) ) {
num_stripes = map - > num_stripes ;
} else {
stripe_nr = div_u64_rem ( stripe_nr , map - > num_stripes ,
& stripe_index ) ;
}
2021-09-23 14:00:08 +08:00
bioc = alloc_btrfs_io_context ( fs_info , num_stripes , 0 ) ;
2021-09-15 15:17:16 +08:00
if ( ! bioc ) {
2017-03-14 13:33:56 -07:00
ret = - ENOMEM ;
goto out ;
}
for ( i = 0 ; i < num_stripes ; i + + ) {
2021-09-15 15:17:16 +08:00
bioc - > stripes [ i ] . physical =
2017-03-14 13:33:56 -07:00
map - > stripes [ stripe_index ] . physical +
stripe_offset + stripe_nr * map - > stripe_len ;
2021-09-15 15:17:16 +08:00
bioc - > stripes [ i ] . dev = map - > stripes [ stripe_index ] . dev ;
2017-03-14 13:33:56 -07:00
if ( map - > type & ( BTRFS_BLOCK_GROUP_RAID0 |
BTRFS_BLOCK_GROUP_RAID10 ) ) {
2021-09-15 15:17:16 +08:00
bioc - > stripes [ i ] . length = stripes_per_dev *
2017-03-14 13:33:56 -07:00
map - > stripe_len ;
if ( i / sub_stripes < remaining_stripes )
2021-09-15 15:17:16 +08:00
bioc - > stripes [ i ] . length + = map - > stripe_len ;
2017-03-14 13:33:56 -07:00
/*
* Special for the first stripe and
* the last stripe :
*
* | - - - - - - - | . . . | - - - - - - - |
* | - - - - - - - - - - |
* off end_off
*/
if ( i < sub_stripes )
2021-09-15 15:17:16 +08:00
bioc - > stripes [ i ] . length - = stripe_offset ;
2017-03-14 13:33:56 -07:00
if ( stripe_index > = last_stripe & &
stripe_index < = ( last_stripe +
sub_stripes - 1 ) )
2021-09-15 15:17:16 +08:00
bioc - > stripes [ i ] . length - = stripe_end_offset ;
2017-03-14 13:33:56 -07:00
if ( i = = sub_stripes - 1 )
stripe_offset = 0 ;
} else {
2021-09-15 15:17:16 +08:00
bioc - > stripes [ i ] . length = length ;
2017-03-14 13:33:56 -07:00
}
stripe_index + + ;
if ( stripe_index = = map - > num_stripes ) {
stripe_index = 0 ;
stripe_nr + + ;
}
}
2021-09-15 15:17:16 +08:00
* bioc_ret = bioc ;
bioc - > map_type = map - > type ;
bioc - > num_stripes = num_stripes ;
2017-03-14 13:33:56 -07:00
out :
free_extent_map ( em ) ;
return ret ;
}
2017-03-14 13:33:57 -07:00
/*
* In dev - replace case , for repair case ( that ' s the only case where the mirror
* is selected explicitly when calling btrfs_map_block ) , blocks left of the
* left cursor can also be read from the target drive .
*
* For REQ_GET_READ_MIRRORS , the target drive is added as the last one to the
* array of stripes .
* For READ , it also needs to be supported using the same mirror number .
*
* If the requested block is not left of the left cursor , EIO is returned . This
* can happen because btrfs_num_copies ( ) returns one more in the dev - replace
* case .
*/
static int get_extra_mirror_from_replace ( struct btrfs_fs_info * fs_info ,
u64 logical , u64 length ,
u64 srcdev_devid , int * mirror_num ,
u64 * physical )
{
2021-09-15 15:17:16 +08:00
struct btrfs_io_context * bioc = NULL ;
2017-03-14 13:33:57 -07:00
int num_stripes ;
int index_srcdev = 0 ;
int found = 0 ;
u64 physical_of_found = 0 ;
int i ;
int ret = 0 ;
ret = __btrfs_map_block ( fs_info , BTRFS_MAP_GET_READ_MIRRORS ,
2021-09-15 15:17:16 +08:00
logical , & length , & bioc , 0 , 0 ) ;
2017-03-14 13:33:57 -07:00
if ( ret ) {
2021-09-15 15:17:16 +08:00
ASSERT ( bioc = = NULL ) ;
2017-03-14 13:33:57 -07:00
return ret ;
}
2021-09-15 15:17:16 +08:00
num_stripes = bioc - > num_stripes ;
2017-03-14 13:33:57 -07:00
if ( * mirror_num > num_stripes ) {
/*
* BTRFS_MAP_GET_READ_MIRRORS does not contain this mirror ,
* that means that the requested area is not left of the left
* cursor
*/
2021-09-15 15:17:16 +08:00
btrfs_put_bioc ( bioc ) ;
2017-03-14 13:33:57 -07:00
return - EIO ;
}
/*
* process the rest of the function using the mirror_num of the source
* drive . Therefore look it up first . At the end , patch the device
* pointer to the one of the target drive .
*/
for ( i = 0 ; i < num_stripes ; i + + ) {
2021-09-15 15:17:16 +08:00
if ( bioc - > stripes [ i ] . dev - > devid ! = srcdev_devid )
2017-03-14 13:33:57 -07:00
continue ;
/*
* In case of DUP , in order to keep it simple , only add the
* mirror with the lowest physical address
*/
if ( found & &
2021-09-15 15:17:16 +08:00
physical_of_found < = bioc - > stripes [ i ] . physical )
2017-03-14 13:33:57 -07:00
continue ;
index_srcdev = i ;
found = 1 ;
2021-09-15 15:17:16 +08:00
physical_of_found = bioc - > stripes [ i ] . physical ;
2017-03-14 13:33:57 -07:00
}
2021-09-15 15:17:16 +08:00
btrfs_put_bioc ( bioc ) ;
2017-03-14 13:33:57 -07:00
ASSERT ( found ) ;
if ( ! found )
return - EIO ;
* mirror_num = index_srcdev + 1 ;
* physical = physical_of_found ;
return ret ;
}
2021-02-04 19:22:12 +09:00
static bool is_block_group_to_copy ( struct btrfs_fs_info * fs_info , u64 logical )
{
struct btrfs_block_group * cache ;
bool ret ;
2021-02-04 19:22:13 +09:00
/* Non zoned filesystem does not use "to_copy" flag */
2021-02-04 19:22:12 +09:00
if ( ! btrfs_is_zoned ( fs_info ) )
return false ;
cache = btrfs_lookup_block_group ( fs_info , logical ) ;
spin_lock ( & cache - > lock ) ;
ret = cache - > to_copy ;
spin_unlock ( & cache - > lock ) ;
btrfs_put_block_group ( cache ) ;
return ret ;
}
2017-03-14 13:33:58 -07:00
static void handle_ops_on_dev_replace ( enum btrfs_map_op op ,
2021-09-15 15:17:16 +08:00
struct btrfs_io_context * * bioc_ret ,
2017-03-14 13:33:58 -07:00
struct btrfs_dev_replace * dev_replace ,
2021-02-04 19:22:12 +09:00
u64 logical ,
2017-03-14 13:33:58 -07:00
int * num_stripes_ret , int * max_errors_ret )
{
2021-09-15 15:17:16 +08:00
struct btrfs_io_context * bioc = * bioc_ret ;
2017-03-14 13:33:58 -07:00
u64 srcdev_devid = dev_replace - > srcdev - > devid ;
int tgtdev_indexes = 0 ;
int num_stripes = * num_stripes_ret ;
int max_errors = * max_errors_ret ;
int i ;
if ( op = = BTRFS_MAP_WRITE ) {
int index_where_to_add ;
2021-02-04 19:22:12 +09:00
/*
* A block group which have " to_copy " set will eventually
* copied by dev - replace process . We can avoid cloning IO here .
*/
if ( is_block_group_to_copy ( dev_replace - > srcdev - > fs_info , logical ) )
return ;
2017-03-14 13:33:58 -07:00
/*
* duplicate the write operations while the dev replace
* procedure is running . Since the copying of the old disk to
* the new disk takes place at run time while the filesystem is
* mounted writable , the regular write operations to the old
* disk have to be duplicated to go to the new disk as well .
*
* Note that device - > missing is handled by the caller , and that
* the write to the old disk is already set up in the stripes
* array .
*/
index_where_to_add = num_stripes ;
for ( i = 0 ; i < num_stripes ; i + + ) {
2021-09-15 15:17:16 +08:00
if ( bioc - > stripes [ i ] . dev - > devid = = srcdev_devid ) {
2017-03-14 13:33:58 -07:00
/* write to new disk, too */
2021-09-15 15:17:16 +08:00
struct btrfs_io_stripe * new =
bioc - > stripes + index_where_to_add ;
struct btrfs_io_stripe * old =
bioc - > stripes + i ;
2017-03-14 13:33:58 -07:00
new - > physical = old - > physical ;
new - > length = old - > length ;
new - > dev = dev_replace - > tgtdev ;
2021-09-15 15:17:16 +08:00
bioc - > tgtdev_map [ i ] = index_where_to_add ;
2017-03-14 13:33:58 -07:00
index_where_to_add + + ;
max_errors + + ;
tgtdev_indexes + + ;
}
}
num_stripes = index_where_to_add ;
} else if ( op = = BTRFS_MAP_GET_READ_MIRRORS ) {
int index_srcdev = 0 ;
int found = 0 ;
u64 physical_of_found = 0 ;
/*
* During the dev - replace procedure , the target drive can also
* be used to read data in case it is needed to repair a corrupt
* block elsewhere . This is possible if the requested area is
* left of the left cursor . In this area , the target drive is a
* full copy of the source drive .
*/
for ( i = 0 ; i < num_stripes ; i + + ) {
2021-09-15 15:17:16 +08:00
if ( bioc - > stripes [ i ] . dev - > devid = = srcdev_devid ) {
2017-03-14 13:33:58 -07:00
/*
* In case of DUP , in order to keep it simple ,
* only add the mirror with the lowest physical
* address
*/
if ( found & &
2021-09-15 15:17:16 +08:00
physical_of_found < = bioc - > stripes [ i ] . physical )
2017-03-14 13:33:58 -07:00
continue ;
index_srcdev = i ;
found = 1 ;
2021-09-15 15:17:16 +08:00
physical_of_found = bioc - > stripes [ i ] . physical ;
2017-03-14 13:33:58 -07:00
}
}
if ( found ) {
2021-09-15 15:17:16 +08:00
struct btrfs_io_stripe * tgtdev_stripe =
bioc - > stripes + num_stripes ;
2017-03-14 13:33:58 -07:00
tgtdev_stripe - > physical = physical_of_found ;
tgtdev_stripe - > length =
2021-09-15 15:17:16 +08:00
bioc - > stripes [ index_srcdev ] . length ;
2017-03-14 13:33:58 -07:00
tgtdev_stripe - > dev = dev_replace - > tgtdev ;
2021-09-15 15:17:16 +08:00
bioc - > tgtdev_map [ index_srcdev ] = num_stripes ;
2017-03-14 13:33:58 -07:00
tgtdev_indexes + + ;
num_stripes + + ;
}
}
* num_stripes_ret = num_stripes ;
* max_errors_ret = max_errors ;
2021-09-15 15:17:16 +08:00
bioc - > num_tgtdevs = tgtdev_indexes ;
* bioc_ret = bioc ;
2017-03-14 13:33:58 -07:00
}
2017-03-14 13:34:00 -07:00
static bool need_full_stripe ( enum btrfs_map_op op )
{
return ( op = = BTRFS_MAP_WRITE | | op = = BTRFS_MAP_GET_READ_MIRRORS ) ;
}
2019-06-03 12:05:03 +03:00
/*
2021-01-27 14:57:27 +01:00
* Calculate the geometry of a particular ( address , len ) tuple . This
* information is used to calculate how big a particular bio can get before it
* straddles a stripe .
2019-06-03 12:05:03 +03:00
*
2021-01-27 14:57:27 +01:00
* @ fs_info : the filesystem
* @ em : mapping containing the logical extent
* @ op : type of operation - write or read
* @ logical : address that we want to figure out the geometry of
* @ io_geom : pointer used to return values
2019-06-03 12:05:03 +03:00
*
* Returns < 0 in case a chunk for the given logical address cannot be found ,
* usually shouldn ' t happen unless @ logical is corrupted , 0 otherwise .
*/
2021-01-27 14:57:27 +01:00
int btrfs_get_io_geometry ( struct btrfs_fs_info * fs_info , struct extent_map * em ,
2021-04-13 17:58:48 +08:00
enum btrfs_map_op op , u64 logical ,
2021-01-27 14:57:27 +01:00
struct btrfs_io_geometry * io_geom )
2019-06-03 12:05:03 +03:00
{
struct map_lookup * map ;
2021-04-13 17:58:48 +08:00
u64 len ;
2019-06-03 12:05:03 +03:00
u64 offset ;
u64 stripe_offset ;
u64 stripe_nr ;
u64 stripe_len ;
u64 raid56_full_stripe_start = ( u64 ) - 1 ;
int data_stripes ;
ASSERT ( op ! = BTRFS_MAP_DISCARD ) ;
map = em - > map_lookup ;
/* Offset of this logical address in the chunk */
offset = logical - em - > start ;
/* Len of a stripe in a chunk */
stripe_len = map - > stripe_len ;
2021-05-21 17:42:23 +02:00
/* Stripe where this block falls in */
2019-06-03 12:05:03 +03:00
stripe_nr = div64_u64 ( offset , stripe_len ) ;
/* Offset of stripe in the chunk */
stripe_offset = stripe_nr * stripe_len ;
if ( offset < stripe_offset ) {
btrfs_crit ( fs_info ,
" stripe math has gone wrong, stripe_offset=%llu offset=%llu start=%llu logical=%llu stripe_len=%llu " ,
stripe_offset , offset , em - > start , logical , stripe_len ) ;
2021-01-27 14:57:27 +01:00
return - EINVAL ;
2019-06-03 12:05:03 +03:00
}
/* stripe_offset is the offset of this block in its stripe */
stripe_offset = offset - stripe_offset ;
data_stripes = nr_data_stripes ( map ) ;
if ( map - > type & BTRFS_BLOCK_GROUP_PROFILE_MASK ) {
u64 max_len = stripe_len - stripe_offset ;
/*
* In case of raid56 , we need to know the stripe aligned start
*/
if ( map - > type & BTRFS_BLOCK_GROUP_RAID56_MASK ) {
unsigned long full_stripe_len = stripe_len * data_stripes ;
raid56_full_stripe_start = offset ;
/*
* Allow a write of a full stripe , but make sure we
* don ' t allow straddling of stripes
*/
raid56_full_stripe_start = div64_u64 ( raid56_full_stripe_start ,
full_stripe_len ) ;
raid56_full_stripe_start * = full_stripe_len ;
/*
* For writes to RAID [ 56 ] , allow a full stripeset across
* all disks . For other RAID types and for RAID [ 56 ]
* reads , just allow a single stripe ( on a single disk ) .
*/
if ( op = = BTRFS_MAP_WRITE ) {
max_len = stripe_len * data_stripes -
( offset - raid56_full_stripe_start ) ;
}
}
len = min_t ( u64 , em - > len - offset , max_len ) ;
} else {
len = em - > len - offset ;
}
io_geom - > len = len ;
io_geom - > offset = offset ;
io_geom - > stripe_len = stripe_len ;
io_geom - > stripe_nr = stripe_nr ;
io_geom - > stripe_offset = stripe_offset ;
io_geom - > raid56_stripe_offset = raid56_full_stripe_start ;
2021-01-27 14:57:27 +01:00
return 0 ;
2019-06-03 12:05:03 +03:00
}
2016-10-27 09:27:36 +02:00
static int __btrfs_map_block ( struct btrfs_fs_info * fs_info ,
enum btrfs_map_op op ,
2008-04-21 10:03:05 -04:00
u64 logical , u64 * length ,
2021-09-15 15:17:16 +08:00
struct btrfs_io_context * * bioc_ret ,
2015-01-20 15:11:33 +08:00
int mirror_num , int need_raid_map )
2008-03-24 15:01:56 -04:00
{
struct extent_map * em ;
struct map_lookup * map ;
2008-03-25 16:50:33 -04:00
u64 stripe_offset ;
u64 stripe_nr ;
2013-01-29 18:40:14 -05:00
u64 stripe_len ;
2015-02-20 18:42:11 +01:00
u32 stripe_index ;
2019-05-17 11:43:45 +02:00
int data_stripes ;
2008-04-09 16:28:12 -04:00
int i ;
2011-12-01 12:55:47 +08:00
int ret = 0 ;
2008-04-21 10:03:05 -04:00
int num_stripes ;
2008-04-29 09:38:00 -04:00
int max_errors = 0 ;
2014-11-14 16:06:25 +08:00
int tgtdev_indexes = 0 ;
2021-09-15 15:17:16 +08:00
struct btrfs_io_context * bioc = NULL ;
2012-11-06 14:43:46 +01:00
struct btrfs_dev_replace * dev_replace = & fs_info - > dev_replace ;
int dev_replace_is_ongoing = 0 ;
int num_alloc_stripes ;
2012-11-06 15:06:47 +01:00
int patch_the_first_stripe_for_dev_replace = 0 ;
u64 physical_to_patch_in_first_stripe = 0 ;
2013-01-29 18:40:14 -05:00
u64 raid56_full_stripe_start = ( u64 ) - 1 ;
2019-06-03 12:05:05 +03:00
struct btrfs_io_geometry geom ;
2021-09-15 15:17:16 +08:00
ASSERT ( bioc_ret ) ;
2018-08-03 00:36:29 +02:00
ASSERT ( op ! = BTRFS_MAP_DISCARD ) ;
2017-03-14 13:33:56 -07:00
2021-01-27 14:57:27 +01:00
em = btrfs_get_chunk_map ( fs_info , logical , * length ) ;
ASSERT ( ! IS_ERR ( em ) ) ;
2021-04-13 17:58:48 +08:00
ret = btrfs_get_io_geometry ( fs_info , em , op , logical , & geom ) ;
2019-06-03 12:05:05 +03:00
if ( ret < 0 )
return ret ;
2008-03-24 15:01:56 -04:00
2015-06-03 10:55:48 -04:00
map = em - > map_lookup ;
2008-03-25 16:50:33 -04:00
2019-06-03 12:05:05 +03:00
* length = geom . len ;
stripe_len = geom . stripe_len ;
stripe_nr = geom . stripe_nr ;
stripe_offset = geom . stripe_offset ;
raid56_full_stripe_start = geom . raid56_stripe_offset ;
2019-05-17 11:43:45 +02:00
data_stripes = nr_data_stripes ( map ) ;
2008-03-25 16:50:33 -04:00
2018-09-07 16:11:23 +02:00
down_read ( & dev_replace - > rwsem ) ;
2012-11-06 14:43:46 +01:00
dev_replace_is_ongoing = btrfs_dev_replace_is_ongoing ( dev_replace ) ;
2018-04-05 01:41:06 +02:00
/*
* Hold the semaphore for read during the whole operation , write is
* requested at commit time but must wait .
*/
2012-11-06 14:43:46 +01:00
if ( ! dev_replace_is_ongoing )
2018-09-07 16:11:23 +02:00
up_read ( & dev_replace - > rwsem ) ;
2012-11-06 14:43:46 +01:00
2012-11-06 15:06:47 +01:00
if ( dev_replace_is_ongoing & & mirror_num = = map - > num_stripes + 1 & &
2017-03-14 13:34:00 -07:00
! need_full_stripe ( op ) & & dev_replace - > tgtdev ! = NULL ) {
2017-03-14 13:33:57 -07:00
ret = get_extra_mirror_from_replace ( fs_info , logical , * length ,
dev_replace - > srcdev - > devid ,
& mirror_num ,
& physical_to_patch_in_first_stripe ) ;
if ( ret )
2012-11-06 15:06:47 +01:00
goto out ;
2017-03-14 13:33:57 -07:00
else
patch_the_first_stripe_for_dev_replace = 1 ;
2012-11-06 15:06:47 +01:00
} else if ( mirror_num > map - > num_stripes ) {
mirror_num = 0 ;
}
2008-04-21 10:03:05 -04:00
num_stripes = 1 ;
2008-04-09 16:28:12 -04:00
stripe_index = 0 ;
2011-03-24 10:24:26 +00:00
if ( map - > type & BTRFS_BLOCK_GROUP_RAID0 ) {
2015-02-20 18:43:47 +01:00
stripe_nr = div_u64_rem ( stripe_nr , map - > num_stripes ,
& stripe_index ) ;
2017-10-12 16:43:00 +08:00
if ( ! need_full_stripe ( op ) )
2014-09-12 18:44:02 +08:00
mirror_num = 1 ;
2019-05-31 15:39:31 +02:00
} else if ( map - > type & BTRFS_BLOCK_GROUP_RAID1_MASK ) {
2017-10-12 16:43:00 +08:00
if ( need_full_stripe ( op ) )
2008-04-21 10:03:05 -04:00
num_stripes = map - > num_stripes ;
2008-04-29 14:12:09 -04:00
else if ( mirror_num )
2008-04-09 16:28:12 -04:00
stripe_index = mirror_num - 1 ;
2008-05-13 13:46:40 -04:00
else {
2012-11-06 14:52:18 +01:00
stripe_index = find_live_mirror ( fs_info , map , 0 ,
dev_replace_is_ongoing ) ;
2011-08-04 17:15:33 +02:00
mirror_num = stripe_index + 1 ;
2008-05-13 13:46:40 -04:00
}
2008-04-29 14:12:09 -04:00
2008-04-03 16:29:03 -04:00
} else if ( map - > type & BTRFS_BLOCK_GROUP_DUP ) {
2017-10-12 16:43:00 +08:00
if ( need_full_stripe ( op ) ) {
2008-04-21 10:03:05 -04:00
num_stripes = map - > num_stripes ;
2011-08-04 17:15:33 +02:00
} else if ( mirror_num ) {
2008-04-09 16:28:12 -04:00
stripe_index = mirror_num - 1 ;
2011-08-04 17:15:33 +02:00
} else {
mirror_num = 1 ;
}
2008-04-29 14:12:09 -04:00
2008-04-16 10:49:51 -04:00
} else if ( map - > type & BTRFS_BLOCK_GROUP_RAID10 ) {
2015-02-20 18:42:11 +01:00
u32 factor = map - > num_stripes / map - > sub_stripes ;
2008-04-16 10:49:51 -04:00
2015-02-20 18:43:47 +01:00
stripe_nr = div_u64_rem ( stripe_nr , factor , & stripe_index ) ;
2008-04-16 10:49:51 -04:00
stripe_index * = map - > sub_stripes ;
2017-10-12 16:43:00 +08:00
if ( need_full_stripe ( op ) )
2008-04-21 10:03:05 -04:00
num_stripes = map - > sub_stripes ;
2008-04-16 10:49:51 -04:00
else if ( mirror_num )
stripe_index + = mirror_num - 1 ;
2008-05-13 13:46:40 -04:00
else {
2012-04-27 12:41:45 -04:00
int old_stripe_index = stripe_index ;
2012-11-06 14:52:18 +01:00
stripe_index = find_live_mirror ( fs_info , map ,
stripe_index ,
dev_replace_is_ongoing ) ;
2012-04-27 12:41:45 -04:00
mirror_num = stripe_index - old_stripe_index + 1 ;
2008-05-13 13:46:40 -04:00
}
2013-01-29 18:40:14 -05:00
2015-01-20 15:11:44 +08:00
} else if ( map - > type & BTRFS_BLOCK_GROUP_RAID56_MASK ) {
2017-10-12 16:43:00 +08:00
if ( need_raid_map & & ( need_full_stripe ( op ) | | mirror_num > 1 ) ) {
2013-01-29 18:40:14 -05:00
/* push stripe_nr back to the start of the full stripe */
2017-04-03 13:45:24 -07:00
stripe_nr = div64_u64 ( raid56_full_stripe_start ,
2019-05-17 11:43:45 +02:00
stripe_len * data_stripes ) ;
2013-01-29 18:40:14 -05:00
/* RAID[56] write or recovery. Return all stripes */
num_stripes = map - > num_stripes ;
max_errors = nr_parity_stripes ( map ) ;
* length = map - > stripe_len ;
stripe_index = 0 ;
stripe_offset = 0 ;
} else {
/*
* Mirror # 0 or # 1 means the original data block .
* Mirror # 2 is RAID5 parity block .
* Mirror # 3 is RAID6 Q block .
*/
2015-02-20 18:43:47 +01:00
stripe_nr = div_u64_rem ( stripe_nr ,
2019-05-17 11:43:45 +02:00
data_stripes , & stripe_index ) ;
2013-01-29 18:40:14 -05:00
if ( mirror_num > 1 )
2019-05-17 11:43:45 +02:00
stripe_index = data_stripes + mirror_num - 2 ;
2013-01-29 18:40:14 -05:00
/* We distribute the parity blocks across stripes */
2015-02-20 18:43:47 +01:00
div_u64_rem ( stripe_nr + stripe_index , map - > num_stripes ,
& stripe_index ) ;
2017-10-12 16:43:00 +08:00
if ( ! need_full_stripe ( op ) & & mirror_num < = 1 )
2014-09-12 18:44:02 +08:00
mirror_num = 1 ;
2013-01-29 18:40:14 -05:00
}
2008-04-03 16:29:03 -04:00
} else {
/*
2015-02-20 18:43:47 +01:00
* after this , stripe_nr is the number of stripes on this
* device we have to walk to find the data , and stripe_index is
* the number of our device in the stripe array
2008-04-03 16:29:03 -04:00
*/
2015-02-20 18:43:47 +01:00
stripe_nr = div_u64_rem ( stripe_nr , map - > num_stripes ,
& stripe_index ) ;
2011-08-04 17:15:33 +02:00
mirror_num = stripe_index + 1 ;
2008-04-03 16:29:03 -04:00
}
2016-04-12 12:54:40 -04:00
if ( stripe_index > = map - > num_stripes ) {
2016-09-20 10:05:00 -04:00
btrfs_crit ( fs_info ,
" stripe index math went horribly wrong, got stripe_index=%u, num_stripes=%u " ,
2016-04-12 12:54:40 -04:00
stripe_index , map - > num_stripes ) ;
ret = - EINVAL ;
goto out ;
}
2008-04-09 16:28:12 -04:00
2012-11-06 14:43:46 +01:00
num_alloc_stripes = num_stripes ;
2017-03-14 13:33:59 -07:00
if ( dev_replace_is_ongoing & & dev_replace - > tgtdev ! = NULL ) {
2017-03-14 13:33:56 -07:00
if ( op = = BTRFS_MAP_WRITE )
2012-11-06 15:06:47 +01:00
num_alloc_stripes < < = 1 ;
2016-10-27 09:27:36 +02:00
if ( op = = BTRFS_MAP_GET_READ_MIRRORS )
2012-11-06 15:06:47 +01:00
num_alloc_stripes + + ;
2014-11-14 16:06:25 +08:00
tgtdev_indexes = num_stripes ;
2012-11-06 15:06:47 +01:00
}
2014-11-14 16:06:25 +08:00
2021-09-23 14:00:08 +08:00
bioc = alloc_btrfs_io_context ( fs_info , num_alloc_stripes , tgtdev_indexes ) ;
2021-09-15 15:17:16 +08:00
if ( ! bioc ) {
2011-12-01 12:55:47 +08:00
ret = - ENOMEM ;
goto out ;
}
2020-07-02 16:46:41 +03:00
for ( i = 0 ; i < num_stripes ; i + + ) {
2021-09-15 15:17:16 +08:00
bioc - > stripes [ i ] . physical = map - > stripes [ stripe_index ] . physical +
2020-07-02 16:46:41 +03:00
stripe_offset + stripe_nr * map - > stripe_len ;
2021-09-15 15:17:16 +08:00
bioc - > stripes [ i ] . dev = map - > stripes [ stripe_index ] . dev ;
2020-07-02 16:46:41 +03:00
stripe_index + + ;
}
2011-12-01 12:55:47 +08:00
2021-09-15 15:17:16 +08:00
/* Build raid_map */
2017-03-14 13:34:00 -07:00
if ( map - > type & BTRFS_BLOCK_GROUP_RAID56_MASK & & need_raid_map & &
( need_full_stripe ( op ) | | mirror_num > 1 ) ) {
2015-01-20 15:11:33 +08:00
u64 tmp ;
2015-02-20 18:42:11 +01:00
unsigned rot ;
2015-01-20 15:11:33 +08:00
/* Work out the disk rotation on this stripe-set */
2015-02-20 18:43:47 +01:00
div_u64_rem ( stripe_nr , num_stripes , & rot ) ;
2015-01-20 15:11:33 +08:00
/* Fill in the logical address of each stripe */
2019-05-17 11:43:45 +02:00
tmp = stripe_nr * data_stripes ;
for ( i = 0 ; i < data_stripes ; i + + )
2021-09-15 15:17:16 +08:00
bioc - > raid_map [ ( i + rot ) % num_stripes ] =
2015-01-20 15:11:33 +08:00
em - > start + ( tmp + i ) * map - > stripe_len ;
2021-09-15 15:17:16 +08:00
bioc - > raid_map [ ( i + rot ) % map - > num_stripes ] = RAID5_P_STRIPE ;
2015-01-20 15:11:33 +08:00
if ( map - > type & BTRFS_BLOCK_GROUP_RAID6 )
2021-09-15 15:17:16 +08:00
bioc - > raid_map [ ( i + rot + 1 ) % num_stripes ] =
2015-01-20 15:11:33 +08:00
RAID6_Q_STRIPE ;
2021-09-15 15:17:16 +08:00
sort_parity_stripes ( bioc , num_stripes ) ;
2008-03-25 16:50:33 -04:00
}
2011-12-01 12:55:47 +08:00
2017-03-14 13:34:00 -07:00
if ( need_full_stripe ( op ) )
2014-07-03 18:22:13 +08:00
max_errors = btrfs_chunk_max_errors ( map ) ;
2011-12-01 12:55:47 +08:00
2017-03-14 13:33:58 -07:00
if ( dev_replace_is_ongoing & & dev_replace - > tgtdev ! = NULL & &
2017-03-14 13:34:00 -07:00
need_full_stripe ( op ) ) {
2021-09-15 15:17:16 +08:00
handle_ops_on_dev_replace ( op , & bioc , dev_replace , logical ,
2021-02-04 19:22:12 +09:00
& num_stripes , & max_errors ) ;
2012-11-06 14:43:46 +01:00
}
2021-09-15 15:17:16 +08:00
* bioc_ret = bioc ;
bioc - > map_type = map - > type ;
bioc - > num_stripes = num_stripes ;
bioc - > max_errors = max_errors ;
bioc - > mirror_num = mirror_num ;
2012-11-06 15:06:47 +01:00
/*
* this is the case that REQ_READ & & dev_replace_is_ongoing & &
* mirror_num = = num_stripes + 1 & & dev_replace target drive is
* available as a mirror
*/
if ( patch_the_first_stripe_for_dev_replace & & num_stripes > 0 ) {
WARN_ON ( num_stripes > 1 ) ;
2021-09-15 15:17:16 +08:00
bioc - > stripes [ 0 ] . dev = dev_replace - > tgtdev ;
bioc - > stripes [ 0 ] . physical = physical_to_patch_in_first_stripe ;
bioc - > mirror_num = map - > num_stripes + 1 ;
2012-11-06 15:06:47 +01:00
}
2008-04-09 16:28:12 -04:00
out :
Btrfs: fix lockdep deadlock warning due to dev_replace
Xfstests btrfs/011 complains about a deadlock warning,
[ 1226.649039] =========================================================
[ 1226.649039] [ INFO: possible irq lock inversion dependency detected ]
[ 1226.649039] 4.1.0+ #270 Not tainted
[ 1226.649039] ---------------------------------------------------------
[ 1226.652955] kswapd0/46 just changed the state of lock:
[ 1226.652955] (&delayed_node->mutex){+.+.-.}, at: [<ffffffff81458735>] __btrfs_release_delayed_node+0x45/0x1d0
[ 1226.652955] but this lock took another, RECLAIM_FS-unsafe lock in the past:
[ 1226.652955] (&fs_info->dev_replace.lock){+.+.+.}
and interrupts could create inverse lock ordering between them.
[ 1226.652955]
other info that might help us debug this:
[ 1226.652955] Chain exists of:
&delayed_node->mutex --> &found->groups_sem --> &fs_info->dev_replace.lock
[ 1226.652955] Possible interrupt unsafe locking scenario:
[ 1226.652955] CPU0 CPU1
[ 1226.652955] ---- ----
[ 1226.652955] lock(&fs_info->dev_replace.lock);
[ 1226.652955] local_irq_disable();
[ 1226.652955] lock(&delayed_node->mutex);
[ 1226.652955] lock(&found->groups_sem);
[ 1226.652955] <Interrupt>
[ 1226.652955] lock(&delayed_node->mutex);
[ 1226.652955]
*** DEADLOCK ***
Commit 084b6e7c7607 ("btrfs: Fix a lockdep warning when running xfstest.") tried
to fix a similar one that has the exactly same warning, but with that, we still
run to this.
The above lock chain comes from
btrfs_commit_transaction
->btrfs_run_delayed_items
...
->__btrfs_update_delayed_inode
...
->__btrfs_cow_block
...
->find_free_extent
->cache_block_group
->load_free_space_cache
->btrfs_readpages
->submit_one_bio
...
->__btrfs_map_block
->btrfs_dev_replace_lock
However, with high memory pressure, tasks which hold dev_replace.lock can
be interrupted by kswapd and then kswapd is intended to release memory occupied
by superblock, inodes and dentries, where we may call evict_inode, and it comes
to
[ 1226.652955] [<ffffffff81458735>] __btrfs_release_delayed_node+0x45/0x1d0
[ 1226.652955] [<ffffffff81459e74>] btrfs_remove_delayed_node+0x24/0x30
[ 1226.652955] [<ffffffff8140c5fe>] btrfs_evict_inode+0x34e/0x700
delayed_node->mutex may be acquired in __btrfs_release_delayed_node(), and it leads
to a ABBA deadlock.
To fix this, we can use "blocking rwlock" used in the case of extent_buffer, but
things are simpler here since we only needs read's spinlock to blocking lock.
With this, btrfs/011 no more produces warnings in dmesg.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2015-07-17 16:49:19 +08:00
if ( dev_replace_is_ongoing ) {
2018-04-05 01:41:06 +02:00
lockdep_assert_held ( & dev_replace - > rwsem ) ;
/* Unlock and let waiting writers proceed */
2018-09-07 16:11:23 +02:00
up_read ( & dev_replace - > rwsem ) ;
Btrfs: fix lockdep deadlock warning due to dev_replace
Xfstests btrfs/011 complains about a deadlock warning,
[ 1226.649039] =========================================================
[ 1226.649039] [ INFO: possible irq lock inversion dependency detected ]
[ 1226.649039] 4.1.0+ #270 Not tainted
[ 1226.649039] ---------------------------------------------------------
[ 1226.652955] kswapd0/46 just changed the state of lock:
[ 1226.652955] (&delayed_node->mutex){+.+.-.}, at: [<ffffffff81458735>] __btrfs_release_delayed_node+0x45/0x1d0
[ 1226.652955] but this lock took another, RECLAIM_FS-unsafe lock in the past:
[ 1226.652955] (&fs_info->dev_replace.lock){+.+.+.}
and interrupts could create inverse lock ordering between them.
[ 1226.652955]
other info that might help us debug this:
[ 1226.652955] Chain exists of:
&delayed_node->mutex --> &found->groups_sem --> &fs_info->dev_replace.lock
[ 1226.652955] Possible interrupt unsafe locking scenario:
[ 1226.652955] CPU0 CPU1
[ 1226.652955] ---- ----
[ 1226.652955] lock(&fs_info->dev_replace.lock);
[ 1226.652955] local_irq_disable();
[ 1226.652955] lock(&delayed_node->mutex);
[ 1226.652955] lock(&found->groups_sem);
[ 1226.652955] <Interrupt>
[ 1226.652955] lock(&delayed_node->mutex);
[ 1226.652955]
*** DEADLOCK ***
Commit 084b6e7c7607 ("btrfs: Fix a lockdep warning when running xfstest.") tried
to fix a similar one that has the exactly same warning, but with that, we still
run to this.
The above lock chain comes from
btrfs_commit_transaction
->btrfs_run_delayed_items
...
->__btrfs_update_delayed_inode
...
->__btrfs_cow_block
...
->find_free_extent
->cache_block_group
->load_free_space_cache
->btrfs_readpages
->submit_one_bio
...
->__btrfs_map_block
->btrfs_dev_replace_lock
However, with high memory pressure, tasks which hold dev_replace.lock can
be interrupted by kswapd and then kswapd is intended to release memory occupied
by superblock, inodes and dentries, where we may call evict_inode, and it comes
to
[ 1226.652955] [<ffffffff81458735>] __btrfs_release_delayed_node+0x45/0x1d0
[ 1226.652955] [<ffffffff81459e74>] btrfs_remove_delayed_node+0x24/0x30
[ 1226.652955] [<ffffffff8140c5fe>] btrfs_evict_inode+0x34e/0x700
delayed_node->mutex may be acquired in __btrfs_release_delayed_node(), and it leads
to a ABBA deadlock.
To fix this, we can use "blocking rwlock" used in the case of extent_buffer, but
things are simpler here since we only needs read's spinlock to blocking lock.
With this, btrfs/011 no more produces warnings in dmesg.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2015-07-17 16:49:19 +08:00
}
2008-03-24 15:01:56 -04:00
free_extent_map ( em ) ;
2011-12-01 12:55:47 +08:00
return ret ;
2008-03-24 15:01:56 -04:00
}
2016-10-27 09:27:36 +02:00
int btrfs_map_block ( struct btrfs_fs_info * fs_info , enum btrfs_map_op op ,
2008-04-21 10:03:05 -04:00
u64 logical , u64 * length ,
2021-09-15 15:17:16 +08:00
struct btrfs_io_context * * bioc_ret , int mirror_num )
2008-04-21 10:03:05 -04:00
{
2018-08-03 00:36:29 +02:00
if ( op = = BTRFS_MAP_DISCARD )
return __btrfs_map_block_for_discard ( fs_info , logical ,
2021-09-15 15:17:16 +08:00
length , bioc_ret ) ;
2018-08-03 00:36:29 +02:00
2021-09-15 15:17:16 +08:00
return __btrfs_map_block ( fs_info , op , logical , length , bioc_ret ,
2015-01-20 15:11:33 +08:00
mirror_num , 0 ) ;
2008-04-21 10:03:05 -04:00
}
2014-10-23 14:42:50 +08:00
/* For Scrub/replace */
2016-10-27 09:27:36 +02:00
int btrfs_map_sblock ( struct btrfs_fs_info * fs_info , enum btrfs_map_op op ,
2014-10-23 14:42:50 +08:00
u64 logical , u64 * length ,
2021-09-15 15:17:16 +08:00
struct btrfs_io_context * * bioc_ret )
2014-10-23 14:42:50 +08:00
{
2021-09-15 15:17:16 +08:00
return __btrfs_map_block ( fs_info , op , logical , length , bioc_ret , 0 , 1 ) ;
2014-10-23 14:42:50 +08:00
}
2021-09-15 15:17:16 +08:00
static inline void btrfs_end_bioc ( struct btrfs_io_context * bioc , struct bio * bio )
2014-06-19 10:42:55 +08:00
{
2021-09-15 15:17:16 +08:00
bio - > bi_private = bioc - > private ;
bio - > bi_end_io = bioc - > end_io ;
2015-07-20 15:29:37 +02:00
bio_endio ( bio ) ;
2015-05-22 09:14:03 -04:00
2021-09-15 15:17:16 +08:00
btrfs_put_bioc ( bioc ) ;
2014-06-19 10:42:55 +08:00
}
2015-07-20 15:29:37 +02:00
static void btrfs_end_bio ( struct bio * bio )
2008-04-03 16:29:03 -04:00
{
2021-09-15 15:17:16 +08:00
struct btrfs_io_context * bioc = bio - > bi_private ;
2008-08-05 10:13:57 -04:00
int is_orig_bio = 0 ;
2008-04-03 16:29:03 -04:00
2017-06-03 09:38:06 +02:00
if ( bio - > bi_status ) {
2021-09-15 15:17:16 +08:00
atomic_inc ( & bioc - > error ) ;
2017-06-03 09:38:06 +02:00
if ( bio - > bi_status = = BLK_STS_IOERR | |
bio - > bi_status = = BLK_STS_TARGET ) {
2021-09-15 15:17:18 +08:00
struct btrfs_device * dev = btrfs_bio ( bio ) - > device ;
2012-05-25 16:06:08 +02:00
2020-07-02 15:23:31 +03:00
ASSERT ( dev - > bdev ) ;
2021-02-04 19:21:59 +09:00
if ( btrfs_op ( bio ) = = BTRFS_MAP_WRITE )
2020-07-02 15:23:31 +03:00
btrfs_dev_stat_inc_and_print ( dev ,
2012-06-14 16:42:31 +02:00
BTRFS_DEV_STAT_WRITE_ERRS ) ;
2020-07-02 15:23:31 +03:00
else if ( ! ( bio - > bi_opf & REQ_RAHEAD ) )
btrfs_dev_stat_inc_and_print ( dev ,
2012-06-14 16:42:31 +02:00
BTRFS_DEV_STAT_READ_ERRS ) ;
2020-07-02 15:23:31 +03:00
if ( bio - > bi_opf & REQ_PREFLUSH )
btrfs_dev_stat_inc_and_print ( dev ,
2012-06-14 16:42:31 +02:00
BTRFS_DEV_STAT_FLUSH_ERRS ) ;
2012-05-25 16:06:08 +02:00
}
}
2008-04-03 16:29:03 -04:00
2021-09-15 15:17:16 +08:00
if ( bio = = bioc - > orig_bio )
2008-08-05 10:13:57 -04:00
is_orig_bio = 1 ;
2021-09-15 15:17:16 +08:00
btrfs_bio_counter_dec ( bioc - > fs_info ) ;
Btrfs: fix use-after-free in the finishing procedure of the device replace
During device replace test, we hit a null pointer deference (It was very easy
to reproduce it by running xfstests' btrfs/011 on the devices with the virtio
scsi driver). There were two bugs that caused this problem:
- We might allocate new chunks on the replaced device after we updated
the mapping tree. And we forgot to replace the source device in those
mapping of the new chunks.
- We might get the mapping information which including the source device
before the mapping information update. And then submit the bio which was
based on that mapping information after we freed the source device.
For the first bug, we can fix it by doing mapping tree update and source
device remove in the same context of the chunk mutex. The chunk mutex is
used to protect the allocable device list, the above method can avoid
the new chunk allocation, and after we remove the source device, all
the new chunks will be allocated on the new device. So it can fix
the first bug.
For the second bug, we need make sure all flighting bios are finished and
no new bios are produced during we are removing the source device. To fix
this problem, we introduced a global @bio_counter, we not only inc/dec
@bio_counter outsize of map_blocks, but also inc it before submitting bio
and dec @bio_counter when ending bios.
Since Raid56 is a little different and device replace dosen't support raid56
yet, it is not addressed in the patch and I add comments to make sure we will
fix it in the future.
Reported-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-01-30 16:46:55 +08:00
2021-09-15 15:17:16 +08:00
if ( atomic_dec_and_test ( & bioc - > stripes_pending ) ) {
2008-08-05 10:13:57 -04:00
if ( ! is_orig_bio ) {
bio_put ( bio ) ;
2021-09-15 15:17:16 +08:00
bio = bioc - > orig_bio ;
2008-08-05 10:13:57 -04:00
}
2014-01-08 14:19:52 -07:00
2021-09-15 15:17:18 +08:00
btrfs_bio ( bio ) - > mirror_num = bioc - > mirror_num ;
2008-04-29 09:38:00 -04:00
/* only send an error to the higher layers if it is
2013-01-29 18:40:14 -05:00
* beyond the tolerance of the btrfs bio
2008-04-29 09:38:00 -04:00
*/
2021-09-15 15:17:16 +08:00
if ( atomic_read ( & bioc - > error ) > bioc - > max_errors ) {
2017-06-03 09:38:06 +02:00
bio - > bi_status = BLK_STS_IOERR ;
2011-12-09 11:07:37 -05:00
} else {
2008-05-12 13:39:03 -04:00
/*
* this bio is actually up to date , we didn ' t
* go over the max number of errors
*/
2017-10-14 08:35:56 +08:00
bio - > bi_status = BLK_STS_OK ;
2008-05-12 13:39:03 -04:00
}
2014-06-19 10:42:54 +08:00
2021-09-15 15:17:16 +08:00
btrfs_end_bioc ( bioc , bio ) ;
2008-08-05 10:13:57 -04:00
} else if ( ! is_orig_bio ) {
2008-04-03 16:29:03 -04:00
bio_put ( bio ) ;
}
}
2021-09-15 15:17:16 +08:00
static void submit_stripe_bio ( struct btrfs_io_context * bioc , struct bio * bio ,
2020-07-03 11:14:27 +03:00
u64 physical , struct btrfs_device * dev )
2012-10-19 16:50:56 -04:00
{
2021-09-15 15:17:16 +08:00
struct btrfs_fs_info * fs_info = bioc - > fs_info ;
2012-10-19 16:50:56 -04:00
2021-09-15 15:17:16 +08:00
bio - > bi_private = bioc ;
2021-09-15 15:17:18 +08:00
btrfs_bio ( bio ) - > device = dev ;
2012-10-19 16:50:56 -04:00
bio - > bi_end_io = btrfs_end_bio ;
2013-10-11 15:44:27 -07:00
bio - > bi_iter . bi_sector = physical > > 9 ;
2021-02-04 19:22:05 +09:00
/*
* For zone append writing , bi_sector must point the beginning of the
* zone
*/
if ( bio_op ( bio ) = = REQ_OP_ZONE_APPEND ) {
if ( btrfs_dev_is_sequential ( dev , physical ) ) {
u64 zone_start = round_down ( physical , fs_info - > zone_size ) ;
bio - > bi_iter . bi_sector = zone_start > > SECTOR_SHIFT ;
} else {
bio - > bi_opf & = ~ REQ_OP_ZONE_APPEND ;
bio - > bi_opf | = REQ_OP_WRITE ;
}
}
2018-08-02 16:19:07 +09:00
btrfs_debug_in_rcu ( fs_info ,
" btrfs_map_bio: rw %d 0x%x, sector=%llu, dev=%lu (%s id %llu), size=%u " ,
2020-11-26 15:41:27 +01:00
bio_op ( bio ) , bio - > bi_opf , bio - > bi_iter . bi_sector ,
2019-11-28 15:31:46 +01:00
( unsigned long ) dev - > bdev - > bd_dev , rcu_str_deref ( dev - > name ) ,
dev - > devid , bio - > bi_iter . bi_size ) ;
2017-08-23 19:10:32 +02:00
bio_set_dev ( bio , dev - > bdev ) ;
Btrfs: fix use-after-free in the finishing procedure of the device replace
During device replace test, we hit a null pointer deference (It was very easy
to reproduce it by running xfstests' btrfs/011 on the devices with the virtio
scsi driver). There were two bugs that caused this problem:
- We might allocate new chunks on the replaced device after we updated
the mapping tree. And we forgot to replace the source device in those
mapping of the new chunks.
- We might get the mapping information which including the source device
before the mapping information update. And then submit the bio which was
based on that mapping information after we freed the source device.
For the first bug, we can fix it by doing mapping tree update and source
device remove in the same context of the chunk mutex. The chunk mutex is
used to protect the allocable device list, the above method can avoid
the new chunk allocation, and after we remove the source device, all
the new chunks will be allocated on the new device. So it can fix
the first bug.
For the second bug, we need make sure all flighting bios are finished and
no new bios are produced during we are removing the source device. To fix
this problem, we introduced a global @bio_counter, we not only inc/dec
@bio_counter outsize of map_blocks, but also inc it before submitting bio
and dec @bio_counter when ending bios.
Since Raid56 is a little different and device replace dosen't support raid56
yet, it is not addressed in the patch and I add comments to make sure we will
fix it in the future.
Reported-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-01-30 16:46:55 +08:00
2016-06-22 18:54:24 -04:00
btrfs_bio_counter_inc_noblocked ( fs_info ) ;
Btrfs: fix use-after-free in the finishing procedure of the device replace
During device replace test, we hit a null pointer deference (It was very easy
to reproduce it by running xfstests' btrfs/011 on the devices with the virtio
scsi driver). There were two bugs that caused this problem:
- We might allocate new chunks on the replaced device after we updated
the mapping tree. And we forgot to replace the source device in those
mapping of the new chunks.
- We might get the mapping information which including the source device
before the mapping information update. And then submit the bio which was
based on that mapping information after we freed the source device.
For the first bug, we can fix it by doing mapping tree update and source
device remove in the same context of the chunk mutex. The chunk mutex is
used to protect the allocable device list, the above method can avoid
the new chunk allocation, and after we remove the source device, all
the new chunks will be allocated on the new device. So it can fix
the first bug.
For the second bug, we need make sure all flighting bios are finished and
no new bios are produced during we are removing the source device. To fix
this problem, we introduced a global @bio_counter, we not only inc/dec
@bio_counter outsize of map_blocks, but also inc it before submitting bio
and dec @bio_counter when ending bios.
Since Raid56 is a little different and device replace dosen't support raid56
yet, it is not addressed in the patch and I add comments to make sure we will
fix it in the future.
Reported-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-01-30 16:46:55 +08:00
2019-07-10 12:28:14 -07:00
btrfsic_submit_bio ( bio ) ;
2012-10-19 16:50:56 -04:00
}
2021-09-15 15:17:16 +08:00
static void bioc_error ( struct btrfs_io_context * bioc , struct bio * bio , u64 logical )
2012-10-19 16:50:56 -04:00
{
2021-09-15 15:17:16 +08:00
atomic_inc ( & bioc - > error ) ;
if ( atomic_dec_and_test ( & bioc - > stripes_pending ) ) {
2016-05-19 21:18:45 -04:00
/* Should be the original bio. */
2021-09-15 15:17:16 +08:00
WARN_ON ( bio ! = bioc - > orig_bio ) ;
2014-06-19 10:42:55 +08:00
2021-09-15 15:17:18 +08:00
btrfs_bio ( bio ) - > mirror_num = bioc - > mirror_num ;
2013-10-11 15:44:27 -07:00
bio - > bi_iter . bi_sector = logical > > 9 ;
2021-09-15 15:17:16 +08:00
if ( atomic_read ( & bioc - > error ) > bioc - > max_errors )
2017-10-14 08:34:02 +08:00
bio - > bi_status = BLK_STS_IOERR ;
else
bio - > bi_status = BLK_STS_OK ;
2021-09-15 15:17:16 +08:00
btrfs_end_bioc ( bioc , bio ) ;
2012-10-19 16:50:56 -04:00
}
}
2017-08-22 23:45:59 -07:00
blk_status_t btrfs_map_bio ( struct btrfs_fs_info * fs_info , struct bio * bio ,
2019-07-10 12:28:14 -07:00
int mirror_num )
2008-03-24 15:01:56 -04:00
{
struct btrfs_device * dev ;
2008-04-03 16:29:03 -04:00
struct bio * first_bio = bio ;
2020-11-26 15:41:27 +01:00
u64 logical = bio - > bi_iter . bi_sector < < 9 ;
2008-03-24 15:01:56 -04:00
u64 length = 0 ;
u64 map_length ;
int ret ;
2015-02-12 15:42:16 +08:00
int dev_nr ;
int total_devs ;
2021-09-15 15:17:16 +08:00
struct btrfs_io_context * bioc = NULL ;
2008-03-24 15:01:56 -04:00
2013-10-11 15:44:27 -07:00
length = bio - > bi_iter . bi_size ;
2008-03-24 15:01:56 -04:00
map_length = length ;
2008-04-09 16:28:12 -04:00
2016-06-22 18:54:23 -04:00
btrfs_bio_counter_inc_blocked ( fs_info ) ;
2017-09-19 17:50:09 -06:00
ret = __btrfs_map_block ( fs_info , btrfs_op ( bio ) , logical ,
2021-09-15 15:17:16 +08:00
& map_length , & bioc , mirror_num , 1 ) ;
Btrfs: fix use-after-free in the finishing procedure of the device replace
During device replace test, we hit a null pointer deference (It was very easy
to reproduce it by running xfstests' btrfs/011 on the devices with the virtio
scsi driver). There were two bugs that caused this problem:
- We might allocate new chunks on the replaced device after we updated
the mapping tree. And we forgot to replace the source device in those
mapping of the new chunks.
- We might get the mapping information which including the source device
before the mapping information update. And then submit the bio which was
based on that mapping information after we freed the source device.
For the first bug, we can fix it by doing mapping tree update and source
device remove in the same context of the chunk mutex. The chunk mutex is
used to protect the allocable device list, the above method can avoid
the new chunk allocation, and after we remove the source device, all
the new chunks will be allocated on the new device. So it can fix
the first bug.
For the second bug, we need make sure all flighting bios are finished and
no new bios are produced during we are removing the source device. To fix
this problem, we introduced a global @bio_counter, we not only inc/dec
@bio_counter outsize of map_blocks, but also inc it before submitting bio
and dec @bio_counter when ending bios.
Since Raid56 is a little different and device replace dosen't support raid56
yet, it is not addressed in the patch and I add comments to make sure we will
fix it in the future.
Reported-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-01-30 16:46:55 +08:00
if ( ret ) {
2016-06-22 18:54:23 -04:00
btrfs_bio_counter_dec ( fs_info ) ;
2017-08-22 23:45:59 -07:00
return errno_to_blk_status ( ret ) ;
Btrfs: fix use-after-free in the finishing procedure of the device replace
During device replace test, we hit a null pointer deference (It was very easy
to reproduce it by running xfstests' btrfs/011 on the devices with the virtio
scsi driver). There were two bugs that caused this problem:
- We might allocate new chunks on the replaced device after we updated
the mapping tree. And we forgot to replace the source device in those
mapping of the new chunks.
- We might get the mapping information which including the source device
before the mapping information update. And then submit the bio which was
based on that mapping information after we freed the source device.
For the first bug, we can fix it by doing mapping tree update and source
device remove in the same context of the chunk mutex. The chunk mutex is
used to protect the allocable device list, the above method can avoid
the new chunk allocation, and after we remove the source device, all
the new chunks will be allocated on the new device. So it can fix
the first bug.
For the second bug, we need make sure all flighting bios are finished and
no new bios are produced during we are removing the source device. To fix
this problem, we introduced a global @bio_counter, we not only inc/dec
@bio_counter outsize of map_blocks, but also inc it before submitting bio
and dec @bio_counter when ending bios.
Since Raid56 is a little different and device replace dosen't support raid56
yet, it is not addressed in the patch and I add comments to make sure we will
fix it in the future.
Reported-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-01-30 16:46:55 +08:00
}
2008-04-09 16:28:12 -04:00
2021-09-15 15:17:16 +08:00
total_devs = bioc - > num_stripes ;
bioc - > orig_bio = first_bio ;
bioc - > private = first_bio - > bi_private ;
bioc - > end_io = first_bio - > bi_end_io ;
atomic_set ( & bioc - > stripes_pending , bioc - > num_stripes ) ;
2013-01-29 18:40:14 -05:00
2021-09-15 15:17:16 +08:00
if ( ( bioc - > map_type & BTRFS_BLOCK_GROUP_RAID56_MASK ) & &
2021-02-04 19:21:59 +09:00
( ( btrfs_op ( bio ) = = BTRFS_MAP_WRITE ) | | ( mirror_num > 1 ) ) ) {
2013-01-29 18:40:14 -05:00
/* In this case, map_length has been set to the length of
a single stripe ; not the whole write */
2021-02-04 19:21:59 +09:00
if ( btrfs_op ( bio ) = = BTRFS_MAP_WRITE ) {
2021-09-23 14:00:09 +08:00
ret = raid56_parity_write ( bio , bioc , map_length ) ;
2013-01-29 18:40:14 -05:00
} else {
2021-09-23 14:00:09 +08:00
ret = raid56_parity_recover ( bio , bioc , map_length ,
mirror_num , 1 ) ;
2013-01-29 18:40:14 -05:00
}
2014-11-25 16:39:28 +08:00
2016-06-22 18:54:23 -04:00
btrfs_bio_counter_dec ( fs_info ) ;
2017-08-22 23:45:59 -07:00
return errno_to_blk_status ( ret ) ;
2013-01-29 18:40:14 -05:00
}
2008-04-09 16:28:12 -04:00
if ( map_length < length ) {
2016-06-22 18:54:23 -04:00
btrfs_crit ( fs_info ,
2016-09-20 10:05:00 -04:00
" mapping failed logical %llu bio len %llu len %llu " ,
logical , length , map_length ) ;
2008-04-09 16:28:12 -04:00
BUG ( ) ;
}
2011-08-04 17:15:33 +02:00
2015-02-12 15:42:16 +08:00
for ( dev_nr = 0 ; dev_nr < total_devs ; dev_nr + + ) {
2021-09-15 15:17:16 +08:00
dev = bioc - > stripes [ dev_nr ] . dev ;
2018-11-08 16:16:38 +02:00
if ( ! dev | | ! dev - > bdev | | test_bit ( BTRFS_DEV_STATE_MISSING ,
& dev - > dev_state ) | |
2021-02-04 19:21:59 +09:00
( btrfs_op ( first_bio ) = = BTRFS_MAP_WRITE & &
2017-12-04 12:54:52 +08:00
! test_bit ( BTRFS_DEV_STATE_WRITEABLE , & dev - > dev_state ) ) ) {
2021-09-15 15:17:16 +08:00
bioc_error ( bioc , first_bio , logical ) ;
2012-10-19 16:50:56 -04:00
continue ;
}
2017-06-02 17:38:30 +02:00
if ( dev_nr < total_devs - 1 )
2017-06-02 17:48:13 +02:00
bio = btrfs_bio_clone ( first_bio ) ;
2017-06-02 17:38:30 +02:00
else
2011-08-04 17:15:33 +02:00
bio = first_bio ;
2012-10-19 16:50:56 -04:00
2021-09-15 15:17:16 +08:00
submit_stripe_bio ( bioc , bio , bioc - > stripes [ dev_nr ] . physical , dev ) ;
2008-04-03 16:29:03 -04:00
}
2016-06-22 18:54:23 -04:00
btrfs_bio_counter_dec ( fs_info ) ;
2017-08-22 23:45:59 -07:00
return BLK_STS_OK ;
2008-03-24 15:01:56 -04:00
}
2021-10-05 16:12:42 -04:00
static bool dev_args_match_fs_devices ( const struct btrfs_dev_lookup_args * args ,
const struct btrfs_fs_devices * fs_devices )
{
if ( args - > fsid = = NULL )
return true ;
if ( memcmp ( fs_devices - > metadata_uuid , args - > fsid , BTRFS_FSID_SIZE ) = = 0 )
return true ;
return false ;
}
static bool dev_args_match_device ( const struct btrfs_dev_lookup_args * args ,
const struct btrfs_device * device )
{
ASSERT ( ( args - > devid ! = ( u64 ) - 1 ) | | args - > missing ) ;
if ( ( args - > devid ! = ( u64 ) - 1 ) & & device - > devid ! = args - > devid )
return false ;
if ( args - > uuid & & memcmp ( device - > uuid , args - > uuid , BTRFS_UUID_SIZE ) ! = 0 )
return false ;
if ( ! args - > missing )
return true ;
if ( test_bit ( BTRFS_DEV_STATE_IN_FS_METADATA , & device - > dev_state ) & &
! device - > bdev )
return true ;
return false ;
}
2019-01-19 14:48:55 +08:00
/*
* Find a device specified by @ devid or @ uuid in the list of @ fs_devices , or
* return NULL .
*
* If devid and uuid are both specified , the match must be exact , otherwise
* only devid is used .
*/
2021-10-05 16:12:42 -04:00
struct btrfs_device * btrfs_find_device ( const struct btrfs_fs_devices * fs_devices ,
const struct btrfs_dev_lookup_args * args )
2008-03-24 15:01:56 -04:00
{
2008-11-17 21:11:30 -05:00
struct btrfs_device * device ;
2020-07-16 10:25:33 +03:00
struct btrfs_fs_devices * seed_devs ;
2021-10-05 16:12:42 -04:00
if ( dev_args_match_fs_devices ( args , fs_devices ) ) {
2020-07-16 10:25:33 +03:00
list_for_each_entry ( device , & fs_devices - > devices , dev_list ) {
2021-10-05 16:12:42 -04:00
if ( dev_args_match_device ( args , device ) )
2020-07-16 10:25:33 +03:00
return device ;
}
}
2008-11-17 21:11:30 -05:00
2020-07-16 10:25:33 +03:00
list_for_each_entry ( seed_devs , & fs_devices - > seed_list , seed_list ) {
2021-10-05 16:12:42 -04:00
if ( ! dev_args_match_fs_devices ( args , seed_devs ) )
continue ;
list_for_each_entry ( device , & seed_devs - > devices , dev_list ) {
if ( dev_args_match_device ( args , device ) )
return device ;
2008-11-17 21:11:30 -05:00
}
}
2020-07-16 10:25:33 +03:00
2008-11-17 21:11:30 -05:00
return NULL ;
2008-03-24 15:01:56 -04:00
}
2016-06-22 18:54:24 -04:00
static struct btrfs_device * add_missing_dev ( struct btrfs_fs_devices * fs_devices ,
2008-05-13 13:46:40 -04:00
u64 devid , u8 * dev_uuid )
{
struct btrfs_device * device ;
2020-08-31 10:52:42 -04:00
unsigned int nofs_flag ;
2008-05-13 13:46:40 -04:00
2020-08-31 10:52:42 -04:00
/*
* We call this under the chunk_mutex , so we want to use NOFS for this
* allocation , however we don ' t want to change btrfs_alloc_device ( ) to
* always do NOFS because we use it in a lot of other GFP_KERNEL safe
* places .
*/
nofs_flag = memalloc_nofs_save ( ) ;
2013-08-23 13:20:17 +03:00
device = btrfs_alloc_device ( NULL , & devid , dev_uuid ) ;
2020-08-31 10:52:42 -04:00
memalloc_nofs_restore ( nofs_flag ) ;
2013-08-23 13:20:17 +03:00
if ( IS_ERR ( device ) )
2017-10-11 12:46:18 +08:00
return device ;
2013-08-23 13:20:17 +03:00
list_add ( & device - > dev_list , & fs_devices - > devices ) ;
2008-12-12 10:03:26 -05:00
device - > fs_devices = fs_devices ;
2008-05-13 13:46:40 -04:00
fs_devices - > num_devices + + ;
2013-08-23 13:20:17 +03:00
2017-12-04 12:54:54 +08:00
set_bit ( BTRFS_DEV_STATE_MISSING , & device - > dev_state ) ;
2010-12-13 14:56:23 -05:00
fs_devices - > missing_devices + + ;
2013-08-23 13:20:17 +03:00
2008-05-13 13:46:40 -04:00
return device ;
}
2013-08-23 13:20:17 +03:00
/**
* btrfs_alloc_device - allocate struct btrfs_device
* @ fs_info : used only for generating a new devid , can be NULL if
* devid is provided ( i . e . @ devid ! = NULL ) .
* @ devid : a pointer to devid for this device . If NULL a new devid
* is generated .
* @ uuid : a pointer to UUID for this device . If NULL a new UUID
* is generated .
*
* Return : a pointer to a new & struct btrfs_device on success ; ERR_PTR ( )
2017-10-30 18:10:25 +01:00
* on error . Returned struct is not linked onto any lists and must be
2018-03-20 15:47:33 +01:00
* destroyed with btrfs_free_device .
2013-08-23 13:20:17 +03:00
*/
struct btrfs_device * btrfs_alloc_device ( struct btrfs_fs_info * fs_info ,
const u64 * devid ,
const u8 * uuid )
{
struct btrfs_device * dev ;
u64 tmp ;
2013-10-31 10:30:08 +05:30
if ( WARN_ON ( ! devid & & ! fs_info ) )
2013-08-23 13:20:17 +03:00
return ERR_PTR ( - EINVAL ) ;
2021-07-26 14:15:21 +02:00
dev = kzalloc ( sizeof ( * dev ) , GFP_KERNEL ) ;
if ( ! dev )
return ERR_PTR ( - ENOMEM ) ;
/*
* Preallocate a bio that ' s always going to be used for flushing device
* barriers and matches the device lifespan
*/
dev - > flush_bio = bio_kmalloc ( GFP_KERNEL , 0 ) ;
if ( ! dev - > flush_bio ) {
kfree ( dev ) ;
return ERR_PTR ( - ENOMEM ) ;
}
INIT_LIST_HEAD ( & dev - > dev_list ) ;
INIT_LIST_HEAD ( & dev - > dev_alloc_list ) ;
INIT_LIST_HEAD ( & dev - > post_commit_list ) ;
atomic_set ( & dev - > reada_in_flight , 0 ) ;
atomic_set ( & dev - > dev_stats_ccnt , 0 ) ;
btrfs_device_data_ordered_init ( dev ) ;
INIT_RADIX_TREE ( & dev - > reada_zones , GFP_NOFS & ~ __GFP_DIRECT_RECLAIM ) ;
INIT_RADIX_TREE ( & dev - > reada_extents , GFP_NOFS & ~ __GFP_DIRECT_RECLAIM ) ;
extent_io_tree_init ( fs_info , & dev - > alloc_state ,
IO_TREE_DEVICE_ALLOC_STATE , NULL ) ;
2013-08-23 13:20:17 +03:00
if ( devid )
tmp = * devid ;
else {
int ret ;
ret = find_next_devid ( fs_info , & tmp ) ;
if ( ret ) {
2018-03-20 15:47:33 +01:00
btrfs_free_device ( dev ) ;
2013-08-23 13:20:17 +03:00
return ERR_PTR ( ret ) ;
}
}
dev - > devid = tmp ;
if ( uuid )
memcpy ( dev - > uuid , uuid , BTRFS_UUID_SIZE ) ;
else
generate_random_uuid ( dev - > uuid ) ;
return dev ;
}
2017-10-09 11:07:45 +08:00
static void btrfs_report_missing_device ( struct btrfs_fs_info * fs_info ,
2017-10-09 11:07:46 +08:00
u64 devid , u8 * uuid , bool error )
2017-10-09 11:07:45 +08:00
{
2017-10-09 11:07:46 +08:00
if ( error )
btrfs_err_rl ( fs_info , " devid %llu uuid %pU is missing " ,
devid , uuid ) ;
else
btrfs_warn_rl ( fs_info , " devid %llu uuid %pU is missing " ,
devid , uuid ) ;
2017-10-09 11:07:45 +08:00
}
2019-03-25 14:31:25 +02:00
static u64 calc_stripe_length ( u64 type , u64 chunk_len , int num_stripes )
{
2021-07-26 14:15:24 +02:00
const int data_stripes = calc_data_stripes ( type , num_stripes ) ;
2019-05-14 01:59:54 +02:00
2019-03-25 14:31:25 +02:00
return div_u64 ( chunk_len , data_stripes ) ;
}
2021-02-25 09:18:14 +08:00
# if BITS_PER_LONG == 32
/*
* Due to page cache limit , metadata beyond BTRFS_32BIT_MAX_FILE_SIZE
* can ' t be accessed on 32 bit systems .
*
* This function do mount time check to reject the fs if it already has
* metadata chunk beyond that limit .
*/
static int check_32bit_meta_chunk ( struct btrfs_fs_info * fs_info ,
u64 logical , u64 length , u64 type )
{
if ( ! ( type & BTRFS_BLOCK_GROUP_METADATA ) )
return 0 ;
if ( logical + length < MAX_LFS_FILESIZE )
return 0 ;
btrfs_err_32bit_limit ( fs_info ) ;
return - EOVERFLOW ;
}
/*
* This is to give early warning for any metadata chunk reaching
* BTRFS_32BIT_EARLY_WARN_THRESHOLD .
* Although we can still access the metadata , it ' s not going to be possible
* once the limit is reached .
*/
static void warn_32bit_meta_chunk ( struct btrfs_fs_info * fs_info ,
u64 logical , u64 length , u64 type )
{
if ( ! ( type & BTRFS_BLOCK_GROUP_METADATA ) )
return ;
if ( logical + length < BTRFS_32BIT_EARLY_WARN_THRESHOLD )
return ;
btrfs_warn_32bit_limit ( fs_info ) ;
}
# endif
2019-03-20 16:43:07 +01:00
static int read_one_chunk ( struct btrfs_key * key , struct extent_buffer * leaf ,
2016-06-03 12:05:15 -07:00
struct btrfs_chunk * chunk )
{
2021-10-05 16:12:42 -04:00
BTRFS_DEV_LOOKUP_ARGS ( args ) ;
2019-03-20 16:43:07 +01:00
struct btrfs_fs_info * fs_info = leaf - > fs_info ;
2019-05-17 11:43:17 +02:00
struct extent_map_tree * map_tree = & fs_info - > mapping_tree ;
2016-06-03 12:05:15 -07:00
struct map_lookup * map ;
struct extent_map * em ;
u64 logical ;
u64 length ;
u64 devid ;
2021-02-25 09:18:14 +08:00
u64 type ;
2016-06-03 12:05:15 -07:00
u8 uuid [ BTRFS_UUID_SIZE ] ;
int num_stripes ;
int ret ;
int i ;
logical = key - > offset ;
length = btrfs_chunk_length ( leaf , chunk ) ;
2021-02-25 09:18:14 +08:00
type = btrfs_chunk_type ( leaf , chunk ) ;
2016-06-03 12:05:15 -07:00
num_stripes = btrfs_chunk_num_stripes ( leaf , chunk ) ;
2021-02-25 09:18:14 +08:00
# if BITS_PER_LONG == 32
ret = check_32bit_meta_chunk ( fs_info , logical , length , type ) ;
if ( ret < 0 )
return ret ;
warn_32bit_meta_chunk ( fs_info , logical , length , type ) ;
# endif
2019-03-20 13:42:33 +08:00
/*
* Only need to verify chunk item if we ' re reading from sys chunk array ,
* as chunk item in tree block is already verified by tree - checker .
*/
if ( leaf - > start = = BTRFS_SUPER_INFO_OFFSET ) {
2019-03-20 16:40:48 +01:00
ret = btrfs_check_chunk_valid ( leaf , chunk , logical ) ;
2019-03-20 13:42:33 +08:00
if ( ret )
return ret ;
}
2008-05-07 11:43:44 -04:00
2019-05-17 11:43:17 +02:00
read_lock ( & map_tree - > lock ) ;
em = lookup_extent_mapping ( map_tree , logical , 1 ) ;
read_unlock ( & map_tree - > lock ) ;
2008-03-24 15:01:56 -04:00
/* already mapped? */
if ( em & & em - > start < = logical & & em - > start + em - > len > logical ) {
free_extent_map ( em ) ;
return 0 ;
} else if ( em ) {
free_extent_map ( em ) ;
}
2011-04-21 00:48:27 +02:00
em = alloc_extent_map ( ) ;
2008-03-24 15:01:56 -04:00
if ( ! em )
return - ENOMEM ;
2008-03-25 16:50:33 -04:00
map = kmalloc ( map_lookup_size ( num_stripes ) , GFP_NOFS ) ;
2008-03-24 15:01:56 -04:00
if ( ! map ) {
free_extent_map ( em ) ;
return - ENOMEM ;
}
2014-06-19 10:42:52 +08:00
set_bit ( EXTENT_FLAG_FS_MAPPING , & em - > flags ) ;
2015-06-03 10:55:48 -04:00
em - > map_lookup = map ;
2008-03-24 15:01:56 -04:00
em - > start = logical ;
em - > len = length ;
2012-10-11 16:54:30 -04:00
em - > orig_start = 0 ;
2008-03-24 15:01:56 -04:00
em - > block_start = 0 ;
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 14:49:59 -04:00
em - > block_len = em - > len ;
2008-03-24 15:01:56 -04:00
2008-03-25 16:50:33 -04:00
map - > num_stripes = num_stripes ;
map - > io_width = btrfs_chunk_io_width ( leaf , chunk ) ;
map - > io_align = btrfs_chunk_io_align ( leaf , chunk ) ;
map - > stripe_len = btrfs_chunk_stripe_len ( leaf , chunk ) ;
2021-02-25 09:18:14 +08:00
map - > type = type ;
2008-04-16 10:49:51 -04:00
map - > sub_stripes = btrfs_chunk_sub_stripes ( leaf , chunk ) ;
2018-08-01 10:37:19 +08:00
map - > verified_stripes = 0 ;
2021-02-25 09:18:14 +08:00
em - > orig_block_len = calc_stripe_length ( type , em - > len ,
2019-03-25 14:31:25 +02:00
map - > num_stripes ) ;
2008-03-25 16:50:33 -04:00
for ( i = 0 ; i < num_stripes ; i + + ) {
map - > stripes [ i ] . physical =
btrfs_stripe_offset_nr ( leaf , chunk , i ) ;
devid = btrfs_stripe_devid_nr ( leaf , chunk , i ) ;
2021-10-05 16:12:42 -04:00
args . devid = devid ;
2008-04-18 10:29:38 -04:00
read_extent_buffer ( leaf , uuid , ( unsigned long )
btrfs_stripe_dev_uuid_nr ( chunk , i ) ,
BTRFS_UUID_SIZE ) ;
2021-10-05 16:12:42 -04:00
args . uuid = uuid ;
map - > stripes [ i ] . dev = btrfs_find_device ( fs_info - > fs_devices , & args ) ;
2016-06-09 21:38:35 -04:00
if ( ! map - > stripes [ i ] . dev & &
2016-06-22 18:54:23 -04:00
! btrfs_test_opt ( fs_info , DEGRADED ) ) {
2008-03-25 16:50:33 -04:00
free_extent_map ( em ) ;
2017-10-09 11:07:46 +08:00
btrfs_report_missing_device ( fs_info , devid , uuid , true ) ;
2017-10-09 11:07:44 +08:00
return - ENOENT ;
2008-03-25 16:50:33 -04:00
}
2008-05-13 13:46:40 -04:00
if ( ! map - > stripes [ i ] . dev ) {
map - > stripes [ i ] . dev =
2016-06-22 18:54:24 -04:00
add_missing_dev ( fs_info - > fs_devices , devid ,
uuid ) ;
2017-10-11 12:46:18 +08:00
if ( IS_ERR ( map - > stripes [ i ] . dev ) ) {
2008-05-13 13:46:40 -04:00
free_extent_map ( em ) ;
2017-10-11 12:46:18 +08:00
btrfs_err ( fs_info ,
" failed to init missing dev %llu: %ld " ,
devid , PTR_ERR ( map - > stripes [ i ] . dev ) ) ;
return PTR_ERR ( map - > stripes [ i ] . dev ) ;
2008-05-13 13:46:40 -04:00
}
2017-10-09 11:07:46 +08:00
btrfs_report_missing_device ( fs_info , devid , uuid , false ) ;
2008-05-13 13:46:40 -04:00
}
2017-12-04 12:54:53 +08:00
set_bit ( BTRFS_DEV_STATE_IN_FS_METADATA ,
& ( map - > stripes [ i ] . dev - > dev_state ) ) ;
2008-03-24 15:01:56 -04:00
}
2019-05-17 11:43:17 +02:00
write_lock ( & map_tree - > lock ) ;
ret = add_extent_mapping ( map_tree , em , 0 ) ;
write_unlock ( & map_tree - > lock ) ;
2018-08-01 10:37:20 +08:00
if ( ret < 0 ) {
btrfs_err ( fs_info ,
" failed to add chunk map, start=%llu len=%llu: %d " ,
em - > start , em - > len , ret ) ;
}
2008-03-24 15:01:56 -04:00
free_extent_map ( em ) ;
2018-08-01 10:37:20 +08:00
return ret ;
2008-03-24 15:01:56 -04:00
}
2012-03-01 14:56:26 +01:00
static void fill_device_from_item ( struct extent_buffer * leaf ,
2008-03-24 15:01:56 -04:00
struct btrfs_dev_item * dev_item ,
struct btrfs_device * device )
{
unsigned long ptr ;
device - > devid = btrfs_device_id ( leaf , dev_item ) ;
2009-04-27 07:29:03 -04:00
device - > disk_total_bytes = btrfs_device_total_bytes ( leaf , dev_item ) ;
device - > total_bytes = device - > disk_total_bytes ;
2014-09-03 21:35:33 +08:00
device - > commit_total_bytes = device - > disk_total_bytes ;
2008-03-24 15:01:56 -04:00
device - > bytes_used = btrfs_device_bytes_used ( leaf , dev_item ) ;
2014-09-03 21:35:34 +08:00
device - > commit_bytes_used = device - > bytes_used ;
2008-03-24 15:01:56 -04:00
device - > type = btrfs_device_type ( leaf , dev_item ) ;
device - > io_align = btrfs_device_io_align ( leaf , dev_item ) ;
device - > io_width = btrfs_device_io_width ( leaf , dev_item ) ;
device - > sector_size = btrfs_device_sector_size ( leaf , dev_item ) ;
2012-11-06 13:15:27 +01:00
WARN_ON ( device - > devid = = BTRFS_DEV_REPLACE_DEVID ) ;
2017-12-04 12:54:55 +08:00
clear_bit ( BTRFS_DEV_STATE_REPLACE_TGT , & device - > dev_state ) ;
2008-03-24 15:01:56 -04:00
2013-08-20 13:20:11 +02:00
ptr = btrfs_device_uuid ( dev_item ) ;
2008-04-15 15:41:47 -04:00
read_extent_buffer ( leaf , device - > uuid , ptr , BTRFS_UUID_SIZE ) ;
2008-03-24 15:01:56 -04:00
}
2016-06-22 18:54:24 -04:00
static struct btrfs_fs_devices * open_seed_devices ( struct btrfs_fs_info * fs_info ,
2014-09-03 21:35:46 +08:00
u8 * fsid )
2008-11-17 21:11:30 -05:00
{
struct btrfs_fs_devices * fs_devices ;
int ret ;
2018-03-16 02:21:22 +01:00
lockdep_assert_held ( & uuid_mutex ) ;
2017-06-14 02:48:07 +02:00
ASSERT ( fsid ) ;
2008-11-17 21:11:30 -05:00
2020-08-12 17:04:36 +03:00
/* This will match only for multi-device seed fs */
2020-07-16 10:25:33 +03:00
list_for_each_entry ( fs_devices , & fs_info - > fs_devices - > seed_list , seed_list )
2017-07-29 17:50:09 +08:00
if ( ! memcmp ( fs_devices - > fsid , fsid , BTRFS_FSID_SIZE ) )
2014-09-03 21:35:46 +08:00
return fs_devices ;
2008-11-17 21:11:30 -05:00
2018-10-30 16:43:23 +02:00
fs_devices = find_fsid ( fsid , NULL ) ;
2008-11-17 21:11:30 -05:00
if ( ! fs_devices ) {
2016-06-22 18:54:23 -04:00
if ( ! btrfs_test_opt ( fs_info , DEGRADED ) )
2014-09-03 21:35:46 +08:00
return ERR_PTR ( - ENOENT ) ;
2018-10-30 16:43:23 +02:00
fs_devices = alloc_fs_devices ( fsid , NULL ) ;
2014-09-03 21:35:46 +08:00
if ( IS_ERR ( fs_devices ) )
return fs_devices ;
2019-11-13 11:27:27 +01:00
fs_devices - > seeding = true ;
2014-09-03 21:35:46 +08:00
fs_devices - > opened = 1 ;
return fs_devices ;
2008-11-17 21:11:30 -05:00
}
2008-12-12 10:03:26 -05:00
2020-08-12 17:04:36 +03:00
/*
* Upon first call for a seed fs fsid , just create a private copy of the
* respective fs_devices and anchor it at fs_info - > fs_devices - > seed_list
*/
2008-12-12 10:03:26 -05:00
fs_devices = clone_fs_devices ( fs_devices ) ;
2014-09-03 21:35:46 +08:00
if ( IS_ERR ( fs_devices ) )
return fs_devices ;
2008-11-17 21:11:30 -05:00
2018-04-12 10:29:28 +08:00
ret = open_fs_devices ( fs_devices , FMODE_READ , fs_info - > bdev_holder ) ;
2012-04-14 11:24:33 +02:00
if ( ret ) {
free_fs_devices ( fs_devices ) ;
2020-09-05 01:34:35 +08:00
return ERR_PTR ( ret ) ;
2012-04-14 11:24:33 +02:00
}
2008-11-17 21:11:30 -05:00
if ( ! fs_devices - > seeding ) {
2018-04-12 10:29:27 +08:00
close_fs_devices ( fs_devices ) ;
2008-12-12 10:03:26 -05:00
free_fs_devices ( fs_devices ) ;
2020-09-05 01:34:35 +08:00
return ERR_PTR ( - EINVAL ) ;
2008-11-17 21:11:30 -05:00
}
2020-07-16 10:25:33 +03:00
list_add ( & fs_devices - > seed_list , & fs_info - > fs_devices - > seed_list ) ;
2020-09-05 01:34:35 +08:00
2014-09-03 21:35:46 +08:00
return fs_devices ;
2008-11-17 21:11:30 -05:00
}
2019-03-20 16:45:15 +01:00
static int read_one_dev ( struct extent_buffer * leaf ,
2008-03-24 15:01:56 -04:00
struct btrfs_dev_item * dev_item )
{
2021-10-05 16:12:42 -04:00
BTRFS_DEV_LOOKUP_ARGS ( args ) ;
2019-03-20 16:45:15 +01:00
struct btrfs_fs_info * fs_info = leaf - > fs_info ;
2016-06-22 18:54:23 -04:00
struct btrfs_fs_devices * fs_devices = fs_info - > fs_devices ;
2008-03-24 15:01:56 -04:00
struct btrfs_device * device ;
u64 devid ;
int ret ;
2017-07-29 17:50:09 +08:00
u8 fs_uuid [ BTRFS_FSID_SIZE ] ;
2008-04-18 10:29:38 -04:00
u8 dev_uuid [ BTRFS_UUID_SIZE ] ;
2021-10-05 16:12:42 -04:00
devid = args . devid = btrfs_device_id ( leaf , dev_item ) ;
2013-08-20 13:20:11 +02:00
read_extent_buffer ( leaf , dev_uuid , btrfs_device_uuid ( dev_item ) ,
2008-04-18 10:29:38 -04:00
BTRFS_UUID_SIZE ) ;
2013-08-20 13:20:12 +02:00
read_extent_buffer ( leaf , fs_uuid , btrfs_device_fsid ( dev_item ) ,
2017-07-29 17:50:09 +08:00
BTRFS_FSID_SIZE ) ;
2021-10-05 16:12:42 -04:00
args . uuid = dev_uuid ;
args . fsid = fs_uuid ;
2008-11-17 21:11:30 -05:00
2018-10-30 16:43:24 +02:00
if ( memcmp ( fs_uuid , fs_devices - > metadata_uuid , BTRFS_FSID_SIZE ) ) {
2016-06-22 18:54:24 -04:00
fs_devices = open_seed_devices ( fs_info , fs_uuid ) ;
2014-09-03 21:35:46 +08:00
if ( IS_ERR ( fs_devices ) )
return PTR_ERR ( fs_devices ) ;
2008-11-17 21:11:30 -05:00
}
2021-10-05 16:12:42 -04:00
device = btrfs_find_device ( fs_info - > fs_devices , & args ) ;
2014-09-03 21:35:46 +08:00
if ( ! device ) {
2017-03-09 09:34:42 +08:00
if ( ! btrfs_test_opt ( fs_info , DEGRADED ) ) {
2017-10-09 11:07:46 +08:00
btrfs_report_missing_device ( fs_info , devid ,
dev_uuid , true ) ;
2017-10-09 11:07:44 +08:00
return - ENOENT ;
2017-03-09 09:34:42 +08:00
}
2008-11-17 21:11:30 -05:00
2016-06-22 18:54:24 -04:00
device = add_missing_dev ( fs_devices , devid , dev_uuid ) ;
2017-10-11 12:46:18 +08:00
if ( IS_ERR ( device ) ) {
btrfs_err ( fs_info ,
" failed to add missing dev %llu: %ld " ,
devid , PTR_ERR ( device ) ) ;
return PTR_ERR ( device ) ;
}
2017-10-09 11:07:46 +08:00
btrfs_report_missing_device ( fs_info , devid , dev_uuid , false ) ;
2014-09-03 21:35:46 +08:00
} else {
2017-03-09 09:34:42 +08:00
if ( ! device - > bdev ) {
2017-10-09 11:07:46 +08:00
if ( ! btrfs_test_opt ( fs_info , DEGRADED ) ) {
btrfs_report_missing_device ( fs_info ,
devid , dev_uuid , true ) ;
2017-10-09 11:07:44 +08:00
return - ENOENT ;
2017-10-09 11:07:46 +08:00
}
btrfs_report_missing_device ( fs_info , devid ,
dev_uuid , false ) ;
2017-03-09 09:34:42 +08:00
}
2014-09-03 21:35:46 +08:00
2017-12-04 12:54:54 +08:00
if ( ! device - > bdev & &
! test_bit ( BTRFS_DEV_STATE_MISSING , & device - > dev_state ) ) {
2010-12-13 14:56:23 -05:00
/*
* this happens when a device that was properly setup
* in the device info lists suddenly goes bad .
* device - > bdev is NULL , and so we have to set
* device - > missing to one here
*/
2014-09-03 21:35:46 +08:00
device - > fs_devices - > missing_devices + + ;
2017-12-04 12:54:54 +08:00
set_bit ( BTRFS_DEV_STATE_MISSING , & device - > dev_state ) ;
2008-11-17 21:11:30 -05:00
}
2014-09-03 21:35:46 +08:00
/* Move the device to its own fs_devices */
if ( device - > fs_devices ! = fs_devices ) {
2017-12-04 12:54:54 +08:00
ASSERT ( test_bit ( BTRFS_DEV_STATE_MISSING ,
& device - > dev_state ) ) ;
2014-09-03 21:35:46 +08:00
list_move ( & device - > dev_list , & fs_devices - > devices ) ;
device - > fs_devices - > num_devices - - ;
fs_devices - > num_devices + + ;
device - > fs_devices - > missing_devices - - ;
fs_devices - > missing_devices + + ;
device - > fs_devices = fs_devices ;
}
2008-11-17 21:11:30 -05:00
}
2016-06-22 18:54:23 -04:00
if ( device - > fs_devices ! = fs_info - > fs_devices ) {
2017-12-04 12:54:52 +08:00
BUG_ON ( test_bit ( BTRFS_DEV_STATE_WRITEABLE , & device - > dev_state ) ) ;
2008-11-17 21:11:30 -05:00
if ( device - > generation ! =
btrfs_device_generation ( leaf , dev_item ) )
return - EINVAL ;
2008-03-24 15:01:59 -04:00
}
2008-03-24 15:01:56 -04:00
fill_device_from_item ( leaf , dev_item , device ) ;
2020-11-03 13:49:42 +08:00
if ( device - > bdev ) {
u64 max_total_bytes = i_size_read ( device - > bdev - > bd_inode ) ;
if ( device - > total_bytes > max_total_bytes ) {
btrfs_err ( fs_info ,
" device total_bytes should be at most %llu but found %llu " ,
max_total_bytes , device - > total_bytes ) ;
return - EINVAL ;
}
}
2017-12-04 12:54:53 +08:00
set_bit ( BTRFS_DEV_STATE_IN_FS_METADATA , & device - > dev_state ) ;
2017-12-04 12:54:52 +08:00
if ( test_bit ( BTRFS_DEV_STATE_WRITEABLE , & device - > dev_state ) & &
2017-12-04 12:54:55 +08:00
! test_bit ( BTRFS_DEV_STATE_REPLACE_TGT , & device - > dev_state ) ) {
2008-11-17 21:11:30 -05:00
device - > fs_devices - > total_rw_bytes + = device - > total_bytes ;
2017-05-11 09:17:46 +03:00
atomic64_add ( device - > total_bytes - device - > bytes_used ,
& fs_info - > free_chunk_space ) ;
2011-09-26 17:12:22 -04:00
}
2008-03-24 15:01:56 -04:00
ret = 0 ;
return ret ;
}
2016-06-21 21:16:51 -04:00
int btrfs_read_sys_array ( struct btrfs_fs_info * fs_info )
2008-03-24 15:01:56 -04:00
{
2016-06-21 21:16:51 -04:00
struct btrfs_root * root = fs_info - > tree_root ;
2016-09-20 10:05:02 -04:00
struct btrfs_super_block * super_copy = fs_info - > super_copy ;
2008-05-07 11:43:44 -04:00
struct extent_buffer * sb ;
2008-03-24 15:01:56 -04:00
struct btrfs_disk_key * disk_key ;
struct btrfs_chunk * chunk ;
2014-10-31 19:02:42 +01:00
u8 * array_ptr ;
unsigned long sb_array_offset ;
2008-04-25 09:04:37 -04:00
int ret = 0 ;
2008-03-24 15:01:56 -04:00
u32 num_stripes ;
u32 array_size ;
u32 len = 0 ;
2014-10-31 19:02:42 +01:00
u32 cur_offset ;
2016-06-03 12:05:15 -07:00
u64 type ;
2008-04-25 09:04:37 -04:00
struct btrfs_key key ;
2008-03-24 15:01:56 -04:00
2016-06-22 18:54:23 -04:00
ASSERT ( BTRFS_SUPER_INFO_SIZE < = fs_info - > nodesize ) ;
2014-06-15 02:39:54 +02:00
/*
* This will create extent buffer of nodesize , superblock size is
* fixed to BTRFS_SUPER_INFO_SIZE . If nodesize > sb size , this will
* overallocate but we can keep it as - is , only the first page is used .
*/
2020-11-05 10:45:20 -05:00
sb = btrfs_find_create_tree_block ( fs_info , BTRFS_SUPER_INFO_OFFSET ,
root - > root_key . objectid , 0 ) ;
2016-06-06 12:01:23 -07:00
if ( IS_ERR ( sb ) )
return PTR_ERR ( sb ) ;
2015-12-03 13:06:46 +01:00
set_extent_buffer_uptodate ( sb ) ;
2011-10-07 18:06:13 +02:00
/*
2016-05-19 21:18:45 -04:00
* The sb extent buffer is artificial and just used to read the system array .
2015-12-03 13:06:46 +01:00
* set_extent_buffer_uptodate ( ) call does not properly mark all it ' s
2011-10-07 18:06:13 +02:00
* pages up - to - date when the page is larger : extent does not cover the
* whole page and consequently check_page_uptodate does not find all
* the page ' s extents up - to - date ( the hole beyond sb ) ,
* write_extent_buffer then triggers a WARN_ON .
*
* Regular short extents go through mark_extent_buffer_dirty / writeback cycle ,
* but sb spans only this function . Add an explicit SetPageUptodate call
* to silence the warning eg . on PowerPC 64.
*/
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 15:29:47 +03:00
if ( PAGE_SIZE > BTRFS_SUPER_INFO_SIZE )
2010-08-06 13:21:20 -04:00
SetPageUptodate ( sb - > pages [ 0 ] ) ;
2009-02-12 14:09:45 -05:00
2008-05-07 11:43:44 -04:00
write_extent_buffer ( sb , super_copy , 0 , BTRFS_SUPER_INFO_SIZE ) ;
2008-03-24 15:01:56 -04:00
array_size = btrfs_super_sys_array_size ( super_copy ) ;
2014-10-31 19:02:42 +01:00
array_ptr = super_copy - > sys_chunk_array ;
sb_array_offset = offsetof ( struct btrfs_super_block , sys_chunk_array ) ;
cur_offset = 0 ;
2008-03-24 15:01:56 -04:00
2014-10-31 19:02:42 +01:00
while ( cur_offset < array_size ) {
disk_key = ( struct btrfs_disk_key * ) array_ptr ;
2014-11-05 15:24:51 +01:00
len = sizeof ( * disk_key ) ;
if ( cur_offset + len > array_size )
goto out_short_read ;
2008-03-24 15:01:56 -04:00
btrfs_disk_key_to_cpu ( & key , disk_key ) ;
2014-10-31 19:02:42 +01:00
array_ptr + = len ;
sb_array_offset + = len ;
cur_offset + = len ;
2008-03-24 15:01:56 -04:00
2019-10-18 11:58:23 +02:00
if ( key . type ! = BTRFS_CHUNK_ITEM_KEY ) {
btrfs_err ( fs_info ,
" unexpected item type %u in sys_array at offset %u " ,
( u32 ) key . type , cur_offset ) ;
ret = - EIO ;
break ;
}
2015-11-30 17:27:06 +01:00
2019-10-18 11:58:23 +02:00
chunk = ( struct btrfs_chunk * ) sb_array_offset ;
/*
* At least one btrfs_chunk with one stripe must be present ,
* exact stripe count check comes afterwards
*/
len = btrfs_chunk_item_size ( 1 ) ;
if ( cur_offset + len > array_size )
goto out_short_read ;
2016-06-03 12:05:15 -07:00
2019-10-18 11:58:23 +02:00
num_stripes = btrfs_chunk_num_stripes ( sb , chunk ) ;
if ( ! num_stripes ) {
btrfs_err ( fs_info ,
" invalid number of stripes %u in sys_array at offset %u " ,
num_stripes , cur_offset ) ;
ret = - EIO ;
break ;
}
2014-11-05 15:24:51 +01:00
2019-10-18 11:58:23 +02:00
type = btrfs_chunk_type ( sb , chunk ) ;
if ( ( type & BTRFS_BLOCK_GROUP_SYSTEM ) = = 0 ) {
2016-09-20 10:05:02 -04:00
btrfs_err ( fs_info ,
2019-10-18 11:58:23 +02:00
" invalid chunk type %llu in sys_array at offset %u " ,
type , cur_offset ) ;
2008-04-25 09:04:37 -04:00
ret = - EIO ;
break ;
2008-03-24 15:01:56 -04:00
}
2019-10-18 11:58:23 +02:00
len = btrfs_chunk_item_size ( num_stripes ) ;
if ( cur_offset + len > array_size )
goto out_short_read ;
ret = read_one_chunk ( & key , sb , chunk ) ;
if ( ret )
break ;
2014-10-31 19:02:42 +01:00
array_ptr + = len ;
sb_array_offset + = len ;
cur_offset + = len ;
2008-03-24 15:01:56 -04:00
}
2016-06-03 17:41:42 -07:00
clear_extent_buffer_uptodate ( sb ) ;
2016-05-13 17:06:59 -07:00
free_extent_buffer_stale ( sb ) ;
2008-04-25 09:04:37 -04:00
return ret ;
2014-11-05 15:24:51 +01:00
out_short_read :
2016-09-20 10:05:02 -04:00
btrfs_err ( fs_info , " sys_array too short to read %u bytes at offset %u " ,
2014-11-05 15:24:51 +01:00
len , cur_offset ) ;
2016-06-03 17:41:42 -07:00
clear_extent_buffer_uptodate ( sb ) ;
2016-05-13 17:06:59 -07:00
free_extent_buffer_stale ( sb ) ;
2014-11-05 15:24:51 +01:00
return - EIO ;
2008-03-24 15:01:56 -04:00
}
btrfs: Introduce a function to check if all chunks a OK for degraded rw mount
Introduce a new function, btrfs_check_rw_degradable(), to check if all
chunks in btrfs is OK for degraded rw mount.
It provides the new basis for accurate btrfs mount/remount and even
runtime degraded mount check other than old one-size-fit-all method.
Btrfs currently uses num_tolerated_disk_barrier_failures to do global
check for tolerated missing device.
Although the one-size-fit-all solution is quite safe, it's too strict
if data and metadata has different duplication level.
For example, if one use Single data and RAID1 metadata for 2 disks, it
means any missing device will make the fs unable to be degraded
mounted.
But in fact, some times all single chunks may be in the existing
device and in that case, we should allow it to be rw degraded mounted.
Such case can be easily reproduced using the following script:
# mkfs.btrfs -f -m raid1 -d sing /dev/sdb /dev/sdc
# wipefs -f /dev/sdc
# mount /dev/sdb -o degraded,rw
If using btrfs-debug-tree to check /dev/sdb, one should find that the
data chunk is only in sdb, so in fact it should allow degraded mount.
This patchset will introduce a new per-chunk degradable check for
btrfs, allow above case to succeed, and it's quite small anyway.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ copied text from cover letter with more details about the problem being
solved ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-03-09 09:34:36 +08:00
/*
* Check if all chunks in the fs are OK for read - write degraded mount
*
2017-12-18 17:08:59 +08:00
* If the @ failing_dev is specified , it ' s accounted as missing .
*
btrfs: Introduce a function to check if all chunks a OK for degraded rw mount
Introduce a new function, btrfs_check_rw_degradable(), to check if all
chunks in btrfs is OK for degraded rw mount.
It provides the new basis for accurate btrfs mount/remount and even
runtime degraded mount check other than old one-size-fit-all method.
Btrfs currently uses num_tolerated_disk_barrier_failures to do global
check for tolerated missing device.
Although the one-size-fit-all solution is quite safe, it's too strict
if data and metadata has different duplication level.
For example, if one use Single data and RAID1 metadata for 2 disks, it
means any missing device will make the fs unable to be degraded
mounted.
But in fact, some times all single chunks may be in the existing
device and in that case, we should allow it to be rw degraded mounted.
Such case can be easily reproduced using the following script:
# mkfs.btrfs -f -m raid1 -d sing /dev/sdb /dev/sdc
# wipefs -f /dev/sdc
# mount /dev/sdb -o degraded,rw
If using btrfs-debug-tree to check /dev/sdb, one should find that the
data chunk is only in sdb, so in fact it should allow degraded mount.
This patchset will introduce a new per-chunk degradable check for
btrfs, allow above case to succeed, and it's quite small anyway.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ copied text from cover letter with more details about the problem being
solved ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-03-09 09:34:36 +08:00
* Return true if all chunks meet the minimal RW mount requirements .
* Return false if any chunk doesn ' t meet the minimal RW mount requirements .
*/
2017-12-18 17:08:59 +08:00
bool btrfs_check_rw_degradable ( struct btrfs_fs_info * fs_info ,
struct btrfs_device * failing_dev )
btrfs: Introduce a function to check if all chunks a OK for degraded rw mount
Introduce a new function, btrfs_check_rw_degradable(), to check if all
chunks in btrfs is OK for degraded rw mount.
It provides the new basis for accurate btrfs mount/remount and even
runtime degraded mount check other than old one-size-fit-all method.
Btrfs currently uses num_tolerated_disk_barrier_failures to do global
check for tolerated missing device.
Although the one-size-fit-all solution is quite safe, it's too strict
if data and metadata has different duplication level.
For example, if one use Single data and RAID1 metadata for 2 disks, it
means any missing device will make the fs unable to be degraded
mounted.
But in fact, some times all single chunks may be in the existing
device and in that case, we should allow it to be rw degraded mounted.
Such case can be easily reproduced using the following script:
# mkfs.btrfs -f -m raid1 -d sing /dev/sdb /dev/sdc
# wipefs -f /dev/sdc
# mount /dev/sdb -o degraded,rw
If using btrfs-debug-tree to check /dev/sdb, one should find that the
data chunk is only in sdb, so in fact it should allow degraded mount.
This patchset will introduce a new per-chunk degradable check for
btrfs, allow above case to succeed, and it's quite small anyway.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ copied text from cover letter with more details about the problem being
solved ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-03-09 09:34:36 +08:00
{
2019-05-17 11:43:17 +02:00
struct extent_map_tree * map_tree = & fs_info - > mapping_tree ;
btrfs: Introduce a function to check if all chunks a OK for degraded rw mount
Introduce a new function, btrfs_check_rw_degradable(), to check if all
chunks in btrfs is OK for degraded rw mount.
It provides the new basis for accurate btrfs mount/remount and even
runtime degraded mount check other than old one-size-fit-all method.
Btrfs currently uses num_tolerated_disk_barrier_failures to do global
check for tolerated missing device.
Although the one-size-fit-all solution is quite safe, it's too strict
if data and metadata has different duplication level.
For example, if one use Single data and RAID1 metadata for 2 disks, it
means any missing device will make the fs unable to be degraded
mounted.
But in fact, some times all single chunks may be in the existing
device and in that case, we should allow it to be rw degraded mounted.
Such case can be easily reproduced using the following script:
# mkfs.btrfs -f -m raid1 -d sing /dev/sdb /dev/sdc
# wipefs -f /dev/sdc
# mount /dev/sdb -o degraded,rw
If using btrfs-debug-tree to check /dev/sdb, one should find that the
data chunk is only in sdb, so in fact it should allow degraded mount.
This patchset will introduce a new per-chunk degradable check for
btrfs, allow above case to succeed, and it's quite small anyway.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ copied text from cover letter with more details about the problem being
solved ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-03-09 09:34:36 +08:00
struct extent_map * em ;
u64 next_start = 0 ;
bool ret = true ;
2019-05-17 11:43:17 +02:00
read_lock ( & map_tree - > lock ) ;
em = lookup_extent_mapping ( map_tree , 0 , ( u64 ) - 1 ) ;
read_unlock ( & map_tree - > lock ) ;
btrfs: Introduce a function to check if all chunks a OK for degraded rw mount
Introduce a new function, btrfs_check_rw_degradable(), to check if all
chunks in btrfs is OK for degraded rw mount.
It provides the new basis for accurate btrfs mount/remount and even
runtime degraded mount check other than old one-size-fit-all method.
Btrfs currently uses num_tolerated_disk_barrier_failures to do global
check for tolerated missing device.
Although the one-size-fit-all solution is quite safe, it's too strict
if data and metadata has different duplication level.
For example, if one use Single data and RAID1 metadata for 2 disks, it
means any missing device will make the fs unable to be degraded
mounted.
But in fact, some times all single chunks may be in the existing
device and in that case, we should allow it to be rw degraded mounted.
Such case can be easily reproduced using the following script:
# mkfs.btrfs -f -m raid1 -d sing /dev/sdb /dev/sdc
# wipefs -f /dev/sdc
# mount /dev/sdb -o degraded,rw
If using btrfs-debug-tree to check /dev/sdb, one should find that the
data chunk is only in sdb, so in fact it should allow degraded mount.
This patchset will introduce a new per-chunk degradable check for
btrfs, allow above case to succeed, and it's quite small anyway.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ copied text from cover letter with more details about the problem being
solved ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-03-09 09:34:36 +08:00
/* No chunk at all? Return false anyway */
if ( ! em ) {
ret = false ;
goto out ;
}
while ( em ) {
struct map_lookup * map ;
int missing = 0 ;
int max_tolerated ;
int i ;
map = em - > map_lookup ;
max_tolerated =
btrfs_get_num_tolerated_disk_barrier_failures (
map - > type ) ;
for ( i = 0 ; i < map - > num_stripes ; i + + ) {
struct btrfs_device * dev = map - > stripes [ i ] . dev ;
2017-12-04 12:54:54 +08:00
if ( ! dev | | ! dev - > bdev | |
test_bit ( BTRFS_DEV_STATE_MISSING , & dev - > dev_state ) | |
btrfs: Introduce a function to check if all chunks a OK for degraded rw mount
Introduce a new function, btrfs_check_rw_degradable(), to check if all
chunks in btrfs is OK for degraded rw mount.
It provides the new basis for accurate btrfs mount/remount and even
runtime degraded mount check other than old one-size-fit-all method.
Btrfs currently uses num_tolerated_disk_barrier_failures to do global
check for tolerated missing device.
Although the one-size-fit-all solution is quite safe, it's too strict
if data and metadata has different duplication level.
For example, if one use Single data and RAID1 metadata for 2 disks, it
means any missing device will make the fs unable to be degraded
mounted.
But in fact, some times all single chunks may be in the existing
device and in that case, we should allow it to be rw degraded mounted.
Such case can be easily reproduced using the following script:
# mkfs.btrfs -f -m raid1 -d sing /dev/sdb /dev/sdc
# wipefs -f /dev/sdc
# mount /dev/sdb -o degraded,rw
If using btrfs-debug-tree to check /dev/sdb, one should find that the
data chunk is only in sdb, so in fact it should allow degraded mount.
This patchset will introduce a new per-chunk degradable check for
btrfs, allow above case to succeed, and it's quite small anyway.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ copied text from cover letter with more details about the problem being
solved ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-03-09 09:34:36 +08:00
dev - > last_flush_error )
missing + + ;
2017-12-18 17:08:59 +08:00
else if ( failing_dev & & failing_dev = = dev )
missing + + ;
btrfs: Introduce a function to check if all chunks a OK for degraded rw mount
Introduce a new function, btrfs_check_rw_degradable(), to check if all
chunks in btrfs is OK for degraded rw mount.
It provides the new basis for accurate btrfs mount/remount and even
runtime degraded mount check other than old one-size-fit-all method.
Btrfs currently uses num_tolerated_disk_barrier_failures to do global
check for tolerated missing device.
Although the one-size-fit-all solution is quite safe, it's too strict
if data and metadata has different duplication level.
For example, if one use Single data and RAID1 metadata for 2 disks, it
means any missing device will make the fs unable to be degraded
mounted.
But in fact, some times all single chunks may be in the existing
device and in that case, we should allow it to be rw degraded mounted.
Such case can be easily reproduced using the following script:
# mkfs.btrfs -f -m raid1 -d sing /dev/sdb /dev/sdc
# wipefs -f /dev/sdc
# mount /dev/sdb -o degraded,rw
If using btrfs-debug-tree to check /dev/sdb, one should find that the
data chunk is only in sdb, so in fact it should allow degraded mount.
This patchset will introduce a new per-chunk degradable check for
btrfs, allow above case to succeed, and it's quite small anyway.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ copied text from cover letter with more details about the problem being
solved ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-03-09 09:34:36 +08:00
}
if ( missing > max_tolerated ) {
2017-12-18 17:08:59 +08:00
if ( ! failing_dev )
btrfs_warn ( fs_info ,
2018-11-28 12:05:13 +01:00
" chunk %llu missing %d devices, max tolerance is %d for writable mount " ,
btrfs: Introduce a function to check if all chunks a OK for degraded rw mount
Introduce a new function, btrfs_check_rw_degradable(), to check if all
chunks in btrfs is OK for degraded rw mount.
It provides the new basis for accurate btrfs mount/remount and even
runtime degraded mount check other than old one-size-fit-all method.
Btrfs currently uses num_tolerated_disk_barrier_failures to do global
check for tolerated missing device.
Although the one-size-fit-all solution is quite safe, it's too strict
if data and metadata has different duplication level.
For example, if one use Single data and RAID1 metadata for 2 disks, it
means any missing device will make the fs unable to be degraded
mounted.
But in fact, some times all single chunks may be in the existing
device and in that case, we should allow it to be rw degraded mounted.
Such case can be easily reproduced using the following script:
# mkfs.btrfs -f -m raid1 -d sing /dev/sdb /dev/sdc
# wipefs -f /dev/sdc
# mount /dev/sdb -o degraded,rw
If using btrfs-debug-tree to check /dev/sdb, one should find that the
data chunk is only in sdb, so in fact it should allow degraded mount.
This patchset will introduce a new per-chunk degradable check for
btrfs, allow above case to succeed, and it's quite small anyway.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ copied text from cover letter with more details about the problem being
solved ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-03-09 09:34:36 +08:00
em - > start , missing , max_tolerated ) ;
free_extent_map ( em ) ;
ret = false ;
goto out ;
}
next_start = extent_map_end ( em ) ;
free_extent_map ( em ) ;
2019-05-17 11:43:17 +02:00
read_lock ( & map_tree - > lock ) ;
em = lookup_extent_mapping ( map_tree , next_start ,
btrfs: Introduce a function to check if all chunks a OK for degraded rw mount
Introduce a new function, btrfs_check_rw_degradable(), to check if all
chunks in btrfs is OK for degraded rw mount.
It provides the new basis for accurate btrfs mount/remount and even
runtime degraded mount check other than old one-size-fit-all method.
Btrfs currently uses num_tolerated_disk_barrier_failures to do global
check for tolerated missing device.
Although the one-size-fit-all solution is quite safe, it's too strict
if data and metadata has different duplication level.
For example, if one use Single data and RAID1 metadata for 2 disks, it
means any missing device will make the fs unable to be degraded
mounted.
But in fact, some times all single chunks may be in the existing
device and in that case, we should allow it to be rw degraded mounted.
Such case can be easily reproduced using the following script:
# mkfs.btrfs -f -m raid1 -d sing /dev/sdb /dev/sdc
# wipefs -f /dev/sdc
# mount /dev/sdb -o degraded,rw
If using btrfs-debug-tree to check /dev/sdb, one should find that the
data chunk is only in sdb, so in fact it should allow degraded mount.
This patchset will introduce a new per-chunk degradable check for
btrfs, allow above case to succeed, and it's quite small anyway.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ copied text from cover letter with more details about the problem being
solved ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-03-09 09:34:36 +08:00
( u64 ) ( - 1 ) - next_start ) ;
2019-05-17 11:43:17 +02:00
read_unlock ( & map_tree - > lock ) ;
btrfs: Introduce a function to check if all chunks a OK for degraded rw mount
Introduce a new function, btrfs_check_rw_degradable(), to check if all
chunks in btrfs is OK for degraded rw mount.
It provides the new basis for accurate btrfs mount/remount and even
runtime degraded mount check other than old one-size-fit-all method.
Btrfs currently uses num_tolerated_disk_barrier_failures to do global
check for tolerated missing device.
Although the one-size-fit-all solution is quite safe, it's too strict
if data and metadata has different duplication level.
For example, if one use Single data and RAID1 metadata for 2 disks, it
means any missing device will make the fs unable to be degraded
mounted.
But in fact, some times all single chunks may be in the existing
device and in that case, we should allow it to be rw degraded mounted.
Such case can be easily reproduced using the following script:
# mkfs.btrfs -f -m raid1 -d sing /dev/sdb /dev/sdc
# wipefs -f /dev/sdc
# mount /dev/sdb -o degraded,rw
If using btrfs-debug-tree to check /dev/sdb, one should find that the
data chunk is only in sdb, so in fact it should allow degraded mount.
This patchset will introduce a new per-chunk degradable check for
btrfs, allow above case to succeed, and it's quite small anyway.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ copied text from cover letter with more details about the problem being
solved ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-03-09 09:34:36 +08:00
}
out :
return ret ;
}
2020-07-08 22:55:14 +02:00
static void readahead_tree_node_children ( struct extent_buffer * node )
{
int i ;
const int nr_items = btrfs_header_nritems ( node ) ;
2020-11-05 10:45:09 -05:00
for ( i = 0 ; i < nr_items ; i + + )
btrfs_readahead_node_child ( node , i ) ;
2020-07-08 22:55:14 +02:00
}
2016-06-21 10:40:19 -04:00
int btrfs_read_chunk_tree ( struct btrfs_fs_info * fs_info )
2008-03-24 15:01:56 -04:00
{
2016-06-21 10:40:19 -04:00
struct btrfs_root * root = fs_info - > chunk_root ;
2008-03-24 15:01:56 -04:00
struct btrfs_path * path ;
struct extent_buffer * leaf ;
struct btrfs_key key ;
struct btrfs_key found_key ;
int ret ;
int slot ;
2016-06-03 12:05:14 -07:00
u64 total_dev = 0 ;
2020-07-08 22:55:14 +02:00
u64 last_ra_node = 0 ;
2008-03-24 15:01:56 -04:00
path = btrfs_alloc_path ( ) ;
if ( ! path )
return - ENOMEM ;
2018-04-12 10:29:32 +08:00
/*
* uuid_mutex is needed only if we are mounting a sprout FS
* otherwise we don ' t need it .
*/
2011-12-07 11:38:24 +08:00
mutex_lock ( & uuid_mutex ) ;
btrfs: fix mount failure caused by race with umount
It is possible to cause a btrfs mount to fail by racing it with a slow
umount. The crux of the sequence is generic_shutdown_super not yet
calling sop->put_super before btrfs_mount_root calls btrfs_open_devices.
If that occurs, btrfs_open_devices will decide the opened counter is
non-zero, increment it, and skip resetting fs_devices->total_rw_bytes to
0. From here, mount will call sget which will result in grab_super
trying to take the super block umount semaphore. That semaphore will be
held by the slow umount, so mount will block. Before up-ing the
semaphore, umount will delete the super block, resulting in mount's sget
reliably allocating a new one, which causes the mount path to dutifully
fill it out, and increment total_rw_bytes a second time, which causes
the mount to fail, as we see double the expected bytes.
Here is the sequence laid out in greater detail:
CPU0 CPU1
down_write sb->s_umount
btrfs_kill_super
kill_anon_super(sb)
generic_shutdown_super(sb);
shrink_dcache_for_umount(sb);
sync_filesystem(sb);
evict_inodes(sb); // SLOW
btrfs_mount_root
btrfs_scan_one_device
fs_devices = device->fs_devices
fs_info->fs_devices = fs_devices
// fs_devices-opened makes this a no-op
btrfs_open_devices(fs_devices, mode, fs_type)
s = sget(fs_type, test, set, flags, fs_info);
find sb in s_instances
grab_super(sb);
down_write(&s->s_umount); // blocks
sop->put_super(sb)
// sb->fs_devices->opened == 2; no-op
spin_lock(&sb_lock);
hlist_del_init(&sb->s_instances);
spin_unlock(&sb_lock);
up_write(&sb->s_umount);
return 0;
retry lookup
don't find sb in s_instances (deleted by CPU0)
s = alloc_super
return s;
btrfs_fill_super(s, fs_devices, data)
open_ctree // fs_devices total_rw_bytes improperly set!
btrfs_read_chunk_tree
read_one_dev // increment total_rw_bytes again!!
super_total_bytes < fs_devices->total_rw_bytes // ERROR!!!
To fix this, we clear total_rw_bytes from within btrfs_read_chunk_tree
before the calls to read_one_dev, while holding the sb umount semaphore
and the uuid mutex.
To reproduce, it is sufficient to dirty a decent number of inodes, then
quickly umount and mount.
for i in $(seq 0 500)
do
dd if=/dev/zero of="/mnt/foo/$i" bs=1M count=1
done
umount /mnt/foo&
mount /mnt/foo
does the trick for me.
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-16 13:29:46 -07:00
/*
* It is possible for mount and umount to race in such a way that
* we execute this code path , but open_fs_devices failed to clear
* total_rw_bytes . We certainly want it cleared before reading the
* device items , so clear it here .
*/
fs_info - > fs_devices - > total_rw_bytes = 0 ;
2013-07-30 12:03:04 +01:00
/*
* Read all device items , and then all the chunk items . All
* device items are found before any chunk item ( their object id
* is smaller than the lowest possible object id for a chunk
* item - BTRFS_FIRST_CHUNK_TREE_OBJECTID ) .
2008-03-24 15:01:56 -04:00
*/
key . objectid = BTRFS_DEV_ITEMS_OBJECTID ;
key . offset = 0 ;
key . type = 0 ;
ret = btrfs_search_slot ( NULL , root , & key , path , 0 , 0 ) ;
2010-03-25 12:34:49 +00:00
if ( ret < 0 )
goto error ;
2009-01-05 21:25:51 -05:00
while ( 1 ) {
2020-07-08 22:55:14 +02:00
struct extent_buffer * node ;
2008-03-24 15:01:56 -04:00
leaf = path - > nodes [ 0 ] ;
slot = path - > slots [ 0 ] ;
if ( slot > = btrfs_header_nritems ( leaf ) ) {
ret = btrfs_next_leaf ( root , path ) ;
if ( ret = = 0 )
continue ;
if ( ret < 0 )
goto error ;
break ;
}
2020-07-08 22:55:14 +02:00
/*
* The nodes on level 1 are not locked but we don ' t need to do
* that during mount time as nothing else can access the tree
*/
node = path - > nodes [ 1 ] ;
if ( node ) {
if ( last_ra_node ! = node - > start ) {
readahead_tree_node_children ( node ) ;
last_ra_node = node - > start ;
}
}
2008-03-24 15:01:56 -04:00
btrfs_item_key_to_cpu ( leaf , & found_key , slot ) ;
2013-07-30 12:03:04 +01:00
if ( found_key . type = = BTRFS_DEV_ITEM_KEY ) {
struct btrfs_dev_item * dev_item ;
dev_item = btrfs_item_ptr ( leaf , slot ,
2008-03-24 15:01:56 -04:00
struct btrfs_dev_item ) ;
2019-03-20 16:45:15 +01:00
ret = read_one_dev ( leaf , dev_item ) ;
2013-07-30 12:03:04 +01:00
if ( ret )
goto error ;
2016-06-03 12:05:14 -07:00
total_dev + + ;
2008-03-24 15:01:56 -04:00
} else if ( found_key . type = = BTRFS_CHUNK_ITEM_KEY ) {
struct btrfs_chunk * chunk ;
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 14:43:06 +01:00
/*
* We are only called at mount time , so no need to take
* fs_info - > chunk_mutex . Plus , to avoid lockdep warnings ,
* we always lock first fs_info - > chunk_mutex before
* acquiring any locks on the chunk tree . This is a
* requirement for chunk allocation , see the comment on
* top of btrfs_chunk_alloc ( ) for details .
*/
ASSERT ( ! test_bit ( BTRFS_FS_OPEN , & fs_info - > flags ) ) ;
2008-03-24 15:01:56 -04:00
chunk = btrfs_item_ptr ( leaf , slot , struct btrfs_chunk ) ;
2019-03-20 16:43:07 +01:00
ret = read_one_chunk ( & found_key , leaf , chunk ) ;
2008-11-17 21:11:30 -05:00
if ( ret )
goto error ;
2008-03-24 15:01:56 -04:00
}
path - > slots [ 0 ] + + ;
}
2016-06-03 12:05:14 -07:00
/*
* After loading chunk tree , we ' ve got all device information ,
* do another round of validation checks .
*/
2016-06-22 18:54:23 -04:00
if ( total_dev ! = fs_info - > fs_devices - > total_devices ) {
btrfs_err ( fs_info ,
2016-06-03 12:05:14 -07:00
" super_num_devices %llu mismatch with num_devices %llu found here " ,
2016-06-22 18:54:23 -04:00
btrfs_super_num_devices ( fs_info - > super_copy ) ,
2016-06-03 12:05:14 -07:00
total_dev ) ;
ret = - EINVAL ;
goto error ;
}
2016-06-22 18:54:23 -04:00
if ( btrfs_super_total_bytes ( fs_info - > super_copy ) <
fs_info - > fs_devices - > total_rw_bytes ) {
btrfs_err ( fs_info ,
2016-06-03 12:05:14 -07:00
" super_total_bytes %llu mismatch with fs_devices total_rw_bytes %llu " ,
2016-06-22 18:54:23 -04:00
btrfs_super_total_bytes ( fs_info - > super_copy ) ,
fs_info - > fs_devices - > total_rw_bytes ) ;
2016-06-03 12:05:14 -07:00
ret = - EINVAL ;
goto error ;
}
2008-03-24 15:01:56 -04:00
ret = 0 ;
error :
2011-12-07 11:38:24 +08:00
mutex_unlock ( & uuid_mutex ) ;
2008-11-17 21:11:30 -05:00
btrfs_free_path ( path ) ;
2008-03-24 15:01:56 -04:00
return ret ;
}
2012-05-25 16:06:08 +02:00
2013-05-15 07:48:19 +00:00
void btrfs_init_devices_late ( struct btrfs_fs_info * fs_info )
{
2020-07-16 10:25:33 +03:00
struct btrfs_fs_devices * fs_devices = fs_info - > fs_devices , * seed_devs ;
2013-05-15 07:48:19 +00:00
struct btrfs_device * device ;
2020-07-16 10:25:33 +03:00
fs_devices - > fs_info = fs_info ;
mutex_lock ( & fs_devices - > device_list_mutex ) ;
list_for_each_entry ( device , & fs_devices - > devices , dev_list )
device - > fs_info = fs_info ;
list_for_each_entry ( seed_devs , & fs_devices - > seed_list , seed_list ) {
list_for_each_entry ( device , & seed_devs - > devices , dev_list )
2016-06-22 18:54:56 -04:00
device - > fs_info = fs_info ;
2014-05-11 23:14:59 +08:00
2020-07-16 10:25:33 +03:00
seed_devs - > fs_info = fs_info ;
2014-05-11 23:14:59 +08:00
}
2020-09-05 01:34:31 +08:00
mutex_unlock ( & fs_devices - > device_list_mutex ) ;
2013-05-15 07:48:19 +00:00
}
2019-08-21 20:05:32 +02:00
static u64 btrfs_dev_stats_value ( const struct extent_buffer * eb ,
const struct btrfs_dev_stats_item * ptr ,
int index )
{
u64 val ;
read_extent_buffer ( eb , & val ,
offsetof ( struct btrfs_dev_stats_item , values ) +
( ( unsigned long ) ptr ) + ( index * sizeof ( u64 ) ) ,
sizeof ( val ) ) ;
return val ;
}
static void btrfs_set_dev_stats_value ( struct extent_buffer * eb ,
struct btrfs_dev_stats_item * ptr ,
int index , u64 val )
{
write_extent_buffer ( eb , & val ,
offsetof ( struct btrfs_dev_stats_item , values ) +
( ( unsigned long ) ptr ) + ( index * sizeof ( u64 ) ) ,
sizeof ( val ) ) ;
}
2020-09-18 16:44:33 -04:00
static int btrfs_device_init_dev_stats ( struct btrfs_device * device ,
struct btrfs_path * path )
2012-05-25 16:06:10 +02:00
{
2020-09-18 16:44:32 -04:00
struct btrfs_dev_stats_item * ptr ;
2012-05-25 16:06:10 +02:00
struct extent_buffer * eb ;
2020-09-18 16:44:32 -04:00
struct btrfs_key key ;
int item_size ;
int i , ret , slot ;
2021-03-11 11:23:15 -05:00
if ( ! device - > fs_info - > dev_root )
return 0 ;
2020-09-18 16:44:32 -04:00
key . objectid = BTRFS_DEV_STATS_OBJECTID ;
key . type = BTRFS_PERSISTENT_ITEM_KEY ;
key . offset = device - > devid ;
ret = btrfs_search_slot ( NULL , device - > fs_info - > dev_root , & key , path , 0 , 0 ) ;
if ( ret ) {
for ( i = 0 ; i < BTRFS_DEV_STAT_VALUES_MAX ; i + + )
btrfs_dev_stat_set ( device , i , 0 ) ;
device - > dev_stats_valid = 1 ;
btrfs_release_path ( path ) ;
2020-09-18 16:44:33 -04:00
return ret < 0 ? ret : 0 ;
2020-09-18 16:44:32 -04:00
}
slot = path - > slots [ 0 ] ;
eb = path - > nodes [ 0 ] ;
item_size = btrfs_item_size_nr ( eb , slot ) ;
ptr = btrfs_item_ptr ( eb , slot , struct btrfs_dev_stats_item ) ;
for ( i = 0 ; i < BTRFS_DEV_STAT_VALUES_MAX ; i + + ) {
if ( item_size > = ( 1 + i ) * sizeof ( __le64 ) )
btrfs_dev_stat_set ( device , i ,
btrfs_dev_stats_value ( eb , ptr , i ) ) ;
else
btrfs_dev_stat_set ( device , i , 0 ) ;
}
device - > dev_stats_valid = 1 ;
btrfs_dev_stat_print_on_load ( device ) ;
btrfs_release_path ( path ) ;
2020-09-18 16:44:33 -04:00
return 0 ;
2020-09-18 16:44:32 -04:00
}
int btrfs_init_dev_stats ( struct btrfs_fs_info * fs_info )
{
struct btrfs_fs_devices * fs_devices = fs_info - > fs_devices , * seed_devs ;
2012-05-25 16:06:10 +02:00
struct btrfs_device * device ;
struct btrfs_path * path = NULL ;
2020-09-18 16:44:33 -04:00
int ret = 0 ;
2012-05-25 16:06:10 +02:00
path = btrfs_alloc_path ( ) ;
2019-08-21 17:26:32 +08:00
if ( ! path )
return - ENOMEM ;
2012-05-25 16:06:10 +02:00
mutex_lock ( & fs_devices - > device_list_mutex ) ;
2020-09-18 16:44:33 -04:00
list_for_each_entry ( device , & fs_devices - > devices , dev_list ) {
ret = btrfs_device_init_dev_stats ( device , path ) ;
if ( ret )
goto out ;
}
2020-09-18 16:44:32 -04:00
list_for_each_entry ( seed_devs , & fs_devices - > seed_list , seed_list ) {
2020-09-18 16:44:33 -04:00
list_for_each_entry ( device , & seed_devs - > devices , dev_list ) {
ret = btrfs_device_init_dev_stats ( device , path ) ;
if ( ret )
goto out ;
}
2012-05-25 16:06:10 +02:00
}
2020-09-18 16:44:33 -04:00
out :
2012-05-25 16:06:10 +02:00
mutex_unlock ( & fs_devices - > device_list_mutex ) ;
btrfs_free_path ( path ) ;
2020-09-18 16:44:33 -04:00
return ret ;
2012-05-25 16:06:10 +02:00
}
static int update_dev_stat_item ( struct btrfs_trans_handle * trans ,
struct btrfs_device * device )
{
2018-07-20 19:37:49 +03:00
struct btrfs_fs_info * fs_info = trans - > fs_info ;
2016-06-21 21:16:51 -04:00
struct btrfs_root * dev_root = fs_info - > dev_root ;
2012-05-25 16:06:10 +02:00
struct btrfs_path * path ;
struct btrfs_key key ;
struct extent_buffer * eb ;
struct btrfs_dev_stats_item * ptr ;
int ret ;
int i ;
2016-01-25 17:51:31 +01:00
key . objectid = BTRFS_DEV_STATS_OBJECTID ;
key . type = BTRFS_PERSISTENT_ITEM_KEY ;
2012-05-25 16:06:10 +02:00
key . offset = device - > devid ;
path = btrfs_alloc_path ( ) ;
2017-02-15 09:35:01 +01:00
if ( ! path )
return - ENOMEM ;
2012-05-25 16:06:10 +02:00
ret = btrfs_search_slot ( trans , dev_root , & key , path , - 1 , 1 ) ;
if ( ret < 0 ) {
2016-06-22 18:54:23 -04:00
btrfs_warn_in_rcu ( fs_info ,
2015-10-08 09:01:03 +02:00
" error %d while searching for dev_stats item for device %s " ,
2012-06-04 14:03:51 -04:00
ret , rcu_str_deref ( device - > name ) ) ;
2012-05-25 16:06:10 +02:00
goto out ;
}
if ( ret = = 0 & &
btrfs_item_size_nr ( path - > nodes [ 0 ] , path - > slots [ 0 ] ) < sizeof ( * ptr ) ) {
/* need to delete old one and insert a new one */
ret = btrfs_del_item ( trans , dev_root , path ) ;
if ( ret ! = 0 ) {
2016-06-22 18:54:23 -04:00
btrfs_warn_in_rcu ( fs_info ,
2015-10-08 09:01:03 +02:00
" delete too small dev_stats item for device %s failed %d " ,
2012-06-04 14:03:51 -04:00
rcu_str_deref ( device - > name ) , ret ) ;
2012-05-25 16:06:10 +02:00
goto out ;
}
ret = 1 ;
}
if ( ret = = 1 ) {
/* need to insert a new item */
btrfs_release_path ( path ) ;
ret = btrfs_insert_empty_item ( trans , dev_root , path ,
& key , sizeof ( * ptr ) ) ;
if ( ret < 0 ) {
2016-06-22 18:54:23 -04:00
btrfs_warn_in_rcu ( fs_info ,
2015-10-08 09:01:03 +02:00
" insert dev_stats item for device %s failed %d " ,
rcu_str_deref ( device - > name ) , ret ) ;
2012-05-25 16:06:10 +02:00
goto out ;
}
}
eb = path - > nodes [ 0 ] ;
ptr = btrfs_item_ptr ( eb , path - > slots [ 0 ] , struct btrfs_dev_stats_item ) ;
for ( i = 0 ; i < BTRFS_DEV_STAT_VALUES_MAX ; i + + )
btrfs_set_dev_stats_value ( eb , ptr , i ,
btrfs_dev_stat_read ( device , i ) ) ;
btrfs_mark_buffer_dirty ( eb ) ;
out :
btrfs_free_path ( path ) ;
return ret ;
}
/*
* called from commit_transaction . Writes all changed device stats to disk .
*/
2019-03-20 16:50:38 +01:00
int btrfs_run_dev_stats ( struct btrfs_trans_handle * trans )
2012-05-25 16:06:10 +02:00
{
2019-03-20 16:50:38 +01:00
struct btrfs_fs_info * fs_info = trans - > fs_info ;
2012-05-25 16:06:10 +02:00
struct btrfs_fs_devices * fs_devices = fs_info - > fs_devices ;
struct btrfs_device * device ;
2014-07-24 11:37:11 +08:00
int stats_cnt ;
2012-05-25 16:06:10 +02:00
int ret = 0 ;
mutex_lock ( & fs_devices - > device_list_mutex ) ;
list_for_each_entry ( device , & fs_devices - > devices , dev_list ) {
2017-10-24 13:47:37 +03:00
stats_cnt = atomic_read ( & device - > dev_stats_ccnt ) ;
if ( ! device - > dev_stats_valid | | stats_cnt = = 0 )
2012-05-25 16:06:10 +02:00
continue ;
2017-10-24 13:47:37 +03:00
/*
* There is a LOAD - LOAD control dependency between the value of
* dev_stats_ccnt and updating the on - disk values which requires
* reading the in - memory counters . Such control dependencies
* require explicit read memory barriers .
*
* This memory barriers pairs with smp_mb__before_atomic in
* btrfs_dev_stat_inc / btrfs_dev_stat_set and with the full
* barrier implied by atomic_xchg in
* btrfs_dev_stats_read_and_reset
*/
smp_rmb ( ) ;
2018-07-20 19:37:49 +03:00
ret = update_dev_stat_item ( trans , device ) ;
2012-05-25 16:06:10 +02:00
if ( ! ret )
2014-07-24 11:37:11 +08:00
atomic_sub ( stats_cnt , & device - > dev_stats_ccnt ) ;
2012-05-25 16:06:10 +02:00
}
mutex_unlock ( & fs_devices - > device_list_mutex ) ;
return ret ;
}
2012-05-25 16:06:08 +02:00
void btrfs_dev_stat_inc_and_print ( struct btrfs_device * dev , int index )
{
btrfs_dev_stat_inc ( dev , index ) ;
btrfs_dev_stat_print_on_error ( dev ) ;
}
2013-04-25 20:41:01 +00:00
static void btrfs_dev_stat_print_on_error ( struct btrfs_device * dev )
2012-05-25 16:06:08 +02:00
{
2012-05-25 16:06:10 +02:00
if ( ! dev - > dev_stats_valid )
return ;
2016-06-22 18:54:56 -04:00
btrfs_err_rl_in_rcu ( dev - > fs_info ,
2015-10-08 10:43:10 +02:00
" bdev %s errs: wr %u, rd %u, flush %u, corrupt %u, gen %u " ,
2012-06-04 14:03:51 -04:00
rcu_str_deref ( dev - > name ) ,
2012-05-25 16:06:08 +02:00
btrfs_dev_stat_read ( dev , BTRFS_DEV_STAT_WRITE_ERRS ) ,
btrfs_dev_stat_read ( dev , BTRFS_DEV_STAT_READ_ERRS ) ,
btrfs_dev_stat_read ( dev , BTRFS_DEV_STAT_FLUSH_ERRS ) ,
2013-12-20 11:37:06 -05:00
btrfs_dev_stat_read ( dev , BTRFS_DEV_STAT_CORRUPTION_ERRS ) ,
btrfs_dev_stat_read ( dev , BTRFS_DEV_STAT_GENERATION_ERRS ) ) ;
2012-05-25 16:06:08 +02:00
}
2012-05-25 16:06:09 +02:00
2012-05-25 16:06:10 +02:00
static void btrfs_dev_stat_print_on_load ( struct btrfs_device * dev )
{
2012-07-17 09:02:11 -06:00
int i ;
for ( i = 0 ; i < BTRFS_DEV_STAT_VALUES_MAX ; i + + )
if ( btrfs_dev_stat_read ( dev , i ) ! = 0 )
break ;
if ( i = = BTRFS_DEV_STAT_VALUES_MAX )
return ; /* all values == 0, suppress message */
2016-06-22 18:54:56 -04:00
btrfs_info_in_rcu ( dev - > fs_info ,
2015-10-08 09:01:03 +02:00
" bdev %s errs: wr %u, rd %u, flush %u, corrupt %u, gen %u " ,
2012-06-04 14:03:51 -04:00
rcu_str_deref ( dev - > name ) ,
2012-05-25 16:06:10 +02:00
btrfs_dev_stat_read ( dev , BTRFS_DEV_STAT_WRITE_ERRS ) ,
btrfs_dev_stat_read ( dev , BTRFS_DEV_STAT_READ_ERRS ) ,
btrfs_dev_stat_read ( dev , BTRFS_DEV_STAT_FLUSH_ERRS ) ,
btrfs_dev_stat_read ( dev , BTRFS_DEV_STAT_CORRUPTION_ERRS ) ,
btrfs_dev_stat_read ( dev , BTRFS_DEV_STAT_GENERATION_ERRS ) ) ;
}
2016-06-22 18:54:24 -04:00
int btrfs_get_dev_stats ( struct btrfs_fs_info * fs_info ,
2012-06-22 06:30:39 -06:00
struct btrfs_ioctl_get_dev_stats * stats )
2012-05-25 16:06:09 +02:00
{
2021-10-05 16:12:42 -04:00
BTRFS_DEV_LOOKUP_ARGS ( args ) ;
2012-05-25 16:06:09 +02:00
struct btrfs_device * dev ;
2016-06-22 18:54:23 -04:00
struct btrfs_fs_devices * fs_devices = fs_info - > fs_devices ;
2012-05-25 16:06:09 +02:00
int i ;
mutex_lock ( & fs_devices - > device_list_mutex ) ;
2021-10-05 16:12:42 -04:00
args . devid = stats - > devid ;
dev = btrfs_find_device ( fs_info - > fs_devices , & args ) ;
2012-05-25 16:06:09 +02:00
mutex_unlock ( & fs_devices - > device_list_mutex ) ;
if ( ! dev ) {
2016-06-22 18:54:23 -04:00
btrfs_warn ( fs_info , " get dev_stats failed, device not found " ) ;
2012-05-25 16:06:09 +02:00
return - ENODEV ;
2012-05-25 16:06:10 +02:00
} else if ( ! dev - > dev_stats_valid ) {
2016-06-22 18:54:23 -04:00
btrfs_warn ( fs_info , " get dev_stats failed, not yet valid " ) ;
2012-05-25 16:06:10 +02:00
return - ENODEV ;
2012-06-22 06:30:39 -06:00
} else if ( stats - > flags & BTRFS_DEV_STATS_RESET ) {
2012-05-25 16:06:09 +02:00
for ( i = 0 ; i < BTRFS_DEV_STAT_VALUES_MAX ; i + + ) {
if ( stats - > nr_items > i )
stats - > values [ i ] =
btrfs_dev_stat_read_and_reset ( dev , i ) ;
else
2019-08-07 16:21:19 +08:00
btrfs_dev_stat_set ( dev , i , 0 ) ;
2012-05-25 16:06:09 +02:00
}
2020-01-10 12:26:34 +08:00
btrfs_info ( fs_info , " device stats zeroed by %s (%d) " ,
current - > comm , task_pid_nr ( current ) ) ;
2012-05-25 16:06:09 +02:00
} else {
for ( i = 0 ; i < BTRFS_DEV_STAT_VALUES_MAX ; i + + )
if ( stats - > nr_items > i )
stats - > values [ i ] = btrfs_dev_stat_read ( dev , i ) ;
}
if ( stats - > nr_items > BTRFS_DEV_STAT_VALUES_MAX )
stats - > nr_items = BTRFS_DEV_STAT_VALUES_MAX ;
return 0 ;
}
2012-11-05 15:50:14 +01:00
2014-09-03 21:35:33 +08:00
/*
2019-03-25 14:31:22 +02:00
* Update the size and bytes used for each device where it changed . This is
* delayed since we would otherwise get errors while writing out the
* superblocks .
*
* Must be invoked during transaction commit .
2014-09-03 21:35:33 +08:00
*/
2019-03-25 14:31:22 +02:00
void btrfs_commit_device_sizes ( struct btrfs_transaction * trans )
2014-09-03 21:35:33 +08:00
{
struct btrfs_device * curr , * next ;
2019-03-25 14:31:22 +02:00
ASSERT ( trans - > state = = TRANS_STATE_COMMIT_DOING ) ;
2014-09-03 21:35:34 +08:00
2019-03-25 14:31:22 +02:00
if ( list_empty ( & trans - > dev_update_list ) )
2014-09-03 21:35:34 +08:00
return ;
2019-03-25 14:31:22 +02:00
/*
* We don ' t need the device_list_mutex here . This list is owned by the
* transaction and the transaction must complete before the device is
* released .
*/
mutex_lock ( & trans - > fs_info - > chunk_mutex ) ;
list_for_each_entry_safe ( curr , next , & trans - > dev_update_list ,
post_commit_list ) {
list_del_init ( & curr - > post_commit_list ) ;
curr - > commit_total_bytes = curr - > disk_total_bytes ;
curr - > commit_bytes_used = curr - > bytes_used ;
2014-09-03 21:35:34 +08:00
}
2019-03-25 14:31:22 +02:00
mutex_unlock ( & trans - > fs_info - > chunk_mutex ) ;
2014-09-03 21:35:34 +08:00
}
2015-03-10 06:38:31 +08:00
2018-07-13 20:46:30 +02:00
/*
* Multiplicity factor for simple profiles : DUP , RAID1 - like and RAID10 .
*/
int btrfs_bg_type_to_factor ( u64 flags )
{
2019-05-17 11:43:31 +02:00
const int index = btrfs_bg_flags_to_raid_index ( flags ) ;
return btrfs_raid_array [ index ] . ncopies ;
2018-07-13 20:46:30 +02:00
}
2018-08-01 10:37:19 +08:00
static int verify_one_dev_extent ( struct btrfs_fs_info * fs_info ,
u64 chunk_offset , u64 devid ,
u64 physical_offset , u64 physical_len )
{
2021-10-05 16:12:42 -04:00
struct btrfs_dev_lookup_args args = { . devid = devid } ;
2019-05-17 11:43:17 +02:00
struct extent_map_tree * em_tree = & fs_info - > mapping_tree ;
2018-08-01 10:37:19 +08:00
struct extent_map * em ;
struct map_lookup * map ;
2018-10-05 17:45:55 +08:00
struct btrfs_device * dev ;
2018-08-01 10:37:19 +08:00
u64 stripe_len ;
bool found = false ;
int ret = 0 ;
int i ;
read_lock ( & em_tree - > lock ) ;
em = lookup_extent_mapping ( em_tree , chunk_offset , 1 ) ;
read_unlock ( & em_tree - > lock ) ;
if ( ! em ) {
btrfs_err ( fs_info ,
" dev extent physical offset %llu on devid %llu doesn't have corresponding chunk " ,
physical_offset , devid ) ;
ret = - EUCLEAN ;
goto out ;
}
map = em - > map_lookup ;
stripe_len = calc_stripe_length ( map - > type , em - > len , map - > num_stripes ) ;
if ( physical_len ! = stripe_len ) {
btrfs_err ( fs_info ,
" dev extent physical offset %llu on devid %llu length doesn't match chunk %llu, have %llu expect %llu " ,
physical_offset , devid , em - > start , physical_len ,
stripe_len ) ;
ret = - EUCLEAN ;
goto out ;
}
for ( i = 0 ; i < map - > num_stripes ; i + + ) {
if ( map - > stripes [ i ] . dev - > devid = = devid & &
map - > stripes [ i ] . physical = = physical_offset ) {
found = true ;
if ( map - > verified_stripes > = map - > num_stripes ) {
btrfs_err ( fs_info ,
" too many dev extents for chunk %llu found " ,
em - > start ) ;
ret = - EUCLEAN ;
goto out ;
}
map - > verified_stripes + + ;
break ;
}
}
if ( ! found ) {
btrfs_err ( fs_info ,
" dev extent physical offset %llu devid %llu has no corresponding chunk " ,
physical_offset , devid ) ;
ret = - EUCLEAN ;
}
2018-10-05 17:45:55 +08:00
2021-05-21 17:42:23 +02:00
/* Make sure no dev extent is beyond device boundary */
2021-10-05 16:12:42 -04:00
dev = btrfs_find_device ( fs_info - > fs_devices , & args ) ;
2018-10-05 17:45:55 +08:00
if ( ! dev ) {
btrfs_err ( fs_info , " failed to find devid %llu " , devid ) ;
ret = - EUCLEAN ;
goto out ;
}
2019-01-08 14:08:18 +08:00
2018-10-05 17:45:55 +08:00
if ( physical_offset + physical_len > dev - > disk_total_bytes ) {
btrfs_err ( fs_info ,
" dev extent devid %llu physical offset %llu len %llu is beyond device boundary %llu " ,
devid , physical_offset , physical_len ,
dev - > disk_total_bytes ) ;
ret = - EUCLEAN ;
goto out ;
}
2021-02-04 19:21:49 +09:00
if ( dev - > zone_info ) {
u64 zone_size = dev - > zone_info - > zone_size ;
if ( ! IS_ALIGNED ( physical_offset , zone_size ) | |
! IS_ALIGNED ( physical_len , zone_size ) ) {
btrfs_err ( fs_info ,
" zoned: dev extent devid %llu physical offset %llu len %llu is not aligned to device zone " ,
devid , physical_offset , physical_len ) ;
ret = - EUCLEAN ;
goto out ;
}
}
2018-08-01 10:37:19 +08:00
out :
free_extent_map ( em ) ;
return ret ;
}
static int verify_chunk_dev_extent_mapping ( struct btrfs_fs_info * fs_info )
{
2019-05-17 11:43:17 +02:00
struct extent_map_tree * em_tree = & fs_info - > mapping_tree ;
2018-08-01 10:37:19 +08:00
struct extent_map * em ;
struct rb_node * node ;
int ret = 0 ;
read_lock ( & em_tree - > lock ) ;
2018-08-23 03:51:52 +08:00
for ( node = rb_first_cached ( & em_tree - > map ) ; node ; node = rb_next ( node ) ) {
2018-08-01 10:37:19 +08:00
em = rb_entry ( node , struct extent_map , rb_node ) ;
if ( em - > map_lookup - > num_stripes ! =
em - > map_lookup - > verified_stripes ) {
btrfs_err ( fs_info ,
" chunk %llu has missing dev extent, have %d expect %d " ,
em - > start , em - > map_lookup - > verified_stripes ,
em - > map_lookup - > num_stripes ) ;
ret = - EUCLEAN ;
goto out ;
}
}
out :
read_unlock ( & em_tree - > lock ) ;
return ret ;
}
/*
* Ensure that all dev extents are mapped to correct chunk , otherwise
* later chunk allocation / free would cause unexpected behavior .
*
* NOTE : This will iterate through the whole device tree , which should be of
* the same size level as the chunk tree . This slightly increases mount time .
*/
int btrfs_verify_dev_extents ( struct btrfs_fs_info * fs_info )
{
struct btrfs_path * path ;
struct btrfs_root * root = fs_info - > dev_root ;
struct btrfs_key key ;
btrfs: volumes: Make sure there is no overlap of dev extents at mount time
Enhance btrfs_verify_dev_extents() to remember previous checked dev
extents, so it can verify no dev extents can overlap.
Analysis from Hans:
"Imagine allocating a DATA|DUP chunk.
In the chunk allocator, we first set...
max_stripe_size = SZ_1G;
max_chunk_size = BTRFS_MAX_DATA_CHUNK_SIZE
... which is 10GiB.
Then...
/* we don't want a chunk larger than 10% of writeable space */
max_chunk_size = min(div_factor(fs_devices->total_rw_bytes, 1),
max_chunk_size);
Imagine we only have one 7880MiB block device in this filesystem. Now
max_chunk_size is down to 788MiB.
The next step in the code is to search for max_stripe_size * dev_stripes
amount of free space on the device, which is in our example 1GiB * 2 =
2GiB. Imagine the device has exactly 1578MiB free in one contiguous
piece. This amount of bytes will be put in devices_info[ndevs - 1].max_avail
Next we recalculate the stripe_size (which is actually the device extent
length), based on the actual maximum amount of available raw disk space:
stripe_size = div_u64(devices_info[ndevs - 1].max_avail, dev_stripes);
stripe_size is now 789MiB
Next we do...
data_stripes = num_stripes / ncopies
...where data_stripes ends up as 1, because num_stripes is 2 (the amount
of device extents we're going to have), and DUP has ncopies 2.
Next there's a check...
if (stripe_size * data_stripes > max_chunk_size)
...which matches because 789MiB * 1 > 788MiB.
We go into the if code, and next is...
stripe_size = div_u64(max_chunk_size, data_stripes);
...which resets stripe_size to max_chunk_size: 788MiB
Next is a fun one...
/* bump the answer up to a 16MB boundary */
stripe_size = round_up(stripe_size, SZ_16M);
...which changes stripe_size from 788MiB to 800MiB.
We're not done changing stripe_size yet...
/* But don't go higher than the limits we found while searching
* for free extents
*/
stripe_size = min(devices_info[ndevs - 1].max_avail,
stripe_size);
This is bad. max_avail is twice the stripe_size (we need to fit 2 device
extents on the same device for DUP).
The result here is that 800MiB < 1578MiB, so it's unchanged. However,
the resulting DUP chunk will need 1600MiB disk space, which isn't there,
and the second dev_extent might extend into the next thing (next
dev_extent? end of device?) for 22MiB.
The last shown line of code relies on a situation where there's twice
the value of stripe_size present as value for the variable stripe_size
when it's DUP. This was actually the case before commit 92e222df7b
"btrfs: alloc_chunk: fix DUP stripe size handling", from which I quote:
"[...] in the meantime there's a check to see if the stripe_size does
not exceed max_chunk_size. Since during this check stripe_size is twice
the amount as intended, the check will reduce the stripe_size to
max_chunk_size if the actual correct to be used stripe_size is more than
half the amount of max_chunk_size."
In the previous version of the code, the 16MiB alignment (why is this
done, by the way?) would result in a 50% chance that it would actually
do an 8MiB alignment for the individual dev_extents, since it was
operating on double the size. Does this matter?
Does it matter that stripe_size can be set to anything which is not
16MiB aligned because of the amount of remaining available disk space
which is just taken?
What is the main purpose of this round_up?
The most straightforward thing to do seems something like...
stripe_size = min(
div_u64(devices_info[ndevs - 1].max_avail, dev_stripes),
stripe_size
)
..just putting half of the max_avail into stripe_size."
Link: https://lore.kernel.org/linux-btrfs/b3461a38-e5f8-f41d-c67c-2efac8129054@mendix.com/
Reported-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
[ add analysis from report ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-05 17:45:54 +08:00
u64 prev_devid = 0 ;
u64 prev_dev_ext_end = 0 ;
2018-08-01 10:37:19 +08:00
int ret = 0 ;
2020-10-16 11:29:18 -04:00
/*
* We don ' t have a dev_root because we mounted with ignorebadroots and
* failed to load the root , so we want to skip the verification in this
* case for sure .
*
* However if the dev root is fine , but the tree itself is corrupted
* we ' d still fail to mount . This verification is only to make sure
* writes can happen safely , so instead just bypass this check
* completely in the case of IGNOREBADROOTS .
*/
if ( btrfs_test_opt ( fs_info , IGNOREBADROOTS ) )
return 0 ;
2018-08-01 10:37:19 +08:00
key . objectid = 1 ;
key . type = BTRFS_DEV_EXTENT_KEY ;
key . offset = 0 ;
path = btrfs_alloc_path ( ) ;
if ( ! path )
return - ENOMEM ;
path - > reada = READA_FORWARD ;
ret = btrfs_search_slot ( NULL , root , & key , path , 0 , 0 ) ;
if ( ret < 0 )
goto out ;
if ( path - > slots [ 0 ] > = btrfs_header_nritems ( path - > nodes [ 0 ] ) ) {
2021-07-13 10:58:03 -03:00
ret = btrfs_next_leaf ( root , path ) ;
2018-08-01 10:37:19 +08:00
if ( ret < 0 )
goto out ;
/* No dev extents at all? Not good */
if ( ret > 0 ) {
ret = - EUCLEAN ;
goto out ;
}
}
while ( 1 ) {
struct extent_buffer * leaf = path - > nodes [ 0 ] ;
struct btrfs_dev_extent * dext ;
int slot = path - > slots [ 0 ] ;
u64 chunk_offset ;
u64 physical_offset ;
u64 physical_len ;
u64 devid ;
btrfs_item_key_to_cpu ( leaf , & key , slot ) ;
if ( key . type ! = BTRFS_DEV_EXTENT_KEY )
break ;
devid = key . objectid ;
physical_offset = key . offset ;
dext = btrfs_item_ptr ( leaf , slot , struct btrfs_dev_extent ) ;
chunk_offset = btrfs_dev_extent_chunk_offset ( leaf , dext ) ;
physical_len = btrfs_dev_extent_length ( leaf , dext ) ;
btrfs: volumes: Make sure there is no overlap of dev extents at mount time
Enhance btrfs_verify_dev_extents() to remember previous checked dev
extents, so it can verify no dev extents can overlap.
Analysis from Hans:
"Imagine allocating a DATA|DUP chunk.
In the chunk allocator, we first set...
max_stripe_size = SZ_1G;
max_chunk_size = BTRFS_MAX_DATA_CHUNK_SIZE
... which is 10GiB.
Then...
/* we don't want a chunk larger than 10% of writeable space */
max_chunk_size = min(div_factor(fs_devices->total_rw_bytes, 1),
max_chunk_size);
Imagine we only have one 7880MiB block device in this filesystem. Now
max_chunk_size is down to 788MiB.
The next step in the code is to search for max_stripe_size * dev_stripes
amount of free space on the device, which is in our example 1GiB * 2 =
2GiB. Imagine the device has exactly 1578MiB free in one contiguous
piece. This amount of bytes will be put in devices_info[ndevs - 1].max_avail
Next we recalculate the stripe_size (which is actually the device extent
length), based on the actual maximum amount of available raw disk space:
stripe_size = div_u64(devices_info[ndevs - 1].max_avail, dev_stripes);
stripe_size is now 789MiB
Next we do...
data_stripes = num_stripes / ncopies
...where data_stripes ends up as 1, because num_stripes is 2 (the amount
of device extents we're going to have), and DUP has ncopies 2.
Next there's a check...
if (stripe_size * data_stripes > max_chunk_size)
...which matches because 789MiB * 1 > 788MiB.
We go into the if code, and next is...
stripe_size = div_u64(max_chunk_size, data_stripes);
...which resets stripe_size to max_chunk_size: 788MiB
Next is a fun one...
/* bump the answer up to a 16MB boundary */
stripe_size = round_up(stripe_size, SZ_16M);
...which changes stripe_size from 788MiB to 800MiB.
We're not done changing stripe_size yet...
/* But don't go higher than the limits we found while searching
* for free extents
*/
stripe_size = min(devices_info[ndevs - 1].max_avail,
stripe_size);
This is bad. max_avail is twice the stripe_size (we need to fit 2 device
extents on the same device for DUP).
The result here is that 800MiB < 1578MiB, so it's unchanged. However,
the resulting DUP chunk will need 1600MiB disk space, which isn't there,
and the second dev_extent might extend into the next thing (next
dev_extent? end of device?) for 22MiB.
The last shown line of code relies on a situation where there's twice
the value of stripe_size present as value for the variable stripe_size
when it's DUP. This was actually the case before commit 92e222df7b
"btrfs: alloc_chunk: fix DUP stripe size handling", from which I quote:
"[...] in the meantime there's a check to see if the stripe_size does
not exceed max_chunk_size. Since during this check stripe_size is twice
the amount as intended, the check will reduce the stripe_size to
max_chunk_size if the actual correct to be used stripe_size is more than
half the amount of max_chunk_size."
In the previous version of the code, the 16MiB alignment (why is this
done, by the way?) would result in a 50% chance that it would actually
do an 8MiB alignment for the individual dev_extents, since it was
operating on double the size. Does this matter?
Does it matter that stripe_size can be set to anything which is not
16MiB aligned because of the amount of remaining available disk space
which is just taken?
What is the main purpose of this round_up?
The most straightforward thing to do seems something like...
stripe_size = min(
div_u64(devices_info[ndevs - 1].max_avail, dev_stripes),
stripe_size
)
..just putting half of the max_avail into stripe_size."
Link: https://lore.kernel.org/linux-btrfs/b3461a38-e5f8-f41d-c67c-2efac8129054@mendix.com/
Reported-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
[ add analysis from report ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-05 17:45:54 +08:00
/* Check if this dev extent overlaps with the previous one */
if ( devid = = prev_devid & & physical_offset < prev_dev_ext_end ) {
btrfs_err ( fs_info ,
" dev extent devid %llu physical offset %llu overlap with previous dev extent end %llu " ,
devid , physical_offset , prev_dev_ext_end ) ;
ret = - EUCLEAN ;
goto out ;
}
2018-08-01 10:37:19 +08:00
ret = verify_one_dev_extent ( fs_info , chunk_offset , devid ,
physical_offset , physical_len ) ;
if ( ret < 0 )
goto out ;
btrfs: volumes: Make sure there is no overlap of dev extents at mount time
Enhance btrfs_verify_dev_extents() to remember previous checked dev
extents, so it can verify no dev extents can overlap.
Analysis from Hans:
"Imagine allocating a DATA|DUP chunk.
In the chunk allocator, we first set...
max_stripe_size = SZ_1G;
max_chunk_size = BTRFS_MAX_DATA_CHUNK_SIZE
... which is 10GiB.
Then...
/* we don't want a chunk larger than 10% of writeable space */
max_chunk_size = min(div_factor(fs_devices->total_rw_bytes, 1),
max_chunk_size);
Imagine we only have one 7880MiB block device in this filesystem. Now
max_chunk_size is down to 788MiB.
The next step in the code is to search for max_stripe_size * dev_stripes
amount of free space on the device, which is in our example 1GiB * 2 =
2GiB. Imagine the device has exactly 1578MiB free in one contiguous
piece. This amount of bytes will be put in devices_info[ndevs - 1].max_avail
Next we recalculate the stripe_size (which is actually the device extent
length), based on the actual maximum amount of available raw disk space:
stripe_size = div_u64(devices_info[ndevs - 1].max_avail, dev_stripes);
stripe_size is now 789MiB
Next we do...
data_stripes = num_stripes / ncopies
...where data_stripes ends up as 1, because num_stripes is 2 (the amount
of device extents we're going to have), and DUP has ncopies 2.
Next there's a check...
if (stripe_size * data_stripes > max_chunk_size)
...which matches because 789MiB * 1 > 788MiB.
We go into the if code, and next is...
stripe_size = div_u64(max_chunk_size, data_stripes);
...which resets stripe_size to max_chunk_size: 788MiB
Next is a fun one...
/* bump the answer up to a 16MB boundary */
stripe_size = round_up(stripe_size, SZ_16M);
...which changes stripe_size from 788MiB to 800MiB.
We're not done changing stripe_size yet...
/* But don't go higher than the limits we found while searching
* for free extents
*/
stripe_size = min(devices_info[ndevs - 1].max_avail,
stripe_size);
This is bad. max_avail is twice the stripe_size (we need to fit 2 device
extents on the same device for DUP).
The result here is that 800MiB < 1578MiB, so it's unchanged. However,
the resulting DUP chunk will need 1600MiB disk space, which isn't there,
and the second dev_extent might extend into the next thing (next
dev_extent? end of device?) for 22MiB.
The last shown line of code relies on a situation where there's twice
the value of stripe_size present as value for the variable stripe_size
when it's DUP. This was actually the case before commit 92e222df7b
"btrfs: alloc_chunk: fix DUP stripe size handling", from which I quote:
"[...] in the meantime there's a check to see if the stripe_size does
not exceed max_chunk_size. Since during this check stripe_size is twice
the amount as intended, the check will reduce the stripe_size to
max_chunk_size if the actual correct to be used stripe_size is more than
half the amount of max_chunk_size."
In the previous version of the code, the 16MiB alignment (why is this
done, by the way?) would result in a 50% chance that it would actually
do an 8MiB alignment for the individual dev_extents, since it was
operating on double the size. Does this matter?
Does it matter that stripe_size can be set to anything which is not
16MiB aligned because of the amount of remaining available disk space
which is just taken?
What is the main purpose of this round_up?
The most straightforward thing to do seems something like...
stripe_size = min(
div_u64(devices_info[ndevs - 1].max_avail, dev_stripes),
stripe_size
)
..just putting half of the max_avail into stripe_size."
Link: https://lore.kernel.org/linux-btrfs/b3461a38-e5f8-f41d-c67c-2efac8129054@mendix.com/
Reported-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
[ add analysis from report ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-05 17:45:54 +08:00
prev_devid = devid ;
prev_dev_ext_end = physical_offset + physical_len ;
2018-08-01 10:37:19 +08:00
ret = btrfs_next_item ( root , path ) ;
if ( ret < 0 )
goto out ;
if ( ret > 0 ) {
ret = 0 ;
break ;
}
}
/* Ensure all chunks have corresponding dev extents */
ret = verify_chunk_dev_extent_mapping ( fs_info ) ;
out :
btrfs_free_path ( path ) ;
return ret ;
}
Btrfs: prevent ioctls from interfering with a swap file
A later patch will implement swap file support for Btrfs, but before we
do that, we need to make sure that the various Btrfs ioctls cannot
change a swap file.
When a swap file is active, we must make sure that the extents of the
file are not moved and that they don't become shared. That means that
the following are not safe:
- chattr +c (enable compression)
- reflink
- dedupe
- snapshot
- defrag
Don't allow those to happen on an active swap file.
Additionally, balance, resize, device remove, and device replace are
also unsafe if they affect an active swapfile. Add a red-black tree of
block groups and devices which contain an active swapfile. Relocation
checks each block group against this tree and skips it or errors out for
balance or resize, respectively. Device remove and device replace check
the tree for the device they will operate on.
Note that we don't have to worry about chattr -C (disable nocow), which
we ignore for non-empty files, because an active swapfile must be
non-empty and can't be truncated. We also don't have to worry about
autodefrag because it's only done on COW files. Truncate and fallocate
are already taken care of by the generic code. Device add doesn't do
relocation so it's not an issue, either.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-11-03 10:28:12 -07:00
/*
* Check whether the given block group or device is pinned by any inode being
* used as a swapfile .
*/
bool btrfs_pinned_by_swapfile ( struct btrfs_fs_info * fs_info , void * ptr )
{
struct btrfs_swapfile_pin * sp ;
struct rb_node * node ;
spin_lock ( & fs_info - > swapfile_pins_lock ) ;
node = fs_info - > swapfile_pins . rb_node ;
while ( node ) {
sp = rb_entry ( node , struct btrfs_swapfile_pin , node ) ;
if ( ptr < sp - > ptr )
node = node - > rb_left ;
else if ( ptr > sp - > ptr )
node = node - > rb_right ;
else
break ;
}
spin_unlock ( & fs_info - > swapfile_pins_lock ) ;
return node ! = NULL ;
}
btrfs: zoned: relocate block group to repair IO failure in zoned filesystems
When a bad checksum is found and if the filesystem has a mirror of the
damaged data, we read the correct data from the mirror and writes it to
damaged blocks. This however, violates the sequential write constraints
of a zoned block device.
We can consider three methods to repair an IO failure in zoned filesystems:
(1) Reset and rewrite the damaged zone
(2) Allocate new device extent and replace the damaged device extent to
the new extent
(3) Relocate the corresponding block group
Method (1) is most similar to a behavior done with regular devices.
However, it also wipes non-damaged data in the same device extent, and
so it unnecessary degrades non-damaged data.
Method (2) is much like device replacing but done in the same device. It
is safe because it keeps the device extent until the replacing finish.
However, extending device replacing is non-trivial. It assumes
"src_dev->physical == dst_dev->physical". Also, the extent mapping
replacing function should be extended to support replacing device extent
position in one device.
Method (3) invokes relocation of the damaged block group and is
straightforward to implement. It relocates all the mirrored device
extents, so it potentially is a more costly operation than method (1) or
(2). But it relocates only used extents which reduce the total IO size.
Let's apply method (3) for now. In the future, we can extend device-replace
and apply method (2).
For protecting a block group gets relocated multiple time with multiple
IO errors, this commit introduces "relocating_repair" bit to show it's
now relocating to repair IO failures. Also it uses a new kthread
"btrfs-relocating-repair", not to block IO path with relocating process.
This commit also supports repairing in the scrub process.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-04 19:22:16 +09:00
static int relocating_repair_kthread ( void * data )
{
struct btrfs_block_group * cache = ( struct btrfs_block_group * ) data ;
struct btrfs_fs_info * fs_info = cache - > fs_info ;
u64 target ;
int ret = 0 ;
target = cache - > start ;
btrfs_put_block_group ( cache ) ;
if ( ! btrfs_exclop_start ( fs_info , BTRFS_EXCLOP_BALANCE ) ) {
btrfs_info ( fs_info ,
" zoned: skip relocating block group %llu to repair: EBUSY " ,
target ) ;
return - EBUSY ;
}
2021-04-19 16:41:01 +09:00
mutex_lock ( & fs_info - > reclaim_bgs_lock ) ;
btrfs: zoned: relocate block group to repair IO failure in zoned filesystems
When a bad checksum is found and if the filesystem has a mirror of the
damaged data, we read the correct data from the mirror and writes it to
damaged blocks. This however, violates the sequential write constraints
of a zoned block device.
We can consider three methods to repair an IO failure in zoned filesystems:
(1) Reset and rewrite the damaged zone
(2) Allocate new device extent and replace the damaged device extent to
the new extent
(3) Relocate the corresponding block group
Method (1) is most similar to a behavior done with regular devices.
However, it also wipes non-damaged data in the same device extent, and
so it unnecessary degrades non-damaged data.
Method (2) is much like device replacing but done in the same device. It
is safe because it keeps the device extent until the replacing finish.
However, extending device replacing is non-trivial. It assumes
"src_dev->physical == dst_dev->physical". Also, the extent mapping
replacing function should be extended to support replacing device extent
position in one device.
Method (3) invokes relocation of the damaged block group and is
straightforward to implement. It relocates all the mirrored device
extents, so it potentially is a more costly operation than method (1) or
(2). But it relocates only used extents which reduce the total IO size.
Let's apply method (3) for now. In the future, we can extend device-replace
and apply method (2).
For protecting a block group gets relocated multiple time with multiple
IO errors, this commit introduces "relocating_repair" bit to show it's
now relocating to repair IO failures. Also it uses a new kthread
"btrfs-relocating-repair", not to block IO path with relocating process.
This commit also supports repairing in the scrub process.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-04 19:22:16 +09:00
/* Ensure block group still exists */
cache = btrfs_lookup_block_group ( fs_info , target ) ;
if ( ! cache )
goto out ;
if ( ! cache - > relocating_repair )
goto out ;
ret = btrfs_may_alloc_data_chunk ( fs_info , target ) ;
if ( ret < 0 )
goto out ;
btrfs_info ( fs_info ,
" zoned: relocating block group %llu to repair IO failure " ,
target ) ;
ret = btrfs_relocate_chunk ( fs_info , target ) ;
out :
if ( cache )
btrfs_put_block_group ( cache ) ;
2021-04-19 16:41:01 +09:00
mutex_unlock ( & fs_info - > reclaim_bgs_lock ) ;
btrfs: zoned: relocate block group to repair IO failure in zoned filesystems
When a bad checksum is found and if the filesystem has a mirror of the
damaged data, we read the correct data from the mirror and writes it to
damaged blocks. This however, violates the sequential write constraints
of a zoned block device.
We can consider three methods to repair an IO failure in zoned filesystems:
(1) Reset and rewrite the damaged zone
(2) Allocate new device extent and replace the damaged device extent to
the new extent
(3) Relocate the corresponding block group
Method (1) is most similar to a behavior done with regular devices.
However, it also wipes non-damaged data in the same device extent, and
so it unnecessary degrades non-damaged data.
Method (2) is much like device replacing but done in the same device. It
is safe because it keeps the device extent until the replacing finish.
However, extending device replacing is non-trivial. It assumes
"src_dev->physical == dst_dev->physical". Also, the extent mapping
replacing function should be extended to support replacing device extent
position in one device.
Method (3) invokes relocation of the damaged block group and is
straightforward to implement. It relocates all the mirrored device
extents, so it potentially is a more costly operation than method (1) or
(2). But it relocates only used extents which reduce the total IO size.
Let's apply method (3) for now. In the future, we can extend device-replace
and apply method (2).
For protecting a block group gets relocated multiple time with multiple
IO errors, this commit introduces "relocating_repair" bit to show it's
now relocating to repair IO failures. Also it uses a new kthread
"btrfs-relocating-repair", not to block IO path with relocating process.
This commit also supports repairing in the scrub process.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-04 19:22:16 +09:00
btrfs_exclop_finish ( fs_info ) ;
return ret ;
}
int btrfs_repair_one_zone ( struct btrfs_fs_info * fs_info , u64 logical )
{
struct btrfs_block_group * cache ;
/* Do not attempt to repair in degraded state */
if ( btrfs_test_opt ( fs_info , DEGRADED ) )
return 0 ;
cache = btrfs_lookup_block_group ( fs_info , logical ) ;
if ( ! cache )
return 0 ;
spin_lock ( & cache - > lock ) ;
if ( cache - > relocating_repair ) {
spin_unlock ( & cache - > lock ) ;
btrfs_put_block_group ( cache ) ;
return 0 ;
}
cache - > relocating_repair = 1 ;
spin_unlock ( & cache - > lock ) ;
kthread_run ( relocating_repair_kthread , cache ,
" btrfs-relocating-repair " ) ;
return 0 ;
}